提交 · b640f042faa2a2fad6464f259a8afec06e2f6386 · OpenHarmony / kernel_linux

12 6月, 2009 5 次提交

memcg: don't use bootmem allocator in setup code · 959982fe

由 Yinghai Lu 提交于 5月 28, 2009

The bootmem allocator is no longer available for page_cgroup_init() because we
set up the kernel slab allocator much earlier now.

Cc: Ingo Molnar <mingo@elte.hu>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYinghai Lu <yinghai@kernel.org>
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>

959982fe

vmalloc: use kzalloc() instead of alloc_bootmem() · 43ebdac4

由 Pekka Enberg 提交于 5月 25, 2009

We can call vmalloc_init() after kmem_cache_init() and use kzalloc() instead of
the bootmem allocator when initializing vmalloc data structures.
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Acked-by: NNick Piggin <npiggin@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>

43ebdac4

slab: setup allocators earlier in the boot sequence · 83b519e8

由 Pekka Enberg 提交于 6月 10, 2009

This patch makes kmalloc() available earlier in the boot sequence so we can get
rid of some bootmem allocations. The bulk of the changes are due to
kmem_cache_init() being called with interrupts disabled which requires some
changes to allocator boostrap code.

Note: 32-bit x86 does WP protect test in mem_init() so we must setup traps
before we call mem_init() during boot as reported by Ingo Molnar:

  We have a hard crash in the WP-protect code:

  [    0.000000] Checking if this processor honours the WP bit even in supervisor mode...BUG: Int 14: CR2 ffcff000
  [    0.000000]      EDI 00000188  ESI 00000ac7  EBP c17eaf9c  ESP c17eaf8c
  [    0.000000]      EBX 000014e0  EDX 0000000e  ECX 01856067  EAX 00000001
  [    0.000000]      err 00000003  EIP c10135b1   CS 00000060  flg 00010002
  [    0.000000] Stack: c17eafa8 c17fd410 c16747bc c17eafc4 c17fd7e5 000011fd f8616000 c18237cc
  [    0.000000]        00099800 c17bb000 c17eafec c17f1668 000001c5 c17f1322 c166e039 c1822bf0
  [    0.000000]        c166e033 c153a014 c18237cc 00020800 c17eaff8 c17f106a 00020800 01ba5003
  [    0.000000] Pid: 0, comm: swapper Not tainted 2.6.30-tip-02161-g7a74539-dirty #52203
  [    0.000000] Call Trace:
  [    0.000000]  [<c15357c2>] ? printk+0x14/0x16
  [    0.000000]  [<c10135b1>] ? do_test_wp_bit+0x19/0x23
  [    0.000000]  [<c17fd410>] ? test_wp_bit+0x26/0x64
  [    0.000000]  [<c17fd7e5>] ? mem_init+0x1ba/0x1d8
  [    0.000000]  [<c17f1668>] ? start_kernel+0x164/0x2f7
  [    0.000000]  [<c17f1322>] ? unknown_bootoption+0x0/0x19c
  [    0.000000]  [<c17f106a>] ? __init_begin+0x6a/0x6f
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>

83b519e8

bootmem: fix slab fallback on numa · c91c4773

由 Pekka Enberg 提交于 6月 11, 2009

If the user requested bootmem allocation on a specific node, we should use
kzalloc_node() for the fallback allocation.

Cc: Ingo Molnar <mingo@elte.hu>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>

c91c4773

bootmem: use slab if bootmem is no longer available · 441c7e0a

由 Pekka Enberg 提交于 6月 10, 2009

As a preparation for initializing the slab allocator early, make sure the
bootmem allocator does not crash and burn if someone calls it after slab is up;
otherwise we'd need a flag day for switching to early slab.
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>

441c7e0a

10 6月, 2009 2 次提交

nommu: Provide mmap_min_addr definition. · 35f2c2f6

由 Paul Mundt 提交于 6月 09, 2009

With the "security: use mmap_min_addr indepedently of security models"
change, mmap_min_addr is used in common areas, which susbsequently blows
up the nommu build. This stubs in the definition in the nommu case as
well.
Signed-off-by: NPaul Mundt <lethal@linux-sh.org>

--

 mm/nommu.c |    3 +++
 1 file changed, 3 insertions(+)
Signed-off-by: NJames Morris <jmorris@namei.org>

35f2c2f6

tracing/events: convert block trace points to TRACE_EVENT() · 55782138

由 Li Zefan 提交于 6月 09, 2009

TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
these new capabilities to this tracepoint:

  - zero-copy and per-cpu splice() tracing
  - binary tracing without printf overhead
  - structured logging records exposed under /debug/tracing/events
  - trace events embedded in function tracer output and other plugins
  - user-defined, per tracepoint filter expressions
  ...

Cons:

  - no dev_t info for the output of plug, unplug_timer and unplug_io events.
    no dev_t info for getrq and sleeprq events if bio == NULL.
    no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.

    This is mainly because we can't get the deivce from a request queue.
    But this may change in the future.

  - A packet command is converted to a string in TP_assign, not TP_print.
    While blktrace do the convertion just before output.

    Since pc requests should be rather rare, this is not a big issue.

  - In blktrace, an event can have 2 different print formats, but a TRACE_EVENT
    has a unique format, which means we have some unused data in a trace entry.

    The overhead is minimized by using __dynamic_array() instead of __array().

I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing:

      dd                   dd + ioctl blktrace       dd + TRACE_EVENT (splice)
1     7.36s, 42.7 MB/s     7.50s, 42.0 MB/s          7.41s, 42.5 MB/s
2     7.43s, 42.3 MB/s     7.48s, 42.1 MB/s          7.43s, 42.4 MB/s
3     7.38s, 42.6 MB/s     7.45s, 42.2 MB/s          7.41s, 42.5 MB/s

So the overhead of tracing is very small, and no regression when using
those trace events vs blktrace.

And the binary output of TRACE_EVENT is much smaller than blktrace:

 # ls -l -h
 -rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0
 -rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1
 -rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out

Following are some comparisons between TRACE_EVENT and blktrace:

plug:
  kjournald-480   [000]   303.084981: block_plug: [kjournald]
  kjournald-480   [000]   303.084981:   8,0    P   N [kjournald]

unplug_io:
  kblockd/0-118   [000]   300.052973: block_unplug_io: [kblockd/0] 1
  kblockd/0-118   [000]   300.052974:   8,0    U   N [kblockd/0] 1

remap:
  kjournald-480   [000]   303.085042: block_remap: 8,0 W 102736992 + 8 <- (8,8) 33384
  kjournald-480   [000]   303.085043:   8,0    A   W 102736992 + 8 <- (8,8) 33384

bio_backmerge:
  kjournald-480   [000]   303.085086: block_bio_backmerge: 8,0 W 102737032 + 8 [kjournald]
  kjournald-480   [000]   303.085086:   8,0    M   W 102737032 + 8 [kjournald]

getrq:
  kjournald-480   [000]   303.084974: block_getrq: 8,0 W 102736984 + 8 [kjournald]
  kjournald-480   [000]   303.084975:   8,0    G   W 102736984 + 8 [kjournald]

  bash-2066  [001]  1072.953770:   8,0    G   N [bash]
  bash-2066  [001]  1072.953773: block_getrq: 0,0 N 0 + 0 [bash]

rq_complete:
  konsole-2065  [001]   300.053184: block_rq_complete: 8,0 W () 103669040 + 16 [0]
  konsole-2065  [001]   300.053191:   8,0    C   W 103669040 + 16 [0]

  ksoftirqd/1-7   [001]  1072.953811:   8,0    C   N (5a 00 08 00 00 00 00 00 24 00) [0]
  ksoftirqd/1-7   [001]  1072.953813: block_rq_complete: 0,0 N (5a 00 08 00 00 00 00 00 24 00) 0 + 0 [0]

rq_insert:
  kjournald-480   [000]   303.084985: block_rq_insert: 8,0 W 0 () 102736984 + 8 [kjournald]
  kjournald-480   [000]   303.084986:   8,0    I   W 102736984 + 8 [kjournald]

Changelog from v2 -> v3:

- use the newly introduced __dynamic_array().

Changelog from v1 -> v2:

- use __string() instead of __array() to minimize the memory required
  to store hex dump of rq->cmd().

- support large pc requests.

- add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT.

- some cleanups.
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4A2DF669.5070905@cn.fujitsu.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

55782138

04 6月, 2009 1 次提交

security: use mmap_min_addr indepedently of security models · e0a94c2a

由 Christoph Lameter 提交于 6月 03, 2009

This patch removes the dependency of mmap_min_addr on CONFIG_SECURITY.
It also sets a default mmap_min_addr of 4096.

mmapping of addresses below 4096 will only be possible for processes
with CAP_SYS_RAWIO.
Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
Acked-by: NEric Paris <eparis@redhat.com>
Looks-ok-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NJames Morris <jmorris@namei.org>

e0a94c2a

29 5月, 2009 4 次提交

memcg: fix build warning and avoid checking for mem != null again and again · 46f7e602

由 Nikanth Karthikesan 提交于 5月 28, 2009

Fix build warning, "mem_cgroup_is_obsolete defined but not used" when
CONFIG_DEBUG_VM is not set.  Also avoid checking for !mem again and again.
Signed-off-by: NNikanth Karthikesan <knikanth@suse.de>
Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

46f7e602

mm: account for MAP_SHARED mappings using VM_MAYSHARE and not VM_SHARED in hugetlbfs · f83a275d

由 Mel Gorman 提交于 5月 28, 2009

Addresses http://bugzilla.kernel.org/show_bug.cgi?id=13302

hugetlbfs reserves huge pages but does not fault them at mmap() time to
ensure that future faults succeed.  The reservation behaviour differs
depending on whether the mapping was mapped MAP_SHARED or MAP_PRIVATE.
For MAP_SHARED mappings, hugepages are reserved when mmap() is first
called and are tracked based on information associated with the inode.
Other processes mapping MAP_SHARED use the same reservation.  MAP_PRIVATE
track the reservations based on the VMA created as part of the mmap()
operation.  Each process mapping MAP_PRIVATE must make its own
reservation.

hugetlbfs currently checks if a VMA is MAP_SHARED with the VM_SHARED flag
and not VM_MAYSHARE.  For file-backed mappings, such as hugetlbfs,
VM_SHARED is set only if the mapping is MAP_SHARED and the file was opened
read-write.  If a shared memory mapping was mapped shared-read-write for
populating of data and mapped shared-read-only by other processes, then
hugetlbfs would account for the mapping as if it was MAP_PRIVATE.  This
causes processes to fail to map the file MAP_SHARED even though it should
succeed as the reservation is there.

This patch alters mm/hugetlb.c and replaces VM_SHARED with VM_MAYSHARE
when the intent of the code was to check whether the VMA was mapped
MAP_SHARED or MAP_PRIVATE.
Signed-off-by: NMel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <starlight@binnacle.cx>
Cc: Eric B Munson <ebmunson@us.ibm.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f83a275d

memcg: fix deadlock between lock_page_cgroup and mapping tree_lock · e767e056

由 Daisuke Nishimura 提交于 5月 28, 2009

mapping->tree_lock can be acquired from interrupt context.  Then,
following dead lock can occur.

Assume "A" as a page.

 CPU0:
       lock_page_cgroup(A)
		interrupted
			-> take mapping->tree_lock.
 CPU1:
       take mapping->tree_lock
		-> lock_page_cgroup(A)

This patch tries to fix above deadlock by moving memcg's hook to out of
mapping->tree_lock.  charge/uncharge of pagecache/swapcache is protected
by page lock, not tree_lock.

After this patch, lock_page_cgroup() is not called under mapping->tree_lock.
Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e767e056

oom: fix possible oom_dump_tasks NULL pointer · 6d2661ed

由 David Rientjes 提交于 5月 28, 2009

When /proc/sys/vm/oom_dump_tasks is enabled, it is possible to get a NULL
pointer for tasks that have detached mm's since task_lock() is not held
during the tasklist scan.  Add the task_lock().
Acked-by: NNick Piggin <npiggin@suse.de>
Acked-by: NMel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6d2661ed

23 5月, 2009 1 次提交

block: Use accessor functions for queue limits · ae03bf63

由 Martin K. Petersen 提交于 5月 22, 2009

Convert all external users of queue limits to using wrapper functions
instead of poking the request queue variables directly.
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

ae03bf63

22 5月, 2009 3 次提交

integrity: move ima_counts_get · c9d9ac52

由 Mimi Zohar 提交于 5月 19, 2009

Based on discussion on lkml (Andrew Morton and Eric Paris),
move ima_counts_get down a layer into shmem/hugetlb__file_setup().
Resolves drm shmem_file_setup() usage case as well.

HD comment:
  I still think you're doing this at the wrong level, but recognize
  that you probably won't be persuaded until a few more users of
  alloc_file() emerge, all wanting your ima_counts_get().

  Resolving GEM's shmem_file_setup() is an improvement, so I'll say
Acked-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: NMimi Zohar <zohar@us.ibm.com>
Signed-off-by: NJames Morris <jmorris@namei.org>

c9d9ac52

integrity: path_check update · b9fc745d

由 Mimi Zohar 提交于 5月 19, 2009

- Add support in ima_path_check() for integrity checking without
incrementing the counts. (Required for nfsd.)
- rename and export opencount_get to ima_counts_get
- replace ima_shm_check calls with ima_counts_get
- export ima_path_check
Signed-off-by: NMimi Zohar <zohar@us.ibm.com>
Signed-off-by: NJames Morris <jmorris@namei.org>

b9fc745d

hugh: update email address · 98f32602

由 Hugh Dickins 提交于 5月 21, 2009

My old address will shut down in a few days time: remove it from the tree,
and add a tmpfs (shmem filesystem) maintainer entry with the new address.
Signed-off-by: NHugh Dickins <hugh@veritas.com>
Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

98f32602

18 5月, 2009 3 次提交

[ARM] Double check memmap is actually valid with a memmap has unexpected holes V2 · eb33575c

由 Mel Gorman 提交于 5月 13, 2009

pfn_valid() is meant to be able to tell if a given PFN has valid memmap
associated with it or not. In FLATMEM, it is expected that holes always
have valid memmap as long as there is valid PFNs either side of the hole.
In SPARSEMEM, it is assumed that a valid section has a memmap for the
entire section.

However, ARM and maybe other embedded architectures in the future free
memmap backing holes to save memory on the assumption the memmap is never
used. The page_zone linkages are then broken even though pfn_valid()
returns true. A walker of the full memmap must then do this additional
check to ensure the memmap they are looking at is sane by making sure the
zone and PFN linkages are still valid. This is expensive, but walkers of
the full memmap are extremely rare.

This was caught before for FLATMEM and hacked around but it hits again for
SPARSEMEM because the page_zone linkages can look ok where the PFN linkages
are totally screwed. This looks like a hatchet job but the reality is that
any clean solution would end up consumning all the memory saved by punching
these unexpected holes in the memmap. For example, we tried marking the
memmap within the section invalid but the section size exceeds the size of
the hole in most cases so pfn_valid() starts returning false where valid
memmap exists. Shrinking the size of the section would increase memory
consumption offsetting the gains.

This patch identifies when an architecture is punching unexpected holes
in the memmap that the memory model cannot automatically detect and sets
ARCH_HAS_HOLES_MEMORYMODEL. At the moment, this is restricted to EP93xx
which is the model sub-architecture this has been reported on but may expand
later. When set, walkers of the full memmap must call memmap_valid_within()
for each PFN and passing in what it expects the page and zone to be for
that PFN. If it finds the linkages to be broken, it assumes the memmap is
invalid for that PFN.
Signed-off-by: NMel Gorman <mel@csn.ul.ie>
Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>

eb33575c

mm, x86: remove MEMORY_HOTPLUG_RESERVE related code · 888a589f

由 Yinghai Lu 提交于 5月 15, 2009

after:

 | commit b263295d
 | Author: Christoph Lameter <clameter@sgi.com>
 | Date:   Wed Jan 30 13:30:47 2008 +0100
 |
 |    x86: 64-bit, make sparsemem vmemmap the only memory model

we don't have MEMORY_HOTPLUG_RESERVE anymore.

Historically, x86-64 had an architecture-specific method for memory hotplug
whereby it scanned the SRAT for physical memory ranges that could be
potentially used for memory hot-add later. By reserving those ranges
without physical memory, the memmap would be allocated and left dormant
until needed. This depended on the DISCONTIG memory model which has been
removed so the code implementing HOTPLUG_RESERVE is now dead.

This patch removes the dead code used by MEMORY_HOTPLUG_RESERVE.

(Changelog authored by Mel.)

v2: updated changelog, and remove hotadd= in doc

[ Impact: remove dead code ]
Signed-off-by: NYinghai Lu <yinghai@kernel.org>
Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
Reviewed-by: NMel Gorman <mel@csn.ul.ie>
Workflow-found-OK-by: NAndrew Morton <akpm@linux-foundation.org>
LKML-Reference: <4A0C4910.7090508@kernel.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

888a589f

page-writeback: fix the calculation of the oldest_jif in wb_kupdate() · 22ef37ee

由 Toshiyuki Okajima 提交于 5月 16, 2009

wb_kupdate() function has a bug on linux-2.6.30-rc5.  This bug causes
generic_sync_sb_inodes() to start to write inodes back much earlier than
our expectations because it miscalculates oldest_jif in wb_kupdate().

This bug was introduced in 704503d8
('mm: fix proc_dointvec_userhz_jiffies "breakage"').
Signed-off-by: NToshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

22ef37ee

15 5月, 2009 1 次提交

Revert "mm: add /proc controls for pdflush threads" · cd17cbfd

由 Jens Axboe 提交于 5月 15, 2009

This reverts commit fafd688e.

Work is progressing to switch away from pdflush as the process backing
for flushing out dirty data. So it seems pointless to add more knobs
to control pdflush threads. The original author of the patch did not
have any specific use cases for adding the knobs, so we can easily
revert this before 2.6.30 to avoid having to maintain this API
forever.
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

cd17cbfd

13 5月, 2009 1 次提交

Revert "Ignore madvise(MADV_WILLNEED) for hugetlbfs-backed regions" · 0f181328

由 Linus Torvalds 提交于 5月 13, 2009

This reverts commit a425a638.

Now that the previous commit removed the "readpage" actor for hugetlb
files, read-ahead will no longer mess up the mapping, and there's no
longer any reason to treat hugetlbfs mappings specially.
Tested-and-acked-by: NMel Gorman <mel@csn.ul.ie>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0f181328

08 5月, 2009 1 次提交

NOMMU: Don't check vm_region::vm_start is page aligned in add_nommu_region() · 8c9ed899

由 David Howells 提交于 5月 07, 2009

Don't check vm_region::vm_start is page aligned in add_nommu_region() because
the region may reflect some non-page-aligned mapped file, such as could be
obtained from RomFS XIP.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NGreg Ungerer <gerg@uclinux.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8c9ed899

07 5月, 2009 5 次提交

nommu: make the initial mmap allocation excess behaviour Kconfig configurable · fc4d5c29

由 David Howells 提交于 5月 06, 2009

NOMMU mmap() has an option controlled by a sysctl variable that determines
whether the allocations made by do_mmap_private() should have the excess
space trimmed off and returned to the allocator.  Make the initial setting
of this variable a Kconfig configuration option.

The reason there can be excess space is that the allocator only allocates
in power-of-2 size chunks, but mmap()'s can be made in sizes that aren't a
power of 2.

There are two alternatives:

 (1) Keep the excess as dead space.  The dead space then remains unused for the
     lifetime of the mapping.  Mappings of shared objects such as libc, ld.so
     or busybox's text segment may retain their dead space forever.

 (2) Return the excess to the allocator.  This means that the dead space is
     limited to less than a page per mapping, but it means that for a transient
     process, there's more chance of fragmentation as the excess space may be
     reused fairly quickly.

During the boot process, a lot of transient processes are created, and
this can cause a lot of fragmentation as the pagecache and various slabs
grow greatly during this time.

By turning off the trimming of excess space during boot and disabling
batching of frees, Coldfire can manage to boot.

A better way of doing things might be to have /sbin/init turn this option
off.  By that point libc, ld.so and init - which are all long-duration
processes - have all been loaded and trimmed.
Reported-by: NLanttor Guo <lanttor.guo@freescale.com>
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Tested-by: NLanttor Guo <lanttor.guo@freescale.com>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fc4d5c29

nommu: clamp zone_batchsize() to 0 under NOMMU conditions · 3a6be87f

由 David Howells 提交于 5月 06, 2009

Clamp zone_batchsize() to 0 under NOMMU conditions to stop
free_hot_cold_page() from queueing and batching frees.

The problem is that under NOMMU conditions it is really important to be
able to allocate large contiguous chunks of memory, but when munmap() or
exit_mmap() releases big stretches of memory, return of these to the buddy
allocator can be deferred, and when it does finally happen, it can be in
small chunks.

Whilst the fragmentation this incurs isn't so much of a problem under MMU
conditions as userspace VM is glued together from individual pages with
the aid of the MMU, it is a real problem if there isn't an MMU.

By clamping the page freeing queue size to 0, pages are returned to the
allocator immediately, and the buddy detector is more likely to be able to
glue them together into large chunks immediately, and fragmentation is
less likely to occur.

By disabling batching of frees, and by turning off the trimming of excess
space during boot, Coldfire can manage to boot.
Reported-by: NLanttor Guo <lanttor.guo@freescale.com>
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Tested-by: NLanttor Guo <lanttor.guo@freescale.com>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3a6be87f

mm: use roundown_pow_of_two() in zone_batchsize() · 9155203a

由 David Howells 提交于 5月 06, 2009

Use roundown_pow_of_two(N) in zone_batchsize() rather than (1 <<
(fls(N)-1)) as they are equivalent, and with the former it is easier to
see what is going on.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Tested-by: NLanttor Guo <lanttor.guo@freescale.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9155203a

alloc_vmap_area: fix memory leak · 2498ce42

由 Ralph Wuerthner 提交于 5月 06, 2009

If alloc_vmap_area() fails the allocated struct vmap_area has to be freed.
Signed-off-by: NRalph Wuerthner <ralphw@linux.vnet.ibm.com>
Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2498ce42

oom: prevent livelock when oom_kill_allocating_task is set · 184101bf

由 David Rientjes 提交于 5月 06, 2009

When /proc/sys/vm/oom_kill_allocating_task is set for large systems that
want to avoid the lengthy tasklist scan, it's possible to livelock if
current is ineligible for oom kill.  This normally happens when it is set
to OOM_DISABLE, but is also possible if any threads are sharing the same
->mm with a different tgid.

So change __out_of_memory() to fall back to the full task-list scan if it
was unable to kill `current'.

Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

184101bf

06 5月, 2009 3 次提交

mm: SLOB fix reclaim_state · 1f0532eb

由 Nick Piggin 提交于 5月 05, 2009

SLOB does not correctly account reclaim_state.reclaimed_slab, so it will
break memory reclaim. Account it like SLAB does.

Cc: stable@kernel.org
Cc: linux-mm@kvack.org
Acked-by: NMatt Mackall <mpm@selenic.com>
Acked-by: NChristoph Lameter <cl@linux-foundation.org>
Signed-off-by: NNick Piggin <npiggin@suse.de>
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>

1f0532eb

mm: SLUB fix reclaim_state · 1eb5ac64

由 Nick Piggin 提交于 5月 05, 2009

SLUB does not correctly account reclaim_state.reclaimed_slab, so it will
break memory reclaim. Account it like SLAB does.

Cc: stable@kernel.org
Cc: linux-mm@kvack.org
Cc: Matt Mackall <mpm@selenic.com>
Acked-by: NChristoph Lameter <cl@linux-foundation.org>
Signed-off-by: NNick Piggin <npiggin@suse.de>
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>

1eb5ac64

Ignore madvise(MADV_WILLNEED) for hugetlbfs-backed regions · a425a638

由 Mel Gorman 提交于 5月 05, 2009

madvise(MADV_WILLNEED) forces page cache readahead on a range of memory
backed by a file.  The assumption is made that the page required is
order-0 and "normal" page cache.

On hugetlbfs, this assumption is not true and order-0 pages are
allocated and inserted into the hugetlbfs page cache.  This leaks
hugetlbfs page reservations and can cause BUGs to trigger related to
corrupted page tables.

This patch causes MADV_WILLNEED to be ignored for hugetlbfs-backed
regions.
Signed-off-by: NMel Gorman <mel@csn.ul.ie>
Cc: stable@kernel.org
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a425a638

03 5月, 2009 6 次提交

vmscan: avoid multiplication overflow in shrink_zone() · 8713e012

由 Andrew Morton 提交于 4月 30, 2009

Local variable `scan' can overflow on zones which are larger than

	(2G * 4k) / 100 = 80GB.

Making it 64-bit on 64-bit will fix that up.

Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8713e012

mm: fix Committed_AS underflow on large NR_CPUS environment · 00a62ce9

由 KOSAKI Motohiro 提交于 4月 30, 2009

The Committed_AS field can underflow in certain situations:

>         # while true; do cat /proc/meminfo  | grep _AS; sleep 1; done | uniq -c
>               1 Committed_AS: 18446744073709323392 kB
>              11 Committed_AS: 18446744073709455488 kB
>               6 Committed_AS:    35136 kB
>               5 Committed_AS: 18446744073709454400 kB
>               7 Committed_AS:    35904 kB
>               3 Committed_AS: 18446744073709453248 kB
>               2 Committed_AS:    34752 kB
>               9 Committed_AS: 18446744073709453248 kB
>               8 Committed_AS:    34752 kB
>               3 Committed_AS: 18446744073709320960 kB
>               7 Committed_AS: 18446744073709454080 kB
>               3 Committed_AS: 18446744073709320960 kB
>               5 Committed_AS: 18446744073709454080 kB
>               6 Committed_AS: 18446744073709320960 kB

Because NR_CPUS can be greater than 1000 and meminfo_proc_show() does
not check for underflow.

But NR_CPUS proportional isn't good calculation.  In general,
possibility of lock contention is proportional to the number of online
cpus, not theorical maximum cpus (NR_CPUS).

The current kernel has generic percpu-counter stuff.  using it is right
way.  it makes code simplify and percpu_counter_read_positive() don't
make underflow issue.
Reported-by: NDave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Eric B Munson <ebmunson@us.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: <stable@kernel.org>		[All kernel versions]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

00a62ce9

memcg: fix mem_cgroup_shrink_usage() · ae3abae6

由 Daisuke Nishimura 提交于 4月 30, 2009

Current mem_cgroup_shrink_usage() has two problems.

1. It doesn't call mem_cgroup_out_of_memory and doesn't update
   last_oom_jiffies, so pagefault_out_of_memory invokes global OOM.

2. Considering hierarchy, shrinking has to be done from the
   mem_over_limit, not from the memcg which the page would be charged to.

mem_cgroup_try_charge_swapin() does all of these things properly, so we
use it and call cancel_charge_swapin when it succeeded.

The name of "shrink_usage" is not appropriate for this behavior, so we
change it too.
Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.cn>
Cc: Paul Menage <menage@google.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ae3abae6

mm: close page_mkwrite races · b827e496

由 Nick Piggin 提交于 4月 30, 2009

Change page_mkwrite to allow implementations to return with the page
locked, and also change it's callers (in page fault paths) to hold the
lock until the page is marked dirty.  This allows the filesystem to have
full control of page dirtying events coming from the VM.

Rather than simply hold the page locked over the page_mkwrite call, we
call page_mkwrite with the page unlocked and allow callers to return with
it locked, so filesystems can avoid LOR conditions with page lock.

The problem with the current scheme is this: a filesystem that wants to
associate some metadata with a page as long as the page is dirty, will
perform this manipulation in its ->page_mkwrite.  It currently then must
return with the page unlocked and may not hold any other locks (according
to existing page_mkwrite convention).

In this window, the VM could write out the page, clearing page-dirty.  The
filesystem has no good way to detect that a dirty pte is about to be
attached, so it will happily write out the page, at which point, the
filesystem may manipulate the metadata to reflect that the page is no
longer dirty.

It is not always possible to perform the required metadata manipulation in
->set_page_dirty, because that function cannot block or fail.  The
filesystem may need to allocate some data structure, for example.

And the VM cannot mark the pte dirty before page_mkwrite, because
page_mkwrite is allowed to fail, so we must not allow any window where the
page could be written to if page_mkwrite does fail.

This solution of holding the page locked over the 3 critical operations
(page_mkwrite, setting the pte dirty, and finally setting the page dirty)
closes out races nicely, preventing page cleaning for writeout being
initiated in that window.  This provides the filesystem with a strong
synchronisation against the VM here.

- Sage needs this race closed for ceph filesystem.
- Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913).
- I need it for fsblock.
- I suspect other filesystems may need it too (eg. btrfs).
- I have converted buffer.c to the new locking. Even simple block allocation
  under dirty pages might be susceptible to i_size changing under partial page
  at the end of file (we also have a buffer.c-side problem here, but it cannot
  be fixed properly without this patch).
- Other filesystems (eg. NFS, maybe btrfs) will need to change their
  page_mkwrite functions themselves.

[ This also moves page_mkwrite another step closer to fault, which should
  eventually allow page_mkwrite to be moved into ->fault, and thus avoiding a
  filesystem calldown and page lock/unlock cycle in __do_fault. ]

[akpm@linux-foundation.org: fix derefs of NULL ->mapping]
Cc: Sage Weil <sage@newdream.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: NNick Piggin <npiggin@suse.de>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: <stable@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b827e496

memcg: fix try_get_mem_cgroup_from_swapcache() · c0bd3f63

由 Daisuke Nishimura 提交于 4月 30, 2009

This is a bugfix for commit 3c776e64
("memcg: charge swapcache to proper memcg").

Used bit of swapcache is solid under page lock, but considering
move_account, pc->mem_cgroup is not.

We need lock_page_cgroup() anyway.
Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c0bd3f63

mm: fix pageref leak in do_swap_page() · bc43f75c

由 Johannes Weiner 提交于 4月 30, 2009

By the time the memory cgroup code is notified about a swapin we
already hold a reference on the fault page.

If the cgroup callback fails make sure to unlock AND release the page
reference which was taken by lookup_swap_cach(), or we leak the reference.
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

bc43f75c

24 4月, 2009 1 次提交

x86, bts, mm: clean up buffer allocation · 1cb81b14

由 Markus Metzger 提交于 4月 24, 2009

The current mm interface is asymetric. One function allocates a locked
buffer, another function only refunds the memory.

Change this to have two functions for accounting and refunding locked
memory, respectively; and do the actual buffer allocation in ptrace.

[ Impact: refactor BTS buffer allocation code ]
Signed-off-by: NMarkus Metzger <markus.t.metzger@intel.com>
Acked-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090424095143.A30265@sedona.ch.intel.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

1cb81b14

23 4月, 2009 1 次提交

slub: enforce MAX_ORDER · 818cf590

由 David Rientjes 提交于 4月 23, 2009

slub_max_order may not be equal to or greater than MAX_ORDER.

Additionally, if a single object cannot be placed in a slab of
slub_max_order, it still must allocate slabs below MAX_ORDER.
Acked-by: NChristoph Lameter <cl@linux-foundation.org>
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>

818cf590

22 4月, 2009 1 次提交

vmscan,memcg: reintroduce sc->may_swap · 2e2e4259

由 KOSAKI Motohiro 提交于 4月 21, 2009

Commit a6dc60f8 ("vmscan: rename
sc.may_swap to may_unmap") removed the may_swap flag, but memcg had used
it as a flag for "we need to use swap?", as the name indicate.

And in the current implementation, memcg cannot reclaim mapped file
caches when mem+swap hits the limit.

re-introduce may_swap flag and handle it at get_scan_ratio().  This
patch doesn't influence any scan_control users other than memcg.
Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2e2e4259

19 4月, 2009 1 次提交

PM/Hibernate: Fix memory shrinking · a21e2553

由 Rafael J. Wysocki 提交于 4月 18, 2009

Commit d979677c ("mm: shrink_all_memory(): use sc.nr_reclaimed")
broke the memory shrinking used by hibernation, becuse it did not update
shrink_all_zones() in accordance with the other changes it made.

Fix this by making shrink_all_zones() update sc->nr_reclaimed instead of
overwriting its value.

This fixes http://bugzilla.kernel.org/show_bug.cgi?id=13058Reported-and-tested-by: NAlan Jenkins <alan-jenkins@tuffmail.co.uk>
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a21e2553

OpenHarmony / kernel_linux 上一次同步 3 年多

OpenHarmony / kernel_linux
上一次同步 3 年多