提交 · 15b7c5142049e7efc3071280e1370dc3b8add6f5 · openeuler / raspberrypi-kernel

06 10月, 2010 1 次提交

SLUB: Optimize slab_free() debug check · 15b7c514

由 Pekka Enberg 提交于 10月 02, 2010

This patch optimizes slab_free() debug check to use "c->node != NUMA_NO_NODE"
instead of "c->node >= 0" because the former generates smaller code on x86-64:

  Before:

    4736:       48 39 70 08             cmp    %rsi,0x8(%rax)
    473a:       75 26                   jne    4762 <kfree+0xa2>
    473c:       44 8b 48 10             mov    0x10(%rax),%r9d
    4740:       45 85 c9                test   %r9d,%r9d
    4743:       78 1d                   js     4762 <kfree+0xa2>

  After:

    4736:       48 39 70 08             cmp    %rsi,0x8(%rax)
    473a:       75 23                   jne    475f <kfree+0x9f>
    473c:       83 78 10 ff             cmpl   $0xffffffffffffffff,0x10(%rax)
    4740:       74 1d                   je     475f <kfree+0x9f>

This patch also cleans up __slab_alloc() to use NUMA_NO_NODE instead of "-1"
for enabling debugging for a per-CPU cache.
Acked-by: NChristoph Lameter <cl@linux.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

15b7c514

02 10月, 2010 22 次提交

slub: Move NUMA-related functions under CONFIG_NUMA · 5d1f57e4

由 Namhyung Kim 提交于 9月 29, 2010

Make kmalloc_cache_alloc_node_notrace(), kmalloc_large_node()
and __kmalloc_node_track_caller() to be compiled only when
CONFIG_NUMA is selected.
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

5d1f57e4

slub: Add lock release annotation · 3478973d

由 Namhyung Kim 提交于 9月 29, 2010

The unfreeze_slab() releases page's PG_locked bit but was missing
proper annotation. The deactivate_slab() needs to be marked also
since it calls unfreeze_slab() without grabbing the lock.
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

3478973d

slub: Fix signedness warnings · a5dd5c11

由 Namhyung Kim 提交于 9月 29, 2010

The bit-ops routines require its arg to be a pointer to unsigned long.
This leads sparse to complain about different signedness as follows:

 mm/slub.c:2425:49: warning: incorrect type in argument 2 (different signedness)
 mm/slub.c:2425:49:    expected unsigned long volatile *addr
 mm/slub.c:2425:49:    got long *map
Acked-by: NChristoph Lameter <cl@linux.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

a5dd5c11

slub: extract common code to remove objects from partial list without locking · 62e346a8

由 Christoph Lameter 提交于 9月 28, 2010

There are a couple of places where repeat the same statements when removing
a page from the partial list. Consolidate that into __remove_partial().
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

62e346a8

SLUB: Pass active and inactive redzone flags instead of boolean to debug functions · f7cb1933

由 Christoph Lameter 提交于 9月 29, 2010

Pass the actual values used for inactive and active redzoning to the
functions that check the objects. Avoids a lot of the ? : things to
lookup the values in the functions.
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

f7cb1933

slub: reduce differences between SMP and NUMA · 7340cc84

由 Christoph Lameter 提交于 9月 28, 2010

Reduce the #ifdefs and simplify bootstrap by making SMP and NUMA as much alike
as possible. This means that there will be an additional indirection to get to
the kmem_cache_node field under SMP.
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

7340cc84

Revert "Slub: UP bandaid" · ed59ecbf

由 Pekka Enberg 提交于 9月 18, 2010

This reverts commit 5249d039500f05a5ab379286b1d23ab9b04d3f2c. It's not needed
after commit bbddff05 ("percpu: use percpu
allocator on UP too").

ed59ecbf

percpu: clear memory allocated with the km allocator · ed6c1115

由 Tejun Heo 提交于 9月 10, 2010

Percpu allocator should clear memory before returning it but the km
allocator forgot to do it.  Fix it.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NPeter Zijlstra <peterz@infradead.org>
Acked-by: NPeter Zijlstra <peterz@infradead.org>

ed6c1115

percpu: use percpu allocator on UP too · 9b8327bb

由 Tejun Heo 提交于 9月 03, 2010

On UP, percpu allocations were redirected to kmalloc.  This has the
following problems.

* For certain amount of allocations (determined by
  PERCPU_DYNAMIC_EARLY_SLOTS and PERCPU_DYNAMIC_EARLY_SIZE), percpu
  allocator can be used before the usual kernel memory allocator is
  brought online.  On SMP, this is used to initialize the kernel
  memory allocator.

* percpu allocator honors alignment upto PAGE_SIZE but kmalloc()
  doesn't.  For example, workqueue makes use of larger alignments for
  cpu_workqueues.

Currently, users of percpu allocators need to handle UP differently,
which is somewhat fragile and ugly.  Other than small amount of
memory, there isn't much to lose by enabling percpu allocator on UP.
It can simply use kernel memory based chunk allocation which was added
for SMP archs w/o MMUs.

This patch removes mm/percpu_up.c, builds mm/percpu.c on UP too and
makes UP build use percpu-km.  As percpu addresses and kernel
addresses are always identity mapped and static percpu variables don't
need any special treatment, nothing is arch dependent and mm/percpu.c
implements generic setup_per_cpu_areas() for UP.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>

9b8327bb

vmalloc: pcpu_get/free_vm_areas() aren't needed on UP · 0bc14062

由 Tejun Heo 提交于 9月 03, 2010

These functions are used only by percpu memory allocator on SMP.
Don't build them on UP.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Nick Piggin <npiggin@kernel.dk>

0bc14062

SLUB: Fix merged slab cache names · 84c1cf62

由 Pekka Enberg 提交于 9月 14, 2010

As explained by Linus "I'm Proud to be an American" Torvalds:

  Looking at the merging code, I actually think it's totally
  buggy. If you have something like this:

   - load module A: create slab cache A

   - load module B: create slab cache B that can merge with A

   - unload module A

   - "cat /proc/slabinfo": BOOM. Oops.

  exactly because the name is not handled correctly, and you'll have
  module B holding open a slab cache that has a name pointer that points
  to module A that no longer exists.

This patch fixes the problem by using kstrdup() to allocate dynamic memory for
->name of "struct kmem_cache" as suggested by Christoph Lameter.
Acked-by: NChristoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

Conflicts:

	mm/slub.c

84c1cf62

Slub: UP bandaid · db210e70

由 Christoph Lameter 提交于 8月 26, 2010

Since the percpu allocator does not provide early allocation in UP mode (only
in SMP configurations) use __get_free_page() to improvise a compound page
allocation that can be later freed via kfree().

Compound pages will be released when the cpu caches are resized.
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

db210e70

slub: fix SLUB_RESILIENCY_TEST for dynamic kmalloc caches · a016471a

由 David Rientjes 提交于 8月 25, 2010

Now that the kmalloc_caches array is dynamically allocated at boot,
SLUB_RESILIENCY_TEST needs to be fixed to pass the correct type.
Acked-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

a016471a

slub: Fix up missing kmalloc_cache -> kmem_cache_node case for memoryhotplug · 8de66a0c

由 Christoph Lameter 提交于 8月 25, 2010

Memory hotplug allocates and frees per node structures. Use the correct name.
Acked-by: NDavid Rientjes <rientjes@google.com>
Acked-by: NRandy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

8de66a0c

slub: Add dummy functions for the !SLUB_DEBUG case · 7d550c56

由 Christoph Lameter 提交于 8月 25, 2010

On Wed, 25 Aug 2010, Randy Dunlap wrote:
> mm/slub.c:1732: error: implicit declaration of function 'slab_pre_alloc_hook'
> mm/slub.c:1751: error: implicit declaration of function 'slab_post_alloc_hook'
> mm/slub.c:1881: error: implicit declaration of function 'slab_free_hook'
> mm/slub.c:1886: error: implicit declaration of function 'slab_free_hook_irq'

Empty functions are missing if the runtime debuggability option is compiled
out.

Provide the fall back functions to empty hooks if SLUB_DEBUG is not set.
Acked-by: NRandy Dunlap <randy.dunlap@oracle.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

7d550c56

slob: fix gfp flags for order-0 page allocations · 8df275af

由 David Rientjes 提交于 8月 22, 2010

kmalloc_node() may allocate higher order slob pages, but the __GFP_COMP
bit is only passed to the page allocator and not represented in the
tracepoint event.  The bit should be passed to trace_kmalloc_node() as
well.
Acked-by: NMatt Mackall <mpm@selenic.com>
Reviewed-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

8df275af

slub: Move gfpflag masking out of the hotpath · c1d50836

由 Christoph Lameter 提交于 8月 20, 2010

Move the gfpflags masking into the hooks for checkers and into the slowpaths.
gfpflag masking requires access to a global variable and thus adds an
additional cacheline reference to the hotpaths.

If no hooks are active then the gfpflag masking will result in
code that the compiler can toss out.
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

c1d50836

slub: Extract hooks for memory checkers from hotpaths · c016b0bd

由 Christoph Lameter 提交于 8月 20, 2010

Extract the code that memory checkers and other verification tools use from
the hotpaths. Makes it easier to add new ones and reduces the disturbances
of the hotpaths.
Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

c016b0bd

slub: Dynamically size kmalloc cache allocations · 51df1142

由 Christoph Lameter 提交于 8月 20, 2010

kmalloc caches are statically defined and may take up a lot of space just
because the sizes of the node array has to be dimensioned for the largest
node count supported.

This patch makes the size of the kmem_cache structure dynamic throughout by
creating a kmem_cache slab cache for the kmem_cache objects. The bootstrap
occurs by allocating the initial one or two kmem_cache objects from the
page allocator.

C2->C3
	- Fix various issues indicated by David
	- Make create kmalloc_cache return a kmem_cache * pointer.
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

51df1142

slub: Remove static kmem_cache_cpu array for boot · 6c182dc0

由 Christoph Lameter 提交于 8月 20, 2010

The percpu allocator can now handle allocations during early boot.
So drop the static kmem_cache_cpu array.

Cc: Tejun Heo <tj@kernel.org>
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

6c182dc0

slub: Remove dynamic dma slab allocation · 55136592

由 Christoph Lameter 提交于 8月 20, 2010

Remove the dynamic dma slab allocation since this causes too many issues with
nested locks etc etc. The change avoids passing gfpflags into many functions.

V3->V4:
- Create dma caches in kmem_cache_init() instead of kmem_cache_init_late().
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

55136592

slub: Force no inlining of debug functions · 1537066c

由 Christoph Lameter 提交于 8月 20, 2010

Compiler folds the debgging functions into the critical paths.
Avoid that by adding noinline to the functions that check for
problems.
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

1537066c

26 9月, 2010 1 次提交

Avoid pgoff overflow in remap_file_pages · 5ec1055a

由 Larry Woodman 提交于 9月 24, 2010

Thomas Pollet noticed that the remap_file_pages() system call in
fremap.c has a potential overflow in the first part of the if statement
below, which could cause it to process bogus input parameters.
Specifically the pgoff + size parameters could be wrap thereby
preventing the system call from failing when it should.
Reported-by: NThomas Pollet <thomas.pollet@gmail.com>
Signed-off-by: NLarry Woodman <lwoodman@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5ec1055a

25 9月, 2010 1 次提交

fremap: get rid of broken 'end' variable · e92b05de

由 Linus Torvalds 提交于 9月 24, 2010

Thomas Pollet points out that the 'end' variable is broken. It was
computed based on start/size before they were page-aligned, and as such
doesn't actually match any of the other actions we take. The overflow
test on end was also redundant, since we had already tested it with the
properly aligned version.

So just get rid of it entirely. The one remaining use for that broken
variable can just use 'start+size' like all the other cases already did.
Reported-by: NThomas Pollet <thomas.pollet@gmail.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e92b05de

24 9月, 2010 4 次提交

hugetlb, rmap: add BUG_ON(!PageLocked) in hugetlb_add_anon_rmap() · a850ea30

由 Naoya Horiguchi 提交于 9月 10, 2010

Confirming page lock is held in hugetlb_add_anon_rmap() may be useful
to detect possible future problems.
Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: NRik van Riel <riel@redhat.com>
Acked-by: NAndrea Arcangeli <aarcange@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a850ea30

hugetlb, rmap: fix confusing page locking in hugetlb_cow() · 56c9cfb1

由 Naoya Horiguchi 提交于 9月 10, 2010

The "if (!trylock_page)" block in the avoidcopy path of hugetlb_cow()
looks confusing and is buggy.  Originally this trylock_page() was
intended to make sure that old_page is locked even when old_page !=
pagecache_page, because then only pagecache_page is locked.

This patch fixes it by moving page locking into hugetlb_fault().
Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

56c9cfb1

hugetlb, rmap: use hugepage_add_new_anon_rmap() in hugetlb_cow() · cd67f0d2

由 Naoya Horiguchi 提交于 9月 10, 2010

Obviously, setting anon_vma for COWed hugepage should be done
by hugepage_add_new_anon_rmap() to scan vmas faster.
This patch fixes it.
Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: NAndrea Arcangeli <aarcange@redhat.com>
Reviewed-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

cd67f0d2

hugetlb, rmap: always use anon_vma root pointer · 433abed6

由 Naoya Horiguchi 提交于 9月 10, 2010

This patch applies Andrea's fix given by the following patch into hugepage
rmapping code:

  commit 288468c3
  Author: Andrea Arcangeli <aarcange@redhat.com>
  Date:   Mon Aug 9 17:19:09 2010 -0700

This patch uses anon_vma->root and avoids unnecessary overwriting when
anon_vma is already set up.
Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: NAndrea Arcangeli <aarcange@redhat.com>
Reviewed-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

433abed6

23 9月, 2010 4 次提交

mmap: call unlink_anon_vmas() in __split_vma() in case of error · 2aeadc30

由 Andrea Arcangeli 提交于 9月 22, 2010

If __split_vma fails because of an out of memory condition the
anon_vma_chain isn't teardown and freed potentially leading to rmap walks
accessing freed vma information plus there's a memleak.
Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
Acked-by: NJohannes Weiner <jweiner@redhat.com>
Acked-by: NRik van Riel <riel@redhat.com>
Acked-by: NHugh Dickins <hughd@google.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2aeadc30

oom: filter unkillable tasks from tasklist dump · e85bfd3a

由 David Rientjes 提交于 9月 22, 2010

/proc/sys/vm/oom_dump_tasks is enabled by default, so it's necessary to
limit as much information as possible that it should emit.

The tasklist dump should be filtered to only those tasks that are eligible
for oom kill.  This is already done for memcg ooms, but this patch extends
it to both cpuset and mempolicy ooms as well as init.

In addition to suppressing irrelevant information, this also reduces
confusion since users currently don't know which tasks in the tasklist
aren't eligible for kill (such as those attached to cpusets or bound to
mempolicies with a disjoint set of mems or nodes, respectively) since that
information is not shown.
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e85bfd3a

vmscan: check all_unreclaimable in direct reclaim path · d1908362

由 Minchan Kim 提交于 9月 22, 2010

M.  Vefa Bicakci reported 2.6.35 kernel hang up when hibernation on his
32bit 3GB mem machine.
(https://bugzilla.kernel.org/show_bug.cgi?id=16771). Also he bisected
the regression to

  commit bb21c7ce
  Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
  Date:   Fri Jun 4 14:15:05 2010 -0700

     vmscan: fix do_try_to_free_pages() return value when priority==0 reclaim failure

At first impression, this seemed very strange because the above commit
only chenged function return value and hibernate_preallocate_memory()
ignore return value of shrink_all_memory().  But it's related.

Now, page allocation from hibernation code may enter infinite loop if the
system has highmem.  The reasons are that vmscan don't care enough OOM
case when oom_killer_disabled.

The problem sequence is following as.

1. hibernation
2. oom_disable
3. alloc_pages
4. do_try_to_free_pages
       if (scanning_global_lru(sc) && !all_unreclaimable)
               return 1;

If kswapd is not freozen, it would set zone->all_unreclaimable to 1 and
then shrink_zones maybe return true(ie, all_unreclaimable is true).  So at
last, alloc_pages could go to _nopage_.  If it is, it should have no
problem.

This patch adds all_unreclaimable check to protect in direct reclaim path,
too.  It can care of hibernation OOM case and help bailout
all_unreclaimable case slightly.
Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
Reported-by: NM. Vefa Bicakci <bicave@superonline.com>
Reported-by: <caiqian@redhat.com>
Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
Tested-by: <caiqian@redhat.com>
Acked-by: NRafael J. Wysocki <rjw@sisk.pl>
Acked-by: NRik van Riel <riel@redhat.com>
Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d1908362

oom: always return a badness score of non-zero for eligible tasks · f19e8aa1

由 David Rientjes 提交于 9月 22, 2010

A task's badness score is roughly a proportion of its rss and swap
compared to the system's capacity.  The scale ranges from 0 to 1000 with
the highest score chosen for kill.  Thus, this scale operates on a
resolution of 0.1% of RAM + swap.  Admin tasks are also given a 3% bonus,
so the badness score of an admin task using 3% of memory, for example,
would still be 0.

It's possible that an exceptionally large number of tasks will combine to
exhaust all resources but never have a single task that uses more than
0.1% of RAM and swap (or 3.0% for admin tasks).

This patch ensures that the badness score of any eligible task is never 0
so the machine doesn't unnecessarily panic because it cannot find a task
to kill.
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f19e8aa1

22 9月, 2010 1 次提交

bdi: Initialize noop_backing_dev_info properly · 976e48f8

由 Jan Kara 提交于 9月 21, 2010

Properly initialize this backing dev info so that writeback code does not
barf when getting to it e.g. via sb->s_bdi.

Cc: stable@kernel.org
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

976e48f8

21 9月, 2010 2 次提交

percpu: fix pcpu_last_unit_cpu · 46b30ea9

由 Tejun Heo 提交于 9月 21, 2010

pcpu_first/last_unit_cpu are used to track which cpu has the first and
last units assigned.  This in turn is used to determine the span of a
chunk for man/unmap cache flushes and whether an address belongs to
the first chunk or not in per_cpu_ptr_to_phys().

When the number of possible CPUs isn't power of two, a chunk may
contain unassigned units towards the end of a chunk.  The logic to
determine pcpu_last_unit_cpu was incorrect when there was an unused
unit at the end of a chunk.  It failed to ignore the unused unit and
assigned the unused marker NR_CPUS to pcpu_last_unit_cpu.

This was discovered through kdump failure which was caused by
malfunctioning per_cpu_ptr_to_phys() on a kvm setup with 50 possible
CPUs by CAI Qian.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NCAI Qian <caiqian@redhat.com>
Cc: stable@kernel.org

46b30ea9

mm: further fix swapin race condition · 31c4a3d3

由 Hugh Dickins 提交于 9月 19, 2010

Commit 4969c119 ("mm: fix swapin race condition") is now agreed to
be incomplete.  There's a race, not very much less likely than the
original race envisaged, in which it is further necessary to check that
the swapcache page's swap has not changed.

Here's the reasoning: cast in terms of reuse_swap_page(), but probably
could be reformulated to rely on try_to_free_swap() instead, or on
swapoff+swapon.

A, faults into do_swap_page(): does page1 = lookup_swap_cache(swap1) and
comes through the lock_page(page1).

B, a racing thread of the same process, faults on the same address: does
page1 = lookup_swap_cache(swap1) and now waits in lock_page(page1), but
for whatever reason is unlucky not to get the lock any time soon.

A carries on through do_swap_page(), a write fault, but cannot reuse the
swap page1 (another reference to swap1).  Unlocks the page1 (but B
doesn't get it yet), does COW in do_wp_page(), page2 now in that pte.

C, perhaps the parent of A+B, comes in and write faults the same swap
page1 into its mm, reuse_swap_page() succeeds this time, swap1 is freed.

kswapd comes in after some time (B still unlucky) and swaps out some
pages from A+B and C: it allocates the original swap1 to page2 in A+B,
and some other swap2 to the original page1 now in C.  But does not
immediately free page1 (actually it couldn't: B holds a reference),
leaving it in swap cache for now.

B at last gets the lock on page1, hooray! Is PageSwapCache(page1)? Yes.
Is pte_same(*page_table, orig_pte)? Yes, because page2 has now been
given the swap1 which page1 used to have.  So B proceeds to insert page1
into A+B's page_table, though its content now belongs to C, quite
different from what A wrote there.

B ought to have checked that page1's swap was still swap1.
Signed-off-by: NHugh Dickins <hughd@google.com>
Reviewed-by: NRik van Riel <riel@redhat.com>
Cc: stable@kernel.org
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

31c4a3d3

10 9月, 2010 4 次提交

mm: page allocator: drain per-cpu lists after direct reclaim allocation fails · 9ee493ce

由 Mel Gorman 提交于 9月 09, 2010

When under significant memory pressure, a process enters direct reclaim
and immediately afterwards tries to allocate a page.  If it fails and no
further progress is made, it's possible the system will go OOM.  However,
on systems with large amounts of memory, it's possible that a significant
number of pages are on per-cpu lists and inaccessible to the calling
process.  This leads to a process entering direct reclaim more often than
it should increasing the pressure on the system and compounding the
problem.

This patch notes that if direct reclaim is making progress but allocations
are still failing that the system is already under heavy pressure.  In
this case, it drains the per-cpu lists and tries the allocation a second
time before continuing.
Signed-off-by: NMel Gorman <mel@csn.ul.ie>
Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: NChristoph Lameter <cl@linux.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9ee493ce

mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory... · aa454840

由 Christoph Lameter 提交于 9月 09, 2010

mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake

Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is
cheaper than scanning a number of lists. To avoid synchronization
overhead, counter deltas are maintained on a per-cpu basis and drained
both periodically and when the delta is above a threshold. On large CPU
systems, the difference between the estimated and real value of
NR_FREE_PAGES can be very high. If NR_FREE_PAGES is much higher than
number of real free page in buddy, the VM can allocate pages below min
watermark, at worst reducing the real number of pages to zero. Even if
the OOM killer kills some victim for freeing memory, it may not free
memory if the exit path requires a new page resulting in livelock.

This patch introduces a zone_page_state_snapshot() function (courtesy of
Christoph) that takes a slightly more accurate view of an arbitrary vmstat
counter. It is used to read NR_FREE_PAGES while kswapd is awake to avoid
the watermark being accidentally broken. The estimate is not perfect and
may result in cache line bounces but is expected to be lighter than the
IPI calls necessary to continually drain the per-cpu counters while kswapd
is awake.
Signed-off-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NMel Gorman <mel@csn.ul.ie>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

aa454840

mm: page allocator: update free page counters after pages are placed on the free list · 72853e29

由 Mel Gorman 提交于 9月 09, 2010

When allocating a page, the system uses NR_FREE_PAGES counters to
determine if watermarks would remain intact after the allocation was made.
This check is made without interrupts disabled or the zone lock held and
so is race-prone by nature. Unfortunately, when pages are being freed in
batch, the counters are updated before the pages are added on the list.
During this window, the counters are misleading as the pages do not exist
yet. When under significant pressure on systems with large numbers of
CPUs, it's possible for processes to make progress even though they should
have been stalled. This is particularly problematic if a number of the
processes are using GFP_ATOMIC as the min watermark can be accidentally
breached and in extreme cases, the system can livelock.

This patch updates the counters after the pages have been added to the
list. This makes the allocator more cautious with respect to preserving
the watermarks and mitigates livelock possibilities.

[akpm@linux-foundation.org: avoid modifying incoming args]
Signed-off-by: NMel Gorman <mel@csn.ul.ie>
Reviewed-by: NRik van Riel <riel@redhat.com>
Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: NChristoph Lameter <cl@linux.com>
Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

72853e29

vmstat: update zone stat threshold when onlining a cpu · 5ee28a44

由 KAMEZAWA Hiroyuki 提交于 9月 09, 2010

refresh_zone_stat_thresholds() calculates parameter based on the number of
online cpus.  It's called at cpu offlining but needs to be called at
onlining, too.
Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Acked-by: NMel Gorman <mel@csn.ul.ie>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5ee28a44