提交 · 0bc14062414d35c269b7c7dc3243a890886e7b38 · openeuler / raspberrypi-kernel

02 10月, 2010 13 次提交

vmalloc: pcpu_get/free_vm_areas() aren't needed on UP · 0bc14062

由 Tejun Heo 提交于 9月 03, 2010

These functions are used only by percpu memory allocator on SMP.
Don't build them on UP.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Nick Piggin <npiggin@kernel.dk>

0bc14062

SLUB: Fix merged slab cache names · 84c1cf62

由 Pekka Enberg 提交于 9月 14, 2010

As explained by Linus "I'm Proud to be an American" Torvalds:

  Looking at the merging code, I actually think it's totally
  buggy. If you have something like this:

   - load module A: create slab cache A

   - load module B: create slab cache B that can merge with A

   - unload module A

   - "cat /proc/slabinfo": BOOM. Oops.

  exactly because the name is not handled correctly, and you'll have
  module B holding open a slab cache that has a name pointer that points
  to module A that no longer exists.

This patch fixes the problem by using kstrdup() to allocate dynamic memory for
->name of "struct kmem_cache" as suggested by Christoph Lameter.
Acked-by: NChristoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

Conflicts:

	mm/slub.c

84c1cf62

Slub: UP bandaid · db210e70

由 Christoph Lameter 提交于 8月 26, 2010

Since the percpu allocator does not provide early allocation in UP mode (only
in SMP configurations) use __get_free_page() to improvise a compound page
allocation that can be later freed via kfree().

Compound pages will be released when the cpu caches are resized.
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

db210e70

slub: fix SLUB_RESILIENCY_TEST for dynamic kmalloc caches · a016471a

由 David Rientjes 提交于 8月 25, 2010

Now that the kmalloc_caches array is dynamically allocated at boot,
SLUB_RESILIENCY_TEST needs to be fixed to pass the correct type.
Acked-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

a016471a

slub: Fix up missing kmalloc_cache -> kmem_cache_node case for memoryhotplug · 8de66a0c

由 Christoph Lameter 提交于 8月 25, 2010

Memory hotplug allocates and frees per node structures. Use the correct name.
Acked-by: NDavid Rientjes <rientjes@google.com>
Acked-by: NRandy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

8de66a0c

slub: Add dummy functions for the !SLUB_DEBUG case · 7d550c56

由 Christoph Lameter 提交于 8月 25, 2010

On Wed, 25 Aug 2010, Randy Dunlap wrote:
> mm/slub.c:1732: error: implicit declaration of function 'slab_pre_alloc_hook'
> mm/slub.c:1751: error: implicit declaration of function 'slab_post_alloc_hook'
> mm/slub.c:1881: error: implicit declaration of function 'slab_free_hook'
> mm/slub.c:1886: error: implicit declaration of function 'slab_free_hook_irq'

Empty functions are missing if the runtime debuggability option is compiled
out.

Provide the fall back functions to empty hooks if SLUB_DEBUG is not set.
Acked-by: NRandy Dunlap <randy.dunlap@oracle.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

7d550c56

slob: fix gfp flags for order-0 page allocations · 8df275af

由 David Rientjes 提交于 8月 22, 2010

kmalloc_node() may allocate higher order slob pages, but the __GFP_COMP
bit is only passed to the page allocator and not represented in the
tracepoint event.  The bit should be passed to trace_kmalloc_node() as
well.
Acked-by: NMatt Mackall <mpm@selenic.com>
Reviewed-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

8df275af

slub: Move gfpflag masking out of the hotpath · c1d50836

由 Christoph Lameter 提交于 8月 20, 2010

Move the gfpflags masking into the hooks for checkers and into the slowpaths.
gfpflag masking requires access to a global variable and thus adds an
additional cacheline reference to the hotpaths.

If no hooks are active then the gfpflag masking will result in
code that the compiler can toss out.
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

c1d50836

slub: Extract hooks for memory checkers from hotpaths · c016b0bd

由 Christoph Lameter 提交于 8月 20, 2010

Extract the code that memory checkers and other verification tools use from
the hotpaths. Makes it easier to add new ones and reduces the disturbances
of the hotpaths.
Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

c016b0bd

slub: Dynamically size kmalloc cache allocations · 51df1142

由 Christoph Lameter 提交于 8月 20, 2010

kmalloc caches are statically defined and may take up a lot of space just
because the sizes of the node array has to be dimensioned for the largest
node count supported.

This patch makes the size of the kmem_cache structure dynamic throughout by
creating a kmem_cache slab cache for the kmem_cache objects. The bootstrap
occurs by allocating the initial one or two kmem_cache objects from the
page allocator.

C2->C3
	- Fix various issues indicated by David
	- Make create kmalloc_cache return a kmem_cache * pointer.
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

51df1142

slub: Remove static kmem_cache_cpu array for boot · 6c182dc0

由 Christoph Lameter 提交于 8月 20, 2010

The percpu allocator can now handle allocations during early boot.
So drop the static kmem_cache_cpu array.

Cc: Tejun Heo <tj@kernel.org>
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

6c182dc0

slub: Remove dynamic dma slab allocation · 55136592

由 Christoph Lameter 提交于 8月 20, 2010

Remove the dynamic dma slab allocation since this causes too many issues with
nested locks etc etc. The change avoids passing gfpflags into many functions.

V3->V4:
- Create dma caches in kmem_cache_init() instead of kmem_cache_init_late().
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

55136592

slub: Force no inlining of debug functions · 1537066c

由 Christoph Lameter 提交于 8月 20, 2010

Compiler folds the debgging functions into the critical paths.
Avoid that by adding noinline to the functions that check for
problems.
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NPekka Enberg <penberg@kernel.org>

1537066c

26 9月, 2010 1 次提交

Avoid pgoff overflow in remap_file_pages · 5ec1055a

由 Larry Woodman 提交于 9月 24, 2010

Thomas Pollet noticed that the remap_file_pages() system call in
fremap.c has a potential overflow in the first part of the if statement
below, which could cause it to process bogus input parameters.
Specifically the pgoff + size parameters could be wrap thereby
preventing the system call from failing when it should.
Reported-by: NThomas Pollet <thomas.pollet@gmail.com>
Signed-off-by: NLarry Woodman <lwoodman@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5ec1055a

25 9月, 2010 1 次提交

fremap: get rid of broken 'end' variable · e92b05de

由 Linus Torvalds 提交于 9月 24, 2010

Thomas Pollet points out that the 'end' variable is broken. It was
computed based on start/size before they were page-aligned, and as such
doesn't actually match any of the other actions we take. The overflow
test on end was also redundant, since we had already tested it with the
properly aligned version.

So just get rid of it entirely. The one remaining use for that broken
variable can just use 'start+size' like all the other cases already did.
Reported-by: NThomas Pollet <thomas.pollet@gmail.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e92b05de

24 9月, 2010 4 次提交

hugetlb, rmap: add BUG_ON(!PageLocked) in hugetlb_add_anon_rmap() · a850ea30

由 Naoya Horiguchi 提交于 9月 10, 2010

Confirming page lock is held in hugetlb_add_anon_rmap() may be useful
to detect possible future problems.
Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: NRik van Riel <riel@redhat.com>
Acked-by: NAndrea Arcangeli <aarcange@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a850ea30

hugetlb, rmap: fix confusing page locking in hugetlb_cow() · 56c9cfb1

由 Naoya Horiguchi 提交于 9月 10, 2010

The "if (!trylock_page)" block in the avoidcopy path of hugetlb_cow()
looks confusing and is buggy.  Originally this trylock_page() was
intended to make sure that old_page is locked even when old_page !=
pagecache_page, because then only pagecache_page is locked.

This patch fixes it by moving page locking into hugetlb_fault().
Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

56c9cfb1

hugetlb, rmap: use hugepage_add_new_anon_rmap() in hugetlb_cow() · cd67f0d2

由 Naoya Horiguchi 提交于 9月 10, 2010

Obviously, setting anon_vma for COWed hugepage should be done
by hugepage_add_new_anon_rmap() to scan vmas faster.
This patch fixes it.
Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: NAndrea Arcangeli <aarcange@redhat.com>
Reviewed-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

cd67f0d2

hugetlb, rmap: always use anon_vma root pointer · 433abed6

由 Naoya Horiguchi 提交于 9月 10, 2010

This patch applies Andrea's fix given by the following patch into hugepage
rmapping code:

  commit 288468c3
  Author: Andrea Arcangeli <aarcange@redhat.com>
  Date:   Mon Aug 9 17:19:09 2010 -0700

This patch uses anon_vma->root and avoids unnecessary overwriting when
anon_vma is already set up.
Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: NAndrea Arcangeli <aarcange@redhat.com>
Reviewed-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

433abed6

23 9月, 2010 4 次提交

mmap: call unlink_anon_vmas() in __split_vma() in case of error · 2aeadc30

由 Andrea Arcangeli 提交于 9月 22, 2010

If __split_vma fails because of an out of memory condition the
anon_vma_chain isn't teardown and freed potentially leading to rmap walks
accessing freed vma information plus there's a memleak.
Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
Acked-by: NJohannes Weiner <jweiner@redhat.com>
Acked-by: NRik van Riel <riel@redhat.com>
Acked-by: NHugh Dickins <hughd@google.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2aeadc30

oom: filter unkillable tasks from tasklist dump · e85bfd3a

由 David Rientjes 提交于 9月 22, 2010

/proc/sys/vm/oom_dump_tasks is enabled by default, so it's necessary to
limit as much information as possible that it should emit.

The tasklist dump should be filtered to only those tasks that are eligible
for oom kill.  This is already done for memcg ooms, but this patch extends
it to both cpuset and mempolicy ooms as well as init.

In addition to suppressing irrelevant information, this also reduces
confusion since users currently don't know which tasks in the tasklist
aren't eligible for kill (such as those attached to cpusets or bound to
mempolicies with a disjoint set of mems or nodes, respectively) since that
information is not shown.
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e85bfd3a

vmscan: check all_unreclaimable in direct reclaim path · d1908362

由 Minchan Kim 提交于 9月 22, 2010

M.  Vefa Bicakci reported 2.6.35 kernel hang up when hibernation on his
32bit 3GB mem machine.
(https://bugzilla.kernel.org/show_bug.cgi?id=16771). Also he bisected
the regression to

  commit bb21c7ce
  Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
  Date:   Fri Jun 4 14:15:05 2010 -0700

     vmscan: fix do_try_to_free_pages() return value when priority==0 reclaim failure

At first impression, this seemed very strange because the above commit
only chenged function return value and hibernate_preallocate_memory()
ignore return value of shrink_all_memory().  But it's related.

Now, page allocation from hibernation code may enter infinite loop if the
system has highmem.  The reasons are that vmscan don't care enough OOM
case when oom_killer_disabled.

The problem sequence is following as.

1. hibernation
2. oom_disable
3. alloc_pages
4. do_try_to_free_pages
       if (scanning_global_lru(sc) && !all_unreclaimable)
               return 1;

If kswapd is not freozen, it would set zone->all_unreclaimable to 1 and
then shrink_zones maybe return true(ie, all_unreclaimable is true).  So at
last, alloc_pages could go to _nopage_.  If it is, it should have no
problem.

This patch adds all_unreclaimable check to protect in direct reclaim path,
too.  It can care of hibernation OOM case and help bailout
all_unreclaimable case slightly.
Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
Reported-by: NM. Vefa Bicakci <bicave@superonline.com>
Reported-by: <caiqian@redhat.com>
Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
Tested-by: <caiqian@redhat.com>
Acked-by: NRafael J. Wysocki <rjw@sisk.pl>
Acked-by: NRik van Riel <riel@redhat.com>
Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d1908362

oom: always return a badness score of non-zero for eligible tasks · f19e8aa1

由 David Rientjes 提交于 9月 22, 2010

A task's badness score is roughly a proportion of its rss and swap
compared to the system's capacity.  The scale ranges from 0 to 1000 with
the highest score chosen for kill.  Thus, this scale operates on a
resolution of 0.1% of RAM + swap.  Admin tasks are also given a 3% bonus,
so the badness score of an admin task using 3% of memory, for example,
would still be 0.

It's possible that an exceptionally large number of tasks will combine to
exhaust all resources but never have a single task that uses more than
0.1% of RAM and swap (or 3.0% for admin tasks).

This patch ensures that the badness score of any eligible task is never 0
so the machine doesn't unnecessarily panic because it cannot find a task
to kill.
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f19e8aa1

22 9月, 2010 1 次提交

bdi: Initialize noop_backing_dev_info properly · 976e48f8

由 Jan Kara 提交于 9月 21, 2010

Properly initialize this backing dev info so that writeback code does not
barf when getting to it e.g. via sb->s_bdi.

Cc: stable@kernel.org
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

976e48f8

21 9月, 2010 2 次提交

percpu: fix pcpu_last_unit_cpu · 46b30ea9

由 Tejun Heo 提交于 9月 21, 2010

pcpu_first/last_unit_cpu are used to track which cpu has the first and
last units assigned.  This in turn is used to determine the span of a
chunk for man/unmap cache flushes and whether an address belongs to
the first chunk or not in per_cpu_ptr_to_phys().

When the number of possible CPUs isn't power of two, a chunk may
contain unassigned units towards the end of a chunk.  The logic to
determine pcpu_last_unit_cpu was incorrect when there was an unused
unit at the end of a chunk.  It failed to ignore the unused unit and
assigned the unused marker NR_CPUS to pcpu_last_unit_cpu.

This was discovered through kdump failure which was caused by
malfunctioning per_cpu_ptr_to_phys() on a kvm setup with 50 possible
CPUs by CAI Qian.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NCAI Qian <caiqian@redhat.com>
Cc: stable@kernel.org

46b30ea9

mm: further fix swapin race condition · 31c4a3d3

由 Hugh Dickins 提交于 9月 19, 2010

Commit 4969c119 ("mm: fix swapin race condition") is now agreed to
be incomplete.  There's a race, not very much less likely than the
original race envisaged, in which it is further necessary to check that
the swapcache page's swap has not changed.

Here's the reasoning: cast in terms of reuse_swap_page(), but probably
could be reformulated to rely on try_to_free_swap() instead, or on
swapoff+swapon.

A, faults into do_swap_page(): does page1 = lookup_swap_cache(swap1) and
comes through the lock_page(page1).

B, a racing thread of the same process, faults on the same address: does
page1 = lookup_swap_cache(swap1) and now waits in lock_page(page1), but
for whatever reason is unlucky not to get the lock any time soon.

A carries on through do_swap_page(), a write fault, but cannot reuse the
swap page1 (another reference to swap1).  Unlocks the page1 (but B
doesn't get it yet), does COW in do_wp_page(), page2 now in that pte.

C, perhaps the parent of A+B, comes in and write faults the same swap
page1 into its mm, reuse_swap_page() succeeds this time, swap1 is freed.

kswapd comes in after some time (B still unlucky) and swaps out some
pages from A+B and C: it allocates the original swap1 to page2 in A+B,
and some other swap2 to the original page1 now in C.  But does not
immediately free page1 (actually it couldn't: B holds a reference),
leaving it in swap cache for now.

B at last gets the lock on page1, hooray! Is PageSwapCache(page1)? Yes.
Is pte_same(*page_table, orig_pte)? Yes, because page2 has now been
given the swap1 which page1 used to have.  So B proceeds to insert page1
into A+B's page_table, though its content now belongs to C, quite
different from what A wrote there.

B ought to have checked that page1's swap was still swap1.
Signed-off-by: NHugh Dickins <hughd@google.com>
Reviewed-by: NRik van Riel <riel@redhat.com>
Cc: stable@kernel.org
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

31c4a3d3

10 9月, 2010 14 次提交

mm: page allocator: drain per-cpu lists after direct reclaim allocation fails · 9ee493ce

由 Mel Gorman 提交于 9月 09, 2010

When under significant memory pressure, a process enters direct reclaim
and immediately afterwards tries to allocate a page.  If it fails and no
further progress is made, it's possible the system will go OOM.  However,
on systems with large amounts of memory, it's possible that a significant
number of pages are on per-cpu lists and inaccessible to the calling
process.  This leads to a process entering direct reclaim more often than
it should increasing the pressure on the system and compounding the
problem.

This patch notes that if direct reclaim is making progress but allocations
are still failing that the system is already under heavy pressure.  In
this case, it drains the per-cpu lists and tries the allocation a second
time before continuing.
Signed-off-by: NMel Gorman <mel@csn.ul.ie>
Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: NChristoph Lameter <cl@linux.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9ee493ce

mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory... · aa454840

由 Christoph Lameter 提交于 9月 09, 2010

mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake

Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is
cheaper than scanning a number of lists. To avoid synchronization
overhead, counter deltas are maintained on a per-cpu basis and drained
both periodically and when the delta is above a threshold. On large CPU
systems, the difference between the estimated and real value of
NR_FREE_PAGES can be very high. If NR_FREE_PAGES is much higher than
number of real free page in buddy, the VM can allocate pages below min
watermark, at worst reducing the real number of pages to zero. Even if
the OOM killer kills some victim for freeing memory, it may not free
memory if the exit path requires a new page resulting in livelock.

This patch introduces a zone_page_state_snapshot() function (courtesy of
Christoph) that takes a slightly more accurate view of an arbitrary vmstat
counter. It is used to read NR_FREE_PAGES while kswapd is awake to avoid
the watermark being accidentally broken. The estimate is not perfect and
may result in cache line bounces but is expected to be lighter than the
IPI calls necessary to continually drain the per-cpu counters while kswapd
is awake.
Signed-off-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NMel Gorman <mel@csn.ul.ie>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

aa454840

mm: page allocator: update free page counters after pages are placed on the free list · 72853e29

由 Mel Gorman 提交于 9月 09, 2010

When allocating a page, the system uses NR_FREE_PAGES counters to
determine if watermarks would remain intact after the allocation was made.
This check is made without interrupts disabled or the zone lock held and
so is race-prone by nature. Unfortunately, when pages are being freed in
batch, the counters are updated before the pages are added on the list.
During this window, the counters are misleading as the pages do not exist
yet. When under significant pressure on systems with large numbers of
CPUs, it's possible for processes to make progress even though they should
have been stalled. This is particularly problematic if a number of the
processes are using GFP_ATOMIC as the min watermark can be accidentally
breached and in extreme cases, the system can livelock.

This patch updates the counters after the pages have been added to the
list. This makes the allocator more cautious with respect to preserving
the watermarks and mitigates livelock possibilities.

[akpm@linux-foundation.org: avoid modifying incoming args]
Signed-off-by: NMel Gorman <mel@csn.ul.ie>
Reviewed-by: NRik van Riel <riel@redhat.com>
Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: NChristoph Lameter <cl@linux.com>
Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

72853e29

vmstat: update zone stat threshold when onlining a cpu · 5ee28a44

由 KAMEZAWA Hiroyuki 提交于 9月 09, 2010

refresh_zone_stat_thresholds() calculates parameter based on the number of
online cpus.  It's called at cpu offlining but needs to be called at
onlining, too.
Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Acked-by: NMel Gorman <mel@csn.ul.ie>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5ee28a44

swap: discard while swapping only if SWAP_FLAG_DISCARD · 33994466

由 Hugh Dickins 提交于 9月 09, 2010

Tests with recent firmware on Intel X25-M 80GB and OCZ Vertex 60GB SSDs
show a shift since I last tested in December: in part because of firmware
updates, in part because of the necessary move from barriers to awaiting
completion at the block layer.  While discard at swapon still shows as
slightly beneficial on both, discarding 1MB swap cluster when allocating
is now disadvanteous: adds 25% overhead on Intel, adds 230% on OCZ (YMMV).

Surrender: discard as presently implemented is more hindrance than help
for swap; but might prove useful on other devices, or with improvements.
So continue to do the discard at swapon, but make discard while swapping
conditional on a SWAP_FLAG_DISCARD to sys_swapon() (which has been using
only the lower 16 bits of int flags).

We can add a --discard or -d to swapon(8), and a "discard" to swap in
/etc/fstab: matching the mount option for btrfs, ext4, fat, gfs2, nilfs2.
Signed-off-by: NHugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Nigel Cunningham <nigel@tuxonice.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: <stable@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

33994466

swap: do not send discards as barriers · 8f2ae0fa

由 Christoph Hellwig 提交于 9月 09, 2010

The swap code already uses synchronous discards, no need to add I/O
barriers.

This fixes the worst of the terrible slowdown in swap allocation for
hibernation, reported on 2.6.35 by Nigel Cunningham; but does not entirely
eliminate that regression.

[tj@kernel.org: superflous newlines removed]
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NNigel Cunningham <nigel@tuxonice.net>
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NHugh Dickins <hughd@google.com>
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: <stable@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8f2ae0fa

swap: prevent reuse during hibernation · b73d7fce

由 Hugh Dickins 提交于 9月 09, 2010

Move the hibernation check from scan_swap_map() into try_to_free_swap():
to catch not only the common case when hibernation's allocation itself
triggers swap reuse, but also the less likely case when concurrent page
reclaim (shrink_page_list) might happen to try_to_free_swap from a page.

Hibernation already clears __GFP_IO from the gfp_allowed_mask, to stop
reclaim from going to swap: check that to prevent swap reuse too.
Signed-off-by: NHugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Ondrej Zary <linux@rainbow-software.org>
Cc: Andrea Gelmini <andrea.gelmini@gmail.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nigel Cunningham <nigel@tuxonice.net>
Cc: <stable@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b73d7fce

swap: revert special hibernation allocation · 910321ea

由 Hugh Dickins 提交于 9月 09, 2010

Please revert 2.6.36-rc commit d2997b10
"hibernation: freeze swap at hibernation".  It complicated matters by
adding a second swap allocation path, just for hibernation; without in any
way fixing the issue that it was intended to address - page reclaim after
fixing the hibernation image might free swap from a page already imaged as
swapcache, letting its swap be reallocated to store a different page of
the image: resulting in data corruption if the imaged page were freed as
clean then swapped back in.  Pages freed to si->swap_map were still in
danger of being reallocated by the alternative allocation path.

I guess it inadvertently fixed slow SSD swap allocation for hibernation,
as reported by Nigel Cunningham: by missing out the discards that occur on
the usual swap allocation path; but that was unintentional, and needs a
separate fix.
Signed-off-by: NHugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Ondrej Zary <linux@rainbow-software.org>
Cc: Andrea Gelmini <andrea.gelmini@gmail.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nigel Cunningham <nigel@tuxonice.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

910321ea

bounce: call flush_dcache_page() after bounce_copy_vec() · ac8456d6

由 Gary King 提交于 9月 09, 2010

I have been seeing problems on Tegra 2 (ARMv7 SMP) systems with HIGHMEM
enabled on 2.6.35 (plus some patches targetted at 2.6.36 to perform cache
maintenance lazily), and the root cause appears to be that the mm bouncing
code is calling flush_dcache_page before it copies the bounce buffer into
the bio.

The bounced page needs to be flushed after data is copied into it, to
ensure that architecture implementations can synchronize instruction and
data caches if necessary.
Signed-off-by: NGary King <gking@nvidia.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Acked-by: NJens Axboe <axboe@kernel.dk>
Cc: <stable@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ac8456d6

memory hotplug: fix next block calculation in is_removable · 0dcc48c1

由 KAMEZAWA Hiroyuki 提交于 9月 09, 2010

next_active_pageblock() is for finding next _used_ freeblock.  It skips
several blocks when it finds there are a chunk of free pages lager than
pageblock.  But it has 2 bugs.

  1. We have no lock. page_order(page) - pageblock_order can be minus.
  2. pageblocks_stride += is wrong. it should skip page_order(p) of pages.
Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: <stable@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0dcc48c1

mm: compaction: handle active and inactive fairly in too_many_isolated · bc693045

由 Minchan Kim 提交于 9月 09, 2010

Iram reported that compaction's too_many_isolated() loops forever.
(http://www.spinics.net/lists/linux-mm/msg08123.html)

The meminfo when the situation happened was inactive anon is zero.  That's
because the system has no memory pressure until then.  While all anon
pages were in the active lru, compaction could select active lru as well
as inactive lru.  That's a different thing from vmscan's isolated.  So we
has been two too_many_isolated.

While compaction can isolate pages in both active and inactive, current
implementation of too_many_isolated only considers inactive.  It made
Iram's problem.

This patch handles active and inactive fairly.  That's because we can't
expect where from and how many compaction would isolated pages.

This patch changes (nr_isolated > nr_inactive) with
nr_isolated > (nr_active + nr_inactive) / 2.
Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
Reported-by: NIram Shahzad <iram.shahzad@jp.fujitsu.com>
Acked-by: NMel Gorman <mel@csn.ul.ie>
Acked-by: NWu Fengguang <fengguang.wu@intel.com>
Cc: <stable@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

bc693045

mm: avoid warning when COMPACTION is selected · 152e0659

由 Andrea Arcangeli 提交于 9月 09, 2010

COMPACTION enables MIGRATION, but MIGRATION spawns a warning if numa or
memhotplug aren't selected. However MIGRATION doesn't depend on them. I
guess it's just trying to be strict doing a double check on who's enabling
it, but it doesn't know that compaction also enables MIGRATION.
Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
Acked-by: NMel Gorman <mel@csn.ul.ie>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

152e0659

mm: fix swapin race condition · 4969c119

由 Andrea Arcangeli 提交于 9月 09, 2010

The pte_same check is reliable only if the swap entry remains pinned (by
the page lock on swapcache).  We've also to ensure the swapcache isn't
removed before we take the lock as try_to_free_swap won't care about the
page pin.

One of the possible impacts of this patch is that a KSM-shared page can
point to the anon_vma of another process, which could exit before the page
is freed.

This can leave a page with a pointer to a recycled anon_vma object, or
worse, a pointer to something that is no longer an anon_vma.

[riel@redhat.com: changelog help]
Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
Acked-by: NHugh Dickins <hughd@google.com>
Reviewed-by: NRik van Riel <riel@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4969c119

mm: Move vma_stack_continue into mm.h · 39aa3cb3

由 Stefan Bader 提交于 8月 31, 2010

So it can be used by all that need to check for that.
Signed-off-by: NStefan Bader <stefan.bader@canonical.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

39aa3cb3