提交 · ea941f0e2a8c02ae876cd73deb4e1557248f258c · openeuler / raspberrypi-kernel

27 10月, 2010 13 次提交

writeback: add nr_dirtied and nr_written to /proc/vmstat · ea941f0e

由 Michael Rubin 提交于 10月 26, 2010

To help developers and applications gain visibility into writeback
behaviour adding two entries to vm_stat_items and /proc/vmstat.  This will
allow us to track the "written" and "dirtied" counts.

   # grep nr_dirtied /proc/vmstat
   nr_dirtied 3747
   # grep nr_written /proc/vmstat
   nr_written 3618
Signed-off-by: NMichael Rubin <mrubin@google.com>
Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ea941f0e

mm: add account_page_writeback() · f629d1c9

由 Michael Rubin 提交于 10月 26, 2010

To help developers and applications gain visibility into writeback
behaviour this patch adds two counters to /proc/vmstat.

  # grep nr_dirtied /proc/vmstat
  nr_dirtied 3747
  # grep nr_written /proc/vmstat
  nr_written 3618

These entries allow user apps to understand writeback behaviour over time
and learn how it is impacting their performance.  Currently there is no
way to inspect dirty and writeback speed over time.  It's not possible for
nr_dirty/nr_writeback.

These entries are necessary to give visibility into writeback behaviour.
We have /proc/diskstats which lets us understand the io in the block
layer.  We have blktrace for more in depth understanding.  We have
e2fsprogs and debugsfs to give insight into the file systems behaviour,
but we don't offer our users the ability understand what writeback is
doing.  There is no way to know how active it is over the whole system, if
it's falling behind or to quantify it's efforts.  With these values
exported users can easily see how much data applications are sending
through writeback and also at what rates writeback is processing this
data.  Comparing the rates of change between the two allow developers to
see when writeback is not able to keep up with incoming traffic and the
rate of dirty memory being sent to the IO back end.  This allows folks to
understand their io workloads and track kernel issues.  Non kernel
engineers at Google often use these counters to solve puzzling performance
problems.

Patch #4 adds a pernode vmstat file with nr_dirtied and nr_written

Patch #5 add writeback thresholds to /proc/vmstat

Currently these values are in debugfs. But they should be promoted to
/proc since they are useful for developers who are writing databases
and file servers and are not debugging the kernel.

The output is as below:

 # grep threshold /proc/vmstat
 nr_pages_dirty_threshold 409111
 nr_pages_dirty_background_threshold 818223

This patch:

This allows code outside of the mm core to safely manipulate page
writeback state and not worry about the other accounting.  Not using these
routines means that some code will lose track of the accounting and we get
bugs.

Modify nilfs2 to use interface.
Signed-off-by: NMichael Rubin <mrubin@google.com>
Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Jiro SEKIBA <jir@unicus.jp>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f629d1c9

mm/mempolicy.c: check return code of check_range · 0def08e3

由 Vasiliy Kulikov 提交于 10月 26, 2010

Function check_range may return ERR_PTR(...). Check for it.
Signed-off-by: NVasiliy Kulikov <segooon@gmail.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Reviewed-by: NChristoph Lameter <cl@linux.com>
Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0def08e3

vmscan: prevent background aging of anon page in no swap system · 74e3f3c3

由 Minchan Kim 提交于 10月 26, 2010

Ying Han reported that backing aging of anon pages in no swap system
causes unnecessary TLB flush.

When I sent a patch(69c85481), I wanted this patch but Rik pointed out
and allowed aging of anon pages to give a chance to promote from inactive
to active LRU.

It has a two problem.

1) non-swap system

Never make sense to age anon pages.

2) swap configured but still doesn't swapon

It doesn't make sense to age anon pages until swap-on time.  But it's
arguable.  If we have aged anon pages by swapon, VM have moved anon pages
from active to inactive.  And in the time swapon by admin, the VM can't
reclaim hot pages so we can protect hot pages swapout.

But let's think about it.  When does swap-on happen?  It depends on admin.
 we can't expect it.  Nonetheless, we have done aging of anon pages to
protect hot pages swapout.  It means we lost run time overhead when below
high watermark but gain hot page swap-[in/out] overhead when VM decide
swapout.  Is it true?  Let's think more detail.  We don't promote anon
pages in case of non-swap system.  So even though VM does aging of anon
pages, the pages would be in inactive LRU for a long time.  It means many
of pages in there would mark access bit again.  So access bit hot/code
separation would be pointless.

This patch prevents unnecessary anon pages demotion in not-yet-swapon and
non-configured swap system.  Even, in non-configuared swap system
inactive_anon_is_low can be compiled out.

It could make side effect that hot anon pages could swap out when admin
does swap on.  But I think sooner or later it would be steady state.  So
it's not a big problem.

We could lose someting but gain more thing(TLB flush and unnecessary
function call to demote anon pages).
Signed-off-by: NYing Han <yinghan@google.com>
Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
Reviewed-by: NRik van Riel <riel@redhat.com>
Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

74e3f3c3

memory hotplug: unify is_removable and offline detection code · 49ac8255

由 KAMEZAWA Hiroyuki 提交于 10月 26, 2010

Now, sysfs interface of memory hotplug shows whether the section is
removable or not.  But it checks only migrateype of pages and doesn't
check details of cluster of pages.

Next, memory hotplug's set_migratetype_isolate() has the same kind of
check, too.

This patch adds the function __count_unmovable_pages() and makes above 2
checks to use the same logic.  Then, is_removable and hotremove code uses
the same logic.  No changes in the hotremove logic itself.

TODO: need to find a way to check RECLAMABLE. But, considering bit,
      calling shrink_slab() against a range before starting memory hotremove
      sounds better. If so, this patch's logic doesn't need to be changed.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reported-by: NMichal Hocko <mhocko@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

49ac8255

memory hotplug: fix notifier's return value check · 4b20477f

由 KAMEZAWA Hiroyuki 提交于 10月 26, 2010

Even if notifier cannot find any pages, it doesn't mean no pages are
available...And, if there are no notifiers registered, this condition will
be always true and memory hotplug will show -EBUSY.

This is a bug but not critical.

In most case, a pageblock which will be offlined is MIGRATE_MOVABLE This
"notifier" is called only when the pageblock is _not_ MIGRATE_MOVABLE.
But if not MIGRATE_MOVABLE, it's common case that memory hotplug will
fail.  So, no one notice this bug.
Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4b20477f

mm: compaction: fix COMPACTPAGEFAILED counting · cf608ac1

由 Minchan Kim 提交于 10月 26, 2010

Presently update_nr_listpages() doesn't have a role.  That's because lists
passed is always empty just after calling migrate_pages.  The
migrate_pages cleans up page list which have failed to migrate before
returning by aaa994b3.

 [PATCH] page migration: handle freeing of pages in migrate_pages()

 Do not leave pages on the lists passed to migrate_pages().  Seems that we will
 not need any postprocessing of pages.  This will simplify the handling of
 pages by the callers of migrate_pages().

At that time, we thought we don't need any postprocessing of pages.  But
the situation is changed.  The compaction need to know the number of
failed to migrate for COMPACTPAGEFAILED stat

This patch makes new rule for caller of migrate_pages to call
putback_lru_pages.  So caller need to clean up the lists so it has a
chance to postprocess the pages.  [suggested by Christoph Lameter]
Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andi Kleen <andi@firstfloor.org>
Reviewed-by: NMel Gorman <mel@csn.ul.ie>
Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
Acked-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

cf608ac1

mm: only build per-node scan_unevictable functions when NUMA is enabled · e4455abb

由 Thadeu Lima de Souza Cascardo 提交于 10月 26, 2010

Non-NUMA systems do never create these files anyway, since they are only
created by driver subsystem when NUMA is configured.

[akpm@linux-foundation.org: cleanup]
Signed-off-by: NThadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e4455abb

writeback: remove nonblocking/encountered_congestion references · 1b430bee

由 Wu Fengguang 提交于 10月 26, 2010

This removes more dead code that was somehow missed by commit 0d99519e
(writeback: remove unused nonblocking and congestion checks).  There are
no behavior change except for the removal of two entries from one of the
ext4 tracing interface.

The nonblocking checks in ->writepages are no longer used because the
flusher now prefer to block on get_request_wait() than to skip inodes on
IO congestion.  The latter will lead to more seeky IO.

The nonblocking checks in ->writepage are no longer used because it's
redundant with the WB_SYNC_NONE check.

We no long set ->nonblocking in VM page out and page migration, because
a) it's effectively redundant with WB_SYNC_NONE in current code
b) it's old semantic of "Don't get stuck on request queues" is mis-behavior:
   that would skip some dirty inodes on congestion and page out others, which
   is unfair in terms of LRU age.

Inspired by Christoph Hellwig. Thanks!
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: David Howells <dhowells@redhat.com>
Cc: Sage Weil <sage@newdream.net>
Cc: Steve French <sfrench@samba.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1b430bee

oom: kill all threads sharing oom killed task's mm · 1e99bad0

由 David Rientjes 提交于 10月 26, 2010

It's necessary to kill all threads that share an oom killed task's mm if
the goal is to lead to future memory freeing.

This patch reintroduces the code removed in 8c5cd6f3 (oom: oom_kill
doesn't kill vfork parent (or child)) since it is obsoleted.

It's now guaranteed that any task passed to oom_kill_task() does not share
an mm with any thread that is unkillable.  Thus, we're safe to issue a
SIGKILL to any thread sharing the same mm.

This is especially necessary to solve an mm->mmap_sem livelock issue
whereas an oom killed thread must acquire the lock in the exit path while
another thread is holding it in the page allocator while trying to
allocate memory itself (and will preempt the oom killer since a task was
already killed).  Since tasks with pending fatal signals are now granted
access to memory reserves, the thread holding the lock may quickly
allocate and release the lock so that the oom killed task may exit.

This mainly is for threads that are cloned with CLONE_VM but not
CLONE_THREAD, so they are in a different thread group.  Non-NPTL threads
exist in the wild and this change is necessary to prevent the livelock in
such cases.  We care more about preventing the livelock than incurring the
additional tasklist in the oom killer when a task has been killed.
Systems that are sufficiently large to not want the tasklist scan in the
oom killer in the first place already have the option of enabling
/proc/sys/vm/oom_kill_allocating_task, which was designed specifically for
that purpose.

This code had existed in the oom killer for over eight years dating back
to the 2.4 kernel.

[akpm@linux-foundation.org: add nice comment]
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1e99bad0

oom: avoid killing a task if a thread sharing its mm cannot be killed · e18641e1

由 David Rientjes 提交于 10月 26, 2010

The oom killer's goal is to kill a memory-hogging task so that it may
exit, free its memory, and allow the current context to allocate the
memory that triggered it in the first place.  Thus, killing a task is
pointless if other threads sharing its mm cannot be killed because of its
/proc/pid/oom_adj or /proc/pid/oom_score_adj value.

This patch checks whether any other thread sharing p->mm has an
oom_score_adj of OOM_SCORE_ADJ_MIN.  If so, the thread cannot be killed
and oom_badness(p) returns 0, meaning it's unkillable.
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e18641e1

mm, page-allocator: do not check the state of a non-existant buddy during free · b7f50cfa

由 Mel Gorman 提交于 10月 26, 2010

There is a bug in commit 6dda9d55 ("page allocator: reduce fragmentation
in buddy allocator by adding buddies that are merging to the tail of the
free lists") that means a buddy at order MAX_ORDER is checked for merging.
 A page of this order never exists so at times, an effectively random
piece of memory is being checked.

Alan Curry has reported that this is causing memory corruption in
userspace data on a PPC32 platform (http://lkml.org/lkml/2010/10/9/32).
It is not clear why this is happening.  It could be a cache coherency
problem where pages mapped in both user and kernel space are getting
different cache lines due to the bad read from kernel space
(http://lkml.org/lkml/2010/10/13/179).  It could also be that there are
some special registers being io-remapped at the end of the memmap array
and that a read has special meaning on them.  Compiler bugs have been
ruled out because the assembly before and after the patch looks relatively
harmless.

This patch fixes the problem by ensuring we are not reading a possibly
invalid location of memory.  It's not clear why the read causes corruption
but one way or the other it is a buggy read.
Signed-off-by: NMel Gorman <mel@csn.ul.ie>
Cc: Corrado Zoccolo <czoccolo@gmail.com>
Reported-by: NAlan Curry <pacman@kosh.dhis.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b7f50cfa

mm: fix return value of scan_lru_pages in memory unplug · f8f72ad5

由 KAMEZAWA Hiroyuki 提交于 10月 26, 2010

scan_lru_pages returns pfn. So, it's type should be "unsigned long"
not "int".

Note: I guess this has been work until now because memory hotplug tester's
      machine has not very big memory....
      physical address < 32bit << PAGE_SHIFT.
Reported-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <stable@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f8f72ad5

25 10月, 2010 1 次提交

[S390] add support for nonquiescing sske · e2b8d7af

由 Martin Schwidefsky 提交于 10月 25, 2010

Improve performance of the sske operation by using the nonquiescing
variant if the affected page has no mappings established. On machines
with no support for the new sske variant the mask bit will be ignored.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

e2b8d7af

24 10月, 2010 1 次提交

export __get_user_pages_fast() function · 45888a0c

由 Xiao Guangrong 提交于 8月 22, 2010

This function is used by KVM to pin process's page in the atomic context.

Define the 'weak' function to avoid other architecture not support it
Acked-by: NNick Piggin <npiggin@suse.de>
Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

45888a0c

19 10月, 2010 1 次提交

memory_hotplug: drop spurious calls to flush_scheduled_work() · 10ccd846

由 Tejun Heo 提交于 10月 19, 2010

lru_add_drain_all() uses schedule_on_each_cpu() which is synchronous.
There is no reason to call flush_scheduled_work() after
lru_add_drain_all().  Drop the spurious calls.

This is to prepare for the deprecation and removal of
flush_scheduled_work().
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
Acked-by: NMel Gorman <mel@csn.ul.ie>

10ccd846

12 10月, 2010 2 次提交

memblock: Annotate memblock functions with __init_memblock · cd79481d

由 Yinghai Lu 提交于 10月 11, 2010

Stephen found

WARNING: mm/built-in.o(.text+0x25ab8): Section mismatch in reference from the function memblock_find_base() to the function .init.text:memblock_find_region()
The function memblock_find_base() references
the function __init memblock_find_region().
This is often because memblock_find_base lacks a __init
annotation or the annotation of memblock_find_region is wrong.

So let memblock_find_region() to use __init_memblock instead of __init
directly.

Also fix one function that did not have __init* to be __init_memblock.
Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NYinghai Lu <yinghai@kernel.org>
LKML-Reference: <4CB366B1.40405@kernel.org>
Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>

cd79481d

memblock: Allow memblock_init to be called early · 236260b9

由 Jeremy Fitzhardinge 提交于 10月 06, 2010

The Xen setup code needs to call memblock_x86_reserve_range() very early,
so allow it to initialize the memblock subsystem before doing so.  The
second memblock_init() is ignored.
Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
LKML-Reference: <4CACFDAD.3090900@goop.org>
Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>

236260b9

11 10月, 2010 1 次提交

Fix migration.c compilation on s390 · 3ef8fd7f

由 Andi Kleen 提交于 10月 11, 2010

31bit s390 doesn't have huge pages and failed with:

> mm/migrate.c: In function 'remove_migration_pte':
> mm/migrate.c:143:3: error: implicit declaration of function 'pte_mkhuge'
> mm/migrate.c:143:7: error: incompatible types when assigning to type 'pte_t' from type 'int'

Put that code into a ifdef.

Reported by Heiko Carstens
Signed-off-by: NAndi Kleen <ak@linux.intel.com>

3ef8fd7f

08 10月, 2010 19 次提交

HWPOISON: Remove retry loop for try_to_unmap · a08c80eb

由 Andi Kleen 提交于 9月 27, 2010

We don't reply in other temporary failure cases and there were no
reports of replies happening. I think the original reason it was
added was also just an early bug, not an observation of the race.

So remove the loop for now, but keep a warning message.
Signed-off-by: NAndi Kleen <ak@linux.intel.com>

a08c80eb

HWPOISON: Turn addr_valid from bitfield into char · 9033ae16

由 Andi Kleen 提交于 9月 27, 2010

The addr_valid flag is the only flag in "to_kill" and it's slightly more
efficient to have it as char instead of a bitfield.
Signed-off-by: NAndi Kleen <ak@linux.intel.com>

9033ae16

HWPOISON: Disable DEBUG by default · 898e70d1

由 Andi Kleen 提交于 9月 27, 2010

Now that only a few obscure messages are left as pr_debug disable
outputting of pr_debug in memory-failure.c by default.
Signed-off-by: NAndi Kleen <ak@linux.intel.com>

898e70d1

HWPOISON: Convert pr_debugs to pr_info · fb46e735

由 Andi Kleen 提交于 9月 27, 2010

Convert a lot of pr_debugs in memory-failure.c that are generally useful
to pr_info. It's reasonable to print at least one message why
offlining succeeded or failed by default.
Signed-off-by: NAndi Kleen <ak@linux.intel.com>

fb46e735

HWPOISON: Improve comments in memory-failure.c · 1c80b990

由 Andi Kleen 提交于 9月 27, 2010

Clean up and improve the overview comment in memory-failure.c

Tidy some grammar issues in other comments.
Signed-off-by: NAndi Kleen <ak@linux.intel.com>

1c80b990

Encode huge page size for VM_FAULT_HWPOISON errors · aa50d3a7

由 Andi Kleen 提交于 10月 06, 2010

This fixes a problem introduced with the hugetlb hwpoison handling

The user space SIGBUS signalling wants to know the size of the hugepage
that caused a HWPOISON fault.

Unfortunately the architecture page fault handlers do not have easy
access to the struct page.

Pass the information out in the fault error code instead.

I added a separate VM_FAULT_HWPOISON_LARGE bit for this case and encode
the hpage index in some free upper bits of the fault code. The small
page hwpoison keeps stays with the VM_FAULT_HWPOISON name to minimize
changes.

Also add code to hugetlb.h to convert that index into a page shift.

Will be used in a further patch.

Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: fengguang.wu@intel.com
Signed-off-by: NAndi Kleen <ak@linux.intel.com>

aa50d3a7

hugepage: move is_hugepage_on_freelist inside ifdef to avoid warning · d5bd9106

由 Andi Kleen 提交于 9月 27, 2010

Fixes warning reported by Stephen Rothwell

mm/hugetlb.c:2950: warning: 'is_hugepage_on_freelist' defined but not used

for the !CONFIG_MEMORY_FAILURE case.
Signed-off-by: NAndi Kleen <ak@linux.intel.com>

d5bd9106

Clean up __page_set_anon_rmap · 4e1c1975

由 Andi Kleen 提交于 9月 22, 2010

Linus asked for a cleanup of __page_set_anon_rmap to make
it look more like the cleaner huge pages version.

Factor out the duplicated PageAnon check into a single check
at the beginning of the function.

Remove obsolete comments and rewrite them into standard English.

No functional changes.
Signed-off-by: NAndi Kleen <ak@linux.intel.com>

4e1c1975

HWPOISON, hugetlb: fix unpoison for hugepage · 6a90181c

由 Naoya Horiguchi 提交于 9月 08, 2010

Currently unpoisoning hugepages doesn't work correctly because
clearing PG_HWPoison is done outside if (TestClearPageHWPoison).
This patch fixes it.
Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: NMel Gorman <mel@csn.ul.ie>
Signed-off-by: NAndi Kleen <ak@linux.intel.com>

6a90181c

HWPOISON, hugetlb: soft offlining for hugepage · d950b958

由 Naoya Horiguchi 提交于 9月 08, 2010

This patch extends soft offlining framework to support hugepage.
When memory corrected errors occur repeatedly on a hugepage,
we can choose to stop using it by migrating data onto another hugepage
and disabling the original (maybe half-broken) one.

ChangeLog since v4:
- branch soft_offline_page() for hugepage

ChangeLog since v3:
- remove comment about "ToDo: hugepage soft-offline"

ChangeLog since v2:
- move refcount handling into isolate_lru_page()

ChangeLog since v1:
- add double check in isolating hwpoisoned hugepage
- define free/non-free checker for hugepage
- postpone calling put_page() for hugepage in soft_offline_page()
Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: NMel Gorman <mel@csn.ul.ie>
Signed-off-by: NAndi Kleen <ak@linux.intel.com>

d950b958

HWPOSION, hugetlb: recover from free hugepage error when !MF_COUNT_INCREASED · 8c6c2ecb

由 Naoya Horiguchi 提交于 9月 08, 2010

Currently error recovery for free hugepage works only for MF_COUNT_INCREASED.
This patch enables !MF_COUNT_INCREASED case.

Free hugepages can be handled directly by alloc_huge_page() and
dequeue_hwpoisoned_huge_page(), and both of them are protected
by hugetlb_lock, so there is no race between them.

Note that this patch defines the refcount of HWPoisoned hugepage
dequeued from freelist is 1, deviated from present 0, thereby we
can avoid race between unpoison and memory failure on free hugepage.
This is reasonable because unlikely to free buddy pages, free hugepage
is governed by hugetlbfs even after error handling finishes.
And it also makes unpoison code added in the later patch cleaner.
Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: NMel Gorman <mel@csn.ul.ie>
Signed-off-by: NAndi Kleen <ak@linux.intel.com>

8c6c2ecb

hugetlb: move refcounting in hugepage allocation inside hugetlb_lock · a9869b83

由 Naoya Horiguchi 提交于 9月 08, 2010

Currently alloc_huge_page() raises page refcount outside hugetlb_lock.
but it causes race when dequeue_hwpoison_huge_page() runs concurrently
with alloc_huge_page().
To avoid it, this patch moves set_page_refcounted() in hugetlb_lock.
Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
Acked-by: NMel Gorman <mel@csn.ul.ie>
Reviewed-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NAndi Kleen <ak@linux.intel.com>

a9869b83

HWPOISON, hugetlb: add free check to dequeue_hwpoison_huge_page() · 6de2b1aa

由 Naoya Horiguchi 提交于 9月 08, 2010

This check is necessary to avoid race between dequeue and allocation,
which can cause a free hugepage to be dequeued twice and get kernel unstable.
Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
Acked-by: NMel Gorman <mel@csn.ul.ie>
Reviewed-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NAndi Kleen <ak@linux.intel.com>

6de2b1aa

hugetlb: hugepage migration core · 290408d4

由 Naoya Horiguchi 提交于 9月 08, 2010

This patch extends page migration code to support hugepage migration.
One of the potential users of this feature is soft offlining which
is triggered by memory corrected errors (added by the next patch.)

Todo:
- there are other users of page migration such as memory policy,
  memory hotplug and memocy compaction.
  They are not ready for hugepage support for now.

ChangeLog since v4:
- define migrate_huge_pages()
- remove changes on isolation/putback_lru_page()

ChangeLog since v2:
- refactor isolate/putback_lru_page() to handle hugepage
- add comment about race on unmap_and_move_huge_page()

ChangeLog since v1:
- divide migration code path for hugepage
- define routine checking migration swap entry for hugetlb
- replace "goto" with "if/else" in remove_migration_pte()
Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: NMel Gorman <mel@csn.ul.ie>
Signed-off-by: NAndi Kleen <ak@linux.intel.com>

290408d4

hugetlb: redefine hugepage copy functions · 0ebabb41

由 Naoya Horiguchi 提交于 9月 08, 2010

This patch modifies hugepage copy functions to have only destination
and source hugepages as arguments for later use.
The old ones are renamed from copy_{gigantic,huge}_page() to
copy_user_{gigantic,huge}_page().
This naming convention is consistent with that between copy_highpage()
and copy_user_highpage().

ChangeLog since v4:
- add blank line between local declaration and code
- remove unnecessary might_sleep()

ChangeLog since v2:
- change copy_huge_page() from macro to inline dummy function
  to avoid compile warning when !CONFIG_HUGETLB_PAGE.
Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: NMel Gorman <mel@csn.ul.ie>
Reviewed-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NAndi Kleen <ak@linux.intel.com>

0ebabb41

hugetlb: add allocate function for hugepage migration · bf50bab2

由 Naoya Horiguchi 提交于 9月 08, 2010

We can't use existing hugepage allocation functions to allocate hugepage
for page migration, because page migration can happen asynchronously with
the running processes and page migration users should call the allocation
function with physical addresses (not virtual addresses) as arguments.

ChangeLog since v3:
- unify alloc_buddy_huge_page() and alloc_buddy_huge_page_node()

ChangeLog since v2:
- remove unnecessary get/put_mems_allowed() (thanks to David Rientjes)

ChangeLog since v1:
- add comment on top of alloc_huge_page_no_vma()
Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: NMel Gorman <mel@csn.ul.ie>
Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
Reviewed-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NAndi Kleen <ak@linux.intel.com>

bf50bab2

hugetlb: fix metadata corruption in hugetlb_fault() · 998b4382

由 Naoya Horiguchi 提交于 9月 08, 2010

Since the PageHWPoison() check is for avoiding hwpoisoned page remained
in pagecache mapping to the process, it should be done in "found in pagecache"
branch, not in the common path.
Otherwise, metadata corruption occurs if memory failure happens between
alloc_huge_page() and lock_page() because page fault fails with metadata
changes remained (such as refcount, mapcount, etc.)

This patch moves the check to "found in pagecache" branch and fix the problem.

ChangeLog since v2:
- remove retry check in "new allocation" path.
- make description more detailed
- change patch name from "HWPOISON, hugetlb: move PG_HWPoison bit check"
Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: NMel Gorman <mel@csn.ul.ie>
Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
Reviewed-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NAndi Kleen <ak@linux.intel.com>

998b4382

memcg: fix thresholds with use_hierarchy == 1 · ad4ca5f4

由 Kirill A. Shutemov 提交于 10月 07, 2010

We need to check parent's thresholds if parent has use_hierarchy == 1 to
be sure that parent's threshold events will be triggered even if parent
itself is not active (no MEM_CGROUP_EVENTS).
Signed-off-by: NKirill A. Shutemov <kirill@shutemov.name>
Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ad4ca5f4

mm: alloc_large_system_hash() printk overflow on 16TB boot · f241e660

由 Robin Holt 提交于 10月 07, 2010

During boot of a 16TB system, the following is printed:
Dentry cache hash table entries: -2147483648 (order: 22, 17179869184 bytes)
Signed-off-by: NRobin Holt <holt@sgi.com>
Reviewed-by: NWANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f241e660

07 10月, 2010 2 次提交

HWPOISON: Stop shrinking at right page count · 47f43e7e

由 Andi Kleen 提交于 9月 28, 2010

When we call the slab shrinker to free a page we need to stop at
page count one because the caller always holds a single reference, not zero.

This avoids useless looping over slab shrinkers and freeing too much
memory.
Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
Signed-off-by: NAndi Kleen <ak@linux.intel.com>

47f43e7e

HWPOISON: Report correct address granuality for AO huge page errors · 0d9ee6a2

由 Andi Kleen 提交于 9月 27, 2010

The SIGBUS user space signalling is supposed to report the
address granuality of a corruption. Pass this information correctly
for huge pages by querying the hpage order.
Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
Signed-off-by: NAndi Kleen <ak@linux.intel.com>

0d9ee6a2