1. 11 1月, 2012 29 次提交
    • J
      mm: bootmem: try harder to free pages in bulk · 799f933a
      Johannes Weiner 提交于
      The loop that frees pages to the page allocator while bootstrapping tries
      to free higher-order blocks only when the starting address is aligned to
      that block size.  Otherwise it will free all pages on that node
      one-by-one.
      
      Change it to free individual pages up to the first aligned block and then
      try higher-order frees from there.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      799f933a
    • J
      mm: bootmem: drop superfluous range check when freeing pages in bulk · 560a036b
      Johannes Weiner 提交于
      The area node_bootmem_map represents is aligned to BITS_PER_LONG, and all
      bits in any aligned word of that map valid.  When the represented area
      extends beyond the end of the node, the non-existant pages will be marked
      as reserved.
      
      As a result, when freeing a page block, doing an explicit range check for
      whether that block is within the node's range is redundant as the bitmap
      is consulted anyway to see whether all pages in the block are unreserved.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      560a036b
    • J
      mm: page_alloc: generalize order handling in __free_pages_bootmem() · c3993076
      Johannes Weiner 提交于
      __free_pages_bootmem() used to special-case higher-order frees to save
      individual page checking with free_pages_bulk().
      
      Nowadays, both zero order and non-zero order frees use free_pages(), which
      checks each individual page anyway, and so there is little point in making
      the distinction anymore.  The higher-order loop will work just fine for
      zero order pages.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3993076
    • K
      tracepoint: add tracepoints for debugging oom_score_adj · 43d2b113
      KAMEZAWA Hiroyuki 提交于
      oom_score_adj is used for guarding processes from OOM-Killer.  One of
      problem is that it's inherited at fork().  When a daemon set oom_score_adj
      and make children, it's hard to know where the value is set.
      
      This patch adds some tracepoints useful for debugging. This patch adds
      3 trace points.
        - creating new task
        - renaming a task (exec)
        - set oom_score_adj
      
      To debug, users need to enable some trace pointer. Maybe filtering is useful as
      
      # EVENT=/sys/kernel/debug/tracing/events/task/
      # echo "oom_score_adj != 0" > $EVENT/task_newtask/filter
      # echo "oom_score_adj != 0" > $EVENT/task_rename/filter
      # echo 1 > $EVENT/enable
      # EVENT=/sys/kernel/debug/tracing/events/oom/
      # echo 1 > $EVENT/enable
      
      output will be like this.
      # grep oom /sys/kernel/debug/tracing/trace
      bash-7699  [007] d..3  5140.744510: oom_score_adj_update: pid=7699 comm=bash oom_score_adj=-1000
      bash-7699  [007] ...1  5151.818022: task_newtask: pid=7729 comm=bash clone_flags=1200011 oom_score_adj=-1000
      ls-7729  [003] ...2  5151.818504: task_rename: pid=7729 oldcomm=bash newcomm=ls oom_score_adj=-1000
      bash-7699  [002] ...1  5175.701468: task_newtask: pid=7730 comm=bash clone_flags=1200011 oom_score_adj=-1000
      grep-7730  [007] ...2  5175.701993: task_rename: pid=7730 oldcomm=bash newcomm=grep oom_score_adj=-1000
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43d2b113
    • K
      mm: simplify find_vma_prev() · 6bd4837d
      KOSAKI Motohiro 提交于
      commit 297c5eee ("mm: make the vma list be doubly linked") added the
      vm_prev member to vm_area_struct.  We can simplify find_vma_prev() by
      using it.  Also, this change helps to improve page fault performance
      because it has stronger locality of reference.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6bd4837d
    • A
      mremap: enforce rmap src/dst vma ordering in case of vma_merge() succeeding in copy_vma() · 948f017b
      Andrea Arcangeli 提交于
      migrate was doing an rmap_walk with speculative lock-less access on
      pagetables.  That could lead it to not serializing properly against mremap
      PT locks.  But a second problem remains in the order of vmas in the
      same_anon_vma list used by the rmap_walk.
      
      If vma_merge succeeds in copy_vma, the src vma could be placed after the
      dst vma in the same_anon_vma list.  That could still lead to migrate
      missing some pte.
      
      This patch adds an anon_vma_moveto_tail() function to force the dst vma at
      the end of the list before mremap starts to solve the problem.
      
      If the mremap is very large and there are a lots of parents or childs
      sharing the anon_vma root lock, this should still scale better than taking
      the anon_vma root lock around every pte copy practically for the whole
      duration of mremap.
      
      Update: Hugh noticed special care is needed in the error path where
      move_page_tables goes in the reverse direction, a second
      anon_vma_moveto_tail() call is needed in the error path.
      
      This program exercises the anon_vma_moveto_tail:
      
      ===
      
      int main()
      {
      	static struct timeval oldstamp, newstamp;
      	long diffsec;
      	char *p, *p2, *p3, *p4;
      	if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
      		perror("memalign"), exit(1);
      	if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
      		perror("memalign"), exit(1);
      	if (posix_memalign((void **)&p3, 2*1024*1024, SIZE))
      		perror("memalign"), exit(1);
      
      	memset(p, 0xff, SIZE);
      	printf("%p\n", p);
      	memset(p2, 0xff, SIZE);
      	memset(p3, 0x77, 4096);
      	if (memcmp(p, p2, SIZE))
      		printf("error\n");
      	p4 = mremap(p+SIZE/2, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
      	if (p4 != p3)
      		perror("mremap"), exit(1);
      	p4 = mremap(p4, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p+SIZE/2);
      	if (p4 != p+SIZE/2)
      		perror("mremap"), exit(1);
      	if (memcmp(p, p2, SIZE))
      		printf("error\n");
      	printf("ok\n");
      
      	return 0;
      }
      ===
      
      $ perf probe -a anon_vma_moveto_tail
      Add new event:
        probe:anon_vma_moveto_tail (on anon_vma_moveto_tail)
      
      You can now use it on all perf tools, such as:
      
              perf record -e probe:anon_vma_moveto_tail -aR sleep 1
      
      $ perf record -e probe:anon_vma_moveto_tail -aR ./anon_vma_moveto_tail
      0x7f2ca2800000
      ok
      [ perf record: Woken up 1 times to write data ]
      [ perf record: Captured and wrote 0.043 MB perf.data (~1860 samples) ]
      $ perf report --stdio
         100.00%  anon_vma_moveto  [kernel.kallsyms]  [k] anon_vma_moveto_tail
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NNai Xia <nai.xia@gmail.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pawel Sikora <pluto@agmk.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      948f017b
    • M
      mm: fix off-by-two in __zone_watermark_ok() · df0a6daa
      Michal Hocko 提交于
      Commit 88f5acf8 ("mm: page allocator: adjust the per-cpu counter
      threshold when memory is low") changed the form how free_pages is
      calculated but it forgot that we used to do free_pages - ((1 << order) -
      1) so we ended up with off-by-two when calculating free_pages.
      Reported-by: NWang Sheng-Hui <shhuiw@gmail.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df0a6daa
    • U
      bootmem: micro optimize freeing pages in bulk · 9571a982
      Uwe Kleine-König 提交于
      The first entry of bdata->node_bootmem_map holds the data for
      bdata->node_min_pfn up to bdata->node_min_pfn + BITS_PER_LONG - 1.  So the
      test for freeing all pages of a single map entry can be slightly relaxed.
      
      Moreover use DIV_ROUND_UP in another place instead of open coding it.
      Signed-off-by: NUwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Cc: Johannes Weiner <hannes@saeurebad.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9571a982
    • H
      mm: compaction: push isolate search base of compact control one pfn ahead · 31b8384a
      Hillf Danton 提交于
      After isolated the current pfn will no longer be scanned and isolated if
      the next round is necessary, so push the isolate_migratepages search base
      of the given compact_control one step ahead.
      Signed-off-by: NHillf Danton <dhillf@gmail.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31b8384a
    • J
      mm: filemap: pass __GFP_WRITE from grab_cache_page_write_begin() · 0faa70cb
      Johannes Weiner 提交于
      Tell the page allocator that pages allocated through
      grab_cache_page_write_begin() are expected to become dirty soon.
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0faa70cb
    • J
      mm: try to distribute dirty pages fairly across zones · a756cf59
      Johannes Weiner 提交于
      The maximum number of dirty pages that exist in the system at any time is
      determined by a number of pages considered dirtyable and a user-configured
      percentage of those, or an absolute number in bytes.
      
      This number of dirtyable pages is the sum of memory provided by all the
      zones in the system minus their lowmem reserves and high watermarks, so
      that the system can retain a healthy number of free pages without having
      to reclaim dirty pages.
      
      But there is a flaw in that we have a zoned page allocator which does not
      care about the global state but rather the state of individual memory
      zones.  And right now there is nothing that prevents one zone from filling
      up with dirty pages while other zones are spared, which frequently leads
      to situations where kswapd, in order to restore the watermark of free
      pages, does indeed have to write pages from that zone's LRU list.  This
      can interfere so badly with IO from the flusher threads that major
      filesystems (btrfs, xfs, ext4) mostly ignore write requests from reclaim
      already, taking away the VM's only possibility to keep such a zone
      balanced, aside from hoping the flushers will soon clean pages from that
      zone.
      
      Enter per-zone dirty limits.  They are to a zone's dirtyable memory what
      the global limit is to the global amount of dirtyable memory, and try to
      make sure that no single zone receives more than its fair share of the
      globally allowed dirty pages in the first place.  As the number of pages
      considered dirtyable excludes the zones' lowmem reserves and high
      watermarks, the maximum number of dirty pages in a zone is such that the
      zone can always be balanced without requiring page cleaning.
      
      As this is a placement decision in the page allocator and pages are
      dirtied only after the allocation, this patch allows allocators to pass
      __GFP_WRITE when they know in advance that the page will be written to and
      become dirty soon.  The page allocator will then attempt to allocate from
      the first zone of the zonelist - which on NUMA is determined by the task's
      NUMA memory policy - that has not exceeded its dirty limit.
      
      At first glance, it would appear that the diversion to lower zones can
      increase pressure on them, but this is not the case.  With a full high
      zone, allocations will be diverted to lower zones eventually, so it is
      more of a shift in timing of the lower zone allocations.  Workloads that
      previously could fit their dirty pages completely in the higher zone may
      be forced to allocate from lower zones, but the amount of pages that
      "spill over" are limited themselves by the lower zones' dirty constraints,
      and thus unlikely to become a problem.
      
      For now, the problem of unfair dirty page distribution remains for NUMA
      configurations where the zones allowed for allocation are in sum not big
      enough to trigger the global dirty limits, wake up the flusher threads and
      remedy the situation.  Because of this, an allocation that could not
      succeed on any of the considered zones is allowed to ignore the dirty
      limits before going into direct reclaim or even failing the allocation,
      until a future patch changes the global dirty throttling and flusher
      thread activation so that they take individual zone states into account.
      
      			Test results
      
      15M DMA + 3246M DMA32 + 504 Normal = 3765M memory
      40% dirty ratio
      16G USB thumb drive
      10 runs of dd if=/dev/zero of=disk/zeroes bs=32k count=$((10 << 15))
      
      		seconds			nr_vmscan_write
      		        (stddev)	       min|     median|        max
      xfs
      vanilla:	 549.747( 3.492)	     0.000|      0.000|      0.000
      patched:	 550.996( 3.802)	     0.000|      0.000|      0.000
      
      fuse-ntfs
      vanilla:	1183.094(53.178)	 54349.000|  59341.000|  65163.000
      patched:	 558.049(17.914)	     0.000|      0.000|     43.000
      
      btrfs
      vanilla:	 573.679(14.015)	156657.000| 460178.000| 606926.000
      patched:	 563.365(11.368)	     0.000|      0.000|   1362.000
      
      ext4
      vanilla:	 561.197(15.782)	     0.000|2725438.000|4143837.000
      patched:	 568.806(17.496)	     0.000|      0.000|      0.000
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Tested-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a756cf59
    • J
      mm: writeback: cleanups in preparation for per-zone dirty limits · ccafa287
      Johannes Weiner 提交于
      The next patch will introduce per-zone dirty limiting functions in
      addition to the traditional global dirty limiting.
      
      Rename determine_dirtyable_memory() to global_dirtyable_memory() before
      adding the zone-specific version, and fix up its documentation.
      
      Also, move the functions to determine the dirtyable memory and the
      function to calculate the dirty limit based on that together so that their
      relationship is more apparent and that they can be commented on as a
      group.
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NMel Gorman <mel@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ccafa287
    • J
      mm: exclude reserved pages from dirtyable memory · ab8fabd4
      Johannes Weiner 提交于
      Per-zone dirty limits try to distribute page cache pages allocated for
      writing across zones in proportion to the individual zone sizes, to reduce
      the likelihood of reclaim having to write back individual pages from the
      LRU lists in order to make progress.
      
      This patch:
      
      The amount of dirtyable pages should not include the full number of free
      pages: there is a number of reserved pages that the page allocator and
      kswapd always try to keep free.
      
      The closer (reclaimable pages - dirty pages) is to the number of reserved
      pages, the more likely it becomes for reclaim to run into dirty pages:
      
             +----------+ ---
             |   anon   |  |
             +----------+  |
             |          |  |
             |          |  -- dirty limit new    -- flusher new
             |   file   |  |                     |
             |          |  |                     |
             |          |  -- dirty limit old    -- flusher old
             |          |                        |
             +----------+                       --- reclaim
             | reserved |
             +----------+
             |  kernel  |
             +----------+
      
      This patch introduces a per-zone dirty reserve that takes both the lowmem
      reserve as well as the high watermark of the zone into account, and a
      global sum of those per-zone values that is subtracted from the global
      amount of dirtyable pages.  The lowmem reserve is unavailable to page
      cache allocations and kswapd tries to keep the high watermark free.  We
      don't want to end up in a situation where reclaim has to clean pages in
      order to balance zones.
      
      Not treating reserved pages as dirtyable on a global level is only a
      conceptual fix.  In reality, dirty pages are not distributed equally
      across zones and reclaim runs into dirty pages on a regular basis.
      
      But it is important to get this right before tackling the problem on a
      per-zone level, where the distance between reclaim and the dirty pages is
      mostly much smaller in absolute numbers.
      
      [akpm@linux-foundation.org: fix highmem build]
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab8fabd4
    • K
      vmscan: add task name to warn_scan_unevictable() messages · 25bd91bd
      KOSAKI Motohiro 提交于
      If we need to know a usecase, caller program name is critical important.
      Show it.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      David Rientjes <rientjes@google.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      25bd91bd
    • S
      fadvise: only initiate writeback for specified range with FADV_DONTNEED · ad8a1b55
      Shawn Bohrer 提交于
      Previously POSIX_FADV_DONTNEED would start writeback for the entire file
      when the bdi was not write congested.  This negatively impacts performance
      if the file contains dirty pages outside of the requested range.  This
      change uses __filemap_fdatawrite_range() to only initiate writeback for
      the requested range.
      Signed-off-by: NShawn Bohrer <sbohrer@rgmadvisors.com>
      Acked-by: NJohannes Weiner <jweiner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad8a1b55
    • S
      slub: min order when debug_guardpage_minorder > 0 · fc8d8620
      Stanislaw Gruszka 提交于
      Disable slub debug facilities and allocate slabs at minimal order when
      debug_guardpage_minorder > 0 to increase probability to catch random
      memory corruption by cpu exception.
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc8d8620
    • S
      mm: more intensive memory corruption debugging · c0a32fc5
      Stanislaw Gruszka 提交于
      With CONFIG_DEBUG_PAGEALLOC configured, the CPU will generate an exception
      on access (read,write) to an unallocated page, which permits us to catch
      code which corrupts memory.  However the kernel is trying to maximise
      memory usage, hence there are usually few free pages in the system and
      buggy code usually corrupts some crucial data.
      
      This patch changes the buddy allocator to keep more free/protected pages
      and to interlace free/protected and allocated pages to increase the
      probability of catching corruption.
      
      When the kernel is compiled with CONFIG_DEBUG_PAGEALLOC,
      debug_guardpage_minorder defines the minimum order used by the page
      allocator to grant a request.  The requested size will be returned with
      the remaining pages used as guard pages.
      
      The default value of debug_guardpage_minorder is zero: no change from
      current behaviour.
      
      [akpm@linux-foundation.org: tweak documentation, s/flg/flag/]
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0a32fc5
    • K
      mm/hugetlb.c: fix virtual address handling in hugetlb fault · 1e16a539
      KAMEZAWA Hiroyuki 提交于
      handle_mm_fault() passes 'faulted' address to hugetlb_fault().  This
      address is not aligned to a hugepage boundary.
      
      Most of the functions for hugetlb pages are aware of that and calculate an
      alignment themselves.  However some functions such as
      copy_user_huge_page() and clear_huge_page() don't handle alignment by
      themselves.
      
      This patch make hugeltb_fault() fix the alignment and pass an aligned
      addresss (to address of a faulted hugepage) to functions.
      
      [akpm@linux-foundation.org: use &=]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1e16a539
    • M
      hugetlb: clarify hugetlb_instantiation_mutex usage · ef009b25
      Michal Hocko 提交于
      Let's make it clear that we cannot race with other fault handlers due to
      hugetlb (global) mutex.  Also make it clear that we want to keep pte_same
      checks anayway to have a transition from the global mutex easier.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef009b25
    • H
      hugetlb: detect race upon page allocation failure during COW · a734bcc8
      Hillf Danton 提交于
      Currently we are not rechecking pte_same in hugetlb_cow after we take ptl
      lock again in the page allocation failure code path and simply retry
      again.  This is not an issue at the moment because hugetlb fault path is
      protected by hugetlb_instantiation_mutex so we cannot race.
      
      The original page is locked and so we cannot race even with the page
      migration.
      
      Let's add the pte_same check anyway as we want to be consistent with the
      other check later in this function and be safe if we ever remove the
      mutex.
      
      [mhocko@suse.cz: reworded the changelog]
      Signed-off-by: NHillf Danton <dhillf@gmail.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a734bcc8
    • M
      mm: avoid livelock on !__GFP_FS allocations · f90ac398
      Mel Gorman 提交于
      Colin Cross reported;
      
        Under the following conditions, __alloc_pages_slowpath can loop forever:
        gfp_mask & __GFP_WAIT is true
        gfp_mask & __GFP_FS is false
        reclaim and compaction make no progress
        order <= PAGE_ALLOC_COSTLY_ORDER
      
        These conditions happen very often during suspend and resume,
        when pm_restrict_gfp_mask() effectively converts all GFP_KERNEL
        allocations into __GFP_WAIT.
      
        The oom killer is not run because gfp_mask & __GFP_FS is false,
        but should_alloc_retry will always return true when order is less
        than PAGE_ALLOC_COSTLY_ORDER.
      
      In his fix, he avoided retrying the allocation if reclaim made no progress
      and __GFP_FS was not set.  The problem is that this would result in
      GFP_NOIO allocations failing that previously succeeded which would be very
      unfortunate.
      
      The big difference between GFP_NOIO and suspend converting GFP_KERNEL to
      behave like GFP_NOIO is that normally flushers will be cleaning pages and
      kswapd reclaims pages allowing GFP_NOIO to succeed after a short delay.
      The same does not necessarily apply during suspend as the storage device
      may be suspended.
      
      This patch special cases the suspend case to fail the page allocation if
      reclaim cannot make progress and adds some documentation on how
      gfp_allowed_mask is currently used.  Failing allocations like this may
      cause suspend to abort but that is better than a livelock.
      
      [mgorman@suse.de: Rework fix to be suspend specific]
      [rientjes@google.com: Move suspended device check to should_alloc_retry]
      Reported-by: NColin Cross <ccross@android.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f90ac398
    • M
      mm: reduce the amount of work done when updating min_free_kbytes · 938929f1
      Mel Gorman 提交于
      When min_free_kbytes is updated, some pageblocks are marked
      MIGRATE_RESERVE.  Ordinarily, this work is unnoticable as it happens early
      in boot but on large machines with 1TB of memory, this has been reported
      to delay boot times, probably due to the NUMA distances involved.
      
      The bulk of the work is due to calling calling pageblock_is_reserved() an
      unnecessary amount of times and accessing far more struct page metadata
      than is necessary.  This patch significantly reduces the amount of work
      done by setup_zone_migrate_reserve() improving boot times on 1TB machines.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      938929f1
    • J
      mm: migrate: one less atomic operation · 937a94c9
      Jacobo Giralt 提交于
      migrate_page_move_mapping() drops a reference from the old page after
      unfreezing its counter.  Both operations can be merged into a single
      atomic operation by directly unfreezing to one less reference.
      
      The same applies to migrate_huge_page_move_mapping().
      Signed-off-by: NJacobo Giralt <jacobo.giralt@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      937a94c9
    • K
      mm-tracepoint: rename page-free events · b413d48a
      Konstantin Khlebnikov 提交于
      Rename mm_page_free_direct into mm_page_free and mm_pagevec_free into
      mm_page_free_batched
      
      Since v2.6.33-5426-gc475dab6 the kernel triggers mm_page_free_direct for
      all freed pages, not only for directly freed.  So, let's name it properly.
       For pages freed via page-list we also trigger mm_page_free_batched event.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b413d48a
    • K
      mm: remove unused pagevec_free · da066ad3
      Konstantin Khlebnikov 提交于
      It not exported and now nobody uses it.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da066ad3
    • K
      mm: add free_hot_cold_page_list() helper · cc59850e
      Konstantin Khlebnikov 提交于
      This patch adds helper free_hot_cold_page_list() to free list of 0-order
      pages.  It frees pages directly from list without temporary page-vector.
      It also calls trace_mm_pagevec_free() to simulate pagevec_free()
      behaviour.
      
      bloat-o-meter:
      
      add/remove: 1/1 grow/shrink: 1/3 up/down: 267/-295 (-28)
      function                                     old     new   delta
      free_hot_cold_page_list                        -     264    +264
      get_page_from_freelist                      2129    2132      +3
      __pagevec_free                               243     239      -4
      split_free_page                              380     373      -7
      release_pages                                606     510     -96
      free_page_list                               188       -    -188
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc59850e
    • K
      vmscan: activate executable pages after first usage · c909e993
      Konstantin Khlebnikov 提交于
      Logic added in commit 8cab4754 ("vmscan: make mapped executable pages
      the first class citizen") was noticeably weakened in commit
      64574746 ("vmscan: detect mapped file pages used only once").
      
      Currently these pages can become "first class citizens" only after second
      usage.  After this patch page_check_references() will activate they after
      first usage, and executable code gets yet better chance to stay in memory.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c909e993
    • K
      vmscan: promote shared file mapped pages · 34dbc67a
      Konstantin Khlebnikov 提交于
      Commit 64574746 ("vmscan: detect mapped file pages used only once")
      greatly decreases lifetime of single-used mapped file pages.
      Unfortunately it also decreases life time of all shared mapped file
      pages.  Because after commit bf3f3bc5 ("mm: don't mark_page_accessed
      in fault path") page-fault handler does not mark page active or even
      referenced.
      
      Thus page_check_references() activates file page only if it was used twice
      while it stays in inactive list, meanwhile it activates anon pages after
      first access.  Inactive list can be small enough, this way reclaimer can
      accidentally throw away any widely used page if it wasn't used twice in
      short period.
      
      After this patch page_check_references() also activate file mapped page at
      first inactive list scan if this page is already used multiple times via
      several ptes.
      
      I found this while trying to fix degragation in rhel6 (~2.6.32) from rhel5
      (~2.6.18).  There a complete mess with >100 web/mail/spam/ftp containers,
      they share all their files but there a lot of anonymous pages: ~500mb
      shared file mapped memory and 15-20Gb non-shared anonymous memory.  In
      this situation major-pagefaults are very costly, because all containers
      share the same page.  In my load kernel created a disproportionate
      pressure on the file memory, compared with the anonymous, they equaled
      only if I raise swappiness up to 150 =)
      
      These patches actually wasn't helped a lot in my problem, but I saw
      noticable (10-20 times) reduce in count and average time of
      major-pagefault in file-mapped areas.
      
      Actually both patches are fixes for commit v2.6.33-5448-g64574746, because
      it was aimed at one scenario (singly used pages), but it breaks the logic
      in other scenarios (shared and/or executable pages)
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: NPekka Enberg <penberg@kernel.org>
      Acked-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34dbc67a
    • J
      mm/page-writeback.c: make determine_dirtyable_memory static again · 1edf2234
      Johannes Weiner 提交于
      The tracing ring-buffer used this function briefly, but not anymore.
      Make it local to the writeback code again.
      
      Also, move the function so that no forward declaration needs to be
      reintroduced.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1edf2234
  2. 10 1月, 2012 1 次提交
  3. 08 1月, 2012 1 次提交
    • G
      net: fix sock_clone reference mismatch with tcp memcontrol · f3f511e1
      Glauber Costa 提交于
      Sockets can also be created through sock_clone. Because it copies
      all data in the sock structure, it also copies the memcg-related pointer,
      and all should be fine. However, since we now use reference counts in
      socket creation, we are left with some sockets that have no reference
      counts. It matters when we destroy them, since it leads to a mismatch.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: Greg Thelen <gthelen@google.com>
      CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
      CC: Laurent Chavey <chavey@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f3f511e1
  4. 07 1月, 2012 1 次提交
  5. 04 1月, 2012 8 次提交