1. 01 8月, 2012 13 次提交
    • M
      mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages · 072bb0aa
      Mel Gorman 提交于
      When a user or administrator requires swap for their application, they
      create a swap partition and file, format it with mkswap and activate it
      with swapon.  Swap over the network is considered as an option in diskless
      systems.  The two likely scenarios are when blade servers are used as part
      of a cluster where the form factor or maintenance costs do not allow the
      use of disks and thin clients.
      
      The Linux Terminal Server Project recommends the use of the Network Block
      Device (NBD) for swap according to the manual at
      https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
      There is also documentation and tutorials on how to setup swap over NBD at
      places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The
      nbd-client also documents the use of NBD as swap.  Despite this, the fact
      is that a machine using NBD for swap can deadlock within minutes if swap
      is used intensively.  This patch series addresses the problem.
      
      The core issue is that network block devices do not use mempools like
      normal block devices do.  As the host cannot control where they receive
      packets from, they cannot reliably work out in advance how much memory
      they might need.  Some years ago, Peter Zijlstra developed a series of
      patches that supported swap over an NFS that at least one distribution is
      carrying within their kernels.  This patch series borrows very heavily
      from Peter's work to support swapping over NBD as a pre-requisite to
      supporting swap-over-NFS.  The bulk of the complexity is concerned with
      preserving memory that is allocated from the PFMEMALLOC reserves for use
      by the network layer which is needed for both NBD and NFS.
      
      Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
      	preserve access to pages allocated under low memory situations
      	to callers that are freeing memory.
      
      Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks
      
      Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
      	reserves without setting PFMEMALLOC.
      
      Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
      	for later use by network packet processing.
      
      Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required
      
      Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set.
      
      Patches 7-12 allows network processing to use PFMEMALLOC reserves when
      	the socket has been marked as being used by the VM to clean pages. If
      	packets are received and stored in pages that were allocated under
      	low-memory situations and are unrelated to the VM, the packets
      	are dropped.
      
      	Patch 11 reintroduces __skb_alloc_page which the networking
      	folk may object to but is needed in some cases to propogate
      	pfmemalloc from a newly allocated page to an skb. If there is a
      	strong objection, this patch can be dropped with the impact being
      	that swap-over-network will be slower in some cases but it should
      	not fail.
      
      Patch 13 is a micro-optimisation to avoid a function call in the
      	common case.
      
      Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use
      	PFMEMALLOC if necessary.
      
      Patch 15 notes that it is still possible for the PFMEMALLOC reserve
      	to be depleted. To prevent this, direct reclaimers get throttled on
      	a waitqueue if 50% of the PFMEMALLOC reserves are depleted.  It is
      	expected that kswapd and the direct reclaimers already running
      	will clean enough pages for the low watermark to be reached and
      	the throttled processes are woken up.
      
      Patch 16 adds a statistic to track how often processes get throttled
      
      Some basic performance testing was run using kernel builds, netperf on
      loopback for UDP and TCP, hackbench (pipes and sockets), iozone and
      sysbench.  Each of them were expected to use the sl*b allocators
      reasonably heavily but there did not appear to be significant performance
      variances.
      
      For testing swap-over-NBD, a machine was booted with 2G of RAM with a
      swapfile backed by NBD.  8*NUM_CPU processes were started that create
      anonymous memory mappings and read them linearly in a loop.  The total
      size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
      memory pressure.
      
      Without the patches and using SLUB, the machine locks up within minutes
      and runs to completion with them applied.  With SLAB, the story is
      different as an unpatched kernel run to completion.  However, the patched
      kernel completed the test 45% faster.
      
      MICRO
                                               3.5.0-rc2 3.5.0-rc2
      					 vanilla     swapnbd
      Unrecognised test vmscan-anon-mmap-write
      MMTests Statistics: duration
      Sys Time Running Test (seconds)             197.80    173.07
      User+Sys Time Running Test (seconds)        206.96    182.03
      Total Elapsed Time (seconds)               3240.70   1762.09
      
      This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
      
      Allocations of pages below the min watermark run a risk of the machine
      hanging due to a lack of memory.  To prevent this, only callers who have
      PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are
      allowed to allocate with ALLOC_NO_WATERMARKS.  Once they are allocated to
      a slab though, nothing prevents other callers consuming free objects
      within those slabs.  This patch limits access to slab pages that were
      alloced from the PFMEMALLOC reserves.
      
      When this patch is applied, pages allocated from below the low watermark
      are returned with page->pfmemalloc set and it is up to the caller to
      determine how the page should be protected.  SLAB restricts access to any
      page with page->pfmemalloc set to callers which are known to able to
      access the PFMEMALLOC reserve.  If one is not available, an attempt is
      made to allocate a new page rather than use a reserve.  SLUB is a bit more
      relaxed in that it only records if the current per-CPU page was allocated
      from PFMEMALLOC reserve and uses another partial slab if the caller does
      not have the necessary GFP or process flags.  This was found to be
      sufficient in tests to avoid hangs due to SLUB generally maintaining
      smaller lists than SLAB.
      
      In low-memory conditions it does mean that !PFMEMALLOC allocators can fail
      a slab allocation even though free objects are available because they are
      being preserved for callers that are freeing pages.
      
      [a.p.zijlstra@chello.nl: Original implementation]
      [sebastian@breakpoint.cc: Correct order of page flag clearing]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      072bb0aa
    • M
      memory-hotplug: fix kswapd looping forever problem · 702d1a6e
      Minchan Kim 提交于
      When hotplug offlining happens on zone A, it starts to mark freed page as
      MIGRATE_ISOLATE type in buddy for preventing further allocation.
      (MIGRATE_ISOLATE is very irony type because it's apparently on buddy but
      we can't allocate them).
      
      When the memory shortage happens during hotplug offlining, current task
      starts to reclaim, then wake up kswapd.  Kswapd checks watermark, then go
      sleep because current zone_watermark_ok_safe doesn't consider
      MIGRATE_ISOLATE freed page count.  Current task continue to reclaim in
      direct reclaim path without kswapd's helping.  The problem is that
      zone->all_unreclaimable is set by only kswapd so that current task would
      be looping forever like below.
      
      __alloc_pages_slowpath
      restart:
      	wake_all_kswapd
      rebalance:
      	__alloc_pages_direct_reclaim
      		do_try_to_free_pages
      			if global_reclaim && !all_unreclaimable
      				return 1; /* It means we did did_some_progress */
      	skip __alloc_pages_may_oom
      	should_alloc_retry
      		goto rebalance;
      
      If we apply KOSAKI's patch[1] which doesn't depends on kswapd about
      setting zone->all_unreclaimable, we can solve this problem by killing some
      task in direct reclaim path.  But it doesn't wake up kswapd, still.  It
      could be a problem still if other subsystem needs GFP_ATOMIC request.  So
      kswapd should consider MIGRATE_ISOLATE when it calculate free pages BEFORE
      going sleep.
      
      This patch counts the number of MIGRATE_ISOLATE page block and
      zone_watermark_ok_safe will consider it if the system has such blocks
      (fortunately, it's very rare so no problem in POV overhead and kswapd is
      never hotpath).
      
      Copy/modify from Mel's quote
      "
      Ideal solution would be "allocating" the pageblock.
      It would keep the free space accounting as it is but historically,
      memory hotplug didn't allocate pages because it would be difficult to
      detect if a pageblock was isolated or if part of some balloon.
      Allocating just full pageblocks would work around this, However,
      it would play very badly with CMA.
      "
      
      [1] http://lkml.org/lkml/2012/6/14/74
      
      [akpm@linux-foundation.org: simplify nr_zone_isolate_freepages(), rework zone_watermark_ok_safe() comment, simplify set_pageblock_isolate() and restore_pageblock_isolate()]
      [akpm@linux-foundation.org: fix CONFIG_MEMORY_ISOLATION=n build]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Suggested-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Tested-by: NAaditya Kumar <aaditya.kumar.30@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      702d1a6e
    • M
      mm: fix free page check in zone_watermark_ok() · 2cfed075
      Minchan Kim 提交于
      __zone_watermark_ok currently compares free_pages which is a signed type
      with z->lowmem_reserve[classzone_idx] which is unsigned which might lead
      to sign overflow if free_pages doesn't satisfy the given order (or it came
      as negative already) and then we rely on the following order loop to fix
      it (which doesn't work for order-0).  Let's fix the type conversion and do
      not rely on the given value of free_pages or follow up fixups.
      
      This patch fixes it because "memory-hotplug: fix kswapd looping forever
      problem" depends on this.
      
      As benefit of this patch, it doesn't rely on the loop to exit
      __zone_watermark_ok in case of high order check and make the first test
      effective.(ie, if (free_pages <= min + lowmem_reserve))
      
      Aaditya reported this problem when he test my hotplug patch.
      Reported-off-by: NAaditya Kumar <aaditya.kumar@ap.sony.com>
      Tested-by: NAaditya Kumar <aaditya.kumar@ap.sony.com>
      Signed-off-by: NAaditya Kumar <aaditya.kumar@ap.sony.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2cfed075
    • M
      mm: factor out memory isolate functions · ee6f509c
      Minchan Kim 提交于
      mm/page_alloc.c has some memory isolation functions but they are used only
      when we enable CONFIG_{CMA|MEMORY_HOTPLUG|MEMORY_FAILURE}.  So let's make
      it configurable by new CONFIG_MEMORY_ISOLATION so that it can reduce
      binary size and we can check it simple by CONFIG_MEMORY_ISOLATION, not if
      defined CONFIG_{CMA|MEMORY_HOTPLUG|MEMORY_FAILURE}.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ee6f509c
    • J
      mm/hotplug: mark memory hotplug code in page_alloc.c as __meminit · 4ed7e022
      Jiang Liu 提交于
      Mark functions used by both boot and memory hotplug as __meminit to reduce
      memory footprint when memory hotplug is disabled.
      
      Alos guard zone_pcp_update() with CONFIG_MEMORY_HOTPLUG because it's only
      used by memory hotplug code.
      Signed-off-by: NJiang Liu <liuj97@gmail.com>
      Cc: Wei Wang <Bessel.Wang@huawei.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Keping Chen <chenkeping@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4ed7e022
    • J
      mm/hotplug: free zone->pageset when a zone becomes empty · 340175b7
      Jiang Liu 提交于
      When a zone becomes empty after memory offlining, free zone->pageset.
      Otherwise it will cause memory leak when adding memory to the empty zone
      again because build_all_zonelists() will allocate zone->pageset for an
      empty zone.
      Signed-off-by: NJiang Liu <liuj97@gmail.com>
      Signed-off-by: NWei Wang <Bessel.Wang@huawei.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Keping Chen <chenkeping@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      340175b7
    • J
      mm/hotplug: correctly setup fallback zonelists when creating new pgdat · 9adb62a5
      Jiang Liu 提交于
      When hotadd_new_pgdat() is called to create new pgdat for a new node, a
      fallback zonelist should be created for the new node.  There's code to try
      to achieve that in hotadd_new_pgdat() as below:
      
      	/*
      	 * The node we allocated has no zone fallback lists. For avoiding
      	 * to access not-initialized zonelist, build here.
      	 */
      	mutex_lock(&zonelists_mutex);
      	build_all_zonelists(pgdat, NULL);
      	mutex_unlock(&zonelists_mutex);
      
      But it doesn't work as expected.  When hotadd_new_pgdat() is called, the
      new node is still in offline state because node_set_online(nid) hasn't
      been called yet.  And build_all_zonelists() only builds zonelists for
      online nodes as:
      
              for_each_online_node(nid) {
                      pg_data_t *pgdat = NODE_DATA(nid);
      
                      build_zonelists(pgdat);
                      build_zonelist_cache(pgdat);
              }
      
      Though we hope to create zonelist for the new pgdat, but it doesn't.  So
      add a new parameter "pgdat" the build_all_zonelists() to build pgdat for
      the new pgdat too.
      Signed-off-by: NJiang Liu <liuj97@gmail.com>
      Signed-off-by: NXishi Qiu <qiuxishi@huawei.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Keping Chen <chenkeping@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9adb62a5
    • X
      mm: setup pageblock_order before it's used by sparsemem · ca57df79
      Xishi Qiu 提交于
      On architectures with CONFIG_HUGETLB_PAGE_SIZE_VARIABLE set, such as
      Itanium, pageblock_order is a variable with default value of 0.  It's set
      to the right value by set_pageblock_order() in function
      free_area_init_core().
      
      But pageblock_order may be used by sparse_init() before free_area_init_core()
      is called along path:
      sparse_init()
          ->sparse_early_usemaps_alloc_node()
      	->usemap_size()
      	    ->SECTION_BLOCKFLAGS_BITS
      		->((1UL << (PFN_SECTION_SHIFT - pageblock_order)) *
      NR_PAGEBLOCK_BITS)
      
      The uninitialized pageblock_size will cause memory wasting because
      usemap_size() returns a much bigger value then it's really needed.
      
      For example, on an Itanium platform,
      sparse_init() pageblock_order=0 usemap_size=24576
      free_area_init_core() before pageblock_order=0, usemap_size=24576
      free_area_init_core() after pageblock_order=12, usemap_size=8
      
      That means 24K memory has been wasted for each section, so fix it by calling
      set_pageblock_order() from sparse_init().
      Signed-off-by: NXishi Qiu <qiuxishi@huawei.com>
      Signed-off-by: NJiang Liu <liuj97@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Keping Chen <chenkeping@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ca57df79
    • R
      mm: have order > 0 compaction start off where it left · 7db8889a
      Rik van Riel 提交于
      Order > 0 compaction stops when enough free pages of the correct page
      order have been coalesced.  When doing subsequent higher order
      allocations, it is possible for compaction to be invoked many times.
      
      However, the compaction code always starts out looking for things to
      compact at the start of the zone, and for free pages to compact things to
      at the end of the zone.
      
      This can cause quadratic behaviour, with isolate_freepages starting at the
      end of the zone each time, even though previous invocations of the
      compaction code already filled up all free memory on that end of the zone.
      
      This can cause isolate_freepages to take enormous amounts of CPU with
      certain workloads on larger memory systems.
      
      The obvious solution is to have isolate_freepages remember where it left
      off last time, and continue at that point the next time it gets invoked
      for an order > 0 compaction.  This could cause compaction to fail if
      cc->free_pfn and cc->migrate_pfn are close together initially, in that
      case we restart from the end of the zone and try once more.
      
      Forced full (order == -1) compactions are left alone.
      
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: s/laste/last/, use 80 cols]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Reported-by: NJim Schutt <jaschut@sandia.gov>
      Tested-by: NJim Schutt <jaschut@sandia.gov>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7db8889a
    • M
      mm: clean up __count_immobile_pages() · 80934513
      Minchan Kim 提交于
      The __count_immobile_pages() naming is rather awkward.  Choose a more
      clear name and add a comment.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      80934513
    • M
      mm: do not use page_count() without a page pin · 97d255c8
      Minchan Kim 提交于
      d179e84b ("mm: vmscan: do not use page_count without a page pin") fixed
      this problem in vmscan.c but same problem is in __count_immobile_pages().
      
      I copy and paste d179e84b's contents for description.
      
      "It is unsafe to run page_count during the physical pfn scan because
      compound_head could trip on a dangling pointer when reading
      page->first_page if the compound page is being freed by another CPU."
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Wanpeng Li <liwp.linux@gmail.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97d255c8
    • K
      mm: clear pages_scanned only if draining a pcp adds pages to the buddy allocator again · 2a13515c
      KOSAKI Motohiro 提交于
      commit 2ff754fa ("mm: clear pages_scanned only if draining a pcp adds
      pages to the buddy allocator again") fixed one free_pcppages_bulk()
      misuse.  But two another miuse still exist.
      
      This patch fixes it.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a13515c
    • G
      mm/buddy: cleanup on should_fail_alloc_page · deaf386e
      Gavin Shan 提交于
      Currently, function should_fail() has "bool" for its return value, so it's
      reasonable to change the return value of function should_fail_alloc_page()
      into "bool" as well.
      
      The patch does cleanup on function should_fail_alloc_page() to have "bool"
      for its return value.
      Signed-off-by: NGavin Shan <shangw@linux.vnet.ibm.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      deaf386e
  2. 06 7月, 2012 1 次提交
    • R
      mm: cma: don't replace lowmem pages with highmem · 6a6dccba
      Rabin Vincent 提交于
      The filesystem layer expects pages in the block device's mapping to not
      be in highmem (the mapping's gfp mask is set in bdget()), but CMA can
      currently replace lowmem pages with highmem pages, leading to crashes in
      filesystem code such as the one below:
      
        Unable to handle kernel NULL pointer dereference at virtual address 00000400
        pgd = c0c98000
        [00000400] *pgd=00c91831, *pte=00000000, *ppte=00000000
        Internal error: Oops: 817 [#1] PREEMPT SMP ARM
        CPU: 0    Not tainted  (3.5.0-rc5+ #80)
        PC is at __memzero+0x24/0x80
        ...
        Process fsstress (pid: 323, stack limit = 0xc0cbc2f0)
        Backtrace:
        [<c010e3f0>] (ext4_getblk+0x0/0x180) from [<c010e58c>] (ext4_bread+0x1c/0x98)
        [<c010e570>] (ext4_bread+0x0/0x98) from [<c0117944>] (ext4_mkdir+0x160/0x3bc)
         r4:c15337f0
        [<c01177e4>] (ext4_mkdir+0x0/0x3bc) from [<c00c29e0>] (vfs_mkdir+0x8c/0x98)
        [<c00c2954>] (vfs_mkdir+0x0/0x98) from [<c00c2a60>] (sys_mkdirat+0x74/0xac)
         r6:00000000 r5:c152eb40 r4:000001ff r3:c14b43f0
        [<c00c29ec>] (sys_mkdirat+0x0/0xac) from [<c00c2ab8>] (sys_mkdir+0x20/0x24)
         r6:beccdcf0 r5:00074000 r4:beccdbbc
        [<c00c2a98>] (sys_mkdir+0x0/0x24) from [<c000e3c0>] (ret_fast_syscall+0x0/0x30)
      
      Fix this by replacing only highmem pages with highmem.
      Reported-by: NLaura Abbott <lauraa@codeaurora.org>
      Signed-off-by: NRabin Vincent <rabin@rab.in>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Signed-off-by: NMarek Szyprowski <m.szyprowski@samsung.com>
      6a6dccba
  3. 04 6月, 2012 1 次提交
  4. 30 5月, 2012 8 次提交
  5. 24 5月, 2012 1 次提交
  6. 21 5月, 2012 9 次提交
  7. 12 5月, 2012 1 次提交
    • H
      mm: raise MemFree by reverting percpu_pagelist_fraction to 0 · 1b76b02f
      Hugh Dickins 提交于
      Why is there less MemFree than there used to be?  It perturbed a test,
      so I've just been bisecting linux-next, and now find the offender went
      upstream yesterday.
      
      Commit 93278814 "mm: fix division by 0 in percpu_pagelist_fraction()"
      mistakenly initialized percpu_pagelist_fraction to the sysctl's minimum 8,
      which leaves 1/8th of memory on percpu lists (on each cpu??); but most of
      us expect it to be left unset at 0 (and it's not then used as a divisor).
      
        MemTotal: 8061476kB  8061476kB  8061476kB  8061476kB  8061476kB  8061476kB
        Repetitive test with percpu_pagelist_fraction 8:
        MemFree:  6948420kB  6237172kB  6949696kB  6840692kB  6949048kB  6862984kB
        Same test with percpu_pagelist_fraction back to 0:
        MemFree:  7945000kB  7944908kB  7948568kB  7949060kB  7948796kB  7948812kB
      Signed-off-by: NHugh Dickins <hughd@google.com>
      [ We really should fix the crazy sysctl interface too, but that's a
        separate thing - Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b76b02f
  8. 11 5月, 2012 1 次提交
  9. 08 5月, 2012 1 次提交
  10. 29 3月, 2012 2 次提交
    • G
      mm: only IPI CPUs to drain local pages if they exist · 74046494
      Gilad Ben-Yossef 提交于
      Calculate a cpumask of CPUs with per-cpu pages in any zone and only send
      an IPI requesting CPUs to drain these pages to the buddy allocator if they
      actually have pages when asked to flush.
      
      This patch saves 85%+ of IPIs asking to drain per-cpu pages in case of
      severe memory pressure that leads to OOM since in these cases multiple,
      possibly concurrent, allocation requests end up in the direct reclaim code
      path so when the per-cpu pages end up reclaimed on first allocation
      failure for most of the proceeding allocation attempts until the memory
      pressure is off (possibly via the OOM killer) there are no per-cpu pages
      on most CPUs (and there can easily be hundreds of them).
      
      This also has the side effect of shortening the average latency of direct
      reclaim by 1 or more order of magnitude since waiting for all the CPUs to
      ACK the IPI takes a long time.
      
      Tested by running "hackbench 400" on a 8 CPU x86 VM and observing the
      difference between the number of direct reclaim attempts that end up in
      drain_all_pages() and those were more then 1/2 of the online CPU had any
      per-cpu page in them, using the vmstat counters introduced in the next
      patch in the series and using proc/interrupts.
      
      In the test sceanrio, this was seen to save around 3600 global
      IPIs after trigerring an OOM on a concurrent workload:
      
      $ cat /proc/vmstat | tail -n 2
      pcp_global_drain 0
      pcp_global_ipi_saved 0
      
      $ cat /proc/interrupts | grep CAL
      CAL:          1          2          1          2
                2          2          2          2   Function call interrupts
      
      $ hackbench 400
      [OOM messages snipped]
      
      $ cat /proc/vmstat | tail -n 2
      pcp_global_drain 3647
      pcp_global_ipi_saved 3642
      
      $ cat /proc/interrupts | grep CAL
      CAL:          6         13          6          3
                3          3         1 2          7   Function call interrupts
      
      Please note that if the global drain is removed from the direct reclaim
      path as a patch from Mel Gorman currently suggests this should be replaced
      with an on_each_cpu_cond invocation.
      Signed-off-by: NGilad Ben-Yossef <gilad@benyossef.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Acked-by: NMichal Nazarewicz <mina86@mina86.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      74046494
    • D
      mm, coredump: fail allocations when coredumping instead of oom killing · 29fd66d2
      David Rientjes 提交于
      The size of coredump files is limited by RLIMIT_CORE, however, allocating
      large amounts of memory results in three negative consequences:
      
       - the coredumping process may be chosen for oom kill and quickly deplete
         all memory reserves in oom conditions preventing further progress from
         being made or tasks from exiting,
      
       - the coredumping process may cause other processes to be oom killed
         without fault of their own as the result of a SIGSEGV, for example, in
         the coredumping process, or
      
       - the coredumping process may result in a livelock while writing to the
         dump file if it needs memory to allocate while other threads are in
         the exit path waiting on the coredumper to complete.
      
      This is fixed by implying __GFP_NORETRY in the page allocator for
      coredumping processes when reclaim has failed so the allocations fail and
      the process continues to exit.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29fd66d2
  11. 22 3月, 2012 2 次提交