1. 12 1月, 2013 2 次提交
    • M
      mm: compaction: partially revert capture of suitable high-order page · 8fb74b9f
      Mel Gorman 提交于
      Eric Wong reported on 3.7 and 3.8-rc2 that ppoll() got stuck when
      waiting for POLLIN on a local TCP socket.  It was easier to trigger if
      there was disk IO and dirty pages at the same time and he bisected it to
      commit 1fb3f8ca ("mm: compaction: capture a suitable high-order page
      immediately when it is made available").
      
      The intention of that patch was to improve high-order allocations under
      memory pressure after changes made to reclaim in 3.6 drastically hurt
      THP allocations but the approach was flawed.  For Eric, the problem was
      that page->pfmemalloc was not being cleared for captured pages leading
      to a poor interaction with swap-over-NFS support causing the packets to
      be dropped.  However, I identified a few more problems with the patch
      including the fact that it can increase contention on zone->lock in some
      cases which could result in async direct compaction being aborted early.
      
      In retrospect the capture patch took the wrong approach.  What it should
      have done is mark the pageblock being migrated as MIGRATE_ISOLATE if it
      was allocating for THP and avoided races that way.  While the patch was
      showing to improve allocation success rates at the time, the benefit is
      marginal given the relative complexity and it should be revisited from
      scratch in the context of the other reclaim-related changes that have
      taken place since the patch was first written and tested.  This patch
      partially reverts commit 1fb3f8ca ("mm: compaction: capture a
      suitable high-order page immediately when it is made available").
      Reported-and-tested-by: NEric Wong <normalperson@yhbt.net>
      Tested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8fb74b9f
    • J
      mm: compaction: fix echo 1 > compact_memory return error issue · 7964c06d
      Jason Liu 提交于
      when run the folloing command under shell, it will return error
      
        sh/$ echo 1 > /proc/sys/vm/compact_memory
        sh/$ sh: write error: Bad address
      
      After strace, I found the following log:
      
        ...
        write(1, "1\n", 2)               = 3
        write(1, "", 4294967295)         = -1 EFAULT (Bad address)
        write(2, "echo: write error: Bad address\n", 31echo: write error: Bad address
        ) = 31
      
      This tells system return 3(COMPACT_COMPLETE) after write data to
      compact_memory.
      
      The fix is to make the system just return 0 instead 3(COMPACT_COMPLETE)
      from sysctl_compaction_handler after compaction_nodes finished.
      Signed-off-by: NJason Liu <r64343@freescale.com>
      Suggested-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7964c06d
  2. 21 12月, 2012 1 次提交
  3. 13 12月, 2012 1 次提交
  4. 12 12月, 2012 2 次提交
    • R
      mm: introduce putback_movable_pages() · 5733c7d1
      Rafael Aquini 提交于
      The PATCH "mm: introduce compaction and migration for virtio ballooned pages"
      hacks around putback_lru_pages() in order to allow ballooned pages to be
      re-inserted on balloon page list as if a ballooned page was like a LRU page.
      
      As ballooned pages are not legitimate LRU pages, this patch introduces
      putback_movable_pages() to properly cope with cases where the isolated
      pageset contains ballooned pages and LRU pages, thus fixing the mentioned
      inelegant hack around putback_lru_pages().
      Signed-off-by: NRafael Aquini <aquini@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5733c7d1
    • R
      mm: introduce compaction and migration for ballooned pages · bf6bddf1
      Rafael Aquini 提交于
      Memory fragmentation introduced by ballooning might reduce significantly
      the number of 2MB contiguous memory blocks that can be used within a guest,
      thus imposing performance penalties associated with the reduced number of
      transparent huge pages that could be used by the guest workload.
      
      This patch introduces the helper functions as well as the necessary changes
      to teach compaction and migration bits how to cope with pages which are
      part of a guest memory balloon, in order to make them movable by memory
      compaction procedures.
      Signed-off-by: NRafael Aquini <aquini@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf6bddf1
  5. 11 12月, 2012 3 次提交
    • M
      mm: compaction: Add scanned and isolated counters for compaction · 397487db
      Mel Gorman 提交于
      Compaction already has tracepoints to count scanned and isolated pages
      but it requires that ftrace be enabled and if that information has to be
      written to disk then it can be disruptive. This patch adds vmstat counters
      for compaction called compact_migrate_scanned, compact_free_scanned and
      compact_isolated.
      
      With these counters, it is possible to define a basic cost model for
      compaction. This approximates of how much work compaction is doing and can
      be compared that with an oprofile showing TLB misses and see if the cost of
      compaction is being offset by THP for example. Minimally a compaction patch
      can be evaluated in terms of whether it increases or decreases cost. The
      basic cost model looks like this
      
      Fundamental unit u:	a word	sizeof(void *)
      
      Ca  = cost of struct page access = sizeof(struct page) / u
      
      Cmc = Cost migrate page copy = (Ca + PAGE_SIZE/u) * 2
      Cmf = Cost migrate failure   = Ca * 2
      Ci  = Cost page isolation    = (Ca + Wi)
      	where Wi is a constant that should reflect the approximate
      	cost of the locking operation.
      
      Csm = Cost migrate scanning = Ca
      Csf = Cost free    scanning = Ca
      
      Overall cost =	(Csm * compact_migrate_scanned) +
      	      	(Csf * compact_free_scanned)    +
      	      	(Ci  * compact_isolated)	+
      		(Cmc * pgmigrate_success)	+
      		(Cmf * pgmigrate_failed)
      
      Where the values are read from /proc/vmstat.
      
      This is very basic and ignores certain costs such as the allocation cost
      to do a migrate page copy but any improvement to the model would still
      use the same vmstat counters.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      397487db
    • M
      mm: migrate: Add a tracepoint for migrate_pages · 7b2a2d4a
      Mel Gorman 提交于
      The pgmigrate_success and pgmigrate_fail vmstat counters tells the user
      about migration activity but not the type or the reason. This patch adds
      a tracepoint to identify the type of page migration and why the page is
      being migrated.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      7b2a2d4a
    • M
      mm: compaction: Move migration fail/success stats to migrate.c · 5647bc29
      Mel Gorman 提交于
      The compact_pages_moved and compact_pagemigrate_failed events are
      convenient for determining if compaction is active and to what
      degree migration is succeeding but it's at the wrong level. Other
      users of migration may also want to know if migration is working
      properly and this will be particularly true for any automated
      NUMA migration. This patch moves the counters down to migration
      with the new events called pgmigrate_success and pgmigrate_fail.
      The compact_blocks_moved counter is removed because while it was
      useful for debugging initially, it's worthless now as no meaningful
      conclusions can be drawn from its value.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      5647bc29
  6. 07 12月, 2012 1 次提交
    • M
      mm: compaction: validate pfn range passed to isolate_freepages_block · 60177d31
      Mel Gorman 提交于
      Commit 0bf380bc ("mm: compaction: check pfn_valid when entering a
      new MAX_ORDER_NR_PAGES block during isolation for migration") added a
      check for pfn_valid() when isolating pages for migration as the scanner
      does not necessarily start pageblock-aligned.
      
      Since commit c89511ab ("mm: compaction: Restart compaction from near
      where it left off"), the free scanner has the same problem.  This patch
      makes sure that the pfn range passed to isolate_freepages_block() is
      within the same block so that pfn_valid() checks are unnecessary.
      
      In answer to Henrik's wondering why others have not reported this:
      reproducing this requires a large enough hole with the right aligment to
      have compaction walk into a PFN range with no memmap.  Size and
      alignment depends in the memory model - 4M for FLATMEM and 128M for
      SPARSEMEM on x86.  It needs a "lucky" machine.
      Reported-by: NHenrik Rydberg <rydberg@euromail.se>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60177d31
  7. 20 10月, 2012 1 次提交
  8. 09 10月, 2012 13 次提交
    • M
      CMA: migrate mlocked pages · e46a2879
      Minchan Kim 提交于
      Presently CMA cannot migrate mlocked pages so it ends up failing to allocate
      contiguous memory space.
      
      This patch makes mlocked pages be migrated out.  Of course, it can affect
      realtime processes but in CMA usecase, contiguous memory allocation failing
      is far worse than access latency to an mlocked page being variable while
      CMA is running.  If someone wants to make the system realtime, he shouldn't
      enable CMA because stalls can still happen at random times.
      
      [akpm@linux-foundation.org: tweak comment text, per Mel]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e46a2879
    • M
      mm: compaction: clear PG_migrate_skip based on compaction and reclaim activity · 62997027
      Mel Gorman 提交于
      Compaction caches if a pageblock was scanned and no pages were isolated so
      that the pageblocks can be skipped in the future to reduce scanning.  This
      information is not cleared by the page allocator based on activity due to
      the impact it would have to the page allocator fast paths.  Hence there is
      a requirement that something clear the cache or pageblocks will be skipped
      forever.  Currently the cache is cleared if there were a number of recent
      allocation failures and it has not been cleared within the last 5 seconds.
      Time-based decisions like this are terrible as they have no relationship
      to VM activity and is basically a big hammer.
      
      Unfortunately, accurate heuristics would add cost to some hot paths so
      this patch implements a rough heuristic.  There are two cases where the
      cache is cleared.
      
      1. If a !kswapd process completes a compaction cycle (migrate and free
         scanner meet), the zone is marked compact_blockskip_flush. When kswapd
         goes to sleep, it will clear the cache. This is expected to be the
         common case where the cache is cleared. It does not really matter if
         kswapd happens to be asleep or going to sleep when the flag is set as
         it will be woken on the next allocation request.
      
      2. If there have been multiple failures recently and compaction just
         finished being deferred then a process will clear the cache and start a
         full scan.  This situation happens if there are multiple high-order
         allocation requests under heavy memory pressure.
      
      The clearing of the PG_migrate_skip bits and other scans is inherently
      racy but the race is harmless.  For allocations that can fail such as THP,
      they will simply fail.  For requests that cannot fail, they will retry the
      allocation.  Tests indicated that scanning rates were roughly similar to
      when the time-based heuristic was used and the allocation success rates
      were similar.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Richard Davies <richard@arachsys.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62997027
    • M
      mm: compaction: Restart compaction from near where it left off · c89511ab
      Mel Gorman 提交于
      This is almost entirely based on Rik's previous patches and discussions
      with him about how this might be implemented.
      
      Order > 0 compaction stops when enough free pages of the correct page
      order have been coalesced.  When doing subsequent higher order
      allocations, it is possible for compaction to be invoked many times.
      
      However, the compaction code always starts out looking for things to
      compact at the start of the zone, and for free pages to compact things to
      at the end of the zone.
      
      This can cause quadratic behaviour, with isolate_freepages starting at the
      end of the zone each time, even though previous invocations of the
      compaction code already filled up all free memory on that end of the zone.
       This can cause isolate_freepages to take enormous amounts of CPU with
      certain workloads on larger memory systems.
      
      This patch caches where the migration and free scanner should start from
      on subsequent compaction invocations using the pageblock-skip information.
       When compaction starts it begins from the cached restart points and will
      update the cached restart points until a page is isolated or a pageblock
      is skipped that would have been scanned by synchronous compaction.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Richard Davies <richard@arachsys.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Avi Kivity <avi@redhat.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c89511ab
    • M
      mm: compaction: cache if a pageblock was scanned and no pages were isolated · bb13ffeb
      Mel Gorman 提交于
      When compaction was implemented it was known that scanning could
      potentially be excessive.  The ideal was that a counter be maintained for
      each pageblock but maintaining this information would incur a severe
      penalty due to a shared writable cache line.  It has reached the point
      where the scanning costs are a serious problem, particularly on
      long-lived systems where a large process starts and allocates a large
      number of THPs at the same time.
      
      Instead of using a shared counter, this patch adds another bit to the
      pageblock flags called PG_migrate_skip.  If a pageblock is scanned by
      either migrate or free scanner and 0 pages were isolated, the pageblock is
      marked to be skipped in the future.  When scanning, this bit is checked
      before any scanning takes place and the block skipped if set.
      
      The main difficulty with a patch like this is "when to ignore the cached
      information?" If it's ignored too often, the scanning rates will still be
      excessive.  If the information is too stale then allocations will fail
      that might have otherwise succeeded.  In this patch
      
      o CMA always ignores the information
      o If the migrate and free scanner meet then the cached information will
        be discarded if it's at least 5 seconds since the last time the cache
        was discarded
      o If there are a large number of allocation failures, discard the cache.
      
      The time-based heuristic is very clumsy but there are few choices for a
      better event.  Depending solely on multiple allocation failures still
      allows excessive scanning when THP allocations are failing in quick
      succession due to memory pressure.  Waiting until memory pressure is
      relieved would cause compaction to continually fail instead of using
      reclaim/compaction to try allocate the page.  The time-based mechanism is
      clumsy but a better option is not obvious.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Richard Davies <richard@arachsys.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Avi Kivity <avi@redhat.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Cc: Mark Brown <broonie@opensource.wolfsonmicro.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb13ffeb
    • M
      revert "mm: have order > 0 compaction start off where it left" · 753341a4
      Mel Gorman 提交于
      This reverts commit 7db8889a ("mm: have order > 0 compaction start
      off where it left") and commit de74f1cc ("mm: have order > 0 compaction
      start near a pageblock with free pages").  These patches were a good
      idea and tests confirmed that they massively reduced the amount of
      scanning but the implementation is complex and tricky to understand.  A
      later patch will cache what pageblocks should be skipped and
      reimplements the concept of compact_cached_free_pfn on top for both
      migration and free scanners.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Richard Davies <richard@arachsys.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Avi Kivity <avi@redhat.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      753341a4
    • M
      mm: compaction: acquire the zone->lock as late as possible · f40d1e42
      Mel Gorman 提交于
      Compaction's free scanner acquires the zone->lock when checking for
      PageBuddy pages and isolating them.  It does this even if there are no
      PageBuddy pages in the range.
      
      This patch defers acquiring the zone lock for as long as possible.  In the
      event there are no free pages in the pageblock then the lock will not be
      acquired at all which reduces contention on zone->lock.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Richard Davies <richard@arachsys.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Avi Kivity <avi@redhat.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Tested-by: NPeter Ujfalusi <peter.ujfalusi@ti.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f40d1e42
    • M
      mm: compaction: acquire the zone->lru_lock as late as possible · 2a1402aa
      Mel Gorman 提交于
      Richard Davies and Shaohua Li have both reported lock contention problems
      in compaction on the zone and LRU locks as well as significant amounts of
      time being spent in compaction.  This series aims to reduce lock
      contention and scanning rates to reduce that CPU usage.  Richard reported
      at https://lkml.org/lkml/2012/9/21/91 that this series made a big
      different to a problem he reported in August:
      
         http://marc.info/?l=kvm&m=134511507015614&w=2
      
      Patch 1 defers acquiring the zone->lru_lock as long as possible.
      
      Patch 2 defers acquiring the zone->lock as lock as possible.
      
      Patch 3 reverts Rik's "skip-free" patches as the core concept gets
      	reimplemented later and the remaining patches are easier to
      	understand if this is reverted first.
      
      Patch 4 adds a pageblock-skip bit to the pageblock flags to cache what
      	pageblocks should be skipped by the migrate and free scanners.
      	This drastically reduces the amount of scanning compaction has
      	to do.
      
      Patch 5 reimplements something similar to Rik's idea except it uses the
      	pageblock-skip information to decide where the scanners should
      	restart from and does not need to wrap around.
      
      I tested this on 3.6-rc6 + linux-next/akpm. Kernels tested were
      
      akpm-20120920	3.6-rc6 + linux-next/akpm as of Septeber 20th, 2012
      lesslock	Patches 1-6
      revert		Patches 1-7
      cachefail	Patches 1-8
      skipuseless	Patches 1-9
      
      Stress high-order allocation tests looked ok.  Success rates are more or
      less the same with the full series applied but there is an expectation
      that there is less opportunity to race with other allocation requests if
      there is less scanning.  The time to complete the tests did not vary that
      much and are uninteresting as were the vmstat statistics so I will not
      present them here.
      
      Using ftrace I recorded how much scanning was done by compaction and got this
      
                                  3.6.0-rc6     3.6.0-rc6   3.6.0-rc6  3.6.0-rc6 3.6.0-rc6
                                  akpm-20120920 lockless  revert-v2r2  cachefail skipuseless
      
      Total   free    scanned         360753976  515414028  565479007   17103281   18916589
      Total   free    isolated          2852429    3597369    4048601     670493     727840
      Total   free    efficiency        0.0079%    0.0070%    0.0072%    0.0392%    0.0385%
      Total   migrate scanned         247728664  822729112 1004645830   17946827   14118903
      Total   migrate isolated          2555324    3245937    3437501     616359     658616
      Total   migrate efficiency        0.0103%    0.0039%    0.0034%    0.0343%    0.0466%
      
      The efficiency is worthless because of the nature of the test and the
      number of failures.  The really interesting point as far as this patch
      series is concerned is the number of pages scanned.  Note that reverting
      Rik's patches massively increases the number of pages scanned indicating
      that those patches really did make a difference to CPU usage.
      
      However, caching what pageblocks should be skipped has a much higher
      impact.  With patches 1-8 applied, free page and migrate page scanning are
      both reduced by 95% in comparison to the akpm kernel.  If the basic
      concept of Rik's patches are implemened on top then scanning then the free
      scanner barely changed but migrate scanning was further reduced.  That
      said, tests on 3.6-rc5 indicated that the last patch had greater impact
      than what was measured here so it is a bit variable.
      
      One way or the other, this series has a large impact on the amount of
      scanning compaction does when there is a storm of THP allocations.
      
      This patch:
      
      Compaction's migrate scanner acquires the zone->lru_lock when scanning a
      range of pages looking for LRU pages to acquire.  It does this even if
      there are no LRU pages in the range.  If multiple processes are compacting
      then this can cause severe locking contention.  To make matters worse
      commit b2eef8c0 ("mm: compaction: minimise the time IRQs are disabled
      while isolating pages for migration") releases the lru_lock every
      SWAP_CLUSTER_MAX pages that are scanned.
      
      This patch makes two changes to how the migrate scanner acquires the LRU
      lock.  First, it only releases the LRU lock every SWAP_CLUSTER_MAX pages
      if the lock is contended.  This reduces the number of times it
      unnecessarily disables and re-enables IRQs.  The second is that it defers
      acquiring the LRU lock for as long as possible.  If there are no LRU pages
      or the only LRU pages are transhuge then the LRU lock will not be acquired
      at all which reduces contention on zone->lru_lock.
      
      [minchan@kernel.org: augment comment]
      [akpm@linux-foundation.org: tweak comment text]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Richard Davies <richard@arachsys.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Avi Kivity <avi@redhat.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a1402aa
    • M
      mm: compaction: Update try_to_compact_pages()kerneldoc comment · 661c4cb9
      Mel Gorman 提交于
      Parameters were added without documentation, tut tut.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      661c4cb9
    • M
      mm: compaction: move fatal signal check out of compact_checklock_irqsave · 3cc668f4
      Mel Gorman 提交于
      Commit c67fe375 ("mm: compaction: Abort async compaction if locks
      are contended or taking too long") addressed a lock contention problem
      in compaction by introducing compact_checklock_irqsave() that effecively
      aborting async compaction in the event of compaction.
      
      To preserve existing behaviour it also moved a fatal_signal_pending()
      check into compact_checklock_irqsave() but that is very misleading.  It
      "hides" the check within a locking function but has nothing to do with
      locking as such.  It just happens to work in a desirable fashion.
      
      This patch moves the fatal_signal_pending() check to
      isolate_migratepages_range() where it belongs.  Arguably the same check
      should also happen when isolating pages for freeing but it's overkill.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3cc668f4
    • S
      mm: compaction: abort compaction loop if lock is contended or run too long · e64c5237
      Shaohua Li 提交于
      isolate_migratepages_range() might isolate no pages if for example when
      zone->lru_lock is contended and running asynchronous compaction. In this
      case, we should abort compaction, otherwise, compact_zone will run a
      useless loop and make zone->lru_lock is even contended.
      
      An additional check is added to ensure that cc.migratepages and
      cc.freepages get properly drained whan compaction is aborted.
      
      [minchan@kernel.org: Putback pages isolated for migration if aborting]
      [akpm@linux-foundation.org: compact_zone_order requires non-NULL arg contended]
      [akpm@linux-foundation.org: make compact_zone_order() require non-NULL arg `contended']
      [minchan@kernel.org: Putback pages isolated for migration if aborting]
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e64c5237
    • B
      cma: fix watermark checking · d95ea5d1
      Bartlomiej Zolnierkiewicz 提交于
      * Add ALLOC_CMA alloc flag and pass it to [__]zone_watermark_ok()
        (from Minchan Kim).
      
      * During watermark check decrease available free pages number by
        free CMA pages number if necessary (unmovable allocations cannot
        use pages from CMA areas).
      Signed-off-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Signed-off-by: NKyungmin Park <kyungmin.park@samsung.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d95ea5d1
    • M
      mm: compaction: capture a suitable high-order page immediately when it is made available · 1fb3f8ca
      Mel Gorman 提交于
      While compaction is migrating pages to free up large contiguous blocks
      for allocation it races with other allocation requests that may steal
      these blocks or break them up.  This patch alters direct compaction to
      capture a suitable free page as soon as it becomes available to reduce
      this race.  It uses similar logic to split_free_page() to ensure that
      watermarks are still obeyed.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1fb3f8ca
    • M
      mm: compaction: update comment in try_to_compact_pages · 4ffb6335
      Mel Gorman 提交于
      Allocation success rates have been far lower since 3.4 due to commit
      fe2c2a10 ("vmscan: reclaim at order 0 when compaction is enabled").
      This commit was introduced for good reasons and it was known in advance
      that the success rates would suffer but it was justified on the grounds
      that the high allocation success rates were achieved by aggressive
      reclaim.  Success rates are expected to suffer even more in 3.6 due to
      commit 7db8889a ("mm: have order > 0 compaction start off where it
      left") which testing has shown to severely reduce allocation success
      rates under load - to 0% in one case.
      
      This series aims to improve the allocation success rates without
      regressing the benefits of commit fe2c2a10.  The series is based on
      latest mmotm and takes into account the __GFP_NO_KSWAPD flag is going
      away.
      
      Patch 1 updates a stale comment seeing as I was in the general area.
      
      Patch 2 updates reclaim/compaction to reclaim pages scaled on the number
      	of recent failures.
      
      Patch 3 captures suitable high-order pages freed by compaction to reduce
      	races with parallel allocation requests.
      
      Patch 4 fixes the upstream commit [7db8889a: mm: have order > 0 compaction
      	start off where it left] to enable compaction again
      
      Patch 5 identifies when compacion is taking too long due to contention
      	and aborts.
      
      STRESS-HIGHALLOC
      		 3.6-rc1-akpm	  full-series
      Pass 1          36.00 ( 0.00%)    51.00 (15.00%)
      Pass 2          42.00 ( 0.00%)    63.00 (21.00%)
      while Rested    86.00 ( 0.00%)    86.00 ( 0.00%)
      
      From
      
        http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__stress-highalloc-performance-ext3/hydra/comparison.html
      
      I know that the allocation success rates in 3.3.6 was 78% in comparison
      to 36% in in the current akpm tree.  With the full series applied, the
      success rates are up to around 51% with some variability in the results.
      This is not as high a success rate but it does not reclaim excessively
      which is a key point.
      
      MMTests Statistics: vmstat
      Page Ins                                     3050912     3078892
      Page Outs                                    8033528     8039096
      Swap Ins                                           0           0
      Swap Outs                                          0           0
      
      Note that swap in/out rates remain at 0. In 3.3.6 with 78% success rates
      there were 71881 pages swapped out.
      
      Direct pages scanned                           70942      122976
      Kswapd pages scanned                         1366300     1520122
      Kswapd pages reclaimed                       1366214     1484629
      Direct pages reclaimed                         70936      105716
      Kswapd efficiency                                99%         97%
      Kswapd velocity                             1072.550    1182.615
      Direct efficiency                                99%         85%
      Direct velocity                               55.690      95.672
      
      The kswapd velocity changes very little as expected.  kswapd velocity is
      around the 1000 pages/sec mark where as in kernel 3.3.6 with the high
      allocation success rates it was 8140 pages/second.  Direct velocity is
      higher as a result of patch 2 of the series but this is expected and is
      acceptable.  The direct reclaim and kswapd velocities change very little.
      
      If these get accepted for merging then there is a difficulty in how they
      should be handled.  7db8889a ("mm: have order > 0 compaction start off
      where it left") is broken but it is already in 3.6-rc1 and needs to be
      fixed.  However, if just patch 4 from this series is applied then Jim
      Schutt's workload is known to break again as his workload also requires
      patch 5.  While it would be preferred to have all these patches in 3.6 to
      improve compaction in general, it would at least be acceptable if just
      patches 4 and 5 were merged to 3.6 to fix a known problem without breaking
      compaction completely.  On the face of it, that would force
      __GFP_NO_KSWAPD patches to be merged at the same time but I can do a
      version of this series with __GFP_NO_KSWAPD change reverted and then
      rebase it on top of this series.  That might be best overall because I
      note that the __GFP_NO_KSWAPD patch should have removed
      deferred_compaction from page_alloc.c but it didn't but fixing that causes
      collisions with this series.
      
      This patch:
      
      The comment about order applied when the check was order >
      PAGE_ALLOC_COSTLY_ORDER which has not been the case since c5a73c3d ("thp:
      use compaction for all allocation orders").  Fixing the comment while I'm
      in the general area.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4ffb6335
  9. 22 8月, 2012 3 次提交
    • M
      mm: compaction: Abort async compaction if locks are contended or taking too long · c67fe375
      Mel Gorman 提交于
      Jim Schutt reported a problem that pointed at compaction contending
      heavily on locks.  The workload is straight-forward and in his own words;
      
      	The systems in question have 24 SAS drives spread across 3 HBAs,
      	running 24 Ceph OSD instances, one per drive.  FWIW these servers
      	are dual-socket Intel 5675 Xeons w/48 GB memory.  I've got ~160
      	Ceph Linux clients doing dd simultaneously to a Ceph file system
      	backed by 12 of these servers.
      
      Early in the test everything looks fine
      
        procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
         r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
        31 15          0     287216        576   38606628    0    0     2  1158    2   14   1  3  95  0  0
        27 15          0     225288        576   38583384    0    0    18 2222016 203357 134876  11 56  17 15  0
        28 17          0     219256        576   38544736    0    0    11 2305932 203141 146296  11 49  23 17  0
         6 18          0     215596        576   38552872    0    0     7 2363207 215264 166502  12 45  22 20  0
        22 18          0     226984        576   38596404    0    0     3 2445741 223114 179527  12 43  23 22  0
      
      and then it goes to pot
      
        procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
         r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
        163  8          0     464308        576   36791368    0    0    11 22210  866  536   3 13  79  4  0
        207 14          0     917752        576   36181928    0    0   712 1345376 134598 47367   7 90   1  2  0
        123 12          0     685516        576   36296148    0    0   429 1386615 158494 60077   8 84   5  3  0
        123 12          0     598572        576   36333728    0    0  1107 1233281 147542 62351   7 84   5  4  0
        622  7          0     660768        576   36118264    0    0   557 1345548 151394 59353   7 85   4  3  0
        223 11          0     283960        576   36463868    0    0    46 1107160 121846 33006   6 93   1  1  0
      
      Note that system CPU usage is very high blocks being written out has
      dropped by 42%. He analysed this with perf and found
      
        perf record -g -a sleep 10
        perf report --sort symbol --call-graph fractal,5
          34.63%  [k] _raw_spin_lock_irqsave
                  |
                  |--97.30%-- isolate_freepages
                  |          compaction_alloc
                  |          unmap_and_move
                  |          migrate_pages
                  |          compact_zone
                  |          compact_zone_order
                  |          try_to_compact_pages
                  |          __alloc_pages_direct_compact
                  |          __alloc_pages_slowpath
                  |          __alloc_pages_nodemask
                  |          alloc_pages_vma
                  |          do_huge_pmd_anonymous_page
                  |          handle_mm_fault
                  |          do_page_fault
                  |          page_fault
                  |          |
                  |          |--87.39%-- skb_copy_datagram_iovec
                  |          |          tcp_recvmsg
                  |          |          inet_recvmsg
                  |          |          sock_recvmsg
                  |          |          sys_recvfrom
                  |          |          system_call
                  |          |          __recv
                  |          |          |
                  |          |           --100.00%-- (nil)
                  |          |
                  |           --12.61%-- memcpy
                   --2.70%-- [...]
      
      There was other data but primarily it is all showing that compaction is
      contended heavily on the zone->lock and zone->lru_lock.
      
      commit [b2eef8c0: mm: compaction: minimise the time IRQs are disabled
      while isolating pages for migration] noted that it was possible for
      migration to hold the lru_lock for an excessive amount of time. Very
      broadly speaking this patch expands the concept.
      
      This patch introduces compact_checklock_irqsave() to check if a lock
      is contended or the process needs to be scheduled. If either condition
      is true then async compaction is aborted and the caller is informed.
      The page allocator will fail a THP allocation if compaction failed due
      to contention. This patch also introduces compact_trylock_irqsave()
      which will acquire the lock only if it is not contended and the process
      does not need to schedule.
      Reported-by: NJim Schutt <jaschut@sandia.gov>
      Tested-by: NJim Schutt <jaschut@sandia.gov>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c67fe375
    • M
      mm: have order > 0 compaction start near a pageblock with free pages · de74f1cc
      Mel Gorman 提交于
      Commit 7db8889a ("mm: have order > 0 compaction start off where it
      left") introduced a caching mechanism to reduce the amount work the free
      page scanner does in compaction.  However, it has a problem.  Consider
      two process simultaneously scanning free pages
      
      					    			C
      	Process A		M     S     			F
      			|---------------------------------------|
      	Process B		M 	FS
      
      	C is zone->compact_cached_free_pfn
      	S is cc->start_pfree_pfn
      	M is cc->migrate_pfn
      	F is cc->free_pfn
      
      In this diagram, Process A has just reached its migrate scanner, wrapped
      around and updated compact_cached_free_pfn accordingly.
      
      Simultaneously, Process B finishes isolating in a block and updates
      compact_cached_free_pfn again to the location of its free scanner.
      
      Process A moves to "end_of_zone - one_pageblock" and runs this check
      
                      if (cc->order > 0 && (!cc->wrapped ||
                                            zone->compact_cached_free_pfn >
                                            cc->start_free_pfn))
                              pfn = min(pfn, zone->compact_cached_free_pfn);
      
      compact_cached_free_pfn is above where it started so the free scanner
      skips almost the entire space it should have scanned.  When there are
      multiple processes compacting it can end in a situation where the entire
      zone is not being scanned at all.  Further, it is possible for two
      processes to ping-pong update to compact_cached_free_pfn which is just
      random.
      
      Overall, the end result wrecks allocation success rates.
      
      There is not an obvious way around this problem without introducing new
      locking and state so this patch takes a different approach.
      
      First, it gets rid of the skip logic because it's not clear that it
      matters if two free scanners happen to be in the same block but with
      racing updates it's too easy for it to skip over blocks it should not.
      
      Second, it updates compact_cached_free_pfn in a more limited set of
      circumstances.
      
      If a scanner has wrapped, it updates compact_cached_free_pfn to the end
      	of the zone. When a wrapped scanner isolates a page, it updates
      	compact_cached_free_pfn to point to the highest pageblock it
      	can isolate pages from.
      
      If a scanner has not wrapped when it has finished isolated pages it
      	checks if compact_cached_free_pfn is pointing to the end of the
      	zone. If so, the value is updated to point to the highest
      	pageblock that pages were isolated from. This value will not
      	be updated again until a free page scanner wraps and resets
      	compact_cached_free_pfn.
      
      This is not optimal and it can still race but the compact_cached_free_pfn
      will be pointing to or very near a pageblock with free pages.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de74f1cc
    • M
      mm/compaction.c: fix deferring compaction mistake · c81758fb
      Minchan Kim 提交于
      Commit aff62249 ("vmscan: only defer compaction for failed order and
      higher") fixed bad deferring policy but made mistake about checking
      compact_order_failed in __compact_pgdat().  So it can't update
      compact_order_failed with the new order.  This ends up preventing
      correct operation of policy deferral.  This patch fixes it.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c81758fb
  10. 01 8月, 2012 1 次提交
    • R
      mm: have order > 0 compaction start off where it left · 7db8889a
      Rik van Riel 提交于
      Order > 0 compaction stops when enough free pages of the correct page
      order have been coalesced.  When doing subsequent higher order
      allocations, it is possible for compaction to be invoked many times.
      
      However, the compaction code always starts out looking for things to
      compact at the start of the zone, and for free pages to compact things to
      at the end of the zone.
      
      This can cause quadratic behaviour, with isolate_freepages starting at the
      end of the zone each time, even though previous invocations of the
      compaction code already filled up all free memory on that end of the zone.
      
      This can cause isolate_freepages to take enormous amounts of CPU with
      certain workloads on larger memory systems.
      
      The obvious solution is to have isolate_freepages remember where it left
      off last time, and continue at that point the next time it gets invoked
      for an order > 0 compaction.  This could cause compaction to fail if
      cc->free_pfn and cc->migrate_pfn are close together initially, in that
      case we restart from the end of the zone and try once more.
      
      Forced full (order == -1) compactions are left alone.
      
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: s/laste/last/, use 80 cols]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Reported-by: NJim Schutt <jaschut@sandia.gov>
      Tested-by: NJim Schutt <jaschut@sandia.gov>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7db8889a
  11. 12 7月, 2012 1 次提交
    • D
      mm, thp: abort compaction if migration page cannot be charged to memcg · 4bf2bba3
      David Rientjes 提交于
      If page migration cannot charge the temporary page to the memcg,
      migrate_pages() will return -ENOMEM.  This isn't considered in memory
      compaction however, and the loop continues to iterate over all
      pageblocks trying to isolate and migrate pages.  If a small number of
      very large memcgs happen to be oom, however, these attempts will mostly
      be futile leading to an enormous amout of cpu consumption due to the
      page migration failures.
      
      This patch will short circuit and fail memory compaction if
      migrate_pages() returns -ENOMEM.  COMPACT_PARTIAL is returned in case
      some migrations were successful so that the page allocator will retry.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4bf2bba3
  12. 04 6月, 2012 1 次提交
  13. 30 5月, 2012 3 次提交
    • H
      mm/memcg: apply add/del_page to lruvec · fa9add64
      Hugh Dickins 提交于
      Take lruvec further: pass it instead of zone to add_page_to_lru_list() and
      del_page_from_lru_list(); and pagevec_lru_move_fn() pass lruvec down to
      its target functions.
      
      This cleanup eliminates a swathe of cruft in memcontrol.c, including
      mem_cgroup_lru_add_list(), mem_cgroup_lru_del_list() and
      mem_cgroup_lru_move_lists() - which never actually touched the lists.
      
      In their place, mem_cgroup_page_lruvec() to decide the lruvec, previously
      a side-effect of add, and mem_cgroup_update_lru_size() to maintain the
      lru_size stats.
      
      Whilst these are simplifications in their own right, the goal is to bring
      the evaluation of lruvec next to the spin_locking of the lrus, in
      preparation for a future patch.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa9add64
    • K
      mm: remove lru type checks from __isolate_lru_page() · f3fd4a61
      Konstantin Khlebnikov 提交于
      After patch "mm: forbid lumpy-reclaim in shrink_active_list()" we can
      completely remove anon/file and active/inactive lru type filters from
      __isolate_lru_page(), because isolation for 0-order reclaim always
      isolates pages from right lru list.  And pages-isolation for lumpy
      shrink_inactive_list() or memory-compaction anyway allowed to isolate
      pages from all evictable lru lists.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3fd4a61
    • B
      mm: compaction: handle incorrect MIGRATE_UNMOVABLE type pageblocks · 5ceb9ce6
      Bartlomiej Zolnierkiewicz 提交于
      When MIGRATE_UNMOVABLE pages are freed from MIGRATE_UNMOVABLE type
      pageblock (and some MIGRATE_MOVABLE pages are left in it) waiting until an
      allocation takes ownership of the block may take too long.  The type of
      the pageblock remains unchanged so the pageblock cannot be used as a
      migration target during compaction.
      
      Fix it by:
      
      * Adding enum compact_mode (COMPACT_ASYNC_[MOVABLE,UNMOVABLE], and
        COMPACT_SYNC) and then converting sync field in struct compact_control
        to use it.
      
      * Adding nr_pageblocks_skipped field to struct compact_control and
        tracking how many destination pageblocks were of MIGRATE_UNMOVABLE type.
         If COMPACT_ASYNC_MOVABLE mode compaction ran fully in
        try_to_compact_pages() (COMPACT_COMPLETE) it implies that there is not a
        suitable page for allocation.  In this case then check how if there were
        enough MIGRATE_UNMOVABLE pageblocks to try a second pass in
        COMPACT_ASYNC_UNMOVABLE mode.
      
      * Scanning the MIGRATE_UNMOVABLE pageblocks (during COMPACT_SYNC and
        COMPACT_ASYNC_UNMOVABLE compaction modes) and building a count based on
        finding PageBuddy pages, page_count(page) == 0 or PageLRU pages.  If all
        pages within the MIGRATE_UNMOVABLE pageblock are in one of those three
        sets change the whole pageblock type to MIGRATE_MOVABLE.
      
      My particular test case (on a ARM EXYNOS4 device with 512 MiB, which means
      131072 standard 4KiB pages in 'Normal' zone) is to:
      
      - allocate 120000 pages for kernel's usage
      - free every second page (60000 pages) of memory just allocated
      - allocate and use 60000 pages from user space
      - free remaining 60000 pages of kernel memory
        (now we have fragmented memory occupied mostly by user space pages)
      - try to allocate 100 order-9 (2048 KiB) pages for kernel's usage
      
      The results:
      - with compaction disabled I get 11 successful allocations
      - with compaction enabled - 14 successful allocations
      - with this patch I'm able to get all 100 successful allocations
      
      NOTE: If we can make kswapd aware of order-0 request during compaction, we
      can enhance kswapd with changing mode to COMPACT_ASYNC_FULL
      (COMPACT_ASYNC_MOVABLE + COMPACT_ASYNC_UNMOVABLE).  Please see the
      following thread:
      
      	http://marc.info/?l=linux-mm&m=133552069417068&w=2
      
      [minchan@kernel.org: minor cleanups]
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Signed-off-by: NKyungmin Park <kyungmin.park@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ceb9ce6
  14. 21 5月, 2012 5 次提交
  15. 22 3月, 2012 2 次提交
    • D
      mm: compaction: make compact_control order signed · aad6ec37
      Dan Carpenter 提交于
      "order" is -1 when compacting via /proc/sys/vm/compact_memory.  Making
      it unsigned causes a bug in __compact_pgdat() when we test:
      
      	if (cc->order < 0 || !compaction_deferred(zone, cc->order))
      		compact_zone(zone, cc);
      
      [akpm@linux-foundation.org: make __compact_pgdat()'s comparison match other code sites]
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan@kernel.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aad6ec37
    • H
      compact_pgdat: workaround lockdep warning in kswapd · 8575ec29
      Hugh Dickins 提交于
      I get this lockdep warning from swapping load on linux-next, due to
      "vmscan: kswapd carefully call compaction".
      
      =================================
      [ INFO: inconsistent lock state ]
      3.3.0-rc2-next-20120201 #5 Not tainted
      ---------------------------------
      inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
      kswapd0/28 [HC0[0]:SC0[0]:HE1:SE1] takes:
       (pcpu_alloc_mutex){+.+.?.}, at: [<ffffffff810d6684>] pcpu_alloc+0x67/0x325
      {RECLAIM_FS-ON-W} state was registered at:
        [<ffffffff81099b75>] mark_held_locks+0xd7/0x103
        [<ffffffff8109a13c>] lockdep_trace_alloc+0x85/0x9e
        [<ffffffff810f6bdc>] __kmalloc+0x6c/0x14b
        [<ffffffff810d57fd>] pcpu_mem_zalloc+0x59/0x62
        [<ffffffff810d5d16>] pcpu_extend_area_map+0x26/0xb1
        [<ffffffff810d679f>] pcpu_alloc+0x182/0x325
        [<ffffffff810d694d>] __alloc_percpu+0xb/0xd
        [<ffffffff8142ebfd>] snmp_mib_init+0x1e/0x2e
        [<ffffffff8185cd8d>] ipv4_mib_init_net+0x7a/0x184
        [<ffffffff813dc963>] ops_init.clone.0+0x6b/0x73
        [<ffffffff813dc9cc>] register_pernet_operations+0x61/0xa0
        [<ffffffff813dca8e>] register_pernet_subsys+0x29/0x42
        [<ffffffff8185d044>] inet_init+0x1ad/0x252
        [<ffffffff810002e3>] do_one_initcall+0x7a/0x12f
        [<ffffffff81832bc5>] kernel_init+0x9d/0x11e
        [<ffffffff814e51e4>] kernel_thread_helper+0x4/0x10
      irq event stamp: 656613
      hardirqs last  enabled at (656613): [<ffffffff814e0ddc>] __mutex_unlock_slowpath+0x104/0x128
      hardirqs last disabled at (656612): [<ffffffff814e0d34>] __mutex_unlock_slowpath+0x5c/0x128
      softirqs last  enabled at (655568): [<ffffffff8105b4a5>] __do_softirq+0x120/0x136
      softirqs last disabled at (654757): [<ffffffff814e52dc>] call_softirq+0x1c/0x30
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock(pcpu_alloc_mutex);
        <Interrupt>
          lock(pcpu_alloc_mutex);
      
       *** DEADLOCK ***
      
      no locks held by kswapd0/28.
      
      stack backtrace:
      Pid: 28, comm: kswapd0 Not tainted 3.3.0-rc2-next-20120201 #5
      Call Trace:
       [<ffffffff810981f4>] print_usage_bug+0x1bf/0x1d0
       [<ffffffff81096c3e>] ? print_irq_inversion_bug+0x1d9/0x1d9
       [<ffffffff810982c0>] mark_lock_irq+0xbb/0x22e
       [<ffffffff810c5399>] ? free_hot_cold_page+0x13d/0x14f
       [<ffffffff81098684>] mark_lock+0x251/0x331
       [<ffffffff81098893>] mark_irqflags+0x12f/0x141
       [<ffffffff81098e32>] __lock_acquire+0x58d/0x753
       [<ffffffff810d6684>] ? pcpu_alloc+0x67/0x325
       [<ffffffff81099433>] lock_acquire+0x54/0x6a
       [<ffffffff810d6684>] ? pcpu_alloc+0x67/0x325
       [<ffffffff8107a5b8>] ? add_preempt_count+0xa9/0xae
       [<ffffffff814e0a21>] mutex_lock_nested+0x5e/0x315
       [<ffffffff810d6684>] ? pcpu_alloc+0x67/0x325
       [<ffffffff81098f81>] ? __lock_acquire+0x6dc/0x753
       [<ffffffff810c9fb0>] ? __pagevec_release+0x2c/0x2c
       [<ffffffff810d6684>] pcpu_alloc+0x67/0x325
       [<ffffffff810c9fb0>] ? __pagevec_release+0x2c/0x2c
       [<ffffffff810d694d>] __alloc_percpu+0xb/0xd
       [<ffffffff8106c35e>] schedule_on_each_cpu+0x23/0x110
       [<ffffffff810c9fcb>] lru_add_drain_all+0x10/0x12
       [<ffffffff810f126f>] __compact_pgdat+0x20/0x182
       [<ffffffff810f15c2>] compact_pgdat+0x27/0x29
       [<ffffffff810c306b>] ? zone_watermark_ok+0x1a/0x1c
       [<ffffffff810cdf6f>] balance_pgdat+0x732/0x751
       [<ffffffff810ce0ed>] kswapd+0x15f/0x178
       [<ffffffff810cdf8e>] ? balance_pgdat+0x751/0x751
       [<ffffffff8106fd11>] kthread+0x84/0x8c
       [<ffffffff814e51e4>] kernel_thread_helper+0x4/0x10
       [<ffffffff810787ed>] ? finish_task_switch+0x85/0xea
       [<ffffffff814e3861>] ? retint_restore_args+0xe/0xe
       [<ffffffff8106fc8d>] ? __init_kthread_worker+0x56/0x56
       [<ffffffff814e51e0>] ? gs_change+0xb/0xb
      
      The RECLAIM_FS notations indicate that it's doing the GFP_FS checking that
      Nick hacked into lockdep a while back: I think we're intended to read that
      "<Interrupt>" in the DEADLOCK scenario as "<Direct reclaim>".
      
      I'm hazy, I have not reached any conclusion as to whether it's right to
      complain or not; but I believe it's uneasy about kswapd now doing the
      mutex_lock(&pcpu_alloc_mutex) which lru_add_drain_all() entails.  Nor have
      I reached any conclusion as to whether it's important for kswapd to do
      that draining or not.
      
      But so as not to get blocked on this, with lockdep disabled from giving
      further reports, here's a patch which removes the lru_add_drain_all() from
      kswapd's callpath (and calls it only once from compact_nodes(), instead of
      once per node).
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8575ec29