1. 12 12月, 2012 1 次提交
    • R
      mm: introduce compaction and migration for ballooned pages · bf6bddf1
      Rafael Aquini 提交于
      Memory fragmentation introduced by ballooning might reduce significantly
      the number of 2MB contiguous memory blocks that can be used within a guest,
      thus imposing performance penalties associated with the reduced number of
      transparent huge pages that could be used by the guest workload.
      
      This patch introduces the helper functions as well as the necessary changes
      to teach compaction and migration bits how to cope with pages which are
      part of a guest memory balloon, in order to make them movable by memory
      compaction procedures.
      Signed-off-by: NRafael Aquini <aquini@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf6bddf1
  2. 07 12月, 2012 1 次提交
    • M
      mm: compaction: validate pfn range passed to isolate_freepages_block · 60177d31
      Mel Gorman 提交于
      Commit 0bf380bc ("mm: compaction: check pfn_valid when entering a
      new MAX_ORDER_NR_PAGES block during isolation for migration") added a
      check for pfn_valid() when isolating pages for migration as the scanner
      does not necessarily start pageblock-aligned.
      
      Since commit c89511ab ("mm: compaction: Restart compaction from near
      where it left off"), the free scanner has the same problem.  This patch
      makes sure that the pfn range passed to isolate_freepages_block() is
      within the same block so that pfn_valid() checks are unnecessary.
      
      In answer to Henrik's wondering why others have not reported this:
      reproducing this requires a large enough hole with the right aligment to
      have compaction walk into a PFN range with no memmap.  Size and
      alignment depends in the memory model - 4M for FLATMEM and 128M for
      SPARSEMEM on x86.  It needs a "lucky" machine.
      Reported-by: NHenrik Rydberg <rydberg@euromail.se>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60177d31
  3. 20 10月, 2012 1 次提交
  4. 09 10月, 2012 13 次提交
    • M
      CMA: migrate mlocked pages · e46a2879
      Minchan Kim 提交于
      Presently CMA cannot migrate mlocked pages so it ends up failing to allocate
      contiguous memory space.
      
      This patch makes mlocked pages be migrated out.  Of course, it can affect
      realtime processes but in CMA usecase, contiguous memory allocation failing
      is far worse than access latency to an mlocked page being variable while
      CMA is running.  If someone wants to make the system realtime, he shouldn't
      enable CMA because stalls can still happen at random times.
      
      [akpm@linux-foundation.org: tweak comment text, per Mel]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e46a2879
    • M
      mm: compaction: clear PG_migrate_skip based on compaction and reclaim activity · 62997027
      Mel Gorman 提交于
      Compaction caches if a pageblock was scanned and no pages were isolated so
      that the pageblocks can be skipped in the future to reduce scanning.  This
      information is not cleared by the page allocator based on activity due to
      the impact it would have to the page allocator fast paths.  Hence there is
      a requirement that something clear the cache or pageblocks will be skipped
      forever.  Currently the cache is cleared if there were a number of recent
      allocation failures and it has not been cleared within the last 5 seconds.
      Time-based decisions like this are terrible as they have no relationship
      to VM activity and is basically a big hammer.
      
      Unfortunately, accurate heuristics would add cost to some hot paths so
      this patch implements a rough heuristic.  There are two cases where the
      cache is cleared.
      
      1. If a !kswapd process completes a compaction cycle (migrate and free
         scanner meet), the zone is marked compact_blockskip_flush. When kswapd
         goes to sleep, it will clear the cache. This is expected to be the
         common case where the cache is cleared. It does not really matter if
         kswapd happens to be asleep or going to sleep when the flag is set as
         it will be woken on the next allocation request.
      
      2. If there have been multiple failures recently and compaction just
         finished being deferred then a process will clear the cache and start a
         full scan.  This situation happens if there are multiple high-order
         allocation requests under heavy memory pressure.
      
      The clearing of the PG_migrate_skip bits and other scans is inherently
      racy but the race is harmless.  For allocations that can fail such as THP,
      they will simply fail.  For requests that cannot fail, they will retry the
      allocation.  Tests indicated that scanning rates were roughly similar to
      when the time-based heuristic was used and the allocation success rates
      were similar.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Richard Davies <richard@arachsys.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62997027
    • M
      mm: compaction: Restart compaction from near where it left off · c89511ab
      Mel Gorman 提交于
      This is almost entirely based on Rik's previous patches and discussions
      with him about how this might be implemented.
      
      Order > 0 compaction stops when enough free pages of the correct page
      order have been coalesced.  When doing subsequent higher order
      allocations, it is possible for compaction to be invoked many times.
      
      However, the compaction code always starts out looking for things to
      compact at the start of the zone, and for free pages to compact things to
      at the end of the zone.
      
      This can cause quadratic behaviour, with isolate_freepages starting at the
      end of the zone each time, even though previous invocations of the
      compaction code already filled up all free memory on that end of the zone.
       This can cause isolate_freepages to take enormous amounts of CPU with
      certain workloads on larger memory systems.
      
      This patch caches where the migration and free scanner should start from
      on subsequent compaction invocations using the pageblock-skip information.
       When compaction starts it begins from the cached restart points and will
      update the cached restart points until a page is isolated or a pageblock
      is skipped that would have been scanned by synchronous compaction.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Richard Davies <richard@arachsys.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Avi Kivity <avi@redhat.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c89511ab
    • M
      mm: compaction: cache if a pageblock was scanned and no pages were isolated · bb13ffeb
      Mel Gorman 提交于
      When compaction was implemented it was known that scanning could
      potentially be excessive.  The ideal was that a counter be maintained for
      each pageblock but maintaining this information would incur a severe
      penalty due to a shared writable cache line.  It has reached the point
      where the scanning costs are a serious problem, particularly on
      long-lived systems where a large process starts and allocates a large
      number of THPs at the same time.
      
      Instead of using a shared counter, this patch adds another bit to the
      pageblock flags called PG_migrate_skip.  If a pageblock is scanned by
      either migrate or free scanner and 0 pages were isolated, the pageblock is
      marked to be skipped in the future.  When scanning, this bit is checked
      before any scanning takes place and the block skipped if set.
      
      The main difficulty with a patch like this is "when to ignore the cached
      information?" If it's ignored too often, the scanning rates will still be
      excessive.  If the information is too stale then allocations will fail
      that might have otherwise succeeded.  In this patch
      
      o CMA always ignores the information
      o If the migrate and free scanner meet then the cached information will
        be discarded if it's at least 5 seconds since the last time the cache
        was discarded
      o If there are a large number of allocation failures, discard the cache.
      
      The time-based heuristic is very clumsy but there are few choices for a
      better event.  Depending solely on multiple allocation failures still
      allows excessive scanning when THP allocations are failing in quick
      succession due to memory pressure.  Waiting until memory pressure is
      relieved would cause compaction to continually fail instead of using
      reclaim/compaction to try allocate the page.  The time-based mechanism is
      clumsy but a better option is not obvious.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Richard Davies <richard@arachsys.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Avi Kivity <avi@redhat.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Cc: Mark Brown <broonie@opensource.wolfsonmicro.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb13ffeb
    • M
      revert "mm: have order > 0 compaction start off where it left" · 753341a4
      Mel Gorman 提交于
      This reverts commit 7db8889a ("mm: have order > 0 compaction start
      off where it left") and commit de74f1cc ("mm: have order > 0 compaction
      start near a pageblock with free pages").  These patches were a good
      idea and tests confirmed that they massively reduced the amount of
      scanning but the implementation is complex and tricky to understand.  A
      later patch will cache what pageblocks should be skipped and
      reimplements the concept of compact_cached_free_pfn on top for both
      migration and free scanners.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Richard Davies <richard@arachsys.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Avi Kivity <avi@redhat.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      753341a4
    • M
      mm: compaction: acquire the zone->lock as late as possible · f40d1e42
      Mel Gorman 提交于
      Compaction's free scanner acquires the zone->lock when checking for
      PageBuddy pages and isolating them.  It does this even if there are no
      PageBuddy pages in the range.
      
      This patch defers acquiring the zone lock for as long as possible.  In the
      event there are no free pages in the pageblock then the lock will not be
      acquired at all which reduces contention on zone->lock.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Richard Davies <richard@arachsys.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Avi Kivity <avi@redhat.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Tested-by: NPeter Ujfalusi <peter.ujfalusi@ti.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f40d1e42
    • M
      mm: compaction: acquire the zone->lru_lock as late as possible · 2a1402aa
      Mel Gorman 提交于
      Richard Davies and Shaohua Li have both reported lock contention problems
      in compaction on the zone and LRU locks as well as significant amounts of
      time being spent in compaction.  This series aims to reduce lock
      contention and scanning rates to reduce that CPU usage.  Richard reported
      at https://lkml.org/lkml/2012/9/21/91 that this series made a big
      different to a problem he reported in August:
      
         http://marc.info/?l=kvm&m=134511507015614&w=2
      
      Patch 1 defers acquiring the zone->lru_lock as long as possible.
      
      Patch 2 defers acquiring the zone->lock as lock as possible.
      
      Patch 3 reverts Rik's "skip-free" patches as the core concept gets
      	reimplemented later and the remaining patches are easier to
      	understand if this is reverted first.
      
      Patch 4 adds a pageblock-skip bit to the pageblock flags to cache what
      	pageblocks should be skipped by the migrate and free scanners.
      	This drastically reduces the amount of scanning compaction has
      	to do.
      
      Patch 5 reimplements something similar to Rik's idea except it uses the
      	pageblock-skip information to decide where the scanners should
      	restart from and does not need to wrap around.
      
      I tested this on 3.6-rc6 + linux-next/akpm. Kernels tested were
      
      akpm-20120920	3.6-rc6 + linux-next/akpm as of Septeber 20th, 2012
      lesslock	Patches 1-6
      revert		Patches 1-7
      cachefail	Patches 1-8
      skipuseless	Patches 1-9
      
      Stress high-order allocation tests looked ok.  Success rates are more or
      less the same with the full series applied but there is an expectation
      that there is less opportunity to race with other allocation requests if
      there is less scanning.  The time to complete the tests did not vary that
      much and are uninteresting as were the vmstat statistics so I will not
      present them here.
      
      Using ftrace I recorded how much scanning was done by compaction and got this
      
                                  3.6.0-rc6     3.6.0-rc6   3.6.0-rc6  3.6.0-rc6 3.6.0-rc6
                                  akpm-20120920 lockless  revert-v2r2  cachefail skipuseless
      
      Total   free    scanned         360753976  515414028  565479007   17103281   18916589
      Total   free    isolated          2852429    3597369    4048601     670493     727840
      Total   free    efficiency        0.0079%    0.0070%    0.0072%    0.0392%    0.0385%
      Total   migrate scanned         247728664  822729112 1004645830   17946827   14118903
      Total   migrate isolated          2555324    3245937    3437501     616359     658616
      Total   migrate efficiency        0.0103%    0.0039%    0.0034%    0.0343%    0.0466%
      
      The efficiency is worthless because of the nature of the test and the
      number of failures.  The really interesting point as far as this patch
      series is concerned is the number of pages scanned.  Note that reverting
      Rik's patches massively increases the number of pages scanned indicating
      that those patches really did make a difference to CPU usage.
      
      However, caching what pageblocks should be skipped has a much higher
      impact.  With patches 1-8 applied, free page and migrate page scanning are
      both reduced by 95% in comparison to the akpm kernel.  If the basic
      concept of Rik's patches are implemened on top then scanning then the free
      scanner barely changed but migrate scanning was further reduced.  That
      said, tests on 3.6-rc5 indicated that the last patch had greater impact
      than what was measured here so it is a bit variable.
      
      One way or the other, this series has a large impact on the amount of
      scanning compaction does when there is a storm of THP allocations.
      
      This patch:
      
      Compaction's migrate scanner acquires the zone->lru_lock when scanning a
      range of pages looking for LRU pages to acquire.  It does this even if
      there are no LRU pages in the range.  If multiple processes are compacting
      then this can cause severe locking contention.  To make matters worse
      commit b2eef8c0 ("mm: compaction: minimise the time IRQs are disabled
      while isolating pages for migration") releases the lru_lock every
      SWAP_CLUSTER_MAX pages that are scanned.
      
      This patch makes two changes to how the migrate scanner acquires the LRU
      lock.  First, it only releases the LRU lock every SWAP_CLUSTER_MAX pages
      if the lock is contended.  This reduces the number of times it
      unnecessarily disables and re-enables IRQs.  The second is that it defers
      acquiring the LRU lock for as long as possible.  If there are no LRU pages
      or the only LRU pages are transhuge then the LRU lock will not be acquired
      at all which reduces contention on zone->lru_lock.
      
      [minchan@kernel.org: augment comment]
      [akpm@linux-foundation.org: tweak comment text]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Richard Davies <richard@arachsys.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Avi Kivity <avi@redhat.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a1402aa
    • M
      mm: compaction: Update try_to_compact_pages()kerneldoc comment · 661c4cb9
      Mel Gorman 提交于
      Parameters were added without documentation, tut tut.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      661c4cb9
    • M
      mm: compaction: move fatal signal check out of compact_checklock_irqsave · 3cc668f4
      Mel Gorman 提交于
      Commit c67fe375 ("mm: compaction: Abort async compaction if locks
      are contended or taking too long") addressed a lock contention problem
      in compaction by introducing compact_checklock_irqsave() that effecively
      aborting async compaction in the event of compaction.
      
      To preserve existing behaviour it also moved a fatal_signal_pending()
      check into compact_checklock_irqsave() but that is very misleading.  It
      "hides" the check within a locking function but has nothing to do with
      locking as such.  It just happens to work in a desirable fashion.
      
      This patch moves the fatal_signal_pending() check to
      isolate_migratepages_range() where it belongs.  Arguably the same check
      should also happen when isolating pages for freeing but it's overkill.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3cc668f4
    • S
      mm: compaction: abort compaction loop if lock is contended or run too long · e64c5237
      Shaohua Li 提交于
      isolate_migratepages_range() might isolate no pages if for example when
      zone->lru_lock is contended and running asynchronous compaction. In this
      case, we should abort compaction, otherwise, compact_zone will run a
      useless loop and make zone->lru_lock is even contended.
      
      An additional check is added to ensure that cc.migratepages and
      cc.freepages get properly drained whan compaction is aborted.
      
      [minchan@kernel.org: Putback pages isolated for migration if aborting]
      [akpm@linux-foundation.org: compact_zone_order requires non-NULL arg contended]
      [akpm@linux-foundation.org: make compact_zone_order() require non-NULL arg `contended']
      [minchan@kernel.org: Putback pages isolated for migration if aborting]
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e64c5237
    • B
      cma: fix watermark checking · d95ea5d1
      Bartlomiej Zolnierkiewicz 提交于
      * Add ALLOC_CMA alloc flag and pass it to [__]zone_watermark_ok()
        (from Minchan Kim).
      
      * During watermark check decrease available free pages number by
        free CMA pages number if necessary (unmovable allocations cannot
        use pages from CMA areas).
      Signed-off-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Signed-off-by: NKyungmin Park <kyungmin.park@samsung.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d95ea5d1
    • M
      mm: compaction: capture a suitable high-order page immediately when it is made available · 1fb3f8ca
      Mel Gorman 提交于
      While compaction is migrating pages to free up large contiguous blocks
      for allocation it races with other allocation requests that may steal
      these blocks or break them up.  This patch alters direct compaction to
      capture a suitable free page as soon as it becomes available to reduce
      this race.  It uses similar logic to split_free_page() to ensure that
      watermarks are still obeyed.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1fb3f8ca
    • M
      mm: compaction: update comment in try_to_compact_pages · 4ffb6335
      Mel Gorman 提交于
      Allocation success rates have been far lower since 3.4 due to commit
      fe2c2a10 ("vmscan: reclaim at order 0 when compaction is enabled").
      This commit was introduced for good reasons and it was known in advance
      that the success rates would suffer but it was justified on the grounds
      that the high allocation success rates were achieved by aggressive
      reclaim.  Success rates are expected to suffer even more in 3.6 due to
      commit 7db8889a ("mm: have order > 0 compaction start off where it
      left") which testing has shown to severely reduce allocation success
      rates under load - to 0% in one case.
      
      This series aims to improve the allocation success rates without
      regressing the benefits of commit fe2c2a10.  The series is based on
      latest mmotm and takes into account the __GFP_NO_KSWAPD flag is going
      away.
      
      Patch 1 updates a stale comment seeing as I was in the general area.
      
      Patch 2 updates reclaim/compaction to reclaim pages scaled on the number
      	of recent failures.
      
      Patch 3 captures suitable high-order pages freed by compaction to reduce
      	races with parallel allocation requests.
      
      Patch 4 fixes the upstream commit [7db8889a: mm: have order > 0 compaction
      	start off where it left] to enable compaction again
      
      Patch 5 identifies when compacion is taking too long due to contention
      	and aborts.
      
      STRESS-HIGHALLOC
      		 3.6-rc1-akpm	  full-series
      Pass 1          36.00 ( 0.00%)    51.00 (15.00%)
      Pass 2          42.00 ( 0.00%)    63.00 (21.00%)
      while Rested    86.00 ( 0.00%)    86.00 ( 0.00%)
      
      From
      
        http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__stress-highalloc-performance-ext3/hydra/comparison.html
      
      I know that the allocation success rates in 3.3.6 was 78% in comparison
      to 36% in in the current akpm tree.  With the full series applied, the
      success rates are up to around 51% with some variability in the results.
      This is not as high a success rate but it does not reclaim excessively
      which is a key point.
      
      MMTests Statistics: vmstat
      Page Ins                                     3050912     3078892
      Page Outs                                    8033528     8039096
      Swap Ins                                           0           0
      Swap Outs                                          0           0
      
      Note that swap in/out rates remain at 0. In 3.3.6 with 78% success rates
      there were 71881 pages swapped out.
      
      Direct pages scanned                           70942      122976
      Kswapd pages scanned                         1366300     1520122
      Kswapd pages reclaimed                       1366214     1484629
      Direct pages reclaimed                         70936      105716
      Kswapd efficiency                                99%         97%
      Kswapd velocity                             1072.550    1182.615
      Direct efficiency                                99%         85%
      Direct velocity                               55.690      95.672
      
      The kswapd velocity changes very little as expected.  kswapd velocity is
      around the 1000 pages/sec mark where as in kernel 3.3.6 with the high
      allocation success rates it was 8140 pages/second.  Direct velocity is
      higher as a result of patch 2 of the series but this is expected and is
      acceptable.  The direct reclaim and kswapd velocities change very little.
      
      If these get accepted for merging then there is a difficulty in how they
      should be handled.  7db8889a ("mm: have order > 0 compaction start off
      where it left") is broken but it is already in 3.6-rc1 and needs to be
      fixed.  However, if just patch 4 from this series is applied then Jim
      Schutt's workload is known to break again as his workload also requires
      patch 5.  While it would be preferred to have all these patches in 3.6 to
      improve compaction in general, it would at least be acceptable if just
      patches 4 and 5 were merged to 3.6 to fix a known problem without breaking
      compaction completely.  On the face of it, that would force
      __GFP_NO_KSWAPD patches to be merged at the same time but I can do a
      version of this series with __GFP_NO_KSWAPD change reverted and then
      rebase it on top of this series.  That might be best overall because I
      note that the __GFP_NO_KSWAPD patch should have removed
      deferred_compaction from page_alloc.c but it didn't but fixing that causes
      collisions with this series.
      
      This patch:
      
      The comment about order applied when the check was order >
      PAGE_ALLOC_COSTLY_ORDER which has not been the case since c5a73c3d ("thp:
      use compaction for all allocation orders").  Fixing the comment while I'm
      in the general area.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4ffb6335
  5. 22 8月, 2012 3 次提交
    • M
      mm: compaction: Abort async compaction if locks are contended or taking too long · c67fe375
      Mel Gorman 提交于
      Jim Schutt reported a problem that pointed at compaction contending
      heavily on locks.  The workload is straight-forward and in his own words;
      
      	The systems in question have 24 SAS drives spread across 3 HBAs,
      	running 24 Ceph OSD instances, one per drive.  FWIW these servers
      	are dual-socket Intel 5675 Xeons w/48 GB memory.  I've got ~160
      	Ceph Linux clients doing dd simultaneously to a Ceph file system
      	backed by 12 of these servers.
      
      Early in the test everything looks fine
      
        procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
         r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
        31 15          0     287216        576   38606628    0    0     2  1158    2   14   1  3  95  0  0
        27 15          0     225288        576   38583384    0    0    18 2222016 203357 134876  11 56  17 15  0
        28 17          0     219256        576   38544736    0    0    11 2305932 203141 146296  11 49  23 17  0
         6 18          0     215596        576   38552872    0    0     7 2363207 215264 166502  12 45  22 20  0
        22 18          0     226984        576   38596404    0    0     3 2445741 223114 179527  12 43  23 22  0
      
      and then it goes to pot
      
        procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
         r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
        163  8          0     464308        576   36791368    0    0    11 22210  866  536   3 13  79  4  0
        207 14          0     917752        576   36181928    0    0   712 1345376 134598 47367   7 90   1  2  0
        123 12          0     685516        576   36296148    0    0   429 1386615 158494 60077   8 84   5  3  0
        123 12          0     598572        576   36333728    0    0  1107 1233281 147542 62351   7 84   5  4  0
        622  7          0     660768        576   36118264    0    0   557 1345548 151394 59353   7 85   4  3  0
        223 11          0     283960        576   36463868    0    0    46 1107160 121846 33006   6 93   1  1  0
      
      Note that system CPU usage is very high blocks being written out has
      dropped by 42%. He analysed this with perf and found
      
        perf record -g -a sleep 10
        perf report --sort symbol --call-graph fractal,5
          34.63%  [k] _raw_spin_lock_irqsave
                  |
                  |--97.30%-- isolate_freepages
                  |          compaction_alloc
                  |          unmap_and_move
                  |          migrate_pages
                  |          compact_zone
                  |          compact_zone_order
                  |          try_to_compact_pages
                  |          __alloc_pages_direct_compact
                  |          __alloc_pages_slowpath
                  |          __alloc_pages_nodemask
                  |          alloc_pages_vma
                  |          do_huge_pmd_anonymous_page
                  |          handle_mm_fault
                  |          do_page_fault
                  |          page_fault
                  |          |
                  |          |--87.39%-- skb_copy_datagram_iovec
                  |          |          tcp_recvmsg
                  |          |          inet_recvmsg
                  |          |          sock_recvmsg
                  |          |          sys_recvfrom
                  |          |          system_call
                  |          |          __recv
                  |          |          |
                  |          |           --100.00%-- (nil)
                  |          |
                  |           --12.61%-- memcpy
                   --2.70%-- [...]
      
      There was other data but primarily it is all showing that compaction is
      contended heavily on the zone->lock and zone->lru_lock.
      
      commit [b2eef8c0: mm: compaction: minimise the time IRQs are disabled
      while isolating pages for migration] noted that it was possible for
      migration to hold the lru_lock for an excessive amount of time. Very
      broadly speaking this patch expands the concept.
      
      This patch introduces compact_checklock_irqsave() to check if a lock
      is contended or the process needs to be scheduled. If either condition
      is true then async compaction is aborted and the caller is informed.
      The page allocator will fail a THP allocation if compaction failed due
      to contention. This patch also introduces compact_trylock_irqsave()
      which will acquire the lock only if it is not contended and the process
      does not need to schedule.
      Reported-by: NJim Schutt <jaschut@sandia.gov>
      Tested-by: NJim Schutt <jaschut@sandia.gov>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c67fe375
    • M
      mm: have order > 0 compaction start near a pageblock with free pages · de74f1cc
      Mel Gorman 提交于
      Commit 7db8889a ("mm: have order > 0 compaction start off where it
      left") introduced a caching mechanism to reduce the amount work the free
      page scanner does in compaction.  However, it has a problem.  Consider
      two process simultaneously scanning free pages
      
      					    			C
      	Process A		M     S     			F
      			|---------------------------------------|
      	Process B		M 	FS
      
      	C is zone->compact_cached_free_pfn
      	S is cc->start_pfree_pfn
      	M is cc->migrate_pfn
      	F is cc->free_pfn
      
      In this diagram, Process A has just reached its migrate scanner, wrapped
      around and updated compact_cached_free_pfn accordingly.
      
      Simultaneously, Process B finishes isolating in a block and updates
      compact_cached_free_pfn again to the location of its free scanner.
      
      Process A moves to "end_of_zone - one_pageblock" and runs this check
      
                      if (cc->order > 0 && (!cc->wrapped ||
                                            zone->compact_cached_free_pfn >
                                            cc->start_free_pfn))
                              pfn = min(pfn, zone->compact_cached_free_pfn);
      
      compact_cached_free_pfn is above where it started so the free scanner
      skips almost the entire space it should have scanned.  When there are
      multiple processes compacting it can end in a situation where the entire
      zone is not being scanned at all.  Further, it is possible for two
      processes to ping-pong update to compact_cached_free_pfn which is just
      random.
      
      Overall, the end result wrecks allocation success rates.
      
      There is not an obvious way around this problem without introducing new
      locking and state so this patch takes a different approach.
      
      First, it gets rid of the skip logic because it's not clear that it
      matters if two free scanners happen to be in the same block but with
      racing updates it's too easy for it to skip over blocks it should not.
      
      Second, it updates compact_cached_free_pfn in a more limited set of
      circumstances.
      
      If a scanner has wrapped, it updates compact_cached_free_pfn to the end
      	of the zone. When a wrapped scanner isolates a page, it updates
      	compact_cached_free_pfn to point to the highest pageblock it
      	can isolate pages from.
      
      If a scanner has not wrapped when it has finished isolated pages it
      	checks if compact_cached_free_pfn is pointing to the end of the
      	zone. If so, the value is updated to point to the highest
      	pageblock that pages were isolated from. This value will not
      	be updated again until a free page scanner wraps and resets
      	compact_cached_free_pfn.
      
      This is not optimal and it can still race but the compact_cached_free_pfn
      will be pointing to or very near a pageblock with free pages.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de74f1cc
    • M
      mm/compaction.c: fix deferring compaction mistake · c81758fb
      Minchan Kim 提交于
      Commit aff62249 ("vmscan: only defer compaction for failed order and
      higher") fixed bad deferring policy but made mistake about checking
      compact_order_failed in __compact_pgdat().  So it can't update
      compact_order_failed with the new order.  This ends up preventing
      correct operation of policy deferral.  This patch fixes it.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c81758fb
  6. 01 8月, 2012 1 次提交
    • R
      mm: have order > 0 compaction start off where it left · 7db8889a
      Rik van Riel 提交于
      Order > 0 compaction stops when enough free pages of the correct page
      order have been coalesced.  When doing subsequent higher order
      allocations, it is possible for compaction to be invoked many times.
      
      However, the compaction code always starts out looking for things to
      compact at the start of the zone, and for free pages to compact things to
      at the end of the zone.
      
      This can cause quadratic behaviour, with isolate_freepages starting at the
      end of the zone each time, even though previous invocations of the
      compaction code already filled up all free memory on that end of the zone.
      
      This can cause isolate_freepages to take enormous amounts of CPU with
      certain workloads on larger memory systems.
      
      The obvious solution is to have isolate_freepages remember where it left
      off last time, and continue at that point the next time it gets invoked
      for an order > 0 compaction.  This could cause compaction to fail if
      cc->free_pfn and cc->migrate_pfn are close together initially, in that
      case we restart from the end of the zone and try once more.
      
      Forced full (order == -1) compactions are left alone.
      
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: s/laste/last/, use 80 cols]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Reported-by: NJim Schutt <jaschut@sandia.gov>
      Tested-by: NJim Schutt <jaschut@sandia.gov>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7db8889a
  7. 12 7月, 2012 1 次提交
    • D
      mm, thp: abort compaction if migration page cannot be charged to memcg · 4bf2bba3
      David Rientjes 提交于
      If page migration cannot charge the temporary page to the memcg,
      migrate_pages() will return -ENOMEM.  This isn't considered in memory
      compaction however, and the loop continues to iterate over all
      pageblocks trying to isolate and migrate pages.  If a small number of
      very large memcgs happen to be oom, however, these attempts will mostly
      be futile leading to an enormous amout of cpu consumption due to the
      page migration failures.
      
      This patch will short circuit and fail memory compaction if
      migrate_pages() returns -ENOMEM.  COMPACT_PARTIAL is returned in case
      some migrations were successful so that the page allocator will retry.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4bf2bba3
  8. 04 6月, 2012 1 次提交
  9. 30 5月, 2012 3 次提交
    • H
      mm/memcg: apply add/del_page to lruvec · fa9add64
      Hugh Dickins 提交于
      Take lruvec further: pass it instead of zone to add_page_to_lru_list() and
      del_page_from_lru_list(); and pagevec_lru_move_fn() pass lruvec down to
      its target functions.
      
      This cleanup eliminates a swathe of cruft in memcontrol.c, including
      mem_cgroup_lru_add_list(), mem_cgroup_lru_del_list() and
      mem_cgroup_lru_move_lists() - which never actually touched the lists.
      
      In their place, mem_cgroup_page_lruvec() to decide the lruvec, previously
      a side-effect of add, and mem_cgroup_update_lru_size() to maintain the
      lru_size stats.
      
      Whilst these are simplifications in their own right, the goal is to bring
      the evaluation of lruvec next to the spin_locking of the lrus, in
      preparation for a future patch.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa9add64
    • K
      mm: remove lru type checks from __isolate_lru_page() · f3fd4a61
      Konstantin Khlebnikov 提交于
      After patch "mm: forbid lumpy-reclaim in shrink_active_list()" we can
      completely remove anon/file and active/inactive lru type filters from
      __isolate_lru_page(), because isolation for 0-order reclaim always
      isolates pages from right lru list.  And pages-isolation for lumpy
      shrink_inactive_list() or memory-compaction anyway allowed to isolate
      pages from all evictable lru lists.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3fd4a61
    • B
      mm: compaction: handle incorrect MIGRATE_UNMOVABLE type pageblocks · 5ceb9ce6
      Bartlomiej Zolnierkiewicz 提交于
      When MIGRATE_UNMOVABLE pages are freed from MIGRATE_UNMOVABLE type
      pageblock (and some MIGRATE_MOVABLE pages are left in it) waiting until an
      allocation takes ownership of the block may take too long.  The type of
      the pageblock remains unchanged so the pageblock cannot be used as a
      migration target during compaction.
      
      Fix it by:
      
      * Adding enum compact_mode (COMPACT_ASYNC_[MOVABLE,UNMOVABLE], and
        COMPACT_SYNC) and then converting sync field in struct compact_control
        to use it.
      
      * Adding nr_pageblocks_skipped field to struct compact_control and
        tracking how many destination pageblocks were of MIGRATE_UNMOVABLE type.
         If COMPACT_ASYNC_MOVABLE mode compaction ran fully in
        try_to_compact_pages() (COMPACT_COMPLETE) it implies that there is not a
        suitable page for allocation.  In this case then check how if there were
        enough MIGRATE_UNMOVABLE pageblocks to try a second pass in
        COMPACT_ASYNC_UNMOVABLE mode.
      
      * Scanning the MIGRATE_UNMOVABLE pageblocks (during COMPACT_SYNC and
        COMPACT_ASYNC_UNMOVABLE compaction modes) and building a count based on
        finding PageBuddy pages, page_count(page) == 0 or PageLRU pages.  If all
        pages within the MIGRATE_UNMOVABLE pageblock are in one of those three
        sets change the whole pageblock type to MIGRATE_MOVABLE.
      
      My particular test case (on a ARM EXYNOS4 device with 512 MiB, which means
      131072 standard 4KiB pages in 'Normal' zone) is to:
      
      - allocate 120000 pages for kernel's usage
      - free every second page (60000 pages) of memory just allocated
      - allocate and use 60000 pages from user space
      - free remaining 60000 pages of kernel memory
        (now we have fragmented memory occupied mostly by user space pages)
      - try to allocate 100 order-9 (2048 KiB) pages for kernel's usage
      
      The results:
      - with compaction disabled I get 11 successful allocations
      - with compaction enabled - 14 successful allocations
      - with this patch I'm able to get all 100 successful allocations
      
      NOTE: If we can make kswapd aware of order-0 request during compaction, we
      can enhance kswapd with changing mode to COMPACT_ASYNC_FULL
      (COMPACT_ASYNC_MOVABLE + COMPACT_ASYNC_UNMOVABLE).  Please see the
      following thread:
      
      	http://marc.info/?l=linux-mm&m=133552069417068&w=2
      
      [minchan@kernel.org: minor cleanups]
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Signed-off-by: NKyungmin Park <kyungmin.park@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ceb9ce6
  10. 21 5月, 2012 5 次提交
  11. 22 3月, 2012 4 次提交
    • D
      mm: compaction: make compact_control order signed · aad6ec37
      Dan Carpenter 提交于
      "order" is -1 when compacting via /proc/sys/vm/compact_memory.  Making
      it unsigned causes a bug in __compact_pgdat() when we test:
      
      	if (cc->order < 0 || !compaction_deferred(zone, cc->order))
      		compact_zone(zone, cc);
      
      [akpm@linux-foundation.org: make __compact_pgdat()'s comparison match other code sites]
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan@kernel.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aad6ec37
    • H
      compact_pgdat: workaround lockdep warning in kswapd · 8575ec29
      Hugh Dickins 提交于
      I get this lockdep warning from swapping load on linux-next, due to
      "vmscan: kswapd carefully call compaction".
      
      =================================
      [ INFO: inconsistent lock state ]
      3.3.0-rc2-next-20120201 #5 Not tainted
      ---------------------------------
      inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
      kswapd0/28 [HC0[0]:SC0[0]:HE1:SE1] takes:
       (pcpu_alloc_mutex){+.+.?.}, at: [<ffffffff810d6684>] pcpu_alloc+0x67/0x325
      {RECLAIM_FS-ON-W} state was registered at:
        [<ffffffff81099b75>] mark_held_locks+0xd7/0x103
        [<ffffffff8109a13c>] lockdep_trace_alloc+0x85/0x9e
        [<ffffffff810f6bdc>] __kmalloc+0x6c/0x14b
        [<ffffffff810d57fd>] pcpu_mem_zalloc+0x59/0x62
        [<ffffffff810d5d16>] pcpu_extend_area_map+0x26/0xb1
        [<ffffffff810d679f>] pcpu_alloc+0x182/0x325
        [<ffffffff810d694d>] __alloc_percpu+0xb/0xd
        [<ffffffff8142ebfd>] snmp_mib_init+0x1e/0x2e
        [<ffffffff8185cd8d>] ipv4_mib_init_net+0x7a/0x184
        [<ffffffff813dc963>] ops_init.clone.0+0x6b/0x73
        [<ffffffff813dc9cc>] register_pernet_operations+0x61/0xa0
        [<ffffffff813dca8e>] register_pernet_subsys+0x29/0x42
        [<ffffffff8185d044>] inet_init+0x1ad/0x252
        [<ffffffff810002e3>] do_one_initcall+0x7a/0x12f
        [<ffffffff81832bc5>] kernel_init+0x9d/0x11e
        [<ffffffff814e51e4>] kernel_thread_helper+0x4/0x10
      irq event stamp: 656613
      hardirqs last  enabled at (656613): [<ffffffff814e0ddc>] __mutex_unlock_slowpath+0x104/0x128
      hardirqs last disabled at (656612): [<ffffffff814e0d34>] __mutex_unlock_slowpath+0x5c/0x128
      softirqs last  enabled at (655568): [<ffffffff8105b4a5>] __do_softirq+0x120/0x136
      softirqs last disabled at (654757): [<ffffffff814e52dc>] call_softirq+0x1c/0x30
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock(pcpu_alloc_mutex);
        <Interrupt>
          lock(pcpu_alloc_mutex);
      
       *** DEADLOCK ***
      
      no locks held by kswapd0/28.
      
      stack backtrace:
      Pid: 28, comm: kswapd0 Not tainted 3.3.0-rc2-next-20120201 #5
      Call Trace:
       [<ffffffff810981f4>] print_usage_bug+0x1bf/0x1d0
       [<ffffffff81096c3e>] ? print_irq_inversion_bug+0x1d9/0x1d9
       [<ffffffff810982c0>] mark_lock_irq+0xbb/0x22e
       [<ffffffff810c5399>] ? free_hot_cold_page+0x13d/0x14f
       [<ffffffff81098684>] mark_lock+0x251/0x331
       [<ffffffff81098893>] mark_irqflags+0x12f/0x141
       [<ffffffff81098e32>] __lock_acquire+0x58d/0x753
       [<ffffffff810d6684>] ? pcpu_alloc+0x67/0x325
       [<ffffffff81099433>] lock_acquire+0x54/0x6a
       [<ffffffff810d6684>] ? pcpu_alloc+0x67/0x325
       [<ffffffff8107a5b8>] ? add_preempt_count+0xa9/0xae
       [<ffffffff814e0a21>] mutex_lock_nested+0x5e/0x315
       [<ffffffff810d6684>] ? pcpu_alloc+0x67/0x325
       [<ffffffff81098f81>] ? __lock_acquire+0x6dc/0x753
       [<ffffffff810c9fb0>] ? __pagevec_release+0x2c/0x2c
       [<ffffffff810d6684>] pcpu_alloc+0x67/0x325
       [<ffffffff810c9fb0>] ? __pagevec_release+0x2c/0x2c
       [<ffffffff810d694d>] __alloc_percpu+0xb/0xd
       [<ffffffff8106c35e>] schedule_on_each_cpu+0x23/0x110
       [<ffffffff810c9fcb>] lru_add_drain_all+0x10/0x12
       [<ffffffff810f126f>] __compact_pgdat+0x20/0x182
       [<ffffffff810f15c2>] compact_pgdat+0x27/0x29
       [<ffffffff810c306b>] ? zone_watermark_ok+0x1a/0x1c
       [<ffffffff810cdf6f>] balance_pgdat+0x732/0x751
       [<ffffffff810ce0ed>] kswapd+0x15f/0x178
       [<ffffffff810cdf8e>] ? balance_pgdat+0x751/0x751
       [<ffffffff8106fd11>] kthread+0x84/0x8c
       [<ffffffff814e51e4>] kernel_thread_helper+0x4/0x10
       [<ffffffff810787ed>] ? finish_task_switch+0x85/0xea
       [<ffffffff814e3861>] ? retint_restore_args+0xe/0xe
       [<ffffffff8106fc8d>] ? __init_kthread_worker+0x56/0x56
       [<ffffffff814e51e0>] ? gs_change+0xb/0xb
      
      The RECLAIM_FS notations indicate that it's doing the GFP_FS checking that
      Nick hacked into lockdep a while back: I think we're intended to read that
      "<Interrupt>" in the DEADLOCK scenario as "<Direct reclaim>".
      
      I'm hazy, I have not reached any conclusion as to whether it's right to
      complain or not; but I believe it's uneasy about kswapd now doing the
      mutex_lock(&pcpu_alloc_mutex) which lru_add_drain_all() entails.  Nor have
      I reached any conclusion as to whether it's important for kswapd to do
      that draining or not.
      
      But so as not to get blocked on this, with lockdep disabled from giving
      further reports, here's a patch which removes the lru_add_drain_all() from
      kswapd's callpath (and calls it only once from compact_nodes(), instead of
      once per node).
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8575ec29
    • R
      vmscan: only defer compaction for failed order and higher · aff62249
      Rik van Riel 提交于
      Currently a failed order-9 (transparent hugepage) compaction can lead to
      memory compaction being temporarily disabled for a memory zone.  Even if
      we only need compaction for an order 2 allocation, eg.  for jumbo frames
      networking.
      
      The fix is relatively straightforward: keep track of the highest order at
      which compaction is succeeding, and only defer compaction for orders at
      which compaction is failing.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aff62249
    • R
      vmscan: kswapd carefully call compaction · 7be62de9
      Rik van Riel 提交于
      With CONFIG_COMPACTION enabled, kswapd does not try to free contiguous
      free pages, even when it is woken for a higher order request.
      
      This could be bad for eg.  jumbo frame network allocations, which are done
      from interrupt context and cannot compact memory themselves.  Higher than
      before allocation failure rates in the network receive path have been
      observed in kernels with compaction enabled.
      
      Teach kswapd to defragment the memory zones in a node, but only if
      required and compaction is not deferred in a zone.
      
      [akpm@linux-foundation.org: reduce scope of zones_need_compaction]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7be62de9
  12. 09 2月, 2012 1 次提交
    • M
      mm: compaction: check for overlapping nodes during isolation for migration · dc908600
      Mel Gorman 提交于
      When isolating pages for migration, migration starts at the start of a
      zone while the free scanner starts at the end of the zone.  Migration
      avoids entering a new zone by never going beyond the free scanned.
      
      Unfortunately, in very rare cases nodes can overlap.  When this happens,
      migration isolates pages without the LRU lock held, corrupting lists
      which will trigger errors in reclaim or during page free such as in the
      following oops
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
        IP: [<ffffffff810f795c>] free_pcppages_bulk+0xcc/0x450
        PGD 1dda554067 PUD 1e1cb58067 PMD 0
        Oops: 0000 [#1] SMP
        CPU 37
        Pid: 17088, comm: memcg_process_s Tainted: G            X
        RIP: free_pcppages_bulk+0xcc/0x450
        Process memcg_process_s (pid: 17088, threadinfo ffff881c2926e000, task ffff881c2926c0c0)
        Call Trace:
          free_hot_cold_page+0x17e/0x1f0
          __pagevec_free+0x90/0xb0
          release_pages+0x22a/0x260
          pagevec_lru_move_fn+0xf3/0x110
          putback_lru_page+0x66/0xe0
          unmap_and_move+0x156/0x180
          migrate_pages+0x9e/0x1b0
          compact_zone+0x1f3/0x2f0
          compact_zone_order+0xa2/0xe0
          try_to_compact_pages+0xdf/0x110
          __alloc_pages_direct_compact+0xee/0x1c0
          __alloc_pages_slowpath+0x370/0x830
          __alloc_pages_nodemask+0x1b1/0x1c0
          alloc_pages_vma+0x9b/0x160
          do_huge_pmd_anonymous_page+0x160/0x270
          do_page_fault+0x207/0x4c0
          page_fault+0x25/0x30
      
      The "X" in the taint flag means that external modules were loaded but but
      is unrelated to the bug triggering.  The real problem was because the PFN
      layout looks like this
      
        Zone PFN ranges:
          DMA      0x00000010 -> 0x00001000
          DMA32    0x00001000 -> 0x00100000
          Normal   0x00100000 -> 0x01e80000
        Movable zone start PFN for each node
        early_node_map[14] active PFN ranges
            0: 0x00000010 -> 0x0000009b
            0: 0x00000100 -> 0x0007a1ec
            0: 0x0007a354 -> 0x0007a379
            0: 0x0007f7ff -> 0x0007f800
            0: 0x00100000 -> 0x00680000
            1: 0x00680000 -> 0x00e80000
            0: 0x00e80000 -> 0x01080000
            1: 0x01080000 -> 0x01280000
            0: 0x01280000 -> 0x01480000
            1: 0x01480000 -> 0x01680000
            0: 0x01680000 -> 0x01880000
            1: 0x01880000 -> 0x01a80000
            0: 0x01a80000 -> 0x01c80000
            1: 0x01c80000 -> 0x01e80000
      
      The fix is straight-forward.  isolate_migratepages() has to make a
      similar check to isolate_freepage to ensure that it never isolates pages
      from a zone it does not hold the LRU lock for.
      
      This was discovered in a 3.0-based kernel but it affects 3.1.x, 3.2.x
      and current mainline.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc908600
  13. 04 2月, 2012 1 次提交
    • M
      mm: compaction: check pfn_valid when entering a new MAX_ORDER_NR_PAGES block... · 0bf380bc
      Mel Gorman 提交于
      mm: compaction: check pfn_valid when entering a new MAX_ORDER_NR_PAGES block during isolation for migration
      
      When isolating for migration, migration starts at the start of a zone
      which is not necessarily pageblock aligned.  Further, it stops isolating
      when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally
      not aligned.  This allows isolate_migratepages() to call pfn_to_page() on
      an invalid PFN which can result in a crash.  This was originally reported
      against a 3.0-based kernel with the following trace in a crash dump.
      
      PID: 9902   TASK: d47aecd0  CPU: 0   COMMAND: "memcg_process_s"
       #0 [d72d3ad0] crash_kexec at c028cfdb
       #1 [d72d3b24] oops_end at c05c5322
       #2 [d72d3b38] __bad_area_nosemaphore at c0227e60
       #3 [d72d3bec] bad_area at c0227fb6
       #4 [d72d3c00] do_page_fault at c05c72ec
       #5 [d72d3c80] error_code (via page_fault) at c05c47a4
          EAX: 00000000  EBX: 000c0000  ECX: 00000001  EDX: 00000807  EBP: 000c0000
          DS:  007b      ESI: 00000001  ES:  007b      EDI: f3000a80  GS:  6f50
          CS:  0060      EIP: c030b15a  ERR: ffffffff  EFLAGS: 00010002
       #6 [d72d3cb4] isolate_migratepages at c030b15a
       #7 [d72d3d14] zone_watermark_ok at c02d26cb
       #8 [d72d3d2c] compact_zone at c030b8de
       #9 [d72d3d68] compact_zone_order at c030bba1
      #10 [d72d3db4] try_to_compact_pages at c030bc84
      #11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7
      #12 [d72d3e08] __alloc_pages_slowpath at c02d66c7
      #13 [d72d3e78] __alloc_pages_nodemask at c02d6a97
      #14 [d72d3eb8] alloc_pages_vma at c030a845
      #15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb
      #16 [d72d3f00] handle_mm_fault at c02f36c6
      #17 [d72d3f30] do_page_fault at c05c70ed
      #18 [d72d3fb0] error_code (via page_fault) at c05c47a4
          EAX: b71ff000  EBX: 00000001  ECX: 00001600  EDX: 00000431
          DS:  007b      ESI: 08048950  ES:  007b      EDI: bfaa3788
          SS:  007b      ESP: bfaa36e0  EBP: bfaa3828  GS:  6f50
          CS:  0073      EIP: 080487c8  ERR: ffffffff  EFLAGS: 00010202
      
      It was also reported by Herbert van den Bergh against 3.1-based kernel
      with the following snippet from the console log.
      
      BUG: unable to handle kernel paging request at 01c00008
      IP: [<c0522399>] isolate_migratepages+0x119/0x390
      *pdpt = 000000002f7ce001 *pde = 0000000000000000
      
      It is expected that it also affects 3.2.x and current mainline.
      
      The problem is that pfn_valid is only called on the first PFN being
      checked and that PFN is not necessarily aligned.  Lets say we have a case
      like this
      
      H = MAX_ORDER_NR_PAGES boundary
      | = pageblock boundary
      m = cc->migrate_pfn
      f = cc->free_pfn
      o = memory hole
      
      H------|------H------|----m-Hoooooo|ooooooH-f----|------H
      
      The migrate_pfn is just below a memory hole and the free scanner is beyond
      the hole.  When isolate_migratepages started, it scans from migrate_pfn to
      migrate_pfn+pageblock_nr_pages which is now in a memory hole.  It checks
      pfn_valid() on the first PFN but then scans into the hole where there are
      not necessarily valid struct pages.
      
      This patch ensures that isolate_migratepages calls pfn_valid when
      necessary.
      Reported-by: NHerbert van den Bergh <herbert.van.den.bergh@oracle.com>
      Tested-by: NHerbert van den Bergh <herbert.van.den.bergh@oracle.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0bf380bc
  14. 13 1月, 2012 4 次提交
    • M
      mm: compaction: introduce sync-light migration for use by compaction · a6bc32b8
      Mel Gorman 提交于
      This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
      mode that avoids writing back pages to backing storage.  Async compaction
      maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
      For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
      used.
      
      This avoids sync compaction stalling for an excessive length of time,
      particularly when copying files to a USB stick where there might be a
      large number of dirty pages backed by a filesystem that does not support
      ->writepages.
      
      [aarcange@redhat.com: This patch is heavily based on Andrea's work]
      [akpm@linux-foundation.org: fix fs/nfs/write.c build]
      [akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6bc32b8
    • M
      mm: compaction: make isolate_lru_page() filter-aware again · c8244935
      Mel Gorman 提交于
      Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware")
      noted that compaction does not migrate dirty or writeback pages and that
      is was meaningless to pick the page and re-add it to the LRU list.  This
      had to be partially reverted because some dirty pages can be migrated by
      compaction without blocking.
      
      This patch updates "mm: compaction: make isolate_lru_page" by skipping
      over pages that migration has no possibility of migrating to minimise LRU
      disruption.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: Rik van Riel<riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c8244935
    • M
      mm: compaction: use synchronous compaction for /proc/sys/vm/compact_memory · b16d3d5a
      Mel Gorman 提交于
      When asynchronous compaction was introduced, the
      /proc/sys/vm/compact_memory handler should have been updated to always use
      synchronous compaction.  This did not happen so this patch addresses it.
      
      The assumption is if a user writes to /proc/sys/vm/compact_memory, they
      are willing for that process to stall.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b16d3d5a
    • M
      mm: compaction: allow compaction to isolate dirty pages · a77ebd33
      Mel Gorman 提交于
      Short summary: There are severe stalls when a USB stick using VFAT is
      used with THP enabled that are reduced by this series.  If you are
      experiencing this problem, please test and report back and considering I
      have seen complaints from openSUSE and Fedora users on this as well as a
      few private mails, I'm guessing it's a widespread issue.  This is a new
      type of USB-related stall because it is due to synchronous compaction
      writing where as in the past the big problem was dirty pages reaching
      the end of the LRU and being written by reclaim.
      
      Am cc'ing Andrew this time and this series would replace
      mm-do-not-stall-in-synchronous-compaction-for-thp-allocations.patch.
      I'm also cc'ing Dave Jones as he might have merged that patch to Fedora
      for wider testing and ideally it would be reverted and replaced by this
      series.
      
      That said, the later patches could really do with some review.  If this
      series is not the answer then a new direction needs to be discussed
      because as it is, the stalls are unacceptable as the results in this
      leader show.
      
      For testers that try backporting this to 3.1, it won't work because
      there is a non-obvious dependency on not writing back pages in direct
      reclaim so you need those patches too.
      
      Changelog since V5
      o Rebase to 3.2-rc5
      o Tidy up the changelogs a bit
      
      Changelog since V4
      o Added reviewed-bys, credited Andrea properly for sync-light
      o Allow dirty pages without mappings to be considered for migration
      o Bound the number of pages freed for compaction
      o Isolate PageReclaim pages on their own LRU list
      
      This is against 3.2-rc5 and follows on from discussions on "mm: Do
      not stall in synchronous compaction for THP allocations" and "[RFC
      PATCH 0/5] Reduce compaction-related stalls". Initially, the proposed
      patch eliminated stalls due to compaction which sometimes resulted in
      user-visible interactivity problems on browsers by simply never using
      sync compaction. The downside was that THP success allocation rates
      were lower because dirty pages were not being migrated as reported by
      Andrea. His approach at fixing this was nacked on the grounds that
      it reverted fixes from Rik merged that reduced the amount of pages
      reclaimed as it severely impacted his workloads performance.
      
      This series attempts to reconcile the requirements of maximising THP
      usage, without stalling in a user-visible fashion due to compaction
      or cheating by reclaiming an excessive number of pages.
      
      Patch 1 partially reverts commit 39deaf85 to allow migration to isolate
      	dirty pages. This is because migration can move some dirty
      	pages without blocking.
      
      Patch 2 notes that the /proc/sys/vm/compact_memory handler is not using
      	synchronous compaction when it should be. This is unrelated
      	to the reported stalls but is worth fixing.
      
      Patch 3 checks if we isolated a compound page during lumpy scan and
      	account for it properly. For the most part, this affects
      	tracing so it's unrelated to the stalls but worth fixing.
      
      Patch 4 notes that it is possible to abort reclaim early for compaction
      	and return 0 to the page allocator potentially entering the
      	"may oom" path. This has not been observed in practice but
      	the rest of the series potentially makes it easier to happen.
      
      Patch 5 adds a sync parameter to the migratepage callback and gives
      	the callback responsibility for migrating the page without
      	blocking if sync==false. For example, fallback_migrate_page
      	will not call writepage if sync==false. This increases the
      	number of pages that can be handled by asynchronous compaction
      	thereby reducing stalls.
      
      Patch 6 restores filter-awareness to isolate_lru_page for migration.
      	In practice, it means that pages under writeback and pages
      	without a ->migratepage callback will not be isolated
      	for migration.
      
      Patch 7 avoids calling direct reclaim if compaction is deferred but
      	makes sure that compaction is only deferred if sync
      	compaction was used.
      
      Patch 8 introduces a sync-light migration mechanism that sync compaction
      	uses. The objective is to allow some stalls but to not call
      	->writepage which can lead to significant user-visible stalls.
      
      Patch 9 notes that while we want to abort reclaim ASAP to allow
      	compation to go ahead that we leave a very small window of
      	opportunity for compaction to run. This patch allows more pages
      	to be freed by reclaim but bounds the number to a reasonable
      	level based on the high watermark on each zone.
      
      Patch 10 allows slabs to be shrunk even after compaction_ready() is
      	true for one zone. This is to avoid a problem whereby a single
      	small zone can abort reclaim even though no pages have been
      	reclaimed and no suitably large zone is in a usable state.
      
      Patch 11 fixes a problem with the rate of page scanning. As reclaim is
      	rarely stalling on pages under writeback it means that scan
      	rates are very high. This is particularly true for direct
      	reclaim which is not calling writepage. The vmstat figures
      	implied that much of this was busy work with PageReclaim pages
      	marked for immediate reclaim. This patch is a prototype that
      	moves these pages to their own LRU list.
      
      This has been tested and other than 2 USB keys getting trashed,
      nothing horrible fell out. That said, I am a bit unhappy with the
      rescue logic in patch 11 but did not find a better way around it. It
      does significantly reduce scan rates and System CPU time indicating
      it is the right direction to take.
      
      What is of critical importance is that stalls due to compaction
      are massively reduced even though sync compaction was still
      allowed. Testing from people complaining about stalls copying to USBs
      with THP enabled are particularly welcome.
      
      The following tests all involve THP usage and USB keys in some
      way. Each test follows this type of pattern
      
      1. Read from some fast fast storage, be it raw device or file. Each time
         the copy finishes, start again until the test ends
      2. Write a large file to a filesystem on a USB stick. Each time the copy
         finishes, start again until the test ends
      3. When memory is low, start an alloc process that creates a mapping
         the size of physical memory to stress THP allocation. This is the
         "real" part of the test and the part that is meant to trigger
         stalls when THP is enabled. Copying continues in the background.
      4. Record the CPU usage and time to execute of the alloc process
      5. Record the number of THP allocs and fallbacks as well as the number of THP
         pages in use a the end of the test just before alloc exited
      6. Run the test 5 times to get an idea of variability
      7. Between each run, sync is run and caches dropped and the test
         waits until nr_dirty is a small number to avoid interference
         or caching between iterations that would skew the figures.
      
      The individual tests were then
      
      writebackCPDeviceBasevfat
      	Disable THP, read from a raw device (sda), vfat on USB stick
      writebackCPDeviceBaseext4
      	Disable THP, read from a raw device (sda), ext4 on USB stick
      writebackCPDevicevfat
      	THP enabled, read from a raw device (sda), vfat on USB stick
      writebackCPDeviceext4
      	THP enabled, read from a raw device (sda), ext4 on USB stick
      writebackCPFilevfat
      	THP enabled, read from a file on fast storage and USB, both vfat
      writebackCPFileext4
      	THP enabled, read from a file on fast storage and USB, both ext4
      
      The kernels tested were
      
      3.1		3.1
      vanilla		3.2-rc5
      freemore	Patches 1-10
      immediate	Patches 1-11
      andrea		The 8 patches Andrea posted as a basis of comparison
      
      The results are very long unfortunately. I'll start with the case
      where we are not using THP at all
      
      writebackCPDeviceBasevfat
                         3.1.0-vanilla         rc5-vanilla       freemore-v6r1        isolate-v6r1         andrea-v2r1
      System Time         1.28 (    0.00%)   54.49 (-4143.46%)   48.63 (-3687.69%)    4.69 ( -265.11%)   51.88 (-3940.81%)
      +/-                 0.06 (    0.00%)    2.45 (-4305.55%)    4.75 (-8430.57%)    7.46 (-13282.76%)    4.76 (-8440.70%)
      User Time           0.09 (    0.00%)    0.05 (   40.91%)    0.06 (   29.55%)    0.07 (   15.91%)    0.06 (   27.27%)
      +/-                 0.02 (    0.00%)    0.01 (   45.39%)    0.02 (   25.07%)    0.00 (   77.06%)    0.01 (   52.24%)
      Elapsed Time      110.27 (    0.00%)   56.38 (   48.87%)   49.95 (   54.70%)   11.77 (   89.33%)   53.43 (   51.54%)
      +/-                 7.33 (    0.00%)    3.77 (   48.61%)    4.94 (   32.63%)    6.71 (    8.50%)    4.76 (   35.03%)
      THP Active          0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)
      +/-                 0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)
      Fault Alloc         0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)
      +/-                 0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)
      Fault Fallback      0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)
      +/-                 0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)
      
      The THP figures are obviously all 0 because THP was enabled. The
      main thing to watch is the elapsed times and how they compare to
      times when THP is enabled later. It's also important to note that
      elapsed time is improved by this series as System CPu time is much
      reduced.
      
      writebackCPDevicevfat
      
                         3.1.0-vanilla         rc5-vanilla       freemore-v6r1        isolate-v6r1         andrea-v2r1
      System Time         1.22 (    0.00%)   13.89 (-1040.72%)   46.40 (-3709.20%)    4.44 ( -264.37%)   47.37 (-3789.33%)
      +/-                 0.06 (    0.00%)   22.82 (-37635.56%)    3.84 (-6249.44%)    6.48 (-10618.92%)    6.60
      (-10818.53%)
      User Time           0.06 (    0.00%)    0.06 (   -6.90%)    0.05 (   17.24%)    0.05 (   13.79%)    0.04 (   31.03%)
      +/-                 0.01 (    0.00%)    0.01 (   33.33%)    0.01 (   33.33%)    0.01 (   39.14%)    0.01 (   25.46%)
      Elapsed Time     10445.54 (    0.00%) 2249.92 (   78.46%)   70.06 (   99.33%)   16.59 (   99.84%)  472.43 (
      95.48%)
      +/-               643.98 (    0.00%)  811.62 (  -26.03%)   10.02 (   98.44%)    7.03 (   98.91%)   59.99 (   90.68%)
      THP Active         15.60 (    0.00%)   35.20 (  225.64%)   65.00 (  416.67%)   70.80 (  453.85%)   62.20 (  398.72%)
      +/-                18.48 (    0.00%)   51.29 (  277.59%)   15.99 (   86.52%)   37.91 (  205.18%)   22.02 (  119.18%)
      Fault Alloc       121.80 (    0.00%)   76.60 (   62.89%)  155.40 (  127.59%)  181.20 (  148.77%)  286.60 (  235.30%)
      +/-                73.51 (    0.00%)   61.11 (   83.12%)   34.89 (   47.46%)   31.88 (   43.36%)   68.13 (   92.68%)
      Fault Fallback    881.20 (    0.00%)  926.60 (   -5.15%)  847.60 (    3.81%)  822.00 (    6.72%)  716.60 (   18.68%)
      +/-                73.51 (    0.00%)   61.26 (   16.67%)   34.89 (   52.54%)   31.65 (   56.94%)   67.75 (    7.84%)
      MMTests Statistics: duration
      User/Sys Time Running Test (seconds)       3540.88   1945.37    716.04     64.97   1937.03
      Total Elapsed Time (seconds)              52417.33  11425.90    501.02    230.95   2520.28
      
      The first thing to note is the "Elapsed Time" for the vanilla kernels
      of 2249 seconds versus 56 with THP disabled which might explain the
      reports of USB stalls with THP enabled. Applying the patches brings
      performance in line with THP-disabled performance while isolating
      pages for immediate reclaim from the LRU cuts down System CPU time.
      
      The "Fault Alloc" success rate figures are also improved. The vanilla
      kernel only managed to allocate 76.6 pages on average over the course
      of 5 iterations where as applying the series allocated 181.20 on
      average albeit it is well within variance. It's worth noting that
      applies the series at least descreases the amount of variance which
      implies an improvement.
      
      Andrea's series had a higher success rate for THP allocations but
      at a severe cost to elapsed time which is still better than vanilla
      but still much worse than disabling THP altogether. One can bring my
      series close to Andrea's by removing this check
      
              /*
               * If compaction is deferred for high-order allocations, it is because
               * sync compaction recently failed. In this is the case and the caller
               * has requested the system not be heavily disrupted, fail the
               * allocation now instead of entering direct reclaim
               */
              if (deferred_compaction && (gfp_mask & __GFP_NO_KSWAPD))
                      goto nopage;
      
      I didn't include a patch that removed the above check because hurting
      overall performance to improve the THP figure is not what the average
      user wants. It's something to consider though if someone really wants
      to maximise THP usage no matter what it does to the workload initially.
      
      This is summary of vmstat figures from the same test.
      
                                             3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
      Page Ins                                  3257266139  1111844061    17263623    10901575   161423219
      Page Outs                                   81054922    30364312     3626530     3657687     8753730
      Swap Ins                                        3294        2851        6560        4964        4592
      Swap Outs                                     390073      528094      620197      790912      698285
      Direct pages scanned                      1077581700  3024951463  1764930052   115140570  5901188831
      Kswapd pages scanned                        34826043     7112868     2131265     1686942     1893966
      Kswapd pages reclaimed                      28950067     4911036     1246044      966475     1497726
      Direct pages reclaimed                     805148398   280167837     3623473     2215044    40809360
      Kswapd efficiency                                83%         69%         58%         57%         79%
      Kswapd velocity                              664.399     622.521    4253.852    7304.360     751.490
      Direct efficiency                                74%          9%          0%          1%          0%
      Direct velocity                            20557.737  264745.137 3522673.849  498551.938 2341481.435
      Percentage direct scans                          96%         99%         99%         98%         99%
      Page writes by reclaim                        722646      529174      620319      791018      699198
      Page writes file                              332573        1080         122         106         913
      Page writes anon                              390073      528094      620197      790912      698285
      Page reclaim immediate                             0  2552514720  1635858848   111281140  5478375032
      Page rescued immediate                             0           0           0       87848           0
      Slabs scanned                                  23552       23552        9216        8192        9216
      Direct inode steals                              231           0           0           0           0
      Kswapd inode steals                                0           0           0           0           0
      Kswapd skipped wait                            28076         786           0          61           6
      THP fault alloc                                  609         383         753         906        1433
      THP collapse alloc                                12           6           0           0           6
      THP splits                                       536         211         456         593        1136
      THP fault fallback                              4406        4633        4263        4110        3583
      THP collapse fail                                120         127           0           0           4
      Compaction stalls                               1810         728         623         779        3200
      Compaction success                               196          53          60          80         123
      Compaction failures                             1614         675         563         699        3077
      Compaction pages moved                        193158       53545      243185      333457      226688
      Compaction move failure                         9952        9396       16424       23676       45070
      
      The main things to look at are
      
      1. Page In/out figures are much reduced by the series.
      
      2. Direct page scanning is incredibly high (264745.137 pages scanned
         per second on the vanilla kernel) but isolating PageReclaim pages
         on their own list reduces the number of pages scanned significantly.
      
      3. The fact that "Page rescued immediate" is a positive number implies
         that we sometimes race removing pages from the LRU_IMMEDIATE list
         that need to be put back on a normal LRU but it happens only for
         0.07% of the pages marked for immediate reclaim.
      
      writebackCPDeviceext4
                         3.1.0-vanilla         rc5-vanilla       freemore-v6r1        isolate-v6r1         andrea-v2r1
      System Time         1.51 (    0.00%)    1.77 (  -17.66%)    1.46 (    2.92%)    1.15 (   23.77%)    1.89 (  -25.63%)
      +/-                 0.27 (    0.00%)    0.67 ( -148.52%)    0.33 (  -22.76%)    0.30 (  -11.15%)    0.19 (   30.16%)
      User Time           0.03 (    0.00%)    0.04 (  -37.50%)    0.05 (  -62.50%)    0.07 ( -112.50%)    0.04 (  -18.75%)
      +/-                 0.01 (    0.00%)    0.02 ( -146.64%)    0.02 (  -97.91%)    0.02 (  -75.59%)    0.02 (  -63.30%)
      Elapsed Time      124.93 (    0.00%)  114.49 (    8.36%)   96.77 (   22.55%)   27.48 (   78.00%)  205.70 (  -64.65%)
      +/-                20.20 (    0.00%)   74.39 ( -268.34%)   59.88 ( -196.48%)    7.72 (   61.79%)   25.03 (  -23.95%)
      THP Active        161.80 (    0.00%)   83.60 (   51.67%)  141.20 (   87.27%)   84.60 (   52.29%)   82.60 (   51.05%)
      +/-                71.95 (    0.00%)   43.80 (   60.88%)   26.91 (   37.40%)   59.02 (   82.03%)   52.13 (   72.45%)
      Fault Alloc       471.40 (    0.00%)  228.60 (   48.49%)  282.20 (   59.86%)  225.20 (   47.77%)  388.40 (   82.39%)
      +/-                88.07 (    0.00%)   87.42 (   99.26%)   73.79 (   83.78%)  109.62 (  124.47%)   82.62 (   93.81%)
      Fault Fallback    531.60 (    0.00%)  774.60 (  -45.71%)  720.80 (  -35.59%)  777.80 (  -46.31%)  614.80 (  -15.65%)
      +/-                88.07 (    0.00%)   87.26 (    0.92%)   73.79 (   16.22%)  109.62 (  -24.47%)   82.29 (    6.56%)
      MMTests Statistics: duration
      User/Sys Time Running Test (seconds)         50.22     33.76     30.65     24.14    128.45
      Total Elapsed Time (seconds)               1113.73   1132.19   1029.45    759.49   1707.26
      
      Similar test but the USB stick is using ext4 instead of vfat. As
      ext4 does not use writepage for migration, the large stalls due to
      compaction when THP is enabled are not observed. Still, isolating
      PageReclaim pages on their own list helped completion time largely
      by reducing the number of pages scanned by direct reclaim although
      time spend in congestion_wait could also be a factor.
      
      Again, Andrea's series had far higher success rates for THP allocation
      at the cost of elapsed time. I didn't look too closely but a quick
      look at the vmstat figures tells me kswapd reclaimed 8 times more pages
      than the patch series and direct reclaim reclaimed roughly three times
      as many pages. It follows that if memory is aggressively reclaimed,
      there will be more available for THP.
      
      writebackCPFilevfat
                         3.1.0-vanilla         rc5-vanilla       freemore-v6r1        isolate-v6r1         andrea-v2r1
      System Time         1.76 (    0.00%)   29.10 (-1555.52%)   46.01 (-2517.18%)    4.79 ( -172.35%)   54.89 (-3022.53%)
      +/-                 0.14 (    0.00%)   25.61 (-18185.17%)    2.15 (-1434.83%)    6.60 (-4610.03%)    9.75
      (-6863.76%)
      User Time           0.05 (    0.00%)    0.07 (  -45.83%)    0.05 (   -4.17%)    0.06 (  -29.17%)    0.06 (  -16.67%)
      +/-                 0.02 (    0.00%)    0.02 (   20.11%)    0.02 (   -3.14%)    0.01 (   31.58%)    0.01 (   47.41%)
      Elapsed Time     22520.79 (    0.00%) 1082.85 (   95.19%)   73.30 (   99.67%)   32.43 (   99.86%)  291.84 (  98.70%)
      +/-              7277.23 (    0.00%)  706.29 (   90.29%)   19.05 (   99.74%)   17.05 (   99.77%)  125.55 (   98.27%)
      THP Active         83.80 (    0.00%)   12.80 (   15.27%)   15.60 (   18.62%)   13.00 (   15.51%)    0.80 (    0.95%)
      +/-                66.81 (    0.00%)   20.19 (   30.22%)    5.92 (    8.86%)   15.06 (   22.54%)    1.17 (    1.75%)
      Fault Alloc       171.00 (    0.00%)   67.80 (   39.65%)   97.40 (   56.96%)  125.60 (   73.45%)  133.00 (   77.78%)
      +/-                82.91 (    0.00%)   30.69 (   37.02%)   53.91 (   65.02%)   55.05 (   66.40%)   21.19 (   25.56%)
      Fault Fallback    832.00 (    0.00%)  935.20 (  -12.40%)  906.00 (   -8.89%)  877.40 (   -5.46%)  870.20 (   -4.59%)
      +/-                82.91 (    0.00%)   30.69 (   62.98%)   54.01 (   34.86%)   55.05 (   33.60%)   20.91 (   74.78%)
      MMTests Statistics: duration
      User/Sys Time Running Test (seconds)       7229.81    928.42    704.52     80.68   1330.76
      Total Elapsed Time (seconds)             112849.04   5618.69    571.11    360.54   1664.28
      
      In this case, the test is reading/writing only from filesystems but as
      it's vfat, it's slow due to calling writepage during compaction. Little
      to observe really - the time to complete the test goes way down
      with the series applied and THP allocation success rates go up in
      comparison to 3.2-rc5.  The success rates are lower than 3.1.0 but
      the elapsed time for that kernel is abysmal so it is not really a
      sensible comparison.
      
      As before, Andrea's series allocates more THPs at the cost of overall
      performance.
      
      writebackCPFileext4
                         3.1.0-vanilla         rc5-vanilla       freemore-v6r1        isolate-v6r1         andrea-v2r1
      System Time         1.51 (    0.00%)    1.77 (  -17.66%)    1.46 (    2.92%)    1.15 (   23.77%)    1.89 (  -25.63%)
      +/-                 0.27 (    0.00%)    0.67 ( -148.52%)    0.33 (  -22.76%)    0.30 (  -11.15%)    0.19 (   30.16%)
      User Time           0.03 (    0.00%)    0.04 (  -37.50%)    0.05 (  -62.50%)    0.07 ( -112.50%)    0.04 (  -18.75%)
      +/-                 0.01 (    0.00%)    0.02 ( -146.64%)    0.02 (  -97.91%)    0.02 (  -75.59%)    0.02 (  -63.30%)
      Elapsed Time      124.93 (    0.00%)  114.49 (    8.36%)   96.77 (   22.55%)   27.48 (   78.00%)  205.70 (  -64.65%)
      +/-                20.20 (    0.00%)   74.39 ( -268.34%)   59.88 ( -196.48%)    7.72 (   61.79%)   25.03 (  -23.95%)
      THP Active        161.80 (    0.00%)   83.60 (   51.67%)  141.20 (   87.27%)   84.60 (   52.29%)   82.60 (   51.05%)
      +/-                71.95 (    0.00%)   43.80 (   60.88%)   26.91 (   37.40%)   59.02 (   82.03%)   52.13 (   72.45%)
      Fault Alloc       471.40 (    0.00%)  228.60 (   48.49%)  282.20 (   59.86%)  225.20 (   47.77%)  388.40 (   82.39%)
      +/-                88.07 (    0.00%)   87.42 (   99.26%)   73.79 (   83.78%)  109.62 (  124.47%)   82.62 (   93.81%)
      Fault Fallback    531.60 (    0.00%)  774.60 (  -45.71%)  720.80 (  -35.59%)  777.80 (  -46.31%)  614.80 (  -15.65%)
      +/-                88.07 (    0.00%)   87.26 (    0.92%)   73.79 (   16.22%)  109.62 (  -24.47%)   82.29 (    6.56%)
      MMTests Statistics: duration
      User/Sys Time Running Test (seconds)         50.22     33.76     30.65     24.14    128.45
      Total Elapsed Time (seconds)               1113.73   1132.19   1029.45    759.49   1707.26
      
      Same type of story - elapsed times go down. In this case, allocation
      success rates are roughtly the same. As before, Andrea's has higher
      success rates but takes a lot longer.
      
      Overall the series does reduce latencies and while the tests are
      inherency racy as alloc competes with the cp processes, the variability
      was included. The THP allocation rates are not as high as they could
      be but that is because we would have to be more aggressive about
      reclaim and compaction impacting overall performance.
      
      This patch:
      
      Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware")
      noted that compaction does not migrate dirty or writeback pages and that
      is was meaningless to pick the page and re-add it to the LRU list.
      
      What was missed during review is that asynchronous migration moves dirty
      pages if their ->migratepage callback is migrate_page() because these can
      be moved without blocking.  This potentially impacted hugepage allocation
      success rates by a factor depending on how many dirty pages are in the
      system.
      
      This patch partially reverts 39deaf85 to allow migration to isolate dirty
      pages again.  This increases how much compaction disrupts the LRU but that
      is addressed later in the series.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a77ebd33