1. 09 2月, 2012 1 次提交
    • M
      mm: compaction: check for overlapping nodes during isolation for migration · dc908600
      Mel Gorman 提交于
      When isolating pages for migration, migration starts at the start of a
      zone while the free scanner starts at the end of the zone.  Migration
      avoids entering a new zone by never going beyond the free scanned.
      
      Unfortunately, in very rare cases nodes can overlap.  When this happens,
      migration isolates pages without the LRU lock held, corrupting lists
      which will trigger errors in reclaim or during page free such as in the
      following oops
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
        IP: [<ffffffff810f795c>] free_pcppages_bulk+0xcc/0x450
        PGD 1dda554067 PUD 1e1cb58067 PMD 0
        Oops: 0000 [#1] SMP
        CPU 37
        Pid: 17088, comm: memcg_process_s Tainted: G            X
        RIP: free_pcppages_bulk+0xcc/0x450
        Process memcg_process_s (pid: 17088, threadinfo ffff881c2926e000, task ffff881c2926c0c0)
        Call Trace:
          free_hot_cold_page+0x17e/0x1f0
          __pagevec_free+0x90/0xb0
          release_pages+0x22a/0x260
          pagevec_lru_move_fn+0xf3/0x110
          putback_lru_page+0x66/0xe0
          unmap_and_move+0x156/0x180
          migrate_pages+0x9e/0x1b0
          compact_zone+0x1f3/0x2f0
          compact_zone_order+0xa2/0xe0
          try_to_compact_pages+0xdf/0x110
          __alloc_pages_direct_compact+0xee/0x1c0
          __alloc_pages_slowpath+0x370/0x830
          __alloc_pages_nodemask+0x1b1/0x1c0
          alloc_pages_vma+0x9b/0x160
          do_huge_pmd_anonymous_page+0x160/0x270
          do_page_fault+0x207/0x4c0
          page_fault+0x25/0x30
      
      The "X" in the taint flag means that external modules were loaded but but
      is unrelated to the bug triggering.  The real problem was because the PFN
      layout looks like this
      
        Zone PFN ranges:
          DMA      0x00000010 -> 0x00001000
          DMA32    0x00001000 -> 0x00100000
          Normal   0x00100000 -> 0x01e80000
        Movable zone start PFN for each node
        early_node_map[14] active PFN ranges
            0: 0x00000010 -> 0x0000009b
            0: 0x00000100 -> 0x0007a1ec
            0: 0x0007a354 -> 0x0007a379
            0: 0x0007f7ff -> 0x0007f800
            0: 0x00100000 -> 0x00680000
            1: 0x00680000 -> 0x00e80000
            0: 0x00e80000 -> 0x01080000
            1: 0x01080000 -> 0x01280000
            0: 0x01280000 -> 0x01480000
            1: 0x01480000 -> 0x01680000
            0: 0x01680000 -> 0x01880000
            1: 0x01880000 -> 0x01a80000
            0: 0x01a80000 -> 0x01c80000
            1: 0x01c80000 -> 0x01e80000
      
      The fix is straight-forward.  isolate_migratepages() has to make a
      similar check to isolate_freepage to ensure that it never isolates pages
      from a zone it does not hold the LRU lock for.
      
      This was discovered in a 3.0-based kernel but it affects 3.1.x, 3.2.x
      and current mainline.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc908600
  2. 04 2月, 2012 1 次提交
    • M
      mm: compaction: check pfn_valid when entering a new MAX_ORDER_NR_PAGES block... · 0bf380bc
      Mel Gorman 提交于
      mm: compaction: check pfn_valid when entering a new MAX_ORDER_NR_PAGES block during isolation for migration
      
      When isolating for migration, migration starts at the start of a zone
      which is not necessarily pageblock aligned.  Further, it stops isolating
      when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally
      not aligned.  This allows isolate_migratepages() to call pfn_to_page() on
      an invalid PFN which can result in a crash.  This was originally reported
      against a 3.0-based kernel with the following trace in a crash dump.
      
      PID: 9902   TASK: d47aecd0  CPU: 0   COMMAND: "memcg_process_s"
       #0 [d72d3ad0] crash_kexec at c028cfdb
       #1 [d72d3b24] oops_end at c05c5322
       #2 [d72d3b38] __bad_area_nosemaphore at c0227e60
       #3 [d72d3bec] bad_area at c0227fb6
       #4 [d72d3c00] do_page_fault at c05c72ec
       #5 [d72d3c80] error_code (via page_fault) at c05c47a4
          EAX: 00000000  EBX: 000c0000  ECX: 00000001  EDX: 00000807  EBP: 000c0000
          DS:  007b      ESI: 00000001  ES:  007b      EDI: f3000a80  GS:  6f50
          CS:  0060      EIP: c030b15a  ERR: ffffffff  EFLAGS: 00010002
       #6 [d72d3cb4] isolate_migratepages at c030b15a
       #7 [d72d3d14] zone_watermark_ok at c02d26cb
       #8 [d72d3d2c] compact_zone at c030b8de
       #9 [d72d3d68] compact_zone_order at c030bba1
      #10 [d72d3db4] try_to_compact_pages at c030bc84
      #11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7
      #12 [d72d3e08] __alloc_pages_slowpath at c02d66c7
      #13 [d72d3e78] __alloc_pages_nodemask at c02d6a97
      #14 [d72d3eb8] alloc_pages_vma at c030a845
      #15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb
      #16 [d72d3f00] handle_mm_fault at c02f36c6
      #17 [d72d3f30] do_page_fault at c05c70ed
      #18 [d72d3fb0] error_code (via page_fault) at c05c47a4
          EAX: b71ff000  EBX: 00000001  ECX: 00001600  EDX: 00000431
          DS:  007b      ESI: 08048950  ES:  007b      EDI: bfaa3788
          SS:  007b      ESP: bfaa36e0  EBP: bfaa3828  GS:  6f50
          CS:  0073      EIP: 080487c8  ERR: ffffffff  EFLAGS: 00010202
      
      It was also reported by Herbert van den Bergh against 3.1-based kernel
      with the following snippet from the console log.
      
      BUG: unable to handle kernel paging request at 01c00008
      IP: [<c0522399>] isolate_migratepages+0x119/0x390
      *pdpt = 000000002f7ce001 *pde = 0000000000000000
      
      It is expected that it also affects 3.2.x and current mainline.
      
      The problem is that pfn_valid is only called on the first PFN being
      checked and that PFN is not necessarily aligned.  Lets say we have a case
      like this
      
      H = MAX_ORDER_NR_PAGES boundary
      | = pageblock boundary
      m = cc->migrate_pfn
      f = cc->free_pfn
      o = memory hole
      
      H------|------H------|----m-Hoooooo|ooooooH-f----|------H
      
      The migrate_pfn is just below a memory hole and the free scanner is beyond
      the hole.  When isolate_migratepages started, it scans from migrate_pfn to
      migrate_pfn+pageblock_nr_pages which is now in a memory hole.  It checks
      pfn_valid() on the first PFN but then scans into the hole where there are
      not necessarily valid struct pages.
      
      This patch ensures that isolate_migratepages calls pfn_valid when
      necessary.
      Reported-by: NHerbert van den Bergh <herbert.van.den.bergh@oracle.com>
      Tested-by: NHerbert van den Bergh <herbert.van.den.bergh@oracle.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0bf380bc
  3. 13 1月, 2012 4 次提交
    • M
      mm: compaction: introduce sync-light migration for use by compaction · a6bc32b8
      Mel Gorman 提交于
      This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
      mode that avoids writing back pages to backing storage.  Async compaction
      maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
      For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
      used.
      
      This avoids sync compaction stalling for an excessive length of time,
      particularly when copying files to a USB stick where there might be a
      large number of dirty pages backed by a filesystem that does not support
      ->writepages.
      
      [aarcange@redhat.com: This patch is heavily based on Andrea's work]
      [akpm@linux-foundation.org: fix fs/nfs/write.c build]
      [akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6bc32b8
    • M
      mm: compaction: make isolate_lru_page() filter-aware again · c8244935
      Mel Gorman 提交于
      Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware")
      noted that compaction does not migrate dirty or writeback pages and that
      is was meaningless to pick the page and re-add it to the LRU list.  This
      had to be partially reverted because some dirty pages can be migrated by
      compaction without blocking.
      
      This patch updates "mm: compaction: make isolate_lru_page" by skipping
      over pages that migration has no possibility of migrating to minimise LRU
      disruption.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: Rik van Riel<riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c8244935
    • M
      mm: compaction: use synchronous compaction for /proc/sys/vm/compact_memory · b16d3d5a
      Mel Gorman 提交于
      When asynchronous compaction was introduced, the
      /proc/sys/vm/compact_memory handler should have been updated to always use
      synchronous compaction.  This did not happen so this patch addresses it.
      
      The assumption is if a user writes to /proc/sys/vm/compact_memory, they
      are willing for that process to stall.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b16d3d5a
    • M
      mm: compaction: allow compaction to isolate dirty pages · a77ebd33
      Mel Gorman 提交于
      Short summary: There are severe stalls when a USB stick using VFAT is
      used with THP enabled that are reduced by this series.  If you are
      experiencing this problem, please test and report back and considering I
      have seen complaints from openSUSE and Fedora users on this as well as a
      few private mails, I'm guessing it's a widespread issue.  This is a new
      type of USB-related stall because it is due to synchronous compaction
      writing where as in the past the big problem was dirty pages reaching
      the end of the LRU and being written by reclaim.
      
      Am cc'ing Andrew this time and this series would replace
      mm-do-not-stall-in-synchronous-compaction-for-thp-allocations.patch.
      I'm also cc'ing Dave Jones as he might have merged that patch to Fedora
      for wider testing and ideally it would be reverted and replaced by this
      series.
      
      That said, the later patches could really do with some review.  If this
      series is not the answer then a new direction needs to be discussed
      because as it is, the stalls are unacceptable as the results in this
      leader show.
      
      For testers that try backporting this to 3.1, it won't work because
      there is a non-obvious dependency on not writing back pages in direct
      reclaim so you need those patches too.
      
      Changelog since V5
      o Rebase to 3.2-rc5
      o Tidy up the changelogs a bit
      
      Changelog since V4
      o Added reviewed-bys, credited Andrea properly for sync-light
      o Allow dirty pages without mappings to be considered for migration
      o Bound the number of pages freed for compaction
      o Isolate PageReclaim pages on their own LRU list
      
      This is against 3.2-rc5 and follows on from discussions on "mm: Do
      not stall in synchronous compaction for THP allocations" and "[RFC
      PATCH 0/5] Reduce compaction-related stalls". Initially, the proposed
      patch eliminated stalls due to compaction which sometimes resulted in
      user-visible interactivity problems on browsers by simply never using
      sync compaction. The downside was that THP success allocation rates
      were lower because dirty pages were not being migrated as reported by
      Andrea. His approach at fixing this was nacked on the grounds that
      it reverted fixes from Rik merged that reduced the amount of pages
      reclaimed as it severely impacted his workloads performance.
      
      This series attempts to reconcile the requirements of maximising THP
      usage, without stalling in a user-visible fashion due to compaction
      or cheating by reclaiming an excessive number of pages.
      
      Patch 1 partially reverts commit 39deaf85 to allow migration to isolate
      	dirty pages. This is because migration can move some dirty
      	pages without blocking.
      
      Patch 2 notes that the /proc/sys/vm/compact_memory handler is not using
      	synchronous compaction when it should be. This is unrelated
      	to the reported stalls but is worth fixing.
      
      Patch 3 checks if we isolated a compound page during lumpy scan and
      	account for it properly. For the most part, this affects
      	tracing so it's unrelated to the stalls but worth fixing.
      
      Patch 4 notes that it is possible to abort reclaim early for compaction
      	and return 0 to the page allocator potentially entering the
      	"may oom" path. This has not been observed in practice but
      	the rest of the series potentially makes it easier to happen.
      
      Patch 5 adds a sync parameter to the migratepage callback and gives
      	the callback responsibility for migrating the page without
      	blocking if sync==false. For example, fallback_migrate_page
      	will not call writepage if sync==false. This increases the
      	number of pages that can be handled by asynchronous compaction
      	thereby reducing stalls.
      
      Patch 6 restores filter-awareness to isolate_lru_page for migration.
      	In practice, it means that pages under writeback and pages
      	without a ->migratepage callback will not be isolated
      	for migration.
      
      Patch 7 avoids calling direct reclaim if compaction is deferred but
      	makes sure that compaction is only deferred if sync
      	compaction was used.
      
      Patch 8 introduces a sync-light migration mechanism that sync compaction
      	uses. The objective is to allow some stalls but to not call
      	->writepage which can lead to significant user-visible stalls.
      
      Patch 9 notes that while we want to abort reclaim ASAP to allow
      	compation to go ahead that we leave a very small window of
      	opportunity for compaction to run. This patch allows more pages
      	to be freed by reclaim but bounds the number to a reasonable
      	level based on the high watermark on each zone.
      
      Patch 10 allows slabs to be shrunk even after compaction_ready() is
      	true for one zone. This is to avoid a problem whereby a single
      	small zone can abort reclaim even though no pages have been
      	reclaimed and no suitably large zone is in a usable state.
      
      Patch 11 fixes a problem with the rate of page scanning. As reclaim is
      	rarely stalling on pages under writeback it means that scan
      	rates are very high. This is particularly true for direct
      	reclaim which is not calling writepage. The vmstat figures
      	implied that much of this was busy work with PageReclaim pages
      	marked for immediate reclaim. This patch is a prototype that
      	moves these pages to their own LRU list.
      
      This has been tested and other than 2 USB keys getting trashed,
      nothing horrible fell out. That said, I am a bit unhappy with the
      rescue logic in patch 11 but did not find a better way around it. It
      does significantly reduce scan rates and System CPU time indicating
      it is the right direction to take.
      
      What is of critical importance is that stalls due to compaction
      are massively reduced even though sync compaction was still
      allowed. Testing from people complaining about stalls copying to USBs
      with THP enabled are particularly welcome.
      
      The following tests all involve THP usage and USB keys in some
      way. Each test follows this type of pattern
      
      1. Read from some fast fast storage, be it raw device or file. Each time
         the copy finishes, start again until the test ends
      2. Write a large file to a filesystem on a USB stick. Each time the copy
         finishes, start again until the test ends
      3. When memory is low, start an alloc process that creates a mapping
         the size of physical memory to stress THP allocation. This is the
         "real" part of the test and the part that is meant to trigger
         stalls when THP is enabled. Copying continues in the background.
      4. Record the CPU usage and time to execute of the alloc process
      5. Record the number of THP allocs and fallbacks as well as the number of THP
         pages in use a the end of the test just before alloc exited
      6. Run the test 5 times to get an idea of variability
      7. Between each run, sync is run and caches dropped and the test
         waits until nr_dirty is a small number to avoid interference
         or caching between iterations that would skew the figures.
      
      The individual tests were then
      
      writebackCPDeviceBasevfat
      	Disable THP, read from a raw device (sda), vfat on USB stick
      writebackCPDeviceBaseext4
      	Disable THP, read from a raw device (sda), ext4 on USB stick
      writebackCPDevicevfat
      	THP enabled, read from a raw device (sda), vfat on USB stick
      writebackCPDeviceext4
      	THP enabled, read from a raw device (sda), ext4 on USB stick
      writebackCPFilevfat
      	THP enabled, read from a file on fast storage and USB, both vfat
      writebackCPFileext4
      	THP enabled, read from a file on fast storage and USB, both ext4
      
      The kernels tested were
      
      3.1		3.1
      vanilla		3.2-rc5
      freemore	Patches 1-10
      immediate	Patches 1-11
      andrea		The 8 patches Andrea posted as a basis of comparison
      
      The results are very long unfortunately. I'll start with the case
      where we are not using THP at all
      
      writebackCPDeviceBasevfat
                         3.1.0-vanilla         rc5-vanilla       freemore-v6r1        isolate-v6r1         andrea-v2r1
      System Time         1.28 (    0.00%)   54.49 (-4143.46%)   48.63 (-3687.69%)    4.69 ( -265.11%)   51.88 (-3940.81%)
      +/-                 0.06 (    0.00%)    2.45 (-4305.55%)    4.75 (-8430.57%)    7.46 (-13282.76%)    4.76 (-8440.70%)
      User Time           0.09 (    0.00%)    0.05 (   40.91%)    0.06 (   29.55%)    0.07 (   15.91%)    0.06 (   27.27%)
      +/-                 0.02 (    0.00%)    0.01 (   45.39%)    0.02 (   25.07%)    0.00 (   77.06%)    0.01 (   52.24%)
      Elapsed Time      110.27 (    0.00%)   56.38 (   48.87%)   49.95 (   54.70%)   11.77 (   89.33%)   53.43 (   51.54%)
      +/-                 7.33 (    0.00%)    3.77 (   48.61%)    4.94 (   32.63%)    6.71 (    8.50%)    4.76 (   35.03%)
      THP Active          0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)
      +/-                 0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)
      Fault Alloc         0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)
      +/-                 0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)
      Fault Fallback      0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)
      +/-                 0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)
      
      The THP figures are obviously all 0 because THP was enabled. The
      main thing to watch is the elapsed times and how they compare to
      times when THP is enabled later. It's also important to note that
      elapsed time is improved by this series as System CPu time is much
      reduced.
      
      writebackCPDevicevfat
      
                         3.1.0-vanilla         rc5-vanilla       freemore-v6r1        isolate-v6r1         andrea-v2r1
      System Time         1.22 (    0.00%)   13.89 (-1040.72%)   46.40 (-3709.20%)    4.44 ( -264.37%)   47.37 (-3789.33%)
      +/-                 0.06 (    0.00%)   22.82 (-37635.56%)    3.84 (-6249.44%)    6.48 (-10618.92%)    6.60
      (-10818.53%)
      User Time           0.06 (    0.00%)    0.06 (   -6.90%)    0.05 (   17.24%)    0.05 (   13.79%)    0.04 (   31.03%)
      +/-                 0.01 (    0.00%)    0.01 (   33.33%)    0.01 (   33.33%)    0.01 (   39.14%)    0.01 (   25.46%)
      Elapsed Time     10445.54 (    0.00%) 2249.92 (   78.46%)   70.06 (   99.33%)   16.59 (   99.84%)  472.43 (
      95.48%)
      +/-               643.98 (    0.00%)  811.62 (  -26.03%)   10.02 (   98.44%)    7.03 (   98.91%)   59.99 (   90.68%)
      THP Active         15.60 (    0.00%)   35.20 (  225.64%)   65.00 (  416.67%)   70.80 (  453.85%)   62.20 (  398.72%)
      +/-                18.48 (    0.00%)   51.29 (  277.59%)   15.99 (   86.52%)   37.91 (  205.18%)   22.02 (  119.18%)
      Fault Alloc       121.80 (    0.00%)   76.60 (   62.89%)  155.40 (  127.59%)  181.20 (  148.77%)  286.60 (  235.30%)
      +/-                73.51 (    0.00%)   61.11 (   83.12%)   34.89 (   47.46%)   31.88 (   43.36%)   68.13 (   92.68%)
      Fault Fallback    881.20 (    0.00%)  926.60 (   -5.15%)  847.60 (    3.81%)  822.00 (    6.72%)  716.60 (   18.68%)
      +/-                73.51 (    0.00%)   61.26 (   16.67%)   34.89 (   52.54%)   31.65 (   56.94%)   67.75 (    7.84%)
      MMTests Statistics: duration
      User/Sys Time Running Test (seconds)       3540.88   1945.37    716.04     64.97   1937.03
      Total Elapsed Time (seconds)              52417.33  11425.90    501.02    230.95   2520.28
      
      The first thing to note is the "Elapsed Time" for the vanilla kernels
      of 2249 seconds versus 56 with THP disabled which might explain the
      reports of USB stalls with THP enabled. Applying the patches brings
      performance in line with THP-disabled performance while isolating
      pages for immediate reclaim from the LRU cuts down System CPU time.
      
      The "Fault Alloc" success rate figures are also improved. The vanilla
      kernel only managed to allocate 76.6 pages on average over the course
      of 5 iterations where as applying the series allocated 181.20 on
      average albeit it is well within variance. It's worth noting that
      applies the series at least descreases the amount of variance which
      implies an improvement.
      
      Andrea's series had a higher success rate for THP allocations but
      at a severe cost to elapsed time which is still better than vanilla
      but still much worse than disabling THP altogether. One can bring my
      series close to Andrea's by removing this check
      
              /*
               * If compaction is deferred for high-order allocations, it is because
               * sync compaction recently failed. In this is the case and the caller
               * has requested the system not be heavily disrupted, fail the
               * allocation now instead of entering direct reclaim
               */
              if (deferred_compaction && (gfp_mask & __GFP_NO_KSWAPD))
                      goto nopage;
      
      I didn't include a patch that removed the above check because hurting
      overall performance to improve the THP figure is not what the average
      user wants. It's something to consider though if someone really wants
      to maximise THP usage no matter what it does to the workload initially.
      
      This is summary of vmstat figures from the same test.
      
                                             3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
      Page Ins                                  3257266139  1111844061    17263623    10901575   161423219
      Page Outs                                   81054922    30364312     3626530     3657687     8753730
      Swap Ins                                        3294        2851        6560        4964        4592
      Swap Outs                                     390073      528094      620197      790912      698285
      Direct pages scanned                      1077581700  3024951463  1764930052   115140570  5901188831
      Kswapd pages scanned                        34826043     7112868     2131265     1686942     1893966
      Kswapd pages reclaimed                      28950067     4911036     1246044      966475     1497726
      Direct pages reclaimed                     805148398   280167837     3623473     2215044    40809360
      Kswapd efficiency                                83%         69%         58%         57%         79%
      Kswapd velocity                              664.399     622.521    4253.852    7304.360     751.490
      Direct efficiency                                74%          9%          0%          1%          0%
      Direct velocity                            20557.737  264745.137 3522673.849  498551.938 2341481.435
      Percentage direct scans                          96%         99%         99%         98%         99%
      Page writes by reclaim                        722646      529174      620319      791018      699198
      Page writes file                              332573        1080         122         106         913
      Page writes anon                              390073      528094      620197      790912      698285
      Page reclaim immediate                             0  2552514720  1635858848   111281140  5478375032
      Page rescued immediate                             0           0           0       87848           0
      Slabs scanned                                  23552       23552        9216        8192        9216
      Direct inode steals                              231           0           0           0           0
      Kswapd inode steals                                0           0           0           0           0
      Kswapd skipped wait                            28076         786           0          61           6
      THP fault alloc                                  609         383         753         906        1433
      THP collapse alloc                                12           6           0           0           6
      THP splits                                       536         211         456         593        1136
      THP fault fallback                              4406        4633        4263        4110        3583
      THP collapse fail                                120         127           0           0           4
      Compaction stalls                               1810         728         623         779        3200
      Compaction success                               196          53          60          80         123
      Compaction failures                             1614         675         563         699        3077
      Compaction pages moved                        193158       53545      243185      333457      226688
      Compaction move failure                         9952        9396       16424       23676       45070
      
      The main things to look at are
      
      1. Page In/out figures are much reduced by the series.
      
      2. Direct page scanning is incredibly high (264745.137 pages scanned
         per second on the vanilla kernel) but isolating PageReclaim pages
         on their own list reduces the number of pages scanned significantly.
      
      3. The fact that "Page rescued immediate" is a positive number implies
         that we sometimes race removing pages from the LRU_IMMEDIATE list
         that need to be put back on a normal LRU but it happens only for
         0.07% of the pages marked for immediate reclaim.
      
      writebackCPDeviceext4
                         3.1.0-vanilla         rc5-vanilla       freemore-v6r1        isolate-v6r1         andrea-v2r1
      System Time         1.51 (    0.00%)    1.77 (  -17.66%)    1.46 (    2.92%)    1.15 (   23.77%)    1.89 (  -25.63%)
      +/-                 0.27 (    0.00%)    0.67 ( -148.52%)    0.33 (  -22.76%)    0.30 (  -11.15%)    0.19 (   30.16%)
      User Time           0.03 (    0.00%)    0.04 (  -37.50%)    0.05 (  -62.50%)    0.07 ( -112.50%)    0.04 (  -18.75%)
      +/-                 0.01 (    0.00%)    0.02 ( -146.64%)    0.02 (  -97.91%)    0.02 (  -75.59%)    0.02 (  -63.30%)
      Elapsed Time      124.93 (    0.00%)  114.49 (    8.36%)   96.77 (   22.55%)   27.48 (   78.00%)  205.70 (  -64.65%)
      +/-                20.20 (    0.00%)   74.39 ( -268.34%)   59.88 ( -196.48%)    7.72 (   61.79%)   25.03 (  -23.95%)
      THP Active        161.80 (    0.00%)   83.60 (   51.67%)  141.20 (   87.27%)   84.60 (   52.29%)   82.60 (   51.05%)
      +/-                71.95 (    0.00%)   43.80 (   60.88%)   26.91 (   37.40%)   59.02 (   82.03%)   52.13 (   72.45%)
      Fault Alloc       471.40 (    0.00%)  228.60 (   48.49%)  282.20 (   59.86%)  225.20 (   47.77%)  388.40 (   82.39%)
      +/-                88.07 (    0.00%)   87.42 (   99.26%)   73.79 (   83.78%)  109.62 (  124.47%)   82.62 (   93.81%)
      Fault Fallback    531.60 (    0.00%)  774.60 (  -45.71%)  720.80 (  -35.59%)  777.80 (  -46.31%)  614.80 (  -15.65%)
      +/-                88.07 (    0.00%)   87.26 (    0.92%)   73.79 (   16.22%)  109.62 (  -24.47%)   82.29 (    6.56%)
      MMTests Statistics: duration
      User/Sys Time Running Test (seconds)         50.22     33.76     30.65     24.14    128.45
      Total Elapsed Time (seconds)               1113.73   1132.19   1029.45    759.49   1707.26
      
      Similar test but the USB stick is using ext4 instead of vfat. As
      ext4 does not use writepage for migration, the large stalls due to
      compaction when THP is enabled are not observed. Still, isolating
      PageReclaim pages on their own list helped completion time largely
      by reducing the number of pages scanned by direct reclaim although
      time spend in congestion_wait could also be a factor.
      
      Again, Andrea's series had far higher success rates for THP allocation
      at the cost of elapsed time. I didn't look too closely but a quick
      look at the vmstat figures tells me kswapd reclaimed 8 times more pages
      than the patch series and direct reclaim reclaimed roughly three times
      as many pages. It follows that if memory is aggressively reclaimed,
      there will be more available for THP.
      
      writebackCPFilevfat
                         3.1.0-vanilla         rc5-vanilla       freemore-v6r1        isolate-v6r1         andrea-v2r1
      System Time         1.76 (    0.00%)   29.10 (-1555.52%)   46.01 (-2517.18%)    4.79 ( -172.35%)   54.89 (-3022.53%)
      +/-                 0.14 (    0.00%)   25.61 (-18185.17%)    2.15 (-1434.83%)    6.60 (-4610.03%)    9.75
      (-6863.76%)
      User Time           0.05 (    0.00%)    0.07 (  -45.83%)    0.05 (   -4.17%)    0.06 (  -29.17%)    0.06 (  -16.67%)
      +/-                 0.02 (    0.00%)    0.02 (   20.11%)    0.02 (   -3.14%)    0.01 (   31.58%)    0.01 (   47.41%)
      Elapsed Time     22520.79 (    0.00%) 1082.85 (   95.19%)   73.30 (   99.67%)   32.43 (   99.86%)  291.84 (  98.70%)
      +/-              7277.23 (    0.00%)  706.29 (   90.29%)   19.05 (   99.74%)   17.05 (   99.77%)  125.55 (   98.27%)
      THP Active         83.80 (    0.00%)   12.80 (   15.27%)   15.60 (   18.62%)   13.00 (   15.51%)    0.80 (    0.95%)
      +/-                66.81 (    0.00%)   20.19 (   30.22%)    5.92 (    8.86%)   15.06 (   22.54%)    1.17 (    1.75%)
      Fault Alloc       171.00 (    0.00%)   67.80 (   39.65%)   97.40 (   56.96%)  125.60 (   73.45%)  133.00 (   77.78%)
      +/-                82.91 (    0.00%)   30.69 (   37.02%)   53.91 (   65.02%)   55.05 (   66.40%)   21.19 (   25.56%)
      Fault Fallback    832.00 (    0.00%)  935.20 (  -12.40%)  906.00 (   -8.89%)  877.40 (   -5.46%)  870.20 (   -4.59%)
      +/-                82.91 (    0.00%)   30.69 (   62.98%)   54.01 (   34.86%)   55.05 (   33.60%)   20.91 (   74.78%)
      MMTests Statistics: duration
      User/Sys Time Running Test (seconds)       7229.81    928.42    704.52     80.68   1330.76
      Total Elapsed Time (seconds)             112849.04   5618.69    571.11    360.54   1664.28
      
      In this case, the test is reading/writing only from filesystems but as
      it's vfat, it's slow due to calling writepage during compaction. Little
      to observe really - the time to complete the test goes way down
      with the series applied and THP allocation success rates go up in
      comparison to 3.2-rc5.  The success rates are lower than 3.1.0 but
      the elapsed time for that kernel is abysmal so it is not really a
      sensible comparison.
      
      As before, Andrea's series allocates more THPs at the cost of overall
      performance.
      
      writebackCPFileext4
                         3.1.0-vanilla         rc5-vanilla       freemore-v6r1        isolate-v6r1         andrea-v2r1
      System Time         1.51 (    0.00%)    1.77 (  -17.66%)    1.46 (    2.92%)    1.15 (   23.77%)    1.89 (  -25.63%)
      +/-                 0.27 (    0.00%)    0.67 ( -148.52%)    0.33 (  -22.76%)    0.30 (  -11.15%)    0.19 (   30.16%)
      User Time           0.03 (    0.00%)    0.04 (  -37.50%)    0.05 (  -62.50%)    0.07 ( -112.50%)    0.04 (  -18.75%)
      +/-                 0.01 (    0.00%)    0.02 ( -146.64%)    0.02 (  -97.91%)    0.02 (  -75.59%)    0.02 (  -63.30%)
      Elapsed Time      124.93 (    0.00%)  114.49 (    8.36%)   96.77 (   22.55%)   27.48 (   78.00%)  205.70 (  -64.65%)
      +/-                20.20 (    0.00%)   74.39 ( -268.34%)   59.88 ( -196.48%)    7.72 (   61.79%)   25.03 (  -23.95%)
      THP Active        161.80 (    0.00%)   83.60 (   51.67%)  141.20 (   87.27%)   84.60 (   52.29%)   82.60 (   51.05%)
      +/-                71.95 (    0.00%)   43.80 (   60.88%)   26.91 (   37.40%)   59.02 (   82.03%)   52.13 (   72.45%)
      Fault Alloc       471.40 (    0.00%)  228.60 (   48.49%)  282.20 (   59.86%)  225.20 (   47.77%)  388.40 (   82.39%)
      +/-                88.07 (    0.00%)   87.42 (   99.26%)   73.79 (   83.78%)  109.62 (  124.47%)   82.62 (   93.81%)
      Fault Fallback    531.60 (    0.00%)  774.60 (  -45.71%)  720.80 (  -35.59%)  777.80 (  -46.31%)  614.80 (  -15.65%)
      +/-                88.07 (    0.00%)   87.26 (    0.92%)   73.79 (   16.22%)  109.62 (  -24.47%)   82.29 (    6.56%)
      MMTests Statistics: duration
      User/Sys Time Running Test (seconds)         50.22     33.76     30.65     24.14    128.45
      Total Elapsed Time (seconds)               1113.73   1132.19   1029.45    759.49   1707.26
      
      Same type of story - elapsed times go down. In this case, allocation
      success rates are roughtly the same. As before, Andrea's has higher
      success rates but takes a lot longer.
      
      Overall the series does reduce latencies and while the tests are
      inherency racy as alloc competes with the cp processes, the variability
      was included. The THP allocation rates are not as high as they could
      be but that is because we would have to be more aggressive about
      reclaim and compaction impacting overall performance.
      
      This patch:
      
      Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware")
      noted that compaction does not migrate dirty or writeback pages and that
      is was meaningless to pick the page and re-add it to the LRU list.
      
      What was missed during review is that asynchronous migration moves dirty
      pages if their ->migratepage callback is migrate_page() because these can
      be moved without blocking.  This potentially impacted hugepage allocation
      success rates by a factor depending on how many dirty pages are in the
      system.
      
      This patch partially reverts 39deaf85 to allow migration to isolate dirty
      pages again.  This increases how much compaction disrupts the LRU but that
      is addressed later in the series.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a77ebd33
  4. 11 1月, 2012 1 次提交
  5. 22 12月, 2011 1 次提交
  6. 01 11月, 2011 4 次提交
  7. 16 6月, 2011 4 次提交
  8. 23 3月, 2011 4 次提交
    • A
      mm: compaction: minimise the time IRQs are disabled while isolating pages for migration · b2eef8c0
      Andrea Arcangeli 提交于
      compaction_alloc() isolates pages for migration in isolate_migratepages.
      While it's scanning, IRQs are disabled on the mistaken assumption the
      scanning should be short.  Tests show this to be true for the most part
      but contention times on the LRU lock can be increased.  Before this patch,
      the IRQ disabled times for a simple test looked like
      
        Total sampled time IRQs off (not real total time): 5493
        Event shrink_inactive_list..shrink_zone                  1596 us count 1
        Event shrink_inactive_list..shrink_zone                  1530 us count 1
        Event shrink_inactive_list..shrink_zone                   956 us count 1
        Event shrink_inactive_list..shrink_zone                   541 us count 1
        Event shrink_inactive_list..shrink_zone                   531 us count 1
        Event split_huge_page..add_to_swap                        232 us count 1
        Event save_args..call_softirq                              36 us count 1
        Event save_args..call_softirq                              35 us count 2
        Event __wake_up..__wake_up                                  1 us count 1
      
      This patch reduces the worst-case IRQs-disabled latencies by releasing the
      lock every SWAP_CLUSTER_MAX pages that are scanned and releasing the CPU if
      necessary. The cost of this is that the processing performing compaction will
      be slower but IRQs being disabled for too long a time has worse consequences
      as the following report shows;
      
        Total sampled time IRQs off (not real total time): 4367
        Event shrink_inactive_list..shrink_zone                   881 us count 1
        Event shrink_inactive_list..shrink_zone                   875 us count 1
        Event shrink_inactive_list..shrink_zone                   868 us count 1
        Event shrink_inactive_list..shrink_zone                   555 us count 1
        Event split_huge_page..add_to_swap                        495 us count 1
        Event compact_zone..compact_zone_order                    269 us count 1
        Event split_huge_page..add_to_swap                        266 us count 1
        Event shrink_inactive_list..shrink_zone                    85 us count 1
        Event save_args..call_softirq                              36 us count 2
        Event __wake_up..__wake_up                                  1 us count 1
      
      [akpm@linux-foundation.org: simplify with s/unlocked/locked/]
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Arthur Marsh <arthur.marsh@internode.on.net>
      Cc: Clemens Ladisch <cladisch@googlemail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b2eef8c0
    • M
      mm: compaction: minimise the time IRQs are disabled while isolating free pages · 602605a4
      Mel Gorman 提交于
      compaction_alloc() isolates free pages to be used as migration targets.
      While its scanning, IRQs are disabled on the mistaken assumption the
      scanning should be short.  Analysis showed that IRQs were in fact being
      disabled for substantial time.  A simple test was run using large
      anonymous mappings with transparent hugepage support enabled to trigger
      frequent compactions.  A monitor sampled what the worst IRQ-off latencies
      were and a post-processing tool found the following;
      
        Total sampled time IRQs off (not real total time): 22355
        Event compaction_alloc..compaction_alloc                 8409 us count 1
        Event compaction_alloc..compaction_alloc                 7341 us count 1
        Event compaction_alloc..compaction_alloc                 2463 us count 1
        Event compaction_alloc..compaction_alloc                 2054 us count 1
        Event shrink_inactive_list..shrink_zone                  1864 us count 1
        Event shrink_inactive_list..shrink_zone                    88 us count 1
        Event save_args..call_softirq                              36 us count 1
        Event save_args..call_softirq                              35 us count 2
        Event __make_request..__blk_run_queue                      24 us count 1
        Event __alloc_pages_nodemask..__alloc_pages_nodemask        6 us count 1
      
      i.e.  compaction is disabled IRQs for a prolonged period of time - 8ms in
      one instance.  The full report generated by the tool can be found at
      
       http://www.csn.ul.ie/~mel/postings/minfree-20110225/irqsoff-vanilla-micro.report
      
      This patch reduces the time IRQs are disabled by simply disabling IRQs at
      the last possible minute.  An updated IRQs-off summary report then looks
      like;
      
        Total sampled time IRQs off (not real total time): 5493
        Event shrink_inactive_list..shrink_zone                  1596 us count 1
        Event shrink_inactive_list..shrink_zone                  1530 us count 1
        Event shrink_inactive_list..shrink_zone                   956 us count 1
        Event shrink_inactive_list..shrink_zone                   541 us count 1
        Event shrink_inactive_list..shrink_zone                   531 us count 1
        Event split_huge_page..add_to_swap                        232 us count 1
        Event save_args..call_softirq                              36 us count 1
        Event save_args..call_softirq                              35 us count 2
        Event __wake_up..__wake_up                                  1 us count 1
      
      A full report is again available at
      
        http://www.csn.ul.ie/~mel/postings/minfree-20110225/irqsoff-minimiseirq-free-v1r4-micro.report
      
      As should be obvious, IRQ disabled latencies due to compaction are
      almost elimimnated for this particular test.
      
      [aarcange@redhat.com: Fix initialisation of isolated]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujisu.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Arthur Marsh <arthur.marsh@internode.on.net>
      Cc: Clemens Ladisch <cladisch@googlemail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      602605a4
    • M
      mm/compaction: check migrate_pages's return value instead of list_empty() · 9d502c1c
      Minchan Kim 提交于
      Many migrate_page's caller check return value instead of list_empy by
      cf608ac1 ("mm: compaction: fix COMPACTPAGEFAILED counting").  This patch
      makes compaction's migrate_pages consistent with others.  This patch
      should not change old behavior.
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9d502c1c
    • A
      mm: compaction: prevent kswapd compacting memory to reduce CPU usage · d527caf2
      Andrea Arcangeli 提交于
      This patch reverts 5a03b051 ("thp: use compaction in kswapd for GFP_ATOMIC
      order > 0") due to reports stating that kswapd CPU usage was higher and
      IRQs were being disabled more frequently.  This was reported at
      http://www.spinics.net/linux/fedora/alsa-user/msg09885.html.
      
      Without this patch applied, CPU usage by kswapd hovers around the 20% mark
      according to the tester (Arthur Marsh:
      http://www.spinics.net/linux/fedora/alsa-user/msg09899.html).  With this
      patch applied, it's around 2%.
      
      The problem is not related to THP which specifies __GFP_NO_KSWAPD but is
      triggered by high-order allocations hitting the low watermark for their
      order and waking kswapd on kernels with CONFIG_COMPACTION set.  The most
      common trigger for this is network cards configured for jumbo frames but
      it's also possible it'll be triggered by fork-heavy workloads (order-1)
      and some wireless cards which depend on order-1 allocations.
      
      The symptoms for the user will be high CPU usage by kswapd in low-memory
      situations which could be confused with another writeback problem.  While
      a patch like 5a03b051 may be reintroduced in the future, this patch plays
      it safe for now and reverts it.
      
      [mel@csn.ul.ie: Beefed up the changelog]
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reported-by: NArthur Marsh <arthur.marsh@internode.on.net>
      Tested-by: NArthur Marsh <arthur.marsh@internode.on.net>
      Cc: <stable@kernel.org>		[2.6.38.1]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d527caf2
  9. 21 1月, 2011 1 次提交
  10. 14 1月, 2011 8 次提交
  11. 23 12月, 2010 1 次提交
  12. 10 9月, 2010 1 次提交
  13. 25 5月, 2010 5 次提交