1. 18 2月, 2013 3 次提交
  2. 09 2月, 2013 1 次提交
  3. 07 2月, 2013 1 次提交
    • T
      jbd2: track request delay statistics · 9fff24aa
      Theodore Ts'o 提交于
      Track the delay between when we first request that the commit begin
      and when it actually begins, so we can see how much of a gap exists.
      In theory, this should just be the remaining scheduling quantuum of
      the thread which requested the commit (assuming it was not a
      synchronous operation which triggered the commit request) plus
      scheduling overhead; however, it's possible that real time processes
      might get in the way of letting the kjournald thread from executing.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      9fff24aa
  4. 17 1月, 2013 1 次提交
  5. 26 12月, 2012 1 次提交
  6. 19 12月, 2012 1 次提交
  7. 17 12月, 2012 1 次提交
  8. 12 12月, 2012 1 次提交
  9. 11 12月, 2012 2 次提交
    • M
      mm: migrate: Add a tracepoint for migrate_pages · 7b2a2d4a
      Mel Gorman 提交于
      The pgmigrate_success and pgmigrate_fail vmstat counters tells the user
      about migration activity but not the type or the reason. This patch adds
      a tracepoint to identify the type of page migration and why the page is
      being migrated.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      7b2a2d4a
    • L
      Revert "revert "Revert "mm: remove __GFP_NO_KSWAPD""" and associated damage · caf49191
      Linus Torvalds 提交于
      This reverts commits a5091539 and
      d7c3b937.
      
      This is a revert of a revert of a revert.  In addition, it reverts the
      even older i915 change to stop using the __GFP_NO_KSWAPD flag due to the
      original commits in linux-next.
      
      It turns out that the original patch really was bogus, and that the
      original revert was the correct thing to do after all.  We thought we
      had fixed the problem, and then reverted the revert, but the problem
      really is fundamental: waking up kswapd simply isn't the right thing to
      do, and direct reclaim sometimes simply _is_ the right thing to do.
      
      When certain allocations fail, we simply should try some direct reclaim,
      and if that fails, fail the allocation.  That's the right thing to do
      for THP allocations, which can easily fail, and the GPU allocations want
      to do that too.
      
      So starting kswapd is sometimes simply wrong, and removing the flag that
      said "don't start kswapd" was a mistake.  Let's hope we never revisit
      this mistake again - and certainly not this many times ;)
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      caf49191
  10. 01 12月, 2012 1 次提交
    • A
      revert "Revert "mm: remove __GFP_NO_KSWAPD"" · a5091539
      Andrew Morton 提交于
      It apepars that this patch was innocent, and we hope that "mm: avoid
      waking kswapd for THP allocations when compaction is deferred or
      contended" will fix the final kswapd-spinning cause.
      
      Cc: Zdenek Kabelac <zkabelac@redhat.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Jiri Slaby <jirislaby@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5091539
  11. 27 11月, 2012 1 次提交
    • M
      Revert "mm: remove __GFP_NO_KSWAPD" · 82b212f4
      Mel Gorman 提交于
      With "mm: vmscan: scale number of pages reclaimed by reclaim/compaction
      based on failures" reverted, Zdenek Kabelac reported the following
      
        Hmm,  so it's just took longer to hit the problem and observe
        kswapd0 spinning on my CPU again - it's not as endless like before -
        but still it easily eats minutes - it helps to	turn off  Firefox
        or TB  (memory hungry apps) so kswapd0 stops soon - and restart
        those apps again.  (And I still have like >1GB of cached memory)
      
        kswapd0         R  running task        0    30      2 0x00000000
        Call Trace:
          preempt_schedule+0x42/0x60
          _raw_spin_unlock+0x55/0x60
          put_super+0x31/0x40
          drop_super+0x22/0x30
          prune_super+0x149/0x1b0
          shrink_slab+0xba/0x510
      
      The sysrq+m indicates the system has no swap so it'll never reclaim
      anonymous pages as part of reclaim/compaction.  That is one part of the
      problem but not the root cause as file-backed pages could also be
      reclaimed.
      
      The likely underlying problem is that kswapd is woken up or kept awake
      for each THP allocation request in the page allocator slow path.
      
      If compaction fails for the requesting process then compaction will be
      deferred for a time and direct reclaim is avoided.  However, if there
      are a storm of THP requests that are simply rejected, it will still be
      the the case that kswapd is awake for a prolonged period of time as
      pgdat->kswapd_max_order is updated each time.  This is noticed by the
      main kswapd() loop and it will not call kswapd_try_to_sleep().  Instead
      it will loopp, shrinking a small number of pages and calling
      shrink_slab() on each iteration.
      
      The temptation is to supply a patch that checks if kswapd was woken for
      THP and if so ignore pgdat->kswapd_max_order but it'll be a hack and not
      backed up by proper testing.  As 3.7 is very close to release and this
      is not a bug we should release with, a safer path is to revert "mm:
      remove __GFP_NO_KSWAPD" for now and revisit it with the view to ironing
      out the balance_pgdat() logic in general.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Zdenek Kabelac <zkabelac@redhat.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Jiri Slaby <jirislaby@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82b212f4
  12. 17 11月, 2012 1 次提交
    • P
      rcu: Add callback-free CPUs · 3fbfbf7a
      Paul E. McKenney 提交于
      RCU callback execution can add significant OS jitter and also can
      degrade both scheduling latency and, in asymmetric multiprocessors,
      energy efficiency.  This commit therefore adds the ability for selected
      CPUs ("rcu_nocbs=" boot parameter) to have their callbacks offloaded
      to kthreads.  If the "rcu_nocb_poll" boot parameter is also specified,
      these kthreads will do polling, removing the need for the offloaded
      CPUs to do wakeups.  At least one CPU must be doing normal callback
      processing: currently CPU 0 cannot be selected as a no-CBs CPU.
      In addition, attempts to offline the last normal-CBs CPU will fail.
      
      This feature was inspired by Jim Houston's and Joe Korty's JRCU, and
      this commit includes fixes to problems located by Fengguang Wu's
      kbuild test robot.
      
      [ paulmck: Added gfp.h include file as suggested by Fengguang Wu. ]
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      3fbfbf7a
  13. 09 11月, 2012 3 次提交
  14. 01 11月, 2012 1 次提交
  15. 09 10月, 2012 1 次提交
  16. 03 10月, 2012 1 次提交
  17. 02 10月, 2012 1 次提交
  18. 21 9月, 2012 1 次提交
  19. 17 8月, 2012 2 次提交
  20. 01 8月, 2012 1 次提交
  21. 31 7月, 2012 1 次提交
  22. 15 7月, 2012 1 次提交
  23. 13 7月, 2012 1 次提交
    • T
      workqueue: factor out worker_pool from global_cwq · bd7bdd43
      Tejun Heo 提交于
      Move worklist and all worker management fields from global_cwq into
      the new struct worker_pool.  worker_pool points back to the containing
      gcwq.  worker and cpu_workqueue_struct are updated to point to
      worker_pool instead of gcwq too.
      
      This change is mechanical and doesn't introduce any functional
      difference other than rearranging of fields and an added level of
      indirection in some places.  This is to prepare for multiple pools per
      gcwq.
      
      v2: Comment typo fixes as suggested by Namhyung.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      bd7bdd43
  24. 03 7月, 2012 1 次提交
  25. 28 6月, 2012 1 次提交
    • A
      x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range · e7b52ffd
      Alex Shi 提交于
      x86 has no flush_tlb_range support in instruction level. Currently the
      flush_tlb_range just implemented by flushing all page table. That is not
      the best solution for all scenarios. In fact, if we just use 'invlpg' to
      flush few lines from TLB, we can get the performance gain from later
      remain TLB lines accessing.
      
      But the 'invlpg' instruction costs much of time. Its execution time can
      compete with cr3 rewriting, and even a bit more on SNB CPU.
      
      So, on a 512 4KB TLB entries CPU, the balance points is at:
      	(512 - X) * 100ns(assumed TLB refill cost) =
      		X(TLB flush entries) * 100ns(assumed invlpg cost)
      
      Here, X is 256, that is 1/2 of 512 entries.
      
      But with the mysterious CPU pre-fetcher and page miss handler Unit, the
      assumed TLB refill cost is far lower then 100ns in sequential access. And
      2 HT siblings in one core makes the memory access more faster if they are
      accessing the same memory. So, in the patch, I just do the change when
      the target entries is less than 1/16 of whole active tlb entries.
      Actually, I have no data support for the percentage '1/16', so any
      suggestions are welcomed.
      
      As to hugetlb, guess due to smaller page table, and smaller active TLB
      entries, I didn't see benefit via my benchmark, so no optimizing now.
      
      My micro benchmark show in ideal scenarios, the performance improves 70
      percent in reading. And in worst scenario, the reading/writing
      performance is similar with unpatched 3.4-rc4 kernel.
      
      Here is the reading data on my 2P * 4cores *HT NHM EP machine, with THP
      'always':
      
      multi thread testing, '-t' paramter is thread number:
      	       	        with patch   unpatched 3.4-rc4
      ./mprotect -t 1           14ns		24ns
      ./mprotect -t 2           13ns		22ns
      ./mprotect -t 4           12ns		19ns
      ./mprotect -t 8           14ns		16ns
      ./mprotect -t 16          28ns		26ns
      ./mprotect -t 32          54ns		51ns
      ./mprotect -t 128         200ns		199ns
      
      Single process with sequencial flushing and memory accessing:
      
      		       	with patch   unpatched 3.4-rc4
      ./mprotect		    7ns			11ns
      ./mprotect -p 4096  -l 8 -n 10240
      			    21ns		21ns
      
      [ hpa: http://lkml.kernel.org/r/1B4B44D9196EFF41AE41FDA404FC0A100BFF94@SHSMSX101.ccr.corp.intel.com
        has additional performance numbers. ]
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Link: http://lkml.kernel.org/r/1340845344-27557-3-git-send-email-alex.shi@intel.comSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      e7b52ffd
  26. 18 6月, 2012 1 次提交
  27. 14 6月, 2012 1 次提交
  28. 07 6月, 2012 1 次提交
  29. 30 5月, 2012 4 次提交
    • M
      mm: vmscan: remove reclaim_mode_t · 23b9da55
      Mel Gorman 提交于
      There is little motiviation for reclaim_mode_t once RECLAIM_MODE_[A]SYNC
      and lumpy reclaim have been removed.  This patch gets rid of
      reclaim_mode_t as well and improves the documentation about what
      reclaim/compaction is and when it is triggered.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23b9da55
    • M
      mm: vmscan: do not stall on writeback during memory compaction · 41ac1999
      Mel Gorman 提交于
      This patch stops reclaim/compaction entering sync reclaim as this was
      only intended for lumpy reclaim and an oversight.  Page migration has
      its own logic for stalling on writeback pages if necessary and memory
      compaction is already using it.
      
      Waiting on page writeback is bad for a number of reasons but the primary
      one is that waiting on writeback to a slow device like USB can take a
      considerable length of time.  Page reclaim instead uses
      wait_iff_congested() to throttle if too many dirty pages are being
      scanned.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41ac1999
    • M
      mm: vmscan: remove lumpy reclaim · c53919ad
      Mel Gorman 提交于
      This series removes lumpy reclaim and some stalling logic that was
      unintentionally being used by memory compaction.  The end result is that
      stalling on dirty pages during page reclaim now depends on
      wait_iff_congested().
      
      Four kernels were compared
      
        3.3.0     vanilla
        3.4.0-rc2 vanilla
        3.4.0-rc2 lumpyremove-v2 is patch one from this series
        3.4.0-rc2 nosync-v2r3 is the full series
      
      Removing lumpy reclaim saves almost 900 bytes of text whereas the full
      series removes 1200 bytes.
      
           text     data      bss       dec     hex  filename
        67403754  1927944  2260992  10929311  a6c49f  vmlinux-3.4.0-rc2-vanilla
        6739479  1927944  2260992  10928415  a6c11f  vmlinux-3.4.0-rc2-lumpyremove-v2
        6739159  1927944  2260992  10928095  a6bfdf  vmlinux-3.4.0-rc2-nosync-v2
      
      There are behaviour changes in the series and so tests were run with
      monitoring of ftrace events.  This disrupts results so the performance
      results are distorted but the new behaviour should be clearer.
      
      fs-mark running in a threaded configuration showed little of interest as
      it did not push reclaim aggressively
      
        FS-Mark Multi Threaded
                                3.3.0-vanilla       rc2-vanilla       lumpyremove-v2r3       nosync-v2r3
        Files/s  min           3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)
        Files/s  mean          3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)
        Files/s  stddev        0.00 ( 0.00%)        0.00 ( 0.00%)        0.00 ( 0.00%)        0.00 ( 0.00%)
        Files/s  max           3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)
        Overhead min      508667.00 ( 0.00%)   521350.00 (-2.49%)   544292.00 (-7.00%)   547168.00 (-7.57%)
        Overhead mean     551185.00 ( 0.00%)   652690.73 (-18.42%)   991208.40 (-79.83%)   570130.53 (-3.44%)
        Overhead stddev    18200.69 ( 0.00%)   331958.29 (-1723.88%)  1579579.43 (-8578.68%)     9576.81 (47.38%)
        Overhead max      576775.00 ( 0.00%)  1846634.00 (-220.17%)  6901055.00 (-1096.49%)   585675.00 (-1.54%)
        MMTests Statistics: duration
        Sys Time Running Test (seconds)             309.90    300.95    307.33    298.95
        User+Sys Time Running Test (seconds)        319.32    309.67    315.69    307.51
        Total Elapsed Time (seconds)               1187.85   1193.09   1191.98   1193.73
      
        MMTests Statistics: vmstat
        Page Ins                                       80532       82212       81420       79480
        Page Outs                                  111434984   111456240   111437376   111582628
        Swap Ins                                           0           0           0           0
        Swap Outs                                          0           0           0           0
        Direct pages scanned                           44881       27889       27453       34843
        Kswapd pages scanned                        25841428    25860774    25861233    25843212
        Kswapd pages reclaimed                      25841393    25860741    25861199    25843179
        Direct pages reclaimed                         44881       27889       27453       34843
        Kswapd efficiency                                99%         99%         99%         99%
        Kswapd velocity                            21754.791   21675.460   21696.029   21649.127
        Direct efficiency                               100%        100%        100%        100%
        Direct velocity                               37.783      23.375      23.031      29.188
        Percentage direct scans                           0%          0%          0%          0%
      
      ftrace showed that there was no stalling on writeback or pages submitted
      for IO from reclaim context.
      
      postmark was similar and while it was more interesting, it also did not
      push reclaim heavily.
      
        POSTMARK
                                             3.3.0-vanilla       rc2-vanilla  lumpyremove-v2r3       nosync-v2r3
        Transactions per second:               16.00 ( 0.00%)    20.00 (25.00%)    18.00 (12.50%)    17.00 ( 6.25%)
        Data megabytes read per second:        18.80 ( 0.00%)    24.27 (29.10%)    22.26 (18.40%)    20.54 ( 9.26%)
        Data megabytes written per second:     35.83 ( 0.00%)    46.25 (29.08%)    42.42 (18.39%)    39.14 ( 9.24%)
        Files created alone per second:        28.00 ( 0.00%)    38.00 (35.71%)    34.00 (21.43%)    30.00 ( 7.14%)
        Files create/transact per second:       8.00 ( 0.00%)    10.00 (25.00%)     9.00 (12.50%)     8.00 ( 0.00%)
        Files deleted alone per second:       556.00 ( 0.00%)  1224.00 (120.14%)  3062.00 (450.72%)  6124.00 (1001.44%)
        Files delete/transact per second:       8.00 ( 0.00%)    10.00 (25.00%)     9.00 (12.50%)     8.00 ( 0.00%)
      
        MMTests Statistics: duration
        Sys Time Running Test (seconds)             113.34    107.99    109.73    108.72
        User+Sys Time Running Test (seconds)        145.51    139.81    143.32    143.55
        Total Elapsed Time (seconds)               1159.16    899.23    980.17   1062.27
      
        MMTests Statistics: vmstat
        Page Ins                                    13710192    13729032    13727944    13760136
        Page Outs                                   43071140    42987228    42733684    42931624
        Swap Ins                                           0           0           0           0
        Swap Outs                                          0           0           0           0
        Direct pages scanned                               0           0           0           0
        Kswapd pages scanned                         9941613     9937443     9939085     9929154
        Kswapd pages reclaimed                       9940926     9936751     9938397     9928465
        Direct pages reclaimed                             0           0           0           0
        Kswapd efficiency                                99%         99%         99%         99%
        Kswapd velocity                             8576.567   11051.058   10140.164    9347.109
        Direct efficiency                               100%        100%        100%        100%
        Direct velocity                                0.000       0.000       0.000       0.000
      
      It looks like here that the full series regresses performance but as
      ftrace showed no usage of wait_iff_congested() or sync reclaim I am
      assuming it's a disruption due to monitoring.  Other data such as memory
      usage, page IO, swap IO all looked similar.
      
      Running a benchmark with a plain DD showed nothing very interesting.
      The full series stalled in wait_iff_congested() slightly less but stall
      times on vanilla kernels were marginal.
      
      Running a benchmark that hammered on file-backed mappings showed stalls
      due to congestion but not in sync writebacks
      
        MICRO
                                             3.3.0-vanilla       rc2-vanilla  lumpyremove-v2r3       nosync-v2r3
        MMTests Statistics: duration
        Sys Time Running Test (seconds)             308.13    294.50    298.75    299.53
        User+Sys Time Running Test (seconds)        330.45    316.28    318.93    320.79
        Total Elapsed Time (seconds)               1814.90   1833.88   1821.14   1832.91
      
        MMTests Statistics: vmstat
        Page Ins                                      108712      120708       97224      110344
        Page Outs                                  155514576   156017404   155813676   156193256
        Swap Ins                                           0           0           0           0
        Swap Outs                                          0           0           0           0
        Direct pages scanned                         2599253     1550480     2512822     2414760
        Kswapd pages scanned                        69742364    71150694    68839041    69692533
        Kswapd pages reclaimed                      34824488    34773341    34796602    34799396
        Direct pages reclaimed                         53693       94750       61792       75205
        Kswapd efficiency                                49%         48%         50%         49%
        Kswapd velocity                            38427.662   38797.901   37799.972   38022.889
        Direct efficiency                                 2%          6%          2%          3%
        Direct velocity                             1432.174     845.464    1379.807    1317.446
        Percentage direct scans                           3%          2%          3%          3%
        Page writes by reclaim                             0           0           0           0
        Page writes file                                   0           0           0           0
        Page writes anon                                   0           0           0           0
        Page reclaim immediate                             0           0           0        1218
        Page rescued immediate                             0           0           0           0
        Slabs scanned                                  15360       16384       13312       16384
        Direct inode steals                                0           0           0           0
        Kswapd inode steals                             4340        4327        1630        4323
      
        FTrace Reclaim Statistics: congestion_wait
        Direct number congest     waited                 0          0          0          0
        Direct time   congest     waited               0ms        0ms        0ms        0ms
        Direct full   congest     waited                 0          0          0          0
        Direct number conditional waited               900        870        754        789
        Direct time   conditional waited               0ms        0ms        0ms       20ms
        Direct full   conditional waited                 0          0          0          0
        KSwapd number congest     waited              2106       2308       2116       1915
        KSwapd time   congest     waited          139924ms   157832ms   125652ms   132516ms
        KSwapd full   congest     waited              1346       1530       1202       1278
        KSwapd number conditional waited             12922      16320      10943      14670
        KSwapd time   conditional waited               0ms        0ms        0ms        0ms
        KSwapd full   conditional waited                 0          0          0          0
      
      Reclaim statistics are not radically changed.  The stall times in kswapd
      are massive but it is clear that it is due to calls to congestion_wait()
      and that is almost certainly the call in balance_pgdat().  Otherwise
      stalls due to dirty pages are non-existant.
      
      I ran a benchmark that stressed high-order allocation.  This is very
      artifical load but was used in the past to evaluate lumpy reclaim and
      compaction.  Generally I look at allocation success rates and latency
      figures.
      
        STRESS-HIGHALLOC
                         3.3.0-vanilla       rc2-vanilla  lumpyremove-v2r3       nosync-v2r3
        Pass 1          81.00 ( 0.00%)    28.00 (-53.00%)    24.00 (-57.00%)    28.00 (-53.00%)
        Pass 2          82.00 ( 0.00%)    39.00 (-43.00%)    38.00 (-44.00%)    43.00 (-39.00%)
        while Rested    88.00 ( 0.00%)    87.00 (-1.00%)    88.00 ( 0.00%)    88.00 ( 0.00%)
      
        MMTests Statistics: duration
        Sys Time Running Test (seconds)             740.93    681.42    685.14    684.87
        User+Sys Time Running Test (seconds)       2922.65   3269.52   3281.35   3279.44
        Total Elapsed Time (seconds)               1161.73   1152.49   1159.55   1161.44
      
        MMTests Statistics: vmstat
        Page Ins                                     4486020     2807256     2855944     2876244
        Page Outs                                    7261600     7973688     7975320     7986120
        Swap Ins                                       31694           0           0           0
        Swap Outs                                      98179           0           0           0
        Direct pages scanned                           53494       57731       34406      113015
        Kswapd pages scanned                         6271173     1287481     1278174     1219095
        Kswapd pages reclaimed                       2029240     1281025     1260708     1201583
        Direct pages reclaimed                          1468       14564       16649       92456
        Kswapd efficiency                                32%         99%         98%         98%
        Kswapd velocity                             5398.133    1117.130    1102.302    1049.641
        Direct efficiency                                 2%         25%         48%         81%
        Direct velocity                               46.047      50.092      29.672      97.306
        Percentage direct scans                           0%          4%          2%          8%
        Page writes by reclaim                       1616049           0           0           0
        Page writes file                             1517870           0           0           0
        Page writes anon                               98179           0           0           0
        Page reclaim immediate                        103778       27339        9796       17831
        Page rescued immediate                             0           0           0           0
        Slabs scanned                                1096704      986112      980992      998400
        Direct inode steals                              223      215040      216736      247881
        Kswapd inode steals                           175331       61548       68444       63066
        Kswapd skipped wait                            21991           0           1           0
        THP fault alloc                                    1         135         125         134
        THP collapse alloc                               393         311         228         236
        THP splits                                        25          13           7           8
        THP fault fallback                                 0           0           0           0
        THP collapse fail                                  3           5           7           7
        Compaction stalls                                865        1270        1422        1518
        Compaction success                               370         401         353         383
        Compaction failures                              495         869        1069        1135
        Compaction pages moved                        870155     3828868     4036106     4423626
        Compaction move failure                        26429       23865       29742       27514
      
      Success rates are completely hosed for 3.4-rc2 which is almost certainly
      due to commit fe2c2a10 ("vmscan: reclaim at order 0 when compaction
      is enabled").  I expected this would happen for kswapd and impair
      allocation success rates (https://lkml.org/lkml/2012/1/25/166) but I did
      not anticipate this much a difference: 80% less scanning, 37% less
      reclaim by kswapd
      
      In comparison, reclaim/compaction is not aggressive and gives up easily
      which is the intended behaviour.  hugetlbfs uses __GFP_REPEAT and would
      be much more aggressive about reclaim/compaction than THP allocations
      are.  The stress test above is allocating like neither THP or hugetlbfs
      but is much closer to THP.
      
      Mainline is now impaired in terms of high order allocation under heavy
      load although I do not know to what degree as I did not test with
      __GFP_REPEAT.  Keep this in mind for bugs related to hugepage pool
      resizing, THP allocation and high order atomic allocation failures from
      network devices.
      
      In terms of congestion throttling, I see the following for this test
      
        FTrace Reclaim Statistics: congestion_wait
        Direct number congest     waited                 3          0          0          0
        Direct time   congest     waited               0ms        0ms        0ms        0ms
        Direct full   congest     waited                 0          0          0          0
        Direct number conditional waited               957        512       1081       1075
        Direct time   conditional waited               0ms        0ms        0ms        0ms
        Direct full   conditional waited                 0          0          0          0
        KSwapd number congest     waited                36          4          3          5
        KSwapd time   congest     waited            3148ms      400ms      300ms      500ms
        KSwapd full   congest     waited                30          4          3          5
        KSwapd number conditional waited             88514        197        332        542
        KSwapd time   conditional waited            4980ms        0ms        0ms        0ms
        KSwapd full   conditional waited                49          0          0          0
      
      The "conditional waited" times are the most interesting as this is
      directly impacted by the number of dirty pages encountered during scan.
      As lumpy reclaim is no longer scanning contiguous ranges, it is finding
      fewer dirty pages.  This brings wait times from about 5 seconds to 0.
      kswapd itself is still calling congestion_wait() so it'll still stall but
      it's a lot less.
      
      In terms of the type of IO we were doing, I see this
      
        FTrace Reclaim Statistics: mm_vmscan_writepage
        Direct writes anon  sync                         0          0          0          0
        Direct writes anon  async                        0          0          0          0
        Direct writes file  sync                         0          0          0          0
        Direct writes file  async                        0          0          0          0
        Direct writes mixed sync                         0          0          0          0
        Direct writes mixed async                        0          0          0          0
        KSwapd writes anon  sync                         0          0          0          0
        KSwapd writes anon  async                    91682          0          0          0
        KSwapd writes file  sync                         0          0          0          0
        KSwapd writes file  async                   822629          0          0          0
        KSwapd writes mixed sync                         0          0          0          0
        KSwapd writes mixed async                        0          0          0          0
      
      In 3.2, kswapd was doing a bunch of async writes of pages but
      reclaim/compaction was never reaching a point where it was doing sync
      IO.  This does not guarantee that reclaim/compaction was not calling
      wait_on_page_writeback() but I would consider it unlikely.  It indicates
      that merging patches 2 and 3 to stop reclaim/compaction calling
      wait_on_page_writeback() should be safe.
      
      This patch:
      
      Lumpy reclaim had a purpose but in the mind of some, it was to kick the
      system so hard it trashed.  For others the purpose was to complicate
      vmscan.c.  Over time it was giving softer shoes and a nicer attitude but
      memory compaction needs to step up and replace it so this patch sends
      lumpy reclaim to the farm.
      
      The tracepoint format changes for isolating LRU pages with this patch
      applied.  Furthermore reclaim/compaction can no longer queue dirty pages
      in pageout() if the underlying BDI is congested.  Lumpy reclaim used
      this logic and reclaim/compaction was using it in error.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c53919ad
    • R
      mm: remove swap token code · e709ffd6
      Rik van Riel 提交于
      The swap token code no longer fits in with the current VM model.  It
      does not play well with cgroups or the better NUMA placement code in
      development, since we have only one swap token globally.
      
      It also has the potential to mess with scalability of the system, by
      increasing the number of non-reclaimable pages on the active and
      inactive anon LRU lists.
      
      Last but not least, the swap token code has been broken for a year
      without complaints, as reported by Konstantin Khlebnikov.  This suggests
      we no longer have much use for it.
      
      The days of sub-1G memory systems with heavy use of swap are over.  If
      we ever need thrashing reducing code in the future, we will have to
      implement something that does scale.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: NBob Picco <bpicco@meloft.net>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e709ffd6
  30. 16 5月, 2012 2 次提交