1. 04 7月, 2013 40 次提交
    • C
      114d4b79
    • C
    • M
      fs: nfs: inform the VM about pages being committed or unstable · f919b196
      Mel Gorman 提交于
      VM page reclaim uses dirty and writeback page states to determine if
      flushers are cleaning pages too slowly and that page reclaim should
      stall waiting on flushers to catch up.  Page state in NFS is a bit more
      complex and a clean page can be unreclaimable due to being unstable
      which is effectively "dirty" from the perspective of the VM from reclaim
      context.  Similarly, if the inode is currently being committed then it's
      similar to being under writeback.
      
      This patch adds a is_dirty_writeback() handled for NFS that checks if a
      pages backing inode is being committed and should be accounted as
      writeback and if a page has private state indicating that it is
      effectively dirty.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Zlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f919b196
    • M
      mm: vmscan: take page buffers dirty and locked state into account · b4597226
      Mel Gorman 提交于
      Page reclaim keeps track of dirty and under writeback pages and uses it
      to determine if wait_iff_congested() should stall or if kswapd should
      begin writing back pages.  This fails to account for buffer pages that
      can be under writeback but not PageWriteback which is the case for
      filesystems like ext3 ordered mode.  Furthermore, PageDirty buffer pages
      can have all the buffers clean and writepage does no IO so it should not
      be accounted as congested.
      
      This patch adds an address_space operation that filesystems may
      optionally use to check if a page is really dirty or really under
      writeback.  An implementation is provided for for buffer_heads is added
      and used for block operations and ext3 in ordered mode.  By default the
      page flags are obeyed.
      
      Credit goes to Jan Kara for identifying that the page flags alone are
      not sufficient for ext3 and sanity checking a number of ideas on how the
      problem could be addressed.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Zlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4597226
    • M
      mm: vmscan: treat pages marked for immediate reclaim as zone congestion · d04e8acd
      Mel Gorman 提交于
      Currently a zone will only be marked congested if the underlying BDI is
      congested but if dirty pages are spread across zones it is possible that
      an individual zone is full of dirty pages without being congested.  The
      impact is that zone gets scanned very quickly potentially reclaiming
      really clean pages.  This patch treats pages marked for immediate
      reclaim as congested for the purposes of marking a zone ZONE_CONGESTED
      and stalling in wait_iff_congested.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Zlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d04e8acd
    • M
      mm: vmscan: move direct reclaim wait_iff_congested into shrink_list · 8e950282
      Mel Gorman 提交于
      shrink_inactive_list makes decisions on whether to stall based on the
      number of dirty pages encountered.  The wait_iff_congested() call in
      shrink_page_list does no such thing and it's arbitrary.
      
      This patch moves the decision on whether to set ZONE_CONGESTED and the
      wait_iff_congested call into shrink_page_list.  This keeps all the
      decisions on whether to stall or not in the one place.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Zlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e950282
    • M
      mm: vmscan: set zone flags before blocking · f7ab8db7
      Mel Gorman 提交于
      In shrink_page_list a decision may be made to stall and flag a zone as
      ZONE_WRITEBACK so that if a large number of unqueued dirty pages are
      encountered later then the reclaimer will stall.  Set ZONE_WRITEBACK
      before potentially going to sleep so it is noticed sooner.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Zlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7ab8db7
    • M
      mm: vmscan: stall page reclaim after a list of pages have been processed · b1a6f21e
      Mel Gorman 提交于
      Commit "mm: vmscan: Block kswapd if it is encountering pages under
      writeback" blocks page reclaim if it encounters pages under writeback
      marked for immediate reclaim.  It blocks while pages are still isolated
      from the LRU which is unnecessary.  This patch defers the blocking until
      after the isolated pages have been processed and tidies up some of the
      comments.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Zlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b1a6f21e
    • M
      mm: vmscan: stall page reclaim and writeback pages based on dirty/writepage pages encountered · e2be15f6
      Mel Gorman 提交于
      Further testing of the "Reduce system disruption due to kswapd"
      discovered a few problems.  First and foremost, it's possible for pages
      under writeback to be freed which will lead to badness.  Second, as
      pages were not being swapped the file LRU was being scanned faster and
      clean file pages were being reclaimed.  In some cases this results in
      increased read IO to re-read data from disk.  Third, more pages were
      being written from kswapd context which can adversly affect IO
      performance.  Lastly, it was observed that PageDirty pages are not
      necessarily dirty on all filesystems (buffers can be clean while
      PageDirty is set and ->writepage generates no IO) and not all
      filesystems set PageWriteback when the page is being written (e.g.
      ext3).  This disconnect confuses the reclaim stalling logic.  This
      follow-up series is aimed at these problems.
      
      The tests were based on three kernels
      
      vanilla:	kernel 3.9 as that is what the current mmotm uses as a baseline
      mmotm-20130522	is mmotm as of 22nd May with "Reduce system disruption due to
      		kswapd" applied on top as per what should be in Andrew's tree
      		right now
      lessdisrupt-v7r10 is this follow-up series on top of the mmotm kernel
      
      The first test used memcached+memcachetest while some background IO was
      in progress as implemented by the parallel IO tests implement in MM
      Tests.  memcachetest benchmarks how many operations/second memcached can
      service.  It starts with no background IO on a freshly created ext4
      filesystem and then re-runs the test with larger amounts of IO in the
      background to roughly simulate a large copy in progress.  The
      expectation is that the IO should have little or no impact on
      memcachetest which is running entirely in memory.
      
      parallelio
                                                   3.9.0                       3.9.0                       3.9.0
                                                 vanilla          mm1-mmotm-20130522       mm1-lessdisrupt-v7r10
      Ops memcachetest-0M             23117.00 (  0.00%)          22780.00 ( -1.46%)          22763.00 ( -1.53%)
      Ops memcachetest-715M           23774.00 (  0.00%)          23299.00 ( -2.00%)          22934.00 ( -3.53%)
      Ops memcachetest-2385M           4208.00 (  0.00%)          24154.00 (474.00%)          23765.00 (464.76%)
      Ops memcachetest-4055M           4104.00 (  0.00%)          25130.00 (512.33%)          24614.00 (499.76%)
      Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
      Ops io-duration-715M               12.00 (  0.00%)              7.00 ( 41.67%)              6.00 ( 50.00%)
      Ops io-duration-2385M             116.00 (  0.00%)             21.00 ( 81.90%)             21.00 ( 81.90%)
      Ops io-duration-4055M             160.00 (  0.00%)             36.00 ( 77.50%)             35.00 ( 78.12%)
      Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
      Ops swaptotal-715M             140138.00 (  0.00%)             18.00 ( 99.99%)             18.00 ( 99.99%)
      Ops swaptotal-2385M            385682.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
      Ops swaptotal-4055M            418029.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
      Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
      Ops swapin-715M                   144.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
      Ops swapin-2385M               134227.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
      Ops swapin-4055M               125618.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
      Ops minorfaults-0M            1536429.00 (  0.00%)        1531632.00 (  0.31%)        1533541.00 (  0.19%)
      Ops minorfaults-715M          1786996.00 (  0.00%)        1612148.00 (  9.78%)        1608832.00 (  9.97%)
      Ops minorfaults-2385M         1757952.00 (  0.00%)        1614874.00 (  8.14%)        1613541.00 (  8.21%)
      Ops minorfaults-4055M         1774460.00 (  0.00%)        1633400.00 (  7.95%)        1630881.00 (  8.09%)
      Ops majorfaults-0M                  1.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
      Ops majorfaults-715M              184.00 (  0.00%)            167.00 (  9.24%)            166.00 (  9.78%)
      Ops majorfaults-2385M           24444.00 (  0.00%)            155.00 ( 99.37%)             93.00 ( 99.62%)
      Ops majorfaults-4055M           21357.00 (  0.00%)            147.00 ( 99.31%)            134.00 ( 99.37%)
      
      memcachetest is the transactions/second reported by memcachetest. In
              the vanilla kernel note that performance drops from around
              23K/sec to just over 4K/second when there is 2385M of IO going
              on in the background. With current mmotm, there is no collapse
      	in performance and with this follow-up series there is little
      	change.
      
      swaptotal is the total amount of swap traffic. With mmotm and the follow-up
      	series, the total amount of swapping is much reduced.
      
                                       3.9.0       3.9.0       3.9.0
                                     vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
      Minor Faults                  11160152    10706748    10622316
      Major Faults                     46305         755         678
      Swap Ins                        260249           0           0
      Swap Outs                       683860          18          18
      Direct pages scanned                 0         678        2520
      Kswapd pages scanned           6046108     8814900     1639279
      Kswapd pages reclaimed         1081954     1172267     1094635
      Direct pages reclaimed               0         566        2304
      Kswapd efficiency                  17%         13%         66%
      Kswapd velocity               5217.560    7618.953    1414.879
      Direct efficiency                 100%         83%         91%
      Direct velocity                  0.000       0.586       2.175
      Percentage direct scans             0%          0%          0%
      Zone normal velocity          5105.086    6824.681     671.158
      Zone dma32 velocity            112.473     794.858     745.896
      Zone dma velocity                0.000       0.000       0.000
      Page writes by reclaim     1929612.000 6861768.000   32821.000
      Page writes file               1245752     6861750       32803
      Page writes anon                683860          18          18
      Page reclaim immediate            7484          40         239
      Sector Reads                   1130320       93996       86900
      Sector Writes                 13508052    10823500    11804436
      Page rescued immediate               0           0           0
      Slabs scanned                    33536       27136       18560
      Direct inode steals                  0           0           0
      Kswapd inode steals               8641        1035           0
      Kswapd skipped wait                  0           0           0
      THP fault alloc                      8          37          33
      THP collapse alloc                 508         552         515
      THP splits                          24           1           1
      THP fault fallback                   0           0           0
      THP collapse fail                    0           0           0
      
      There are a number of observations to make here
      
      1. Swap outs are almost eliminated. Swap ins are 0 indicating that the
         pages swapped were really unused anonymous pages. Related to that,
         major faults are much reduced.
      
      2. kswapd efficiency was impacted by the initial series but with these
         follow-up patches, the efficiency is now at 66% indicating that far
         fewer pages were skipped during scanning due to dirty or writeback
         pages.
      
      3. kswapd velocity is reduced indicating that fewer pages are being scanned
         with the follow-up series as kswapd now stalls when the tail of the
         LRU queue is full of unqueued dirty pages. The stall gives flushers a
         chance to catch-up so kswapd can reclaim clean pages when it wakes
      
      4. In light of Zlatko's recent reports about zone scanning imbalances,
         mmtests now reports scanning velocity on a per-zone basis. With mainline,
         you can see that the scanning activity is dominated by the Normal
         zone with over 45 times more scanning in Normal than the DMA32 zone.
         With the series currently in mmotm, the ratio is slightly better but it
         is still the case that the bulk of scanning is in the highest zone. With
         this follow-up series, the ratio of scanning between the Normal and
         DMA32 zone is roughly equal.
      
      5. As Dave Chinner observed, the current patches in mmotm increased the
         number of pages written from kswapd context which is expected to adversly
         impact IO performance. With the follow-up patches, far fewer pages are
         written from kswapd context than the mainline kernel
      
      6. With the series in mmotm, fewer inodes were reclaimed by kswapd. With
         the follow-up series, there is less slab shrinking activity and no inodes
         were reclaimed.
      
      7. Note that "Sectors Read" is drastically reduced implying that the source
         data being used for the IO is not being aggressively discarded due to
         page reclaim skipping over dirty pages and reclaiming clean pages. Note
         that the reducion in reads could also be due to inode data not being
         re-read from disk after a slab shrink.
      
                             3.9.0       3.9.0       3.9.0
                           vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
      Mean sda-avgqz        166.99       32.09       33.44
      Mean sda-await        853.64      192.76      185.43
      Mean sda-r_await        6.31        9.24        5.97
      Mean sda-w_await     2992.81      202.65      192.43
      Max  sda-avgqz       1409.91      718.75      698.98
      Max  sda-await       6665.74     3538.00     3124.23
      Max  sda-r_await       58.96      111.95       58.00
      Max  sda-w_await    28458.94     3977.29     3148.61
      
      In light of the changes in writes from reclaim context, the number of
      reads and Dave Chinner's concerns about IO performance I took a closer
      look at the IO stats for the test disk. Few observations
      
      1. The average queue size is reduced by the initial series and roughly
         the same with this follow up.
      
      2. Average wait times for writes are reduced and as the IO
         is completing faster it at least implies that the gain is because
         flushers are writing the files efficiently instead of page reclaim
         getting in the way.
      
      3. The reduction in maximum write latency is staggering. 28 seconds down
         to 3 seconds.
      
      Jan Kara asked how NFS is affected by all of this. Unstable pages can
      be taken into account as one of the patches in the series shows but it
      is still the case that filesystems with unusual handling of dirty or
      writeback could still be treated better.
      
      Tests like postmark, fsmark and largedd showed up nothing useful. On my test
      setup, pages are simply not being written back from reclaim context with or
      without the patches and there are no changes in performance. My test setup
      probably is just not strong enough network-wise to be really interesting.
      
      I ran a longer-lived memcached test with IO going to NFS instead of a local disk
      
      parallelio
                                                   3.9.0                       3.9.0                       3.9.0
                                                 vanilla          mm1-mmotm-20130522       mm1-lessdisrupt-v7r10
      Ops memcachetest-0M             23323.00 (  0.00%)          23241.00 ( -0.35%)          23321.00 ( -0.01%)
      Ops memcachetest-715M           25526.00 (  0.00%)          24763.00 ( -2.99%)          23242.00 ( -8.95%)
      Ops memcachetest-2385M           8814.00 (  0.00%)          26924.00 (205.47%)          23521.00 (166.86%)
      Ops memcachetest-4055M           5835.00 (  0.00%)          26827.00 (359.76%)          25560.00 (338.05%)
      Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
      Ops io-duration-715M               65.00 (  0.00%)             71.00 ( -9.23%)             11.00 ( 83.08%)
      Ops io-duration-2385M             129.00 (  0.00%)             94.00 ( 27.13%)             53.00 ( 58.91%)
      Ops io-duration-4055M             301.00 (  0.00%)            100.00 ( 66.78%)            108.00 ( 64.12%)
      Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
      Ops swaptotal-715M              14394.00 (  0.00%)            949.00 ( 93.41%)             63.00 ( 99.56%)
      Ops swaptotal-2385M            401483.00 (  0.00%)          24437.00 ( 93.91%)          30118.00 ( 92.50%)
      Ops swaptotal-4055M            554123.00 (  0.00%)          35688.00 ( 93.56%)          63082.00 ( 88.62%)
      Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
      Ops swapin-715M                  4522.00 (  0.00%)            560.00 ( 87.62%)             63.00 ( 98.61%)
      Ops swapin-2385M               169861.00 (  0.00%)           5026.00 ( 97.04%)          13917.00 ( 91.81%)
      Ops swapin-4055M               192374.00 (  0.00%)          10056.00 ( 94.77%)          25729.00 ( 86.63%)
      Ops minorfaults-0M            1445969.00 (  0.00%)        1520878.00 ( -5.18%)        1454024.00 ( -0.56%)
      Ops minorfaults-715M          1557288.00 (  0.00%)        1528482.00 (  1.85%)        1535776.00 (  1.38%)
      Ops minorfaults-2385M         1692896.00 (  0.00%)        1570523.00 (  7.23%)        1559622.00 (  7.87%)
      Ops minorfaults-4055M         1654985.00 (  0.00%)        1581456.00 (  4.44%)        1596713.00 (  3.52%)
      Ops majorfaults-0M                  0.00 (  0.00%)              1.00 (-99.00%)              0.00 (  0.00%)
      Ops majorfaults-715M              763.00 (  0.00%)            265.00 ( 65.27%)             75.00 ( 90.17%)
      Ops majorfaults-2385M           23861.00 (  0.00%)            894.00 ( 96.25%)           2189.00 ( 90.83%)
      Ops majorfaults-4055M           27210.00 (  0.00%)           1569.00 ( 94.23%)           4088.00 ( 84.98%)
      
      1. Performance does not collapse due to IO which is good. IO is also completing
         faster. Note with mmotm, IO completes in a third of the time and faster again
         with this series applied
      
      2. Swapping is reduced, although not eliminated. The figures for the follow-up
         look bad but it does vary a bit as the stalling is not perfect for nfs
         or filesystems like ext3 with unusual handling of dirty and writeback
         pages
      
      3. There are swapins, particularly with larger amounts of IO indicating
         that active pages are being reclaimed. However, the number of much
         reduced.
      
                                       3.9.0       3.9.0       3.9.0
                                     vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
      Minor Faults                  36339175    35025445    35219699
      Major Faults                    310964       27108       51887
      Swap Ins                       2176399      173069      333316
      Swap Outs                      3344050      357228      504824
      Direct pages scanned              8972       77283       43242
      Kswapd pages scanned          20899983     8939566    14772851
      Kswapd pages reclaimed         6193156     5172605     5231026
      Direct pages reclaimed            8450       73802       39514
      Kswapd efficiency                  29%         57%         35%
      Kswapd velocity               3929.743    1847.499    3058.840
      Direct efficiency                  94%         95%         91%
      Direct velocity                  1.687      15.972       8.954
      Percentage direct scans             0%          0%          0%
      Zone normal velocity          3721.907     939.103    2185.142
      Zone dma32 velocity            209.522     924.368     882.651
      Zone dma velocity                0.000       0.000       0.000
      Page writes by reclaim     4082185.000  526319.000  537114.000
      Page writes file                738135      169091       32290
      Page writes anon               3344050      357228      504824
      Page reclaim immediate            9524         170     5595843
      Sector Reads                   8909900      861192     1483680
      Sector Writes                 13428980     1488744     2076800
      Page rescued immediate               0           0           0
      Slabs scanned                    38016       31744       28672
      Direct inode steals                  0           0           0
      Kswapd inode steals                424           0           0
      Kswapd skipped wait                  0           0           0
      THP fault alloc                     14          15         119
      THP collapse alloc                1767        1569        1618
      THP splits                          30          29          25
      THP fault fallback                   0           0           0
      THP collapse fail                    8           5           0
      Compaction stalls                   17          41         100
      Compaction success                   7          31          95
      Compaction failures                 10          10           5
      Page migrate success              7083       22157       62217
      Page migrate failure                 0           0           0
      Compaction pages isolated        14847       48758      135830
      Compaction migrate scanned       18328       48398      138929
      Compaction free scanned        2000255      355827     1720269
      Compaction cost                      7          24          68
      
      I guess the main takeaway again is the much reduced page writes
      from reclaim context and reduced reads.
      
                             3.9.0       3.9.0       3.9.0
                           vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
      Mean sda-avgqz         23.58        0.35        0.44
      Mean sda-await        133.47       15.72       15.46
      Mean sda-r_await        4.72        4.69        3.95
      Mean sda-w_await      507.69       28.40       33.68
      Max  sda-avgqz        680.60       12.25       23.14
      Max  sda-await       3958.89      221.83      286.22
      Max  sda-r_await       63.86       61.23       67.29
      Max  sda-w_await    11710.38      883.57     1767.28
      
      And as before, write wait times are much reduced.
      
      This patch:
      
      The patch "mm: vmscan: Have kswapd writeback pages based on dirty pages
      encountered, not priority" decides whether to writeback pages from reclaim
      context based on the number of dirty pages encountered.  This situation is
      flagged too easily and flushers are not given the chance to catch up
      resulting in more pages being written from reclaim context and potentially
      impacting IO performance.  The check for PageWriteback is also misplaced
      as it happens within a PageDirty check which is nonsense as the dirty may
      have been cleared for IO.  The accounting is updated very late and pages
      that are already under writeback, were reactivated, could not unmapped or
      could not be released are all missed.  Similarly, a page is considered
      congested for reasons other than being congested and pages that cannot be
      written out in the correct context are skipped.  Finally, it considers
      stalling and writing back filesystem pages due to encountering dirty
      anonymous pages at the tail of the LRU which is dumb.
      
      This patch causes kswapd to begin writing filesystem pages from reclaim
      context only if page reclaim found that all filesystem pages at the tail
      of the LRU were unqueued dirty pages.  Before it starts writing filesystem
      pages, it will stall to give flushers a chance to catch up.  The decision
      on whether wait_iff_congested is also now determined by dirty filesystem
      pages only.  Congested pages are based on whether the underlying BDI is
      congested regardless of the context of the reclaiming process.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Zlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e2be15f6
    • M
      mm: vmscan: move logic from balance_pgdat() to kswapd_shrink_zone() · 7c954f6d
      Mel Gorman 提交于
      balance_pgdat() is very long and some of the logic can and should be
      internal to kswapd_shrink_zone().  Move it so the flow of
      balance_pgdat() is marginally easier to follow.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Tested-by: NZlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c954f6d
    • M
      mm: vmscan: check if kswapd should writepage once per pgdat scan · b7ea3c41
      Mel Gorman 提交于
      Currently kswapd checks if it should start writepage as it shrinks each
      zone without taking into consideration if the zone is balanced or not.
      This is not wrong as such but it does not make much sense either.  This
      patch checks once per pgdat scan if kswapd should be writing pages.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Tested-by: NZlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b7ea3c41
    • M
      mm: vmscan: block kswapd if it is encountering pages under writeback · 283aba9f
      Mel Gorman 提交于
      Historically, kswapd used to congestion_wait() at higher priorities if
      it was not making forward progress.  This made no sense as the failure
      to make progress could be completely independent of IO.  It was later
      replaced by wait_iff_congested() and removed entirely by commit 258401a6
      (mm: don't wait on congested zones in balance_pgdat()) as it was
      duplicating logic in shrink_inactive_list().
      
      This is problematic.  If kswapd encounters many pages under writeback
      and it continues to scan until it reaches the high watermark then it
      will quickly skip over the pages under writeback and reclaim clean young
      pages or push applications out to swap.
      
      The use of wait_iff_congested() is not suited to kswapd as it will only
      stall if the underlying BDI is really congested or a direct reclaimer
      was unable to write to the underlying BDI.  kswapd bypasses the BDI
      congestion as it sets PF_SWAPWRITE but even if this was taken into
      account then it would cause direct reclaimers to stall on writeback
      which is not desirable.
      
      This patch sets a ZONE_WRITEBACK flag if direct reclaim or kswapd is
      encountering too many pages under writeback.  If this flag is set and
      kswapd encounters a PageReclaim page under writeback then it'll assume
      that the LRU lists are being recycled too quickly before IO can complete
      and block waiting for some IO to complete.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Tested-by: NZlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      283aba9f
    • M
      mm: vmscan: have kswapd writeback pages based on dirty pages encountered, not priority · d43006d5
      Mel Gorman 提交于
      Currently kswapd queues dirty pages for writeback if scanning at an
      elevated priority but the priority kswapd scans at is not related to the
      number of unqueued dirty encountered.  Since commit "mm: vmscan: Flatten
      kswapd priority loop", the priority is related to the size of the LRU
      and the zone watermark which is no indication as to whether kswapd
      should write pages or not.
      
      This patch tracks if an excessive number of unqueued dirty pages are
      being encountered at the end of the LRU.  If so, it indicates that dirty
      pages are being recycled before flusher threads can clean them and flags
      the zone so that kswapd will start writing pages until the zone is
      balanced.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Tested-by: NZlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d43006d5
    • M
      mm: vmscan: do not allow kswapd to scan at maximum priority · 9aa41348
      Mel Gorman 提交于
      Page reclaim at priority 0 will scan the entire LRU as priority 0 is
      considered to be a near OOM condition.  Kswapd can reach priority 0
      quite easily if it is encountering a large number of pages it cannot
      reclaim such as pages under writeback.  When this happens, kswapd
      reclaims very aggressively even though there may be no real risk of
      allocation failure or OOM.
      
      This patch prevents kswapd reaching priority 0 and trying to reclaim the
      world.  Direct reclaimers will still reach priority 0 in the event of an
      OOM situation.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Tested-by: NZlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9aa41348
    • M
      mm: vmscan: decide whether to compact the pgdat based on reclaim progress · 2ab44f43
      Mel Gorman 提交于
      In the past, kswapd makes a decision on whether to compact memory after
      the pgdat was considered balanced.  This more or less worked but it is
      late to make such a decision and does not fit well now that kswapd makes
      a decision whether to exit the zone scanning loop depending on reclaim
      progress.
      
      This patch will compact a pgdat if at least the requested number of
      pages were reclaimed from unbalanced zones for a given priority.  If any
      zone is currently balanced, kswapd will not call compaction as it is
      expected the necessary pages are already available.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Tested-by: NZlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ab44f43
    • M
      mm: vmscan: flatten kswapd priority loop · b8e83b94
      Mel Gorman 提交于
      kswapd stops raising the scanning priority when at least
      SWAP_CLUSTER_MAX pages have been reclaimed or the pgdat is considered
      balanced.  It then rechecks if it needs to restart at DEF_PRIORITY and
      whether high-order reclaim needs to be reset.  This is not wrong per-se
      but it is confusing to follow and forcing kswapd to stay at DEF_PRIORITY
      may require several restarts before it has scanned enough pages to meet
      the high watermark even at 100% efficiency.  This patch irons out the
      logic a bit by controlling when priority is raised and removing the
      "goto loop_again".
      
      This patch has kswapd raise the scanning priority until it is scanning
      enough pages that it could meet the high watermark in one shrink of the
      LRU lists if it is able to reclaim at 100% efficiency.  It will not
      raise the scanning prioirty higher unless it is failing to reclaim any
      pages.
      
      To avoid infinite looping for high-order allocation requests kswapd will
      not reclaim for high-order allocations when it has reclaimed at least
      twice the number of pages as the allocation request.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Tested-by: NZlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8e83b94
    • M
      mm: vmscan: obey proportional scanning requirements for kswapd · e82e0561
      Mel Gorman 提交于
      Simplistically, the anon and file LRU lists are scanned proportionally
      depending on the value of vm.swappiness although there are other factors
      taken into account by get_scan_count().  The patch "mm: vmscan: Limit
      the number of pages kswapd reclaims" limits the number of pages kswapd
      reclaims but it breaks this proportional scanning and may evenly shrink
      anon/file LRUs regardless of vm.swappiness.
      
      This patch preserves the proportional scanning and reclaim.  It does
      mean that kswapd will reclaim more than requested but the number of
      pages will be related to the high watermark.
      
      [mhocko@suse.cz: Correct proportional reclaim for memcg and simplify]
      [kamezawa.hiroyu@jp.fujitsu.com: Recalculate scan based on target]
      [hannes@cmpxchg.org: Account for already scanned pages properly]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Tested-by: NZlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e82e0561
    • M
      mm: vmscan: limit the number of pages kswapd reclaims at each priority · 75485363
      Mel Gorman 提交于
      This series does not fix all the current known problems with reclaim but
      it addresses one important swapping bug when there is background IO.
      
      Changelog since V3
       - Drop the slab shrink changes in light of Glaubers series and
         discussions highlighted that there were a number of potential
         problems with the patch.					(mel)
       - Rebased to 3.10-rc1
      
      Changelog since V2
       - Preserve ratio properly for proportional scanning		(kamezawa)
      
      Changelog since V1
       - Rename ZONE_DIRTY to ZONE_TAIL_LRU_DIRTY			(andi)
       - Reformat comment in shrink_page_list				(andi)
       - Clarify some comments					(dhillf)
       - Rework how the proportional scanning is preserved
       - Add PageReclaim check before kswapd starts writeback
       - Reset sc.nr_reclaimed on every full zone scan
      
      Kswapd and page reclaim behaviour has been screwy in one way or the
      other for a long time.  Very broadly speaking it worked in the far past
      because machines were limited in memory so it did not have that many
      pages to scan and it stalled congestion_wait() frequently to prevent it
      going completely nuts.  In recent times it has behaved very
      unsatisfactorily with some of the problems compounded by the removal of
      stall logic and the introduction of transparent hugepage support with
      high-order reclaims.
      
      There are many variations of bugs that are rooted in this area.  One
      example is reports of a large copy operations or backup causing the
      machine to grind to a halt or applications pushed to swap.  Sometimes in
      low memory situations a large percentage of memory suddenly gets
      reclaimed.  In other cases an application starts and kswapd hits 100%
      CPU usage for prolonged periods of time and so on.  There is now talk of
      introducing features like an extra free kbytes tunable to work around
      aspects of the problem instead of trying to deal with it.  It's
      compounded by the problem that it can be very workload and machine
      specific.
      
      This series aims at addressing some of the worst of these problems
      without attempting to fundmentally alter how page reclaim works.
      
      Patches 1-2 limits the number of pages kswapd reclaims while still obeying
      	the anon/file proportion of the LRUs it should be scanning.
      
      Patches 3-4 control how and when kswapd raises its scanning priority and
      	deletes the scanning restart logic which is tricky to follow.
      
      Patch 5 notes that it is too easy for kswapd to reach priority 0 when
      	scanning and then reclaim the world. Down with that sort of thing.
      
      Patch 6 notes that kswapd starts writeback based on scanning priority which
      	is not necessarily related to dirty pages. It will have kswapd
      	writeback pages if a number of unqueued dirty pages have been
      	recently encountered at the tail of the LRU.
      
      Patch 7 notes that sometimes kswapd should stall waiting on IO to complete
      	to reduce LRU churn and the likelihood that it'll reclaim young
      	clean pages or push applications to swap. It will cause kswapd
      	to block on IO if it detects that pages being reclaimed under
      	writeback are recycling through the LRU before the IO completes.
      
      Patchies 8-9 are cosmetic but balance_pgdat() is easier to follow after they
      	are applied.
      
      This was tested using memcached+memcachetest while some background IO
      was in progress as implemented by the parallel IO tests implement in MM
      Tests.
      
      memcachetest benchmarks how many operations/second memcached can service
      and it is run multiple times.  It starts with no background IO and then
      re-runs the test with larger amounts of IO in the background to roughly
      simulate a large copy in progress.  The expectation is that the IO
      should have little or no impact on memcachetest which is running
      entirely in memory.
      
                                              3.10.0-rc1                  3.10.0-rc1
                                                 vanilla            lessdisrupt-v4
      Ops memcachetest-0M             22155.00 (  0.00%)          22180.00 (  0.11%)
      Ops memcachetest-715M           22720.00 (  0.00%)          22355.00 ( -1.61%)
      Ops memcachetest-2385M           3939.00 (  0.00%)          23450.00 (495.33%)
      Ops memcachetest-4055M           3628.00 (  0.00%)          24341.00 (570.92%)
      Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)
      Ops io-duration-715M               12.00 (  0.00%)              7.00 ( 41.67%)
      Ops io-duration-2385M             118.00 (  0.00%)             21.00 ( 82.20%)
      Ops io-duration-4055M             162.00 (  0.00%)             36.00 ( 77.78%)
      Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)
      Ops swaptotal-715M             140134.00 (  0.00%)             18.00 ( 99.99%)
      Ops swaptotal-2385M            392438.00 (  0.00%)              0.00 (  0.00%)
      Ops swaptotal-4055M            449037.00 (  0.00%)          27864.00 ( 93.79%)
      Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)
      Ops swapin-715M                     0.00 (  0.00%)              0.00 (  0.00%)
      Ops swapin-2385M               148031.00 (  0.00%)              0.00 (  0.00%)
      Ops swapin-4055M               135109.00 (  0.00%)              0.00 (  0.00%)
      Ops minorfaults-0M            1529984.00 (  0.00%)        1530235.00 ( -0.02%)
      Ops minorfaults-715M          1794168.00 (  0.00%)        1613750.00 ( 10.06%)
      Ops minorfaults-2385M         1739813.00 (  0.00%)        1609396.00 (  7.50%)
      Ops minorfaults-4055M         1754460.00 (  0.00%)        1614810.00 (  7.96%)
      Ops majorfaults-0M                  0.00 (  0.00%)              0.00 (  0.00%)
      Ops majorfaults-715M              185.00 (  0.00%)            180.00 (  2.70%)
      Ops majorfaults-2385M           24472.00 (  0.00%)            101.00 ( 99.59%)
      Ops majorfaults-4055M           22302.00 (  0.00%)            229.00 ( 98.97%)
      
      Note how the vanilla kernels performance collapses when there is enough
      IO taking place in the background.  This drop in performance is part of
      what users complain of when they start backups.  Note how the swapin and
      major fault figures indicate that processes were being pushed to swap
      prematurely.  With the series applied, there is no noticable performance
      drop and while there is still some swap activity, it's tiny.
      
      20 iterations of this test were run in total and averaged.  Every 5
      iterations, additional IO was generated in the background using dd to
      measure how the workload was impacted.  The 0M, 715M, 2385M and 4055M
      subblock refer to the amount of IO going on in the background at each
      iteration.  So memcachetest-2385M is reporting how many
      transactions/second memcachetest recorded on average over 5 iterations
      while there was 2385M of IO going on in the ground.  There are six
      blocks of information reported here
      
      memcachetest is the transactions/second reported by memcachetest. In
      	the vanilla kernel note that performance drops from around
      	22K/sec to just under 4K/second when there is 2385M of IO going
      	on in the background. This is one type of performance collapse
      	users complain about if a large cp or backup starts in the
      	background
      
      io-duration refers to how long it takes for the background IO to
      	complete. It's showing that with the patched kernel that the IO
      	completes faster while not interfering with the memcache
      	workload
      
      swaptotal is the total amount of swap traffic. With the patched kernel,
      	the total amount of swapping is much reduced although it is
      	still not zero.
      
      swapin in this case is an indication as to whether we are swap trashing.
      	The closer the swapin/swapout ratio is to 1, the worse the
      	trashing is.  Note with the patched kernel that there is no swapin
      	activity indicating that all the pages swapped were really inactive
      	unused pages.
      
      minorfaults are just minor faults. An increased number of minor faults
      	can indicate that page reclaim is unmapping the pages but not
      	swapping them out before they are faulted back in. With the
      	patched kernel, there is only a small change in minor faults
      
      majorfaults are just major faults in the target workload and a high
      	number can indicate that a workload is being prematurely
      	swapped. With the patched kernel, major faults are much reduced. As
      	there are no swapin's recorded so it's not being swapped. The likely
      	explanation is that that libraries or configuration files used by
      	the workload during startup get paged out by the background IO.
      
      Overall with the series applied, there is no noticable performance drop
      due to background IO and while there is still some swap activity, it's
      tiny and the lack of swapins imply that the swapped pages were inactive
      and unused.
      
                                  3.10.0-rc1  3.10.0-rc1
                                     vanilla lessdisrupt-v4
      Page Ins                       1234608      101892
      Page Outs                     12446272    11810468
      Swap Ins                        283406           0
      Swap Outs                       698469       27882
      Direct pages scanned                 0      136480
      Kswapd pages scanned           6266537     5369364
      Kswapd pages reclaimed         1088989      930832
      Direct pages reclaimed               0      120901
      Kswapd efficiency                  17%         17%
      Kswapd velocity               5398.371    4635.115
      Direct efficiency                 100%         88%
      Direct velocity                  0.000     117.817
      Percentage direct scans             0%          2%
      Page writes by reclaim         1655843     4009929
      Page writes file                957374     3982047
      Page writes anon                698469       27882
      Page reclaim immediate            5245        1745
      Page rescued immediate               0           0
      Slabs scanned                    33664       25216
      Direct inode steals                  0           0
      Kswapd inode steals              19409         778
      Kswapd skipped wait                  0           0
      THP fault alloc                     35          30
      THP collapse alloc                 472         401
      THP splits                          27          22
      THP fault fallback                   0           0
      THP collapse fail                    0           1
      Compaction stalls                    0           4
      Compaction success                   0           0
      Compaction failures                  0           4
      Page migrate success                 0           0
      Page migrate failure                 0           0
      Compaction pages isolated            0           0
      Compaction migrate scanned           0           0
      Compaction free scanned              0           0
      Compaction cost                      0           0
      NUMA PTE updates                     0           0
      NUMA hint faults                     0           0
      NUMA hint local faults               0           0
      NUMA pages migrated                  0           0
      AutoNUMA cost                        0           0
      
      Unfortunately, note that there is a small amount of direct reclaim due to
      kswapd no longer reclaiming the world.  ftrace indicates that the direct
      reclaim stalls are mostly harmless with the vast bulk of the stalls
      incurred by dd
      
           23 tclsh-3367
           38 memcachetest-13733
           49 memcachetest-12443
           57 tee-3368
         1541 dd-13826
         1981 dd-12539
      
      A consequence of the direct reclaim for dd is that the processes for the
      IO workload may show a higher system CPU usage.  There is also a risk that
      kswapd not reclaiming the world may mean that it stays awake balancing
      zones, does not stall on the appropriate events and continually scans
      pages it cannot reclaim consuming CPU.  This will be visible as continued
      high CPU usage but in my own tests I only saw a single spike lasting less
      than a second and I did not observe any problems related to reclaim while
      running the series on my desktop.
      
      This patch:
      
      The number of pages kswapd can reclaim is bound by the number of pages it
      scans which is related to the size of the zone and the scanning priority.
      In many cases the priority remains low because it's reset every
      SWAP_CLUSTER_MAX reclaimed pages but in the event kswapd scans a large
      number of pages it cannot reclaim, it will raise the priority and
      potentially discard a large percentage of the zone as sc->nr_to_reclaim is
      ULONG_MAX.  The user-visible effect is a reclaim "spike" where a large
      percentage of memory is suddenly freed.  It would be bad enough if this
      was just unused memory but because of how anon/file pages are balanced it
      is possible that applications get pushed to swap unnecessarily.
      
      This patch limits the number of pages kswapd will reclaim to the high
      watermark.  Reclaim will still overshoot due to it not being a hard limit
      as shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
      prevents kswapd reclaiming the world at higher priorities.  The number of
      pages it reclaims is not adjusted for high-order allocations as kswapd
      will reclaim excessively if it is to balance zones for high-order
      allocations.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Tested-by: NZlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75485363
    • C
      mm/page_alloc: don't re-init pageset in zone_pcp_update() · 169f6c19
      Cody P Schafer 提交于
      When memory hotplug is triggered, we call pageset_init() on
      per-cpu-pagesets which both contain pages and are in use, causing both the
      leakage of those pages and (potentially) bad behaviour if a page is
      allocated from a pageset while it is being cleared.
      
      Avoid this by factoring out pageset_set_high_and_batch() (which contains
      all needed logic too set a pageset's ->high and ->batch inrespective of
      system state) from zone_pageset_init() and using the new
      pageset_set_high_and_batch() instead of zone_pageset_init() in
      zone_pcp_update().
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      169f6c19
    • C
      mm/page_alloc: rename setup_pagelist_highmark() to match naming of pageset_set_batch() · 3664033c
      Cody P Schafer 提交于
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3664033c
    • C
      mm/page_alloc: in zone_pcp_update(), uze zone_pageset_init() · 737af4c0
      Cody P Schafer 提交于
      Previously, zone_pcp_update() called pageset_set_batch() directly,
      essentially assuming that percpu_pagelist_fraction == 0.
      
      Correct this by calling zone_pageset_init(), which chooses the
      appropriate ->batch and ->high calculations.
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      737af4c0
    • C
      mm/page_alloc: factor zone_pageset_init() out of setup_zone_pageset() · 56cef2b8
      Cody P Schafer 提交于
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56cef2b8
    • C
      mm/page_alloc: relocate comment to be directly above code it refers to. · dd1895e2
      Cody P Schafer 提交于
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd1895e2
    • C
      mm/page_alloc: factor setup_pageset() into pageset_init() and pageset_set_batch() · 88c90dbc
      Cody P Schafer 提交于
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88c90dbc
    • C
      mm/page_alloc: when handling percpu_pagelist_fraction, don't unneedly recalulate high · 22a7f12b
      Cody P Schafer 提交于
      Simply moves calculation of the new 'high' value outside the
      for_each_possible_cpu() loop, as it does not depend on the cpu.
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22a7f12b
    • C
      mm/page_alloc: convert zone_pcp_update() to rely on memory barriers instead of stop_machine() · 0a647f38
      Cody P Schafer 提交于
      zone_pcp_update()'s goal is to adjust the ->high and ->mark members of a
      percpu pageset based on a zone's ->managed_pages.  We don't need to drain
      the entire percpu pageset just to modify these fields.
      
      This lets us avoid calling setup_pageset() (and the draining required to
      call it) and instead allows simply setting the fields' values (with some
      attention paid to memory barriers to prevent the relationship between
      ->batch and ->high from being thrown off).
      
      This does change the behavior of zone_pcp_update() as the percpu pagesets
      will not be drained when zone_pcp_update() is called (they will end up
      being shrunk, not completely drained, later when a 0-order page is freed
      in free_hot_cold_page()).
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a647f38
    • C
      mm/page_alloc: protect pcp->batch accesses with ACCESS_ONCE · 998d39cb
      Cody P Schafer 提交于
      pcp->batch could change at any point, avoid relying on it being a stable
      value.
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      998d39cb
    • C
      mm/page_alloc: insert memory barriers to allow async update of pcp batch and high · 8d7a8fa9
      Cody P Schafer 提交于
      Introduce pageset_update() to perform a safe transision from one set of
      pcp->{batch,high} to a new set using memory barriers.
      
      This ensures that batch is always set to a safe value (1) prior to
      updating high, and ensure that high is fully updated before setting the
      real value of batch.  It avoids ->batch ever rising above ->high.
      
      Suggested by Gilad Ben-Yossef in these threads:
      
      	https://lkml.org/lkml/2013/4/9/23
      	https://lkml.org/lkml/2013/4/10/49
      
      Also reproduces his proposed comment.
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Reviewed-by: NGilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d7a8fa9
    • C
      mm/page_alloc: prevent concurrent updaters of pcp ->batch and ->high · c8e251fa
      Cody P Schafer 提交于
      Because we are going to rely upon a careful transision between old and new
      ->high and ->batch values using memory barriers and will remove
      stop_machine(), we need to prevent multiple updaters from interweaving
      their memory writes.
      
      Add a simple mutex to protect both update loops.
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c8e251fa
    • C
      mm/page_alloc: factor out setting of pcp->high and pcp->batch · 4008bab7
      Cody P Schafer 提交于
      "Problems" with the current code:
      
      1: there is a lack of synchronization in setting ->high and ->batch in
         percpu_pagelist_fraction_sysctl_handler()
      
      2: stop_machine() in zone_pcp_update() is unnecissary.
      
      3: zone_pcp_update() does not consider the case where
         percpu_pagelist_fraction is non-zero
      
      To fix:
      
      1: add memory barriers, a safe ->batch value, an update side mutex when
         updating ->high and ->batch, and use ACCESS_ONCE() for ->batch users
         that expect a stable value.
      
      2: avoid draining pages in zone_pcp_update(), rely upon the memory
         barriers added to fix #1
      
      3: factor out quite a few functions, and then call the appropriate one.
      
      Note that it results in a change to the behavior of zone_pcp_update(),
      which is used by memory_hotplug.  I'm rather certain that I've diserned
      (and preserved) the essential behavior (changing ->high and ->batch), and
      only eliminated unneeded actions (draining the per cpu pages), but this
      may not be the case.
      
      Further note that the draining of pages that previously took place in
      zone_pcp_update() occured after repeated draining when attempting to
      offline a page, and after the offline has "succeeded".  It appears that
      the draining was added to zone_pcp_update() to avoid refactoring
      setup_pageset() into 2 funtions.
      
      This patch:
      
      Creates pageset_set_batch() for use in setup_pageset().
      pageset_set_batch() imitates the functionality of
      setup_pagelist_highmark(), but uses the boot time
      (percpu_pagelist_fraction == 0) calculations for determining ->high based
      on ->batch.
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4008bab7
    • L
      uio: use vma_pages() to replace (vm_end - vm_start) >> PAGE_SHIFT · 52c2dad9
      Libin 提交于
      (*->vm_end - *->vm_start) >> PAGE_SHIFT operation is implemented
      as a inline funcion vma_pages() in linux/mm.h, so using it.
      Signed-off-by: NLibin <huawei.libin@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52c2dad9
    • L
      ncpfs: use vma_pages() to replace (vm_end - vm_start) >> PAGE_SHIFT · ef9f515a
      Libin 提交于
      (*->vm_end - *->vm_start) >> PAGE_SHIFT operation is implemented
      as a inline funcion vma_pages() in linux/mm.h, so using it.
      Signed-off-by: NLibin <huawei.libin@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef9f515a
    • L
      mm: use vma_pages() to replace (vm_end - vm_start) >> PAGE_SHIFT · d6e93217
      Libin 提交于
      (*->vm_end - *->vm_start) >> PAGE_SHIFT operation is implemented
      as a inline funcion vma_pages() in linux/mm.h, so using it.
      Signed-off-by: NLibin <huawei.libin@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d6e93217
    • M
      mm: remove compressed copy from zram in-memory · b430e9d1
      Minchan Kim 提交于
      Swap subsystem does lazy swap slot free with expecting the page would be
      swapped out again so we can avoid unnecessary write.
      
      But the problem in in-memory swap(ex, zram) is that it consumes memory
      space until vm_swap_full(ie, used half of all of swap device) condition
      meet.  It could be bad if we use multiple swap device, small in-memory
      swap and big storage swap or in-memory swap alone.
      
      This patch makes swap subsystem free swap slot as soon as swap-read is
      completed and make the swapcache page dirty so the page should be
      written out the swap device to reclaim it.  It means we never lose it.
      
      I tested this patch with kernel compile workload.
      
      1. before
      
         compile time : 9882.42
         zram max wasted space by fragmentation: 13471881 byte
         memory space consumed by zram: 174227456 byte
         the number of slot free notify: 206684
      
      2. after
      
         compile time : 9653.90
         zram max wasted space by fragmentation: 11805932 byte
         memory space consumed by zram: 154001408 byte
         the number of slot free notify: 426972
      
      [akpm@linux-foundation.org: tweak comment text]
      [artem.savkov@gmail.com: fix BUG due to non-swapcache pages in end_swap_bio_read()]
      [akpm@linux-foundation.org: invert unlikely() test, augment comment, 80-col cleanup]
      Signed-off-by: NDan Magenheimer <dan.magenheimer@oracle.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NArtem Savkov <artem.savkov@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>
      Cc: Shaohua Li <shli@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b430e9d1
    • D
      mm, memcg: don't take task_lock in task_in_mem_cgroup · ffbdccf5
      David Rientjes 提交于
      For processes that have detached their mm's, task_in_mem_cgroup()
      unnecessarily takes task_lock() when rcu_read_lock() is all that is
      necessary to call mem_cgroup_from_task().
      
      While we're here, switch task_in_mem_cgroup() to return bool.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ffbdccf5
    • P
      pagemap: prepare to reuse constant bits with page-shift · 541c237c
      Pavel Emelyanov 提交于
      In order to reuse bits from pagemap entries gracefully, we leave the
      entries as is but on pagemap open emit a warning in dmesg, that bits
      55-60 are about to change in a couple of releases.  Next, if a user
      issues soft-dirty clear command via the clear_refs file (it was disabled
      before v3.9) we assume that he's aware of the new pagemap format, note
      that fact and report the bits in pagemap in the new manner.
      
      The "migration strategy" looks like this then:
      
      1. existing users are not affected -- they don't touch soft-dirty feature, thus
         see old bits in pagemap, but are warned and have time to fix themselves
      2. those who use soft-dirty know about new pagemap format
      3. some time soon we get rid of any signs of page-shift in pagemap as well as
         this trick with clear-soft-dirty affecting pagemap format.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      541c237c
    • P
      mm: soft-dirty bits for user memory changes tracking · 0f8975ec
      Pavel Emelyanov 提交于
      The soft-dirty is a bit on a PTE which helps to track which pages a task
      writes to.  In order to do this tracking one should
      
        1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs)
        2. Wait some time.
        3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries)
      
      To do this tracking, the writable bit is cleared from PTEs when the
      soft-dirty bit is.  Thus, after this, when the task tries to modify a
      page at some virtual address the #PF occurs and the kernel sets the
      soft-dirty bit on the respective PTE.
      
      Note, that although all the task's address space is marked as r/o after
      the soft-dirty bits clear, the #PF-s that occur after that are processed
      fast.  This is so, since the pages are still mapped to physical memory,
      and thus all the kernel does is finds this fact out and puts back
      writable, dirty and soft-dirty bits on the PTE.
      
      Another thing to note, is that when mremap moves PTEs they are marked
      with soft-dirty as well, since from the user perspective mremap modifies
      the virtual memory at mremap's new address.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f8975ec
    • P
      pagemap: introduce pagemap_entry_t without pmshift bits · 2b0a9f01
      Pavel Emelyanov 提交于
      These bits are always constant (== PAGE_SHIFT) and just occupy space in
      the entry.  Moreover, in next patch we will need to report one more bit
      in the pagemap, but all bits are already busy on it.
      
      That said, describe the pagemap entry that has 6 more free zero bits.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2b0a9f01
    • P
      clear_refs: introduce private struct for mm_walk · af9de7eb
      Pavel Emelyanov 提交于
      In the next patch the clear-refs-type will be required in
      clear_refs_pte_range funciton, so prepare the walk->private to carry
      this info.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af9de7eb
    • P
      clear_refs: sanitize accepted commands declaration · 040fa020
      Pavel Emelyanov 提交于
      This is the implementation of the soft-dirty bit concept that should
      help keep track of changes in user memory, which in turn is very-very
      required by the checkpoint-restore project (http://criu.org).
      
      To create a dump of an application(s) we save all the information about
      it to files, and the biggest part of such dump is the contents of tasks'
      memory.  However, there are usage scenarios where it's not required to
      get _all_ the task memory while creating a dump.  For example, when
      doing periodical dumps, it's only required to take full memory dump only
      at the first step and then take incremental changes of memory.  Another
      example is live migration.  We copy all the memory to the destination
      node without stopping all tasks, then stop them, check for what pages
      has changed, dump it and the rest of the state, then copy it to the
      destination node.  This decreases freeze time significantly.
      
      That said, some help from kernel to watch how processes modify the
      contents of their memory is required.
      
      The proposal is to track changes with the help of new soft-dirty bit
      this way:
      
      1. First do "echo 4 > /proc/$pid/clear_refs".
         At that point kernel clears the soft dirty _and_ the writable bits from all
         ptes of process $pid. From now on every write to any page will result in #pf
         and the subsequent call to pte_mkdirty/pmd_mkdirty, which in turn will set
         the soft dirty flag.
      
      2. Then read the /proc/$pid/pagemap2 and check the soft-dirty bit reported there
         (the 55'th one). If set, the respective pte was written to since last call
         to clear refs.
      
      The soft-dirty bit is the _PAGE_BIT_HIDDEN one.  Although it's used by
      kmemcheck, the latter one marks kernel pages with it, while the former
      bit is put on user pages so they do not conflict to each other.
      
      This patch:
      
      A new clear-refs type will be added in the next patch, so prepare
      code for that.
      
      [akpm@linux-foundation.org: don't assume that sizeof(enum clear_refs_types) == sizeof(int)]
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      040fa020