1. 06 3月, 2019 40 次提交
    • M
      mm, compaction: do not consider a need to reschedule as contention · cf66f070
      Mel Gorman 提交于
      Scanning on large machines can take a considerable length of time and
      eventually need to be rescheduled.  This is treated as an abort event
      but that's not appropriate as the attempt is likely to be retried after
      making numerous checks and taking another cycle through the page
      allocator.  This patch will check the need to reschedule if necessary
      but continue the scanning.
      
      The main benefit is reduced scanning when compaction is taking a long
      time or the machine is over-saturated.  It also avoids an unnecessary
      exit of compaction that ends up being retried by the page allocator in
      the outer loop.
      
                                           5.0.0-rc1              5.0.0-rc1
                                    synccached-v3r16        noresched-v3r17
      Amean     fault-both-1         0.00 (   0.00%)        0.00 *   0.00%*
      Amean     fault-both-3      2958.27 (   0.00%)     2965.68 (  -0.25%)
      Amean     fault-both-5      4091.90 (   0.00%)     3995.90 (   2.35%)
      Amean     fault-both-7      5803.05 (   0.00%)     5842.12 (  -0.67%)
      Amean     fault-both-12     9481.06 (   0.00%)     9550.87 (  -0.74%)
      Amean     fault-both-18    14141.51 (   0.00%)    13304.72 (   5.92%)
      Amean     fault-both-24    16438.00 (   0.00%)    14618.59 (  11.07%)
      Amean     fault-both-30    17531.72 (   0.00%)    16650.96 (   5.02%)
      Amean     fault-both-32    17101.96 (   0.00%)    17145.15 (  -0.25%)
      
      Link: http://lkml.kernel.org/r/20190118175136.31341-18-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf66f070
    • M
      mm, compaction: rework compact_should_abort as compact_check_resched · cb810ad2
      Mel Gorman 提交于
      With incremental changes, compact_should_abort no longer makes any
      documented sense.  Rename to compact_check_resched and update the
      associated comments.  There is no benefit other than reducing redundant
      code and making the intent slightly clearer.  It could potentially be
      merged with earlier patches but it just makes the review slightly
      harder.
      
      Link: http://lkml.kernel.org/r/20190118175136.31341-17-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb810ad2
    • M
      mm, compaction: keep cached migration PFNs synced for unusable pageblocks · 8854c55f
      Mel Gorman 提交于
      Migrate has separate cached PFNs for ASYNC and SYNC* migration on the
      basis that some migrations will fail in ASYNC mode.  However, if the
      cached PFNs match at the start of scanning and pageblocks are skipped
      due to having no isolation candidates, then the sync state does not
      matter.  This patch keeps matching cached PFNs in sync until a pageblock
      with isolation candidates is found.
      
      The actual benefit is marginal given that the sync scanner following the
      async scanner will often skip a number of pageblocks but it's useless
      work.  Any benefit depends heavily on whether the scanners restarted
      recently.
      
      Link: http://lkml.kernel.org/r/20190118175136.31341-16-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8854c55f
    • M
      mm, compaction: check early for huge pages encountered by the migration scanner · 9bebefd5
      Mel Gorman 提交于
      When scanning for sources or targets, PageCompound is checked for huge
      pages as they can be skipped quickly but it happens relatively late
      after a lot of setup and checking.  This patch short-cuts the check to
      make it earlier.  It might still change when the lock is acquired but
      this has less overhead overall.  The free scanner advances but the
      migration scanner does not.  Typically the free scanner encounters more
      movable blocks that change state over the lifetime of the system and
      also tends to scan more aggressively as it's actively filling its
      portion of the physical address space with data.  This could change in
      the future but for the moment, this worked better in practice and
      incurred fewer scan restarts.
      
      The impact on latency and allocation success rates is marginal but the
      free scan rates are reduced by 15% and system CPU usage is reduced by
      3.3%.  The 2-socket results are not materially different.
      
      Link: http://lkml.kernel.org/r/20190118175136.31341-15-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9bebefd5
    • M
      mm, compaction: finish pageblock scanning on contention · cb2dcaf0
      Mel Gorman 提交于
      Async migration aborts on spinlock contention but contention can be high
      when there are multiple compaction attempts and kswapd is active.  The
      consequence is that the migration scanners move forward uselessly while
      still contending on locks for longer while leaving suitable migration
      sources behind.
      
      This patch will acquire the lock but track when contention occurs.  When
      it does, the current pageblock will finish as compaction may succeed for
      that block and then abort.  This will have a variable impact on latency
      as in some cases useless scanning is avoided (reduces latency) but a
      lock will be contended (increase latency) or a single contended
      pageblock is scanned that would otherwise have been skipped (increase
      latency).
      
                                           5.0.0-rc1              5.0.0-rc1
                                      norescan-v3r16    finishcontend-v3r16
      Amean     fault-both-1         0.00 (   0.00%)        0.00 *   0.00%*
      Amean     fault-both-3      3002.07 (   0.00%)     3153.17 (  -5.03%)
      Amean     fault-both-5      4684.47 (   0.00%)     4280.52 (   8.62%)
      Amean     fault-both-7      6815.54 (   0.00%)     5811.50 *  14.73%*
      Amean     fault-both-12    10864.02 (   0.00%)     9276.85 (  14.61%)
      Amean     fault-both-18    12247.52 (   0.00%)    11032.67 (   9.92%)
      Amean     fault-both-24    15683.99 (   0.00%)    14285.70 (   8.92%)
      Amean     fault-both-30    18620.02 (   0.00%)    16293.76 *  12.49%*
      Amean     fault-both-32    19250.28 (   0.00%)    16721.02 *  13.14%*
      
                                      5.0.0-rc1              5.0.0-rc1
                                 norescan-v3r16    finishcontend-v3r16
      Percentage huge-1         0.00 (   0.00%)        0.00 (   0.00%)
      Percentage huge-3        95.00 (   0.00%)       96.82 (   1.92%)
      Percentage huge-5        94.22 (   0.00%)       95.40 (   1.26%)
      Percentage huge-7        92.35 (   0.00%)       95.92 (   3.86%)
      Percentage huge-12       91.90 (   0.00%)       96.73 (   5.25%)
      Percentage huge-18       89.58 (   0.00%)       96.77 (   8.03%)
      Percentage huge-24       90.03 (   0.00%)       96.05 (   6.69%)
      Percentage huge-30       89.14 (   0.00%)       96.81 (   8.60%)
      Percentage huge-32       90.58 (   0.00%)       97.41 (   7.54%)
      
      There is a variable impact that is mostly good on latency while allocation
      success rates are slightly higher.  System CPU usage is reduced by about
      10% but scan rate impact is mixed
      
      Compaction migrate scanned    27997659.00    20148867
      Compaction free scanned      120782791.00   118324914
      
      Migration scan rates are reduced 28% which is expected as a pageblock is
      used by the async scanner instead of skipped.  The impact on the free
      scanner is known to be variable.  Overall the primary justification for
      this patch is that completing scanning of a pageblock is very important
      for later patches.
      
      [yuehaibing@huawei.com: fix unused variable warning]
      Link: http://lkml.kernel.org/r/20190118175136.31341-14-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb2dcaf0
    • M
      mm, compaction: avoid rescanning the same pageblock multiple times · 804d3121
      Mel Gorman 提交于
      Pageblocks are marked for skip when no pages are isolated after a scan.
      However, it's possible to hit corner cases where the migration scanner
      gets stuck near the boundary between the source and target scanner.  Due
      to pages being migrated in blocks of COMPACT_CLUSTER_MAX, pages that are
      migrated can be reallocated before the pageblock is complete.  The
      pageblock is not necessarily skipped so it can be rescanned multiple
      times.  Similarly, a pageblock with some dirty/writeback pages may fail
      to migrate and be rescanned until writeback completes which is wasteful.
      
      This patch tracks if a pageblock is being rescanned.  If so, then the
      entire pageblock will be migrated as one operation.  This narrows the
      race window during which pages can be reallocated during migration.
      Secondly, if there are pages that cannot be isolated then the pageblock
      will still be fully scanned and marked for skipping.  On the second
      rescan, the pageblock skip is set and the migration scanner makes
      progress.
      
                                           5.0.0-rc1              5.0.0-rc1
                                      findfree-v3r16         norescan-v3r16
      Amean     fault-both-1         0.00 (   0.00%)        0.00 *   0.00%*
      Amean     fault-both-3      3200.68 (   0.00%)     3002.07 (   6.21%)
      Amean     fault-both-5      4847.75 (   0.00%)     4684.47 (   3.37%)
      Amean     fault-both-7      6658.92 (   0.00%)     6815.54 (  -2.35%)
      Amean     fault-both-12    11077.62 (   0.00%)    10864.02 (   1.93%)
      Amean     fault-both-18    12403.97 (   0.00%)    12247.52 (   1.26%)
      Amean     fault-both-24    15607.10 (   0.00%)    15683.99 (  -0.49%)
      Amean     fault-both-30    18752.27 (   0.00%)    18620.02 (   0.71%)
      Amean     fault-both-32    21207.54 (   0.00%)    19250.28 *   9.23%*
      
                                      5.0.0-rc1              5.0.0-rc1
                                 findfree-v3r16         norescan-v3r16
      Percentage huge-3        96.86 (   0.00%)       95.00 (  -1.91%)
      Percentage huge-5        93.72 (   0.00%)       94.22 (   0.53%)
      Percentage huge-7        94.31 (   0.00%)       92.35 (  -2.08%)
      Percentage huge-12       92.66 (   0.00%)       91.90 (  -0.82%)
      Percentage huge-18       91.51 (   0.00%)       89.58 (  -2.11%)
      Percentage huge-24       90.50 (   0.00%)       90.03 (  -0.52%)
      Percentage huge-30       91.57 (   0.00%)       89.14 (  -2.65%)
      Percentage huge-32       91.00 (   0.00%)       90.58 (  -0.46%)
      
      Negligible difference but this was likely a case when the specific
      corner case was not hit.  A previous run of the same patch based on an
      earlier iteration of the series showed large differences where migration
      rates could be halved when the corner case was hit.
      
      The specific corner case where migration scan rates go through the roof
      was due to a dirty/writeback pageblock located at the boundary of the
      migration/free scanner did not happen in this case.  When it does
      happen, the scan rates multipled by massive margins.
      
      Link: http://lkml.kernel.org/r/20190118175136.31341-13-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      804d3121
    • M
      mm, compaction: use free lists to quickly locate a migration target · 5a811889
      Mel Gorman 提交于
      Similar to the migration scanner, this patch uses the free lists to
      quickly locate a migration target.  The search is different in that
      lower orders will be searched for a suitable high PFN if necessary but
      the search is still bound.  This is justified on the grounds that the
      free scanner typically scans linearly much more than the migration
      scanner.
      
      If a free page is found, it is isolated and compaction continues if
      enough pages were isolated.  For SYNC* scanning, the full pageblock is
      scanned for any remaining free pages so that is can be marked for
      skipping in the near future.
      
      1-socket thpfioscale
                                           5.0.0-rc1              5.0.0-rc1
                                       isolmig-v3r15         findfree-v3r16
      Amean     fault-both-3      3024.41 (   0.00%)     3200.68 (  -5.83%)
      Amean     fault-both-5      4749.30 (   0.00%)     4847.75 (  -2.07%)
      Amean     fault-both-7      6454.95 (   0.00%)     6658.92 (  -3.16%)
      Amean     fault-both-12    10324.83 (   0.00%)    11077.62 (  -7.29%)
      Amean     fault-both-18    12896.82 (   0.00%)    12403.97 (   3.82%)
      Amean     fault-both-24    13470.60 (   0.00%)    15607.10 * -15.86%*
      Amean     fault-both-30    17143.99 (   0.00%)    18752.27 (  -9.38%)
      Amean     fault-both-32    17743.91 (   0.00%)    21207.54 * -19.52%*
      
      The impact on latency is variable but the search is optimistic and
      sensitive to the exact system state.  Success rates are similar but the
      major impact is to the rate of scanning
      
                                      5.0.0-rc1      5.0.0-rc1
                                  isolmig-v3r15 findfree-v3r16
      Compaction migrate scanned    25646769          29507205
      Compaction free scanned      201558184         100359571
      
      The free scan rates are reduced by 50%.  The 2-socket reductions for the
      free scanner are more dramatic which is a likely reflection that the
      machine has more memory.
      
      [dan.carpenter@oracle.com: fix static checker warning]
      [vbabka@suse.cz: correct number of pages scanned for lower orders]
      Link: http://lkml.kernel.org/r/20190118175136.31341-12-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a811889
    • M
      mm, compaction: keep migration source private to a single compaction instance · e380bebe
      Mel Gorman 提交于
      Due to either a fast search of the free list or a linear scan, it is
      possible for multiple compaction instances to pick the same pageblock
      for migration.  This is lucky for one scanner and increased scanning for
      all the others.  It also allows a race between requests on which first
      allocates the resulting free block.
      
      This patch tests and updates the pageblock skip for the migration
      scanner carefully.  When isolating a block, it will check and skip if
      the block is already in use.  Once the zone lock is acquired, it will be
      rechecked so that only one scanner can set the pageblock skip for
      exclusive use.  Any scanner contending will continue with a linear scan.
      The skip bit is still set if no pages can be isolated in a range.  While
      this may result in redundant scanning, it avoids unnecessarily acquiring
      the zone lock when there are no suitable migration sources.
      
      1-socket thpscale
      Amean     fault-both-1         0.00 (   0.00%)        0.00 *   0.00%*
      Amean     fault-both-3      3390.40 (   0.00%)     3024.41 (  10.80%)
      Amean     fault-both-5      5082.28 (   0.00%)     4749.30 (   6.55%)
      Amean     fault-both-7      7012.51 (   0.00%)     6454.95 (   7.95%)
      Amean     fault-both-12    11346.63 (   0.00%)    10324.83 (   9.01%)
      Amean     fault-both-18    15324.19 (   0.00%)    12896.82 *  15.84%*
      Amean     fault-both-24    16088.50 (   0.00%)    13470.60 *  16.27%*
      Amean     fault-both-30    18723.42 (   0.00%)    17143.99 (   8.44%)
      Amean     fault-both-32    18612.01 (   0.00%)    17743.91 (   4.66%)
      
                                      5.0.0-rc1              5.0.0-rc1
                                  findmig-v3r15          isolmig-v3r15
      Percentage huge-3        89.83 (   0.00%)       92.96 (   3.48%)
      Percentage huge-5        91.96 (   0.00%)       93.26 (   1.41%)
      Percentage huge-7        92.85 (   0.00%)       93.63 (   0.84%)
      Percentage huge-12       92.74 (   0.00%)       92.80 (   0.07%)
      Percentage huge-18       91.71 (   0.00%)       91.62 (  -0.10%)
      Percentage huge-24       92.13 (   0.00%)       91.50 (  -0.69%)
      Percentage huge-30       93.79 (   0.00%)       92.73 (  -1.13%)
      Percentage huge-32       91.27 (   0.00%)       91.94 (   0.74%)
      
      This shows a reasonable reduction in latency as multiple compaction
      scanners do not operate on the same blocks with a similar allocation
      success rate.
      
      Compaction migrate scanned    41093126    25646769
      
      Migration scan rates are reduced by 38%.
      
      Link: http://lkml.kernel.org/r/20190118175136.31341-11-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e380bebe
    • M
      mm, compaction: use free lists to quickly locate a migration source · 70b44595
      Mel Gorman 提交于
      The migration scanner is a linear scan of a zone with a potentiall large
      search space.  Furthermore, many pageblocks are unusable such as those
      filled with reserved pages or partially filled with pages that cannot
      migrate.  These still get scanned in the common case of allocating a THP
      and the cost accumulates.
      
      The patch uses a partial search of the free lists to locate a migration
      source candidate that is marked as MOVABLE when allocating a THP.  It
      prefers picking a block with a larger number of free pages already on
      the basis that there are fewer pages to migrate to free the entire
      block.  The lowest PFN found during searches is tracked as the basis of
      the start for the linear search after the first search of the free list
      fails.  After the search, the free list is shuffled so that the next
      search will not encounter the same page.  If the search fails then the
      subsequent searches will be shorter and the linear scanner is used.
      
      If this search fails, or if the request is for a small or
      unmovable/reclaimable allocation then the linear scanner is still used.
      It is somewhat pointless to use the list search in those cases.  Small
      free pages must be used for the search and there is no guarantee that
      movable pages are located within that block that are contiguous.
      
                                           5.0.0-rc1              5.0.0-rc1
                                       noboost-v3r10          findmig-v3r15
      Amean     fault-both-3      3771.41 (   0.00%)     3390.40 (  10.10%)
      Amean     fault-both-5      5409.05 (   0.00%)     5082.28 (   6.04%)
      Amean     fault-both-7      7040.74 (   0.00%)     7012.51 (   0.40%)
      Amean     fault-both-12    11887.35 (   0.00%)    11346.63 (   4.55%)
      Amean     fault-both-18    16718.19 (   0.00%)    15324.19 (   8.34%)
      Amean     fault-both-24    21157.19 (   0.00%)    16088.50 *  23.96%*
      Amean     fault-both-30    21175.92 (   0.00%)    18723.42 *  11.58%*
      Amean     fault-both-32    21339.03 (   0.00%)    18612.01 *  12.78%*
      
                                      5.0.0-rc1              5.0.0-rc1
                                  noboost-v3r10          findmig-v3r15
      Percentage huge-3        86.50 (   0.00%)       89.83 (   3.85%)
      Percentage huge-5        92.52 (   0.00%)       91.96 (  -0.61%)
      Percentage huge-7        92.44 (   0.00%)       92.85 (   0.44%)
      Percentage huge-12       92.98 (   0.00%)       92.74 (  -0.25%)
      Percentage huge-18       91.70 (   0.00%)       91.71 (   0.02%)
      Percentage huge-24       91.59 (   0.00%)       92.13 (   0.60%)
      Percentage huge-30       90.14 (   0.00%)       93.79 (   4.04%)
      Percentage huge-32       90.03 (   0.00%)       91.27 (   1.37%)
      
      This shows an improvement in allocation latencies with similar
      allocation success rates.  While not presented, there was a 31%
      reduction in migration scanning and a 8% reduction on system CPU usage.
      A 2-socket machine showed similar benefits.
      
      [mgorman@techsingularity.net: several fixes]
        Link: http://lkml.kernel.org/r/20190204120111.GL9565@techsingularity.net
      [vbabka@suse.cz: migrate block that was found-fast, some optimisations]
      Link: http://lkml.kernel.org/r/20190118175136.31341-10-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <Vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70b44595
    • M
      mm, compaction: ignore the fragmentation avoidance boost for isolation and compaction · fd1444b2
      Mel Gorman 提交于
      When pageblocks get fragmented, watermarks are artifically boosted to
      reclaim pages to avoid further fragmentation events.  However,
      compaction is often either fragmentation-neutral or moving movable pages
      away from unmovable/reclaimable pages.  As the true watermarks are
      preserved, allow compaction to ignore the boost factor.
      
      The expected impact is very slight as the main benefit is that
      compaction is slightly more likely to succeed when the system has been
      fragmented very recently.  On both 1-socket and 2-socket machines for
      THP-intensive allocation during fragmentation the success rate was
      increased by less than 1% which is marginal.  However, detailed tracing
      indicated that failure of migration due to a premature ENOMEM triggered
      by watermark checks were eliminated.
      
      Link: http://lkml.kernel.org/r/20190118175136.31341-9-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fd1444b2
    • M
      mm, compaction: always finish scanning of a full pageblock · efe771c7
      Mel Gorman 提交于
      When compaction is finishing, it uses a flag to ensure the pageblock is
      complete but it makes sense to always complete migration of a pageblock.
      Minimally, skip information is based on a pageblock and partially
      scanned pageblocks may incur more scanning in the future.  The pageblock
      skip handling also becomes more strict later in the series and the hint
      is more useful if a complete pageblock was always scanned.
      
      The potentially impacts latency as more scanning is done but it's not a
      consistent win or loss as the scanning is not always a high percentage
      of the pageblock and sometimes it is offset by future reductions in
      scanning.  Hence, the results are not presented this time due to a
      misleading mix of gains/losses without any clear pattern.  However, full
      scanning of the pageblock is important for later patches.
      
      Link: http://lkml.kernel.org/r/20190118175136.31341-8-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      efe771c7
    • M
      mm, migrate: immediately fail migration of a page with no migration handler · 806031bb
      Mel Gorman 提交于
      Pages with no migration handler use a fallback handler which sometimes
      works and sometimes persistently retries.  A historical example was
      blockdev pages but there are others such as odd refcounting when
      page->private is used.  These are retried multiple times which is
      wasteful during compaction so this patch will fail migration faster
      unless the caller specifies MIGRATE_SYNC.
      
      This is not expected to help THP allocation success rates but it did
      reduce latencies very slightly in some cases.
      
      1-socket thpfioscale
                                              4.20.0                 4.20.0
                                    noreserved-v2r15         failfast-v2r15
      Amean     fault-both-1         0.00 (   0.00%)        0.00 *   0.00%*
      Amean     fault-both-3      3839.67 (   0.00%)     3833.72 (   0.15%)
      Amean     fault-both-5      5177.47 (   0.00%)     4967.15 (   4.06%)
      Amean     fault-both-7      7245.03 (   0.00%)     7139.19 (   1.46%)
      Amean     fault-both-12    11534.89 (   0.00%)    11326.30 (   1.81%)
      Amean     fault-both-18    16241.10 (   0.00%)    16270.70 (  -0.18%)
      Amean     fault-both-24    19075.91 (   0.00%)    19839.65 (  -4.00%)
      Amean     fault-both-30    22712.11 (   0.00%)    21707.05 (   4.43%)
      Amean     fault-both-32    21692.92 (   0.00%)    21968.16 (  -1.27%)
      
      The 2-socket results are not materially different.  Scan rates are
      similar as expected.
      
      Link: http://lkml.kernel.org/r/20190118175136.31341-7-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      806031bb
    • M
      mm, compaction: rename map_pages to split_map_pages · 4469ab98
      Mel Gorman 提交于
      It's non-obvious that high-order free pages are split into order-0 pages
      from the function name.  Fix it.
      
      Link: http://lkml.kernel.org/r/20190118175136.31341-6-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4469ab98
    • M
      mm, compaction: remove unnecessary zone parameter in some instances · 40cacbcb
      Mel Gorman 提交于
      A zone parameter is passed into a number of top-level compaction
      functions despite the fact that it's already in compact_control.  This
      is harmless but it did need an audit to check if zone actually ever
      changes meaningfully.  This patches removes the parameter in a number of
      top-level functions.  The change could be much deeper but this was
      enough to briefly clarify the flow.
      
      No functional change.
      
      Link: http://lkml.kernel.org/r/20190118175136.31341-5-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      40cacbcb
    • M
      mm, compaction: remove last_migrated_pfn from compact_control · 566e54e1
      Mel Gorman 提交于
      The last_migrated_pfn field is a bit dubious as to whether it really
      helps but either way, the information from it can be inferred without
      increasing the size of compact_control so remove the field.
      
      Link: http://lkml.kernel.org/r/20190118175136.31341-4-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      566e54e1
    • M
      mm, compaction: rearrange compact_control · c5943b9c
      Mel Gorman 提交于
      compact_control spans two cache lines with write-intensive lines on
      both.  Rearrange so the most write-intensive fields are in the same
      cache line.  This has a negligible impact on the overall performance of
      compaction and is more a tidying exercise than anything.
      
      Link: http://lkml.kernel.org/r/20190118175136.31341-3-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5943b9c
    • M
      mm, compaction: shrink compact_control · c5fbd937
      Mel Gorman 提交于
      Patch series "Increase success rates and reduce latency of compaction", v3.
      
      This series reduces scan rates and success rates of compaction,
      primarily by using the free lists to shorten scans, better controlling
      of skip information and whether multiple scanners can target the same
      block and capturing pageblocks before being stolen by parallel requests.
      The series is based on mmotm from January 9th, 2019 with the previous
      compaction series reverted.
      
      I'm mostly using thpscale to measure the impact of the series.  The
      benchmark creates a large file, maps it, faults it, punches holes in the
      mapping so that the virtual address space is fragmented and then tries
      to allocate THP.  It re-executes for different numbers of threads.  From
      a fragmentation perspective, the workload is relatively benign but it
      does stress compaction.
      
      The overall impact on latencies for a 1-socket machine is
      
      				      baseline		      patches
      Amean     fault-both-3      3832.09 (   0.00%)     2748.56 *  28.28%*
      Amean     fault-both-5      4933.06 (   0.00%)     4255.52 (  13.73%)
      Amean     fault-both-7      7017.75 (   0.00%)     6586.93 (   6.14%)
      Amean     fault-both-12    11610.51 (   0.00%)     9162.34 *  21.09%*
      Amean     fault-both-18    17055.85 (   0.00%)    11530.06 *  32.40%*
      Amean     fault-both-24    19306.27 (   0.00%)    17956.13 (   6.99%)
      Amean     fault-both-30    22516.49 (   0.00%)    15686.47 *  30.33%*
      Amean     fault-both-32    23442.93 (   0.00%)    16564.83 *  29.34%*
      
      The allocation success rates are much improved
      
      			 	 baseline		 patches
      Percentage huge-3        85.99 (   0.00%)       97.96 (  13.92%)
      Percentage huge-5        88.27 (   0.00%)       96.87 (   9.74%)
      Percentage huge-7        85.87 (   0.00%)       94.53 (  10.09%)
      Percentage huge-12       82.38 (   0.00%)       98.44 (  19.49%)
      Percentage huge-18       83.29 (   0.00%)       99.14 (  19.04%)
      Percentage huge-24       81.41 (   0.00%)       97.35 (  19.57%)
      Percentage huge-30       80.98 (   0.00%)       98.05 (  21.08%)
      Percentage huge-32       80.53 (   0.00%)       97.06 (  20.53%)
      
      That's a nearly perfect allocation success rate.
      
      The biggest impact is on the scan rates
      
      Compaction migrate scanned    55893379    19341254
      Compaction free scanned      474739990    11903963
      
      The number of pages scanned for migration was reduced by 65% and the
      free scanner was reduced by 97.5%.  So much less work in exchange for
      lower latency and better success rates.
      
      The series was also evaluated using a workload that heavily fragments
      memory but the benefits there are also significant, albeit not
      presented.
      
      It was commented that we should be rethinking scanning entirely and to a
      large extent I agree.  However, to achieve that you need a lot of this
      series in place first so it's best to make the linear scanners as best
      as possible before ripping them out.
      
      This patch (of 22):
      
      The isolate and migrate scanners should never isolate more than a
      pageblock of pages so unsigned int is sufficient saving 8 bytes on a
      64-bit build.
      
      Link: http://lkml.kernel.org/r/20190118175136.31341-2-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5fbd937
    • Z
      mm/filemap: pass inclusive 'end_byte' parameter to filemap_range_has_page · 35f12f0f
      zhengbin 提交于
      The 'end_byte' parameter of filemap_range_has_page is required to be
      inclusive, so follow the rule.
      
      Link: http://lkml.kernel.org/r/1548678679-18122-1-git-send-email-zhengbin13@huawei.com
      Fixes: 6be96d3a ("fs: return if direct I/O will trigger writeback")
      Signed-off-by: Nzhengbin <zhengbin13@huawei.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMatthew Wilcox <willy@infradead.org>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Hou Tao <houtao1@huawei.com>
      Cc: zhangyi (F) <yi.zhang@huawei.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      35f12f0f
    • A
      mm: shuffle GFP_* flags · d71e53ce
      Alexey Dobriyan 提交于
      GFP_KERNEL is one of the most used constant but on archs like arm with
      fixed length instruction some constants are more equal than the others.
      Constants with tightly packed bits can be injected directly into
      instruction stream:
      
      	   0:   e3a00d33        mov     r0, #3264       ; 0xcc0
      
      Others require multiple instructions or even loading out of instruction
      stream:
      
      	   0:   e3a000c0        mov     r0, #192        ; 0xc0
      	   4:   e3400060        movt    r0, #96		; 0x60
      
      Shuffle GFP_* flags so that GFP_KERNEL/GFP_ATOMIC + __GFP_ZERO bits are
      close to each other.
      
      Savings on arm configs are ~0.1%.
      
      Link: http://lkml.kernel.org/r/20190109201838.GA9140@avx2Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d71e53ce
    • Y
      mm: swap: add comment for swap_vma_readahead · e9f59873
      Yang Shi 提交于
      swap_vma_readahead()'s comment is missing, just add it.
      
      Link: http://lkml.kernel.org/r/1546543673-108536-2-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Hugh Dickins <hughd@google.com
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9f59873
    • Y
      mm: swap: check if swap backing device is congested or not · 8fd2e0b5
      Yang Shi 提交于
      Swap readahead would read in a few pages regardless if the underlying
      device is busy or not.  It may incur long waiting time if the device is
      congested, and it may also exacerbate the congestion.
      
      Use inode_read_congested() to check if the underlying device is busy or
      not like what file page readahead does.  Get inode from
      swap_info_struct.
      
      Although we can add inode information in swap_address_space
      (address_space->host), it may lead some unexpected side effect, i.e.  it
      may break mapping_cap_account_dirty().  Using inode from
      swap_info_struct seems simple and good enough.
      
      Just does the check in vma_cluster_readahead() since
      swap_vma_readahead() is just used for non-rotational device which much
      less likely has congestion than traditional HDD.
      
      Although swap slots may be consecutive on swap partition, it still may
      be fragmented on swap file.  This check would help to reduce excessive
      stall for such case.
      
      The test with page_fault1 of will-it-scale (sometimes tracing may just
      show runtest.py that is the wrapper script of page_fault1), which
      basically launches NR_CPU threads to generate 128MB anonymous pages for
      each thread, on my virtual machine with congested HDD shows long tail
      latency is reduced significantly.
      
      Without the patch
       page_fault1_thr-1490  [023]   129.311706: funcgraph_entry:      #57377.796 us |  do_swap_page();
       page_fault1_thr-1490  [023]   129.369103: funcgraph_entry:        5.642us   |  do_swap_page();
       page_fault1_thr-1490  [023]   129.369119: funcgraph_entry:      #1289.592 us |  do_swap_page();
       page_fault1_thr-1490  [023]   129.370411: funcgraph_entry:        4.957us   |  do_swap_page();
       page_fault1_thr-1490  [023]   129.370419: funcgraph_entry:        1.940us   |  do_swap_page();
       page_fault1_thr-1490  [023]   129.378847: funcgraph_entry:      #1411.385 us |  do_swap_page();
       page_fault1_thr-1490  [023]   129.380262: funcgraph_entry:        3.916us   |  do_swap_page();
       page_fault1_thr-1490  [023]   129.380275: funcgraph_entry:      #4287.751 us |  do_swap_page();
      
      With the patch
            runtest.py-1417  [020]   301.925911: funcgraph_entry:      #9870.146 us |  do_swap_page();
            runtest.py-1417  [020]   301.935785: funcgraph_entry:        9.802us   |  do_swap_page();
            runtest.py-1417  [020]   301.935799: funcgraph_entry:        3.551us   |  do_swap_page();
            runtest.py-1417  [020]   301.935806: funcgraph_entry:        2.142us   |  do_swap_page();
            runtest.py-1417  [020]   301.935853: funcgraph_entry:        6.938us   |  do_swap_page();
            runtest.py-1417  [020]   301.935864: funcgraph_entry:        3.765us   |  do_swap_page();
            runtest.py-1417  [020]   301.935871: funcgraph_entry:        3.600us   |  do_swap_page();
            runtest.py-1417  [020]   301.935878: funcgraph_entry:        7.202us   |  do_swap_page();
      
      [akpm@linux-foundation.org: code cleanup]
      [yang.shi@linux.alibaba.com: add comment]
        Link: http://lkml.kernel.org/r/bbc7bda7-62d0-df1a-23ef-d369e865bdca@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1546543673-108536-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: NTim Chen <tim.c.chen@intel.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Hugh Dickins <hughd@google.com
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8fd2e0b5
    • M
      mm/filemap.c: remove redundant test from find_get_pages_contig · 14ef1fc7
      Matthew Wilcox 提交于
      After we establish a reference on the page, we check the pointer
      continues to be in the correct position in i_pages.  Checking
      page->index afterwards is unnecessary; if it were to change, then the
      pointer to it from the page cache would also move.  The check used to be
      done before grabbing a reference on the page which was racy (see commit
      9cbb4cb2 ("mm: find_get_pages_contig fixlet")), but nobody noticed
      that moving the check after grabbing the reference was redundant.
      
      Link: http://lkml.kernel.org/r/20190107200224.13260-1-willy@infradead.orgSigned-off-by: NMatthew Wilcox <willy@infradead.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      14ef1fc7
    • G
      mm/memcontrol.c: use struct_size() in kmalloc() · 67b8046f
      Gustavo A. R. Silva 提交于
      One of the more common cases of allocation size calculations is finding
      the size of a structure that has a zero-sized array at the end, along
      with memory for some number of elements for that array.  For example:
      
        struct foo {
            int stuff;
            void *entry[];
        };
      
        instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);
      
      Instead of leaving these open-coded and prone to type mistakes, we can
      now use the new struct_size() helper:
      
        instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);
      
      This code was detected with the help of Coccinelle.
      
      Link: http://lkml.kernel.org/r/20190104183726.GA6374@embeddedorSigned-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      67b8046f
    • W
      mm: remove extra drain pages on pcp list · c52e7593
      Wei Yang 提交于
      In the current implementation, there are two places to isolate a range
      of page: __offline_pages() and alloc_contig_range().  During this
      procedure, it will drain pages on pcp list.
      
      Below is a brief call flow:
      
        __offline_pages()/alloc_contig_range()
            start_isolate_page_range()
                set_migratetype_isolate()
                    drain_all_pages()
            drain_all_pages()                 <--- A
      
      This snippet shows the current logic is isolate and drain pcp list for
      each pageblock and drain pcp list again for the whole range.
      
      start_isolate_page_range is responsible for isolating the given pfn
      range.  One part of that job is to make sure that also pages that are on
      the allocator pcp lists are properly isolated.  Otherwise they could be
      reused and the range wouldn't be completely isolated until the memory is
      freed back.  While there is no strict guarantee here because pages might
      get allocated at any time before drain_all_pages is called there doesn't
      seem to be any strong demand for such a guarantee.
      
      In any case, draining is already done at the isolation level and there
      is no need to do it again later by start_isolate_page_range callers
      (memory hotplug and CMA allocator currently).  Therefore remove
      pointless draining in existing callers to make the code more clear and
      functionally correct.
      
      [mhocko@suse.com: provide a clearer changelog for the last two paragraphs]
      Link: http://lkml.kernel.org/r/20190105233141.2329-1-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c52e7593
    • A
      arm64/mm: enable HugeTLB migration for contiguous bit HugeTLB pages · 5480280d
      Anshuman Khandual 提交于
      Let arm64 subscribe to the previously added framework in which
      architecture can inform whether a given huge page size is supported for
      migration.  This just overrides the default function
      arch_hugetlb_migration_supported() and enables migration for all
      possible HugeTLB page sizes on arm64.
      
      With this, HugeTLB migration support on arm64 now covers all possible
      HugeTLB options.
      
                CONT PTE    PMD    CONT PMD    PUD
                --------    ---    --------    ---
        4K:        64K      2M        32M      1G
        16K:        2M     32M         1G
        64K:        2M    512M        16G
      
      Link: http://lkml.kernel.org/r/1545121450-1663-6-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NSteve Capper <steve.capper@arm.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5480280d
    • A
      arm64/mm: enable HugeTLB migration · 4a03a058
      Anshuman Khandual 提交于
      Let arm64 subscribe to generic HugeTLB page migration framework.  Right
      now this only works on the following PMD and PUD level HugeTLB page
      sizes with various kernel base page size combinations.
      
               CONT PTE    PMD    CONT PMD    PUD
               --------    ---    --------    ---
        4K:         NA     2M         NA      1G
        16K:        NA    32M         NA
        64K:        NA   512M         NA
      
      Link: http://lkml.kernel.org/r/1545121450-1663-5-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NSteve Capper <steve.capper@arm.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a03a058
    • A
      mm/hugetlb: enable arch specific huge page size support for migration · e693de18
      Anshuman Khandual 提交于
      Architectures like arm64 have HugeTLB page sizes which are different
      than generic sizes at PMD, PUD, PGD level and implemented via contiguous
      bits.  At present these special size HugeTLB pages cannot be identified
      through macros like (PMD|PUD|PGDIR)_SHIFT and hence chosen not be
      migrated.
      
      Enabling migration support for these special HugeTLB page sizes along
      with the generic ones (PMD|PUD|PGD) would require identifying all of
      them on a given platform.  A platform specific hook can precisely
      enumerate all huge page sizes supported for migration.  Instead of
      comparing against standard huge page orders let
      hugetlb_migration_support() function call a platform hook
      arch_hugetlb_migration_support().  Default definition for the platform
      hook maintains existing semantics which checks standard huge page order.
      But an architecture can choose to override the default and provide
      support for a comprehensive set of huge page sizes.
      
      Link: http://lkml.kernel.org/r/1545121450-1663-4-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NSteve Capper <steve.capper@arm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e693de18
    • A
      mm/hugetlb: enable PUD level huge page migration · 9b553bf5
      Anshuman Khandual 提交于
      Architectures like arm64 have PUD level HugeTLB pages for certain configs
      (1GB huge page is PUD based on ARM64_4K_PAGES base page size) that can
      be enabled for migration.  It can be achieved through checking for
      PUD_SHIFT order based HugeTLB pages during migration.
      
      Link: http://lkml.kernel.org/r/1545121450-1663-3-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NSteve Capper <steve.capper@arm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b553bf5
    • A
      mm/hugetlb: distinguish between migratability and movability · 7ed2c31d
      Anshuman Khandual 提交于
      Patch series "arm64/mm: Enable HugeTLB migration", v4.
      
      This patch series enables HugeTLB migration support for all supported
      huge page sizes at all levels including contiguous bit implementation.
      Following HugeTLB migration support matrix has been enabled with this
      patch series.  All permutations have been tested except for the 16GB.
      
                 CONT PTE    PMD    CONT PMD    PUD
                 --------    ---    --------    ---
        4K:         64K     2M         32M     1G
        16K:         2M    32M          1G
        64K:         2M   512M         16G
      
      First the series adds migration support for PUD based huge pages.  It
      then adds a platform specific hook to query an architecture if a given
      huge page size is supported for migration while also providing a default
      fallback option preserving the existing semantics which just checks for
      (PMD|PUD|PGDIR)_SHIFT macros.  The last two patches enables HugeTLB
      migration on arm64 and subscribe to this new platform specific hook by
      defining an override.
      
      The second patch differentiates between movability and migratability
      aspects of huge pages and implements hugepage_movable_supported() which
      can then be used during allocation to decide whether to place the huge
      page in movable zone or not.
      
      This patch (of 5):
      
      During huge page allocation it's migratability is checked to determine
      if it should be placed under movable zones with GFP_HIGHUSER_MOVABLE.
      But the movability aspect of the huge page could depend on other factors
      than just migratability.  Movability in itself is a distinct property
      which should not be tied with migratability alone.
      
      This differentiates these two and implements an enhanced movability check
      which also considers huge page size to determine if it is feasible to be
      placed under a movable zone.  At present it just checks for gigantic pages
      but going forward it can incorporate other enhanced checks.
      
      Link: http://lkml.kernel.org/r/1545121450-1663-2-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NSteve Capper <steve.capper@arm.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Suggested-by: NMichal Hocko <mhocko@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ed2c31d
    • M
      mm: remove sysctl_extfrag_handler() · 6b7e5cad
      Matthew Wilcox 提交于
      sysctl_extfrag_handler() neglects to propagate the return value from
      proc_dointvec_minmax() to its caller.  It's a wrapper that doesn't need
      to exist, so just use proc_dointvec_minmax() directly.
      
      Link: http://lkml.kernel.org/r/20190104032557.3056-1-willy@infradead.orgSigned-off-by: NMatthew Wilcox <willy@infradead.org>
      Reported-by: NAditya Pakki <pakki001@umn.edu>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b7e5cad
    • U
      selftests/vm: add script helper for CONFIG_TEST_VMALLOC_MODULE · a05ef00c
      Uladzislau Rezki (Sony) 提交于
      Add the test script for the kernel test driver to analyse vmalloc
      allocator for benchmarking and stressing purposes.  It is just a kernel
      module loader.  You can specify and pass different parameters in order
      to investigate allocations behaviour.  See "usage" output for more
      details.
      
      Also add basic vmalloc smoke test to the "run_vmtests" suite.
      
      Link: http://lkml.kernel.org/r/20190103142108.20744-4-urezki@gmail.comSigned-off-by: NUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: NShuah Khan <shuah@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a05ef00c
    • U
      vmalloc: add test driver to analyse vmalloc allocator · 3f21a6b7
      Uladzislau Rezki (Sony) 提交于
      This adds a new kernel module for analysis of vmalloc allocator.  It is
      only enabled as a module.  There are two main reasons this module should
      be used for: performance evaluation and stressing of vmalloc subsystem.
      
      It consists of several test cases.  As of now there are 8.  The module
      has five parameters we can specify to change its the behaviour.
      
      1) run_test_mask - set of tests to be run
      
      id: 1,   name: fix_size_alloc_test
      id: 2,   name: full_fit_alloc_test
      id: 4,   name: long_busy_list_alloc_test
      id: 8,   name: random_size_alloc_test
      id: 16,  name: fix_align_alloc_test
      id: 32,  name: random_size_align_alloc_test
      id: 64,  name: align_shift_alloc_test
      id: 128, name: pcpu_alloc_test
      
      By default all tests are in run test mask.  If you want to select some
      specific tests it is possible to pass the mask.  For example for first,
      second and fourth tests we go 11 value.
      
      2) test_repeat_count - how many times each test should be repeated
      By default it is one time per test. It is possible to pass any number.
      As high the value is the test duration gets increased.
      
      3) test_loop_count - internal test loop counter. By default it is set
      to 1000000.
      
      4) single_cpu_test - use one CPU to run the tests
      By default this parameter is set to false. It means that all online
      CPUs execute tests. By setting it to 1, the tests are executed by
      first online CPU only.
      
      5) sequential_test_order - run tests in sequential order
      By default this parameter is set to false. It means that before running
      tests the order is shuffled. It is possible to make it sequential, just
      set it to 1.
      
      Performance analysis:
      In order to evaluate performance of vmalloc allocations, usually it
      makes sense to use only one CPU that runs tests, use sequential order,
      number of repeat tests can be different as well as set of test mask.
      
      For example if we want to run all tests, to use one CPU and repeat each
      test 3 times. Insert the module passing following parameters:
      
      single_cpu_test=1 sequential_test_order=1 test_repeat_count=3
      
      with following output:
      
      <snip>
      Summary: fix_size_alloc_test passed: 3 failed: 0 repeat: 3 loops: 1000000 avg: 901177 usec
      Summary: full_fit_alloc_test passed: 3 failed: 0 repeat: 3 loops: 1000000 avg: 1039341 usec
      Summary: long_busy_list_alloc_test passed: 3 failed: 0 repeat: 3 loops: 1000000 avg: 11775763 usec
      Summary: random_size_alloc_test passed 3: failed: 0 repeat: 3 loops: 1000000 avg: 6081992 usec
      Summary: fix_align_alloc_test passed: 3 failed: 0 repeat: 3, loops: 1000000 avg: 2003712 usec
      Summary: random_size_align_alloc_test passed: 3 failed: 0 repeat: 3 loops: 1000000 avg: 2895689 usec
      Summary: align_shift_alloc_test passed: 0 failed: 3 repeat: 3 loops: 1000000 avg: 573 usec
      Summary: pcpu_alloc_test passed: 3 failed: 0 repeat: 3 loops: 1000000 avg: 95802 usec
      All test took CPU0=192945605995 cycles
      <snip>
      
      The align_shift_alloc_test is expected to be failed.
      
      Stressing:
      In order to stress the vmalloc subsystem we run all available test cases
      on all available CPUs simultaneously. In order to prevent constant behaviour
      pattern, the test cases array is shuffled by default to randomize the order
      of test execution.
      
      For example if we want to run all tests(default), use all online CPUs(default)
      with shuffled order(default) and to repeat each test 30 times. The command
      would be like:
      
      modprobe vmalloc_test test_repeat_count=30
      
      Expected results are the system is alive, there are no any BUG_ONs or Kernel
      Panics the tests are completed, no memory leaks.
      
      [urezki@gmail.com: fix 32-bit builds]
        Link: http://lkml.kernel.org/r/20190106214839.ffvjvmrn52uqog7k@pc636
      [urezki@gmail.com: make CONFIG_TEST_VMALLOC depend on CONFIG_MMU]
        Link: http://lkml.kernel.org/r/20190219085441.s6bg2gpy4esny5vw@pc636
      Link: http://lkml.kernel.org/r/20190103142108.20744-3-urezki@gmail.comSigned-off-by: NUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f21a6b7
    • U
      vmalloc: export __vmalloc_node_range for CONFIG_TEST_VMALLOC_MODULE · 153178ed
      Uladzislau Rezki (Sony) 提交于
      Export __vmaloc_node_range() function if CONFIG_TEST_VMALLOC_MODULE is
      enabled.  Some test cases in vmalloc test suite module require and make
      use of that function.  Please note, that it is not supposed to be used
      for other purposes.
      
      We need it only for performance analysis, stressing and stability check
      of vmalloc allocator.
      
      Link: http://lkml.kernel.org/r/20190103142108.20744-2-urezki@gmail.comSigned-off-by: NUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      153178ed
    • R
      mm/vmalloc: pass VM_USERMAP flags directly to __vmalloc_node_range() · bc84c535
      Roman Penyaev 提交于
      vmalloc_user*() calls differ from normal vmalloc() only in that they set
      VM_USERMAP flags for the area.  During the whole history of vmalloc.c
      changes now it is possible simply to pass VM_USERMAP flags directly to
      __vmalloc_node_range() call instead of finding the area (which obviously
      takes time) after the allocation.
      
      Link: http://lkml.kernel.org/r/20190103145954.16942-4-rpenyaev@suse.deSigned-off-by: NRoman Penyaev <rpenyaev@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc84c535
    • R
      mm/vmalloc: do not call kmemleak_free() on not yet accounted memory · c67dc624
      Roman Penyaev 提交于
      __vmalloc_area_node() calls vfree() on error path, which in turn calls
      kmemleak_free(), but area is not yet accounted by kmemleak_vmalloc().
      
      Link: http://lkml.kernel.org/r/20190103145954.16942-3-rpenyaev@suse.deSigned-off-by: NRoman Penyaev <rpenyaev@suse.de>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c67dc624
    • R
      mm/vmalloc: fix size check for remap_vmalloc_range_partial() · 401592d2
      Roman Penyaev 提交于
      When VM_NO_GUARD is not set area->size includes adjacent guard page,
      thus for correct size checking get_vm_area_size() should be used, but
      not area->size.
      
      This fixes possible kernel oops when userspace tries to mmap an area on
      1 page bigger than was allocated by vmalloc_user() call: the size check
      inside remap_vmalloc_range_partial() accounts non-existing guard page
      also, so check successfully passes but vmalloc_to_page() returns NULL
      (guard page does not physically exist).
      
      The following code pattern example should trigger an oops:
      
        static int oops_mmap(struct file *file, struct vm_area_struct *vma)
        {
              void *mem;
      
              mem = vmalloc_user(4096);
              BUG_ON(!mem);
              /* Do not care about mem leak */
      
              return remap_vmalloc_range(vma, mem, 0);
        }
      
      And userspace simply mmaps size + PAGE_SIZE:
      
        mmap(NULL, 8192, PROT_WRITE|PROT_READ, MAP_PRIVATE, fd, 0);
      
      Possible candidates for oops which do not have any explicit size
      checks:
      
         *** drivers/media/usb/stkwebcam/stk-webcam.c:
         v4l_stk_mmap[789]   ret = remap_vmalloc_range(vma, sbuf->buffer, 0);
      
      Or the following one:
      
         *** drivers/video/fbdev/core/fbmem.c
         static int
         fb_mmap(struct file *file, struct vm_area_struct * vma)
              ...
              res = fb->fb_mmap(info, vma);
      
      Where fb_mmap callback calls remap_vmalloc_range() directly without any
      explicit checks:
      
         *** drivers/video/fbdev/vfb.c
         static int vfb_mmap(struct fb_info *info,
                   struct vm_area_struct *vma)
         {
             return remap_vmalloc_range(vma, (void *)info->fix.smem_start, vma->vm_pgoff);
         }
      
      Link: http://lkml.kernel.org/r/20190103145954.16942-2-rpenyaev@suse.deSigned-off-by: NRoman Penyaev <rpenyaev@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      401592d2
    • R
      mm/vmalloc.c: make vmalloc_32_user() align base kernel virtual address to SHMLBA · 5a82ac71
      Roman Penyaev 提交于
      This patch repeats the original one from David S Miller:
      
        2dca6999 ("mm, perf_event: Make vmalloc_user() align base kernel virtual address to SHMLBA")
      
      but for missed vmalloc_32_user() case, which also requires correct
      alignment of virtual address on kernel side to avoid D-caches aliases.
      A bit of copy-paste from original patch to recover in memory of what is
      all about:
      
        When a vmalloc'd area is mmap'd into userspace, some kind of
        co-ordination is necessary for this to work on platforms with cpu
        D-caches which can have aliases.
      
        Otherwise kernel side writes won't be seen properly in userspace and
        vice versa.
      
        If the kernel side mapping and the user side one have the same
        alignment, modulo SHMLBA, this can work as long as VM_SHARED is shared
        of VMA and for all current users this is true. VM_SHARED will force
        SHMLBA alignment of the user side mmap on platforms with D-cache
        aliasing matters.
      
        David S. Miller
      
      > What are the user-visible runtime effects of this change?
      
      In simple words: proper alignment avoids possible difference in data,
      seen by different virtual mapings: userspace and kernel in our case.
      I.e. userspace reads cache line A, kernel writes to cache line B.  Both
      cache lines correspond to the same physical memory (thus aliases).
      
      So this should fix data corruption for archs with vivt and vipt caches,
      e.g. armv6.  Personally I've never worked with this archs, I just
      spotted the strange difference in code: for one case we do alignment,
      for another - not.  I have a strong feeling that David simply missed
      vmalloc_32_user() case.
      
      >
      > Is a -stable backport needed?
      
      No, I do not think so.  The only one user of vmalloc_32_user() is
      virtual frame buffer device drivers/video/fbdev/vfb.c, which has in the
      description "The main use of this frame buffer device is testing and
      debugging the frame buffer subsystem.  Do NOT enable it for normal
      systems!".
      
      And it seems to me that this vfb.c does not need 32bit addressable pages
      (vmalloc_32_user() case), because it is virtual device and should not
      care about things like dma32 zones, etc.  Probably is better to clean
      the code and switch vfb.c from vmalloc_32_user() to vmalloc_user() case
      and wipe out vmalloc_32_user() from vmalloc.c completely.  But I'm not
      very much sure that this is worth to do, that's so minor, so we can
      leave it as is.
      
      Link: http://lkml.kernel.org/r/20190108110944.23591-1-rpenyaev@suse.deSigned-off-by: NRoman Penyaev <rpenyaev@suse.de>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a82ac71
    • S
      memcg: localize memcg_kmem_enabled() check · 60cd4bcd
      Shakeel Butt 提交于
      Move the memcg_kmem_enabled() checks into memcg kmem charge/uncharge
      functions, so, the users don't have to explicitly check that condition.
      
      This is purely code cleanup patch without any functional change.  Only
      the order of checks in memcg_charge_slab() can potentially be changed
      but the functionally it will be same.  This should not matter as
      memcg_charge_slab() is not in the hot path.
      
      Link: http://lkml.kernel.org/r/20190103161203.162375-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60cd4bcd
    • W
      mm, slub: make the comment of put_cpu_partial() complete · 9234bae9
      Wei Yang 提交于
      There are two cases when put_cpu_partial() is invoked.
      
          * __slab_free
          * get_partial_node
      
      This patch just makes it cover these two cases.
      
      Link: http://lkml.kernel.org/r/20181025094437.18951-3-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9234bae9
    • K
      mm: reuse only-pte-mapped KSM page in do_wp_page() · 52d1e606
      Kirill Tkhai 提交于
      Add an optimization for KSM pages almost in the same way that we have
      for ordinary anonymous pages.  If there is a write fault in a page,
      which is mapped to an only pte, and it is not related to swap cache; the
      page may be reused without copying its content.
      
      [ Note that we do not consider PageSwapCache() pages at least for now,
        since we don't want to complicate __get_ksm_page(), which has nice
        optimization based on this (for the migration case). Currenly it is
        spinning on PageSwapCache() pages, waiting for when they have
        unfreezed counters (i.e., for the migration finish). But we don't want
        to make it also spinning on swap cache pages, which we try to reuse,
        since there is not a very high probability to reuse them. So, for now
        we do not consider PageSwapCache() pages at all. ]
      
      So in reuse_ksm_page() we check for 1) PageSwapCache() and 2)
      page_stable_node(), to skip a page, which KSM is currently trying to
      link to stable tree.  Then we do page_ref_freeze() to prohibit KSM to
      merge one more page into the page, we are reusing.  After that, nobody
      can refer to the reusing page: KSM skips !PageSwapCache() pages with
      zero refcount; and the protection against of all other participants is
      the same as for reused ordinary anon pages pte lock, page lock and
      mmap_sem.
      
      [akpm@linux-foundation.org: replace BUG_ON()s with WARN_ON()s]
      Link: http://lkml.kernel.org/r/154471491016.31352.1168978849911555609.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52d1e606