1. 16 11月, 2017 40 次提交
    • M
      mm: remove __GFP_COLD · 453f85d4
      Mel Gorman 提交于
      As the page free path makes no distinction between cache hot and cold
      pages, there is no real useful ordering of pages in the free list that
      allocation requests can take advantage of.  Juding from the users of
      __GFP_COLD, it is likely that a number of them are the result of copying
      other sites instead of actually measuring the impact.  Remove the
      __GFP_COLD parameter which simplifies a number of paths in the page
      allocator.
      
      This is potentially controversial but bear in mind that the size of the
      per-cpu pagelists versus modern cache sizes means that the whole per-cpu
      list can often fit in the L3 cache.  Hence, there is only a potential
      benefit for microbenchmarks that alloc/free pages in a tight loop.  It's
      even worse when THP is taken into account which has little or no chance
      of getting a cache-hot page as the per-cpu list is bypassed and the
      zeroing of multiple pages will thrash the cache anyway.
      
      The truncate microbenchmarks are not shown as this patch affects the
      allocation path and not the free path.  A page fault microbenchmark was
      tested but it showed no sigificant difference which is not surprising
      given that the __GFP_COLD branches are a miniscule percentage of the
      fault path.
      
      Link: http://lkml.kernel.org/r/20171018075952.10627-9-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      453f85d4
    • M
      mm: remove cold parameter from free_hot_cold_page* · 2d4894b5
      Mel Gorman 提交于
      Most callers users of free_hot_cold_page claim the pages being released
      are cache hot.  The exception is the page reclaim paths where it is
      likely that enough pages will be freed in the near future that the
      per-cpu lists are going to be recycled and the cache hotness information
      is lost.  As no one really cares about the hotness of pages being
      released to the allocator, just ditch the parameter.
      
      The APIs are renamed to indicate that it's no longer about hot/cold
      pages.  It should also be less confusing as there are subtle differences
      between them.  __free_pages drops a reference and frees a page when the
      refcount reaches zero.  free_hot_cold_page handled pages whose refcount
      was already zero which is non-obvious from the name.  free_unref_page
      should be more obvious.
      
      No performance impact is expected as the overhead is marginal.  The
      parameter is removed simply because it is a bit stupid to have a useless
      parameter copied everywhere.
      
      [mgorman@techsingularity.net: add pages to head, not tail]
        Link: http://lkml.kernel.org/r/20171019154321.qtpzaeftoyyw4iey@techsingularity.net
      Link: http://lkml.kernel.org/r/20171018075952.10627-8-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d4894b5
    • M
      mm: remove cold parameter for release_pages · c6f92f9f
      Mel Gorman 提交于
      All callers of release_pages claim the pages being released are cache
      hot.  As no one cares about the hotness of pages being released to the
      allocator, just ditch the parameter.
      
      No performance impact is expected as the overhead is marginal.  The
      parameter is removed simply because it is a bit stupid to have a useless
      parameter copied everywhere.
      
      Link: http://lkml.kernel.org/r/20171018075952.10627-7-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c6f92f9f
    • M
      mm, pagevec: remove cold parameter for pagevecs · 86679820
      Mel Gorman 提交于
      Every pagevec_init user claims the pages being released are hot even in
      cases where it is unlikely the pages are hot.  As no one cares about the
      hotness of pages being released to the allocator, just ditch the
      parameter.
      
      No performance impact is expected as the overhead is marginal.  The
      parameter is removed simply because it is a bit stupid to have a useless
      parameter copied everywhere.
      
      Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86679820
    • M
      mm: only drain per-cpu pagevecs once per pagevec usage · d9ed0d08
      Mel Gorman 提交于
      When a pagevec is initialised on the stack, it is generally used
      multiple times over a range of pages, looking up entries and then
      releasing them.  On each pagevec_release, the per-cpu deferred LRU
      pagevecs are drained on the grounds the page being released may be on
      those queues and the pages may be cache hot.  In many cases only the
      first drain is necessary as it's unlikely that the range of pages being
      walked is racing against LRU addition.  Even if there is such a race,
      the impact is marginal where as constantly redraining the lru pagevecs
      costs.
      
      This patch ensures that pagevec is only drained once in a given
      lifecycle without increasing the cache footprint of the pagevec
      structure.  Only sparsetruncate tiny is shown here as large files have
      many exceptional entries and calls pagecache_release less frequently.
      
      sparsetruncate (tiny)
                                    4.14.0-rc4             4.14.0-rc4
                              batchshadow-v1r1          onedrain-v1r1
      Min          Time      141.00 (   0.00%)      141.00 (   0.00%)
      1st-qrtle    Time      142.00 (   0.00%)      142.00 (   0.00%)
      2nd-qrtle    Time      142.00 (   0.00%)      142.00 (   0.00%)
      3rd-qrtle    Time      143.00 (   0.00%)      143.00 (   0.00%)
      Max-90%      Time      144.00 (   0.00%)      144.00 (   0.00%)
      Max-95%      Time      146.00 (   0.00%)      145.00 (   0.68%)
      Max-99%      Time      198.00 (   0.00%)      194.00 (   2.02%)
      Max          Time      254.00 (   0.00%)      208.00 (  18.11%)
      Amean        Time      145.12 (   0.00%)      144.30 (   0.56%)
      Stddev       Time       12.74 (   0.00%)        9.62 (  24.49%)
      Coeff        Time        8.78 (   0.00%)        6.67 (  24.06%)
      Best99%Amean Time      144.29 (   0.00%)      143.82 (   0.32%)
      Best95%Amean Time      142.68 (   0.00%)      142.31 (   0.26%)
      Best90%Amean Time      142.52 (   0.00%)      142.19 (   0.24%)
      Best75%Amean Time      142.26 (   0.00%)      141.98 (   0.20%)
      Best50%Amean Time      141.90 (   0.00%)      141.71 (   0.13%)
      Best25%Amean Time      141.80 (   0.00%)      141.43 (   0.26%)
      
      The impact on bonnie is marginal and within the noise because a
      significant percentage of the file being truncated has been reclaimed
      and consists of shadow entries which reduce the hotness of the
      pagevec_release path.
      
      Link: http://lkml.kernel.org/r/20171018075952.10627-5-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d9ed0d08
    • M
      mm, truncate: remove all exceptional entries from pagevec under one lock · f2187599
      Mel Gorman 提交于
      During truncate each entry in a pagevec is checked to see if it is an
      exceptional entry and if so, the shadow entry is cleaned up.  This is
      potentially expensive as multiple entries for a mapping locks/unlocks
      the tree lock.  This batches the operation such that any exceptional
      entries removed from a pagevec only acquire the mapping tree lock once.
      The corner case where this is more expensive is where there is only one
      exceptional entry but this is unlikely due to temporal locality and how
      it affects LRU ordering.  Note that for truncations of small files
      created recently, this patch should show no gain because it only batches
      the handling of exceptional entries.
      
      sparsetruncate (large)
                                    4.14.0-rc4             4.14.0-rc4
                               pickhelper-v1r1       batchshadow-v1r1
      Min          Time       38.00 (   0.00%)       27.00 (  28.95%)
      1st-qrtle    Time       40.00 (   0.00%)       28.00 (  30.00%)
      2nd-qrtle    Time       44.00 (   0.00%)       41.00 (   6.82%)
      3rd-qrtle    Time      146.00 (   0.00%)      147.00 (  -0.68%)
      Max-90%      Time      153.00 (   0.00%)      153.00 (   0.00%)
      Max-95%      Time      155.00 (   0.00%)      156.00 (  -0.65%)
      Max-99%      Time      181.00 (   0.00%)      171.00 (   5.52%)
      Amean        Time       93.04 (   0.00%)       88.43 (   4.96%)
      Best99%Amean Time       92.08 (   0.00%)       86.13 (   6.46%)
      Best95%Amean Time       89.19 (   0.00%)       83.13 (   6.80%)
      Best90%Amean Time       85.60 (   0.00%)       79.15 (   7.53%)
      Best75%Amean Time       72.95 (   0.00%)       65.09 (  10.78%)
      Best50%Amean Time       39.86 (   0.00%)       28.20 (  29.25%)
      Best25%Amean Time       39.44 (   0.00%)       27.70 (  29.77%)
      
      bonnie
                                            4.14.0-rc4             4.14.0-rc4
                                       pickhelper-v1r1       batchshadow-v1r1
      Hmean     SeqCreate ops         71.92 (   0.00%)       76.78 (   6.76%)
      Hmean     SeqCreate read        42.42 (   0.00%)       45.01 (   6.10%)
      Hmean     SeqCreate del      26519.88 (   0.00%)    27191.87 (   2.53%)
      Hmean     RandCreate ops        71.92 (   0.00%)       76.95 (   7.00%)
      Hmean     RandCreate read       44.44 (   0.00%)       49.23 (  10.78%)
      Hmean     RandCreate del     24948.62 (   0.00%)    24764.97 (  -0.74%)
      
      Truncation of a large number of files shows a substantial gain with 99%
      of files being truncated 6.46% faster.  bonnie shows a modest gain of
      2.53%
      
      [jack@suse.cz: fix truncate_exceptional_pvec_entries()]
        Link: http://lkml.kernel.org/r/20171108164226.26788-1-jack@suse.cz
      Link: http://lkml.kernel.org/r/20171018075952.10627-4-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f2187599
    • M
      mm, truncate: do not check mapping for every page being truncated · c7df8ad2
      Mel Gorman 提交于
      During truncation, the mapping has already been checked for shmem and
      dax so it's known that workingset_update_node is required.
      
      This patch avoids the checks on mapping for each page being truncated.
      In all other cases, a lookup helper is used to determine if
      workingset_update_node() needs to be called.  The one danger is that the
      API is slightly harder to use as calling workingset_update_node directly
      without checking for dax or shmem mappings could lead to surprises.
      However, the API rarely needs to be used and hopefully the comment is
      enough to give people the hint.
      
      sparsetruncate (tiny)
                                    4.14.0-rc4             4.14.0-rc4
                                   oneirq-v1r1        pickhelper-v1r1
      Min          Time      141.00 (   0.00%)      140.00 (   0.71%)
      1st-qrtle    Time      142.00 (   0.00%)      141.00 (   0.70%)
      2nd-qrtle    Time      142.00 (   0.00%)      142.00 (   0.00%)
      3rd-qrtle    Time      143.00 (   0.00%)      143.00 (   0.00%)
      Max-90%      Time      144.00 (   0.00%)      144.00 (   0.00%)
      Max-95%      Time      147.00 (   0.00%)      145.00 (   1.36%)
      Max-99%      Time      195.00 (   0.00%)      191.00 (   2.05%)
      Max          Time      230.00 (   0.00%)      205.00 (  10.87%)
      Amean        Time      144.37 (   0.00%)      143.82 (   0.38%)
      Stddev       Time       10.44 (   0.00%)        9.00 (  13.74%)
      Coeff        Time        7.23 (   0.00%)        6.26 (  13.41%)
      Best99%Amean Time      143.72 (   0.00%)      143.34 (   0.26%)
      Best95%Amean Time      142.37 (   0.00%)      142.00 (   0.26%)
      Best90%Amean Time      142.19 (   0.00%)      141.85 (   0.24%)
      Best75%Amean Time      141.92 (   0.00%)      141.58 (   0.24%)
      Best50%Amean Time      141.69 (   0.00%)      141.31 (   0.27%)
      Best25%Amean Time      141.38 (   0.00%)      140.97 (   0.29%)
      
      As you'd expect, the gain is marginal but it can be detected.  The
      differences in bonnie are all within the noise which is not surprising
      given the impact on the microbenchmark.
      
      radix_tree_update_node_t is a callback for some radix operations that
      optionally passes in a private field.  The only user of the callback is
      workingset_update_node and as it no longer requires a mapping, the
      private field is removed.
      
      Link: http://lkml.kernel.org/r/20171018075952.10627-3-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c7df8ad2
    • M
      mm, page_alloc: enable/disable IRQs once when freeing a list of pages · 9cca35d4
      Mel Gorman 提交于
      Patch series "Follow-up for speed up page cache truncation", v2.
      
      This series is a follow-on for Jan Kara's series "Speed up page cache
      truncation" series.  We both ended up looking at the same problem but
      saw different problems based on the same data.  This series builds upon
      his work.
      
      A variety of workloads were compared on four separate machines but each
      machine showed gains albeit at different levels.  Minimally, some of the
      differences are due to NUMA where truncating data from a remote node is
      slower than a local node.  The workloads checked were
      
      o sparse truncate microbenchmark, tiny
      o sparse truncate microbenchmark, large
      o reaim-io disk workfile
      o dbench4 (modified by mmtests to produce more stable results)
      o filebench varmail configuration for small memory size
      o bonnie, directory operations, working set size 2*RAM
      
      reaim-io, dbench and filebench all showed minor gains.  Truncation does
      not dominate those workloads but were tested to ensure no other
      regressions.  They will not be reported further.
      
      The sparse truncate microbench was written by Jan.  It creates a number
      of files and then times how long it takes to truncate each one.  The
      "tiny" configuraiton creates a number of files that easily fits in
      memory and times how long it takes to truncate files with page cache.
      The large configuration uses enough files to have data that is twice the
      size of memory and so timings there include truncating page cache and
      working set shadow entries in the radix tree.
      
      Patches 1-4 are the most relevant parts of this series.  Patches 5-8 are
      optional as they are deleting code that is essentially useless but has a
      negligible performance impact.
      
      The changelogs have more information on performance but just for bonnie
      delete options, the main comparison is
      
      bonnie
                                            4.14.0-rc5             4.14.0-rc5             4.14.0-rc5
                                                jan-v2                vanilla                 mel-v2
      Hmean     SeqCreate ops         76.20 (   0.00%)       75.80 (  -0.53%)       76.80 (   0.79%)
      Hmean     SeqCreate read        85.00 (   0.00%)       85.00 (   0.00%)       85.00 (   0.00%)
      Hmean     SeqCreate del      13752.31 (   0.00%)    12090.23 ( -12.09%)    15304.84 (  11.29%)
      Hmean     RandCreate ops        76.00 (   0.00%)       75.60 (  -0.53%)       77.00 (   1.32%)
      Hmean     RandCreate read       96.80 (   0.00%)       96.80 (   0.00%)       97.00 (   0.21%)
      Hmean     RandCreate del     13233.75 (   0.00%)    11525.35 ( -12.91%)    14446.61 (   9.16%)
      
      Jan's series is the baseline and the vanilla kernel is 12% slower where
      as this series on top gains another 11%.  This is from a different
      machine than the data in the changelogs but the detailed data was not
      collected as there was no substantial change in v2.
      
      This patch (of 8):
      
      Freeing a list of pages current enables/disables IRQs for each page
      freed.  This patch splits freeing a list of pages into two operations --
      preparing the pages for freeing and the actual freeing.  This is a
      tradeoff - we're taking two passes of the list to free in exchange for
      avoiding multiple enable/disable of IRQs.
      
      sparsetruncate (tiny)
                                    4.14.0-rc4             4.14.0-rc4
                                 janbatch-v1r1            oneirq-v1r1
      Min          Time      149.00 (   0.00%)      141.00 (   5.37%)
      1st-qrtle    Time      150.00 (   0.00%)      142.00 (   5.33%)
      2nd-qrtle    Time      151.00 (   0.00%)      142.00 (   5.96%)
      3rd-qrtle    Time      151.00 (   0.00%)      143.00 (   5.30%)
      Max-90%      Time      153.00 (   0.00%)      144.00 (   5.88%)
      Max-95%      Time      155.00 (   0.00%)      147.00 (   5.16%)
      Max-99%      Time      201.00 (   0.00%)      195.00 (   2.99%)
      Max          Time      236.00 (   0.00%)      230.00 (   2.54%)
      Amean        Time      152.65 (   0.00%)      144.37 (   5.43%)
      Stddev       Time        9.78 (   0.00%)       10.44 (  -6.72%)
      Coeff        Time        6.41 (   0.00%)        7.23 ( -12.84%)
      Best99%Amean Time      152.07 (   0.00%)      143.72 (   5.50%)
      Best95%Amean Time      150.75 (   0.00%)      142.37 (   5.56%)
      Best90%Amean Time      150.59 (   0.00%)      142.19 (   5.58%)
      Best75%Amean Time      150.36 (   0.00%)      141.92 (   5.61%)
      Best50%Amean Time      150.04 (   0.00%)      141.69 (   5.56%)
      Best25%Amean Time      149.85 (   0.00%)      141.38 (   5.65%)
      
      With a tiny number of files, each file truncated has resident page cache
      and it shows that time to truncate is roughtly 5-6% with some minor
      jitter.
      
                                            4.14.0-rc4             4.14.0-rc4
                                         janbatch-v1r1            oneirq-v1r1
      Hmean     SeqCreate ops         65.27 (   0.00%)       81.86 (  25.43%)
      Hmean     SeqCreate read        39.48 (   0.00%)       47.44 (  20.16%)
      Hmean     SeqCreate del      24963.95 (   0.00%)    26319.99 (   5.43%)
      Hmean     RandCreate ops        65.47 (   0.00%)       82.01 (  25.26%)
      Hmean     RandCreate read       42.04 (   0.00%)       51.75 (  23.09%)
      Hmean     RandCreate del     23377.66 (   0.00%)    23764.79 (   1.66%)
      
      As expected, there is a small gain for the delete operation.
      
      [mgorman@techsingularity.net: use page_private and set_page_private helpers]
        Link: http://lkml.kernel.org/r/20171018101547.mjycw7zreb66jzpa@techsingularity.net
      Link: http://lkml.kernel.org/r/20171018075952.10627-2-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9cca35d4
    • J
      mm: batch radix tree operations when truncating pages · aa65c29c
      Jan Kara 提交于
      Currently we remove pages from the radix tree one by one.  To speed up
      page cache truncation, lock several pages at once and free them in one
      go.  This allows us to batch radix tree operations in a more efficient
      way and also save round-trips on mapping->tree_lock.  As a result we
      gain about 20% speed improvement in page cache truncation.
      
      Data from a simple benchmark timing 10000 truncates of 1024 pages (on
      ext4 on ramdisk but the filesystem is barely visible in the profiles).
      The range shows 1% and 95% percentiles of the measured times:
      
        4.14-rc2	4.14-rc2 + batched truncation
        248-256	209-219
        249-258	209-217
        248-255	211-239
        248-255	209-217
        247-256	210-218
      
      [jack@suse.cz: convert delete_from_page_cache_batch() to pagevec]
        Link: http://lkml.kernel.org/r/20171018111648.13714-1-jack@suse.cz
      [akpm@linux-foundation.org: move struct pagevec forward declaration to top-of-file]
      Link: http://lkml.kernel.org/r/20171010151937.26984-8-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa65c29c
    • J
      mm: factor out checks and accounting from __delete_from_page_cache() · 5ecc4d85
      Jan Kara 提交于
      Move checks and accounting updates from __delete_from_page_cache() into
      a separate function.  We will reuse it when batching page cache
      truncation operations.
      
      Link: http://lkml.kernel.org/r/20171010151937.26984-7-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ecc4d85
    • J
      mm: move clearing of page->mapping to page_cache_tree_delete() · 2300638b
      Jan Kara 提交于
      Clearing of page->mapping makes sense in page_cache_tree_delete() as
      well and it will help us with batching things this way.
      
      Link: http://lkml.kernel.org/r/20171010151937.26984-6-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2300638b
    • J
      mm: move accounting updates before page_cache_tree_delete() · 76253fbc
      Jan Kara 提交于
      Move updates of various counters before page_cache_tree_delete() call.
      It will be easier to batch things this way and there is no difference
      whether the counters get updated before or after removal from the radix
      tree.
      
      Link: http://lkml.kernel.org/r/20171010151937.26984-5-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      76253fbc
    • J
      mm: factor out page cache page freeing into a separate function · 59c66c5f
      Jan Kara 提交于
      Factor out page freeing from delete_from_page_cache() into a separate
      function.  We will need to call the same when batching pagecache
      deletion operations.
      
      invalidate_complete_page2() and replace_page_cache_page() might want to
      call this function as well however they currently don't seem to handle
      THPs so it's unnecessary for them to take the hit of checking whether a
      page is THP or not.
      
      Link: http://lkml.kernel.org/r/20171010151937.26984-4-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59c66c5f
    • J
      mm: refactor truncate_complete_page() · 9f4e41f4
      Jan Kara 提交于
      Move call of delete_from_page_cache() and page->mapping check out of
      truncate_complete_page() into the single caller - truncate_inode_page().
      Also move page_mapped() check into truncate_complete_page().  That way
      it will be easier to batch operations.
      
      Also rename truncate_complete_page() to truncate_cleanup_page().
      
      Link: http://lkml.kernel.org/r/20171010151937.26984-3-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9f4e41f4
    • J
      mm: speed up cancel_dirty_page() for clean pages · 736304f3
      Jan Kara 提交于
      Patch series "Speed up page cache truncation", v1.
      
      When rebasing our enterprise distro to a newer kernel (from 4.4 to 4.12)
      we have noticed a regression in bonnie++ benchmark when deleting files.
      Eventually we have tracked this down to a fact that page cache
      truncation got slower by about 10%.  There were both gains and losses in
      the above interval of kernels but we have been able to identify that
      commit 83929372 ("filemap: prepare find and delete operations for
      huge pages") caused about 10% regression on its own.
      
      After some investigation it didn't seem easily possible to fix the
      regression while maintaining the THP in page cache functionality so
      we've decided to optimize the page cache truncation path instead to make
      up for the change.  This series is a result of that effort.
      
      Patch 1 is an easy speedup of cancel_dirty_page().  Patches 2-6 refactor
      page cache truncation code so that it is easier to batch radix tree
      operations.  Patch 7 implements batching of deletes from the radix tree
      which more than makes up for the original regression.
      
      This patch (of 7):
      
      cancel_dirty_page() does quite some work even for clean pages (fetching
      of mapping, locking of memcg, atomic bit op on page flags) so it
      accounts for ~2.5% of cost of truncation of a clean page.  That is not
      much but still dumb for something we don't need at all.  Check whether a
      page is actually dirty and avoid any work if not.
      
      Link: http://lkml.kernel.org/r/20171010151937.26984-2-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      736304f3
    • C
      drivers/block/zram/zram_drv.c: make zram_page_end_io() static · 384bc41f
      Colin Ian King 提交于
      zram_page_end_io() is local to the source and does not need to be in
      global scope, so make it static.
      
      Cleans up sparse warning:
      
        symbol 'zram_page_end_io' was not declared. Should it be static?
      
      Link: http://lkml.kernel.org/r/20171016173336.20320-1-colin.king@canonical.comSigned-off-by: NColin Ian King <colin.king@canonical.com>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      384bc41f
    • K
      mm/page-writeback.c: convert timers to use timer_setup() · 9823e51b
      Kees Cook 提交于
      In preparation for unconditionally passing the struct timer_list pointer
      to all timer callbacks, switch to using the new timer_setup() and
      from_timer() to pass the timer pointer explicitly.
      
      Link: http://lkml.kernel.org/r/20171016225913.GA99214@beastSigned-off-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Jeff Layton <jlayton@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9823e51b
    • L
      mm, soft_offline: improve hugepage soft offlining error log · b6b18aa8
      Laszlo Toth 提交于
      On a failed attempt, we get the following entry: soft offline: 0x3c0000:
      migration failed 1, type 17ffffc0008008 (uptodate|head)
      
      Make this more specific to be straightforward and to follow other error
      log formats in soft_offline_huge_page().
      
      Link: http://lkml.kernel.org/r/20171016171757.GA3018@ubuntu-desk-vmSigned-off-by: NLaszlo Toth <laszlth@gmail.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6b18aa8
    • M
      userfaultfd: use mmgrab instead of open-coded increment of mm_count · 00bb31fa
      Mike Rapoport 提交于
      Link: http://lkml.kernel.org/r/1508132478-7738-1-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      00bb31fa
    • A
      mm/page_alloc: make sure __rmqueue() etc are always inline · 85ccc8fa
      Aaron Lu 提交于
      __rmqueue(), __rmqueue_fallback(), __rmqueue_smallest() and
      __rmqueue_cma_fallback() are all in page allocator's hot path and better
      be finished as soon as possible.  One way to make them faster is by making
      them inline.  But as Andrew Morton and Andi Kleen pointed out:
      
        https://lkml.org/lkml/2017/10/10/1252
        https://lkml.org/lkml/2017/10/10/1279
      
      To make sure they are inlined, we should use __always_inline for them.
      
      With the will-it-scale/page_fault1/process benchmark, when using nr_cpu
      processes to stress buddy, the results for will-it-scale.processes with
      and without the patch are:
      
      On a 2-sockets Intel-Skylake machine:
      
         compiler          base        head
        gcc-4.4.7       6496131     6911823 +6.4%
        gcc-4.9.4       7225110     7731072 +7.0%
        gcc-5.4.1       7054224     7688146 +9.0%
        gcc-6.2.0       7059794     7651675 +8.4%
      
      On a 4-sockets Intel-Skylake machine:
      
         compiler          base        head
        gcc-4.4.7      13162890    13508193 +2.6%
        gcc-4.9.4      14997463    15484353 +3.2%
        gcc-5.4.1      14708711    15449805 +5.0%
        gcc-6.2.0      14574099    15349204 +5.3%
      
      The above 4 compilers are used because I've done the tests through
      Intel's Linux Kernel Performance(LKP) infrastructure and they are the
      available compilers there.
      
      The benefit being less on 4 sockets machine is due to the lock
      contention there(perf-profile/native_queued_spin_lock_slowpath=81%) is
      less severe than on the 2 sockets machine(85%).
      
      What the benchmark does is: it forks nr_cpu processes and then each
      process does the following:
          1 mmap() 128M anonymous space;
          2 writes to each page there to trigger actual page allocation;
          3 munmap() it.
      in a loop.
      
        https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c
      
      Binary size wise, I have locally built them with different compilers:
      
      [aaron@aaronlu obj]$ size */*/mm/page_alloc.o
         text    data     bss     dec     hex filename
        37409    9904    8524   55837    da1d gcc-4.9.4/base/mm/page_alloc.o
        38273    9904    8524   56701    dd7d gcc-4.9.4/head/mm/page_alloc.o
        37465    9840    8428   55733    d9b5 gcc-5.5.0/base/mm/page_alloc.o
        38169    9840    8428   56437    dc75 gcc-5.5.0/head/mm/page_alloc.o
        37573    9840    8428   55841    da21 gcc-6.4.0/base/mm/page_alloc.o
        38261    9840    8428   56529    dcd1 gcc-6.4.0/head/mm/page_alloc.o
        36863    9840    8428   55131    d75b gcc-7.2.0/base/mm/page_alloc.o
        37711    9840    8428   55979    daab gcc-7.2.0/head/mm/page_alloc.o
      
      Text size increased about 800 bytes for mm/page_alloc.o.
      
      [aaron@aaronlu obj]$ size */*/vmlinux
         text    data     bss     dec       hex     filename
      10342757   5903208 17723392 33969357  20654cd gcc-4.9.4/base/vmlinux
      10342757   5903208 17723392 33969357  20654cd gcc-4.9.4/head/vmlinux
      10332448   5836608 17715200 33884256  2050860 gcc-5.5.0/base/vmlinux
      10332448   5836608 17715200 33884256  2050860 gcc-5.5.0/head/vmlinux
      10094546   5836696 17715200 33646442  201676a gcc-6.4.0/base/vmlinux
      10094546   5836696 17715200 33646442  201676a gcc-6.4.0/head/vmlinux
      10018775   5828732 17715200 33562707  2002053 gcc-7.2.0/base/vmlinux
      10018775   5828732 17715200 33562707  2002053 gcc-7.2.0/head/vmlinux
      
      Text size for vmlinux has no change though, probably due to function
      alignment.
      
      Link: http://lkml.kernel.org/r/20171013063111.GA26032@intel.comSigned-off-by: NAaron Lu <aaron.lu@intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Kemi Wang <kemi.wang@intel.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      85ccc8fa
    • P
      sparc64: optimize struct page zeroing · 78c94366
      Pavel Tatashin 提交于
      Add an optimized mm_zero_struct_page(), so struct page's are zeroed
      without calling memset().  We do eight to ten regular stores based on
      the size of struct page.  Compiler optimizes out the conditions of
      switch() statement.
      
      SPARC-M6 with 15T of memory, single thread performance:
      
                                     BASE            FIX  OPTIMIZED_FIX
              bootmem_init   28.440467985s   2.305674818s   2.305161615s
      free_area_init_nodes  202.845901673s 225.343084508s 172.556506560s
                            --------------------------------------------
      Total                 231.286369658s 227.648759326s 174.861668175s
      
      BASE:  current linux
      FIX:   This patch series without "optimized struct page zeroing"
      OPTIMIZED_FIX: This patch series including the current patch.
      
      bootmem_init() is where memory for struct pages is zeroed during
      allocation.  Note, about two seconds in this function is a fixed time:
      it does not increase as memory is increased.
      
      Link: http://lkml.kernel.org/r/20171013173214.27300-11-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NSteven Sistare <steven.sistare@oracle.com>
      Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: NBob Picco <bob.picco@oracle.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      78c94366
    • P
      mm: stop zeroing memory during allocation in vmemmap · f7f99100
      Pavel Tatashin 提交于
      vmemmap_alloc_block() will no longer zero the block, so zero memory at
      its call sites for everything except struct pages.  Struct page memory
      is zero'd by struct page initialization.
      
      Replace allocators in sparse-vmemmap to use the non-zeroing version.
      So, we will get the performance improvement by zeroing the memory in
      parallel when struct pages are zeroed.
      
      Add struct page zeroing as a part of initialization of other fields in
      __init_single_page().
      
      This single thread performance collected on: Intel(R) Xeon(R) CPU E7-8895
      v3 @ 2.60GHz with 1T of memory (268400646 pages in 8 nodes):
      
                               BASE            FIX
      sparse_init     11.244671836s   0.007199623s
      zone_sizes_init  4.879775891s   8.355182299s
                        --------------------------
      Total           16.124447727s   8.362381922s
      
      sparse_init is where memory for struct pages is zeroed, and the zeroing
      part is moved later in this patch into __init_single_page(), which is
      called from zone_sizes_init().
      
      [akpm@linux-foundation.org: make vmemmap_alloc_block_zero() private to sparse-vmemmap.c]
      Link: http://lkml.kernel.org/r/20171013173214.27300-10-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NSteven Sistare <steven.sistare@oracle.com>
      Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: NBob Picco <bob.picco@oracle.com>
      Tested-by: NBob Picco <bob.picco@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7f99100
    • W
      arm64/mm/kasan: don't use vmemmap_populate() to initialize shadow · e17d8025
      Will Deacon 提交于
      The kasan shadow is currently mapped using vmemmap_populate() since that
      provides a semi-convenient way to map pages into init_top_pgt.  However,
      since that no longer zeroes the mapped pages, it is not suitable for
      kasan, which requires zeroed shadow memory.
      
      Add kasan_populate_shadow() interface and use it instead of
      vmemmap_populate().  Besides, this allows us to take advantage of
      gigantic pages and use them to populate the shadow, which should save us
      some memory wasted on page tables and reduce TLB pressure.
      
      Link: http://lkml.kernel.org/r/20171103185147.2688-3-pasha.tatashin@oracle.comSigned-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Bob Picco <bob.picco@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e17d8025
    • A
      x86/mm/kasan: don't use vmemmap_populate() to initialize shadow · d17a1d97
      Andrey Ryabinin 提交于
      The kasan shadow is currently mapped using vmemmap_populate() since that
      provides a semi-convenient way to map pages into init_top_pgt.  However,
      since that no longer zeroes the mapped pages, it is not suitable for
      kasan, which requires zeroed shadow memory.
      
      Add kasan_populate_shadow() interface and use it instead of
      vmemmap_populate().  Besides, this allows us to take advantage of
      gigantic pages and use them to populate the shadow, which should save us
      some memory wasted on page tables and reduce TLB pressure.
      
      Link: http://lkml.kernel.org/r/20171103185147.2688-2-pasha.tatashin@oracle.comSigned-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Bob Picco <bob.picco@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d17a1d97
    • P
      mm: zero reserved and unavailable struct pages · a4a3ede2
      Pavel Tatashin 提交于
      Some memory is reserved but unavailable: not present in memblock.memory
      (because not backed by physical pages), but present in memblock.reserved.
      Such memory has backing struct pages, but they are not initialized by
      going through __init_single_page().
      
      In some cases these struct pages are accessed even if they do not
      contain any data.  One example is page_to_pfn() might access page->flags
      if this is where section information is stored (CONFIG_SPARSEMEM,
      SECTION_IN_PAGE_FLAGS).
      
      One example of such memory: trim_low_memory_range() unconditionally
      reserves from pfn 0, but e820__memblock_setup() might provide the
      exiting memory from pfn 1 (i.e.  KVM).
      
      Since struct pages are zeroed in __init_single_page(), and not during
      allocation time, we must zero such struct pages explicitly.
      
      The patch involves adding a new memblock iterator:
      	for_each_resv_unavail_range(i, p_start, p_end)
      
      Which iterates through reserved && !memory lists, and we zero struct pages
      explicitly by calling mm_zero_struct_page().
      
      ===
      
      Here is more detailed example of problem that this patch is addressing:
      
      Run tested on qemu with the following arguments:
      
      	-enable-kvm -cpu kvm64 -m 512 -smp 2
      
      This patch reports that there are 98 unavailable pages.
      
      They are: pfn 0 and pfns in range [159, 255].
      
      Note, trim_low_memory_range() reserves only pfns in range [0, 15], it does
      not reserve [159, 255] ones.
      
      e820__memblock_setup() reports linux that the following physical ranges are
      available:
          [1 , 158]
      [256, 130783]
      
      Notice, that exactly unavailable pfns are missing!
      
      Now, lets check what we have in zone 0: [1, 131039]
      
      pfn 0, is not part of the zone, but pfns [1, 158], are.
      
      However, the bigger problem we have if we do not initialize these struct
      pages is with memory hotplug.  Because, that path operates at 2M
      boundaries (section_nr).  And checks if 2M range of pages is hot
      removable.  It starts with first pfn from zone, rounds it down to 2M
      boundary (sturct pages are allocated at 2M boundaries when vmemmap is
      created), and checks if that section is hot removable.  In this case
      start with pfn 1 and convert it down to pfn 0.  Later pfn is converted
      to struct page, and some fields are checked.  Now, if we do not zero
      struct pages, we get unpredictable results.
      
      In fact when CONFIG_VM_DEBUG is enabled, and we explicitly set all
      vmemmap memory to ones, the following panic is observed with kernel test
      without this patch applied:
      
        BUG: unable to handle kernel NULL pointer dereference at          (null)
        IP: is_pageblock_removable_nolock+0x35/0x90
        PGD 0 P4D 0
        Oops: 0000 [#1] PREEMPT
        ...
        task: ffff88001f4e2900 task.stack: ffffc90000314000
        RIP: 0010:is_pageblock_removable_nolock+0x35/0x90
        Call Trace:
         ? is_mem_section_removable+0x5a/0xd0
         show_mem_removable+0x6b/0xa0
         dev_attr_show+0x1b/0x50
         sysfs_kf_seq_show+0xa1/0x100
         kernfs_seq_show+0x22/0x30
         seq_read+0x1ac/0x3a0
         kernfs_fop_read+0x36/0x190
         ? security_file_permission+0x90/0xb0
         __vfs_read+0x16/0x30
         vfs_read+0x81/0x130
         SyS_read+0x44/0xa0
         entry_SYSCALL_64_fastpath+0x1f/0xbd
      
      Link: http://lkml.kernel.org/r/20171013173214.27300-7-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NSteven Sistare <steven.sistare@oracle.com>
      Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: NBob Picco <bob.picco@oracle.com>
      Tested-by: NBob Picco <bob.picco@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a4a3ede2
    • P
      mm: define memblock_virt_alloc_try_nid_raw · ea1f5f37
      Pavel Tatashin 提交于
      * A new variant of memblock_virt_alloc_* allocations:
      memblock_virt_alloc_try_nid_raw()
          - Does not zero the allocated memory
          - Does not panic if request cannot be satisfied
      
      * optimize early system hash allocations
      
      Clients can call alloc_large_system_hash() with flag: HASH_ZERO to
      specify that memory that was allocated for system hash needs to be
      zeroed, otherwise the memory does not need to be zeroed, and client will
      initialize it.
      
      If memory does not need to be zero'd, call the new
      memblock_virt_alloc_raw() interface, and thus improve the boot
      performance.
      
      * debug for raw alloctor
      
      When CONFIG_DEBUG_VM is enabled, this patch sets all the memory that is
      returned by memblock_virt_alloc_try_nid_raw() to ones to ensure that no
      places excpect zeroed memory.
      
      Link: http://lkml.kernel.org/r/20171013173214.27300-6-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NSteven Sistare <steven.sistare@oracle.com>
      Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: NBob Picco <bob.picco@oracle.com>
      Tested-by: NBob Picco <bob.picco@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea1f5f37
    • P
      sparc64: simplify vmemmap_populate · df8ee578
      Pavel Tatashin 提交于
      Remove duplicating code by using common functions vmemmap_pud_populate
      and vmemmap_pgd_populate.
      
      Link: http://lkml.kernel.org/r/20171013173214.27300-5-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NSteven Sistare <steven.sistare@oracle.com>
      Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: NBob Picco <bob.picco@oracle.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df8ee578
    • P
      sparc64/mm: set fields in deferred pages · 2a20aa17
      Pavel Tatashin 提交于
      Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
      flags and other fields in "struct page"es are never changed prior to
      first initializing struct pages by going through __init_single_page().
      
      With deferred struct page feature enabled there is a case where we set
      some fields prior to initializing:
      
      mem_init() {
           register_page_bootmem_info();
           free_all_bootmem();
           ...
      }
      
      When register_page_bootmem_info() is called only non-deferred struct
      pages are initialized.  But, this function goes through some reserved
      pages which might be part of the deferred, and thus are not yet
      initialized.
      
      mem_init
      register_page_bootmem_info
      register_page_bootmem_info_node
       get_page_bootmem
        .. setting fields here ..
        such as: page->freelist = (void *)type;
      
      free_all_bootmem()
      free_low_memory_core_early()
       for_each_reserved_mem_region()
        reserve_bootmem_region()
         init_reserved_page() <- Only if this is deferred reserved page
          __init_single_pfn()
           __init_single_page()
            memset(0) <-- Loose the set fields here
      
      We end up with similar issue as in the previous patch, where currently
      we do not observe problem as memory is zeroed.  But, if flag asserts are
      changed we can start hitting issues.
      
      Also, because in this patch series we will stop zeroing struct page
      memory during allocation, we must make sure that struct pages are
      properly initialized prior to using them.
      
      The deferred-reserved pages are initialized in free_all_bootmem().
      Therefore, the fix is to switch the above calls.
      
      Link: http://lkml.kernel.org/r/20171013173214.27300-4-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NSteven Sistare <steven.sistare@oracle.com>
      Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: NBob Picco <bob.picco@oracle.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a20aa17
    • P
      x86/mm: set fields in deferred pages · 353b1e7b
      Pavel Tatashin 提交于
      Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
      flags and other fields in "struct page"es are never changed prior to
      first initializing struct pages by going through __init_single_page().
      
      With deferred struct page feature enabled, however, we set fields in
      register_page_bootmem_info that are subsequently clobbered right after
      in free_all_bootmem:
      
              mem_init() {
                      register_page_bootmem_info();
                      free_all_bootmem();
                      ...
              }
      
      When register_page_bootmem_info() is called only non-deferred struct
      pages are initialized.  But, this function goes through some reserved
      pages which might be part of the deferred, and thus are not yet
      initialized.
      
        mem_init
         register_page_bootmem_info
          register_page_bootmem_info_node
           get_page_bootmem
            .. setting fields here ..
            such as: page->freelist = (void *)type;
      
        free_all_bootmem()
         free_low_memory_core_early()
          for_each_reserved_mem_region()
           reserve_bootmem_region()
            init_reserved_page() <- Only if this is deferred reserved page
             __init_single_pfn()
              __init_single_page()
                  memset(0) <-- Loose the set fields here
      
      We end up with issue where, currently we do not observe problem as
      memory is explicitly zeroed.  But, if flag asserts are changed we can
      start hitting issues.
      
      Also, because in this patch series we will stop zeroing struct page
      memory during allocation, we must make sure that struct pages are
      properly initialized prior to using them.
      
      The deferred-reserved pages are initialized in free_all_bootmem().
      Therefore, the fix is to switch the above calls.
      
      Link: http://lkml.kernel.org/r/20171013173214.27300-3-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NSteven Sistare <steven.sistare@oracle.com>
      Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: NBob Picco <bob.picco@oracle.com>
      Tested-by: NBob Picco <bob.picco@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      353b1e7b
    • P
      mm: deferred_init_memmap improvements · 2f47a91f
      Pavel Tatashin 提交于
      Patch series "complete deferred page initialization", v12.
      
      SMP machines can benefit from the DEFERRED_STRUCT_PAGE_INIT config
      option, which defers initializing struct pages until all cpus have been
      started so it can be done in parallel.
      
      However, this feature is sub-optimal, because the deferred page
      initialization code expects that the struct pages have already been
      zeroed, and the zeroing is done early in boot with a single thread only.
      Also, we access that memory and set flags before struct pages are
      initialized.  All of this is fixed in this patchset.
      
      In this work we do the following:
       - Never read access struct page until it was initialized
       - Never set any fields in struct pages before they are initialized
       - Zero struct page at the beginning of struct page initialization
      
      ==========================================================================
      Performance improvements on x86 machine with 8 nodes:
      Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz and 1T of memory:
                              TIME          SPEED UP
      base no deferred:       95.796233s
      fix no deferred:        79.978956s    19.77%
      
      base deferred:          77.254713s
      fix deferred:           55.050509s    40.34%
      ==========================================================================
      SPARC M6 3600 MHz with 15T of memory
                              TIME          SPEED UP
      base no deferred:       358.335727s
      fix no deferred:        302.320936s   18.52%
      
      base deferred:          237.534603s
      fix deferred:           182.103003s   30.44%
      ==========================================================================
      Raw dmesg output with timestamps:
      x86 base no deferred:    https://hastebin.com/ofunepurit.scala
      x86 base deferred:       https://hastebin.com/ifazegeyas.scala
      x86 fix no deferred:     https://hastebin.com/pegocohevo.scala
      x86 fix deferred:        https://hastebin.com/ofupevikuk.scala
      sparc base no deferred:  https://hastebin.com/ibobeteken.go
      sparc base deferred:     https://hastebin.com/fariqimiyu.go
      sparc fix no deferred:   https://hastebin.com/muhegoheyi.go
      sparc fix deferred:      https://hastebin.com/xadinobutu.go
      
      This patch (of 11):
      
      deferred_init_memmap() is called when struct pages are initialized later
      in boot by slave CPUs.  This patch simplifies and optimizes this
      function, and also fixes a couple issues (described below).
      
      The main change is that now we are iterating through free memblock areas
      instead of all configured memory.  Thus, we do not have to check if the
      struct page has already been initialized.
      
      =====
      In deferred_init_memmap() where all deferred struct pages are
      initialized we have a check like this:
      
        if (page->flags) {
      	VM_BUG_ON(page_zone(page) != zone);
      	goto free_range;
        }
      
      This way we are checking if the current deferred page has already been
      initialized.  It works, because memory for struct pages has been zeroed,
      and the only way flags are not zero if it went through
      __init_single_page() before.  But, once we change the current behavior
      and won't zero the memory in memblock allocator, we cannot trust
      anything inside "struct page"es until they are initialized.  This patch
      fixes this.
      
      The deferred_init_memmap() is re-written to loop through only free
      memory ranges provided by memblock.
      
      Note, this first issue is relevant only when the following change is
      merged:
      
      =====
      This patch fixes another existing issue on systems that have holes in
      zones i.e CONFIG_HOLES_IN_ZONE is defined.
      
      In for_each_mem_pfn_range() we have code like this:
      
        if (!pfn_valid_within(pfn)
      	goto free_range;
      
      Note: 'page' is not set to NULL and is not incremented but 'pfn'
      advances.  Thus means if deferred struct pages are enabled on systems
      with these kind of holes, linux would get memory corruptions.  I have
      fixed this issue by defining a new macro that performs all the necessary
      operations when we free the current set of pages.
      
      [pasha.tatashin@oracle.com: buddy page accessed before initialized]
        Link: http://lkml.kernel.org/r/20171102170221.7401-2-pasha.tatashin@oracle.com
      Link: http://lkml.kernel.org/r/20171013173214.27300-2-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NSteven Sistare <steven.sistare@oracle.com>
      Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: NBob Picco <bob.picco@oracle.com>
      Tested-by: NBob Picco <bob.picco@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f47a91f
    • C
      mm/swap_state.c: declare a few variables as __read_mostly · 783cb68e
      Changbin Du 提交于
      These global variables are only set during initialization or rarely
      change, so declare them as __read_mostly.
      
      Link: http://lkml.kernel.org/r/1507802349-5554-1-git-send-email-changbin.du@intel.comSigned-off-by: NChangbin Du <changbin.du@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      783cb68e
    • L
      kmemcheck: rip it out · 4675ff05
      Levin, Alexander (Sasha Levin) 提交于
      Fix up makefiles, remove references, and git rm kmemcheck.
      
      Link: http://lkml.kernel.org/r/20171007030159.22241-4-alexander.levin@verizon.comSigned-off-by: NSasha Levin <alexander.levin@verizon.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Vegard Nossum <vegardno@ifi.uio.no>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Tim Hansen <devtimhansen@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4675ff05
    • L
      kmemcheck: remove whats left of NOTRACK flags · d8be7566
      Levin, Alexander (Sasha Levin) 提交于
      Now that kmemcheck is gone, we don't need the NOTRACK flags.
      
      Link: http://lkml.kernel.org/r/20171007030159.22241-5-alexander.levin@verizon.comSigned-off-by: NSasha Levin <alexander.levin@verizon.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tim Hansen <devtimhansen@gmail.com>
      Cc: Vegard Nossum <vegardno@ifi.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8be7566
    • L
      kmemcheck: stop using GFP_NOTRACK and SLAB_NOTRACK · 75f296d9
      Levin, Alexander (Sasha Levin) 提交于
      Convert all allocations that used a NOTRACK flag to stop using it.
      
      Link: http://lkml.kernel.org/r/20171007030159.22241-3-alexander.levin@verizon.comSigned-off-by: NSasha Levin <alexander.levin@verizon.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tim Hansen <devtimhansen@gmail.com>
      Cc: Vegard Nossum <vegardno@ifi.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75f296d9
    • L
      kmemcheck: remove annotations · 49502766
      Levin, Alexander (Sasha Levin) 提交于
      Patch series "kmemcheck: kill kmemcheck", v2.
      
      As discussed at LSF/MM, kill kmemcheck.
      
      KASan is a replacement that is able to work without the limitation of
      kmemcheck (single CPU, slow).  KASan is already upstream.
      
      We are also not aware of any users of kmemcheck (or users who don't
      consider KASan as a suitable replacement).
      
      The only objection was that since KASAN wasn't supported by all GCC
      versions provided by distros at that time we should hold off for 2
      years, and try again.
      
      Now that 2 years have passed, and all distros provide gcc that supports
      KASAN, kill kmemcheck again for the very same reasons.
      
      This patch (of 4):
      
      Remove kmemcheck annotations, and calls to kmemcheck from the kernel.
      
      [alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs]
        Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com
      Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.comSigned-off-by: NSasha Levin <alexander.levin@verizon.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tim Hansen <devtimhansen@gmail.com>
      Cc: Vegard Nossum <vegardno@ifi.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49502766
    • C
      mm/rmap.c: remove redundant variable cend · cdb07bde
      Colin Ian King 提交于
      Variable cend is set but never read, hence it is redundant and can be
      removed.
      
      Cleans up clang build warning: Value stored to 'cend' is never read
      
      Link: http://lkml.kernel.org/r/20171011174942.1372-1-colin.king@canonical.com
      Fixes: 369ea824 ("mm/rmap: update to new mmu_notifier semantic v2")
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cdb07bde
    • S
      fs, mm: account filp cache to kmemcg · f3f7c093
      Shakeel Butt 提交于
      The allocations from filp cache can be directly triggered by userspace
      applications.  A buggy application can consume a significant amount of
      unaccounted system memory.  Though we have not noticed such buggy
      applications in our production but upon close inspection, we found that
      a lot of machines spend very significant amount of memory on these
      caches.
      
      One way to limit allocations from filp cache is to set system level
      limit of maximum number of open files.  However this limit is shared
      between different users on the system and one user can hog this
      resource.  To cater that, we can charge filp to kmemcg and set the
      maximum limit very high and let the memory limit of each user limit the
      number of files they can open and indirectly limiting their allocations
      from filp cache.
      
      One side effect of this change is that it will allow _sysctl() to return
      ENOMEM and the man page of _sysctl() does not specify that.  However the
      man page also discourages to use _sysctl() at all.
      
      Link: http://lkml.kernel.org/r/20171011190359.34926-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3f7c093
    • K
      mm: consolidate page table accounting · af5b0f6a
      Kirill A. Shutemov 提交于
      Currently, we account page tables separately for each page table level,
      but that's redundant -- we only make use of total memory allocated to
      page tables for oom_badness calculation.  We also provide the
      information to userspace, but it has dubious value there too.
      
      This patch switches page table accounting to single counter.
      
      mm->pgtables_bytes is now used to account all page table levels.  We use
      bytes, because page table size for different levels of page table tree
      may be different.
      
      The change has user-visible effect: we don't have VmPMD and VmPUD
      reported in /proc/[pid]/status.  Not sure if anybody uses them.  (As
      alternative, we can always report 0 kB for them.)
      
      OOM-killer report is also slightly changed: we now report pgtables_bytes
      instead of nr_ptes, nr_pmd, nr_puds.
      
      Apart from reducing number of counters per-mm, the benefit is that we
      now calculate oom_badness() more correctly for machines which have
      different size of page tables depending on level or where page tables
      are less than a page in size.
      
      The only downside can be debuggability because we do not know which page
      table level could leak.  But I do not remember many bugs that would be
      caught by separate counters so I wouldn't lose sleep over this.
      
      [akpm@linux-foundation.org: fix mm/huge_memory.c]
      Link: http://lkml.kernel.org/r/20171006100651.44742-2-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      [kirill.shutemov@linux.intel.com: fix build]
        Link: http://lkml.kernel.org/r/20171016150113.ikfxy3e7zzfvsr4w@black.fi.intel.comSigned-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af5b0f6a
    • K
      mm: introduce wrappers to access mm->nr_ptes · c4812909
      Kirill A. Shutemov 提交于
      Let's add wrappers for ->nr_ptes with the same interface as for nr_pmd
      and nr_pud.
      
      The patch also makes nr_ptes accounting dependent onto CONFIG_MMU.  Page
      table accounting doesn't make sense if you don't have page tables.
      
      It's preparation for consolidation of page-table counters in mm_struct.
      
      Link: http://lkml.kernel.org/r/20171006100651.44742-1-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4812909
    • K
      mm: account pud page tables · b4e98d9a
      Kirill A. Shutemov 提交于
      On a machine with 5-level paging support a process can allocate
      significant amount of memory and stay unnoticed by oom-killer and memory
      cgroup.  The trick is to allocate a lot of PUD page tables.  We don't
      account PUD page tables, only PMD and PTE.
      
      We already addressed the same issue for PMD page tables, see commit
      dc6c9a35 ("mm: account pmd page tables to the process").
      Introduction of 5-level paging brings the same issue for PUD page
      tables.
      
      The patch expands accounting to PUD level.
      
      [kirill.shutemov@linux.intel.com: s/pmd_t/pud_t/]
        Link: http://lkml.kernel.org/r/20171004074305.x35eh5u7ybbt5kar@black.fi.intel.com
      [heiko.carstens@de.ibm.com: s390/mm: fix pud table accounting]
        Link: http://lkml.kernel.org/r/20171103090551.18231-1-heiko.carstens@de.ibm.com
      Link: http://lkml.kernel.org/r/20171002080427.3320-1-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4e98d9a