1. 02 9月, 2020 2 次提交
    • M
      mm, vmscan: do not special-case slab reclaim when watermarks are boosted · fb4da0ed
      Mel Gorman 提交于
      to #28825456
      
      commit 28360f398778d7623a5ff8a8e90958c0d925e120 upstream.
      
      Dave Chinner reported a problem pointing a finger at commit 1c30844d2dfe
      ("mm: reclaim small amounts of memory when an external fragmentation
      event occurs").
      
      The report is extensive:
      
        https://lore.kernel.org/linux-mm/20190807091858.2857-1-david@fromorbit.com/
      
      and it's worth recording the most relevant parts (colorful language and
      typos included).
      
      	When running a simple, steady state 4kB file creation test to
      	simulate extracting tarballs larger than memory full of small
      	files into the filesystem, I noticed that once memory fills up
      	the cache balance goes to hell.
      
      	The workload is creating one dirty cached inode for every dirty
      	page, both of which should require a single IO each to clean and
      	reclaim, and creation of inodes is throttled by the rate at which
      	dirty writeback runs at (via balance dirty pages). Hence the ingest
      	rate of new cached inodes and page cache pages is identical and
      	steady. As a result, memory reclaim should quickly find a steady
      	balance between page cache and inode caches.
      
      	The moment memory fills, the page cache is reclaimed at a much
      	faster rate than the inode cache, and evidence suggests that
      	the inode cache shrinker is not being called when large batches
      	of pages are being reclaimed. In roughly the same time period
      	that it takes to fill memory with 50% pages and 50% slab caches,
      	memory reclaim reduces the page cache down to just dirty pages
      	and slab caches fill the entirety of memory.
      
      	The LRU is largely full of dirty pages, and we're getting spikes
      	of random writeback from memory reclaim so it's all going to shit.
      	Behaviour never recovers, the page cache remains pinned at just
      	dirty pages, and nothing I could tune would make any difference.
      	vfs_cache_pressure makes no difference - I would set it so high
      	it should trim the entire inode caches in a single pass, yet it
      	didn't do anything. It was clear from tracing and live telemetry
      	that the shrinkers were pretty much not running except when
      	there was absolutely no memory free at all, and then they did
      	the minimum necessary to free memory to make progress.
      
      	So I went looking at the code, trying to find places where pages
      	got reclaimed and the shrinkers weren't called. There's only one
      	- kswapd doing boosted reclaim as per commit 1c30844d2dfe ("mm:
      	reclaim small amounts of memory when an external fragmentation
      	event occurs").
      
      The watermark boosting introduced by the commit is triggered in response
      to an allocation "fragmentation event".  The boosting was not intended
      to target THP specifically and triggers even if THP is disabled.
      However, with Dave's perfectly reasonable workload, fragmentation events
      can be very common given the ratio of slab to page cache allocations so
      boosting remains active for long periods of time.
      
      As high-order allocations might use compaction and compaction cannot
      move slab pages the decision was made in the commit to special-case
      kswapd when watermarks are boosted -- kswapd avoids reclaiming slab as
      reclaiming slab does not directly help compaction.
      
      As Dave notes, this decision means that slab can be artificially
      protected for long periods of time and messes up the balance with slab
      and page caches.
      
      Removing the special casing can still indirectly help avoid
      fragmentation by avoiding fragmentation-causing events due to slab
      allocation as pages from a slab pageblock will have some slab objects
      freed.  Furthermore, with the special casing, reclaim behaviour is
      unpredictable as kswapd sometimes examines slab and sometimes does not
      in a manner that is tricky to tune or analyse.
      
      This patch removes the special casing.  The downside is that this is not
      a universal performance win.  Some benchmarks that depend on the
      residency of data when rereading metadata may see a regression when slab
      reclaim is restored to its original behaviour.  Similarly, some
      benchmarks that only read-once or write-once may perform better when
      page reclaim is too aggressive.  The primary upside is that slab
      shrinker is less surprising (arguably more sane but that's a matter of
      opinion), behaves consistently regardless of the fragmentation state of
      the system and properly obeys VM sysctls.
      
      A fsmark benchmark configuration was constructed similar to what Dave
      reported and is codified by the mmtest configuration
      config-io-fsmark-small-file-stream.  It was evaluated on a 1-socket
      machine to avoid dealing with NUMA-related issues and the timing of
      reclaim.  The storage was an SSD Samsung Evo and a fresh trimmed XFS
      filesystem was used for the test data.
      
      This is not an exact replication of Dave's setup.  The configuration
      scales its parameters depending on the memory size of the SUT to behave
      similarly across machines.  The parameters mean the first sample
      reported by fs_mark is using 50% of RAM which will barely be throttled
      and look like a big outlier.  Dave used fake NUMA to have multiple
      kswapd instances which I didn't replicate.  Finally, the number of
      iterations differ from Dave's test as the target disk was not large
      enough.  While not identical, it should be representative.
      
        fsmark
                                           5.3.0-rc3              5.3.0-rc3
                                             vanilla          shrinker-v1r1
        Min       1-files/sec     4444.80 (   0.00%)     4765.60 (   7.22%)
        1st-qrtle 1-files/sec     5005.10 (   0.00%)     5091.70 (   1.73%)
        2nd-qrtle 1-files/sec     4917.80 (   0.00%)     4855.60 (  -1.26%)
        3rd-qrtle 1-files/sec     4667.40 (   0.00%)     4831.20 (   3.51%)
        Max-1     1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
        Max-5     1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
        Max-10    1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
        Max-90    1-files/sec     4649.60 (   0.00%)     4780.70 (   2.82%)
        Max-95    1-files/sec     4491.00 (   0.00%)     4768.20 (   6.17%)
        Max-99    1-files/sec     4491.00 (   0.00%)     4768.20 (   6.17%)
        Max       1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
        Hmean     1-files/sec     5004.75 (   0.00%)     5075.96 (   1.42%)
        Stddev    1-files/sec     1778.70 (   0.00%)     1369.66 (  23.00%)
        CoeffVar  1-files/sec       33.70 (   0.00%)       26.05 (  22.71%)
        BHmean-99 1-files/sec     5053.72 (   0.00%)     5101.52 (   0.95%)
        BHmean-95 1-files/sec     5053.72 (   0.00%)     5101.52 (   0.95%)
        BHmean-90 1-files/sec     5107.05 (   0.00%)     5131.41 (   0.48%)
        BHmean-75 1-files/sec     5208.45 (   0.00%)     5206.68 (  -0.03%)
        BHmean-50 1-files/sec     5405.53 (   0.00%)     5381.62 (  -0.44%)
        BHmean-25 1-files/sec     6179.75 (   0.00%)     6095.14 (  -1.37%)
      
                           5.3.0-rc3   5.3.0-rc3
                             vanillashrinker-v1r1
        Duration User         501.82      497.29
        Duration System      4401.44     4424.08
        Duration Elapsed     8124.76     8358.05
      
      This is showing a slight skew for the max result representing a large
      outlier for the 1st, 2nd and 3rd quartile are similar indicating that
      the bulk of the results show little difference.  Note that an earlier
      version of the fsmark configuration showed a regression but that
      included more samples taken while memory was still filling.
      
      Note that the elapsed time is higher.  Part of this is that the
      configuration included time to delete all the test files when the test
      completes -- the test automation handles the possibility of testing
      fsmark with multiple thread counts.  Without the patch, many of these
      objects would be memory resident which is part of what the patch is
      addressing.
      
      There are other important observations that justify the patch.
      
      1. With the vanilla kernel, the number of dirty pages in the system is
         very low for much of the test. With this patch, dirty pages is
         generally kept at 10% which matches vm.dirty_background_ratio which
         is normal expected historical behaviour.
      
      2. With the vanilla kernel, the ratio of Slab/Pagecache is close to
         0.95 for much of the test i.e. Slab is being left alone and
         dominating memory consumption. With the patch applied, the ratio
         varies between 0.35 and 0.45 with the bulk of the measured ratios
         roughly half way between those values. This is a different balance to
         what Dave reported but it was at least consistent.
      
      3. Slabs are scanned throughout the entire test with the patch applied.
         The vanille kernel has periods with no scan activity and then
         relatively massive spikes.
      
      4. Without the patch, kswapd scan rates are very variable. With the
         patch, the scan rates remain quite steady.
      
      4. Overall vmstats are closer to normal expectations
      
      	                                5.3.0-rc3      5.3.0-rc3
      	                                  vanilla  shrinker-v1r1
          Ops Direct pages scanned             99388.00      328410.00
          Ops Kswapd pages scanned          45382917.00    33451026.00
          Ops Kswapd pages reclaimed        30869570.00    25239655.00
          Ops Direct pages reclaimed           74131.00        5830.00
          Ops Kswapd efficiency %                 68.02          75.45
          Ops Kswapd velocity                   5585.75        4002.25
          Ops Page reclaim immediate         1179721.00      430927.00
          Ops Slabs scanned                 62367361.00    73581394.00
          Ops Direct inode steals               2103.00        1002.00
          Ops Kswapd inode steals             570180.00     5183206.00
      
      	o Vanilla kernel is hitting direct reclaim more frequently,
      	  not very much in absolute terms but the fact the patch
      	  reduces it is interesting
      	o "Page reclaim immediate" in the vanilla kernel indicates
      	  dirty pages are being encountered at the tail of the LRU.
      	  This is generally bad and means in this case that the LRU
      	  is not long enough for dirty pages to be cleaned by the
      	  background flush in time. This is much reduced by the
      	  patch.
      	o With the patch, kswapd is reclaiming 10 times more slab
      	  pages than with the vanilla kernel. This is indicative
      	  of the watermark boosting over-protecting slab
      
      A more complete set of tests were run that were part of the basis for
      introducing boosting and while there are some differences, they are well
      within tolerances.
      
      Bottom line, the special casing kswapd to avoid slab behaviour is
      unpredictable and can lead to abnormal results for normal workloads.
      
      This patch restores the expected behaviour that slab and page cache is
      balanced consistently for a workload with a steady allocation ratio of
      slab/pagecache pages.  It also means that if there are workloads that
      favour the preservation of slab over pagecache that it can be tuned via
      vm.vfs_cache_pressure where as the vanilla kernel effectively ignores
      the parameter when boosting is active.
      
      Link: http://lkml.kernel.org/r/20190808182946.GM2739@techsingularity.net
      Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>	[5.0+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      fb4da0ed
    • M
      mm: reclaim small amounts of memory when an external fragmentation event occurs · 9bcadc70
      Mel Gorman 提交于
      to #28825456
      
      commit 1c30844d2dfe272d58c8fc000960b835d13aa2ac upstream.
      
      An external fragmentation event was previously described as
      
          When the page allocator fragments memory, it records the event using
          the mm_page_alloc_extfrag event. If the fallback_order is smaller
          than a pageblock order (order-9 on 64-bit x86) then it's considered
          an event that will cause external fragmentation issues in the future.
      
      The kernel reduces the probability of such events by increasing the
      watermark sizes by calling set_recommended_min_free_kbytes early in the
      lifetime of the system.  This works reasonably well in general but if
      there are enough sparsely populated pageblocks then the problem can still
      occur as enough memory is free overall and kswapd stays asleep.
      
      This patch introduces a watermark_boost_factor sysctl that allows a zone
      watermark to be temporarily boosted when an external fragmentation causing
      events occurs.  The boosting will stall allocations that would decrease
      free memory below the boosted low watermark and kswapd is woken if the
      calling context allows to reclaim an amount of memory relative to the size
      of the high watermark and the watermark_boost_factor until the boost is
      cleared.  When kswapd finishes, it wakes kcompactd at the pageblock order
      to clean some of the pageblocks that may have been affected by the
      fragmentation event.  kswapd avoids any writeback, slab shrinkage and swap
      from reclaim context during this operation to avoid excessive system
      disruption in the name of fragmentation avoidance.  Care is taken so that
      kswapd will do normal reclaim work if the system is really low on memory.
      
      This was evaluated using the same workloads as "mm, page_alloc: Spread
      allocations across zones before introducing fragmentation".
      
      1-socket Skylake machine
      config-global-dhp__workload_thpfioscale XFS (no special madvise)
      4 fio threads, 1 THP allocating thread
      --------------------------------------
      
      4.20-rc3 extfrag events < order 9:   804694
      4.20-rc3+patch:                      408912 (49% reduction)
      4.20-rc3+patch1-4:                    18421 (98% reduction)
      
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Amean     fault-base-1      653.58 (   0.00%)      652.71 (   0.13%)
      Amean     fault-huge-1        0.00 (   0.00%)      178.93 * -99.00%*
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-1        0.00 (   0.00%)        5.12 ( 100.00%)
      
      Note that external fragmentation causing events are massively reduced by
      this path whether in comparison to the previous kernel or the vanilla
      kernel.  The fault latency for huge pages appears to be increased but that
      is only because THP allocations were successful with the patch applied.
      
      1-socket Skylake machine
      global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
      -----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9:  291392
      4.20-rc3+patch:                     191187 (34% reduction)
      4.20-rc3+patch1-4:                   13464 (95% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Min       fault-base-1      912.00 (   0.00%)      905.00 (   0.77%)
      Min       fault-huge-1      127.00 (   0.00%)      135.00 (  -6.30%)
      Amean     fault-base-1     1467.55 (   0.00%)     1481.67 (  -0.96%)
      Amean     fault-huge-1     1127.11 (   0.00%)     1063.88 *   5.61%*
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-1       77.64 (   0.00%)       83.46 (   7.49%)
      
      As before, massive reduction in external fragmentation events, some jitter
      on latencies and an increase in THP allocation success rates.
      
      2-socket Haswell machine
      config-global-dhp__workload_thpfioscale XFS (no special madvise)
      4 fio threads, 5 THP allocating threads
      ----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9:  215698
      4.20-rc3+patch:                     200210 (7% reduction)
      4.20-rc3+patch1-4:                   14263 (93% reduction)
      
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Amean     fault-base-5     1346.45 (   0.00%)     1306.87 (   2.94%)
      Amean     fault-huge-5     3418.60 (   0.00%)     1348.94 (  60.54%)
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-5        0.78 (   0.00%)        7.91 ( 910.64%)
      
      There is a 93% reduction in fragmentation causing events, there is a big
      reduction in the huge page fault latency and allocation success rate is
      higher.
      
      2-socket Haswell machine
      global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
      -----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9: 166352
      4.20-rc3+patch:                    147463 (11% reduction)
      4.20-rc3+patch1-4:                  11095 (93% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Amean     fault-base-5     6217.43 (   0.00%)     7419.67 * -19.34%*
      Amean     fault-huge-5     3163.33 (   0.00%)     3263.80 (  -3.18%)
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-5       95.14 (   0.00%)       87.98 (  -7.53%)
      
      There is a large reduction in fragmentation events with some jitter around
      the latencies and success rates.  As before, the high THP allocation
      success rate does mean the system is under a lot of pressure.  However, as
      the fragmentation events are reduced, it would be expected that the
      long-term allocation success rate would be higher.
      
      Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      9bcadc70
  2. 17 4月, 2020 1 次提交
  3. 16 4月, 2020 5 次提交
    • X
      alinux: mm, memcg: add memsli procfs switch interface · 892970b7
      Xu Yu 提交于
      to #26424368
      
      Since memsli also records latency histogram for swapout and swapin,
      which are NOT in the slow memory path, the overhead of memsli could
      be nonnegligible in some specific scenarios.
      
      For example, in scenarios with frequent swapping out and in, memsli
      could introduce overhead of ~1% of total run time of the synthetic
      testcase.
      
      This adds procfs interface for memsli switch. The memsli feature is
      enabled by default, and you can now disable it by:
      
      $ echo 0 > /proc/memsli/enabled
      
      Apparently, you can check current memsli switch status by:
      
      $ cat /proc/memsli/enabled
      
      Note that disabling memsli at runtime will NOT clear the existing
      latency histogram. You still need to manually reset the specified
      latency histogram(s) by echo 0 into the corresponding cgroup control
      file(s).
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      892970b7
    • X
      alinux: mm, memcg: record latency of swapout and swapin in every memcg · ddfd4d5e
      Xu Yu 提交于
      to #26424368
      
      Probe and calculate the latency of global swapout, memcg swapout and
      swapin respectively, and then group into the latency histogram in struct
      mem_cgroup.
      
      Note that the latency in each memcg is aggregated from all child memcgs.
      
      Usage:
      
      $ cat memory.direct_swapout_global_latency
      0-1ms:  98313
      1-5ms:  0
      5-10ms:         0
      10-100ms:       0
      100-500ms:      0
      500-1000ms:     0
      >=1000ms:       0
      total(ms):      52
      
      Each line is the count of global swapout within the appropriate latency
      range.
      
      To clear the latency histogram:
      
      $ echo 0 > memory.direct_swapout_global_latency
      $ cat memory.direct_swapout_global_latency
      0-1ms:  0
      1-5ms:  0
      5-10ms:         0
      10-100ms:       0
      100-500ms:      0
      500-1000ms:     0
      >=1000ms:       0
      total(ms):      0
      
      The usage of memory.direct_swapout_memcg_latency and
      memory.direct_swapin_latency is the same as
      memory.direct_swapout_global_latency.
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      ddfd4d5e
    • X
      alinux: mm, memcg: adjust the latency probe point for memcg direct reclaim · fe673ccf
      Xu Yu 提交于
      to #26424368
      
      Since there are features other than memcg direct reclaim which also
      invoke try_to_free_mem_cgroup_pages, such as zombie memcg reaper, memcg
      kswapd, etc,.  Move the latency probe point for memcg direct reclaim
      from function try_to_free_mem_cgroup_pages to function try_charge, in
      order to distinguish memcg direct reclaim.
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      fe673ccf
    • X
      alinux: mm, memcg: rework memory latency histogram interfaces · 837e53ab
      Xu Yu 提交于
      to #26424368
      
      There are some duplicate codes in the original implementation of memory
      latency histogram, such as {x, y, z}_show, and {x, y, z}_write, where x,
      y, z represents various types of memory latency.
      
      This reworks common codes of memory latency histogram to make it easier
      to add more types of memory latency later.
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      837e53ab
    • X
      alinux: mm, memcg: record latency of direct reclaim in every memcg · 83058e75
      Xu Yu 提交于
      to #26424368
      
      Probe and calculate the latency of global direct reclaim and memcg
      direct reclaim, respectively, and then group into the latency histogram
      in struct mem_cgroup. Besides, the total latency is accumulated each
      time the histogram is updated.
      
      Note that the latency in each memcg is aggregated from all child memcgs.
      
      Usage:
      
      $ cat memory.direct_reclaim_global_latency
      0-1ms:  228
      1-5ms:  283
      5-10ms:         0
      10-100ms:       0
      100-500ms:      0
      500-1000ms:     0
      >=1000ms:       0
      total(ms):      539
      
      Each line is the count of global direct reclaim within the appropriate
      latency range.
      
      To clear the latency histogram:
      
      $ echo 0 > memory.direct_reclaim_global_latency
      $ cat memory.direct_reclaim_global_latency
      0-1ms:  0
      1-5ms:  0
      5-10ms:         0
      10-100ms:       0
      100-500ms:      0
      500-1000ms:     0
      >=1000ms:       0
      total(ms):      0
      
      The usage of memory.direct_reclaim_memcg_latency is the same as
      memory.direct_reclaim_global_latency.
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      83058e75
  4. 18 3月, 2020 3 次提交
    • M
      mm: introduce MADV_PAGEOUT · 23757dcc
      Minchan Kim 提交于
      commit 1a4e58cce84ee88129d5d49c064bd2852b481357 upstream
      
      When a process expects no accesses to a certain memory range for a long
      time, it could hint kernel that the pages can be reclaimed instantly but
      data should be preserved for future use.  This could reduce workingset
      eviction so it ends up increasing performance.
      
      This patch introduces the new MADV_PAGEOUT hint to madvise(2) syscall.
      MADV_PAGEOUT can be used by a process to mark a memory range as not
      expected to be used for a long time so that kernel reclaims *any LRU*
      pages instantly.  The hint can help kernel in deciding which pages to
      evict proactively.
      
      A note: It doesn't apply SWAP_CLUSTER_MAX LRU page isolation limit
      intentionally because it's automatically bounded by PMD size.  If PMD
      size(e.g., 256) makes some trouble, we could fix it later by limit it to
      SWAP_CLUSTER_MAX[1].
      
      - man-page material
      
      MADV_PAGEOUT (since Linux x.x)
      
      Do not expect access in the near future so pages in the specified
      regions could be reclaimed instantly regardless of memory pressure.
      Thus, access in the range after successful operation could cause
      major page fault but never lose the up-to-date contents unlike
      MADV_DONTNEED. Pages belonging to a shared mapping are only processed
      if a write access is allowed for the calling process.
      
      MADV_PAGEOUT cannot be applied to locked pages, Huge TLB pages, or
      VM_PFNMAP pages.
      
      [1] https://lore.kernel.org/lkml/20190710194719.GS29695@dhcp22.suse.cz/
      
      [minchan@kernel.org: clear PG_active on MADV_PAGEOUT]
        Link: http://lkml.kernel.org/r/20190802200643.GA181880@google.com
      [akpm@linux-foundation.org: resolve conflicts with hmm.git]
      Link: http://lkml.kernel.org/r/20190726023435.214162-5-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      23757dcc
    • M
      mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM · a0747c91
      Minchan Kim 提交于
      commit 8940b34a4e082ae11498ddae8432f2ac07685d1c upstream
      
      The local variable references in shrink_page_list is PAGEREF_RECLAIM_CLEAN
      as default.  It is for preventing to reclaim dirty pages when CMA try to
      migrate pages.  Strictly speaking, we don't need it because CMA didn't
      allow to write out by .may_writepage = 0 in reclaim_clean_pages_from_list.
      
      Moreover, it has a problem to prevent anonymous pages's swap out even
      though force_reclaim = true in shrink_page_list on upcoming patch.  So
      this patch makes references's default value to PAGEREF_RECLAIM and rename
      force_reclaim with ignore_references to make it more clear.
      
      This is a preparatory work for next patch.
      
      Link: http://lkml.kernel.org/r/20190726023435.214162-3-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: kbuild test robot <lkp@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      a0747c91
    • C
      alinux: doc: use unified official project name Cloud Kernel · a60721b9
      Caspar Zhang 提交于
      Cloud Kernel is the official name of our project, this patch unitizes
      the project names used in docs and comments.
      Signed-off-by: NCaspar Zhang <caspar@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      a60721b9
  5. 15 1月, 2020 5 次提交
  6. 27 12月, 2019 2 次提交
  7. 07 8月, 2019 1 次提交
  8. 28 7月, 2019 1 次提交
    • K
      mm: vmscan: scan anonymous pages on file refaults · c1d98b76
      Kuo-Hsin Yang 提交于
      commit 2c012a4ad1a2cd3fb5a0f9307b9d219f84eda1fa upstream.
      
      When file refaults are detected and there are many inactive file pages,
      the system never reclaim anonymous pages, the file pages are dropped
      aggressively when there are still a lot of cold anonymous pages and
      system thrashes.  This issue impacts the performance of applications
      with large executable, e.g.  chrome.
      
      With this patch, when file refault is detected, inactive_list_is_low()
      always returns true for file pages in get_scan_count() to enable
      scanning anonymous pages.
      
      The problem can be reproduced by the following test program.
      
      ---8<---
      void fallocate_file(const char *filename, off_t size)
      {
      	struct stat st;
      	int fd;
      
      	if (!stat(filename, &st) && st.st_size >= size)
      		return;
      
      	fd = open(filename, O_WRONLY | O_CREAT, 0600);
      	if (fd < 0) {
      		perror("create file");
      		exit(1);
      	}
      	if (posix_fallocate(fd, 0, size)) {
      		perror("fallocate");
      		exit(1);
      	}
      	close(fd);
      }
      
      long *alloc_anon(long size)
      {
      	long *start = malloc(size);
      	memset(start, 1, size);
      	return start;
      }
      
      long access_file(const char *filename, long size, long rounds)
      {
      	int fd, i;
      	volatile char *start1, *end1, *start2;
      	const int page_size = getpagesize();
      	long sum = 0;
      
      	fd = open(filename, O_RDONLY);
      	if (fd == -1) {
      		perror("open");
      		exit(1);
      	}
      
      	/*
      	 * Some applications, e.g. chrome, use a lot of executable file
      	 * pages, map some of the pages with PROT_EXEC flag to simulate
      	 * the behavior.
      	 */
      	start1 = mmap(NULL, size / 2, PROT_READ | PROT_EXEC, MAP_SHARED,
      		      fd, 0);
      	if (start1 == MAP_FAILED) {
      		perror("mmap");
      		exit(1);
      	}
      	end1 = start1 + size / 2;
      
      	start2 = mmap(NULL, size / 2, PROT_READ, MAP_SHARED, fd, size / 2);
      	if (start2 == MAP_FAILED) {
      		perror("mmap");
      		exit(1);
      	}
      
      	for (i = 0; i < rounds; ++i) {
      		struct timeval before, after;
      		volatile char *ptr1 = start1, *ptr2 = start2;
      		gettimeofday(&before, NULL);
      		for (; ptr1 < end1; ptr1 += page_size, ptr2 += page_size)
      			sum += *ptr1 + *ptr2;
      		gettimeofday(&after, NULL);
      		printf("File access time, round %d: %f (sec)
      ", i,
      		       (after.tv_sec - before.tv_sec) +
      		       (after.tv_usec - before.tv_usec) / 1000000.0);
      	}
      	return sum;
      }
      
      int main(int argc, char *argv[])
      {
      	const long MB = 1024 * 1024;
      	long anon_mb, file_mb, file_rounds;
      	const char filename[] = "large";
      	long *ret1;
      	long ret2;
      
      	if (argc != 4) {
      		printf("usage: thrash ANON_MB FILE_MB FILE_ROUNDS
      ");
      		exit(0);
      	}
      	anon_mb = atoi(argv[1]);
      	file_mb = atoi(argv[2]);
      	file_rounds = atoi(argv[3]);
      
      	fallocate_file(filename, file_mb * MB);
      	printf("Allocate %ld MB anonymous pages
      ", anon_mb);
      	ret1 = alloc_anon(anon_mb * MB);
      	printf("Access %ld MB file pages
      ", file_mb);
      	ret2 = access_file(filename, file_mb * MB, file_rounds);
      	printf("Print result to prevent optimization: %ld
      ",
      	       *ret1 + ret2);
      	return 0;
      }
      ---8<---
      
      Running the test program on 2GB RAM VM with kernel 5.2.0-rc5, the program
      fills ram with 2048 MB memory, access a 200 MB file for 10 times.  Without
      this patch, the file cache is dropped aggresively and every access to the
      file is from disk.
      
        $ ./thrash 2048 200 10
        Allocate 2048 MB anonymous pages
        Access 200 MB file pages
        File access time, round 0: 2.489316 (sec)
        File access time, round 1: 2.581277 (sec)
        File access time, round 2: 2.487624 (sec)
        File access time, round 3: 2.449100 (sec)
        File access time, round 4: 2.420423 (sec)
        File access time, round 5: 2.343411 (sec)
        File access time, round 6: 2.454833 (sec)
        File access time, round 7: 2.483398 (sec)
        File access time, round 8: 2.572701 (sec)
        File access time, round 9: 2.493014 (sec)
      
      With this patch, these file pages can be cached.
      
        $ ./thrash 2048 200 10
        Allocate 2048 MB anonymous pages
        Access 200 MB file pages
        File access time, round 0: 2.475189 (sec)
        File access time, round 1: 2.440777 (sec)
        File access time, round 2: 2.411671 (sec)
        File access time, round 3: 1.955267 (sec)
        File access time, round 4: 0.029924 (sec)
        File access time, round 5: 0.000808 (sec)
        File access time, round 6: 0.000771 (sec)
        File access time, round 7: 0.000746 (sec)
        File access time, round 8: 0.000738 (sec)
        File access time, round 9: 0.000747 (sec)
      
      Checked the swap out stats during the test [1], 19006 pages swapped out
      with this patch, 3418 pages swapped out without this patch. There are
      more swap out, but I think it's within reasonable range when file backed
      data set doesn't fit into the memory.
      
      $ ./thrash 2000 100 2100 5 1 # ANON_MB FILE_EXEC FILE_NOEXEC ROUNDS
      PROCESSES Allocate 2000 MB anonymous pages active_anon: 1613644,
      inactive_anon: 348656, active_file: 892, inactive_file: 1384 (kB)
      pswpout: 7972443, pgpgin: 478615246 Access 100 MB executable file pages
      Access 2100 MB regular file pages File access time, round 0: 12.165,
      (sec) active_anon: 1433788, inactive_anon: 478116, active_file: 17896,
      inactive_file: 24328 (kB) File access time, round 1: 11.493, (sec)
      active_anon: 1430576, inactive_anon: 477144, active_file: 25440,
      inactive_file: 26172 (kB) File access time, round 2: 11.455, (sec)
      active_anon: 1427436, inactive_anon: 476060, active_file: 21112,
      inactive_file: 28808 (kB) File access time, round 3: 11.454, (sec)
      active_anon: 1420444, inactive_anon: 473632, active_file: 23216,
      inactive_file: 35036 (kB) File access time, round 4: 11.479, (sec)
      active_anon: 1413964, inactive_anon: 471460, active_file: 31728,
      inactive_file: 32224 (kB) pswpout: 7991449 (+ 19006), pgpgin: 489924366
      (+ 11309120)
      
      With 4 processes accessing non-overlapping parts of a large file, 30316
      pages swapped out with this patch, 5152 pages swapped out without this
      patch.  The swapout number is small comparing to pgpgin.
      
      [1]: https://github.com/vovo/testing/blob/master/mem_thrash.c
      
      Link: http://lkml.kernel.org/r/20190701081038.GA83398@google.com
      Fixes: e9868505 ("mm,vmscan: only evict file pages when we have plenty")
      Fixes: 7c5bd705 ("mm: memcg: only evict file pages when we have plenty")
      Signed-off-by: NKuo-Hsin Yang <vovoy@chromium.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Sonny Rao <sonnyrao@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      [backported to 4.14.y, 4.19.y, 5.1.y: adjust context]
      Signed-off-by: NKuo-Hsin Yang <vovoy@chromium.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c1d98b76
  9. 10 7月, 2019 1 次提交
    • S
      mm/vmscan.c: prevent useless kswapd loops · 27ce6c26
      Shakeel Butt 提交于
      commit dffcac2cb88e4ec5906235d64a83d802580b119e upstream.
      
      In production we have noticed hard lockups on large machines running
      large jobs due to kswaps hoarding lru lock within isolate_lru_pages when
      sc->reclaim_idx is 0 which is a small zone.  The lru was couple hundred
      GiBs and the condition (page_zonenum(page) > sc->reclaim_idx) in
      isolate_lru_pages() was basically skipping GiBs of pages while holding
      the LRU spinlock with interrupt disabled.
      
      On further inspection, it seems like there are two issues:
      
      (1) If kswapd on the return from balance_pgdat() could not sleep (i.e.
          node is still unbalanced), the classzone_idx is unintentionally set
          to 0 and the whole reclaim cycle of kswapd will try to reclaim only
          the lowest and smallest zone while traversing the whole memory.
      
      (2) Fundamentally isolate_lru_pages() is really bad when the
          allocation has woken kswapd for a smaller zone on a very large machine
          running very large jobs.  It can hoard the LRU spinlock while skipping
          over 100s of GiBs of pages.
      
      This patch only fixes (1).  (2) needs a more fundamental solution.  To
      fix (1), in the kswapd context, if pgdat->kswapd_classzone_idx is
      invalid use the classzone_idx of the previous kswapd loop otherwise use
      the one the waker has requested.
      
      Link: http://lkml.kernel.org/r/20190701201847.251028-1-shakeelb@google.com
      Fixes: e716f2eb ("mm, vmscan: prevent kswapd sleeping prematurely due to mismatched classzone_idx")
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      27ce6c26
  10. 19 6月, 2019 1 次提交
    • M
      mm/vmscan.c: fix trying to reclaim unevictable LRU page · 54a20289
      Minchan Kim 提交于
      commit a58f2cef26e1ca44182c8b22f4f4395e702a5795 upstream.
      
      There was the below bug report from Wu Fangsuo.
      
      On the CMA allocation path, isolate_migratepages_range() could isolate
      unevictable LRU pages and reclaim_clean_page_from_list() can try to
      reclaim them if they are clean file-backed pages.
      
        page:ffffffbf02f33b40 count:86 mapcount:84 mapping:ffffffc08fa7a810 index:0x24
        flags: 0x19040c(referenced|uptodate|arch_1|mappedtodisk|unevictable|mlocked)
        raw: 000000000019040c ffffffc08fa7a810 0000000000000024 0000005600000053
        raw: ffffffc009b05b20 ffffffc009b05b20 0000000000000000 ffffffc09bf3ee80
        page dumped because: VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page))
        page->mem_cgroup:ffffffc09bf3ee80
        ------------[ cut here ]------------
        kernel BUG at /home/build/farmland/adroid9.0/kernel/linux/mm/vmscan.c:1350!
        Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
        Modules linked in:
        CPU: 0 PID: 7125 Comm: syz-executor Tainted: G S              4.14.81 #3
        Hardware name: ASR AQUILAC EVB (DT)
        task: ffffffc00a54cd00 task.stack: ffffffc009b00000
        PC is at shrink_page_list+0x1998/0x3240
        LR is at shrink_page_list+0x1998/0x3240
        pc : [<ffffff90083a2158>] lr : [<ffffff90083a2158>] pstate: 60400045
        sp : ffffffc009b05940
        ..
           shrink_page_list+0x1998/0x3240
           reclaim_clean_pages_from_list+0x3c0/0x4f0
           alloc_contig_range+0x3bc/0x650
           cma_alloc+0x214/0x668
           ion_cma_allocate+0x98/0x1d8
           ion_alloc+0x200/0x7e0
           ion_ioctl+0x18c/0x378
           do_vfs_ioctl+0x17c/0x1780
           SyS_ioctl+0xac/0xc0
      
      Wu found it's due to commit ad6b6704 ("mm: remove SWAP_MLOCK in
      ttu").  Before that, unevictable pages go to cull_mlocked so that we
      can't reach the VM_BUG_ON_PAGE line.
      
      To fix the issue, this patch filters out unevictable LRU pages from the
      reclaim_clean_pages_from_list in CMA.
      
      Link: http://lkml.kernel.org/r/20190524071114.74202-1-minchan@kernel.org
      Fixes: ad6b6704 ("mm: remove SWAP_MLOCK in ttu")
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: NWu Fangsuo <fangsuowu@asrmicro.com>
      Debugged-by: NWu Fangsuo <fangsuowu@asrmicro.com>
      Tested-by: NWu Fangsuo <fangsuowu@asrmicro.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Pankaj Suryawanshi <pankaj.suryawanshi@einfochips.com>
      Cc: <stable@vger.kernel.org>	[4.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      54a20289
  11. 17 5月, 2019 1 次提交
  12. 20 2月, 2019 1 次提交
  13. 29 12月, 2018 1 次提交
  14. 06 10月, 2018 1 次提交
  15. 21 9月, 2018 1 次提交
    • R
      mm: slowly shrink slabs with a relatively small number of objects · 172b06c3
      Roman Gushchin 提交于
      9092c71b ("mm: use sc->priority for slab shrink targets") changed the
      way that the target slab pressure is calculated and made it
      priority-based:
      
          delta = freeable >> priority;
          delta *= 4;
          do_div(delta, shrinker->seeks);
      
      The problem is that on a default priority (which is 12) no pressure is
      applied at all, if the number of potentially reclaimable objects is less
      than 4096 (1<<12).
      
      This causes the last objects on slab caches of no longer used cgroups to
      (almost) never get reclaimed.  It's obviously a waste of memory.
      
      It can be especially painful, if these stale objects are holding a
      reference to a dying cgroup.  Slab LRU lists are reparented on memcg
      offlining, but corresponding objects are still holding a reference to the
      dying cgroup.  If we don't scan these objects, the dying cgroup can't go
      away.  Most likely, the parent cgroup hasn't any directly charged objects,
      only remaining objects from dying children cgroups.  So it can easily hold
      a reference to hundreds of dying cgroups.
      
      If there are no big spikes in memory pressure, and new memory cgroups are
      created and destroyed periodically, this causes the number of dying
      cgroups grow steadily, causing a slow-ish and hard-to-detect memory
      "leak".  It's not a real leak, as the memory can be eventually reclaimed,
      but it could not happen in a real life at all.  I've seen hosts with a
      steadily climbing number of dying cgroups, which doesn't show any signs of
      a decline in months, despite the host is loaded with a production
      workload.
      
      It is an obvious waste of memory, and to prevent it, let's apply a minimal
      pressure even on small shrinker lists.  E.g.  if there are freeable
      objects, let's scan at least min(freeable, scan_batch) objects.
      
      This fix significantly improves a chance of a dying cgroup to be
      reclaimed, and together with some previous patches stops the steady growth
      of the dying cgroups number on some of our hosts.
      
      Link: http://lkml.kernel.org/r/20180905230759.12236-1-guro@fb.com
      Fixes: 9092c71b ("mm: use sc->priority for slab shrink targets")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NRik van Riel <riel@surriel.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      172b06c3
  16. 23 8月, 2018 2 次提交
  17. 18 8月, 2018 9 次提交
    • K
      mm: use special value SHRINKER_REGISTERING instead of list_empty() check · 7e010df5
      Kirill Tkhai 提交于
      The patch introduces a special value SHRINKER_REGISTERING to use instead
      of list_empty() to differ a registering shrinker from unregistered
      shrinker.  Why we need that at all?
      
      Shrinker registration is split in two parts.  The first one is
      prealloc_shrinker(), which allocates shrinker memory and reserves ID in
      shrinker_idr.  This function can fail.  The second is
      register_shrinker_prepared(), and it finalizes the registration.  This
      function actually makes shrinker available to be used from
      shrink_slab(), and it can't fail.
      
      One shrinker may be based on more then one LRU lists.  So, we never
      clear the bit in memcg shrinker maps, when (one of) corresponding LRU
      list becomes empty, since other LRU lists may be not empty.  See
      superblock shrinker for example: it is based on two LRU lists:
      s_inode_lru and s_dentry_lru.  We do not want to clear shrinker bit,
      when there are no inodes in s_inode_lru, as s_dentry_lru may contain
      dentries.
      
      Instead of that, we use special algorithm to detect shrinkers having no
      elements at all its LRU lists, and this is made in shrink_slab_memcg().
      See the comment in this function for the details.
      
      Also, in shrink_slab_memcg() we clear shrinker bit in the map, when we
      meet unregistered shrinker (bit is set, while there is no a shrinker in
      IDR).  Otherwise, we would have done that at the moment of shrinker
      unregistration for all memcgs (and this looks worse, since iteration
      over all memcg may take much time).  Also this would have imposed
      restrictions on shrinker unregistration order for its users: they would
      have had to guarantee, there are no new elements after
      unregister_shrinker() (otherwise, a new added element would have set a
      bit).
      
      So, if we meet a set bit in map and no shrinker in IDR when we're
      iterating over the map in shrink_slab_memcg(), this means the
      corresponding shrinker is unregistered, and we must clear the bit.
      
      Another case is shrinker registration.  We want two things there:
      
      1) do_shrink_slab() can be called only for completely registered
         shrinkers;
      
      2) shrinker internal lists may be populated in any order with
         register_shrinker_prepared() (let's talk on the example with sb).  Both
         of:
      
        a)list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru); [cpu0]
          memcg_set_shrinker_bit();                               [cpu0]
          ...
          register_shrinker_prepared();                           [cpu1]
      
        and
      
        b)register_shrinker_prepared();                           [cpu0]
          ...
          list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru); [cpu1]
          memcg_set_shrinker_bit();                               [cpu1]
      
         are legitimate.  We don't want to impose restriction here and to
         force people to use only (b) variant.  We don't want to force people to
         care, there is no elements in LRU lists before the shrinker is
         completely registered.  Internal users of LRU lists and shrinker code
         are two different subsystems, and they have to be closed in themselves
         each other.
      
      In (a) case we have the bit set before shrinker is completely
      registered.  We don't want do_shrink_slab() is called at this moment, so
      we have to detect such the registering shrinkers.
      
      Before this patch list_empty() (shrinker is not linked to the list)
      check was used for that.  So, in (a) there could be a bit set, but we
      don't call do_shrink_slab() unless shrinker is linked to the list.  It's
      just an indicator, I just overloaded linking to the list.
      
      This was not the best solution, since it's better not to touch the
      shrinker memory from shrink_slab_memcg() before it's completely
      registered (this also will be useful in the future to make shrink_slab()
      completely lockless).
      
      So, this patch introduces better way to detect registering shrinker,
      which allows not to dereference shrinker memory.  It's just a ~0UL
      value, which we insert into the IDR during ID allocation.  After
      shrinker is ready to be used, we insert actual shrinker pointer in the
      IDR, and it becomes available to shrink_slab_memcg().
      
      We can't use NULL instead of this new value for this purpose as:
      shrink_slab_memcg() already uses NULL to detect unregistered shrinkers,
      and we don't want the function sees NULL and clears the bit, otherwise
      (a) won't work.
      
      This is the only thing the patch makes: the better way to detect
      registering shrinker.  Nothing else this patch makes.
      
      Also this gives a better assembler, but it's minor side of the patch:
      
      Before:
        callq  <idr_find>
        mov    %rax,%r15
        test   %rax,%rax
        je     <shrink_slab_memcg+0x1d5>
        mov    0x20(%rax),%rax
        lea    0x20(%r15),%rdx
        cmp    %rax,%rdx
        je     <shrink_slab_memcg+0xbd>
        mov    0x8(%rsp),%edx
        mov    %r15,%rsi
        lea    0x10(%rsp),%rdi
        callq  <do_shrink_slab>
      
      After:
        callq  <idr_find>
        mov    %rax,%r15
        lea    -0x1(%rax),%rax
        cmp    $0xfffffffffffffffd,%rax
        ja     <shrink_slab_memcg+0x1cd>
        mov    0x8(%rsp),%edx
        mov    %r15,%rsi
        lea    0x10(%rsp),%rdi
        callq  ffffffff810cefd0 <do_shrink_slab>
      
      [ktkhai@virtuozzo.com: add #ifdef CONFIG_MEMCG_KMEM around idr_replace()]
        Link: http://lkml.kernel.org/r/758b8fec-7573-47eb-b26a-7b2847ae7b8c@virtuozzo.com
      Link: http://lkml.kernel.org/r/153355467546.11522.4518015068123480218.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e010df5
    • K
      mm/vmscan.c: move check for SHRINKER_NUMA_AWARE to do_shrink_slab() · ac7fb3ad
      Kirill Tkhai 提交于
      In case of shrink_slab_memcg() we do not zero nid, when shrinker is not
      numa-aware.  This is not a real problem, since currently all memcg-aware
      shrinkers are numa-aware too (we have two: super_block shrinker and
      workingset shrinker), but something may change in the future.
      
      Link: http://lkml.kernel.org/r/153320759911.18959.8842396230157677671.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac7fb3ad
    • K
      mm/vmscan.c: clear shrinker bit if there are no objects related to memcg · f90280d6
      Kirill Tkhai 提交于
      To avoid further unneed calls of do_shrink_slab() for shrinkers, which
      already do not have any charged objects in a memcg, their bits have to
      be cleared.
      
      This patch introduces a lockless mechanism to do that without races
      without parallel list lru add.  After do_shrink_slab() returns
      SHRINK_EMPTY the first time, we clear the bit and call it once again.
      Then we restore the bit, if the new return value is different.
      
      Note, that single smp_mb__after_atomic() in shrink_slab_memcg() covers
      two situations:
      
      1)list_lru_add()     shrink_slab_memcg
          list_add_tail()    for_each_set_bit() <--- read bit
                               do_shrink_slab() <--- missed list update (no barrier)
          <MB>                 <MB>
          set_bit()            do_shrink_slab() <--- seen list update
      
      This situation, when the first do_shrink_slab() sees set bit, but it
      doesn't see list update (i.e., race with the first element queueing), is
      rare.  So we don't add <MB> before the first call of do_shrink_slab()
      instead of this to do not slow down generic case.  Also, it's need the
      second call as seen in below in (2).
      
      2)list_lru_add()      shrink_slab_memcg()
          list_add_tail()     ...
          set_bit()           ...
        ...                   for_each_set_bit()
        do_shrink_slab()        do_shrink_slab()
          clear_bit()           ...
        ...                     ...
        list_lru_add()          ...
          list_add_tail()       clear_bit()
          <MB>                  <MB>
          set_bit()             do_shrink_slab()
      
      The barriers guarantee that the second do_shrink_slab() in the right
      side task sees list update if really cleared the bit.  This case is
      drawn in the code comment.
      
      [Results/performance of the patchset]
      
      After the whole patchset applied the below test shows signify increase
      of performance:
      
        $echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
        $mkdir /sys/fs/cgroup/memory/ct
        $echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
            $for i in `seq 0 4000`; do mkdir /sys/fs/cgroup/memory/ct/$i;
      			    echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs;
      			    mkdir -p s/$i; mount -t tmpfs $i s/$i;
      			    touch s/$i/file; done
      
      Then, 5 sequential calls of drop caches:
      
        $time echo 3 > /proc/sys/vm/drop_caches
      
      1)Before:
        0.00user 13.78system 0:13.78elapsed 99%CPU
        0.00user 5.59system 0:05.60elapsed 99%CPU
        0.00user 5.48system 0:05.48elapsed 99%CPU
        0.00user 8.35system 0:08.35elapsed 99%CPU
        0.00user 8.34system 0:08.35elapsed 99%CPU
      
      2)After
        0.00user 1.10system 0:01.10elapsed 99%CPU
        0.00user 0.00system 0:00.01elapsed 64%CPU
        0.00user 0.01system 0:00.01elapsed 82%CPU
        0.00user 0.00system 0:00.01elapsed 64%CPU
        0.00user 0.01system 0:00.01elapsed 82%CPU
      
      The results show the performance increases at least in 548 times.
      
      Shakeel Butt tested this patchset with fork-bomb on his configuration:
      
       > I created 255 memcgs, 255 ext4 mounts and made each memcg create a
       > file containing few KiBs on corresponding mount. Then in a separate
       > memcg of 200 MiB limit ran a fork-bomb.
       >
       > I ran the "perf record -ag -- sleep 60" and below are the results:
       >
       > Without the patch series:
       > Samples: 4M of event 'cycles', Event count (approx.): 3279403076005
       > +  36.40%            fb.sh  [kernel.kallsyms]    [k] shrink_slab
       > +  18.97%            fb.sh  [kernel.kallsyms]    [k] list_lru_count_one
       > +   6.75%            fb.sh  [kernel.kallsyms]    [k] super_cache_count
       > +   0.49%            fb.sh  [kernel.kallsyms]    [k] down_read_trylock
       > +   0.44%            fb.sh  [kernel.kallsyms]    [k] mem_cgroup_iter
       > +   0.27%            fb.sh  [kernel.kallsyms]    [k] up_read
       > +   0.21%            fb.sh  [kernel.kallsyms]    [k] osq_lock
       > +   0.13%            fb.sh  [kernel.kallsyms]    [k] shmem_unused_huge_count
       > +   0.08%            fb.sh  [kernel.kallsyms]    [k] shrink_node_memcg
       > +   0.08%            fb.sh  [kernel.kallsyms]    [k] shrink_node
       >
       > With the patch series:
       > Samples: 4M of event 'cycles', Event count (approx.): 2756866824946
       > +  47.49%            fb.sh  [kernel.kallsyms]    [k] down_read_trylock
       > +  30.72%            fb.sh  [kernel.kallsyms]    [k] up_read
       > +   9.51%            fb.sh  [kernel.kallsyms]    [k] mem_cgroup_iter
       > +   1.69%            fb.sh  [kernel.kallsyms]    [k] shrink_node_memcg
       > +   1.35%            fb.sh  [kernel.kallsyms]    [k] mem_cgroup_protected
       > +   1.05%            fb.sh  [kernel.kallsyms]    [k] queued_spin_lock_slowpath
       > +   0.85%            fb.sh  [kernel.kallsyms]    [k] _raw_spin_lock
       > +   0.78%            fb.sh  [kernel.kallsyms]    [k] lruvec_lru_size
       > +   0.57%            fb.sh  [kernel.kallsyms]    [k] shrink_node
       > +   0.54%            fb.sh  [kernel.kallsyms]    [k] queue_work_on
       > +   0.46%            fb.sh  [kernel.kallsyms]    [k] shrink_slab_memcg
      
      [ktkhai@virtuozzo.com: v9]
        Link: http://lkml.kernel.org/r/153112561772.4097.11011071937553113003.stgit@localhost.localdomain
      Link: http://lkml.kernel.org/r/153063070859.1818.11870882950920963480.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f90280d6
    • K
      mm: add SHRINK_EMPTY shrinker methods return value · 9b996468
      Kirill Tkhai 提交于
      We need to distinguish the situations when shrinker has very small
      amount of objects (see vfs_pressure_ratio() called from
      super_cache_count()), and when it has no objects at all.  Currently, in
      the both of these cases, shrinker::count_objects() returns 0.
      
      The patch introduces new SHRINK_EMPTY return value, which will be used
      for "no objects at all" case.  It's is a refactoring mostly, as
      SHRINK_EMPTY is replaced by 0 by all callers of do_shrink_slab() in this
      patch, and all the magic will happen in further.
      
      Link: http://lkml.kernel.org/r/153063069574.1818.11037751256699341813.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b996468
    • V
      mm/vmscan.c: generalize shrink_slab() calls in shrink_node() · aeed1d32
      Vladimir Davydov 提交于
      The patch makes shrink_slab() be called for root_mem_cgroup in the same
      way as it's called for the rest of cgroups.  This simplifies the logic
      and improves the readability.
      
      [ktkhai@virtuozzo.com: wrote changelog]
      Link: http://lkml.kernel.org/r/153063068338.1818.11496084754797453962.stgit@localhost.localdomainSigned-off-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aeed1d32
    • K
      mm/vmscan.c: iterate only over charged shrinkers during memcg shrink_slab() · b0dedc49
      Kirill Tkhai 提交于
      Using the preparations made in previous patches, in case of memcg
      shrink, we may avoid shrinkers, which are not set in memcg's shrinkers
      bitmap.  To do that, we separate iterations over memcg-aware and
      !memcg-aware shrinkers, and memcg-aware shrinkers are chosen via
      for_each_set_bit() from the bitmap.  In case of big nodes, having many
      isolated environments, this gives significant performance growth.  See
      next patches for the details.
      
      Note that the patch does not respect to empty memcg shrinkers, since we
      never clear the bitmap bits after we set it once.  Their shrinkers will
      be called again, with no shrinked objects as result.  This functionality
      is provided by next patches.
      
      [ktkhai@virtuozzo.com: v9]
        Link: http://lkml.kernel.org/r/153112558507.4097.12713813335683345488.stgit@localhost.localdomain
      Link: http://lkml.kernel.org/r/153063066653.1818.976035462801487910.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b0dedc49
    • K
      mm, memcg: assign memcg-aware shrinkers bitmap to memcg · 0a4465d3
      Kirill Tkhai 提交于
      Imagine a big node with many cpus, memory cgroups and containers.  Let
      we have 200 containers, every container has 10 mounts, and 10 cgroups.
      All container tasks don't touch foreign containers mounts.  If there is
      intensive pages write, and global reclaim happens, a writing task has to
      iterate over all memcgs to shrink slab, before it's able to go to
      shrink_page_list().
      
      Iteration over all the memcg slabs is very expensive: the task has to
      visit 200 * 10 = 2000 shrinkers for every memcg, and since there are
      2000 memcgs, the total calls are 2000 * 2000 = 4000000.
      
      So, the shrinker makes 4 million do_shrink_slab() calls just to try to
      isolate SWAP_CLUSTER_MAX pages in one of the actively writing memcg via
      shrink_page_list().  I've observed a node spending almost 100% in
      kernel, making useless iteration over already shrinked slab.
      
      This patch adds bitmap of memcg-aware shrinkers to memcg.  The size of
      the bitmap depends on bitmap_nr_ids, and during memcg life it's
      maintained to be enough to fit bitmap_nr_ids shrinkers.  Every bit in
      the map is related to corresponding shrinker id.
      
      Next patches will maintain set bit only for really charged memcg.  This
      will allow shrink_slab() to increase its performance in significant way.
      See the last patch for the numbers.
      
      [ktkhai@virtuozzo.com: v9]
        Link: http://lkml.kernel.org/r/153112549031.4097.3576147070498769979.stgit@localhost.localdomain
      [ktkhai@virtuozzo.com: add comment to mem_cgroup_css_online()]
        Link: http://lkml.kernel.org/r/521f9e5f-c436-b388-fe83-4dc870bfb489@virtuozzo.com
      Link: http://lkml.kernel.org/r/153063056619.1818.12550500883688681076.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a4465d3
    • K
      mm: assign id to every memcg-aware shrinker · b4c2b231
      Kirill Tkhai 提交于
      Introduce shrinker::id number, which is used to enumerate memcg-aware
      shrinkers.  The number start from 0, and the code tries to maintain it
      as small as possible.
      
      This will be used to represent a memcg-aware shrinkers in memcg
      shrinkers map.
      
      Since all memcg-aware shrinkers are based on list_lru, which is
      per-memcg in case of !CONFIG_MEMCG_KMEM only, the new functionality will
      be under this config option.
      
      [ktkhai@virtuozzo.com: v9]
        Link: http://lkml.kernel.org/r/153112546435.4097.10607140323811756557.stgit@localhost.localdomain
      Link: http://lkml.kernel.org/r/153063054586.1818.6041047871606697364.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4c2b231
    • G
      mm/vmscan.c: condense scan_control · bb451fdf
      Greg Thelen 提交于
      Use smaller scan_control fields for order, priority, and reclaim_idx.
      Convert fields from int => s8.  All easily fit within a byte:
      
       - allocation order range: 0..MAX_ORDER(64?)
       - priority range:         0..12(DEF_PRIORITY)
       - reclaim_idx range:      0..6(__MAX_NR_ZONES)
      
      Since 6538b8ea ("x86_64: expand kernel stack to 16K") x86_64 stack
      overflows are not an issue.  But it's inefficient to use ints.
      
      Use s8 (signed byte) rather than u8 to allow for loops like:
      	do {
      		...
      	} while (--sc.priority >= 0);
      
      Add BUILD_BUG_ON to verify that s8 is capable of storing max values.
      
      This reduces sizeof(struct scan_control):
       - 96 => 80 bytes (x86_64)
       - 68 => 56 bytes (i386)
      
      scan_control structure field order is changed to utilize padding.  After
      this patch there is 1 bit of scan_control padding.
      
      akpm: makes my vmscan.o's .text 572 bytes smaller as well.
      
      Link: http://lkml.kernel.org/r/20180530061212.84915-1-gthelen@google.comSigned-off-by: NGreg Thelen <gthelen@google.com>
      Suggested-by: NMatthew Wilcox <willy@infradead.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb451fdf
  18. 08 6月, 2018 2 次提交