1. 05 1月, 2022 2 次提交
  2. 03 1月, 2022 1 次提交
  3. 10 11月, 2021 1 次提交
    • J
      vfs: keep inodes with page cache off the inode shrinker LRU · 51b8c1fe
      Johannes Weiner 提交于
      Historically (pre-2.5), the inode shrinker used to reclaim only empty
      inodes and skip over those that still contained page cache.  This caused
      problems on highmem hosts: struct inode could put fill lowmem zones
      before the cache was getting reclaimed in the highmem zones.
      
      To address this, the inode shrinker started to strip page cache to
      facilitate reclaiming lowmem.  However, this comes with its own set of
      problems: the shrinkers may drop actively used page cache just because
      the inodes are not currently open or dirty - think working with a large
      git tree.  It further doesn't respect cgroup memory protection settings
      and can cause priority inversions between containers.
      
      Nowadays, the page cache also holds non-resident info for evicted cache
      pages in order to detect refaults.  We've come to rely heavily on this
      data inside reclaim for protecting the cache workingset and driving swap
      behavior.  We also use it to quantify and report workload health through
      psi.  The latter in turn is used for fleet health monitoring, as well as
      driving automated memory sizing of workloads and containers, proactive
      reclaim and memory offloading schemes.
      
      The consequences of dropping page cache prematurely is that we're seeing
      subtle and not-so-subtle failures in all of the above-mentioned
      scenarios, with the workload generally entering unexpected thrashing
      states while losing the ability to reliably detect it.
      
      To fix this on non-highmem systems at least, going back to rotating
      inodes on the LRU isn't feasible.  We've tried (commit a76cf1a4
      ("mm: don't reclaim inodes with many attached pages")) and failed
      (commit 69056ee6 ("Revert "mm: don't reclaim inodes with many
      attached pages"")).
      
      The issue is mostly that shrinker pools attract pressure based on their
      size, and when objects get skipped the shrinkers remember this as
      deferred reclaim work.  This accumulates excessive pressure on the
      remaining inodes, and we can quickly eat into heavily used ones, or
      dirty ones that require IO to reclaim, when there potentially is plenty
      of cold, clean cache around still.
      
      Instead, this patch keeps populated inodes off the inode LRU in the
      first place - just like an open file or dirty state would.  An otherwise
      clean and unused inode then gets queued when the last cache entry
      disappears.  This solves the problem without reintroducing the reclaim
      issues, and generally is a bit more scalable than having to wade through
      potentially hundreds of thousands of busy inodes.
      
      Locking is a bit tricky because the locks protecting the inode state
      (i_lock) and the inode LRU (lru_list.lock) don't nest inside the
      irq-safe page cache lock (i_pages.xa_lock).  Page cache deletions are
      serialized through i_lock, taken before the i_pages lock, to make sure
      depopulated inodes are queued reliably.  Additions may race with
      deletions, but we'll check again in the shrinker.  If additions race
      with the shrinker itself, we're protected by the i_lock: if find_inode()
      or iput() win, the shrinker will bail on the elevated i_count or
      I_REFERENCED; if the shrinker wins and goes ahead with the inode, it
      will set I_FREEING and inhibit further igets(), which will cause the
      other side to create a new instance of the inode instead.
      
      Link: https://lkml.kernel.org/r/20210614211904.14420-4-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51b8c1fe
  4. 07 11月, 2021 7 次提交
    • M
      mm/vmscan: throttle reclaim until some writeback completes if congested · 8cd7c588
      Mel Gorman 提交于
      Patch series "Remove dependency on congestion_wait in mm/", v5.
      
      This series that removes all calls to congestion_wait in mm/ and deletes
      wait_iff_congested.  It's not a clever implementation but
      congestion_wait has been broken for a long time [1].
      
      Even if congestion throttling worked, it was never a great idea.  While
      excessive dirty/writeback pages at the tail of the LRU is one
      possibility that reclaim may be slow, there is also the problem of too
      many pages being isolated and reclaim failing for other reasons
      (elevated references, too many pages isolated, excessive LRU contention
      etc).
      
      This series replaces the "congestion" throttling with 3 different types.
      
       - If there are too many dirty/writeback pages, sleep until a timeout or
         enough pages get cleaned
      
       - If too many pages are isolated, sleep until enough isolated pages are
         either reclaimed or put back on the LRU
      
       - If no progress is being made, direct reclaim tasks sleep until
         another task makes progress with acceptable efficiency.
      
      This was initially tested with a mix of workloads that used to trigger
      corner cases that no longer work.  A new test case was created called
      "stutterp" (pagereclaim-stutterp-noreaders in mmtests) using a freshly
      created XFS filesystem.  Note that it may be necessary to increase the
      timeout of ssh if executing remotely as ssh itself can get throttled and
      the connection may timeout.
      
      stutterp varies the number of "worker" processes from 4 up to NR_CPUS*4
      to check the impact as the number of direct reclaimers increase.  It has
      four types of worker.
      
       - One "anon latency" worker creates small mappings with mmap() and
         times how long it takes to fault the mapping reading it 4K at a time
      
       - X file writers which is fio randomly writing X files where the total
         size of the files add up to the allowed dirty_ratio. fio is allowed
         to run for a warmup period to allow some file-backed pages to
         accumulate. The duration of the warmup is based on the best-case
         linear write speed of the storage.
      
       - Y file readers which is fio randomly reading small files
      
       - Z anon memory hogs which continually map (100-dirty_ratio)% of memory
      
       - Total estimated WSS = (100+dirty_ration) percentage of memory
      
      X+Y+Z+1 == NR_WORKERS varying from 4 up to NR_CPUS*4
      
      The intent is to maximise the total WSS with a mix of file and anon
      memory where some anonymous memory must be swapped and there is a high
      likelihood of dirty/writeback pages reaching the end of the LRU.
      
      The test can be configured to have no background readers to stress
      dirty/writeback pages.  The results below are based on having zero
      readers.
      
      The short summary of the results is that the series works and stalls
      until some event occurs but the timeouts may need adjustment.
      
      The test results are not broken down by patch as the series should be
      treated as one block that replaces a broken throttling mechanism with a
      working one.
      
      Finally, three machines were tested but I'm reporting the worst set of
      results.  The other two machines had much better latencies for example.
      
      First the results of the "anon latency" latency
      
        stutterp
                                      5.15.0-rc1             5.15.0-rc1
                                         vanilla mm-reclaimcongest-v5r4
        Amean     mmap-4      31.4003 (   0.00%)   2661.0198 (-8374.52%)
        Amean     mmap-7      38.1641 (   0.00%)    149.2891 (-291.18%)
        Amean     mmap-12     60.0981 (   0.00%)    187.8105 (-212.51%)
        Amean     mmap-21    161.2699 (   0.00%)    213.9107 ( -32.64%)
        Amean     mmap-30    174.5589 (   0.00%)    377.7548 (-116.41%)
        Amean     mmap-48   8106.8160 (   0.00%)   1070.5616 (  86.79%)
        Stddev    mmap-4      41.3455 (   0.00%)  27573.9676 (-66591.66%)
        Stddev    mmap-7      53.5556 (   0.00%)   4608.5860 (-8505.23%)
        Stddev    mmap-12    171.3897 (   0.00%)   5559.4542 (-3143.75%)
        Stddev    mmap-21   1506.6752 (   0.00%)   5746.2507 (-281.39%)
        Stddev    mmap-30    557.5806 (   0.00%)   7678.1624 (-1277.05%)
        Stddev    mmap-48  61681.5718 (   0.00%)  14507.2830 (  76.48%)
        Max-90    mmap-4      31.4243 (   0.00%)     83.1457 (-164.59%)
        Max-90    mmap-7      41.0410 (   0.00%)     41.0720 (  -0.08%)
        Max-90    mmap-12     66.5255 (   0.00%)     53.9073 (  18.97%)
        Max-90    mmap-21    146.7479 (   0.00%)    105.9540 (  27.80%)
        Max-90    mmap-30    193.9513 (   0.00%)     64.3067 (  66.84%)
        Max-90    mmap-48    277.9137 (   0.00%)    591.0594 (-112.68%)
        Max       mmap-4    1913.8009 (   0.00%) 299623.9695 (-15555.96%)
        Max       mmap-7    2423.9665 (   0.00%) 204453.1708 (-8334.65%)
        Max       mmap-12   6845.6573 (   0.00%) 221090.3366 (-3129.64%)
        Max       mmap-21  56278.6508 (   0.00%) 213877.3496 (-280.03%)
        Max       mmap-30  19716.2990 (   0.00%) 216287.6229 (-997.00%)
        Max       mmap-48 477923.9400 (   0.00%) 245414.8238 (  48.65%)
      
      For most thread counts, the time to mmap() is unfortunately increased.
      In earlier versions of the series, this was lower but a large number of
      throttling events were reaching their timeout increasing the amount of
      inefficient scanning of the LRU.  There is no prioritisation of reclaim
      tasks making progress based on each tasks rate of page allocation versus
      progress of reclaim.  The variance is also impacted for high worker
      counts but in all cases, the differences in latency are not
      statistically significant due to very large maximum outliers.  Max-90
      shows that 90% of the stalls are comparable but the Max results show the
      massive outliers which are increased to to stalling.
      
      It is expected that this will be very machine dependant.  Due to the
      test design, reclaim is difficult so allocations stall and there are
      variances depending on whether THPs can be allocated or not.  The amount
      of memory will affect exactly how bad the corner cases are and how often
      they trigger.  The warmup period calculation is not ideal as it's based
      on linear writes where as fio is randomly writing multiple files from
      multiple tasks so the start state of the test is variable.  For example,
      these are the latencies on a single-socket machine that had more memory
      
        Amean     mmap-4      42.2287 (   0.00%)     49.6838 * -17.65%*
        Amean     mmap-7     216.4326 (   0.00%)     47.4451 *  78.08%*
        Amean     mmap-12   2412.0588 (   0.00%)     51.7497 (  97.85%)
        Amean     mmap-21   5546.2548 (   0.00%)     51.8862 (  99.06%)
        Amean     mmap-30   1085.3121 (   0.00%)     72.1004 (  93.36%)
      
      The overall system CPU usage and elapsed time is as follows
      
                          5.15.0-rc3  5.15.0-rc3
                             vanilla mm-reclaimcongest-v5r4
        Duration User        6989.03      983.42
        Duration System      7308.12      799.68
        Duration Elapsed     2277.67     2092.98
      
      The patches reduce system CPU usage by 89% as the vanilla kernel is rarely
      stalling.
      
      The high-level /proc/vmstats show
      
                                             5.15.0-rc1     5.15.0-rc1
                                                vanilla mm-reclaimcongest-v5r2
        Ops Direct pages scanned          1056608451.00   503594991.00
        Ops Kswapd pages scanned           109795048.00   147289810.00
        Ops Kswapd pages reclaimed          63269243.00    31036005.00
        Ops Direct pages reclaimed          10803973.00     6328887.00
        Ops Kswapd efficiency %                   57.62          21.07
        Ops Kswapd velocity                    48204.98       57572.86
        Ops Direct efficiency %                    1.02           1.26
        Ops Direct velocity                   463898.83      196845.97
      
      Kswapd scanned less pages but the detailed pattern is different.  The
      vanilla kernel scans slowly over time where as the patches exhibits
      burst patterns of scan activity.  Direct reclaim scanning is reduced by
      52% due to stalling.
      
      The pattern for stealing pages is also slightly different.  Both kernels
      exhibit spikes but the vanilla kernel when reclaiming shows pages being
      reclaimed over a period of time where as the patches tend to reclaim in
      spikes.  The difference is that vanilla is not throttling and instead
      scanning constantly finding some pages over time where as the patched
      kernel throttles and reclaims in spikes.
      
        Ops Percentage direct scans               90.59          77.37
      
      For direct reclaim, vanilla scanned 90.59% of pages where as with the
      patches, 77.37% were direct reclaim due to throttling
      
        Ops Page writes by reclaim           2613590.00     1687131.00
      
      Page writes from reclaim context are reduced.
      
        Ops Page writes anon                 2932752.00     1917048.00
      
      And there is less swapping.
      
        Ops Page reclaim immediate         996248528.00   107664764.00
      
      The number of pages encountered at the tail of the LRU tagged for
      immediate reclaim but still dirty/writeback is reduced by 89%.
      
        Ops Slabs scanned                     164284.00      153608.00
      
      Slab scan activity is similar.
      
      ftrace was used to gather stall activity
      
        Vanilla
        -------
            1 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=16000
            2 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=12000
            8 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=8000
           29 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=4000
        82394 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=0
      
      The fast majority of wait_iff_congested calls do not stall at all.  What
      is likely happening is that cond_resched() reschedules the task for a
      short period when the BDI is not registering congestion (which it never
      will in this test setup).
      
            1 writeback_congestion_wait: usec_timeout=100000 usec_delayed=120000
            2 writeback_congestion_wait: usec_timeout=100000 usec_delayed=132000
            4 writeback_congestion_wait: usec_timeout=100000 usec_delayed=112000
          380 writeback_congestion_wait: usec_timeout=100000 usec_delayed=108000
          778 writeback_congestion_wait: usec_timeout=100000 usec_delayed=104000
      
      congestion_wait if called always exceeds the timeout as there is no
      trigger to wake it up.
      
      Bottom line: Vanilla will throttle but it's not effective.
      
      Patch series
      ------------
      
      Kswapd throttle activity was always due to scanning pages tagged for
      immediate reclaim at the tail of the LRU
      
            1 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
            4 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
            5 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
            6 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
           11 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK
           11 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
           94 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
          112 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK
      
      The majority of events did not stall or stalled for a short period.
      Roughly 16% of stalls reached the timeout before expiry.  For direct
      reclaim, the number of times stalled for each reason were
      
         6624 reason=VMSCAN_THROTTLE_ISOLATED
        93246 reason=VMSCAN_THROTTLE_NOPROGRESS
        96934 reason=VMSCAN_THROTTLE_WRITEBACK
      
      The most common reason to stall was due to excessive pages tagged for
      immediate reclaim at the tail of the LRU followed by a failure to make
      forward.  A relatively small number were due to too many pages isolated
      from the LRU by parallel threads
      
      For VMSCAN_THROTTLE_ISOLATED, the breakdown of delays was
      
            9 usec_timeout=20000 usect_delayed=4000 reason=VMSCAN_THROTTLE_ISOLATED
           12 usec_timeout=20000 usect_delayed=16000 reason=VMSCAN_THROTTLE_ISOLATED
           83 usec_timeout=20000 usect_delayed=20000 reason=VMSCAN_THROTTLE_ISOLATED
         6520 usec_timeout=20000 usect_delayed=0 reason=VMSCAN_THROTTLE_ISOLATED
      
      Most did not stall at all.  A small number reached the timeout.
      
      For VMSCAN_THROTTLE_NOPROGRESS, the breakdown of stalls were all over
      the map
      
            1 usec_timeout=500000 usect_delayed=324000 reason=VMSCAN_THROTTLE_NOPROGRESS
            1 usec_timeout=500000 usect_delayed=332000 reason=VMSCAN_THROTTLE_NOPROGRESS
            1 usec_timeout=500000 usect_delayed=348000 reason=VMSCAN_THROTTLE_NOPROGRESS
            1 usec_timeout=500000 usect_delayed=360000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=228000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=260000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=340000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=364000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=372000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=428000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=460000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=464000 reason=VMSCAN_THROTTLE_NOPROGRESS
            3 usec_timeout=500000 usect_delayed=244000 reason=VMSCAN_THROTTLE_NOPROGRESS
            3 usec_timeout=500000 usect_delayed=252000 reason=VMSCAN_THROTTLE_NOPROGRESS
            3 usec_timeout=500000 usect_delayed=272000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=188000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=268000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=328000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=380000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=392000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=432000 reason=VMSCAN_THROTTLE_NOPROGRESS
            5 usec_timeout=500000 usect_delayed=204000 reason=VMSCAN_THROTTLE_NOPROGRESS
            5 usec_timeout=500000 usect_delayed=220000 reason=VMSCAN_THROTTLE_NOPROGRESS
            5 usec_timeout=500000 usect_delayed=412000 reason=VMSCAN_THROTTLE_NOPROGRESS
            5 usec_timeout=500000 usect_delayed=436000 reason=VMSCAN_THROTTLE_NOPROGRESS
            6 usec_timeout=500000 usect_delayed=488000 reason=VMSCAN_THROTTLE_NOPROGRESS
            7 usec_timeout=500000 usect_delayed=212000 reason=VMSCAN_THROTTLE_NOPROGRESS
            7 usec_timeout=500000 usect_delayed=300000 reason=VMSCAN_THROTTLE_NOPROGRESS
            7 usec_timeout=500000 usect_delayed=316000 reason=VMSCAN_THROTTLE_NOPROGRESS
            7 usec_timeout=500000 usect_delayed=472000 reason=VMSCAN_THROTTLE_NOPROGRESS
            8 usec_timeout=500000 usect_delayed=248000 reason=VMSCAN_THROTTLE_NOPROGRESS
            8 usec_timeout=500000 usect_delayed=356000 reason=VMSCAN_THROTTLE_NOPROGRESS
            8 usec_timeout=500000 usect_delayed=456000 reason=VMSCAN_THROTTLE_NOPROGRESS
            9 usec_timeout=500000 usect_delayed=124000 reason=VMSCAN_THROTTLE_NOPROGRESS
            9 usec_timeout=500000 usect_delayed=376000 reason=VMSCAN_THROTTLE_NOPROGRESS
            9 usec_timeout=500000 usect_delayed=484000 reason=VMSCAN_THROTTLE_NOPROGRESS
           10 usec_timeout=500000 usect_delayed=172000 reason=VMSCAN_THROTTLE_NOPROGRESS
           10 usec_timeout=500000 usect_delayed=420000 reason=VMSCAN_THROTTLE_NOPROGRESS
           10 usec_timeout=500000 usect_delayed=452000 reason=VMSCAN_THROTTLE_NOPROGRESS
           11 usec_timeout=500000 usect_delayed=256000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=112000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=116000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=144000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=152000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=264000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=384000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=424000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=492000 reason=VMSCAN_THROTTLE_NOPROGRESS
           13 usec_timeout=500000 usect_delayed=184000 reason=VMSCAN_THROTTLE_NOPROGRESS
           13 usec_timeout=500000 usect_delayed=444000 reason=VMSCAN_THROTTLE_NOPROGRESS
           14 usec_timeout=500000 usect_delayed=308000 reason=VMSCAN_THROTTLE_NOPROGRESS
           14 usec_timeout=500000 usect_delayed=440000 reason=VMSCAN_THROTTLE_NOPROGRESS
           14 usec_timeout=500000 usect_delayed=476000 reason=VMSCAN_THROTTLE_NOPROGRESS
           16 usec_timeout=500000 usect_delayed=140000 reason=VMSCAN_THROTTLE_NOPROGRESS
           17 usec_timeout=500000 usect_delayed=232000 reason=VMSCAN_THROTTLE_NOPROGRESS
           17 usec_timeout=500000 usect_delayed=240000 reason=VMSCAN_THROTTLE_NOPROGRESS
           17 usec_timeout=500000 usect_delayed=280000 reason=VMSCAN_THROTTLE_NOPROGRESS
           18 usec_timeout=500000 usect_delayed=404000 reason=VMSCAN_THROTTLE_NOPROGRESS
           20 usec_timeout=500000 usect_delayed=148000 reason=VMSCAN_THROTTLE_NOPROGRESS
           20 usec_timeout=500000 usect_delayed=216000 reason=VMSCAN_THROTTLE_NOPROGRESS
           20 usec_timeout=500000 usect_delayed=468000 reason=VMSCAN_THROTTLE_NOPROGRESS
           21 usec_timeout=500000 usect_delayed=448000 reason=VMSCAN_THROTTLE_NOPROGRESS
           23 usec_timeout=500000 usect_delayed=168000 reason=VMSCAN_THROTTLE_NOPROGRESS
           23 usec_timeout=500000 usect_delayed=296000 reason=VMSCAN_THROTTLE_NOPROGRESS
           25 usec_timeout=500000 usect_delayed=132000 reason=VMSCAN_THROTTLE_NOPROGRESS
           25 usec_timeout=500000 usect_delayed=352000 reason=VMSCAN_THROTTLE_NOPROGRESS
           26 usec_timeout=500000 usect_delayed=180000 reason=VMSCAN_THROTTLE_NOPROGRESS
           27 usec_timeout=500000 usect_delayed=284000 reason=VMSCAN_THROTTLE_NOPROGRESS
           28 usec_timeout=500000 usect_delayed=164000 reason=VMSCAN_THROTTLE_NOPROGRESS
           29 usec_timeout=500000 usect_delayed=136000 reason=VMSCAN_THROTTLE_NOPROGRESS
           30 usec_timeout=500000 usect_delayed=200000 reason=VMSCAN_THROTTLE_NOPROGRESS
           30 usec_timeout=500000 usect_delayed=400000 reason=VMSCAN_THROTTLE_NOPROGRESS
           31 usec_timeout=500000 usect_delayed=196000 reason=VMSCAN_THROTTLE_NOPROGRESS
           32 usec_timeout=500000 usect_delayed=156000 reason=VMSCAN_THROTTLE_NOPROGRESS
           33 usec_timeout=500000 usect_delayed=224000 reason=VMSCAN_THROTTLE_NOPROGRESS
           35 usec_timeout=500000 usect_delayed=128000 reason=VMSCAN_THROTTLE_NOPROGRESS
           35 usec_timeout=500000 usect_delayed=176000 reason=VMSCAN_THROTTLE_NOPROGRESS
           36 usec_timeout=500000 usect_delayed=368000 reason=VMSCAN_THROTTLE_NOPROGRESS
           36 usec_timeout=500000 usect_delayed=496000 reason=VMSCAN_THROTTLE_NOPROGRESS
           37 usec_timeout=500000 usect_delayed=312000 reason=VMSCAN_THROTTLE_NOPROGRESS
           38 usec_timeout=500000 usect_delayed=304000 reason=VMSCAN_THROTTLE_NOPROGRESS
           40 usec_timeout=500000 usect_delayed=288000 reason=VMSCAN_THROTTLE_NOPROGRESS
           43 usec_timeout=500000 usect_delayed=408000 reason=VMSCAN_THROTTLE_NOPROGRESS
           55 usec_timeout=500000 usect_delayed=416000 reason=VMSCAN_THROTTLE_NOPROGRESS
           56 usec_timeout=500000 usect_delayed=76000 reason=VMSCAN_THROTTLE_NOPROGRESS
           58 usec_timeout=500000 usect_delayed=120000 reason=VMSCAN_THROTTLE_NOPROGRESS
           59 usec_timeout=500000 usect_delayed=208000 reason=VMSCAN_THROTTLE_NOPROGRESS
           61 usec_timeout=500000 usect_delayed=68000 reason=VMSCAN_THROTTLE_NOPROGRESS
           71 usec_timeout=500000 usect_delayed=192000 reason=VMSCAN_THROTTLE_NOPROGRESS
           71 usec_timeout=500000 usect_delayed=480000 reason=VMSCAN_THROTTLE_NOPROGRESS
           79 usec_timeout=500000 usect_delayed=60000 reason=VMSCAN_THROTTLE_NOPROGRESS
           82 usec_timeout=500000 usect_delayed=320000 reason=VMSCAN_THROTTLE_NOPROGRESS
           82 usec_timeout=500000 usect_delayed=92000 reason=VMSCAN_THROTTLE_NOPROGRESS
           85 usec_timeout=500000 usect_delayed=64000 reason=VMSCAN_THROTTLE_NOPROGRESS
           85 usec_timeout=500000 usect_delayed=80000 reason=VMSCAN_THROTTLE_NOPROGRESS
           88 usec_timeout=500000 usect_delayed=84000 reason=VMSCAN_THROTTLE_NOPROGRESS
           90 usec_timeout=500000 usect_delayed=160000 reason=VMSCAN_THROTTLE_NOPROGRESS
           90 usec_timeout=500000 usect_delayed=292000 reason=VMSCAN_THROTTLE_NOPROGRESS
           94 usec_timeout=500000 usect_delayed=56000 reason=VMSCAN_THROTTLE_NOPROGRESS
          118 usec_timeout=500000 usect_delayed=88000 reason=VMSCAN_THROTTLE_NOPROGRESS
          119 usec_timeout=500000 usect_delayed=72000 reason=VMSCAN_THROTTLE_NOPROGRESS
          126 usec_timeout=500000 usect_delayed=108000 reason=VMSCAN_THROTTLE_NOPROGRESS
          146 usec_timeout=500000 usect_delayed=52000 reason=VMSCAN_THROTTLE_NOPROGRESS
          148 usec_timeout=500000 usect_delayed=36000 reason=VMSCAN_THROTTLE_NOPROGRESS
          148 usec_timeout=500000 usect_delayed=48000 reason=VMSCAN_THROTTLE_NOPROGRESS
          159 usec_timeout=500000 usect_delayed=28000 reason=VMSCAN_THROTTLE_NOPROGRESS
          178 usec_timeout=500000 usect_delayed=44000 reason=VMSCAN_THROTTLE_NOPROGRESS
          183 usec_timeout=500000 usect_delayed=40000 reason=VMSCAN_THROTTLE_NOPROGRESS
          237 usec_timeout=500000 usect_delayed=100000 reason=VMSCAN_THROTTLE_NOPROGRESS
          266 usec_timeout=500000 usect_delayed=32000 reason=VMSCAN_THROTTLE_NOPROGRESS
          313 usec_timeout=500000 usect_delayed=24000 reason=VMSCAN_THROTTLE_NOPROGRESS
          347 usec_timeout=500000 usect_delayed=96000 reason=VMSCAN_THROTTLE_NOPROGRESS
          470 usec_timeout=500000 usect_delayed=20000 reason=VMSCAN_THROTTLE_NOPROGRESS
          559 usec_timeout=500000 usect_delayed=16000 reason=VMSCAN_THROTTLE_NOPROGRESS
          964 usec_timeout=500000 usect_delayed=12000 reason=VMSCAN_THROTTLE_NOPROGRESS
         2001 usec_timeout=500000 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPROGRESS
         2447 usec_timeout=500000 usect_delayed=8000 reason=VMSCAN_THROTTLE_NOPROGRESS
         7888 usec_timeout=500000 usect_delayed=4000 reason=VMSCAN_THROTTLE_NOPROGRESS
        22727 usec_timeout=500000 usect_delayed=0 reason=VMSCAN_THROTTLE_NOPROGRESS
        51305 usec_timeout=500000 usect_delayed=500000 reason=VMSCAN_THROTTLE_NOPROGRESS
      
      The full timeout is often hit but a large number also do not stall at
      all.  The remainder slept a little allowing other reclaim tasks to make
      progress.
      
      While this timeout could be further increased, it could also negatively
      impact worst-case behaviour when there is no prioritisation of what task
      should make progress.
      
      For VMSCAN_THROTTLE_WRITEBACK, the breakdown was
      
            1 usec_timeout=100000 usect_delayed=44000 reason=VMSCAN_THROTTLE_WRITEBACK
            2 usec_timeout=100000 usect_delayed=76000 reason=VMSCAN_THROTTLE_WRITEBACK
            3 usec_timeout=100000 usect_delayed=80000 reason=VMSCAN_THROTTLE_WRITEBACK
            5 usec_timeout=100000 usect_delayed=48000 reason=VMSCAN_THROTTLE_WRITEBACK
            5 usec_timeout=100000 usect_delayed=84000 reason=VMSCAN_THROTTLE_WRITEBACK
            6 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
            7 usec_timeout=100000 usect_delayed=88000 reason=VMSCAN_THROTTLE_WRITEBACK
           11 usec_timeout=100000 usect_delayed=56000 reason=VMSCAN_THROTTLE_WRITEBACK
           12 usec_timeout=100000 usect_delayed=64000 reason=VMSCAN_THROTTLE_WRITEBACK
           16 usec_timeout=100000 usect_delayed=92000 reason=VMSCAN_THROTTLE_WRITEBACK
           24 usec_timeout=100000 usect_delayed=68000 reason=VMSCAN_THROTTLE_WRITEBACK
           28 usec_timeout=100000 usect_delayed=32000 reason=VMSCAN_THROTTLE_WRITEBACK
           30 usec_timeout=100000 usect_delayed=60000 reason=VMSCAN_THROTTLE_WRITEBACK
           30 usec_timeout=100000 usect_delayed=96000 reason=VMSCAN_THROTTLE_WRITEBACK
           32 usec_timeout=100000 usect_delayed=52000 reason=VMSCAN_THROTTLE_WRITEBACK
           42 usec_timeout=100000 usect_delayed=40000 reason=VMSCAN_THROTTLE_WRITEBACK
           77 usec_timeout=100000 usect_delayed=28000 reason=VMSCAN_THROTTLE_WRITEBACK
           99 usec_timeout=100000 usect_delayed=36000 reason=VMSCAN_THROTTLE_WRITEBACK
          137 usec_timeout=100000 usect_delayed=24000 reason=VMSCAN_THROTTLE_WRITEBACK
          190 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
          339 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
          518 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
          852 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
         3359 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK
         7147 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
        83962 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK
      
      The majority hit the timeout in direct reclaim context although a
      sizable number did not stall at all.  This is very different to kswapd
      where only a tiny percentage of stalls due to writeback reached the
      timeout.
      
      Bottom line, the throttling appears to work and the wakeup events may
      limit worst case stalls.  There might be some grounds for adjusting
      timeouts but it's likely futile as the worst-case scenarios depend on
      the workload, memory size and the speed of the storage.  A better
      approach to improve the series further would be to prioritise tasks
      based on their rate of allocation with the caveat that it may be very
      expensive to track.
      
      This patch (of 5):
      
      Page reclaim throttles on wait_iff_congested under the following
      conditions:
      
       - kswapd is encountering pages under writeback and marked for immediate
         reclaim implying that pages are cycling through the LRU faster than
         pages can be cleaned.
      
       - Direct reclaim will stall if all dirty pages are backed by congested
         inodes.
      
      wait_iff_congested is almost completely broken with few exceptions.
      This patch adds a new node-based workqueue and tracks the number of
      throttled tasks and pages written back since throttling started.  If
      enough pages belonging to the node are written back then the throttled
      tasks will wake early.  If not, the throttled tasks sleeps until the
      timeout expires.
      
      [neilb@suse.de: Uninterruptible sleep and simpler wakeups]
      [hdanton@sina.com: Avoid race when reclaim starts]
      [vbabka@suse.cz: vmstat irq-safe api, clarifications]
      
      Link: https://lore.kernel.org/linux-mm/45d8b7a6-8548-65f5-cccf-9f451d4ae3d4@kernel.dk/ [1]
      Link: https://lkml.kernel.org/r/20211022144651.19914-1-mgorman@techsingularity.net
      Link: https://lkml.kernel.org/r/20211022144651.19914-2-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: NeilBrown <neilb@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Darrick J . Wong" <djwong@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8cd7c588
    • Y
      mm: filemap: coding style cleanup for filemap_map_pmd() · e0f43fa5
      Yang Shi 提交于
      Patch series "Solve silent data loss caused by poisoned page cache (shmem/tmpfs)", v5.
      
      When discussing the patch that splits page cache THP in order to offline
      the poisoned page, Noaya mentioned there is a bigger problem [1] that
      prevents this from working since the page cache page will be truncated
      if uncorrectable errors happen.  By looking this deeper it turns out
      this approach (truncating poisoned page) may incur silent data loss for
      all non-readonly filesystems if the page is dirty.  It may be worse for
      in-memory filesystem, e.g.  shmem/tmpfs since the data blocks are
      actually gone.
      
      To solve this problem we could keep the poisoned dirty page in page
      cache then notify the users on any later access, e.g.  page fault,
      read/write, etc.  The clean page could be truncated as is since they can
      be reread from disk later on.
      
      The consequence is the filesystems may find poisoned page and manipulate
      it as healthy page since all the filesystems actually don't check if the
      page is poisoned or not in all the relevant paths except page fault.  In
      general, we need make the filesystems be aware of poisoned page before
      we could keep the poisoned page in page cache in order to solve the data
      loss problem.
      
      To make filesystems be aware of poisoned page we should consider:
      
       - The page should be not written back: clearing dirty flag could
         prevent from writeback.
      
       - The page should not be dropped (it shows as a clean page) by drop
         caches or other callers: the refcount pin from hwpoison could prevent
         from invalidating (called by cache drop, inode cache shrinking, etc),
         but it doesn't avoid invalidation in DIO path.
      
       - The page should be able to get truncated/hole punched/unlinked: it
         works as it is.
      
       - Notify users when the page is accessed, e.g. read/write, page fault
         and other paths (compression, encryption, etc).
      
      The scope of the last one is huge since almost all filesystems need do
      it once a page is returned from page cache lookup.  There are a couple
      of options to do it:
      
       1. Check hwpoison flag for every path, the most straightforward way.
      
       2. Return NULL for poisoned page from page cache lookup, the most
          callsites check if NULL is returned, this should have least work I
          think. But the error handling in filesystems just return -ENOMEM,
          the error code will incur confusion to the users obviously.
      
       3. To improve #2, we could return error pointer, e.g. ERR_PTR(-EIO),
          but this will involve significant amount of code change as well
          since all the paths need check if the pointer is ERR or not just
          like option #1.
      
      I did prototypes for both #1 and #3, but it seems #3 may require more
      changes than #1.  For #3 ERR_PTR will be returned so all the callers
      need to check the return value otherwise invalid pointer may be
      dereferenced, but not all callers really care about the content of the
      page, for example, partial truncate which just sets the truncated range
      in one page to 0.  So for such paths it needs additional modification if
      ERR_PTR is returned.  And if the callers have their own way to handle
      the problematic pages we need to add a new FGP flag to tell FGP
      functions to return the pointer to the page.
      
      It may happen very rarely, but once it happens the consequence (data
      corruption) could be very bad and it is very hard to debug.  It seems
      this problem had been slightly discussed before, but seems no action was
      taken at that time.  [2]
      
      As the aforementioned investigation, it needs huge amount of work to
      solve the potential data loss for all filesystems.  But it is much
      easier for in-memory filesystems and such filesystems actually suffer
      more than others since even the data blocks are gone due to truncating.
      So this patchset starts from shmem/tmpfs by taking option #1.
      
      TODO:
      * The unpoison has been broken since commit 0ed950d1 ("mm,hwpoison: make
        get_hwpoison_page() call get_any_page()"), and this patch series make
        refcount check for unpoisoning shmem page fail.
      * Expand to other filesystems.  But I haven't heard feedback from filesystem
        developers yet.
      
      Patch breakdown:
      Patch #1: cleanup, depended by patch #2
      Patch #2: fix THP with hwpoisoned subpage(s) PMD map bug
      Patch #3: coding style cleanup
      Patch #4: refactor and preparation.
      Patch #5: keep the poisoned page in page cache and handle such case for all
                the paths.
      Patch #6: the previous patches unblock page cache THP split, so this patch
                add page cache THP split support.
      
      This patch (of 4):
      
      A minor cleanup to the indent.
      
      Link: https://lkml.kernel.org/r/20211020210755.23964-1-shy828301@gmail.com
      Link: https://lkml.kernel.org/r/20211020210755.23964-4-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com>
      Reviewed-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e0f43fa5
    • Q
      mm: introduce pmd_install() helper · 03c4f204
      Qi Zheng 提交于
      Patch series "Do some code cleanups related to mm", v3.
      
      This patch (of 2):
      
      Currently we have three times the same few lines repeated in the code.
      Deduplicate them by newly introduced pmd_install() helper.
      
      Link: https://lkml.kernel.org/r/20210901102722.47686-1-zhengqi.arch@bytedance.com
      Link: https://lkml.kernel.org/r/20210901102722.47686-2-zhengqi.arch@bytedance.comSigned-off-by: NQi Zheng <zhengqi.arch@bytedance.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Mika Penttila <mika.penttila@nextfour.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      03c4f204
    • J
      mm: move more expensive part of XA setup out of mapping check · f8ee8909
      Jens Axboe 提交于
      The fast path here is not needing any writeback, yet we spend time
      setting up the xarray lookup data upfront.  Move the part that actually
      needs to iterate the address space mapping into a separate helper,
      saving ~30% of the time here.
      
      Link: https://lkml.kernel.org/r/49f67983-b802-8929-edab-d807f745c9ca@kernel.dkSigned-off-by: NJens Axboe <axboe@kernel.dk>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8ee8909
    • M
      mm/filemap.c: remove bogus VM_BUG_ON · d417b49f
      Matthew Wilcox (Oracle) 提交于
      It is not safe to check page->index without holding the page lock.  It
      can be changed if the page is moved between the swap cache and the page
      cache for a shmem file, for example.  There is a VM_BUG_ON below which
      checks page->index is correct after taking the page lock.
      
      Link: https://lkml.kernel.org/r/20210818144932.940640-1-willy@infradead.org
      Fixes: 5c211ba2 ("mm: add and use find_lock_entries")
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Reported-by: <syzbot+c87be4f669d920c76330@syzkaller.appspotmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d417b49f
    • J
      mm: don't read i_size of inode unless we need it · 61d0017e
      Jens Axboe 提交于
      We always go through i_size_read(), and we rarely end up needing it.
      Push the read to down where we need to check it, which avoids it for
      most cases.
      
      It looks like we can even remove this check entirely, which might be
      worth pursuing.  But at least this takes it out of the hot path.
      
      Link: https://lkml.kernel.org/r/6b67981f-57d4-c80e-bc07-6020aa601381@kernel.dkSigned-off-by: NJens Axboe <axboe@kernel.dk>
      Acked-by: NChris Mason <clm@fb.com>
      Cc: Josef Bacik <josef@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      61d0017e
    • D
      mm: stop filemap_read() from grabbing a superfluous page · 8c8387ee
      David Howells 提交于
      Under some circumstances, filemap_read() will allocate sufficient pages
      to read to the end of the file, call readahead/readpages on them and
      copy the data over - and then it will allocate another page at the EOF
      and call readpage on that and then ignore it.  This is unnecessary and a
      waste of time and resources.
      
      filemap_read() *does* check for this, but only after it has already done
      the allocation and I/O.  Fix this by checking before calling
      filemap_get_pages() also.
      
      Link: https://lkml.kernel.org/r/163472463105.3126792.7056099385135786492.stgit@warthog.procyon.org.uk
      Link: https://lore.kernel.org/r/160588481358.3465195.16552616179674485179.stgit@warthog.procyon.org.uk/
      Link: https://lore.kernel.org/r/163456863216.2614702.6384850026368833133.stgit@warthog.procyon.org.uk/Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NJeff Layton <jlayton@kernel.org>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8c8387ee
  5. 18 10月, 2021 10 次提交
  6. 27 9月, 2021 16 次提交
  7. 04 9月, 2021 1 次提交
  8. 23 8月, 2021 1 次提交
    • J
      fs: add a filemap_fdatawrite_wbc helper · 5a798493
      Josef Bacik 提交于
      Btrfs sometimes needs to flush dirty pages on a bunch of dirty inodes in
      order to reclaim metadata reservations.  Unfortunately most helpers in
      this area are too smart for us:
      
      1) The normal filemap_fdata* helpers only take range and sync modes, and
         don't give any indication of how much was written, so we can only
         flush full inodes, which isn't what we want in most cases.
      2) The normal writeback path requires us to have the s_umount sem held,
         but we can't unconditionally take it in this path because we could
         deadlock.
      3) The normal writeback path also skips inodes with I_SYNC set if we
         write with WB_SYNC_NONE.  This isn't the behavior we want under heavy
         ENOSPC pressure, we want to actually make sure the pages are under
         writeback before returning, and if another thread is in the middle of
         writing the file we may return before they're under writeback and
         miss our ordered extents and not properly wait for completion.
      4) sync_inode() uses the normal writeback path and has the same problem
         as #3.
      
      What we really want is to call do_writepages() with our wbc.  This way
      we can make sure that writeback is actually started on the pages, and we
      can control how many pages are written as a whole as we write many
      inodes using the same wbc.  Accomplish this with a new helper that does
      just that so we can use it for our ENOSPC flushing infrastructure.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5a798493
  9. 13 7月, 2021 1 次提交