1. 13 8月, 2020 1 次提交
    • J
      mm/vmscan: make active/inactive ratio as 1:1 for anon lru · ccc5dc67
      Joonsoo Kim 提交于
      Patch series "workingset protection/detection on the anonymous LRU list", v7.
      
      * PROBLEM
      In current implementation, newly created or swap-in anonymous page is
      started on the active list.  Growing the active list results in
      rebalancing active/inactive list so old pages on the active list are
      demoted to the inactive list.  Hence, hot page on the active list isn't
      protected at all.
      
      Following is an example of this situation.
      
      Assume that 50 hot pages on active list and system can contain total 100
      pages.  Numbers denote the number of pages on active/inactive list (active
      | inactive).  (h) stands for hot pages and (uo) stands for used-once
      pages.
      
      1. 50 hot pages on active list
      50(h) | 0
      
      2. workload: 50 newly created (used-once) pages
      50(uo) | 50(h)
      
      3. workload: another 50 newly created (used-once) pages
      50(uo) | 50(uo), swap-out 50(h)
      
      As we can see, hot pages are swapped-out and it would cause swap-in later.
      
      * SOLUTION
      Since this is what we want to avoid, this patchset implements workingset
      protection.  Like as the file LRU list, newly created or swap-in anonymous
      page is started on the inactive list.  Also, like as the file LRU list, if
      enough reference happens, the page will be promoted.  This simple
      modification changes the above example as following.
      
      1. 50 hot pages on active list
      50(h) | 0
      
      2. workload: 50 newly created (used-once) pages
      50(h) | 50(uo)
      
      3. workload: another 50 newly created (used-once) pages
      50(h) | 50(uo), swap-out 50(uo)
      
      hot pages remains in the active list. :)
      
      * EXPERIMENT
      I tested this scenario on my test bed and confirmed that this problem
      happens on current implementation. I also checked that it is fixed by
      this patchset.
      
      * SUBJECT
      workingset detection
      
      * PROBLEM
      Later part of the patchset implements the workingset detection for the
      anonymous LRU list.  There is a corner case that workingset protection
      could cause thrashing.  If we can avoid thrashing by workingset detection,
      we can get the better performance.
      
      Following is an example of thrashing due to the workingset protection.
      
      1. 50 hot pages on active list
      50(h) | 0
      
      2. workload: 50 newly created (will be hot) pages
      50(h) | 50(wh)
      
      3. workload: another 50 newly created (used-once) pages
      50(h) | 50(uo), swap-out 50(wh)
      
      4. workload: 50 (will be hot) pages
      50(h) | 50(wh), swap-in 50(wh)
      
      5. workload: another 50 newly created (used-once) pages
      50(h) | 50(uo), swap-out 50(wh)
      
      6. repeat 4, 5
      
      Without workingset detection, this kind of workload cannot be promoted and
      thrashing happens forever.
      
      * SOLUTION
      Therefore, this patchset implements workingset detection.  All the
      infrastructure for workingset detecion is already implemented, so there is
      not much work to do.  First, extend workingset detection code to deal with
      the anonymous LRU list.  Then, make swap cache handles the exceptional
      value for the shadow entry.  Lastly, install/retrieve the shadow value
      into/from the swap cache and check the refault distance.
      
      * EXPERIMENT
      I made a test program to imitates above scenario and confirmed that
      problem exists.  Then, I checked that this patchset fixes it.
      
      My test setup is a virtual machine with 8 cpus and 6100MB memory.  But,
      the amount of the memory that the test program can use is about 280 MB.
      This is because the system uses large ram-backed swap and large ramdisk to
      capture the trace.
      
      Test scenario is like as below.
      
      1. allocate cold memory (512MB)
      2. allocate hot-1 memory (96MB)
      3. activate hot-1 memory (96MB)
      4. allocate another hot-2 memory (96MB)
      5. access cold memory (128MB)
      6. access hot-2 memory (96MB)
      7. repeat 5, 6
      
      Since hot-1 memory (96MB) is on the active list, the inactive list can
      contains roughly 190MB pages.  hot-2 memory's re-access interval (96+128
      MB) is more 190MB, so it cannot be promoted without workingset detection
      and swap-in/out happens repeatedly.  With this patchset, workingset
      detection works and promotion happens.  Therefore, swap-in/out occurs
      less.
      
      Here is the result. (average of 5 runs)
      
      type swap-in swap-out
      base 863240 989945
      patch 681565 809273
      
      As we can see, patched kernel do less swap-in/out.
      
      * OVERALL TEST (ebizzy using modified random function)
      ebizzy is the test program that main thread allocates lots of memory and
      child threads access them randomly during the given times.  Swap-in will
      happen if allocated memory is larger than the system memory.
      
      The random function that represents the zipf distribution is used to make
      hot/cold memory.  Hot/cold ratio is controlled by the parameter.  If the
      parameter is high, hot memory is accessed much larger than cold one.  If
      the parameter is low, the number of access on each memory would be
      similar.  I uses various parameters in order to show the effect of
      patchset on various hot/cold ratio workload.
      
      My test setup is a virtual machine with 8 cpus, 1024 MB memory and 5120 MB
      ram swap.
      
      Result format is as following.
      
      param: 1-1024-0.1
      - 1 (number of thread)
      - 1024 (allocated memory size, MB)
      - 0.1 (zipf distribution alpha,
      0.1 works like as roughly uniform random,
      1.3 works like as small portion of memory is hot and the others are cold)
      
      pswpin: smaller is better
      std: standard deviation
      improvement: negative is better
      
      * single thread
                 param        pswpin       std       improvement
            base 1-1024.0-0.1 14101983.40   79441.19
            prot 1-1024.0-0.1 14065875.80  136413.01  (   -0.26 )
          detect 1-1024.0-0.1 13910435.60  100804.82  (   -1.36 )
            base 1-1024.0-0.7 7998368.80   43469.32
            prot 1-1024.0-0.7 7622245.80   88318.74  (   -4.70 )
          detect 1-1024.0-0.7 7618515.20   59742.07  (   -4.75 )
            base 1-1024.0-1.3 1017400.80   38756.30
            prot 1-1024.0-1.3  940464.60   29310.69  (   -7.56 )
          detect 1-1024.0-1.3  945511.40   24579.52  (   -7.07 )
            base 1-1280.0-0.1 22895541.40   50016.08
            prot 1-1280.0-0.1 22860305.40   51952.37  (   -0.15 )
          detect 1-1280.0-0.1 22705565.20   93380.35  (   -0.83 )
            base 1-1280.0-0.7 13717645.60   46250.65
            prot 1-1280.0-0.7 12935355.80   64754.43  (   -5.70 )
          detect 1-1280.0-0.7 13040232.00   63304.00  (   -4.94 )
            base 1-1280.0-1.3 1654251.40    4159.68
            prot 1-1280.0-1.3 1522680.60   33673.50  (   -7.95 )
          detect 1-1280.0-1.3 1599207.00   70327.89  (   -3.33 )
            base 1-1536.0-0.1 31621775.40   31156.28
            prot 1-1536.0-0.1 31540355.20   62241.36  (   -0.26 )
          detect 1-1536.0-0.1 31420056.00  123831.27  (   -0.64 )
            base 1-1536.0-0.7 19620760.60   60937.60
            prot 1-1536.0-0.7 18337839.60   56102.58  (   -6.54 )
          detect 1-1536.0-0.7 18599128.00   75289.48  (   -5.21 )
            base 1-1536.0-1.3 2378142.40   20994.43
            prot 1-1536.0-1.3 2166260.60   48455.46  (   -8.91 )
          detect 1-1536.0-1.3 2183762.20   16883.24  (   -8.17 )
            base 1-1792.0-0.1 40259714.80   90750.70
            prot 1-1792.0-0.1 40053917.20   64509.47  (   -0.51 )
          detect 1-1792.0-0.1 39949736.40  104989.64  (   -0.77 )
            base 1-1792.0-0.7 25704884.40   69429.68
            prot 1-1792.0-0.7 23937389.00   79945.60  (   -6.88 )
          detect 1-1792.0-0.7 24271902.00   35044.30  (   -5.57 )
            base 1-1792.0-1.3 3129497.00   32731.86
            prot 1-1792.0-1.3 2796994.40   19017.26  (  -10.62 )
          detect 1-1792.0-1.3 2886840.40   33938.82  (   -7.75 )
            base 1-2048.0-0.1 48746924.40   50863.88
            prot 1-2048.0-0.1 48631954.40   24537.30  (   -0.24 )
          detect 1-2048.0-0.1 48509419.80   27085.34  (   -0.49 )
            base 1-2048.0-0.7 32046424.40   78624.22
            prot 1-2048.0-0.7 29764182.20   86002.26  (   -7.12 )
          detect 1-2048.0-0.7 30250315.80  101282.14  (   -5.60 )
            base 1-2048.0-1.3 3916723.60   24048.55
            prot 1-2048.0-1.3 3490781.60   33292.61  (  -10.87 )
          detect 1-2048.0-1.3 3585002.20   44942.04  (   -8.47 )
      
      * multi thread
                 param        pswpin       std       improvement
            base 8-1024.0-0.1 16219822.60  329474.01
            prot 8-1024.0-0.1 15959494.00  654597.45  (   -1.61 )
          detect 8-1024.0-0.1 15773790.80  502275.25  (   -2.75 )
            base 8-1024.0-0.7 9174107.80  537619.33
            prot 8-1024.0-0.7 8571915.00  385230.08  (   -6.56 )
          detect 8-1024.0-0.7 8489484.20  364683.00  (   -7.46 )
            base 8-1024.0-1.3 1108495.60   83555.98
            prot 8-1024.0-1.3 1038906.20   63465.20  (   -6.28 )
          detect 8-1024.0-1.3  941817.80   32648.80  (  -15.04 )
            base 8-1280.0-0.1 25776114.20  450480.45
            prot 8-1280.0-0.1 25430847.00  465627.07  (   -1.34 )
          detect 8-1280.0-0.1 25282555.00  465666.55  (   -1.91 )
            base 8-1280.0-0.7 15218968.00  702007.69
            prot 8-1280.0-0.7 13957947.80  492643.86  (   -8.29 )
          detect 8-1280.0-0.7 14158331.20  238656.02  (   -6.97 )
            base 8-1280.0-1.3 1792482.80   30512.90
            prot 8-1280.0-1.3 1577686.40   34002.62  (  -11.98 )
          detect 8-1280.0-1.3 1556133.00   22944.79  (  -13.19 )
            base 8-1536.0-0.1 33923761.40  575455.85
            prot 8-1536.0-0.1 32715766.20  300633.51  (   -3.56 )
          detect 8-1536.0-0.1 33158477.40  117764.51  (   -2.26 )
            base 8-1536.0-0.7 20628907.80  303851.34
            prot 8-1536.0-0.7 19329511.20  341719.31  (   -6.30 )
          detect 8-1536.0-0.7 20013934.00  385358.66  (   -2.98 )
            base 8-1536.0-1.3 2588106.40  130769.20
            prot 8-1536.0-1.3 2275222.40   89637.06  (  -12.09 )
          detect 8-1536.0-1.3 2365008.40  124412.55  (   -8.62 )
            base 8-1792.0-0.1 43328279.20  946469.12
            prot 8-1792.0-0.1 41481980.80  525690.89  (   -4.26 )
          detect 8-1792.0-0.1 41713944.60  406798.93  (   -3.73 )
            base 8-1792.0-0.7 27155647.40  536253.57
            prot 8-1792.0-0.7 24989406.80  502734.52  (   -7.98 )
          detect 8-1792.0-0.7 25524806.40  263237.87  (   -6.01 )
            base 8-1792.0-1.3 3260372.80  137907.92
            prot 8-1792.0-1.3 2879187.80   63597.26  (  -11.69 )
          detect 8-1792.0-1.3 2892962.20   33229.13  (  -11.27 )
            base 8-2048.0-0.1 50583989.80  710121.48
            prot 8-2048.0-0.1 49599984.40  228782.42  (   -1.95 )
          detect 8-2048.0-0.1 50578596.00  660971.66  (   -0.01 )
            base 8-2048.0-0.7 33765479.60  812659.55
            prot 8-2048.0-0.7 30767021.20  462907.24  (   -8.88 )
          detect 8-2048.0-0.7 32213068.80  211884.24  (   -4.60 )
            base 8-2048.0-1.3 3941675.80   28436.45
            prot 8-2048.0-1.3 3538742.40   76856.08  (  -10.22 )
          detect 8-2048.0-1.3 3579397.80   58630.95  (   -9.19 )
      
      As we can see, all the cases show improvement.  Especially, test case with
      zipf distribution 1.3 show more improvements.  It means that if there is a
      hot/cold tendency in anon pages, this patchset works better.
      
      This patch (of 6):
      
      Current implementation of LRU management for anonymous page has some
      problems.  Most important one is that it doesn't protect the workingset,
      that is, pages on the active LRU list.  Although, this problem will be
      fixed in the following patchset, the preparation is required and this
      patch does it.
      
      What following patch does is to implement workingset protection.  After
      the following patchset, newly created or swap-in pages will start their
      lifetime on the inactive list.  If inactive list is too small, there is
      not enough chance to be referenced and the page cannot become the
      workingset.
      
      In order to provide the newly anonymous or swap-in pages enough chance to
      be referenced again, this patch makes active/inactive LRU ratio as 1:1.
      
      This is just a temporary measure.  Later patch in the series introduces
      workingset detection for anonymous LRU that will be used to better decide
      if pages should start on the active and inactive list.  Afterwards this
      patch is effectively reverted.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Link: http://lkml.kernel.org/r/1595490560-15117-1-git-send-email-iamjoonsoo.kim@lge.com
      Link: http://lkml.kernel.org/r/1595490560-15117-2-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ccc5dc67
  2. 08 8月, 2020 7 次提交
  3. 26 6月, 2020 1 次提交
  4. 05 6月, 2020 1 次提交
  5. 04 6月, 2020 15 次提交
  6. 03 6月, 2020 1 次提交
    • N
      mm/writeback: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE · a37b0715
      NeilBrown 提交于
      PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the
      loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a
      daemon needs to write to one bdi (the final bdi) in order to free up
      writes queued to another bdi (the client bdi).
      
      The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty
      pages, so that it can still dirty pages after other processses have been
      throttled.  The purpose of this is to avoid deadlock that happen when
      the PF_LESS_THROTTLE process must write for any dirty pages to be freed,
      but it is being thottled and cannot write.
      
      This approach was designed when all threads were blocked equally,
      independently on which device they were writing to, or how fast it was.
      Since that time the writeback algorithm has changed substantially with
      different threads getting different allowances based on non-trivial
      heuristics.  This means the simple "add 25%" heuristic is no longer
      reliable.
      
      The important issue is not that the daemon needs a *larger* dirty page
      allowance, but that it needs a *private* dirty page allowance, so that
      dirty pages for the "client" bdi that it is helping to clear (the bdi
      for an NFS filesystem or loop block device etc) do not affect the
      throttling of the daemon writing to the "final" bdi.
      
      This patch changes the heuristic so that the task is not throttled when
      the bdi it is writing to has a dirty page count below below (or equal
      to) the free-run threshold for that bdi.  This ensures it will always be
      able to have some pages in flight, and so will not deadlock.
      
      In a steady-state, it is expected that PF_LOCAL_THROTTLE tasks might
      still be throttled by global threshold, but that is acceptable as it is
      only the deadlock state that is interesting for this flag.
      
      This approach of "only throttle when target bdi is busy" is consistent
      with the other use of PF_LESS_THROTTLE in current_may_throttle(), were
      it causes attention to be focussed only on the target bdi.
      
      So this patch
       - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE,
       - removes the 25% bonus that that flag gives, and
       - If PF_LOCAL_THROTTLE is set, don't delay at all unless the
         global and the local free-run thresholds are exceeded.
      
      Note that previously realtime threads were treated the same as
      PF_LESS_THROTTLE threads.  This patch does *not* change the behvaiour
      for real-time threads, so it is now different from the behaviour of nfsd
      and loop tasks.  I don't know what is wanted for realtime.
      
      [akpm@linux-foundation.org: coding style fixes]
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: Chuck Lever <chuck.lever@oracle.com>	[nfsd]
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Link: http://lkml.kernel.org/r/87ftbf7gs3.fsf@notabene.neil.brown.nameSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a37b0715
  7. 08 5月, 2020 1 次提交
  8. 08 4月, 2020 1 次提交
  9. 03 4月, 2020 7 次提交
  10. 22 2月, 2020 1 次提交
    • G
      mm/vmscan.c: don't round up scan size for online memory cgroup · 76073c64
      Gavin Shan 提交于
      Commit 68600f62 ("mm: don't miss the last page because of round-off
      error") makes the scan size round up to @denominator regardless of the
      memory cgroup's state, online or offline.  This affects the overall
      reclaiming behavior: the corresponding LRU list is eligible for
      reclaiming only when its size logically right shifted by @sc->priority
      is bigger than zero in the former formula.
      
      For example, the inactive anonymous LRU list should have at least 0x4000
      pages to be eligible for reclaiming when we have 60/12 for
      swappiness/priority and without taking scan/rotation ratio into account.
      
      After the roundup is applied, the inactive anonymous LRU list becomes
      eligible for reclaiming when its size is bigger than or equal to 0x1000
      in the same condition.
      
          (0x4000 >> 12) * 60 / (60 + 140 + 1) = 1
          ((0x1000 >> 12) * 60) + 200) / (60 + 140 + 1) = 1
      
      aarch64 has 512MB huge page size when the base page size is 64KB.  The
      memory cgroup that has a huge page is always eligible for reclaiming in
      that case.
      
      The reclaiming is likely to stop after the huge page is reclaimed,
      meaing the further iteration on @sc->priority and the silbing and child
      memory cgroups will be skipped.  The overall behaviour has been changed.
      This fixes the issue by applying the roundup to offlined memory cgroups
      only, to give more preference to reclaim memory from offlined memory
      cgroup.  It sounds reasonable as those memory is unlikedly to be used by
      anyone.
      
      The issue was found by starting up 8 VMs on a Ampere Mustang machine,
      which has 8 CPUs and 16 GB memory.  Each VM is given with 2 vCPUs and
      2GB memory.  It took 264 seconds for all VMs to be completely up and
      784MB swap is consumed after that.  With this patch applied, it took 236
      seconds and 60MB swap to do same thing.  So there is 10% performance
      improvement for my case.  Note that KSM is disable while THP is enabled
      in the testing.
      
               total     used    free   shared  buff/cache   available
         Mem:  16196    10065    2049       16        4081        3749
         Swap:  8175      784    7391
               total     used    free   shared  buff/cache   available
         Mem:  16196    11324    3656       24        1215        2936
         Swap:  8175       60    8115
      
      Link: http://lkml.kernel.org/r/20200211024514.8730-1-gshan@redhat.com
      Fixes: 68600f62 ("mm: don't miss the last page because of round-off error")
      Signed-off-by: NGavin Shan <gshan@redhat.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Cc: <stable@vger.kernel.org>	[4.20+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      76073c64
  11. 01 2月, 2020 3 次提交
  12. 18 12月, 2019 1 次提交