1. 04 6月, 2020 7 次提交
    • J
      mm: balance LRU lists based on relative thrashing · 314b57fb
      Johannes Weiner 提交于
      Since the LRUs were split into anon and file lists, the VM has been
      balancing between page cache and anonymous pages based on per-list ratios
      of scanned vs.  rotated pages.  In most cases that tips page reclaim
      towards the list that is easier to reclaim and has the fewest actively
      used pages, but there are a few problems with it:
      
      1. Refaults and LRU rotations are weighted the same way, even though
         one costs IO and the other costs a bit of CPU.
      
      2. The less we scan an LRU list based on already observed rotations,
         the more we increase the sampling interval for new references, and
         rotations become even more likely on that list. This can enter a
         death spiral in which we stop looking at one list completely until
         the other one is all but annihilated by page reclaim.
      
      Since commit a528910e ("mm: thrash detection-based file cache sizing")
      we have refault detection for the page cache.  Along with swapin events,
      they are good indicators of when the file or anon list, respectively, is
      too small for its workingset and needs to grow.
      
      For example, if the page cache is thrashing, the cache pages need more
      time in memory, while there may be colder pages on the anonymous list.
      Likewise, if swapped pages are faulting back in, it indicates that we
      reclaim anonymous pages too aggressively and should back off.
      
      Replace LRU rotations with refaults and swapins as the basis for relative
      reclaim cost of the two LRUs.  This will have the VM target list balances
      that incur the least amount of IO on aggregate.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-12-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      314b57fb
    • J
      mm: deactivations shouldn't bias the LRU balance · fbbb602e
      Johannes Weiner 提交于
      Operations like MADV_FREE, FADV_DONTNEED etc.  currently move any affected
      active pages to the inactive list to accelerate their reclaim (good) but
      also steer page reclaim toward that LRU type, or away from the other
      (bad).
      
      The reason why this is undesirable is that such operations are not part of
      the regular page aging cycle, and rather a fluke that doesn't say much
      about the remaining pages on that list; they might all be in heavy use,
      and once the chunk of easy victims has been purged, the VM continues to
      apply elevated pressure on those remaining hot pages.  The other LRU,
      meanwhile, might have easily reclaimable pages, and there was never a need
      to steer away from it in the first place.
      
      As the previous patch outlined, we should focus on recording actually
      observed cost to steer the balance rather than speculating about the
      potential value of one LRU list over the other.  In that spirit, leave
      explicitely deactivated pages to the LRU algorithm to pick up, and let
      rotations decide which list is the easiest to reclaim.
      
      [cai@lca.pw: fix set-but-not-used warning]
        Link: http://lkml.kernel.org/r/20200522133335.GA624@Qians-MacBook-Air.localSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Qian Cai <cai@lca.pw>
      Link: http://lkml.kernel.org/r/20200520232525.798933-10-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fbbb602e
    • J
      mm: base LRU balancing on an explicit cost model · 1431d4d1
      Johannes Weiner 提交于
      Currently, scan pressure between the anon and file LRU lists is balanced
      based on a mixture of reclaim efficiency and a somewhat vague notion of
      "value" of having certain pages in memory over others.  That concept of
      value is problematic, because it has caused us to count any event that
      remotely makes one LRU list more or less preferrable for reclaim, even
      when these events are not directly comparable and impose very different
      costs on the system.  One example is referenced file pages that we still
      deactivate and referenced anonymous pages that we actually rotate back to
      the head of the list.
      
      There is also conceptual overlap with the LRU algorithm itself.  By
      rotating recently used pages instead of reclaiming them, the algorithm
      already biases the applied scan pressure based on page value.  Thus, when
      rebalancing scan pressure due to rotations, we should think of reclaim
      cost, and leave assessing the page value to the LRU algorithm.
      
      Lastly, considering both value-increasing as well as value-decreasing
      events can sometimes cause the same type of event to be counted twice,
      i.e.  how rotating a page increases the LRU value, while reclaiming it
      succesfully decreases the value.  In itself this will balance out fine,
      but it quietly skews the impact of events that are only recorded once.
      
      The abstract metric of "value", the murky relationship with the LRU
      algorithm, and accounting both negative and positive events make the
      current pressure balancing model hard to reason about and modify.
      
      This patch switches to a balancing model of accounting the concrete,
      actually observed cost of reclaiming one LRU over another.  For now, that
      cost includes pages that are scanned but rotated back to the list head.
      Subsequent patches will add consideration for IO caused by refaulting of
      recently evicted pages.
      
      Replace struct zone_reclaim_stat with two cost counters in the lruvec, and
      make everything that affects cost go through a new lru_note_cost()
      function.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-9-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1431d4d1
    • J
      mm: remove use-once cache bias from LRU balancing · 96824687
      Johannes Weiner 提交于
      When the splitlru patches divided page cache and swap-backed pages into
      separate LRU lists, the pressure balance between the lists was biased to
      account for the fact that streaming IO can cause memory pressure with a
      flood of pages that are used only once.  New page cache additions would
      tip the balance toward the file LRU, and repeat access would neutralize
      that bias again.  This ensured that page reclaim would always go for
      used-once cache first.
      
      Since e9868505 ("mm,vmscan: only evict file pages when we have
      plenty"), page reclaim generally skips over swap-backed memory entirely as
      long as there is used-once cache present, and will apply the LRU balancing
      when only repeatedly accessed cache pages are left - at which point the
      previous use-once bias will have been neutralized.  This makes the
      use-once cache balancing bias unnecessary.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-7-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96824687
    • J
      mm: fold and remove lru_cache_add_anon() and lru_cache_add_file() · 6058eaec
      Johannes Weiner 提交于
      They're the same function, and for the purpose of all callers they are
      equivalent to lru_cache_add().
      
      [akpm@linux-foundation.org: fix it for local_lock changes]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-5-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6058eaec
    • J
      mm: fix LRU balancing effect of new transparent huge pages · 5df74196
      Johannes Weiner 提交于
      The reclaim code that balances between swapping and cache reclaim tries to
      predict likely reuse based on in-memory reference patterns alone.  This
      works in many cases, but when it fails it cannot detect when the cache is
      thrashing pathologically, or when we're in the middle of a swap storm.
      
      The high seek cost of rotational drives under which the algorithm evolved
      also meant that mistakes could quickly result in lockups from too
      aggressive swapping (which is predominantly random IO).  As a result, the
      balancing code has been tuned over time to a point where it mostly goes
      for page cache and defers swapping until the VM is under significant
      memory pressure.
      
      The resulting strategy doesn't make optimal caching decisions - where
      optimal is the least amount of IO required to execute the workload.
      
      The proliferation of fast random IO devices such as SSDs, in-memory
      compression such as zswap, and persistent memory technologies on the
      horizon, has made this undesirable behavior very noticable: Even in the
      presence of large amounts of cold anonymous memory and a capable swap
      device, the VM refuses to even seriously scan these pages, and can leave
      the page cache thrashing needlessly.
      
      This series sets out to address this.  Since commit ("a528910e mm:
      thrash detection-based file cache sizing") we have exact tracking of
      refault IO - the ultimate cost of reclaiming the wrong pages.  This allows
      us to use an IO cost based balancing model that is more aggressive about
      scanning anonymous memory when the cache is thrashing, while being able to
      avoid unnecessary swap storms.
      
      These patches base the LRU balance on the rate of refaults on each list,
      times the relative IO cost between swap device and filesystem
      (swappiness), in order to optimize reclaim for least IO cost incurred.
      
      	History
      
      I floated these changes in 2016.  At the time they were incomplete and
      full of workarounds due to a lack of infrastructure in the reclaim code:
      We didn't have PageWorkingset, we didn't have hierarchical cgroup
      statistics, and problems with the cgroup swap controller.  As swapping
      wasn't too high a priority then, the patches stalled out.  With all
      dependencies in place now, here we are again with much cleaner,
      feature-complete patches.
      
      I kept the acks for patches that stayed materially the same :-)
      
      Below is a series of test results that demonstrate certain problematic
      behavior of the current code, as well as showcase the new code's more
      predictable and appropriate balancing decisions.
      
      	Test #1: No convergence
      
      This test shows an edge case where the VM currently doesn't converge at
      all on a new file workingset with a stale anon/tmpfs set.
      
      The test sets up a cold anon set the size of 3/4 RAM, then tries to
      establish a new file set half the size of RAM (flat access pattern).
      
      The vanilla kernel refuses to even scan anon pages and never converges.
      The file set is perpetually served from the filesystem.
      
      The first test kernel is with the series up to the workingset patch
      applied.  This allows thrashing page cache to challenge the anonymous
      workingset.  The VM then scans the lists based on the current
      scanned/rotated balancing algorithm.  It converges on a stable state where
      all cold anon pages are pushed out and the fileset is served entirely from
      cache:
      
      			    noconverge/5.7-rc5-mm	noconverge/5.7-rc5-mm-workingset
      Scanned			417719308.00 (    +0.00%)		64091155.00 (   -84.66%)
      Reclaimed		417711094.00 (    +0.00%)		61640308.00 (   -85.24%)
      Reclaim efficiency %	      100.00 (    +0.00%)		      96.18 (    -3.78%)
      Scanned file		417719308.00 (    +0.00%)		59211118.00 (   -85.83%)
      Scanned anon			0.00 (    +0.00%)	         4880037.00 (          )
      Swapouts			0.00 (    +0.00%)	         2439957.00 (          )
      Swapins				0.00 (    +0.00%)		     257.00 (          )
      Refaults		415246605.00 (    +0.00%)		59183722.00 (   -85.75%)
      Restore refaults		0.00 (    +0.00%)	        54988252.00 (          )
      
      The second test kernel is with the full patch series applied, which
      replaces the scanned/rotated ratios with refault/swapin rate-based
      balancing.  It evicts the cold anon pages more aggressively in the
      presence of a thrashing cache and the absence of swapins, and so converges
      with about 60% of the IO and reclaim activity:
      
      			noconverge/5.7-rc5-mm-workingset	noconverge/5.7-rc5-mm-lrubalance
      Scanned				64091155.00 (    +0.00%)		37579741.00 (   -41.37%)
      Reclaimed			61640308.00 (    +0.00%)		35129293.00 (   -43.01%)
      Reclaim efficiency %		      96.18 (    +0.00%)		      93.48 (    -2.78%)
      Scanned file			59211118.00 (    +0.00%)		32708385.00 (   -44.76%)
      Scanned anon			 4880037.00 (    +0.00%)		 4871356.00 (    -0.18%)
      Swapouts			 2439957.00 (    +0.00%)		 2435565.00 (    -0.18%)
      Swapins				     257.00 (    +0.00%)		     262.00 (    +1.94%)
      Refaults			59183722.00 (    +0.00%)		32675667.00 (   -44.79%)
      Restore refaults		54988252.00 (    +0.00%)		28480430.00 (   -48.21%)
      
      We're triggering this case in host sideloading scenarios: When a host's
      primary workload is not saturating the machine (primary load is usually
      driven by user activity), we can optimistically sideload a batch job; if
      user activity picks up and the primary workload needs the whole host
      during this time, we freeze the sideload and rely on it getting pushed to
      swap.  Frequently that swapping doesn't happen and the completely inactive
      sideload simply stays resident while the expanding primary worklad is
      struggling to gain ground.
      
      	Test #2: Kernel build
      
      This test is a a kernel build that is slightly memory-restricted (make -j4
      inside a 400M cgroup).
      
      Despite the very aggressive swapping of cold anon pages in test #1, this
      test shows that the new kernel carefully balances swap against cache
      refaults when both the file and the cache set are pressured.
      
      It shows the patched kernel to be slightly better at finding the coldest
      memory from the combined anon and file set to evict under pressure.  The
      result is lower aggregate reclaim and paging activity:
      
      z				    5.7-rc5-mm	5.7-rc5-mm-lrubalance
      Real time		   210.60 (    +0.00%)	   210.97 (    +0.18%)
      User time		   745.42 (    +0.00%)	   746.48 (    +0.14%)
      System time		    69.78 (    +0.00%)	    69.79 (    +0.02%)
      Scanned file		354682.00 (    +0.00%)	293661.00 (   -17.20%)
      Scanned anon		465381.00 (    +0.00%)	378144.00 (   -18.75%)
      Swapouts		185920.00 (    +0.00%)	147801.00 (   -20.50%)
      Swapins			 34583.00 (    +0.00%)	 32491.00 (    -6.05%)
      Refaults		212664.00 (    +0.00%)	172409.00 (   -18.93%)
      Restore refaults	 48861.00 (    +0.00%)	 80091.00 (   +63.91%)
      Total paging IO		433167.00 (    +0.00%)	352701.00 (   -18.58%)
      
      	Test #3: Overload
      
      This next test is not about performance, but rather about the
      predictability of the algorithm.  The current balancing behavior doesn't
      always lead to comprehensible results, which makes performance analysis
      and parameter tuning (swappiness e.g.) very difficult.
      
      The test shows the balancing behavior under equivalent anon and file
      input.  Anon and file sets are created of equal size (3/4 RAM), have the
      same access patterns (a hot-cold gradient), and synchronized access rates.
      Swappiness is raised from the default of 60 to 100 to indicate equal IO
      cost between swap and cache.
      
      With the vanilla balancing code, anon scans make up around 9% of the total
      pages scanned, or a ~1:10 ratio.  This is a surprisingly skewed ratio, and
      it's an outcome that is hard to explain given the input parameters to the
      VM.
      
      The new balancing model targets a 1:2 balance: All else being equal,
      reclaiming a file page costs one page IO - the refault; reclaiming an anon
      page costs two IOs - the swapout and the swapin.  In the test we observe a
      ~1:3 balance.
      
      The scanned and paging IO numbers indicate that the anon LRU algorithm we
      have in place right now does a slightly worse job at picking the coldest
      pages compared to the file algorithm.  There is ongoing work to improve
      this, like Joonsoo's anon workingset patches; however, it's difficult to
      compare the two aging strategies when the balancing between them is
      behaving unintuitively.
      
      The slightly less efficient anon reclaim results in a deviation from the
      optimal 1:2 scan ratio we would like to see here - however, 1:3 is much
      closer to what we'd want to see in this test than the vanilla kernel's
      aging of 10+ cache pages for every anonymous one:
      
      			overload-100/5.7-rc5-mm-workingset	overload-100/5.7-rc5-mm-lrubalance-realfile
      Scanned				 533633725.00 (    +0.00%)			  595687785.00 (   +11.63%)
      Reclaimed			 494325440.00 (    +0.00%)			  518154380.00 (    +4.82%)
      Reclaim efficiency %			92.63 (    +0.00%)				 86.98 (    -6.03%)
      Scanned file			 484532894.00 (    +0.00%)			  456937722.00 (    -5.70%)
      Scanned anon			  49100831.00 (    +0.00%)			  138750063.00 (  +182.58%)
      Swapouts			   8096423.00 (    +0.00%)			   48982142.00 (  +504.98%)
      Swapins				  10027384.00 (    +0.00%)			   62325044.00 (  +521.55%)
      Refaults			 479819973.00 (    +0.00%)			  451309483.00 (    -5.94%)
      Restore refaults		 426422087.00 (    +0.00%)			  399914067.00 (    -6.22%)
      Total paging IO			 497943780.00 (    +0.00%)			  562616669.00 (   +12.99%)
      
      	Test #4: Parallel IO
      
      It's important to note that these patches only affect the situation where
      the kernel has to reclaim workingset memory, which is usually a
      transitionary period.  The vast majority of page reclaim occuring in a
      system is from trimming the ever-expanding page cache.
      
      These patches don't affect cache trimming behavior.  We never swap as long
      as we only have use-once cache moving through the file LRU, we only
      consider swapping when the cache is actively thrashing.
      
      The following test demonstrates this.  It has an anon workingset that
      takes up half of RAM and then writes a file that is twice the size of RAM
      out to disk.
      
      As the cache is funneled through the inactive file list, no anon pages are
      scanned (aside from apparently some background noise of 10 pages):
      
      					  5.7-rc5-mm		          5.7-rc5-mm-lrubalance
      Scanned			    10714722.00 (    +0.00%)		       10723445.00 (    +0.08%)
      Reclaimed		    10703596.00 (    +0.00%)		       10712166.00 (    +0.08%)
      Reclaim efficiency %		  99.90 (    +0.00%)			     99.89 (    -0.00%)
      Scanned file		    10714722.00 (    +0.00%)		       10723435.00 (    +0.08%)
      Scanned anon			   0.00 (    +0.00%)			     10.00 (          )
      Swapouts			   0.00 (    +0.00%)			      7.00 (          )
      Swapins				   0.00 (    +0.00%)			      0.00 (    +0.00%)
      Refaults			  92.00 (    +0.00%)			     41.00 (   -54.84%)
      Restore refaults		   0.00 (    +0.00%)			      0.00 (    +0.00%)
      Total paging IO			  92.00 (    +0.00%)			     48.00 (   -47.31%)
      
      This patch (of 14):
      
      Currently, THP are counted as single pages until they are split right
      before being swapped out.  However, at that point the VM is already in the
      middle of reclaim, and adjusting the LRU balance then is useless.
      
      Always account THP by the number of basepages, and remove the fixup from
      the splitting path.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-1-hannes@cmpxchg.org
      Link: http://lkml.kernel.org/r/20200520232525.798933-2-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5df74196
    • M
      mm: simplify calling a compound page destructor · ff45fc3c
      Matthew Wilcox (Oracle) 提交于
      None of the three callers of get_compound_page_dtor() want to know the
      value; they just want to call the function.  Replace it with
      destroy_compound_page() which calls the dtor for them.
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Link: http://lkml.kernel.org/r/20200517105051.9352-1-willy@infradead.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff45fc3c
  2. 28 5月, 2020 1 次提交
    • I
      mm/swap: Use local_lock for protection · b01b2141
      Ingo Molnar 提交于
      The various struct pagevec per CPU variables are protected by disabling
      either preemption or interrupts across the critical sections. Inside
      these sections spinlocks have to be acquired.
      
      These spinlocks are regular spinlock_t types which are converted to
      "sleeping" spinlocks on PREEMPT_RT enabled kernels. Obviously sleeping
      locks cannot be acquired in preemption or interrupt disabled sections.
      
      local locks provide a trivial way to substitute preempt and interrupt
      disable instances. On a non PREEMPT_RT enabled kernel local_lock() maps
      to preempt_disable() and local_lock_irq() to local_irq_disable().
      
      Create lru_rotate_pvecs containing the pagevec and the locallock.
      Create lru_pvecs containing the remaining pagevecs and the locallock.
      Add lru_add_drain_cpu_zone() which is used from compact_zone() to avoid
      exporting the pvec structure.
      
      Change the relevant call sites to acquire these locks instead of using
      preempt_disable() / get_cpu() / get_cpu_var() and local_irq_disable() /
      local_irq_save().
      
      There is neither a functional change nor a change in the generated
      binary code for non PREEMPT_RT enabled non-debug kernels.
      
      When lockdep is enabled local locks have lockdep maps embedded. These
      allow lockdep to validate the protections, i.e. inappropriate usage of a
      preemption only protected sections would result in a lockdep warning
      while the same problem would not be noticed with a plain
      preempt_disable() based protection.
      
      local locks also improve readability as they provide a named scope for
      the protections while preempt/interrupt disable are opaque scopeless.
      
      Finally local locks allow PREEMPT_RT to substitute them with real
      locking primitives to ensure the correctness of operation in a fully
      preemptible kernel.
      
      [ bigeasy: Adopted to use local_lock ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20200527201119.1692513-4-bigeasy@linutronix.de
      b01b2141
  3. 08 4月, 2020 2 次提交
    • H
      mm: huge tmpfs: try to split_huge_page() when punching hole · 71725ed1
      Hugh Dickins 提交于
      Yang Shi writes:
      
      Currently, when truncating a shmem file, if the range is partly in a THP
      (start or end is in the middle of THP), the pages actually will just get
      cleared rather than being freed, unless the range covers the whole THP.
      Even though all the subpages are truncated (randomly or sequentially), the
      THP may still be kept in page cache.
      
      This might be fine for some usecases which prefer preserving THP, but
      balloon inflation is handled in base page size.  So when using shmem THP
      as memory backend, QEMU inflation actually doesn't work as expected since
      it doesn't free memory.  But the inflation usecase really needs to get the
      memory freed.  (Anonymous THP will also not get freed right away, but will
      be freed eventually when all subpages are unmapped: whereas shmem THP
      still stays in page cache.)
      
      Split THP right away when doing partial hole punch, and if split fails
      just clear the page so that read of the punched area will return zeroes.
      
      Hugh Dickins adds:
      
      Our earlier "team of pages" huge tmpfs implementation worked in the way
      that Yang Shi proposes; and we have been using this patch to continue to
      split the huge page when hole-punched or truncated, since converting over
      to the compound page implementation.  Although huge tmpfs gives out huge
      pages when available, if the user specifically asks to truncate or punch a
      hole (perhaps to free memory, perhaps to reduce the memcg charge), then
      the filesystem should do so as best it can, splitting the huge page.
      
      That is not always possible: any additional reference to the huge page
      prevents split_huge_page() from succeeding, so the result can be flaky.
      But in practice it works successfully enough that we've not seen any
      problem from that.
      
      Add shmem_punch_compound() to encapsulate the decision of when a split is
      needed, and doing the split if so.  Using this simplifies the flow in
      shmem_undo_range(); and the first (trylock) pass does not need to do any
      page clearing on failure, because the second pass will either succeed or
      do that clearing.  Following the example of zero_user_segment() when
      clearing a partial page, add flush_dcache_page() and set_page_dirty() when
      clearing a hole - though I'm not certain that either is needed.
      
      But: split_huge_page() would be sure to fail if shmem_undo_range()'s
      pagevec holds further references to the huge page.  The easiest way to fix
      that is for find_get_entries() to return early, as soon as it has put one
      compound head or tail into the pagevec.  At first this felt like a hack;
      but on examination, this convention better suits all its callers - or will
      do, if the slight one-page-per-pagevec slowdown in shmem_unlock_mapping()
      and shmem_seek_hole_data() is transformed into a 512-page-per-pagevec
      speedup by checking for compound pages there.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2002261959020.10801@eggly.anvilsSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71725ed1
    • H
      mm: code cleanup for MADV_FREE · 9de4f22a
      Huang Ying 提交于
      Some comments for MADV_FREE is revised and added to help people understand
      the MADV_FREE code, especially the page flag, PG_swapbacked.  This makes
      page_is_file_cache() isn't consistent with its comments.  So the function
      is renamed to page_is_file_lru() to make them consistent again.  All these
      are put in one patch as one logical change.
      Suggested-by: NDavid Hildenbrand <david@redhat.com>
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Suggested-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200317100342.2730705-1-ying.huang@intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9de4f22a
  4. 03 4月, 2020 2 次提交
  5. 01 2月, 2020 1 次提交
    • J
      mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages · 07d80269
      John Hubbard 提交于
      An upcoming patch changes and complicates the refcounting and especially
      the "put page" aspects of it.  In order to keep everything clean,
      refactor the devmap page release routines:
      
      * Rename put_devmap_managed_page() to page_is_devmap_managed(), and
        limit the functionality to "read only": return a bool, with no side
        effects.
      
      * Add a new routine, put_devmap_managed_page(), to handle decrementing
        the refcount for ZONE_DEVICE pages.
      
      * Change callers (just release_pages() and put_page()) to check
        page_is_devmap_managed() before calling the new
        put_devmap_managed_page() routine.  This is a performance point:
        put_page() is a hot path, so we need to avoid non- inline function calls
        where possible.
      
      * Rename __put_devmap_managed_page() to free_devmap_managed_page(), and
        limit the functionality to unconditionally freeing a devmap page.
      
      This is originally based on a separate patch by Ira Weiny, which applied
      to an early version of the put_user_page() experiments.  Since then,
      Jérôme Glisse suggested the refactoring described above.
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-5-jhubbard@nvidia.comSigned-off-by: NIra Weiny <ira.weiny@intel.com>
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Suggested-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      07d80269
  6. 01 12月, 2019 2 次提交
  7. 26 9月, 2019 1 次提交
    • M
      mm: introduce MADV_COLD · 9c276cc6
      Minchan Kim 提交于
      Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.
      
      - Background
      
      The Android terminology used for forking a new process and starting an app
      from scratch is a cold start, while resuming an existing app is a hot
      start.  While we continually try to improve the performance of cold
      starts, hot starts will always be significantly less power hungry as well
      as faster so we are trying to make hot start more likely than cold start.
      
      To increase hot start, Android userspace manages the order that apps
      should be killed in a process called ActivityManagerService.
      ActivityManagerService tracks every Android app or service that the user
      could be interacting with at any time and translates that into a ranked
      list for lmkd(low memory killer daemon).  They are likely to be killed by
      lmkd if the system has to reclaim memory.  In that sense they are similar
      to entries in any other cache.  Those apps are kept alive for
      opportunistic performance improvements but those performance improvements
      will vary based on the memory requirements of individual workloads.
      
      - Problem
      
      Naturally, cached apps were dominant consumers of memory on the system.
      However, they were not significant consumers of swap even though they are
      good candidate for swap.  Under investigation, swapping out only begins
      once the low zone watermark is hit and kswapd wakes up, but the overall
      allocation rate in the system might trip lmkd thresholds and cause a
      cached process to be killed(we measured performance swapping out vs.
      zapping the memory by killing a process.  Unsurprisingly, zapping is 10x
      times faster even though we use zram which is much faster than real
      storage) so kill from lmkd will often satisfy the high zone watermark,
      resulting in very few pages actually being moved to swap.
      
      - Approach
      
      The approach we chose was to use a new interface to allow userspace to
      proactively reclaim entire processes by leveraging platform information.
      This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
      that are known to be cold from userspace and to avoid races with lmkd by
      reclaiming apps as soon as they entered the cached state.  Additionally,
      it could provide many chances for platform to use much information to
      optimize memory efficiency.
      
      To achieve the goal, the patchset introduce two new options for madvise.
      One is MADV_COLD which will deactivate activated pages and the other is
      MADV_PAGEOUT which will reclaim private pages instantly.  These new
      options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
      ways to gain some free memory space.  MADV_PAGEOUT is similar to
      MADV_DONTNEED in a way that it hints the kernel that memory region is not
      currently needed and should be reclaimed immediately; MADV_COLD is similar
      to MADV_FREE in a way that it hints the kernel that memory region is not
      currently needed and should be reclaimed when memory pressure rises.
      
      This patch (of 5):
      
      When a process expects no accesses to a certain memory range, it could
      give a hint to kernel that the pages can be reclaimed when memory pressure
      happens but data should be preserved for future use.  This could reduce
      workingset eviction so it ends up increasing performance.
      
      This patch introduces the new MADV_COLD hint to madvise(2) syscall.
      MADV_COLD can be used by a process to mark a memory range as not expected
      to be used in the near future.  The hint can help kernel in deciding which
      pages to evict early during memory pressure.
      
      It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
      
      	active file page -> inactive file LRU
      	active anon page -> inacdtive anon LRU
      
      Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
      LRU's head because MADV_COLD is a little bit different symantic.
      MADV_FREE means it's okay to discard when the memory pressure because the
      content of the page is *garbage* so freeing such pages is almost zero
      overhead since we don't need to swap out and access afterward causes just
      minor fault.  Thus, it would make sense to put those freeable pages in
      inactive file LRU to compete other used-once pages.  It makes sense for
      implmentaion point of view, too because it's not swapbacked memory any
      longer until it would be re-dirtied.  Even, it could give a bonus to make
      them be reclaimed on swapless system.  However, MADV_COLD doesn't mean
      garbage so reclaiming them requires swap-out/in in the end so it's bigger
      cost.  Since we have designed VM LRU aging based on cost-model, anonymous
      cold pages would be better to position inactive anon's LRU list, not file
      LRU.  Furthermore, it would help to avoid unnecessary scanning if system
      doesn't have a swap device.  Let's start simpler way without adding
      complexity at this moment.  However, keep in mind, too that it's a caveat
      that workloads with a lot of pages cache are likely to ignore MADV_COLD on
      anonymous memory because we rarely age anonymous LRU lists.
      
      * man-page material
      
      MADV_COLD (since Linux x.x)
      
      Pages in the specified regions will be treated as less-recently-accessed
      compared to pages in the system with similar access frequencies.  In
      contrast to MADV_FREE, the contents of the region are preserved regardless
      of subsequent writes to pages.
      
      MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
      pages.
      
      [akpm@linux-foundation.org: resolve conflicts with hmm.git]
      Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c276cc6
  8. 25 9月, 2019 2 次提交
  9. 15 7月, 2019 2 次提交
  10. 03 7月, 2019 1 次提交
  11. 02 7月, 2019 1 次提交
  12. 21 5月, 2019 1 次提交
  13. 15 5月, 2019 1 次提交
  14. 06 3月, 2019 1 次提交
  15. 22 2月, 2019 1 次提交
  16. 05 1月, 2019 1 次提交
  17. 29 12月, 2018 1 次提交
  18. 13 11月, 2018 1 次提交
    • L
      mm: Replace spin_is_locked() with lockdep · 35f3aa39
      Lance Roy 提交于
      lockdep_assert_held() is better suited to checking locking requirements,
      since it only checks if the current thread holds the lock regardless of
      whether someone else does. This is also a step towards possibly removing
      spin_is_locked().
      Signed-off-by: NLance Roy <ldr709@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: <linux-mm@kvack.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      35f3aa39
  19. 27 10月, 2018 1 次提交
  20. 21 10月, 2018 1 次提交
  21. 30 9月, 2018 1 次提交
    • M
      xarray: Replace exceptional entries · 3159f943
      Matthew Wilcox 提交于
      Introduce xarray value entries and tagged pointers to replace radix
      tree exceptional entries.  This is a slight change in encoding to allow
      the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
      value entry).  It is also a change in emphasis; exceptional entries are
      intimidating and different.  As the comment explains, you can choose
      to store values or pointers in the xarray and they are both first-class
      citizens.
      Signed-off-by: NMatthew Wilcox <willy@infradead.org>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      3159f943
  22. 22 5月, 2018 1 次提交
    • D
      mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS · e7638488
      Dan Williams 提交于
      In preparation for fixing dax-dma-vs-unmap issues, filesystems need to
      be able to rely on the fact that they will get wakeups on dev_pagemap
      page-idle events. Introduce MEMORY_DEVICE_FS_DAX and
      generic_dax_page_free() as common indicator / infrastructure for dax
      filesytems to require. With this change there are no users of the
      MEMORY_DEVICE_HOST designation, so remove it.
      
      The HMM sub-system extended dev_pagemap to arrange a callback when a
      dev_pagemap managed page is freed. Since a dev_pagemap page is free /
      idle when its reference count is 1 it requires an additional branch to
      check the page-type at put_page() time. Given put_page() is a hot-path
      we do not want to incur that check if HMM is not in use, so a static
      branch is used to avoid that overhead when not necessary.
      
      Now, the FS_DAX implementation wants to reuse this mechanism for
      receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific
      static-key into a generic mechanism that either HMM or FS_DAX code paths
      can enable.
      
      For ARCH=um builds, and any other arch that lacks ZONE_DEVICE support,
      care must be taken to compile out the DEV_PAGEMAP_OPS infrastructure.
      However, we still need to support FS_DAX in the FS_DAX_LIMITED case
      implemented by the s390/dcssblk driver.
      
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Reported-by: NThomas Meyer <thomas@m3y3r.de>
      Reported-by: NDave Jiang <dave.jiang@intel.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      e7638488
  23. 06 4月, 2018 1 次提交
  24. 22 2月, 2018 2 次提交
    • M
      mm/swap.c: make functions and their kernel-doc agree (again) · cb6f0f34
      Mike Rapoport 提交于
      There was a conflict between the commit e02a9f04 ("mm/swap.c: make
      functions and their kernel-doc agree") and the commit f144c390 ("mm:
      docs: fix parameter names mismatch") that both tried to fix mismatch
      betweeen pagevec_lookup_entries() parameter names and their description.
      
      Since nr_entries is a better name for the parameter, fix the description
      again.
      
      Link: http://lkml.kernel.org/r/1518116946-20947-1-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Acked-by: NRandy Dunlap <rdunlap@infradead.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb6f0f34
    • S
      mm, mlock, vmscan: no more skipping pagevecs · 9c4e6b1a
      Shakeel Butt 提交于
      When a thread mlocks an address space backed either by file pages which
      are currently not present in memory or swapped out anon pages (not in
      swapcache), a new page is allocated and added to the local pagevec
      (lru_add_pvec), I/O is triggered and the thread then sleeps on the page.
      On I/O completion, the thread can wake on a different CPU, the mlock
      syscall will then sets the PageMlocked() bit of the page but will not be
      able to put that page in unevictable LRU as the page is on the pagevec
      of a different CPU.  Even on drain, that page will go to evictable LRU
      because the PageMlocked() bit is not checked on pagevec drain.
      
      The page will eventually go to right LRU on reclaim but the LRU stats
      will remain skewed for a long time.
      
      This patch puts all the pages, even unevictable, to the pagevecs and on
      the drain, the pages will be added on their LRUs correctly by checking
      their evictability.  This resolves the mlocked pages on pagevec of other
      CPUs issue because when those pagevecs will be drained, the mlocked file
      pages will go to unevictable LRU.  Also this makes the race with munlock
      easier to resolve because the pagevec drains happen in LRU lock.
      
      However there is still one place which makes a page evictable and does
      PageLRU check on that page without LRU lock and needs special attention.
      TestClearPageMlocked() and isolate_lru_page() in clear_page_mlock().
      
      	#0: __pagevec_lru_add_fn	#1: clear_page_mlock
      
      	SetPageLRU()			if (!TestClearPageMlocked())
      					  return
      	smp_mb() // <--required
      					// inside does PageLRU
      	if (!PageMlocked())		if (isolate_lru_page())
      	  move to evictable LRU		  putback_lru_page()
      	else
      	  move to unevictable LRU
      
      In '#1', TestClearPageMlocked() provides full memory barrier semantics
      and thus the PageLRU check (inside isolate_lru_page) can not be
      reordered before it.
      
      In '#0', without explicit memory barrier, the PageMlocked() check can be
      reordered before SetPageLRU().  If that happens, '#0' can put a page in
      unevictable LRU and '#1' might have just cleared the Mlocked bit of that
      page but fails to isolate as PageLRU fails as '#0' still hasn't set
      PageLRU bit of that page.  That page will be stranded on the unevictable
      LRU.
      
      There is one (good) side effect though.  Without this patch, the pages
      allocated for System V shared memory segment are added to evictable LRUs
      even after shmctl(SHM_LOCK) on that segment.  This patch will correctly
      put such pages to unevictable LRU.
      
      Link: http://lkml.kernel.org/r/20171121211241.18877-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c4e6b1a
  25. 07 2月, 2018 1 次提交
  26. 01 2月, 2018 2 次提交
  27. 16 11月, 2017 1 次提交