1. 04 5月, 2017 10 次提交
    • M
      mm: introduce memalloc_nofs_{save,restore} API · 7dea19f9
      Michal Hocko 提交于
      GFP_NOFS context is used for the following 5 reasons currently:
      
       - to prevent from deadlocks when the lock held by the allocation
         context would be needed during the memory reclaim
      
       - to prevent from stack overflows during the reclaim because the
         allocation is performed from a deep context already
      
       - to prevent lockups when the allocation context depends on other
         reclaimers to make a forward progress indirectly
      
       - just in case because this would be safe from the fs POV
      
       - silence lockdep false positives
      
      Unfortunately overuse of this allocation context brings some problems to
      the MM.  Memory reclaim is much weaker (especially during heavy FS
      metadata workloads), OOM killer cannot be invoked because the MM layer
      doesn't have enough information about how much memory is freeable by the
      FS layer.
      
      In many cases it is far from clear why the weaker context is even used
      and so it might be used unnecessarily.  We would like to get rid of
      those as much as possible.  One way to do that is to use the flag in
      scopes rather than isolated cases.  Such a scope is declared when really
      necessary, tracked per task and all the allocation requests from within
      the context will simply inherit the GFP_NOFS semantic.
      
      Not only this is easier to understand and maintain because there are
      much less problematic contexts than specific allocation requests, this
      also helps code paths where FS layer interacts with other layers (e.g.
      crypto, security modules, MM etc...) and there is no easy way to convey
      the allocation context between the layers.
      
      Introduce memalloc_nofs_{save,restore} API to control the scope of
      GFP_NOFS allocation context.  This is basically copying
      memalloc_noio_{save,restore} API we have for other restricted allocation
      context GFP_NOIO.  The PF_MEMALLOC_NOFS flag already exists and it is
      just an alias for PF_FSTRANS which has been xfs specific until recently.
      There are no more PF_FSTRANS users anymore so let's just drop it.
      
      PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
      implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO.  memalloc_noio_flags
      is renamed to current_gfp_context because it now cares about both
      PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts.  Xfs code paths preserve
      their semantic.  kmem_flags_convert() doesn't need to evaluate the flag
      anymore.
      
      This patch shouldn't introduce any functional changes.
      
      Let's hope that filesystems will drop direct GFP_NOFS (resp.  ~__GFP_FS)
      usage as much as possible and only use a properly documented
      memalloc_nofs_{save,restore} checkpoints where they are appropriate.
      
      [akpm@linux-foundation.org: fix comment typo, reflow comment]
      Link: http://lkml.kernel.org/r/20170306131408.9828-5-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7dea19f9
    • S
      mm: reclaim MADV_FREE pages · 802a3a92
      Shaohua Li 提交于
      When memory pressure is high, we free MADV_FREE pages.  If the pages are
      not dirty in pte, the pages could be freed immediately.  Otherwise we
      can't reclaim them.  We put the pages back to anonumous LRU list (by
      setting SwapBacked flag) and the pages will be reclaimed in normal
      swapout way.
      
      We use normal page reclaim policy.  Since MADV_FREE pages are put into
      inactive file list, such pages and inactive file pages are reclaimed
      according to their age.  This is expected, because we don't want to
      reclaim too many MADV_FREE pages before used once pages.
      
      Based on Minchan's original patch
      
      [minchan@kernel.org: clean up lazyfree page handling]
        Link: http://lkml.kernel.org/r/20170303025237.GB3503@bbox
      Link: http://lkml.kernel.org/r/14b8eb1d3f6bf6cc492833f183ac8c304e560484.1487965799.git.shli@fb.comSigned-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      802a3a92
    • S
      mm: delete unnecessary TTU_* flags · a128ca71
      Shaohua Li 提交于
      Patch series "mm: fix some MADV_FREE issues", v5.
      
      We are trying to use MADV_FREE in jemalloc.  Several issues are found.
      Without solving the issues, jemalloc can't use the MADV_FREE feature.
      
       - Doesn't support system without swap enabled. Because if swap is off,
         we can't or can't efficiently age anonymous pages. And since
         MADV_FREE pages are mixed with other anonymous pages, we can't
         reclaim MADV_FREE pages. In current implementation, MADV_FREE will
         fallback to MADV_DONTNEED without swap enabled. But in our
         environment, a lot of machines don't enable swap. This will prevent
         our setup using MADV_FREE.
      
       - Increases memory pressure. page reclaim bias file pages reclaim
         against anonymous pages. This doesn't make sense for MADV_FREE pages,
         because those pages could be freed easily and refilled with very
         slight penality. Even page reclaim doesn't bias file pages, there is
         still an issue, because MADV_FREE pages and other anonymous pages are
         mixed together. To reclaim a MADV_FREE page, we probably must scan a
         lot of other anonymous pages, which is inefficient. In our test, we
         usually see oom with MADV_FREE enabled and nothing without it.
      
       - Accounting. There are two accounting problems. We don't have a global
         accounting. If the system is abnormal, we don't know if it's a
         problem from MADV_FREE side. The other problem is RSS accounting.
         MADV_FREE pages are accounted as normal anon pages and reclaimed
         lazily, so application's RSS becomes bigger. This confuses our
         workloads. We have monitoring daemon running and if it finds
         applications' RSS becomes abnormal, the daemon will kill the
         applications even kernel can reclaim the memory easily.
      
      To address the first the two issues, we can either put MADV_FREE pages
      into a separate LRU list (Minchan's previous patches and V1 patches), or
      put them into LRU_INACTIVE_FILE list (suggested by Johannes).  The
      patchset use the second idea.  The reason is LRU_INACTIVE_FILE list is
      tiny nowadays and should be full of used once file pages.  So we can
      still efficiently reclaim MADV_FREE pages there without interference
      with other anon and active file pages.  Putting the pages into inactive
      file list also has an advantage which allows page reclaim to prioritize
      MADV_FREE pages and used once file pages.  MADV_FREE pages are put into
      the lru list and clear SwapBacked flag, so PageAnon(page) &&
      !PageSwapBacked(page) will indicate a MADV_FREE pages.  These pages will
      directly freed without pageout if they are clean, otherwise normal swap
      will reclaim them.
      
      For the third issue, the previous post adds global accounting and a
      separate RSS count for MADV_FREE pages.  The problem is we never get
      accurate accounting for MADV_FREE pages.  The pages are mapped to
      userspace, can be dirtied without notice from kernel side.  To get
      accurate accounting, we could write protect the page, but then there is
      extra page fault overhead, which people don't want to pay.  Jemalloc
      guys have concerns about the inaccurate accounting, so this post drops
      the accounting patches temporarily.  The info exported to
      /proc/pid/smaps for MADV_FREE pages are kept, which is the only place we
      can get accurate accounting right now.
      
      This patch (of 6):
      
      Johannes pointed out TTU_LZFREE is unnecessary.  It's true because we
      always have the flag set if we want to do an unmap.  For cases we don't
      do an unmap, the TTU_LZFREE part of code should never run.
      
      Also the TTU_UNMAP is unnecessary.  If no other flags set (for example,
      TTU_MIGRATION), an unmap is implied.
      
      The patch includes Johannes's cleanup and dead TTU_ACTION macro removal
      code
      
      Link: http://lkml.kernel.org/r/4be3ea1bc56b26fd98a54d0a6f70bec63f6d8980.1487965799.git.shli@fb.comSigned-off-by: NShaohua Li <shli@fb.com>
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a128ca71
    • J
      Revert "mm, vmscan: account for skipped pages as a partial scan" · 3db65812
      Johannes Weiner 提交于
      This reverts commit d7f05528.
      
      Now that reclaimability of a node is no longer based on the ratio
      between pages scanned and theoretically reclaimable pages, we can remove
      accounting tricks for pages skipped due to zone constraints.
      
      Link: http://lkml.kernel.org/r/20170228214007.5621-9-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Jia He <hejianet@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3db65812
    • J
      mm: delete NR_PAGES_SCANNED and pgdat_reclaimable() · c822f622
      Johannes Weiner 提交于
      NR_PAGES_SCANNED counts number of pages scanned since the last page free
      event in the allocator.  This was used primarily to measure the
      reclaimability of zones and nodes, and determine when reclaim should
      give up on them.  In that role, it has been replaced in the preceding
      patches by a different mechanism.
      
      Being implemented as an efficient vmstat counter, it was automatically
      exported to userspace as well.  It's however unlikely that anyone
      outside the kernel is using this counter in any meaningful way.
      
      Remove the counter and the unused pgdat_reclaimable().
      
      Link: http://lkml.kernel.org/r/20170228214007.5621-8-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Jia He <hejianet@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c822f622
    • J
      mm: don't avoid high-priority reclaim on memcg limit reclaim · 688035f7
      Johannes Weiner 提交于
      Commit 246e87a9 ("memcg: fix get_scan_count() for small targets")
      sought to avoid high reclaim priorities for memcg by forcing it to scan
      a minimum amount of pages when lru_pages >> priority yielded nothing.
      This was done at a time when reclaim decisions like dirty throttling
      were tied to the priority level.
      
      Nowadays, the only meaningful thing still tied to priority dropping
      below DEF_PRIORITY - 2 is gating whether laptop_mode=1 is generally
      allowed to write.  But that is from an era where direct reclaim was
      still allowed to call ->writepage, and kswapd nowadays avoids writes
      until it's scanned every clean page in the system.  Potential changes to
      how quick sc->may_writepage could trigger are of little concern.
      
      Remove the force_scan stuff, as well as the ugly multi-pass target
      calculation that it necessitated.
      
      Link: http://lkml.kernel.org/r/20170228214007.5621-7-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Jia He <hejianet@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      688035f7
    • J
      mm: don't avoid high-priority reclaim on unreclaimable nodes · a2d7f8e4
      Johannes Weiner 提交于
      Commit 246e87a9 ("memcg: fix get_scan_count() for small targets")
      sought to avoid high reclaim priorities for kswapd by forcing it to scan
      a minimum amount of pages when lru_pages >> priority yielded nothing.
      
      Commit b95a2f2d ("mm: vmscan: convert global reclaim to per-memcg
      LRU lists"), due to switching global reclaim to a round-robin scheme
      over all cgroups, had to restrict this forceful behavior to
      unreclaimable zones in order to prevent massive overreclaim with many
      cgroups.
      
      The latter patch effectively neutered the behavior completely for all
      but extreme memory pressure.  But in those situations we might as well
      drop the reclaimers to lower priority levels.  Remove the check.
      
      Link: http://lkml.kernel.org/r/20170228214007.5621-6-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Jia He <hejianet@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2d7f8e4
    • J
      mm: remove seemingly spurious reclaimability check from laptop_mode gating · 047d72c3
      Johannes Weiner 提交于
      Commit 1d82de61 ("mm, vmscan: make kswapd reclaim in terms of
      nodes") allowed laptop_mode=1 to start writing not just when the
      priority drops to DEF_PRIORITY - 2 but also when the node is
      unreclaimable.
      
      That appears to be a spurious change in this patch as I doubt the series
      was tested with laptop_mode, and neither is that particular change
      mentioned in the changelog.  Remove it, it's still recent.
      
      Link: http://lkml.kernel.org/r/20170228214007.5621-4-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Jia He <hejianet@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      047d72c3
    • J
      mm: fix check for reclaimable pages in PF_MEMALLOC reclaim throttling · d450abd8
      Johannes Weiner 提交于
      PF_MEMALLOC direct reclaimers get throttled on a node when the sum of
      all free pages in each zone fall below half the min watermark.  During
      the summation, we want to exclude zones that don't have reclaimables.
      Checking the same pgdat over and over again doesn't make sense.
      
      Fixes: 599d0c95 ("mm, vmscan: move LRU lists to node")
      Link: http://lkml.kernel.org/r/20170228214007.5621-3-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Jia He <hejianet@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d450abd8
    • J
      mm: fix 100% CPU kswapd busyloop on unreclaimable nodes · c73322d0
      Johannes Weiner 提交于
      Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and
      cleanups".
      
      Jia reported a scenario in which the kswapd of a node indefinitely spins
      at 100% CPU usage.  We have seen similar cases at Facebook.
      
      The kernel's current method of judging its ability to reclaim a node (or
      whether to back off and sleep) is based on the amount of scanned pages
      in proportion to the amount of reclaimable pages.  In Jia's and our
      scenarios, there are no reclaimable pages in the node, however, and the
      condition for backing off is never met.  Kswapd busyloops in an attempt
      to restore the watermarks while having nothing to work with.
      
      This series reworks the definition of an unreclaimable node based not on
      scanning but on whether kswapd is able to actually reclaim pages in
      MAX_RECLAIM_RETRIES (16) consecutive runs.  This is the same criteria
      the page allocator uses for giving up on direct reclaim and invoking the
      OOM killer.  If it cannot free any pages, kswapd will go to sleep and
      leave further attempts to direct reclaim invocations, which will either
      make progress and re-enable kswapd, or invoke the OOM killer.
      
      Patch #1 fixes the immediate problem Jia reported, the remainder are
      smaller fixlets, cleanups, and overall phasing out of the old method.
      
      Patch #6 is the odd one out.  It's a nice cleanup to get_scan_count(),
      and directly related to #5, but in itself not relevant to the series.
      
      If the whole series is too ambitious for 4.11, I would consider the
      first three patches fixes, the rest cleanups.
      
      This patch (of 9):
      
      Jia He reports a problem with kswapd spinning at 100% CPU when
      requesting more hugepages than memory available in the system:
      
      $ echo 4000 >/proc/sys/vm/nr_hugepages
      
      top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
      Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
      %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
      KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
      KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
      
        PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
         76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3
      
      At that time, there are no reclaimable pages left in the node, but as
      kswapd fails to restore the high watermarks it refuses to go to sleep.
      
      Kswapd needs to back away from nodes that fail to balance.  Up until
      commit 1d82de61 ("mm, vmscan: make kswapd reclaim in terms of
      nodes") kswapd had such a mechanism.  It considered zones whose
      theoretically reclaimable pages it had reclaimed six times over as
      unreclaimable and backed away from them.  This guard was erroneously
      removed as the patch changed the definition of a balanced node.
      
      However, simply restoring this code wouldn't help in the case reported
      here: there *are* no reclaimable pages that could be scanned until the
      threshold is met.  Kswapd would stay awake anyway.
      
      Introduce a new and much simpler way of backing off.  If kswapd runs
      through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
      page, make it back off from the node.  This is the same number of shots
      direct reclaim takes before declaring OOM.  Kswapd will go to sleep on
      that node until a direct reclaimer manages to reclaim some pages, thus
      proving the node reclaimable again.
      
      [hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count]
        Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org
      [shakeelb@google.com: fix condition for throttle_direct_reclaim]
        Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com
      Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Reported-by: NJia He <hejianet@gmail.com>
      Tested-by: NJia He <hejianet@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c73322d0
  2. 02 3月, 2017 1 次提交
  3. 25 2月, 2017 6 次提交
    • M
      mm, vmscan: clear PGDAT_WRITEBACK when zone is balanced · c2f83143
      Mel Gorman 提交于
      Hillf Danton pointed out that since commit 1d82de61 ("mm, vmscan:
      make kswapd reclaim in terms of nodes") that PGDAT_WRITEBACK is no
      longer cleared.
      
      It was not noticed as triggering it requires pages under writeback to
      cycle twice through the LRU and before kswapd gets stalled.
      Historically, such issues tended to occur on small machines writing
      heavily to slow storage such as a USB stick.
      
      Once kswapd stalls, direct reclaim stalls may be higher but due to the
      fact that memory pressure is required, it would not be very noticable.
      
      Michal Hocko suggested removing the flag entirely but the conservative
      fix is to restore the intended PGDAT_WRITEBACK behaviour and clear the
      flag when a suitable zone is balanced.
      
      Fixes: 1d82de61 ("mm, vmscan: make kswapd reclaim in terms of nodes")
      Link: http://lkml.kernel.org/r/20170203203222.gq7hk66yc36lpgtb@suse.deSigned-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c2f83143
    • J
      mm: vmscan: move dirty pages out of the way until they're flushed · c55e8d03
      Johannes Weiner 提交于
      We noticed a performance regression when moving hadoop workloads from
      3.10 kernels to 4.0 and 4.6.  This is accompanied by increased pageout
      activity initiated by kswapd as well as frequent bursts of allocation
      stalls and direct reclaim scans.  Even lowering the dirty ratios to the
      equivalent of less than 1% of memory would not eliminate the issue,
      suggesting that dirty pages concentrate where the scanner is looking.
      
      This can be traced back to recent efforts of thrash avoidance.  Where
      3.10 would not detect refaulting pages and continuously supply clean
      cache to the inactive list, a thrashing workload on 4.0+ will detect and
      activate refaulting pages right away, distilling used-once pages on the
      inactive list much more effectively.  This is by design, and it makes
      sense for clean cache.  But for the most part our workload's cache
      faults are refaults and its use-once cache is from streaming writes.  We
      end up with most of the inactive list dirty, and we don't go after the
      active cache as long as we have use-once pages around.
      
      But waiting for writes to avoid reclaiming clean cache that *might*
      refault is a bad trade-off.  Even if the refaults happen, reads are
      faster than writes.  Before getting bogged down on writeback, reclaim
      should first look at *all* cache in the system, even active cache.
      
      To accomplish this, activate pages that are dirty or under writeback
      when they reach the end of the inactive LRU.  The pages are marked for
      immediate reclaim, meaning they'll get moved back to the inactive LRU
      tail as soon as they're written back and become reclaimable.  But in the
      meantime, by reducing the inactive list to only immediately reclaimable
      pages, we allow the scanner to deactivate and refill the inactive list
      with clean cache from the active list tail to guarantee forward
      progress.
      
      [hannes@cmpxchg.org: update comment]
        Link: http://lkml.kernel.org/r/20170202191957.22872-8-hannes@cmpxchg.org
      Link: http://lkml.kernel.org/r/20170123181641.23938-6-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c55e8d03
    • J
      mm: vmscan: only write dirty pages that the scanner has seen twice · 4eda4823
      Johannes Weiner 提交于
      Dirty pages can easily reach the end of the LRU while there are still
      clean pages to reclaim around.  Don't let kswapd write them back just
      because there are a lot of them.  It costs more CPU to find the clean
      pages, but that's almost certainly better than to disrupt writeback from
      the flushers with LRU-order single-page writes from reclaim.  And the
      flushers have been woken up by that point, so we spend IO capacity on
      flushing and CPU capacity on finding the clean cache.
      
      Only start writing dirty pages if they have cycled around the LRU twice
      now and STILL haven't been queued on the IO device.  It's possible that
      the dirty pages are so sparsely distributed across different bdis,
      inodes, memory cgroups, that the flushers take forever to get to the
      ones we want reclaimed.  Once we see them twice on the LRU, we know
      that's the quicker way to find them, so do LRU writeback.
      
      Link: http://lkml.kernel.org/r/20170123181641.23938-5-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4eda4823
    • J
      mm: vmscan: remove old flusher wakeup from direct reclaim path · bbef9384
      Johannes Weiner 提交于
      Direct reclaim has been replaced by kswapd reclaim in pretty much all
      common memory pressure situations, so this code most likely doesn't
      accomplish the described effect anymore.  The previous patch wakes up
      flushers for all reclaimers when we encounter dirty pages at the tail
      end of the LRU.  Remove the crufty old direct reclaim invocation.
      
      Link: http://lkml.kernel.org/r/20170123181641.23938-4-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bbef9384
    • J
      mm: vmscan: kick flushers when we encounter dirty pages on the LRU · 726d061f
      Johannes Weiner 提交于
      Memory pressure can put dirty pages at the end of the LRU without
      anybody running into dirty limits.  Don't start writing individual pages
      from kswapd while the flushers might be asleep.
      
      Unlike the old direct reclaim flusher wakeup (removed in the next patch)
      that flushes the number of pages just scanned, this patch wakes the
      flushers for all outstanding dirty pages.  That seemed to perform better
      in a synthetic test that pushes dirty pages to the end of the LRU and
      into reclaim, because we know LRU aging outstrips writeback already, and
      this way we give younger dirty pages a headstart rather than wait until
      reclaim runs into them as well.  It also means less plugging and risk of
      exhausting the struct request pool from reclaim.
      
      There is a concern that this will cause temporary files that used to get
      dirtied and truncated before writeback to now get written to disk under
      memory pressure.  If this turns out to be a real problem, we'll have to
      revisit this and tame the reclaim flusher wakeups.
      
      [hannes@cmpxchg.org: mention dirty expiration as a condition]
        Link: http://lkml.kernel.org/r/20170126174739.GA30636@cmpxchg.org
      Link: http://lkml.kernel.org/r/20170123181641.23938-3-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      726d061f
    • J
      mm: vmscan: scan dirty pages even in laptop mode · 1276ad68
      Johannes Weiner 提交于
      Patch series "mm: vmscan: fix kswapd writeback regression".
      
      We noticed a regression on multiple hadoop workloads when moving from
      3.10 to 4.0 and 4.6, which involves kswapd getting tangled up in page
      writeout, causing direct reclaim herds that also don't make progress.
      
      I tracked it down to the thrash avoidance efforts after 3.10 that make
      the kernel better at keeping use-once cache and use-many cache sorted on
      the inactive and active list, with more aggressive protection of the
      active list as long as there is inactive cache.  Unfortunately, our
      workload's use-once cache is mostly from streaming writes.  Waiting for
      writes to avoid potential reloads in the future is not a good tradeoff.
      
      These patches do the following:
      
      1. Wake the flushers when kswapd sees a lump of dirty pages. It's
         possible to be below the dirty background limit and still have cache
         velocity push them through the LRU. So start a-flushin'.
      
      2. Let kswapd only write pages that have been rotated twice. This makes
         sure we really tried to get all the clean pages on the inactive list
         before resorting to horrible LRU-order writeback.
      
      3. Move rotating dirty pages off the inactive list. Instead of churning
         or waiting on page writeback, we'll go after clean active cache. This
         might lead to thrashing, but in this state memory demand outstrips IO
         speed anyway, and reads are faster than writes.
      
      Mel backported the series to 4.10-rc5 with one minor conflict and ran a
      couple of tests on it.  Mix of read/write random workload didn't show
      anything interesting.  Write-only database didn't show much difference
      in performance but there were slight reductions in IO -- probably in the
      noise.
      
      simoop did show big differences although not as big as Mel expected.
      This is Chris Mason's workload that similate the VM activity of hadoop.
      Mel won't go through the full details but over the samples measured
      during an hour it reported
      
                                               4.10.0-rc5            4.10.0-rc5
                                                  vanilla         johannes-v1r1
      Amean    p50-Read             21346531.56 (  0.00%) 21697513.24 ( -1.64%)
      Amean    p95-Read             24700518.40 (  0.00%) 25743268.98 ( -4.22%)
      Amean    p99-Read             27959842.13 (  0.00%) 28963271.11 ( -3.59%)
      Amean    p50-Write                1138.04 (  0.00%)      989.82 ( 13.02%)
      Amean    p95-Write             1106643.48 (  0.00%)    12104.00 ( 98.91%)
      Amean    p99-Write             1569213.22 (  0.00%)    36343.38 ( 97.68%)
      Amean    p50-Allocation          85159.82 (  0.00%)    79120.70 (  7.09%)
      Amean    p95-Allocation         204222.58 (  0.00%)   129018.43 ( 36.82%)
      Amean    p99-Allocation         278070.04 (  0.00%)   183354.43 ( 34.06%)
      Amean    final-p50-Read       21266432.00 (  0.00%) 21921792.00 ( -3.08%)
      Amean    final-p95-Read       24870912.00 (  0.00%) 26116096.00 ( -5.01%)
      Amean    final-p99-Read       28147712.00 (  0.00%) 29523968.00 ( -4.89%)
      Amean    final-p50-Write          1130.00 (  0.00%)      977.00 ( 13.54%)
      Amean    final-p95-Write       1033216.00 (  0.00%)     2980.00 ( 99.71%)
      Amean    final-p99-Write       1517568.00 (  0.00%)    32672.00 ( 97.85%)
      Amean    final-p50-Allocation    86656.00 (  0.00%)    78464.00 (  9.45%)
      Amean    final-p95-Allocation   211712.00 (  0.00%)   116608.00 ( 44.92%)
      Amean    final-p99-Allocation   287232.00 (  0.00%)   168704.00 ( 41.27%)
      
      The latencies are actually completely horrific in comparison to 4.4 (and
      4.10-rc5 is worse than 4.9 according to historical data for reasons Mel
      hasn't analysed yet).
      
      Still, 95% of write latency (p95-write) is halved by the series and
      allocation latency is way down.  Direct reclaim activity is one fifth of
      what it was according to vmstats.  Kswapd activity is higher but this is
      not necessarily surprising.  Kswapd efficiency is unchanged at 99% (99%
      of pages scanned were reclaimed) but direct reclaim efficiency went from
      77% to 99%
      
      In the vanilla kernel, 627MB of data was written back from reclaim
      context.  With the series, no data was written back.  With or without
      the patch, pages are being immediately reclaimed after writeback
      completes.  However, with the patch, only 1/8th of the pages are
      reclaimed like this.
      
      This patch (of 5):
      
      We have an elaborate dirty/writeback throttling mechanism inside the
      reclaim scanner, but for that to work the pages have to go through
      shrink_page_list() and get counted for what they are.  Otherwise, we
      mess up the LRU order and don't match reclaim speed to writeback.
      
      Especially during deactivation, there is never a reason to skip dirty
      pages; nothing is even trying to write them out from there.  Don't mess
      up the LRU order for nothing, shuffle these pages along.
      
      Link: http://lkml.kernel.org/r/20170123181641.23938-2-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1276ad68
  4. 23 2月, 2017 10 次提交
  5. 11 1月, 2017 1 次提交
    • M
      mm, memcg: fix the active list aging for lowmem requests when memcg is enabled · b4536f0c
      Michal Hocko 提交于
      Nils Holland and Klaus Ethgen have reported unexpected OOM killer
      invocations with 32b kernel starting with 4.8 kernels
      
      	kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
      	kworker/u4:5 cpuset=/ mems_allowed=0
      	CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
      	[...]
      	Mem-Info:
      	active_anon:58685 inactive_anon:90 isolated_anon:0
      	 active_file:274324 inactive_file:281962 isolated_file:0
      	 unevictable:0 dirty:649 writeback:0 unstable:0
      	 slab_reclaimable:40662 slab_unreclaimable:17754
      	 mapped:7382 shmem:202 pagetables:351 bounce:0
      	 free:206736 free_pcp:332 free_cma:0
      	Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
      	DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB writepending:96kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
      	lowmem_reserve[]: 0 813 3474 3474
      	Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
      	lowmem_reserve[]: 0 0 21292 21292
      	HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB
      
      the oom killer is clearly pre-mature because there there is still a lot
      of page cache in the zone Normal which should satisfy this lowmem
      request.  Further debugging has shown that the reclaim cannot make any
      forward progress because the page cache is hidden in the active list
      which doesn't get rotated because inactive_list_is_low is not memcg
      aware.
      
      The code simply subtracts per-zone highmem counters from the respective
      memcg's lru sizes which doesn't make any sense.  We can simply end up
      always seeing the resulting active and inactive counts 0 and return
      false.  This issue is not limited to 32b kernels but in practice the
      effect on systems without CONFIG_HIGHMEM would be much harder to notice
      because we do not invoke the OOM killer for allocations requests
      targeting < ZONE_NORMAL.
      
      Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
      and subtract per-memcg highmem counts when memcg is enabled.  Introduce
      helper lruvec_zone_lru_size which redirects to either zone counters or
      mem_cgroup_get_zone_lru_size when appropriate.
      
      We are losing empty LRU but non-zero lru size detection introduced by
      ca707239 ("mm: update_lru_size warn and reset bad lru_size") because
      of the inherent zone vs. node discrepancy.
      
      Fixes: f8d1a311 ("mm: consider whether to decivate based on eligible zones inactive ratio")
      Link: http://lkml.kernel.org/r/20170104100825.3729-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NNils Holland <nholland@tisys.org>
      Tested-by: NNils Holland <nholland@tisys.org>
      Reported-by: NKlaus Ethgen <Klaus@Ethgen.de>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>	[4.8+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4536f0c
  6. 13 12月, 2016 1 次提交
  7. 03 12月, 2016 1 次提交
    • M
      mm, vmscan: add cond_resched() into shrink_node_memcg() · bd041733
      Michal Hocko 提交于
      Boris Zhmurov has reported RCU stalls during the kswapd reclaim:
      
        INFO: rcu_sched detected stalls on CPUs/tasks:
         23-...: (22 ticks this GP) idle=92f/140000000000000/0 softirq=2638404/2638404 fqs=23
         (detected by 4, t=6389 jiffies, g=786259, c=786258, q=42115)
        Task dump for CPU 23:
        kswapd1         R  running task        0   148      2 0x00000008
        Call Trace:
          shrink_node+0xd2/0x2f0
          kswapd+0x2cb/0x6a0
          mem_cgroup_shrink_node+0x160/0x160
          kthread+0xbd/0xe0
          __switch_to+0x1fa/0x5c0
          ret_from_fork+0x1f/0x40
          kthread_create_on_node+0x180/0x180
      
      a closer code inspection has shown that we might indeed miss all the
      scheduling points in the reclaim path if no pages can be isolated from
      the LRU list.  This is a pathological case but other reports from Donald
      Buczek have shown that we might indeed hit such a path:
      
              clusterd-989   [009] .... 118023.654491: mm_vmscan_direct_reclaim_end: nr_reclaimed=193
               kswapd1-86    [001] dN.. 118023.987475: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239830 nr_taken=0 file=1
               kswapd1-86    [001] dN.. 118024.320968: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239844 nr_taken=0 file=1
               kswapd1-86    [001] dN.. 118024.654375: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239858 nr_taken=0 file=1
               kswapd1-86    [001] dN.. 118024.987036: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239872 nr_taken=0 file=1
               kswapd1-86    [001] dN.. 118025.319651: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239886 nr_taken=0 file=1
               kswapd1-86    [001] dN.. 118025.652248: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239900 nr_taken=0 file=1
               kswapd1-86    [001] dN.. 118025.984870: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239914 nr_taken=0 file=1
        [...]
               kswapd1-86    [001] dN.. 118084.274403: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4241133 nr_taken=0 file=1
      
      this is minute long snapshot which didn't take a single page from the
      LRU.  It is not entirely clear why only 1303 pages have been scanned
      during that time (maybe there was a heavy IRQ activity interfering).
      
      In any case it looks like we can really hit long periods without
      scheduling on non preemptive kernels so an explicit cond_resched() in
      shrink_node_memcg which is independent on the reclaim operation is due.
      
      Link: http://lkml.kernel.org/r/20161202095841.16648-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NBoris Zhmurov <bb@kernelpanic.ru>
      Tested-by: NBoris Zhmurov <bb@kernelpanic.ru>
      Reported-by: NDonald Buczek <buczek@molgen.mpg.de>
      Reported-by: N"Christopher S. Aker" <caker@theshore.net>
      Reported-by: NPaul Menzel <pmenzel@molgen.mpg.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd041733
  8. 10 11月, 2016 1 次提交
  9. 28 10月, 2016 1 次提交
    • J
      mm: memcontrol: do not recurse in direct reclaim · 89a28483
      Johannes Weiner 提交于
      On 4.0, we saw a stack corruption from a page fault entering direct
      memory cgroup reclaim, calling into btrfs_releasepage(), which then
      tried to allocate an extent and recursed back into a kmem charge ad
      nauseam:
      
        [...]
        btrfs_releasepage+0x2c/0x30
        try_to_release_page+0x32/0x50
        shrink_page_list+0x6da/0x7a0
        shrink_inactive_list+0x1e5/0x510
        shrink_lruvec+0x605/0x7f0
        shrink_zone+0xee/0x320
        do_try_to_free_pages+0x174/0x440
        try_to_free_mem_cgroup_pages+0xa7/0x130
        try_charge+0x17b/0x830
        memcg_charge_kmem+0x40/0x80
        new_slab+0x2d9/0x5a0
        __slab_alloc+0x2fd/0x44f
        kmem_cache_alloc+0x193/0x1e0
        alloc_extent_state+0x21/0xc0
        __clear_extent_bit+0x2b5/0x400
        try_release_extent_mapping+0x1a3/0x220
        __btrfs_releasepage+0x31/0x70
        btrfs_releasepage+0x2c/0x30
        try_to_release_page+0x32/0x50
        shrink_page_list+0x6da/0x7a0
        shrink_inactive_list+0x1e5/0x510
        shrink_lruvec+0x605/0x7f0
        shrink_zone+0xee/0x320
        do_try_to_free_pages+0x174/0x440
        try_to_free_mem_cgroup_pages+0xa7/0x130
        try_charge+0x17b/0x830
        mem_cgroup_try_charge+0x65/0x1c0
        handle_mm_fault+0x117f/0x1510
        __do_page_fault+0x177/0x420
        do_page_fault+0xc/0x10
        page_fault+0x22/0x30
      
      On later kernels, kmem charging is opt-in rather than opt-out, and that
      particular kmem allocation in btrfs_releasepage() is no longer being
      charged and won't recurse and overrun the stack anymore.
      
      But it's not impossible for an accounted allocation to happen from the
      memcg direct reclaim context, and we needed to reproduce this crash many
      times before we even got a useful stack trace out of it.
      
      Like other direct reclaimers, mark tasks in memcg reclaim PF_MEMALLOC to
      avoid recursing into any other form of direct reclaim.  Then let
      recursive charges from PF_MEMALLOC contexts bypass the cgroup limit.
      
      Link: http://lkml.kernel.org/r/20161025141050.GA13019@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89a28483
  10. 08 10月, 2016 5 次提交
    • A
      mm: use zonelist name instead of using hardcoded index · c9634cf0
      Aneesh Kumar K.V 提交于
      Use the existing enums instead of hardcoded index when looking at the
      zonelist.  This makes it more readable.  No functionality change by this
      patch.
      
      Link: http://lkml.kernel.org/r/1472227078-24852-1-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9634cf0
    • M
      mm, vmscan: get rid of throttle_vm_writeout · bf484383
      Michal Hocko 提交于
      throttle_vm_writeout() was introduced back in 2005 to fix OOMs caused by
      excessive pageout activity during the reclaim.  Too many pages could be
      put under writeback therefore LRUs would be full of unreclaimable pages
      until the IO completes and in turn the OOM killer could be invoked.
      
      There have been some important changes introduced since then in the
      reclaim path though.  Writers are throttled by balance_dirty_pages when
      initiating the buffered IO and later during the memory pressure, the
      direct reclaim is throttled by wait_iff_congested if the node is
      considered congested by dirty pages on LRUs and the underlying bdi is
      congested by the queued IO.  The kswapd is throttled as well if it
      encounters pages marked for immediate reclaim or under writeback which
      signals that that there are too many pages under writeback already.
      Finally should_reclaim_retry does congestion_wait if the reclaim cannot
      make any progress and there are too many dirty/writeback pages.
      
      Another important aspect is that we do not issue any IO from the direct
      reclaim context anymore.  In a heavy parallel load this could queue a
      lot of IO which would be very scattered and thus unefficient which would
      just make the problem worse.
      
      This three mechanisms should throttle and keep the amount of IO in a
      steady state even under heavy IO and memory pressure so yet another
      throttling point doesn't really seem helpful.  Quite contrary, Mikulas
      Patocka has reported that swap backed by dm-crypt doesn't work properly
      because the swapout IO cannot make sufficient progress as the writeout
      path depends on dm_crypt worker which has to allocate memory to perform
      the encryption.  In order to guarantee a forward progress it relies on
      the mempool allocator.  mempool_alloc(), however, prefers to use the
      underlying (usually page) allocator before it grabs objects from the
      pool.  Such an allocation can dive into the memory reclaim and
      consequently to throttle_vm_writeout.  If there are too many dirty or
      pages under writeback it will get throttled even though it is in fact a
      flusher to clear pending pages.
      
        kworker/u4:0    D ffff88003df7f438 10488     6      2	0x00000000
        Workqueue: kcryptd kcryptd_crypt [dm_crypt]
        Call Trace:
          schedule+0x3c/0x90
          schedule_timeout+0x1d8/0x360
          io_schedule_timeout+0xa4/0x110
          congestion_wait+0x86/0x1f0
          throttle_vm_writeout+0x44/0xd0
          shrink_zone_memcg+0x613/0x720
          shrink_zone+0xe0/0x300
          do_try_to_free_pages+0x1ad/0x450
          try_to_free_pages+0xef/0x300
          __alloc_pages_nodemask+0x879/0x1210
          alloc_pages_current+0xa1/0x1f0
          new_slab+0x2d7/0x6a0
          ___slab_alloc+0x3fb/0x5c0
          __slab_alloc+0x51/0x90
          kmem_cache_alloc+0x27b/0x310
          mempool_alloc_slab+0x1d/0x30
          mempool_alloc+0x91/0x230
          bio_alloc_bioset+0xbd/0x260
          kcryptd_crypt+0x114/0x3b0 [dm_crypt]
      
      Let's just drop throttle_vm_writeout altogether.  It is not very much
      helpful anymore.
      
      I have tried to test a potential writeback IO runaway similar to the one
      described in the original patch which has introduced that [1].  Small
      virtual machine (512MB RAM, 4 CPUs, 2G of swap space and disk image on a
      rather slow NFS in a sync mode on the host) with 8 parallel writers each
      writing 1G worth of data.  As soon as the pagecache fills up and the
      direct reclaim hits then I start anon memory consumer in a loop
      (allocating 300M and exiting after populating it) in the background to
      make the memory pressure even stronger as well as to disrupt the steady
      state for the IO.  The direct reclaim is throttled because of the
      congestion as well as kswapd hitting congestion_wait due to nr_immediate
      but throttle_vm_writeout doesn't ever trigger the sleep throughout the
      test.  Dirty+writeback are close to nr_dirty_threshold with some
      fluctuations caused by the anon consumer.
      
      [1] https://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.9-rc1/2.6.9-rc1-mm3/broken-out/vm-pageout-throttling.patch
      Link: http://lkml.kernel.org/r/1471171473-21418-1-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Ondrej Kozina <okozina@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf484383
    • V
      mm, vmscan: make compaction_ready() more accurate and readable · fdd4c614
      Vlastimil Babka 提交于
      The compaction_ready() is used during direct reclaim for costly order
      allocations to skip reclaim for zones where compaction should be
      attempted instead.  It's combining the standard compaction_suitable()
      check with its own watermark check based on high watermark with extra
      gap, and the result is confusing at best.
      
      This patch attempts to better structure and document the checks
      involved.  First, compaction_suitable() can determine that the
      allocation should either succeed already, or that compaction doesn't
      have enough free pages to proceed.  The third possibility is that
      compaction has enough free pages, but we still decide to reclaim first -
      unless we are already above the high watermark with gap.  This does not
      mean that the reclaim will actually reach this watermark during single
      attempt, this is rather an over-reclaim protection.  So document the
      code as such.  The check for compaction_deferred() is removed
      completely, as it in fact had no proper role here.
      
      The result after this patch is mainly a less confusing code.  We also
      skip some over-reclaim in cases where the allocation should already
      succed.
      
      Link: http://lkml.kernel.org/r/20160810091226.6709-12-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fdd4c614
    • V
      mm, compaction: create compact_gap wrapper · 9861a62c
      Vlastimil Babka 提交于
      Compaction uses a watermark gap of (2UL << order) pages at various
      places and it's not immediately obvious why.  Abstract it through a
      compact_gap() wrapper to create a single place with a thorough
      explanation.
      
      [vbabka@suse.cz: clarify the comment of compact_gap()]
       Link: http://lkml.kernel.org/r/7b6aed1f-fdf8-2063-9ff4-bbe4de712d37@suse.cz
      Link: http://lkml.kernel.org/r/20160810091226.6709-9-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9861a62c
    • V
      mm, compaction: rename COMPACT_PARTIAL to COMPACT_SUCCESS · cf378319
      Vlastimil Babka 提交于
      COMPACT_PARTIAL has historically meant that compaction returned after
      doing some work without fully compacting a zone.  It however didn't
      distinguish if compaction terminated because it succeeded in creating
      the requested high-order page.  This has changed recently and now we
      only return COMPACT_PARTIAL when compaction thinks it succeeded, or the
      high-order watermark check in compaction_suitable() passes and no
      compaction needs to be done.
      
      So at this point we can make the return value clearer by renaming it to
      COMPACT_SUCCESS.  The next patch will remove some redundant tests for
      success where compaction just returned COMPACT_SUCCESS.
      
      Link: http://lkml.kernel.org/r/20160810091226.6709-4-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf378319
  11. 25 9月, 2016 1 次提交
    • H
      mm: delete unnecessary and unsafe init_tlb_ubc() · b385d21f
      Hugh Dickins 提交于
      init_tlb_ubc() looked unnecessary to me: tlb_ubc is statically
      initialized with zeroes in the init_task, and copied from parent to
      child while it is quiescent in arch_dup_task_struct(); so I went to
      delete it.
      
      But inserted temporary debug WARN_ONs in place of init_tlb_ubc() to
      check that it was always empty at that point, and found them firing:
      because memcg reclaim can recurse into global reclaim (when allocating
      biosets for swapout in my case), and arrive back at the init_tlb_ubc()
      in shrink_node_memcg().
      
      Resetting tlb_ubc.flush_required at that point is wrong: if the upper
      level needs a deferred TLB flush, but the lower level turns out not to,
      we miss a TLB flush.  But fortunately, that's the only part of the
      protocol that does not nest: with the initialization removed, cpumask
      collects bits from upper and lower levels, and flushes TLB when needed.
      
      Fixes: 72b252ae ("mm: send one IPI per CPU to TLB flush all entries after unmapping pages")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: stable@vger.kernel.org # 4.3+
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b385d21f
  12. 02 9月, 2016 1 次提交
  13. 03 8月, 2016 1 次提交