1. 04 5月, 2017 32 次提交
    • W
      mm/sparse: refine usemap_size() a little · 60a7a88d
      Wei Yang 提交于
      The current implementation calculates usemap_size in two steps:
          * calculate number of bytes to cover these bits
          * calculate number of "unsigned long" to cover these bytes
      
      It would be more clear by:
          * calculate number of "unsigned long" to cover these bits
          * multiple it with sizeof(unsigned long)
      
      This patch refine usemap_size() a little to make it more easy to
      understand.
      
      Link: http://lkml.kernel.org/r/20170310043713.96871-1-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60a7a88d
    • J
      mm: page_alloc: __GFP_NOWARN shouldn't suppress stall warnings · 82251963
      Johannes Weiner 提交于
      __GFP_NOWARN, which is usually added to avoid warnings from callsites
      that expect to fail and have fallbacks, currently also suppresses
      allocation stall warnings.  These trigger when an allocation is stuck
      inside the allocator for 10 seconds or longer.
      
      But there is no class of allocations that can get legitimately stuck in
      the allocator for this long.  This always indicates a problem.
      
      Always emit stall warnings.  Restrict __GFP_NOWARN to alloc failures.
      
      Link: http://lkml.kernel.org/r/20170125181150.GA16398@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82251963
    • M
      mm, vmscan: prevent kswapd sleeping prematurely due to mismatched classzone_idx · e716f2eb
      Mel Gorman 提交于
      kswapd is woken to reclaim a node based on a failed allocation request
      from any eligible zone.  Once reclaiming in balance_pgdat(), it will
      continue reclaiming until there is an eligible zone available for the
      zone it was woken for.  kswapd tracks what zone it was recently woken
      for in pgdat->kswapd_classzone_idx.  If it has not been woken recently,
      this zone will be 0.
      
      However, the decision on whether to sleep is made on
      kswapd_classzone_idx which is 0 without a recent wakeup request and that
      classzone does not account for lowmem reserves.  This allows kswapd to
      sleep when a low small zone such as ZONE_DMA is balanced for a GFP_DMA
      request even if a stream of allocations cannot use that zone.  While
      kswapd may be woken again shortly in the near future there are two
      consequences -- the pgdat bits that control congestion are cleared
      prematurely and direct reclaim is more likely as kswapd slept
      prematurely.
      
      This patch flips kswapd_classzone_idx to default to MAX_NR_ZONES (an
      invalid index) when there has been no recent wakeups.  If there are no
      wakeups, it'll decide whether to sleep based on the highest possible
      zone available (MAX_NR_ZONES - 1).  It then becomes critical that the
      "pgdat balanced" decisions during reclaim and when deciding to sleep are
      the same.  If there is a mismatch, kswapd can stay awake continually
      trying to balance tiny zones.
      
      simoop was used to evaluate it again.  Two of the preparation patches
      regressed the workload so they are included as the second set of
      results.  Otherwise this patch looks artifically excellent
      
                                               4.11.0-rc1            4.11.0-rc1            4.11.0-rc1
                                                  vanilla              clear-v2          keepawake-v2
      Amean    p50-Read             21670074.18 (  0.00%) 19786774.76 (  8.69%) 22668332.52 ( -4.61%)
      Amean    p95-Read             25456267.64 (  0.00%) 24101956.27 (  5.32%) 26738688.00 ( -5.04%)
      Amean    p99-Read             29369064.73 (  0.00%) 27691872.71 (  5.71%) 30991404.52 ( -5.52%)
      Amean    p50-Write                1390.30 (  0.00%)     1011.91 ( 27.22%)      924.91 ( 33.47%)
      Amean    p95-Write              412901.57 (  0.00%)    34874.98 ( 91.55%)     1362.62 ( 99.67%)
      Amean    p99-Write             6668722.09 (  0.00%)   575449.60 ( 91.37%)    16854.04 ( 99.75%)
      Amean    p50-Allocation          78714.31 (  0.00%)    84246.26 ( -7.03%)    74729.74 (  5.06%)
      Amean    p95-Allocation         175533.51 (  0.00%)   400058.43 (-127.91%)   101609.74 ( 42.11%)
      Amean    p99-Allocation         247003.02 (  0.00%) 10905600.00 (-4315.17%)   125765.57 ( 49.08%)
      
      With this patch on top, write and allocation latencies are massively
      improved.  The read latencies are slightly impaired but it's worth
      noting that this is mostly due to the IO scheduler and not directly
      related to reclaim.  The vmstats are a bit of a mix but the relevant
      ones are as follows;
      
                                  4.10.0-rc7  4.10.0-rc7  4.10.0-rc7
                                mmots-20170209 clear-v1r25keepawake-v1r25
      Swap Ins                             0           0           0
      Swap Outs                            0         608           0
      Direct pages scanned           6910672     3132699     6357298
      Kswapd pages scanned          57036946    82488665    56986286
      Kswapd pages reclaimed        55993488    63474329    55939113
      Direct pages reclaimed         6905990     2964843     6352115
      Kswapd efficiency                  98%         76%         98%
      Kswapd velocity              12494.375   17597.507   12488.065
      Direct efficiency                  99%         94%         99%
      Direct velocity               1513.835     668.306    1393.148
      Page writes by reclaim           0.000 4410243.000       0.000
      Page writes file                     0     4409635           0
      Page writes anon                     0         608           0
      Page reclaim immediate         1036792    14175203     1042571
      
                                  4.11.0-rc1  4.11.0-rc1  4.11.0-rc1
                                     vanilla  clear-v2  keepawake-v2
      Swap Ins                             0          12           0
      Swap Outs                            0         838           0
      Direct pages scanned           6579706     3237270     6256811
      Kswapd pages scanned          61853702    79961486    54837791
      Kswapd pages reclaimed        60768764    60755788    53849586
      Direct pages reclaimed         6579055     2987453     6256151
      Kswapd efficiency                  98%         75%         98%
      Page writes by reclaim           0.000 4389496.000       0.000
      Page writes file                     0     4388658           0
      Page writes anon                     0         838           0
      Page reclaim immediate         1073573    14473009      982507
      
      Swap-outs are equivalent to baseline.
      
      Direct reclaim is reduced but not eliminated.  It's worth noting that
      there are two periods of direct reclaim for this workload.  The first is
      when it switches from preparing the files for the actual test itself.
      It's a lot of file IO followed by a lot of allocs that reclaims heavily
      for a brief window.  While direct reclaim is lower with clear-v2, it is
      due to kswapd scanning aggressively and trying to reclaim the world
      which is not the right thing to do.  With the patches applied, there is
      still direct reclaim but the phase change from "creating work files" to
      starting multiple threads that allocate a lot of anonymous memory faster
      than kswapd can reclaim.
      
      Scanning/reclaim efficiency is restored by this patch.
      
      Page writes from reclaim context are back at 0 which is ideal.
      
      Pages immediately reclaimed after IO completes is slightly improved but
      it is expected this will vary slightly.
      
      On UMA, there is almost no change so this is not expected to be a
      universal win.
      
      [mgorman@suse.de: fix ->kswapd_classzone_idx initialization]
        Link: http://lkml.kernel.org/r/20170406174538.5msrznj6nt6qpbx5@suse.de
      Link: http://lkml.kernel.org/r/20170309075657.25121-4-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shantanu Goel <sgoel01@yahoo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e716f2eb
    • M
      mm, vmscan: only clear pgdat congested/dirty/writeback state when balanced · 631b6e08
      Mel Gorman 提交于
      A pgdat tracks if recent reclaim encountered too many dirty, writeback
      or congested pages.  The flags control whether kswapd writes pages back
      from reclaim context, tags pages for immediate reclaim when IO
      completes, whether processes block on wait_iff_congested and whether
      kswapd blocks when too many pages marked for immediate reclaim are
      encountered.
      
      The state is cleared in a check function with side-effects.  With the
      patch "mm, vmscan: fix zone balance check in prepare_kswapd_sleep", the
      timing of when the bits get cleared changed.  Due to the way the check
      works, it'll clear the bits if ZONE_DMA is balanced for a GFP_DMA
      allocation because it does not account for lowmem reserves properly.
      
      For the simoop workload, kswapd is not stalling when it should due to
      the premature clearing, writing pages from reclaim context like crazy
      and generally being unhelpful.
      
      This patch resets the pgdat bits related to page reclaim only when
      kswapd is going to sleep.  The comparison with simoop is then
      
                                               4.11.0-rc1            4.11.0-rc1            4.11.0-rc1
                                                  vanilla           fixcheck-v2              clear-v2
      Amean    p50-Read             21670074.18 (  0.00%) 20464344.18 (  5.56%) 19786774.76 (  8.69%)
      Amean    p95-Read             25456267.64 (  0.00%) 25721423.64 ( -1.04%) 24101956.27 (  5.32%)
      Amean    p99-Read             29369064.73 (  0.00%) 30174230.76 ( -2.74%) 27691872.71 (  5.71%)
      Amean    p50-Write                1390.30 (  0.00%)     1395.28 ( -0.36%)     1011.91 ( 27.22%)
      Amean    p95-Write              412901.57 (  0.00%)    37737.74 ( 90.86%)    34874.98 ( 91.55%)
      Amean    p99-Write             6668722.09 (  0.00%)   666489.04 ( 90.01%)   575449.60 ( 91.37%)
      Amean    p50-Allocation          78714.31 (  0.00%)    86286.22 ( -9.62%)    84246.26 ( -7.03%)
      Amean    p95-Allocation         175533.51 (  0.00%)   351812.27 (-100.42%)   400058.43 (-127.91%)
      Amean    p99-Allocation         247003.02 (  0.00%)  6291171.56 (-2447.00%) 10905600.00 (-4315.17%)
      
      Read latency is improved, write latency is mostly improved but
      allocation latency is regressed.  kswapd is still reclaiming
      inefficiently, pages are being written back from writeback context and a
      host of other issues.  However, given the change, it needed to be
      spelled out why the side-effect was moved.
      
      Link: http://lkml.kernel.org/r/20170309075657.25121-3-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shantanu Goel <sgoel01@yahoo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      631b6e08
    • S
      mm, vmscan: fix zone balance check in prepare_kswapd_sleep · 333b0a45
      Shantanu Goel 提交于
      Patch series "Reduce amount of time kswapd sleeps prematurely", v2.
      
      The series is unusual in that the first patch fixes one problem and
      introduces other issues that are noted in the changelog.  Patch 2 makes
      a minor modification that is worth considering on its own but leaves the
      kernel in a state where it behaves badly.  It's not until patch 3 that
      there is an improvement against baseline.
      
      This was mostly motivated by examining Chris Mason's "simoop" benchmark
      which puts the VM under similar pressure to HADOOP.  It has been
      reported that the benchmark has regressed severely during the last
      number of releases.  While I cannot reproduce all the same problems
      Chris experienced due to hardware limitations, there was a number of
      problems on a 2-socket machine with a single disk.
      
      simoop latencies
                                               4.11.0-rc1            4.11.0-rc1
                                                  vanilla          keepawake-v2
      Amean    p50-Read             21670074.18 (  0.00%) 22668332.52 ( -4.61%)
      Amean    p95-Read             25456267.64 (  0.00%) 26738688.00 ( -5.04%)
      Amean    p99-Read             29369064.73 (  0.00%) 30991404.52 ( -5.52%)
      Amean    p50-Write                1390.30 (  0.00%)      924.91 ( 33.47%)
      Amean    p95-Write              412901.57 (  0.00%)     1362.62 ( 99.67%)
      Amean    p99-Write             6668722.09 (  0.00%)    16854.04 ( 99.75%)
      Amean    p50-Allocation          78714.31 (  0.00%)    74729.74 (  5.06%)
      Amean    p95-Allocation         175533.51 (  0.00%)   101609.74 ( 42.11%)
      Amean    p99-Allocation         247003.02 (  0.00%)   125765.57 ( 49.08%)
      
      These are latencies.  Read/write are threads reading fixed-size random
      blocks from a simulated database.  The allocation latency is mmaping and
      faulting regions of memory.  The p50, 95 and p99 reports the worst
      latencies for 50% of the samples, 95% and 99% respectively.
      
      For example, the report indicates that while the test was running 99% of
      writes completed 99.75% faster.  It's worth noting that on a UMA machine
      that no difference in performance with simoop was observed so milage
      will vary.
      
      It's noted that there is a slight impact to read latencies but it's
      mostly due to IO scheduler decisions and offset by the large reduction
      in other latencies.
      
      This patch (of 3):
      
      The check in prepare_kswapd_sleep needs to match the one in
      balance_pgdat since the latter will return as soon as any one of the
      zones in the classzone is above the watermark.  This is specially
      important for higher order allocations since balance_pgdat will
      typically reset the order to zero relying on compaction to create the
      higher order pages.  Without this patch, prepare_kswapd_sleep fails to
      wake up kcompactd since the zone balance check fails.
      
      It was first reported against 4.9.7 that kswapd is failing to wake up
      kcompactd due to a mismatch in the zone balance check between
      balance_pgdat() and prepare_kswapd_sleep().
      
      balance_pgdat() returns as soon as a single zone satisfies the
      allocation but prepare_kswapd_sleep() requires all zones to do +the
      same.  This causes prepare_kswapd_sleep() to never succeed except in the
      order == 0 case and consequently, wakeup_kcompactd() is never called.
      For the machine that originally motivated this patch, the state of
      compaction from /proc/vmstat looked this way after a day and a half +of
      uptime:
      
      compact_migrate_scanned 240496
      compact_free_scanned 76238632
      compact_isolated 123472
      compact_stall 1791
      compact_fail 29
      compact_success 1762
      compact_daemon_wake 0
      
      After applying the patch and about 10 hours of uptime the state looks
      like this:
      
      compact_migrate_scanned 59927299
      compact_free_scanned 2021075136
      compact_isolated 640926
      compact_stall 4
      compact_fail 2
      compact_success 2
      compact_daemon_wake 5160
      
      Further notes from Mel that motivated him to pick this patch up and
      resend it;
      
      It was observed for the simoop workload (pressures the VM similar to
      HADOOP) that kswapd was failing to keep ahead of direct reclaim.  The
      investigation noted that there was a need to rationalise kswapd
      decisions to reclaim with kswapd decisions to sleep.  With this patch on
      a 2-socket box, there was a 49% reduction in direct reclaim scanning.
      
      However, the impact otherwise is extremely negative.  Kswapd reclaim
      efficiency dropped from 98% to 76%.  simoop has three latency-related
      metrics for read, write and allocation (an anonymous mmap and fault).
      
                                               4.11.0-rc1            4.11.0-rc1
                                                  vanilla           fixcheck-v2
      Amean    p50-Read             21670074.18 (  0.00%) 20464344.18 (  5.56%)
      Amean    p95-Read             25456267.64 (  0.00%) 25721423.64 ( -1.04%)
      Amean    p99-Read             29369064.73 (  0.00%) 30174230.76 ( -2.74%)
      Amean    p50-Write                1390.30 (  0.00%)     1395.28 ( -0.36%)
      Amean    p95-Write              412901.57 (  0.00%)    37737.74 ( 90.86%)
      Amean    p99-Write             6668722.09 (  0.00%)   666489.04 ( 90.01%)
      Amean    p50-Allocation          78714.31 (  0.00%)    86286.22 ( -9.62%)
      Amean    p95-Allocation         175533.51 (  0.00%)   351812.27 (-100.42%)
      Amean    p99-Allocation         247003.02 (  0.00%)  6291171.56 (-2447.00%)
      
      Of greater concern is that the patch causes swapping and page writes
      from kswapd context rose from 0 pages to 4189753 pages during the hour
      the workload ran for.  By and large, the patch has very bad behaviour
      but easily missed as the impact on a UMA machine is negligible.
      
      This patch is included with the data in case a bisection leads to this
      area.  This patch is also a pre-requisite for the rest of the series.
      
      Link: http://lkml.kernel.org/r/20170309075657.25121-2-mgorman@techsingularity.netSigned-off-by: NShantanu Goel <sgoel01@yahoo.com>
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      333b0a45
    • M
      mm: do not use double negation for testing page flags · 2948be5a
      Minchan Kim 提交于
      With the discussion[1], I found it seems there are every PageFlags
      functions return bool at this moment so we don't need double negation
      any more.  Although it's not a problem to keep it, it makes future users
      confused to use double negation for them, too.
      
      Remove such possibility.
      
      [1] https://marc.info/?l=linux-kernel&m=148881578820434
      
      Frankly sepaking, I like every PageFlags to return bool instead of int.
      It will make it clear.  AFAIR, Chen Gang had tried it but don't know why
      it was not merged at that time.
      
      http://lkml.kernel.org/r/1469336184-1904-1-git-send-email-chengang@emindsoft.com.cn
      
      Link: http://lkml.kernel.org/r/1488868597-32222-1-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Chen Gang <gang.chen.5i5j@gmail.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2948be5a
    • K
      mm: remove rodata_test_data export, add pr_fmt · 056b9d8a
      Kees Cook 提交于
      Since commit 3ad38ceb ("x86/mm: Remove CONFIG_DEBUG_NX_TEST"),
      nothing is using the exported rodata_test_data variable, so drop the
      export.
      
      This additionally updates the pr_fmt to avoid redundant strings and
      adjusts some whitespace.
      
      Link: http://lkml.kernel.org/r/20170307005313.GA85809@beastSigned-off-by: NKees Cook <keescook@chromium.org>
      Cc: Jinbum Park <jinb.park7@gmail.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      056b9d8a
    • M
      mm: tighten up the fault path a little · 9ab2594f
      Matthew Wilcox 提交于
      The round_up() macro generates a couple of unnecessary instructions
      in this usage:
      
          48cd:       49 8b 47 50             mov    0x50(%r15),%rax
          48d1:       48 83 e8 01             sub    $0x1,%rax
          48d5:       48 0d ff 0f 00 00       or     $0xfff,%rax
          48db:       48 83 c0 01             add    $0x1,%rax
          48df:       48 c1 f8 0c             sar    $0xc,%rax
          48e3:       48 39 c3                cmp    %rax,%rbx
          48e6:       72 2e                   jb     4916 <filemap_fault+0x96>
      
      If we change round_up() to ((x) + __round_mask(x, y)) & ~__round_mask(x, y)
      then GCC can see through it and remove the mask (because that would be
      dead code given the subsequent shift):
      
          48cd:       49 8b 47 50             mov    0x50(%r15),%rax
          48d1:       48 05 ff 0f 00 00       add    $0xfff,%rax
          48d7:       48 c1 e8 0c             shr    $0xc,%rax
          48db:       48 39 c3                cmp    %rax,%rbx
          48de:       72 2e                   jb     490e <filemap_fault+0x8e>
      
      But that's problematic because we'd evaluate 'y' twice.  Converting
      round_up into an inline function prevents it from being used in other
      definitions.  The easiest thing to do is just change these three usages
      of round_up to use DIV_ROUND_UP.  Also add an unlikely() because GCC's
      heuristic is wrong in this case.
      
      Link: http://lkml.kernel.org/r/20170207192812.5281-1-willy@infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ab2594f
    • M
      mm: introduce memalloc_nofs_{save,restore} API · 7dea19f9
      Michal Hocko 提交于
      GFP_NOFS context is used for the following 5 reasons currently:
      
       - to prevent from deadlocks when the lock held by the allocation
         context would be needed during the memory reclaim
      
       - to prevent from stack overflows during the reclaim because the
         allocation is performed from a deep context already
      
       - to prevent lockups when the allocation context depends on other
         reclaimers to make a forward progress indirectly
      
       - just in case because this would be safe from the fs POV
      
       - silence lockdep false positives
      
      Unfortunately overuse of this allocation context brings some problems to
      the MM.  Memory reclaim is much weaker (especially during heavy FS
      metadata workloads), OOM killer cannot be invoked because the MM layer
      doesn't have enough information about how much memory is freeable by the
      FS layer.
      
      In many cases it is far from clear why the weaker context is even used
      and so it might be used unnecessarily.  We would like to get rid of
      those as much as possible.  One way to do that is to use the flag in
      scopes rather than isolated cases.  Such a scope is declared when really
      necessary, tracked per task and all the allocation requests from within
      the context will simply inherit the GFP_NOFS semantic.
      
      Not only this is easier to understand and maintain because there are
      much less problematic contexts than specific allocation requests, this
      also helps code paths where FS layer interacts with other layers (e.g.
      crypto, security modules, MM etc...) and there is no easy way to convey
      the allocation context between the layers.
      
      Introduce memalloc_nofs_{save,restore} API to control the scope of
      GFP_NOFS allocation context.  This is basically copying
      memalloc_noio_{save,restore} API we have for other restricted allocation
      context GFP_NOIO.  The PF_MEMALLOC_NOFS flag already exists and it is
      just an alias for PF_FSTRANS which has been xfs specific until recently.
      There are no more PF_FSTRANS users anymore so let's just drop it.
      
      PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
      implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO.  memalloc_noio_flags
      is renamed to current_gfp_context because it now cares about both
      PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts.  Xfs code paths preserve
      their semantic.  kmem_flags_convert() doesn't need to evaluate the flag
      anymore.
      
      This patch shouldn't introduce any functional changes.
      
      Let's hope that filesystems will drop direct GFP_NOFS (resp.  ~__GFP_FS)
      usage as much as possible and only use a properly documented
      memalloc_nofs_{save,restore} checkpoints where they are appropriate.
      
      [akpm@linux-foundation.org: fix comment typo, reflow comment]
      Link: http://lkml.kernel.org/r/20170306131408.9828-5-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7dea19f9
    • D
      mm, vmstat: suppress pcp stats for unpopulated zones in zoneinfo · 7dfb8bf3
      David Rientjes 提交于
      After "mm, vmstat: print non-populated zones in zoneinfo",
      /proc/zoneinfo will show unpopulated zones.
      
      The per-cpu pageset statistics are not relevant for unpopulated zones
      and can be potentially lengthy, so supress them when they are not
      interesting.
      
      Also moves lowmem reserve protection information above pcp stats since
      it is relevant for all zones per vm.lowmem_reserve_ratio.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1703061400500.46428@chino.kir.corp.google.comSigned-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7dfb8bf3
    • D
      mm, vmstat: print non-populated zones in zoneinfo · b2bd8598
      David Rientjes 提交于
      Initscripts can use the information (protection levels) from
      /proc/zoneinfo to configure vm.lowmem_reserve_ratio at boot.
      
      vm.lowmem_reserve_ratio is an array of ratios for each configured zone
      on the system.  If a zone is not populated on an arch, /proc/zoneinfo
      suppresses its output.
      
      This results in there not being a 1:1 mapping between the set of zones
      emitted by /proc/zoneinfo and the zones configured by
      vm.lowmem_reserve_ratio.
      
      This patch shows statistics for non-populated zones in /proc/zoneinfo.
      The zones exist and hold a spot in the vm.lowmem_reserve_ratio array.
      Without this patch, it is not possible to determine which index in the
      array controls which zone if one or more zones on the system are not
      populated.
      
      Remaining users of walk_zones_in_node() are unchanged.  Files such as
      /proc/pagetypeinfo require certain zone data to be initialized properly
      for display, which is not done for unpopulated zones.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1703031451310.98023@chino.kir.corp.google.comSigned-off-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b2bd8598
    • X
      mm: use is_migrate_isolate_page() to simplify the code · bbf9ce97
      Xishi Qiu 提交于
      Use is_migrate_isolate_page() to simplify the code, no functional
      changes.
      
      Link: http://lkml.kernel.org/r/58B94FB1.8020802@huawei.comSigned-off-by: NXishi Qiu <qiuxishi@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bbf9ce97
    • X
      mm: use is_migrate_highatomic() to simplify the code · a6ffdc07
      Xishi Qiu 提交于
      Introduce two helpers, is_migrate_highatomic() and is_migrate_highatomic_page().
      
      Simplify the code, no functional changes.
      
      [akpm@linux-foundation.org: use static inlines rather than macros, per mhocko]
      Link: http://lkml.kernel.org/r/58B94F15.6060606@huawei.comSigned-off-by: NXishi Qiu <qiuxishi@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6ffdc07
    • H
      mm, swap: Fix a race in free_swap_and_cache() · 322b8afe
      Huang Ying 提交于
      Before using cluster lock in free_swap_and_cache(), the
      swap_info_struct->lock will be held during freeing the swap entry and
      acquiring page lock, so the page swap count will not change when testing
      page information later.  But after using cluster lock, the cluster lock
      (or swap_info_struct->lock) will be held only during freeing the swap
      entry.  So before acquiring the page lock, the page swap count may be
      changed in another thread.  If the page swap count is not 0, we should
      not delete the page from the swap cache.  This is fixed via checking
      page swap count again after acquiring the page lock.
      
      I found the race when I review the code, so I didn't trigger the race
      via a test program.  If the race occurs for an anonymous page shared by
      multiple processes via fork, multiple pages will be allocated and
      swapped in from the swap device for the previously shared one page.
      That is, the user-visible runtime effect is more memory will be used and
      the access latency for the page will be higher, that is, the performance
      regression.
      
      Link: http://lkml.kernel.org/r/20170301143905.12846-1-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      322b8afe
    • J
      mm: memcontrol: provide shmem statistics · 9a4caf1e
      Johannes Weiner 提交于
      Cgroups currently don't report how much shmem they use, which can be
      useful data to have, in particular since shmem is included in the
      cache/file item while being reclaimed like anonymous memory.
      
      Add a counter to track shmem pages during charging and uncharging.
      
      Link: http://lkml.kernel.org/r/20170221164343.32252-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NChris Down <cdown@fb.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a4caf1e
    • S
      mm: enable MADV_FREE for swapless system · 93e06c7a
      Shaohua Li 提交于
      Now MADV_FREE pages can be easily reclaimed even for swapless system.
      We can safely enable MADV_FREE for all systems.
      
      Link: http://lkml.kernel.org/r/155648585589300bfae1d45078e7aebb3d988b87.1487965799.git.shli@fb.comSigned-off-by: NShaohua Li <shli@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93e06c7a
    • M
      mm: fix lazyfree BUG_ON check in try_to_unmap_one() · eb94a878
      Minchan Kim 提交于
      If a page is swapbacked, it means it should be in swapcache in
      try_to_unmap_one's path.
      
      If a page is !swapbacked, it mean it shouldn't be in swapcache in
      try_to_unmap_one's path.
      
      Check both two cases all at once and if it fails, warn and return
      SWAP_FAIL.  Such bug never mean we should shut down the kernel.
      
      [minchan@kernel.org: do not use VM_WARN_ON_ONCE as if condition[
        Link: http://lkml.kernel.org/r/20170309060226.GB854@bbox
      Link: http://lkml.kernel.org/r/20170307055551.GC29458@bboxSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eb94a878
    • S
      mm: reclaim MADV_FREE pages · 802a3a92
      Shaohua Li 提交于
      When memory pressure is high, we free MADV_FREE pages.  If the pages are
      not dirty in pte, the pages could be freed immediately.  Otherwise we
      can't reclaim them.  We put the pages back to anonumous LRU list (by
      setting SwapBacked flag) and the pages will be reclaimed in normal
      swapout way.
      
      We use normal page reclaim policy.  Since MADV_FREE pages are put into
      inactive file list, such pages and inactive file pages are reclaimed
      according to their age.  This is expected, because we don't want to
      reclaim too many MADV_FREE pages before used once pages.
      
      Based on Minchan's original patch
      
      [minchan@kernel.org: clean up lazyfree page handling]
        Link: http://lkml.kernel.org/r/20170303025237.GB3503@bbox
      Link: http://lkml.kernel.org/r/14b8eb1d3f6bf6cc492833f183ac8c304e560484.1487965799.git.shli@fb.comSigned-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      802a3a92
    • S
      mm: move MADV_FREE pages into LRU_INACTIVE_FILE list · f7ad2a6c
      Shaohua Li 提交于
      madv()'s MADV_FREE indicate pages are 'lazyfree'.  They are still
      anonymous pages, but they can be freed without pageout.  To distinguish
      these from normal anonymous pages, we clear their SwapBacked flag.
      
      MADV_FREE pages could be freed without pageout, so they pretty much like
      used once file pages.  For such pages, we'd like to reclaim them once
      there is memory pressure.  Also it might be unfair reclaiming MADV_FREE
      pages always before used once file pages and we definitively want to
      reclaim the pages before other anonymous and file pages.
      
      To speed up MADV_FREE pages reclaim, we put the pages into
      LRU_INACTIVE_FILE list.  The rationale is LRU_INACTIVE_FILE list is tiny
      nowadays and should be full of used once file pages.  Reclaiming
      MADV_FREE pages will not have much interfere of anonymous and active
      file pages.  And the inactive file pages and MADV_FREE pages will be
      reclaimed according to their age, so we don't reclaim too many MADV_FREE
      pages too.  Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
      means we can reclaim the pages without swap support.  This idea is
      suggested by Johannes.
      
      This patch doesn't move MADV_FREE pages to LRU_INACTIVE_FILE list yet to
      avoid bisect failure, next patch will do it.
      
      The patch is based on Minchan's original patch.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/2f87063c1e9354677b7618c647abde77b07561e5.1487965799.git.shli@fb.comSigned-off-by: NShaohua Li <shli@fb.com>
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7ad2a6c
    • S
      mm: don't assume anonymous pages have SwapBacked flag · d44d363f
      Shaohua Li 提交于
      There are a few places the code assumes anonymous pages should have
      SwapBacked flag set.  MADV_FREE pages are anonymous pages but we are
      going to add them to LRU_INACTIVE_FILE list and clear SwapBacked flag
      for them.  The assumption doesn't hold any more, so fix them.
      
      Link: http://lkml.kernel.org/r/3945232c0df3dd6c4ef001976f35a95f18dcb407.1487965799.git.shli@fb.comSigned-off-by: NShaohua Li <shli@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d44d363f
    • S
      mm: delete unnecessary TTU_* flags · a128ca71
      Shaohua Li 提交于
      Patch series "mm: fix some MADV_FREE issues", v5.
      
      We are trying to use MADV_FREE in jemalloc.  Several issues are found.
      Without solving the issues, jemalloc can't use the MADV_FREE feature.
      
       - Doesn't support system without swap enabled. Because if swap is off,
         we can't or can't efficiently age anonymous pages. And since
         MADV_FREE pages are mixed with other anonymous pages, we can't
         reclaim MADV_FREE pages. In current implementation, MADV_FREE will
         fallback to MADV_DONTNEED without swap enabled. But in our
         environment, a lot of machines don't enable swap. This will prevent
         our setup using MADV_FREE.
      
       - Increases memory pressure. page reclaim bias file pages reclaim
         against anonymous pages. This doesn't make sense for MADV_FREE pages,
         because those pages could be freed easily and refilled with very
         slight penality. Even page reclaim doesn't bias file pages, there is
         still an issue, because MADV_FREE pages and other anonymous pages are
         mixed together. To reclaim a MADV_FREE page, we probably must scan a
         lot of other anonymous pages, which is inefficient. In our test, we
         usually see oom with MADV_FREE enabled and nothing without it.
      
       - Accounting. There are two accounting problems. We don't have a global
         accounting. If the system is abnormal, we don't know if it's a
         problem from MADV_FREE side. The other problem is RSS accounting.
         MADV_FREE pages are accounted as normal anon pages and reclaimed
         lazily, so application's RSS becomes bigger. This confuses our
         workloads. We have monitoring daemon running and if it finds
         applications' RSS becomes abnormal, the daemon will kill the
         applications even kernel can reclaim the memory easily.
      
      To address the first the two issues, we can either put MADV_FREE pages
      into a separate LRU list (Minchan's previous patches and V1 patches), or
      put them into LRU_INACTIVE_FILE list (suggested by Johannes).  The
      patchset use the second idea.  The reason is LRU_INACTIVE_FILE list is
      tiny nowadays and should be full of used once file pages.  So we can
      still efficiently reclaim MADV_FREE pages there without interference
      with other anon and active file pages.  Putting the pages into inactive
      file list also has an advantage which allows page reclaim to prioritize
      MADV_FREE pages and used once file pages.  MADV_FREE pages are put into
      the lru list and clear SwapBacked flag, so PageAnon(page) &&
      !PageSwapBacked(page) will indicate a MADV_FREE pages.  These pages will
      directly freed without pageout if they are clean, otherwise normal swap
      will reclaim them.
      
      For the third issue, the previous post adds global accounting and a
      separate RSS count for MADV_FREE pages.  The problem is we never get
      accurate accounting for MADV_FREE pages.  The pages are mapped to
      userspace, can be dirtied without notice from kernel side.  To get
      accurate accounting, we could write protect the page, but then there is
      extra page fault overhead, which people don't want to pay.  Jemalloc
      guys have concerns about the inaccurate accounting, so this post drops
      the accounting patches temporarily.  The info exported to
      /proc/pid/smaps for MADV_FREE pages are kept, which is the only place we
      can get accurate accounting right now.
      
      This patch (of 6):
      
      Johannes pointed out TTU_LZFREE is unnecessary.  It's true because we
      always have the flag set if we want to do an unmap.  For cases we don't
      do an unmap, the TTU_LZFREE part of code should never run.
      
      Also the TTU_UNMAP is unnecessary.  If no other flags set (for example,
      TTU_MIGRATION), an unmap is implied.
      
      The patch includes Johannes's cleanup and dead TTU_ACTION macro removal
      code
      
      Link: http://lkml.kernel.org/r/4be3ea1bc56b26fd98a54d0a6f70bec63f6d8980.1487965799.git.shli@fb.comSigned-off-by: NShaohua Li <shli@fb.com>
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a128ca71
    • G
      mm/page-writeback.c: use setup_deferrable_timer · 0a372d09
      Geliang Tang 提交于
      Use setup_deferrable_timer() instead of init_timer_deferrable() to
      simplify the code.
      
      Link: http://lkml.kernel.org/r/e8e3d4280a34facbc007346f31df833cec28801e.1488070291.git.geliangtang@gmail.comSigned-off-by: NGeliang Tang <geliangtang@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a372d09
    • J
      mm: remove unnecessary back-off function when retrying page reclaim · 491d79ae
      Johannes Weiner 提交于
      The backoff mechanism is not needed.  If we have MAX_RECLAIM_RETRIES
      loops without progress, we'll OOM anyway; backing off might cut one or
      two iterations off that in the rare OOM case.  If we have intermittent
      success reclaiming a few pages, the backoff function gets reset also,
      and so is of little help in these scenarios.
      
      We might want a backoff function for when there IS progress, but not
      enough to be satisfactory.  But this isn't that.  Remove it.
      
      Link: http://lkml.kernel.org/r/20170228214007.5621-10-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Jia He <hejianet@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      491d79ae
    • J
      Revert "mm, vmscan: account for skipped pages as a partial scan" · 3db65812
      Johannes Weiner 提交于
      This reverts commit d7f05528.
      
      Now that reclaimability of a node is no longer based on the ratio
      between pages scanned and theoretically reclaimable pages, we can remove
      accounting tricks for pages skipped due to zone constraints.
      
      Link: http://lkml.kernel.org/r/20170228214007.5621-9-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Jia He <hejianet@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3db65812
    • J
      mm: delete NR_PAGES_SCANNED and pgdat_reclaimable() · c822f622
      Johannes Weiner 提交于
      NR_PAGES_SCANNED counts number of pages scanned since the last page free
      event in the allocator.  This was used primarily to measure the
      reclaimability of zones and nodes, and determine when reclaim should
      give up on them.  In that role, it has been replaced in the preceding
      patches by a different mechanism.
      
      Being implemented as an efficient vmstat counter, it was automatically
      exported to userspace as well.  It's however unlikely that anyone
      outside the kernel is using this counter in any meaningful way.
      
      Remove the counter and the unused pgdat_reclaimable().
      
      Link: http://lkml.kernel.org/r/20170228214007.5621-8-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Jia He <hejianet@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c822f622
    • J
      mm: don't avoid high-priority reclaim on memcg limit reclaim · 688035f7
      Johannes Weiner 提交于
      Commit 246e87a9 ("memcg: fix get_scan_count() for small targets")
      sought to avoid high reclaim priorities for memcg by forcing it to scan
      a minimum amount of pages when lru_pages >> priority yielded nothing.
      This was done at a time when reclaim decisions like dirty throttling
      were tied to the priority level.
      
      Nowadays, the only meaningful thing still tied to priority dropping
      below DEF_PRIORITY - 2 is gating whether laptop_mode=1 is generally
      allowed to write.  But that is from an era where direct reclaim was
      still allowed to call ->writepage, and kswapd nowadays avoids writes
      until it's scanned every clean page in the system.  Potential changes to
      how quick sc->may_writepage could trigger are of little concern.
      
      Remove the force_scan stuff, as well as the ugly multi-pass target
      calculation that it necessitated.
      
      Link: http://lkml.kernel.org/r/20170228214007.5621-7-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Jia He <hejianet@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      688035f7
    • J
      mm: don't avoid high-priority reclaim on unreclaimable nodes · a2d7f8e4
      Johannes Weiner 提交于
      Commit 246e87a9 ("memcg: fix get_scan_count() for small targets")
      sought to avoid high reclaim priorities for kswapd by forcing it to scan
      a minimum amount of pages when lru_pages >> priority yielded nothing.
      
      Commit b95a2f2d ("mm: vmscan: convert global reclaim to per-memcg
      LRU lists"), due to switching global reclaim to a round-robin scheme
      over all cgroups, had to restrict this forceful behavior to
      unreclaimable zones in order to prevent massive overreclaim with many
      cgroups.
      
      The latter patch effectively neutered the behavior completely for all
      but extreme memory pressure.  But in those situations we might as well
      drop the reclaimers to lower priority levels.  Remove the check.
      
      Link: http://lkml.kernel.org/r/20170228214007.5621-6-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Jia He <hejianet@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2d7f8e4
    • J
      mm: remove unnecessary reclaimability check from NUMA balancing target · 15038d0d
      Johannes Weiner 提交于
      NUMA balancing already checks the watermarks of the target node to
      decide whether it's a suitable balancing target.  Whether the node is
      reclaimable or not is irrelevant when we don't intend to reclaim.
      
      Link: http://lkml.kernel.org/r/20170228214007.5621-5-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Jia He <hejianet@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      15038d0d
    • J
      mm: remove seemingly spurious reclaimability check from laptop_mode gating · 047d72c3
      Johannes Weiner 提交于
      Commit 1d82de61 ("mm, vmscan: make kswapd reclaim in terms of
      nodes") allowed laptop_mode=1 to start writing not just when the
      priority drops to DEF_PRIORITY - 2 but also when the node is
      unreclaimable.
      
      That appears to be a spurious change in this patch as I doubt the series
      was tested with laptop_mode, and neither is that particular change
      mentioned in the changelog.  Remove it, it's still recent.
      
      Link: http://lkml.kernel.org/r/20170228214007.5621-4-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Jia He <hejianet@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      047d72c3
    • J
      mm: fix check for reclaimable pages in PF_MEMALLOC reclaim throttling · d450abd8
      Johannes Weiner 提交于
      PF_MEMALLOC direct reclaimers get throttled on a node when the sum of
      all free pages in each zone fall below half the min watermark.  During
      the summation, we want to exclude zones that don't have reclaimables.
      Checking the same pgdat over and over again doesn't make sense.
      
      Fixes: 599d0c95 ("mm, vmscan: move LRU lists to node")
      Link: http://lkml.kernel.org/r/20170228214007.5621-3-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Jia He <hejianet@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d450abd8
    • J
      mm: fix 100% CPU kswapd busyloop on unreclaimable nodes · c73322d0
      Johannes Weiner 提交于
      Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and
      cleanups".
      
      Jia reported a scenario in which the kswapd of a node indefinitely spins
      at 100% CPU usage.  We have seen similar cases at Facebook.
      
      The kernel's current method of judging its ability to reclaim a node (or
      whether to back off and sleep) is based on the amount of scanned pages
      in proportion to the amount of reclaimable pages.  In Jia's and our
      scenarios, there are no reclaimable pages in the node, however, and the
      condition for backing off is never met.  Kswapd busyloops in an attempt
      to restore the watermarks while having nothing to work with.
      
      This series reworks the definition of an unreclaimable node based not on
      scanning but on whether kswapd is able to actually reclaim pages in
      MAX_RECLAIM_RETRIES (16) consecutive runs.  This is the same criteria
      the page allocator uses for giving up on direct reclaim and invoking the
      OOM killer.  If it cannot free any pages, kswapd will go to sleep and
      leave further attempts to direct reclaim invocations, which will either
      make progress and re-enable kswapd, or invoke the OOM killer.
      
      Patch #1 fixes the immediate problem Jia reported, the remainder are
      smaller fixlets, cleanups, and overall phasing out of the old method.
      
      Patch #6 is the odd one out.  It's a nice cleanup to get_scan_count(),
      and directly related to #5, but in itself not relevant to the series.
      
      If the whole series is too ambitious for 4.11, I would consider the
      first three patches fixes, the rest cleanups.
      
      This patch (of 9):
      
      Jia He reports a problem with kswapd spinning at 100% CPU when
      requesting more hugepages than memory available in the system:
      
      $ echo 4000 >/proc/sys/vm/nr_hugepages
      
      top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
      Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
      %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
      KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
      KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
      
        PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
         76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3
      
      At that time, there are no reclaimable pages left in the node, but as
      kswapd fails to restore the high watermarks it refuses to go to sleep.
      
      Kswapd needs to back away from nodes that fail to balance.  Up until
      commit 1d82de61 ("mm, vmscan: make kswapd reclaim in terms of
      nodes") kswapd had such a mechanism.  It considered zones whose
      theoretically reclaimable pages it had reclaimed six times over as
      unreclaimable and backed away from them.  This guard was erroneously
      removed as the patch changed the definition of a balanced node.
      
      However, simply restoring this code wouldn't help in the case reported
      here: there *are* no reclaimable pages that could be scanned until the
      threshold is met.  Kswapd would stay awake anyway.
      
      Introduce a new and much simpler way of backing off.  If kswapd runs
      through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
      page, make it back off from the node.  This is the same number of shots
      direct reclaim takes before declaring OOM.  Kswapd will go to sleep on
      that node until a direct reclaimer manages to reclaim some pages, thus
      proving the node reclaimable again.
      
      [hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count]
        Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org
      [shakeelb@google.com: fix condition for throttle_direct_reclaim]
        Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com
      Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Reported-by: NJia He <hejianet@gmail.com>
      Tested-by: NJia He <hejianet@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c73322d0
    • G
      slab: avoid IPIs when creating kmem caches · a87c75fb
      Greg Thelen 提交于
      Each slab kmem cache has per cpu array caches.  The array caches are
      created when the kmem_cache is created, either via kmem_cache_create()
      or lazily when the first object is allocated in context of a kmem
      enabled memcg.  Array caches are replaced by writing to /proc/slabinfo.
      
      Array caches are protected by holding slab_mutex or disabling
      interrupts.  Array cache allocation and replacement is done by
      __do_tune_cpucache() which holds slab_mutex and calls
      kick_all_cpus_sync() to interrupt all remote processors which confirms
      there are no references to the old array caches.
      
      IPIs are needed when replacing array caches.  But when creating a new
      array cache, there's no need to send IPIs because there cannot be any
      references to the new cache.  Outside of memcg kmem accounting these
      IPIs occur at boot time, so they're not a problem.  But with memcg kmem
      accounting each container can create kmem caches, so the IPIs are
      wasteful.
      
      Avoid unnecessary IPIs when creating array caches.
      
      Test which reports the IPI count of allocating slab in 10000 memcg:
      
      	import os
      
      	def ipi_count():
      		with open("/proc/interrupts") as f:
      			for l in f:
      				if 'Function call interrupts' in l:
      					return int(l.split()[1])
      
      	def echo(val, path):
      		with open(path, "w") as f:
      			f.write(val)
      
      	n = 10000
      	os.chdir("/mnt/cgroup/memory")
      	pid = str(os.getpid())
      	a = ipi_count()
      	for i in range(n):
      		os.mkdir(str(i))
      		echo("1G\n", "%d/memory.limit_in_bytes" % i)
      		echo("1G\n", "%d/memory.kmem.limit_in_bytes" % i)
      		echo(pid, "%d/cgroup.procs" % i)
      		open("/tmp/x", "w").close()
      		os.unlink("/tmp/x")
      	b = ipi_count()
      	print "%d loops: %d => %d (+%d ipis)" % (n, a, b, b-a)
      	echo(pid, "cgroup.procs")
      	for i in range(n):
      		os.rmdir(str(i))
      
      patched:   10000 loops: 1069 => 1170 (+101 ipis)
      unpatched: 10000 loops: 1192 => 48933 (+47741 ipis)
      
      Link: http://lkml.kernel.org/r/20170416214544.109476-1-gthelen@google.comSigned-off-by: NGreg Thelen <gthelen@google.com>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a87c75fb
  2. 01 5月, 2017 2 次提交
    • H
      mm: remove AVR32 arch special handling in mm/Kconfig · 01e7bc24
      Hans-Christian Noren Egtvedt 提交于
      AVR32 architecture has been removed from the Linux kernel sources, hence
      clean up the special handling setting two quicklists by default in
      mm/Kconfig.
      Signed-off-by: NHans-Christian Noren Egtvedt <egtvedt@samfundet.no>
      01e7bc24
    • D
      mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash · 71389703
      Dan Williams 提交于
      The x86 conversion to the generic GUP code included a small change which causes
      crashes and data corruption in the pmem code - not good.
      
      The root cause is that the /dev/pmem driver code implicitly relies on the x86
      get_user_pages() implementation doing a get_page() on the page refcount, because
      get_page() does a get_zone_device_page() which properly refcounts pmem's separate
      page struct arrays that are not present in the regular page struct structures.
      (The pmem driver does this because it can cover huge memory areas.)
      
      But the x86 conversion to the generic GUP code changed the get_page() to
      page_cache_get_speculative() which is faster but doesn't do the
      get_zone_device_page() call the pmem code relies on.
      
      One way to solve the regression would be to change the generic GUP code to use
      get_page(), but that would slow things down a bit and punish other generic-GUP
      using architectures for an x86-ism they did not care about. (Arguably the pmem
      driver was probably not working reliably for them: but nvdimm is an Intel
      feature, so non-x86 exposure is probably still limited.)
      
      So restructure the pmem code's interface with the MM instead: get rid of the
      get/put_zone_device_page() distinction, integrate put_zone_device_page() into
      __put_page() and and restructure the pmem completion-wait and teardown machinery:
      
      Kirill points out that the calls to {get,put}_dev_pagemap() can be
      removed from the mm fast path if we take a single get_dev_pagemap()
      reference to signify that the page is alive and use the final put of the
      page to drop that reference.
      
      This does require some care to make sure that any waits for the
      percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
      since it now maintains its own elevated reference.
      
      This speeds up things while also making the pmem refcounting more robust going
      forward.
      Suggested-by: NKirill Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NKirill Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NLogan Gunthorpe <logang@deltatee.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/149339998297.24933.1129582806028305912.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      71389703
  3. 23 4月, 2017 1 次提交
    • I
      Revert "x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation" · 6dd29b3d
      Ingo Molnar 提交于
      This reverts commit 2947ba05.
      
      Dan Williams reported dax-pmem kernel warnings with the following signature:
      
         WARNING: CPU: 8 PID: 245 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x1f5/0x200
         percpu ref (dax_pmem_percpu_release [dax_pmem]) <= 0 (0) after switching to atomic
      
      ... and bisected it to this commit, which suggests possible memory corruption
      caused by the x86 fast-GUP conversion.
      
      He also pointed out:
      
       "
        This is similar to the backtrace when we were not properly handling
        pud faults and was fixed with this commit: 220ced16 "mm: fix
        get_user_pages() vs device-dax pud mappings"
      
        I've found some missing _devmap checks in the generic
        get_user_pages_fast() path, but this does not fix the regression
        [...]
       "
      
      So given that there are known bugs, and a pretty robust looking bisection
      points to this commit suggesting that are unknown bugs in the conversion
      as well, revert it for the time being - we'll re-try in v4.13.
      Reported-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: aneesh.kumar@linux.vnet.ibm.com
      Cc: dann.frazier@canonical.com
      Cc: dave.hansen@intel.com
      Cc: steve.capper@linaro.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6dd29b3d
  4. 22 4月, 2017 2 次提交
  5. 21 4月, 2017 3 次提交