1. 07 6月, 2014 2 次提交
    • M
      vmscan: memcg: always use swappiness of the reclaimed memcg · 688eb988
      Michal Hocko 提交于
      Memory reclaim always uses swappiness of the reclaim target memcg
      (origin of the memory pressure) or vm_swappiness for global memory
      reclaim.  This behavior was consistent (except for difference between
      global and hard limit reclaim) because swappiness was enforced to be
      consistent within each memcg hierarchy.
      
      After "mm: memcontrol: remove hierarchy restrictions for swappiness and
      oom_control" each memcg can have its own swappiness independent of
      hierarchical parents, though, so the consistency guarantee is gone.
      This can lead to an unexpected behavior.  Say that a group is explicitly
      configured to not swapout by memory.swappiness=0 but its memory gets
      swapped out anyway when the memory pressure comes from its parent with a
      It is also unexpected that the knob is meaningless without setting the
      hard limit which would trigger the reclaim and enforce the swappiness.
      There are setups where the hard limit is configured higher in the
      hierarchy by an administrator and children groups are under control of
      somebody else who is interested in the swapout behavior but not
      necessarily about the memory limit.
      
      From a semantic point of view swappiness is an attribute defining anon
      vs.
       file proportional scanning of LRU which is memcg specific (unlike
      charges which are propagated up the hierarchy) so it should be applied
      to the particular memcg's LRU regardless where the memory pressure comes
      from.
      
      This patch removes vmscan_swappiness() and stores the swappiness into
      the scan_control structure.  mem_cgroup_swappiness is then used to
      provide the correct value before shrink_lruvec is called.  The global
      vm_swappiness is used for the root memcg.
      
      [hughd@google.com: oopses immediately when booted with cgroup_disable=memory]
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      688eb988
    • J
      mm: vmscan: clear kswapd's special reclaim powers before exiting · 71abdc15
      Johannes Weiner 提交于
      When kswapd exits, it can end up taking locks that were previously held
      by allocating tasks while they waited for reclaim.  Lockdep currently
      warns about this:
      
      On Wed, May 28, 2014 at 06:06:34PM +0800, Gu Zheng wrote:
      >  inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage.
      >  kswapd2/1151 [HC0[0]:SC0[0]:HE1:SE1] takes:
      >   (&sig->group_rwsem){+++++?}, at: exit_signals+0x24/0x130
      >  {RECLAIM_FS-ON-W} state was registered at:
      >     mark_held_locks+0xb9/0x140
      >     lockdep_trace_alloc+0x7a/0xe0
      >     kmem_cache_alloc_trace+0x37/0x240
      >     flex_array_alloc+0x99/0x1a0
      >     cgroup_attach_task+0x63/0x430
      >     attach_task_by_pid+0x210/0x280
      >     cgroup_procs_write+0x16/0x20
      >     cgroup_file_write+0x120/0x2c0
      >     vfs_write+0xc0/0x1f0
      >     SyS_write+0x4c/0xa0
      >     tracesys+0xdd/0xe2
      >  irq event stamp: 49
      >  hardirqs last  enabled at (49):  _raw_spin_unlock_irqrestore+0x36/0x70
      >  hardirqs last disabled at (48):  _raw_spin_lock_irqsave+0x2b/0xa0
      >  softirqs last  enabled at (0):  copy_process.part.24+0x627/0x15f0
      >  softirqs last disabled at (0):            (null)
      >
      >  other info that might help us debug this:
      >   Possible unsafe locking scenario:
      >
      >         CPU0
      >         ----
      >    lock(&sig->group_rwsem);
      >    <Interrupt>
      >      lock(&sig->group_rwsem);
      >
      >   *** DEADLOCK ***
      >
      >  no locks held by kswapd2/1151.
      >
      >  stack backtrace:
      >  CPU: 30 PID: 1151 Comm: kswapd2 Not tainted 3.10.39+ #4
      >  Call Trace:
      >    dump_stack+0x19/0x1b
      >    print_usage_bug+0x1f7/0x208
      >    mark_lock+0x21d/0x2a0
      >    __lock_acquire+0x52a/0xb60
      >    lock_acquire+0xa2/0x140
      >    down_read+0x51/0xa0
      >    exit_signals+0x24/0x130
      >    do_exit+0xb5/0xa50
      >    kthread+0xdb/0x100
      >    ret_from_fork+0x7c/0xb0
      
      This is because the kswapd thread is still marked as a reclaimer at the
      time of exit.  But because it is exiting, nobody is actually waiting on
      it to make reclaim progress anymore, and it's nothing but a regular
      thread at this point.  Be tidy and strip it of all its powers
      (PF_MEMALLOC, PF_SWAPWRITE, PF_KSWAPD, and the lockdep reclaim state)
      before returning from the thread function.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71abdc15
  2. 05 6月, 2014 9 次提交
    • M
      mm: vmscan: use proportional scanning during direct reclaim and full scan at DEF_PRIORITY · 1a501907
      Mel Gorman 提交于
      Commit "mm: vmscan: obey proportional scanning requirements for kswapd"
      ensured that file/anon lists were scanned proportionally for reclaim from
      kswapd but ignored it for direct reclaim.  The intent was to minimse
      direct reclaim latency but Yuanhan Liu pointer out that it substitutes one
      long stall for many small stalls and distorts aging for normal workloads
      like streaming readers/writers.  Hugh Dickins pointed out that a
      side-effect of the same commit was that when one LRU list dropped to zero
      that the entirety of the other list was shrunk leading to excessive
      reclaim in memcgs.  This patch scans the file/anon lists proportionally
      for direct reclaim to similarly age page whether reclaimed by kswapd or
      direct reclaim but takes care to abort reclaim if one LRU drops to zero
      after reclaiming the requested number of pages.
      
      Based on ext4 and using the Intel VM scalability test
      
                                                    3.15.0-rc5            3.15.0-rc5
                                                      shrinker            proportion
      Unit  lru-file-readonce    elapsed      5.3500 (  0.00%)      5.4200 ( -1.31%)
      Unit  lru-file-readonce time_range      0.2700 (  0.00%)      0.1400 ( 48.15%)
      Unit  lru-file-readonce time_stddv      0.1148 (  0.00%)      0.0536 ( 53.33%)
      Unit lru-file-readtwice    elapsed      8.1700 (  0.00%)      8.1700 (  0.00%)
      Unit lru-file-readtwice time_range      0.4300 (  0.00%)      0.2300 ( 46.51%)
      Unit lru-file-readtwice time_stddv      0.1650 (  0.00%)      0.0971 ( 41.16%)
      
      The test cases are running multiple dd instances reading sparse files. The results are within
      the noise for the small test machine. The impact of the patch is more noticable from the vmstats
      
                                  3.15.0-rc5  3.15.0-rc5
                                    shrinker  proportion
      Minor Faults                     35154       36784
      Major Faults                       611        1305
      Swap Ins                           394        1651
      Swap Outs                         4394        5891
      Allocation stalls               118616       44781
      Direct pages scanned           4935171     4602313
      Kswapd pages scanned          15921292    16258483
      Kswapd pages reclaimed        15913301    16248305
      Direct pages reclaimed         4933368     4601133
      Kswapd efficiency                  99%         99%
      Kswapd velocity             670088.047  682555.961
      Direct efficiency                  99%         99%
      Direct velocity             207709.217  193212.133
      Percentage direct scans            23%         22%
      Page writes by reclaim        4858.000    6232.000
      Page writes file                   464         341
      Page writes anon                  4394        5891
      
      Note that there are fewer allocation stalls even though the amount
      of direct reclaim scanning is very approximately the same.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Tested-by: NYuanhan Liu <yuanhan.liu@linux.intel.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a501907
    • J
      mm/vmscan.c: use DIV_ROUND_UP for calculation of zone's balance_gap and correct comments. · 4be89a34
      Jianyu Zhan 提交于
      Currently, we use (zone->managed_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1)
      / KSWAPD_ZONE_BALANCE_GAP_RATIO to avoid a zero gap value.  It's better to
      use DIV_ROUND_UP macro for neater code and clear meaning.
      
      Besides, the gap value is calculated against the per-zone "managed pages",
      not "present pages".  This patch also corrects the comment and do some
      rephrasing.
      Signed-off-by: NJianyu Zhan <nasa4836@gmail.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4be89a34
    • M
      mm: page_alloc: convert hot/cold parameter and immediate callers to bool · b745bc85
      Mel Gorman 提交于
      cold is a bool, make it one.  Make the likely case the "if" part of the
      block instead of the else as according to the optimisation manual this is
      preferred.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b745bc85
    • D
      mm: shrinker: add nid to tracepoint output · df9024a8
      Dave Hansen 提交于
      Now that we are doing NUMA-aware shrinking, and can have shrinkers
      running in parallel, or working on individual nodes, it seems like we
      should also be sticking the node in the output.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NDave Chinner <david@fromorbit.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df9024a8
    • D
      mm: shrinker trace points: fix negatives · 7fe70475
      Dave Hansen 提交于
      I was looking at a trace of the slab shrinkers (attachment in this comment):
      
      	https://bugs.freedesktop.org/show_bug.cgi?id=72742#c67
      
      and noticed that "total_scan" can go negative in some cases.  We
      used to dump out the "total_scan" variable directly, but some of
      the shrinker modifications along the way changed that.
      
      This patch just dumps it out directly, again.  It doesn't make
      any sense to derive it from new_nr and nr any more since there
      are now other shrinkers that can be running in parallel and
      mucking with those values.
      
      Here's an example of the negative numbers in the output:
      
      >          kswapd0-840   [000]   160.869398: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 10 new scan count 39 total_scan 29 last shrinker return val 256
      >          kswapd0-840   [000]   160.869618: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 39 new scan count 102 total_scan 63 last shrinker return val 256
      >          kswapd0-840   [000]   160.870031: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 102 new scan count 47 total_scan -55 last shrinker return val 768
      >          kswapd0-840   [000]   160.870464: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 47 new scan count 45 total_scan -2 last shrinker return val 768
      >          kswapd0-840   [000]   163.384144: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 45 new scan count 56 total_scan 11 last shrinker return val 0
      >          kswapd0-840   [000]   163.384297: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 56 new scan count 15 total_scan -41 last shrinker return val 256
      >          kswapd0-840   [000]   163.384414: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 15 new scan count 117 total_scan 102 last shrinker return val 0
      >          kswapd0-840   [000]   163.384657: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 117 new scan count 36 total_scan -81 last shrinker return val 512
      >          kswapd0-840   [000]   163.384880: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 36 new scan count 111 total_scan 75 last shrinker return val 256
      >          kswapd0-840   [000]   163.385256: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 111 new scan count 34 total_scan -77 last shrinker return val 768
      >          kswapd0-840   [000]   163.385598: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 34 new scan count 122 total_scan 88 last shrinker return val 512
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NDave Chinner <david@fromorbit.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7fe70475
    • N
      mm/vmscan.c: avoid throttling reclaim for loop-back nfsd threads · 399ba0b9
      NeilBrown 提交于
      When a loopback NFS mount is active and the backing device for the NFS
      mount becomes congested, that can impose throttling delays on the nfsd
      threads.
      
      These delays significantly reduce throughput and so the NFS mount remains
      congested.
      
      This results in a livelock and the reduced throughput persists.
      
      This livelock has been found in testing with the 'wait_iff_congested'
      call, and could possibly be caused by the 'congestion_wait' call.
      
      This livelock is similar to the deadlock which justified the introduction
      of PF_LESS_THROTTLE, and the same flag can be used to remove this
      livelock.
      
      To minimise the impact of the change, we still throttle nfsd when the
      filesystem it is writing to is congested, but not when some separate
      filesystem (e.g.  the NFS filesystem) is congested.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      399ba0b9
    • M
      mm: vmscan: do not throttle based on pfmemalloc reserves if node has no ZONE_NORMAL · 675becce
      Mel Gorman 提交于
      throttle_direct_reclaim() is meant to trigger during swap-over-network
      during which the min watermark is treated as a pfmemalloc reserve.  It
      throttes on the first node in the zonelist but this is flawed.
      
      The user-visible impact is that a process running on CPU whose local
      memory node has no ZONE_NORMAL will stall for prolonged periods of time,
      possibly indefintely.  This is due to throttle_direct_reclaim thinking the
      pfmemalloc reserves are depleted when in fact they don't exist on that
      node.
      
      On a NUMA machine running a 32-bit kernel (I know) allocation requests
      from CPUs on node 1 would detect no pfmemalloc reserves and the process
      gets throttled.  This patch adjusts throttling of direct reclaim to
      throttle based on the first node in the zonelist that has a usable
      ZONE_NORMAL or lower zone.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      675becce
    • V
      mem-hotplug: implement get/put_online_mems · bfc8c901
      Vladimir Davydov 提交于
      kmem_cache_{create,destroy,shrink} need to get a stable value of
      cpu/node online mask, because they init/destroy/access per-cpu/node
      kmem_cache parts, which can be allocated or destroyed on cpu/mem
      hotplug.  To protect against cpu hotplug, these functions use
      {get,put}_online_cpus.  However, they do nothing to synchronize with
      memory hotplug - taking the slab_mutex does not eliminate the
      possibility of race as described in patch 2.
      
      What we need there is something like get_online_cpus, but for memory.
      We already have lock_memory_hotplug, which serves for the purpose, but
      it's a bit of a hammer right now, because it's backed by a mutex.  As a
      result, it imposes some limitations to locking order, which are not
      desirable, and can't be used just like get_online_cpus.  That's why in
      patch 1 I substitute it with get/put_online_mems, which work exactly
      like get/put_online_cpus except they block not cpu, but memory hotplug.
      
      [ v1 can be found at https://lkml.org/lkml/2014/4/6/68.  I NAK'ed it by
        myself, because it used an rw semaphore for get/put_online_mems,
        making them dead lock prune.  ]
      
      This patch (of 2):
      
      {un}lock_memory_hotplug, which is used to synchronize against memory
      hotplug, is currently backed by a mutex, which makes it a bit of a
      hammer - threads that only want to get a stable value of online nodes
      mask won't be able to proceed concurrently.  Also, it imposes some
      strong locking ordering rules on it, which narrows down the set of its
      usage scenarios.
      
      This patch introduces get/put_online_mems, which are the same as
      get/put_online_cpus, but for memory hotplug, i.e.  executing a code
      inside a get/put_online_mems section will guarantee a stable value of
      online nodes, present pages, etc.
      
      lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bfc8c901
    • S
      mm: only force scan in reclaim when none of the LRUs are big enough. · 6f04f48d
      Suleiman Souhlal 提交于
      Prior to this change, we would decide whether to force scan a LRU during
      reclaim if that LRU itself was too small for the current priority.
      However, this can lead to the file LRU getting force scanned even if
      there are a lot of anonymous pages we can reclaim, leading to hot file
      pages getting needlessly reclaimed.
      
      To address this, we instead only force scan when none of the reclaimable
      LRUs are big enough.
      
      Gives huge improvements with zswap.  For example, when doing -j20 kernel
      build in a 500MB container with zswap enabled, runtime (in seconds) is
      greatly reduced:
      
      x without this change
      + with this change
          N           Min           Max        Median           Avg        Stddev
      x   5       700.997       790.076       763.928        754.05      39.59493
      +   5       141.634       197.899       155.706         161.9     21.270224
      Difference at 95.0% confidence
              -592.15 +/- 46.3521
              -78.5293% +/- 6.14709%
              (Student's t, pooled s = 31.7819)
      
      Should also give some improvements in regular (non-zswap) swap cases.
      
      Yes, hughd found significant speedup using regular swap, with several
      memcgs under pressure; and it should also be effective in the non-memcg
      case, whenever one or another zone LRU is forced too small.
      Signed-off-by: NSuleiman Souhlal <suleiman@google.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Luigi Semenzato <semenzato@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f04f48d
  3. 07 5月, 2014 1 次提交
  4. 19 4月, 2014 1 次提交
  5. 09 4月, 2014 1 次提交
    • J
      mm: vmscan: do not swap anon pages just because free+file is low · 0bf1457f
      Johannes Weiner 提交于
      Page reclaim force-scans / swaps anonymous pages when file cache drops
      below the high watermark of a zone in order to prevent what little cache
      remains from thrashing.
      
      However, on bigger machines the high watermark value can be quite large
      and when the workload is dominated by a static anonymous/shmem set, the
      file set might just be a small window of used-once cache.  In such
      situations, the VM starts swapping heavily when instead it should be
      recycling the no longer used cache.
      
      This is a longer-standing problem, but it's more likely to trigger after
      commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy")
      because file pages can no longer accumulate in a single zone and are
      dispersed into smaller fractions among the available zones.
      
      To resolve this, do not force scan anon when file pages are low but
      instead rely on the scan/rotation ratios to make the right prediction.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: <stable@kernel.org>		[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0bf1457f
  6. 08 4月, 2014 2 次提交
  7. 04 4月, 2014 6 次提交
    • J
      mm: thrash detection-based file cache sizing · a528910e
      Johannes Weiner 提交于
      The VM maintains cached filesystem pages on two types of lists.  One
      list holds the pages recently faulted into the cache, the other list
      holds pages that have been referenced repeatedly on that first list.
      The idea is to prefer reclaiming young pages over those that have shown
      to benefit from caching in the past.  We call the recently usedbut
      ultimately was not significantly better than a FIFO policy and still
      thrashed cache based on eviction speed, rather than actual demand for
      cache.
      
      This patch solves one half of the problem by decoupling the ability to
      detect working set changes from the inactive list size.  By maintaining
      a history of recently evicted file pages it can detect frequently used
      pages with an arbitrarily small inactive list size, and subsequently
      apply pressure on the active list based on actual demand for cache, not
      just overall eviction speed.
      
      Every zone maintains a counter that tracks inactive list aging speed.
      When a page is evicted, a snapshot of this counter is stored in the
      now-empty page cache radix tree slot.  On refault, the minimum access
      distance of the page can be assessed, to evaluate whether the page
      should be part of the active list or not.
      
      This fixes the VM's blindness towards working set changes in excess of
      the inactive list.  And it's the foundation to further improve the
      protection ability and reduce the minimum inactive list size of 50%.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NBob Liu <bob.liu@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a528910e
    • J
      mm + fs: store shadow entries in page cache · 91b0abe3
      Johannes Weiner 提交于
      Reclaim will be leaving shadow entries in the page cache radix tree upon
      evicting the real page.  As those pages are found from the LRU, an
      iput() can lead to the inode being freed concurrently.  At this point,
      reclaim must no longer install shadow pages because the inode freeing
      code needs to ensure the page tree is really empty.
      
      Add an address_space flag, AS_EXITING, that the inode freeing code sets
      under the tree lock before doing the final truncate.  Reclaim will check
      for this flag before installing shadow pages.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91b0abe3
    • V
      mm: vmscan: shrink_slab: rename max_pass -> freeable · d5bc5fd3
      Vladimir Davydov 提交于
      The name `max_pass' is misleading, because this variable actually keeps
      the estimate number of freeable objects, not the maximal number of
      objects we can scan in this pass, which can be twice that.  Rename it to
      reflect its actual meaning.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d5bc5fd3
    • V
      mm: vmscan: remove shrink_control arg from do_try_to_free_pages() · 3115cd91
      Vladimir Davydov 提交于
      There is no need passing on a shrink_control struct from
      try_to_free_pages() and friends to do_try_to_free_pages() and then to
      shrink_zones(), because it is only used in shrink_zones() and the only
      field initialized on the top level is gfp_mask, which is always equal to
      scan_control.gfp_mask.  So let's move shrink_control initialization to
      shrink_zones().
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3115cd91
    • V
      mm: vmscan: move call to shrink_slab() to shrink_zones() · 65ec02cb
      Vladimir Davydov 提交于
      This reduces the indentation level of do_try_to_free_pages() and removes
      extra loop over all eligible zones counting the number of on-LRU pages.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: NGlauber Costa <glommer@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65ec02cb
    • V
      mm: vmscan: respect NUMA policy mask when shrinking slab on direct reclaim · 99120b77
      Vladimir Davydov 提交于
      When direct reclaim is executed by a process bound to a set of NUMA
      nodes, we should scan only those nodes when possible, but currently we
      will scan kmem from all online nodes even if the kmem shrinker is NUMA
      aware.  That said, binding a process to a particular NUMA node won't
      prevent it from shrinking inode/dentry caches from other nodes, which is
      not good.  Fix this.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99120b77
  8. 30 1月, 2014 1 次提交
  9. 24 1月, 2014 3 次提交
    • V
      mm: vmscan: call NUMA-unaware shrinkers irrespective of nodemask · ec97097b
      Vladimir Davydov 提交于
      If a shrinker is not NUMA-aware, shrink_slab() should call it exactly
      once with nid=0, but currently it is not true: if node 0 is not set in
      the nodemask or if it is not online, we will not call such shrinkers at
      all.  As a result some slabs will be left untouched under some
      circumstances.  Let us fix it.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reported-by: NDave Chinner <dchinner@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec97097b
    • V
      mm: vmscan: shrink all slab objects if tight on memory · 0b1fb40a
      Vladimir Davydov 提交于
      When reclaiming kmem, we currently don't scan slabs that have less than
      batch_size objects (see shrink_slab_node()):
      
              while (total_scan >= batch_size) {
                      shrinkctl->nr_to_scan = batch_size;
                      shrinker->scan_objects(shrinker, shrinkctl);
                      total_scan -= batch_size;
              }
      
      If there are only a few shrinkers available, such a behavior won't cause
      any problems, because the batch_size is usually small, but if we have a
      lot of slab shrinkers, which is perfectly possible since FS shrinkers
      are now per-superblock, we can end up with hundreds of megabytes of
      practically unreclaimable kmem objects.  For instance, mounting a
      thousand of ext2 FS images with a hundred of files in each and iterating
      over all the files using du(1) will result in about 200 Mb of FS caches
      that cannot be dropped even with the aid of the vm.drop_caches sysctl!
      
      This problem was initially pointed out by Glauber Costa [*].  Glauber
      proposed to fix it by making the shrink_slab() always take at least one
      pass, to put it simply, turning the scan loop above to a do{}while()
      loop.  However, this proposal was rejected, because it could result in
      more aggressive and frequent slab shrinking even under low memory
      pressure when total_scan is naturally very small.
      
      This patch is a slightly modified version of Glauber's approach.
      Similarly to Glauber's patch, it makes shrink_slab() scan less than
      batch_size objects, but only if the total number of objects we want to
      scan (total_scan) is greater than the total number of objects available
      (max_pass).  Since total_scan is biased as half max_pass if the current
      delta change is small:
      
              if (delta < max_pass / 4)
                      total_scan = min(total_scan, max_pass / 2);
      
      this is only possible if we are scanning at high prio.  That said, this
      patch shouldn't change the vmscan behaviour if the memory pressure is
      low, but if we are tight on memory, we will do our best by trying to
      reclaim all available objects, which sounds reasonable.
      
      [*] http://www.spinics.net/lists/cgroups/msg06913.htmlSigned-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b1fb40a
    • S
      mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE · 309381fe
      Sasha Levin 提交于
      Most of the VM_BUG_ON assertions are performed on a page.  Usually, when
      one of these assertions fails we'll get a BUG_ON with a call stack and
      the registers.
      
      I've recently noticed based on the requests to add a small piece of code
      that dumps the page to various VM_BUG_ON sites that the page dump is
      quite useful to people debugging issues in mm.
      
      This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
      VM_BUG_ON() does, also dumps the page before executing the actual
      BUG_ON.
      
      [akpm@linux-foundation.org: fix up includes]
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      309381fe
  10. 17 10月, 2013 1 次提交
    • A
      mm/vmscan.c: don't forget to free shrinker->nr_deferred · ae393321
      Andrew Vagin 提交于
      This leak was added by commit 1d3d4437 ("vmscan: per-node deferred
      work").
      
      unreferenced object 0xffff88006ada3bd0 (size 8):
        comm "criu", pid 14781, jiffies 4295238251 (age 105.641s)
        hex dump (first 8 bytes):
          00 00 00 00 00 00 00 00                          ........
        backtrace:
          [<ffffffff8170caee>] kmemleak_alloc+0x5e/0xc0
          [<ffffffff811c0527>] __kmalloc+0x247/0x310
          [<ffffffff8117848c>] register_shrinker+0x3c/0xa0
          [<ffffffff811e115b>] sget+0x5ab/0x670
          [<ffffffff812532f4>] proc_mount+0x54/0x170
          [<ffffffff811e1893>] mount_fs+0x43/0x1b0
          [<ffffffff81202dd2>] vfs_kern_mount+0x72/0x110
          [<ffffffff81202e89>] kern_mount_data+0x19/0x30
          [<ffffffff812530a0>] pid_ns_prepare_proc+0x20/0x40
          [<ffffffff81083c56>] alloc_pid+0x466/0x4a0
          [<ffffffff8105aeda>] copy_process+0xc6a/0x1860
          [<ffffffff8105beab>] do_fork+0x8b/0x370
          [<ffffffff8105c1a6>] SyS_clone+0x16/0x20
          [<ffffffff8171f739>] stub_clone+0x69/0x90
          [<ffffffffffffffff>] 0xffffffffffffffff
      Signed-off-by: NAndrew Vagin <avagin@openvz.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae393321
  11. 01 10月, 2013 1 次提交
    • R
      mm: avoid reinserting isolated balloon pages into LRU lists · 117aad1e
      Rafael Aquini 提交于
      Isolated balloon pages can wrongly end up in LRU lists when
      migrate_pages() finishes its round without draining all the isolated
      page list.
      
      The same issue can happen when reclaim_clean_pages_from_list() tries to
      reclaim pages from an isolated page list, before migration, in the CMA
      path.  Such balloon page leak opens a race window against LRU lists
      shrinkers that leads us to the following kernel panic:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
        IP: [<ffffffff810c2625>] shrink_page_list+0x24e/0x897
        PGD 3cda2067 PUD 3d713067 PMD 0
        Oops: 0000 [#1] SMP
        CPU: 0 PID: 340 Comm: kswapd0 Not tainted 3.12.0-rc1-22626-g4367597 #87
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        RIP: shrink_page_list+0x24e/0x897
        RSP: 0000:ffff88003da499b8  EFLAGS: 00010286
        RAX: 0000000000000000 RBX: ffff88003e82bd60 RCX: 00000000000657d5
        RDX: 0000000000000000 RSI: 000000000000031f RDI: ffff88003e82bd40
        RBP: ffff88003da49ab0 R08: 0000000000000001 R09: 0000000081121a45
        R10: ffffffff81121a45 R11: ffff88003c4a9a28 R12: ffff88003e82bd40
        R13: ffff88003da0e800 R14: 0000000000000001 R15: ffff88003da49d58
        FS:  0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000067d9000 CR3: 000000003ace5000 CR4: 00000000000407b0
        Call Trace:
          shrink_inactive_list+0x240/0x3de
          shrink_lruvec+0x3e0/0x566
          __shrink_zone+0x94/0x178
          shrink_zone+0x3a/0x82
          balance_pgdat+0x32a/0x4c2
          kswapd+0x2f0/0x372
          kthread+0xa2/0xaa
          ret_from_fork+0x7c/0xb0
        Code: 80 7d 8f 01 48 83 95 68 ff ff ff 00 4c 89 e7 e8 5a 7b 00 00 48 85 c0 49 89 c5 75 08 80 7d 8f 00 74 3e eb 31 48 8b 80 18 01 00 00 <48> 8b 74 0d 48 8b 78 30 be 02 00 00 00 ff d2 eb
        RIP  [<ffffffff810c2625>] shrink_page_list+0x24e/0x897
         RSP <ffff88003da499b8>
        CR2: 0000000000000028
        ---[ end trace 703d2451af6ffbfd ]---
        Kernel panic - not syncing: Fatal exception
      
      This patch fixes the issue, by assuring the proper tests are made at
      putback_movable_pages() & reclaim_clean_pages_from_list() to avoid
      isolated balloon pages being wrongly reinserted in LRU lists.
      
      [akpm@linux-foundation.org: clarify awkward comment text]
      Signed-off-by: NRafael Aquini <aquini@redhat.com>
      Reported-by: NLuiz Capitulino <lcapitulino@redhat.com>
      Tested-by: NLuiz Capitulino <lcapitulino@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      117aad1e
  12. 25 9月, 2013 5 次提交
  13. 13 9月, 2013 6 次提交
    • A
      memcg: trivial cleanups · f894ffa8
      Andrew Morton 提交于
      Clean up some mess made by the "Soft limit rework" series, and a few other
      things.
      
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f894ffa8
    • M
      memcg, vmscan: do not fall into reclaim-all pass too quickly · e975de99
      Michal Hocko 提交于
      shrink_zone starts with soft reclaim pass first and then falls back to
      regular reclaim if nothing has been scanned.  This behavior is natural
      but there is a catch.  Memcg iterators, when used with the reclaim
      cookie, are designed to help to prevent from over reclaim by
      interleaving reclaimers (per node-zone-priority) so the tree walk might
      miss many (even all) nodes in the hierarchy e.g.  when there are direct
      reclaimers racing with each other or with kswapd in the global case or
      multiple allocators reaching the limit for the target reclaim case.  To
      make it even more complicated, targeted reclaim doesn't do the whole
      tree walk because it stops reclaiming once it reclaims sufficient pages.
      As a result groups over the limit might be missed, thus nothing is
      scanned, and reclaim would fall back to the reclaim all mode.
      
      This patch checks for the incomplete tree walk in shrink_zone.  If no
      group has been visited and the hierarchy is soft reclaimable then we
      must have missed some groups, in which case the __shrink_zone is called
      again.  This doesn't guarantee there will be some progress of course
      because the current reclaimer might be still racing with others but it
      would at least give a chance to start the walk without a big risk of
      reclaim latencies.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e975de99
    • M
      memcg, vmscan: do not attempt soft limit reclaim if it would not scan anything · e839b6a1
      Michal Hocko 提交于
      mem_cgroup_should_soft_reclaim controls whether soft reclaim pass is
      done and it always says yes currently.  Memcg iterators are clever to
      skip nodes that are not soft reclaimable quite efficiently but
      mem_cgroup_should_soft_reclaim can be more clever and do not start the
      soft reclaim pass at all if it knows that nothing would be scanned
      anyway.
      
      In order to do that, simply reuse mem_cgroup_soft_reclaim_eligible for
      the target group of the reclaim and allow the pass only if the whole
      subtree wouldn't be skipped.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e839b6a1
    • M
      memcg: enhance memcg iterator to support predicates · de57780d
      Michal Hocko 提交于
      The caller of the iterator might know that some nodes or even subtrees
      should be skipped but there is no way to tell iterators about that so the
      only choice left is to let iterators to visit each node and do the
      selection outside of the iterating code.  This, however, doesn't scale
      well with hierarchies with many groups where only few groups are
      interesting.
      
      This patch adds mem_cgroup_iter_cond variant of the iterator with a
      callback which gets called for every visited node.  There are three
      possible ways how the callback can influence the walk.  Either the node is
      visited, it is skipped but the tree walk continues down the tree or the
      whole subtree of the current group is skipped.
      
      [hughd@google.com: fix memcg-less page reclaim]
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de57780d
    • M
      vmscan, memcg: do softlimit reclaim also for targeted reclaim · a5b7c87f
      Michal Hocko 提交于
      Soft reclaim has been done only for the global reclaim (both background
      and direct).  Since "memcg: integrate soft reclaim tighter with zone
      shrinking code" there is no reason for this limitation anymore as the soft
      limit reclaim doesn't use any special code paths and it is a part of the
      zone shrinking code which is used by both global and targeted reclaims.
      
      From the semantic point of view it is natural to consider soft limit
      before touching all groups in the hierarchy tree which is touching the
      hard limit because soft limit tells us where to push back when there is a
      memory pressure.  It is not important whether the pressure comes from the
      limit or imbalanced zones.
      
      This patch simply enables soft reclaim unconditionally in
      mem_cgroup_should_soft_reclaim so it is enabled for both global and
      targeted reclaim paths.  mem_cgroup_soft_reclaim_eligible needs to learn
      about the root of the reclaim to know where to stop checking soft limit
      state of parents up the hierarchy.  Say we have
      
      A (over soft limit)
       \
        B (below s.l., hit the hard limit)
       / \
      C   D (below s.l.)
      
      B is the source of the outside memory pressure now for D but we shouldn't
      soft reclaim it because it is behaving well under B subtree and we can
      still reclaim from C (pressumably it is over the limit).
      mem_cgroup_soft_reclaim_eligible should therefore stop climbing up the
      hierarchy at B (root of the memory pressure).
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NGlauber Costa <glommer@openvz.org>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5b7c87f
    • M
      memcg, vmscan: integrate soft reclaim tighter with zone shrinking code · 3b38722e
      Michal Hocko 提交于
      This patchset is sitting out of tree for quite some time without any
      objections.  I would be really happy if it made it into 3.12.  I do not
      want to push it too hard but I think this work is basically ready and
      waiting more doesn't help.
      
      The basic idea is quite simple.  Pull soft reclaim into shrink_zone in the
      first step and get rid of the previous soft reclaim infrastructure.
      shrink_zone is done in two passes now.  First it tries to do the soft
      limit reclaim and it falls back to reclaim-all mode if no group is over
      the limit or no pages have been scanned.  The second pass happens at the
      same priority so the only time we waste is the memcg tree walk which has
      been updated in the third step to have only negligible overhead.
      
      As a bonus we will get rid of a _lot_ of code by this and soft reclaim
      will not stand out like before when it wasn't integrated into the zone
      shrinking code and it reclaimed at priority 0 (the testing results show
      that some workloads suffers from such an aggressive reclaim).  The clean
      up is in a separate patch because I felt it would be easier to review that
      way.
      
      The second step is soft limit reclaim integration into targeted reclaim.
      It should be rather straight forward.  Soft limit has been used only for
      the global reclaim so far but it makes sense for any kind of pressure
      coming from up-the-hierarchy, including targeted reclaim.
      
      The third step (patches 4-8) addresses the tree walk overhead by enhancing
      memcg iterators to enable skipping whole subtrees and tracking number of
      over soft limit children at each level of the hierarchy.  This information
      is updated same way the old soft limit tree was updated (from
      memcg_check_events) so we shouldn't see an additional overhead.  In fact
      mem_cgroup_update_soft_limit is much simpler than tree manipulation done
      previously.
      
      __shrink_zone uses mem_cgroup_soft_reclaim_eligible as a predicate for
      mem_cgroup_iter so the decision whether a particular group should be
      visited is done at the iterator level which allows us to decide to skip
      the whole subtree as well (if there is no child in excess).  This reduces
      the tree walk overhead considerably.
      
      * TEST 1
      ========
      
      My primary test case was a parallel kernel build with 2 groups (make is
      running with -j8 with a distribution .config in a separate cgroup without
      any hard limit) on a 32 CPU machine booted with 1GB memory and both builds
      run taskset to Node 0 cpus.
      
      I was mostly interested in 2 setups.  Default - no soft limit set and -
      and 0 soft limit set to both groups.  The first one should tell us whether
      the rework regresses the default behavior while the second one should show
      us improvements in an extreme case where both workloads are always over
      the soft limit.
      
      /usr/bin/time -v has been used to collect the statistics and each
      configuration had 3 runs after fresh boot without any other load on the
      system.
      
      base is mmotm-2013-07-18-16-40
      rework all 8 patches applied on top of base
      
      * No-limit
      User
      no-limit/base: min: 651.92 max: 672.65 avg: 664.33 std: 8.01 runs: 6
      no-limit/rework: min: 657.34 [100.8%] max: 668.39 [99.4%] avg: 663.13 [99.8%] std: 3.61 runs: 6
      System
      no-limit/base: min: 69.33 max: 71.39 avg: 70.32 std: 0.79 runs: 6
      no-limit/rework: min: 69.12 [99.7%] max: 71.05 [99.5%] avg: 70.04 [99.6%] std: 0.59 runs: 6
      Elapsed
      no-limit/base: min: 398.27 max: 422.36 avg: 408.85 std: 7.74 runs: 6
      no-limit/rework: min: 386.36 [97.0%] max: 438.40 [103.8%] avg: 416.34 [101.8%] std: 18.85 runs: 6
      
      The results are within noise. Elapsed time has a bigger variance but the
      average looks good.
      
      * 0-limit
      User
      0-limit/base: min: 573.76 max: 605.63 avg: 585.73 std: 12.21 runs: 6
      0-limit/rework: min: 645.77 [112.6%] max: 666.25 [110.0%] avg: 656.97 [112.2%] std: 7.77 runs: 6
      System
      0-limit/base: min: 69.57 max: 71.13 avg: 70.29 std: 0.54 runs: 6
      0-limit/rework: min: 68.68 [98.7%] max: 71.40 [100.4%] avg: 69.91 [99.5%] std: 0.87 runs: 6
      Elapsed
      0-limit/base: min: 1306.14 max: 1550.17 avg: 1430.35 std: 90.86 runs: 6
      0-limit/rework: min: 404.06 [30.9%] max: 465.94 [30.1%] avg: 434.81 [30.4%] std: 22.68 runs: 6
      
      The improvement is really huge here (even bigger than with my previous
      testing and I suspect that this highly depends on the storage).  Page
      fault statistics tell us at least part of the story:
      
      Minor
      0-limit/base: min: 37180461.00 max: 37319986.00 avg: 37247470.00 std: 54772.71 runs: 6
      0-limit/rework: min: 36751685.00 [98.8%] max: 36805379.00 [98.6%] avg: 36774506.33 [98.7%] std: 17109.03 runs: 6
      Major
      0-limit/base: min: 170604.00 max: 221141.00 avg: 196081.83 std: 18217.01 runs: 6
      0-limit/rework: min: 2864.00 [1.7%] max: 10029.00 [4.5%] avg: 5627.33 [2.9%] std: 2252.71 runs: 6
      
      Same as with my previous testing Minor faults are more or less within
      noise but Major fault count is way bellow the base kernel.
      
      While this looks as a nice win it is fair to say that 0-limit
      configuration is quite artificial. So I was playing with 0-no-limit
      loads as well.
      
      * TEST 2
      ========
      
      The following results are from 2 groups configuration on a 16GB machine
      (single NUMA node).
      
      - A running stream IO (dd if=/dev/zero of=local.file bs=1024) with
        2*TotalMem with 0 soft limit.
      - B running a mem_eater which consumes TotalMem-1G without any limit. The
        mem_eater consumes the memory in 100 chunks with 1s nap after each
        mmap+poppulate so that both loads have chance to fight for the memory.
      
      The expected result is that B shouldn't be reclaimed and A shouldn't see
      a big dropdown in elapsed time.
      
      User
      base: min: 2.68 max: 2.89 avg: 2.76 std: 0.09 runs: 3
      rework: min: 3.27 [122.0%] max: 3.74 [129.4%] avg: 3.44 [124.6%] std: 0.21 runs: 3
      System
      base: min: 86.26 max: 88.29 avg: 87.28 std: 0.83 runs: 3
      rework: min: 81.05 [94.0%] max: 84.96 [96.2%] avg: 83.14 [95.3%] std: 1.61 runs: 3
      Elapsed
      base: min: 317.28 max: 332.39 avg: 325.84 std: 6.33 runs: 3
      rework: min: 281.53 [88.7%] max: 298.16 [89.7%] avg: 290.99 [89.3%] std: 6.98 runs: 3
      
      System time improved slightly as well as Elapsed. My previous testing
      has shown worse numbers but this again seem to depend on the storage
      speed.
      
      My theory is that the writeback doesn't catch up and prio-0 soft reclaim
      falls into wait on writeback page too often in the base kernel. The
      patched kernel doesn't do that because the soft reclaim is done from the
      kswapd/direct reclaim context. This can be seen on the following graph
      nicely. The A's group usage_in_bytes regurarly drops really low very often.
      
      All 3 runs
      http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/stream.png
      resp. a detail of the single run
      http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/stream-one-run.png
      
      mem_eater seems to be doing better as well. It gets to the full
      allocation size faster as can be seen on the following graph:
      http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/mem_eater-one-run.png
      
      /proc/meminfo collected during the test also shows that rework kernel
      hasn't swapped that much (well almost not at all):
      base: max: 123900 K avg: 56388.29 K
      rework: max: 300 K avg: 128.68 K
      
      kswapd and direct reclaim statistics are of no use unfortunatelly because
      soft reclaim is not accounted properly as the counters are hidden by
      global_reclaim() checks in the base kernel.
      
      * TEST 3
      ========
      
      Another test was the same configuration as TEST2 except the stream IO was
      replaced by a single kbuild (16 parallel jobs bound to Node0 cpus same as
      in TEST1) and mem_eater allocated TotalMem-200M so kbuild had only 200MB
      left.
      
      Kbuild did better with the rework kernel here as well:
      User
      base: min: 860.28 max: 872.86 avg: 868.03 std: 5.54 runs: 3
      rework: min: 880.81 [102.4%] max: 887.45 [101.7%] avg: 883.56 [101.8%] std: 2.83 runs: 3
      System
      base: min: 84.35 max: 85.06 avg: 84.79 std: 0.31 runs: 3
      rework: min: 85.62 [101.5%] max: 86.09 [101.2%] avg: 85.79 [101.2%] std: 0.21 runs: 3
      Elapsed
      base: min: 135.36 max: 243.30 avg: 182.47 std: 45.12 runs: 3
      rework: min: 110.46 [81.6%] max: 116.20 [47.8%] avg: 114.15 [62.6%] std: 2.61 runs: 3
      Minor
      base: min: 36635476.00 max: 36673365.00 avg: 36654812.00 std: 15478.03 runs: 3
      rework: min: 36639301.00 [100.0%] max: 36695541.00 [100.1%] avg: 36665511.00 [100.0%] std: 23118.23 runs: 3
      Major
      base: min: 14708.00 max: 53328.00 avg: 31379.00 std: 16202.24 runs: 3
      rework: min: 302.00 [2.1%] max: 414.00 [0.8%] avg: 366.33 [1.2%] std: 47.22 runs: 3
      
      Again we can see a significant improvement in Elapsed (it also seems to
      be more stable), there is a huge dropdown for the Major page faults and
      much more swapping:
      base: max: 583736 K avg: 112547.43 K
      rework: max: 4012 K avg: 124.36 K
      
      Graphs from all three runs show the variability of the kbuild quite
      nicely.  It even seems that it took longer after every run with the base
      kernel which would be quite surprising as the source tree for the build is
      removed and caches are dropped after each run so the build operates on a
      freshly extracted sources everytime.
      http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/kbuild-mem_eater.png
      
      My other testing shows that this is just a matter of timing and other runs
      behave differently the std for Elapsed time is similar ~50.  Example of
      other three runs:
      http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/kbuild-mem_eater2.png
      
      So to wrap this up.  The series is still doing good and improves the soft
      limit.
      
      The testing results for bunch of cgroups with both stream IO and kbuild
      loads can be found in "memcg: track children in soft limit excess to
      improve soft limit".
      
      This patch:
      
      Memcg soft reclaim has been traditionally triggered from the global
      reclaim paths before calling shrink_zone.  mem_cgroup_soft_limit_reclaim
      then picked up a group which exceeds the soft limit the most and reclaimed
      it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages.
      
      The infrastructure requires per-node-zone trees which hold over-limit
      groups and keep them up-to-date (via memcg_check_events) which is not cost
      free.  Although this overhead hasn't turned out to be a bottle neck the
      implementation is suboptimal because mem_cgroup_update_tree has no idea
      which zones consumed memory over the limit so we could easily end up
      having a group on a node-zone tree having only few pages from that
      node-zone.
      
      This patch doesn't try to fix node-zone trees management because it seems
      that integrating soft reclaim into zone shrinking sounds much easier and
      more appropriate for several reasons.  First of all 0 priority reclaim was
      a crude hack which might lead to big stalls if the group's LRUs are big
      and hard to reclaim (e.g.  a lot of dirty/writeback pages).  Soft reclaim
      should be applicable also to the targeted reclaim which is awkward right
      now without additional hacks.  Last but not least the whole infrastructure
      eats quite some code.
      
      After this patch shrink_zone is done in 2 passes.  First it tries to do
      the soft reclaim if appropriate (only for global reclaim for now to keep
      compatible with the original state) and fall back to ignoring soft limit
      if no group is eligible to soft reclaim or nothing has been scanned during
      the first pass.  Only groups which are over their soft limit or any of
      their parents up the hierarchy is over the limit are considered eligible
      during the first pass.
      
      Soft limit tree which is not necessary anymore will be removed in the
      follow up patch to make this patch smaller and easier to review.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NGlauber Costa <glommer@openvz.org>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b38722e
  14. 12 9月, 2013 1 次提交
    • L
      mm: vmscan: fix do_try_to_free_pages() livelock · 6e543d57
      Lisa Du 提交于
      This patch is based on KOSAKI's work and I add a little more description,
      please refer https://lkml.org/lkml/2012/6/14/74.
      
      Currently, I found system can enter a state that there are lots of free
      pages in a zone but only order-0 and order-1 pages which means the zone is
      heavily fragmented, then high order allocation could make direct reclaim
      path's long stall(ex, 60 seconds) especially in no swap and no compaciton
      enviroment.  This problem happened on v3.4, but it seems issue still lives
      in current tree, the reason is do_try_to_free_pages enter live lock:
      
      kswapd will go to sleep if the zones have been fully scanned and are still
      not balanced.  As kswapd thinks there's little point trying all over again
      to avoid infinite loop.  Instead it changes order from high-order to
      0-order because kswapd think order-0 is the most important.  Look at
      73ce02e9 in detail.  If watermarks are ok, kswapd will go back to sleep
      and may leave zone->all_unreclaimable =3D 0.  It assume high-order users
      can still perform direct reclaim if they wish.
      
      Direct reclaim continue to reclaim for a high order which is not a
      COSTLY_ORDER without oom-killer until kswapd turn on
      zone->all_unreclaimble= .  This is because to avoid too early oom-kill.
      So it means direct_reclaim depends on kswapd to break this loop.
      
      In worst case, direct-reclaim may continue to page reclaim forever when
      kswapd sleeps forever until someone like watchdog detect and finally kill
      the process.  As described in:
      http://thread.gmane.org/gmane.linux.kernel.mm/103737
      
      We can't turn on zone->all_unreclaimable from direct reclaim path because
      direct reclaim path don't take any lock and this way is racy.  Thus this
      patch removes zone->all_unreclaimable field completely and recalculates
      zone reclaimable state every time.
      
      Note: we can't take the idea that direct-reclaim see zone->pages_scanned
      directly and kswapd continue to use zone->all_unreclaimable.  Because, it
      is racy.  commit 929bea7c (vmscan: all_unreclaimable() use
      zone->all_unreclaimable as a name) describes the detail.
      
      [akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()]
      Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Nick Piggin <npiggin@gmail.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Bob Liu <lliubbo@gmail.com>
      Cc: Neil Zhang <zhangwm@marvell.com>
      Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NLisa Du <cldu@marvell.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6e543d57