1. 04 4月, 2014 3 次提交
  2. 30 1月, 2014 1 次提交
  3. 24 1月, 2014 3 次提交
    • V
      mm: vmscan: call NUMA-unaware shrinkers irrespective of nodemask · ec97097b
      Vladimir Davydov 提交于
      If a shrinker is not NUMA-aware, shrink_slab() should call it exactly
      once with nid=0, but currently it is not true: if node 0 is not set in
      the nodemask or if it is not online, we will not call such shrinkers at
      all.  As a result some slabs will be left untouched under some
      circumstances.  Let us fix it.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reported-by: NDave Chinner <dchinner@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec97097b
    • V
      mm: vmscan: shrink all slab objects if tight on memory · 0b1fb40a
      Vladimir Davydov 提交于
      When reclaiming kmem, we currently don't scan slabs that have less than
      batch_size objects (see shrink_slab_node()):
      
              while (total_scan >= batch_size) {
                      shrinkctl->nr_to_scan = batch_size;
                      shrinker->scan_objects(shrinker, shrinkctl);
                      total_scan -= batch_size;
              }
      
      If there are only a few shrinkers available, such a behavior won't cause
      any problems, because the batch_size is usually small, but if we have a
      lot of slab shrinkers, which is perfectly possible since FS shrinkers
      are now per-superblock, we can end up with hundreds of megabytes of
      practically unreclaimable kmem objects.  For instance, mounting a
      thousand of ext2 FS images with a hundred of files in each and iterating
      over all the files using du(1) will result in about 200 Mb of FS caches
      that cannot be dropped even with the aid of the vm.drop_caches sysctl!
      
      This problem was initially pointed out by Glauber Costa [*].  Glauber
      proposed to fix it by making the shrink_slab() always take at least one
      pass, to put it simply, turning the scan loop above to a do{}while()
      loop.  However, this proposal was rejected, because it could result in
      more aggressive and frequent slab shrinking even under low memory
      pressure when total_scan is naturally very small.
      
      This patch is a slightly modified version of Glauber's approach.
      Similarly to Glauber's patch, it makes shrink_slab() scan less than
      batch_size objects, but only if the total number of objects we want to
      scan (total_scan) is greater than the total number of objects available
      (max_pass).  Since total_scan is biased as half max_pass if the current
      delta change is small:
      
              if (delta < max_pass / 4)
                      total_scan = min(total_scan, max_pass / 2);
      
      this is only possible if we are scanning at high prio.  That said, this
      patch shouldn't change the vmscan behaviour if the memory pressure is
      low, but if we are tight on memory, we will do our best by trying to
      reclaim all available objects, which sounds reasonable.
      
      [*] http://www.spinics.net/lists/cgroups/msg06913.htmlSigned-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b1fb40a
    • S
      mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE · 309381fe
      Sasha Levin 提交于
      Most of the VM_BUG_ON assertions are performed on a page.  Usually, when
      one of these assertions fails we'll get a BUG_ON with a call stack and
      the registers.
      
      I've recently noticed based on the requests to add a small piece of code
      that dumps the page to various VM_BUG_ON sites that the page dump is
      quite useful to people debugging issues in mm.
      
      This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
      VM_BUG_ON() does, also dumps the page before executing the actual
      BUG_ON.
      
      [akpm@linux-foundation.org: fix up includes]
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      309381fe
  4. 17 10月, 2013 1 次提交
    • A
      mm/vmscan.c: don't forget to free shrinker->nr_deferred · ae393321
      Andrew Vagin 提交于
      This leak was added by commit 1d3d4437 ("vmscan: per-node deferred
      work").
      
      unreferenced object 0xffff88006ada3bd0 (size 8):
        comm "criu", pid 14781, jiffies 4295238251 (age 105.641s)
        hex dump (first 8 bytes):
          00 00 00 00 00 00 00 00                          ........
        backtrace:
          [<ffffffff8170caee>] kmemleak_alloc+0x5e/0xc0
          [<ffffffff811c0527>] __kmalloc+0x247/0x310
          [<ffffffff8117848c>] register_shrinker+0x3c/0xa0
          [<ffffffff811e115b>] sget+0x5ab/0x670
          [<ffffffff812532f4>] proc_mount+0x54/0x170
          [<ffffffff811e1893>] mount_fs+0x43/0x1b0
          [<ffffffff81202dd2>] vfs_kern_mount+0x72/0x110
          [<ffffffff81202e89>] kern_mount_data+0x19/0x30
          [<ffffffff812530a0>] pid_ns_prepare_proc+0x20/0x40
          [<ffffffff81083c56>] alloc_pid+0x466/0x4a0
          [<ffffffff8105aeda>] copy_process+0xc6a/0x1860
          [<ffffffff8105beab>] do_fork+0x8b/0x370
          [<ffffffff8105c1a6>] SyS_clone+0x16/0x20
          [<ffffffff8171f739>] stub_clone+0x69/0x90
          [<ffffffffffffffff>] 0xffffffffffffffff
      Signed-off-by: NAndrew Vagin <avagin@openvz.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae393321
  5. 01 10月, 2013 1 次提交
    • R
      mm: avoid reinserting isolated balloon pages into LRU lists · 117aad1e
      Rafael Aquini 提交于
      Isolated balloon pages can wrongly end up in LRU lists when
      migrate_pages() finishes its round without draining all the isolated
      page list.
      
      The same issue can happen when reclaim_clean_pages_from_list() tries to
      reclaim pages from an isolated page list, before migration, in the CMA
      path.  Such balloon page leak opens a race window against LRU lists
      shrinkers that leads us to the following kernel panic:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
        IP: [<ffffffff810c2625>] shrink_page_list+0x24e/0x897
        PGD 3cda2067 PUD 3d713067 PMD 0
        Oops: 0000 [#1] SMP
        CPU: 0 PID: 340 Comm: kswapd0 Not tainted 3.12.0-rc1-22626-g4367597 #87
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        RIP: shrink_page_list+0x24e/0x897
        RSP: 0000:ffff88003da499b8  EFLAGS: 00010286
        RAX: 0000000000000000 RBX: ffff88003e82bd60 RCX: 00000000000657d5
        RDX: 0000000000000000 RSI: 000000000000031f RDI: ffff88003e82bd40
        RBP: ffff88003da49ab0 R08: 0000000000000001 R09: 0000000081121a45
        R10: ffffffff81121a45 R11: ffff88003c4a9a28 R12: ffff88003e82bd40
        R13: ffff88003da0e800 R14: 0000000000000001 R15: ffff88003da49d58
        FS:  0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000067d9000 CR3: 000000003ace5000 CR4: 00000000000407b0
        Call Trace:
          shrink_inactive_list+0x240/0x3de
          shrink_lruvec+0x3e0/0x566
          __shrink_zone+0x94/0x178
          shrink_zone+0x3a/0x82
          balance_pgdat+0x32a/0x4c2
          kswapd+0x2f0/0x372
          kthread+0xa2/0xaa
          ret_from_fork+0x7c/0xb0
        Code: 80 7d 8f 01 48 83 95 68 ff ff ff 00 4c 89 e7 e8 5a 7b 00 00 48 85 c0 49 89 c5 75 08 80 7d 8f 00 74 3e eb 31 48 8b 80 18 01 00 00 <48> 8b 74 0d 48 8b 78 30 be 02 00 00 00 ff d2 eb
        RIP  [<ffffffff810c2625>] shrink_page_list+0x24e/0x897
         RSP <ffff88003da499b8>
        CR2: 0000000000000028
        ---[ end trace 703d2451af6ffbfd ]---
        Kernel panic - not syncing: Fatal exception
      
      This patch fixes the issue, by assuring the proper tests are made at
      putback_movable_pages() & reclaim_clean_pages_from_list() to avoid
      isolated balloon pages being wrongly reinserted in LRU lists.
      
      [akpm@linux-foundation.org: clarify awkward comment text]
      Signed-off-by: NRafael Aquini <aquini@redhat.com>
      Reported-by: NLuiz Capitulino <lcapitulino@redhat.com>
      Tested-by: NLuiz Capitulino <lcapitulino@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      117aad1e
  6. 25 9月, 2013 5 次提交
  7. 13 9月, 2013 6 次提交
    • A
      memcg: trivial cleanups · f894ffa8
      Andrew Morton 提交于
      Clean up some mess made by the "Soft limit rework" series, and a few other
      things.
      
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f894ffa8
    • M
      memcg, vmscan: do not fall into reclaim-all pass too quickly · e975de99
      Michal Hocko 提交于
      shrink_zone starts with soft reclaim pass first and then falls back to
      regular reclaim if nothing has been scanned.  This behavior is natural
      but there is a catch.  Memcg iterators, when used with the reclaim
      cookie, are designed to help to prevent from over reclaim by
      interleaving reclaimers (per node-zone-priority) so the tree walk might
      miss many (even all) nodes in the hierarchy e.g.  when there are direct
      reclaimers racing with each other or with kswapd in the global case or
      multiple allocators reaching the limit for the target reclaim case.  To
      make it even more complicated, targeted reclaim doesn't do the whole
      tree walk because it stops reclaiming once it reclaims sufficient pages.
      As a result groups over the limit might be missed, thus nothing is
      scanned, and reclaim would fall back to the reclaim all mode.
      
      This patch checks for the incomplete tree walk in shrink_zone.  If no
      group has been visited and the hierarchy is soft reclaimable then we
      must have missed some groups, in which case the __shrink_zone is called
      again.  This doesn't guarantee there will be some progress of course
      because the current reclaimer might be still racing with others but it
      would at least give a chance to start the walk without a big risk of
      reclaim latencies.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e975de99
    • M
      memcg, vmscan: do not attempt soft limit reclaim if it would not scan anything · e839b6a1
      Michal Hocko 提交于
      mem_cgroup_should_soft_reclaim controls whether soft reclaim pass is
      done and it always says yes currently.  Memcg iterators are clever to
      skip nodes that are not soft reclaimable quite efficiently but
      mem_cgroup_should_soft_reclaim can be more clever and do not start the
      soft reclaim pass at all if it knows that nothing would be scanned
      anyway.
      
      In order to do that, simply reuse mem_cgroup_soft_reclaim_eligible for
      the target group of the reclaim and allow the pass only if the whole
      subtree wouldn't be skipped.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e839b6a1
    • M
      memcg: enhance memcg iterator to support predicates · de57780d
      Michal Hocko 提交于
      The caller of the iterator might know that some nodes or even subtrees
      should be skipped but there is no way to tell iterators about that so the
      only choice left is to let iterators to visit each node and do the
      selection outside of the iterating code.  This, however, doesn't scale
      well with hierarchies with many groups where only few groups are
      interesting.
      
      This patch adds mem_cgroup_iter_cond variant of the iterator with a
      callback which gets called for every visited node.  There are three
      possible ways how the callback can influence the walk.  Either the node is
      visited, it is skipped but the tree walk continues down the tree or the
      whole subtree of the current group is skipped.
      
      [hughd@google.com: fix memcg-less page reclaim]
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de57780d
    • M
      vmscan, memcg: do softlimit reclaim also for targeted reclaim · a5b7c87f
      Michal Hocko 提交于
      Soft reclaim has been done only for the global reclaim (both background
      and direct).  Since "memcg: integrate soft reclaim tighter with zone
      shrinking code" there is no reason for this limitation anymore as the soft
      limit reclaim doesn't use any special code paths and it is a part of the
      zone shrinking code which is used by both global and targeted reclaims.
      
      From the semantic point of view it is natural to consider soft limit
      before touching all groups in the hierarchy tree which is touching the
      hard limit because soft limit tells us where to push back when there is a
      memory pressure.  It is not important whether the pressure comes from the
      limit or imbalanced zones.
      
      This patch simply enables soft reclaim unconditionally in
      mem_cgroup_should_soft_reclaim so it is enabled for both global and
      targeted reclaim paths.  mem_cgroup_soft_reclaim_eligible needs to learn
      about the root of the reclaim to know where to stop checking soft limit
      state of parents up the hierarchy.  Say we have
      
      A (over soft limit)
       \
        B (below s.l., hit the hard limit)
       / \
      C   D (below s.l.)
      
      B is the source of the outside memory pressure now for D but we shouldn't
      soft reclaim it because it is behaving well under B subtree and we can
      still reclaim from C (pressumably it is over the limit).
      mem_cgroup_soft_reclaim_eligible should therefore stop climbing up the
      hierarchy at B (root of the memory pressure).
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NGlauber Costa <glommer@openvz.org>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5b7c87f
    • M
      memcg, vmscan: integrate soft reclaim tighter with zone shrinking code · 3b38722e
      Michal Hocko 提交于
      This patchset is sitting out of tree for quite some time without any
      objections.  I would be really happy if it made it into 3.12.  I do not
      want to push it too hard but I think this work is basically ready and
      waiting more doesn't help.
      
      The basic idea is quite simple.  Pull soft reclaim into shrink_zone in the
      first step and get rid of the previous soft reclaim infrastructure.
      shrink_zone is done in two passes now.  First it tries to do the soft
      limit reclaim and it falls back to reclaim-all mode if no group is over
      the limit or no pages have been scanned.  The second pass happens at the
      same priority so the only time we waste is the memcg tree walk which has
      been updated in the third step to have only negligible overhead.
      
      As a bonus we will get rid of a _lot_ of code by this and soft reclaim
      will not stand out like before when it wasn't integrated into the zone
      shrinking code and it reclaimed at priority 0 (the testing results show
      that some workloads suffers from such an aggressive reclaim).  The clean
      up is in a separate patch because I felt it would be easier to review that
      way.
      
      The second step is soft limit reclaim integration into targeted reclaim.
      It should be rather straight forward.  Soft limit has been used only for
      the global reclaim so far but it makes sense for any kind of pressure
      coming from up-the-hierarchy, including targeted reclaim.
      
      The third step (patches 4-8) addresses the tree walk overhead by enhancing
      memcg iterators to enable skipping whole subtrees and tracking number of
      over soft limit children at each level of the hierarchy.  This information
      is updated same way the old soft limit tree was updated (from
      memcg_check_events) so we shouldn't see an additional overhead.  In fact
      mem_cgroup_update_soft_limit is much simpler than tree manipulation done
      previously.
      
      __shrink_zone uses mem_cgroup_soft_reclaim_eligible as a predicate for
      mem_cgroup_iter so the decision whether a particular group should be
      visited is done at the iterator level which allows us to decide to skip
      the whole subtree as well (if there is no child in excess).  This reduces
      the tree walk overhead considerably.
      
      * TEST 1
      ========
      
      My primary test case was a parallel kernel build with 2 groups (make is
      running with -j8 with a distribution .config in a separate cgroup without
      any hard limit) on a 32 CPU machine booted with 1GB memory and both builds
      run taskset to Node 0 cpus.
      
      I was mostly interested in 2 setups.  Default - no soft limit set and -
      and 0 soft limit set to both groups.  The first one should tell us whether
      the rework regresses the default behavior while the second one should show
      us improvements in an extreme case where both workloads are always over
      the soft limit.
      
      /usr/bin/time -v has been used to collect the statistics and each
      configuration had 3 runs after fresh boot without any other load on the
      system.
      
      base is mmotm-2013-07-18-16-40
      rework all 8 patches applied on top of base
      
      * No-limit
      User
      no-limit/base: min: 651.92 max: 672.65 avg: 664.33 std: 8.01 runs: 6
      no-limit/rework: min: 657.34 [100.8%] max: 668.39 [99.4%] avg: 663.13 [99.8%] std: 3.61 runs: 6
      System
      no-limit/base: min: 69.33 max: 71.39 avg: 70.32 std: 0.79 runs: 6
      no-limit/rework: min: 69.12 [99.7%] max: 71.05 [99.5%] avg: 70.04 [99.6%] std: 0.59 runs: 6
      Elapsed
      no-limit/base: min: 398.27 max: 422.36 avg: 408.85 std: 7.74 runs: 6
      no-limit/rework: min: 386.36 [97.0%] max: 438.40 [103.8%] avg: 416.34 [101.8%] std: 18.85 runs: 6
      
      The results are within noise. Elapsed time has a bigger variance but the
      average looks good.
      
      * 0-limit
      User
      0-limit/base: min: 573.76 max: 605.63 avg: 585.73 std: 12.21 runs: 6
      0-limit/rework: min: 645.77 [112.6%] max: 666.25 [110.0%] avg: 656.97 [112.2%] std: 7.77 runs: 6
      System
      0-limit/base: min: 69.57 max: 71.13 avg: 70.29 std: 0.54 runs: 6
      0-limit/rework: min: 68.68 [98.7%] max: 71.40 [100.4%] avg: 69.91 [99.5%] std: 0.87 runs: 6
      Elapsed
      0-limit/base: min: 1306.14 max: 1550.17 avg: 1430.35 std: 90.86 runs: 6
      0-limit/rework: min: 404.06 [30.9%] max: 465.94 [30.1%] avg: 434.81 [30.4%] std: 22.68 runs: 6
      
      The improvement is really huge here (even bigger than with my previous
      testing and I suspect that this highly depends on the storage).  Page
      fault statistics tell us at least part of the story:
      
      Minor
      0-limit/base: min: 37180461.00 max: 37319986.00 avg: 37247470.00 std: 54772.71 runs: 6
      0-limit/rework: min: 36751685.00 [98.8%] max: 36805379.00 [98.6%] avg: 36774506.33 [98.7%] std: 17109.03 runs: 6
      Major
      0-limit/base: min: 170604.00 max: 221141.00 avg: 196081.83 std: 18217.01 runs: 6
      0-limit/rework: min: 2864.00 [1.7%] max: 10029.00 [4.5%] avg: 5627.33 [2.9%] std: 2252.71 runs: 6
      
      Same as with my previous testing Minor faults are more or less within
      noise but Major fault count is way bellow the base kernel.
      
      While this looks as a nice win it is fair to say that 0-limit
      configuration is quite artificial. So I was playing with 0-no-limit
      loads as well.
      
      * TEST 2
      ========
      
      The following results are from 2 groups configuration on a 16GB machine
      (single NUMA node).
      
      - A running stream IO (dd if=/dev/zero of=local.file bs=1024) with
        2*TotalMem with 0 soft limit.
      - B running a mem_eater which consumes TotalMem-1G without any limit. The
        mem_eater consumes the memory in 100 chunks with 1s nap after each
        mmap+poppulate so that both loads have chance to fight for the memory.
      
      The expected result is that B shouldn't be reclaimed and A shouldn't see
      a big dropdown in elapsed time.
      
      User
      base: min: 2.68 max: 2.89 avg: 2.76 std: 0.09 runs: 3
      rework: min: 3.27 [122.0%] max: 3.74 [129.4%] avg: 3.44 [124.6%] std: 0.21 runs: 3
      System
      base: min: 86.26 max: 88.29 avg: 87.28 std: 0.83 runs: 3
      rework: min: 81.05 [94.0%] max: 84.96 [96.2%] avg: 83.14 [95.3%] std: 1.61 runs: 3
      Elapsed
      base: min: 317.28 max: 332.39 avg: 325.84 std: 6.33 runs: 3
      rework: min: 281.53 [88.7%] max: 298.16 [89.7%] avg: 290.99 [89.3%] std: 6.98 runs: 3
      
      System time improved slightly as well as Elapsed. My previous testing
      has shown worse numbers but this again seem to depend on the storage
      speed.
      
      My theory is that the writeback doesn't catch up and prio-0 soft reclaim
      falls into wait on writeback page too often in the base kernel. The
      patched kernel doesn't do that because the soft reclaim is done from the
      kswapd/direct reclaim context. This can be seen on the following graph
      nicely. The A's group usage_in_bytes regurarly drops really low very often.
      
      All 3 runs
      http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/stream.png
      resp. a detail of the single run
      http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/stream-one-run.png
      
      mem_eater seems to be doing better as well. It gets to the full
      allocation size faster as can be seen on the following graph:
      http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/mem_eater-one-run.png
      
      /proc/meminfo collected during the test also shows that rework kernel
      hasn't swapped that much (well almost not at all):
      base: max: 123900 K avg: 56388.29 K
      rework: max: 300 K avg: 128.68 K
      
      kswapd and direct reclaim statistics are of no use unfortunatelly because
      soft reclaim is not accounted properly as the counters are hidden by
      global_reclaim() checks in the base kernel.
      
      * TEST 3
      ========
      
      Another test was the same configuration as TEST2 except the stream IO was
      replaced by a single kbuild (16 parallel jobs bound to Node0 cpus same as
      in TEST1) and mem_eater allocated TotalMem-200M so kbuild had only 200MB
      left.
      
      Kbuild did better with the rework kernel here as well:
      User
      base: min: 860.28 max: 872.86 avg: 868.03 std: 5.54 runs: 3
      rework: min: 880.81 [102.4%] max: 887.45 [101.7%] avg: 883.56 [101.8%] std: 2.83 runs: 3
      System
      base: min: 84.35 max: 85.06 avg: 84.79 std: 0.31 runs: 3
      rework: min: 85.62 [101.5%] max: 86.09 [101.2%] avg: 85.79 [101.2%] std: 0.21 runs: 3
      Elapsed
      base: min: 135.36 max: 243.30 avg: 182.47 std: 45.12 runs: 3
      rework: min: 110.46 [81.6%] max: 116.20 [47.8%] avg: 114.15 [62.6%] std: 2.61 runs: 3
      Minor
      base: min: 36635476.00 max: 36673365.00 avg: 36654812.00 std: 15478.03 runs: 3
      rework: min: 36639301.00 [100.0%] max: 36695541.00 [100.1%] avg: 36665511.00 [100.0%] std: 23118.23 runs: 3
      Major
      base: min: 14708.00 max: 53328.00 avg: 31379.00 std: 16202.24 runs: 3
      rework: min: 302.00 [2.1%] max: 414.00 [0.8%] avg: 366.33 [1.2%] std: 47.22 runs: 3
      
      Again we can see a significant improvement in Elapsed (it also seems to
      be more stable), there is a huge dropdown for the Major page faults and
      much more swapping:
      base: max: 583736 K avg: 112547.43 K
      rework: max: 4012 K avg: 124.36 K
      
      Graphs from all three runs show the variability of the kbuild quite
      nicely.  It even seems that it took longer after every run with the base
      kernel which would be quite surprising as the source tree for the build is
      removed and caches are dropped after each run so the build operates on a
      freshly extracted sources everytime.
      http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/kbuild-mem_eater.png
      
      My other testing shows that this is just a matter of timing and other runs
      behave differently the std for Elapsed time is similar ~50.  Example of
      other three runs:
      http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/kbuild-mem_eater2.png
      
      So to wrap this up.  The series is still doing good and improves the soft
      limit.
      
      The testing results for bunch of cgroups with both stream IO and kbuild
      loads can be found in "memcg: track children in soft limit excess to
      improve soft limit".
      
      This patch:
      
      Memcg soft reclaim has been traditionally triggered from the global
      reclaim paths before calling shrink_zone.  mem_cgroup_soft_limit_reclaim
      then picked up a group which exceeds the soft limit the most and reclaimed
      it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages.
      
      The infrastructure requires per-node-zone trees which hold over-limit
      groups and keep them up-to-date (via memcg_check_events) which is not cost
      free.  Although this overhead hasn't turned out to be a bottle neck the
      implementation is suboptimal because mem_cgroup_update_tree has no idea
      which zones consumed memory over the limit so we could easily end up
      having a group on a node-zone tree having only few pages from that
      node-zone.
      
      This patch doesn't try to fix node-zone trees management because it seems
      that integrating soft reclaim into zone shrinking sounds much easier and
      more appropriate for several reasons.  First of all 0 priority reclaim was
      a crude hack which might lead to big stalls if the group's LRUs are big
      and hard to reclaim (e.g.  a lot of dirty/writeback pages).  Soft reclaim
      should be applicable also to the targeted reclaim which is awkward right
      now without additional hacks.  Last but not least the whole infrastructure
      eats quite some code.
      
      After this patch shrink_zone is done in 2 passes.  First it tries to do
      the soft reclaim if appropriate (only for global reclaim for now to keep
      compatible with the original state) and fall back to ignoring soft limit
      if no group is eligible to soft reclaim or nothing has been scanned during
      the first pass.  Only groups which are over their soft limit or any of
      their parents up the hierarchy is over the limit are considered eligible
      during the first pass.
      
      Soft limit tree which is not necessary anymore will be removed in the
      follow up patch to make this patch smaller and easier to review.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NGlauber Costa <glommer@openvz.org>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b38722e
  8. 12 9月, 2013 3 次提交
    • L
      mm: vmscan: fix do_try_to_free_pages() livelock · 6e543d57
      Lisa Du 提交于
      This patch is based on KOSAKI's work and I add a little more description,
      please refer https://lkml.org/lkml/2012/6/14/74.
      
      Currently, I found system can enter a state that there are lots of free
      pages in a zone but only order-0 and order-1 pages which means the zone is
      heavily fragmented, then high order allocation could make direct reclaim
      path's long stall(ex, 60 seconds) especially in no swap and no compaciton
      enviroment.  This problem happened on v3.4, but it seems issue still lives
      in current tree, the reason is do_try_to_free_pages enter live lock:
      
      kswapd will go to sleep if the zones have been fully scanned and are still
      not balanced.  As kswapd thinks there's little point trying all over again
      to avoid infinite loop.  Instead it changes order from high-order to
      0-order because kswapd think order-0 is the most important.  Look at
      73ce02e9 in detail.  If watermarks are ok, kswapd will go back to sleep
      and may leave zone->all_unreclaimable =3D 0.  It assume high-order users
      can still perform direct reclaim if they wish.
      
      Direct reclaim continue to reclaim for a high order which is not a
      COSTLY_ORDER without oom-killer until kswapd turn on
      zone->all_unreclaimble= .  This is because to avoid too early oom-kill.
      So it means direct_reclaim depends on kswapd to break this loop.
      
      In worst case, direct-reclaim may continue to page reclaim forever when
      kswapd sleeps forever until someone like watchdog detect and finally kill
      the process.  As described in:
      http://thread.gmane.org/gmane.linux.kernel.mm/103737
      
      We can't turn on zone->all_unreclaimable from direct reclaim path because
      direct reclaim path don't take any lock and this way is racy.  Thus this
      patch removes zone->all_unreclaimable field completely and recalculates
      zone reclaimable state every time.
      
      Note: we can't take the idea that direct-reclaim see zone->pages_scanned
      directly and kswapd continue to use zone->all_unreclaimable.  Because, it
      is racy.  commit 929bea7c (vmscan: all_unreclaimable() use
      zone->all_unreclaimable as a name) describes the detail.
      
      [akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()]
      Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Nick Piggin <npiggin@gmail.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Bob Liu <lliubbo@gmail.com>
      Cc: Neil Zhang <zhangwm@marvell.com>
      Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NLisa Du <cldu@marvell.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6e543d57
    • V
      mm: putback_lru_page: remove unnecessary call to page_lru_base_type() · 0ec3b74c
      Vlastimil Babka 提交于
      The goal of this patch series is to improve performance of munlock() of
      large mlocked memory areas on systems without THP.  This is motivated by
      reported very long times of crash recovery of processes with such areas,
      where munlock() can take several seconds.  See
      http://lwn.net/Articles/548108/
      
      The work was driven by a simple benchmark (to be included in mmtests) that
      mmaps() e.g.  56GB with MAP_LOCKED | MAP_POPULATE and measures the time of
      munlock().  Profiling was performed by attaching operf --pid to the
      process and sending a signal to trigger the munlock() part and then notify
      bach the monitoring wrapper to stop operf, so that only munlock() appears
      in the profile.
      
      The profiles have shown that CPU time is spent mostly by atomic operations
      and repeated locking per single pages. This series aims to reduce both, starting
      from simpler to more complex changes.
      
      Patch 1 performs a simple cleanup in putback_lru_page() so that page lru base
      	type is not determined without being actually needed.
      
      Patch 2 removes an unnecessary call to lru_add_drain() which drains the per-cpu
      	pagevec after each munlocked page is put there.
      
      Patch 3 changes munlock_vma_range() to use an on-stack pagevec for isolating
      	multiple non-THP pages under a single lru_lock instead of locking and
      	processing each page separately.
      
      Patch 4 changes the NR_MLOCK accounting to be called only once per the pvec
      	introduced by previous patch.
      
      Patch 5 uses the introduced pagevec to batch also the work of putback_lru_page
      	when possible, bypassing the per-cpu pvec and associated overhead.
      
      Patch 6 removes a redundant get_page/put_page pair which saves costly atomic
      	operations.
      
      Patch 7 avoids calling follow_page_mask() on each individual page, and obtains
      	multiple page references under a single page table lock where possible.
      
      Measurements were made using 3.11-rc3 as a baseline.  The first set of
      measurements shows the possibly ideal conditions where batching should
      help the most.  All memory is allocated from a single NUMA node and THP is
      disabled.
      
      timedmunlock
                                  3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3
                                         0                     1                     2                     3                     4                     5                     6                     7
      Elapsed min           3.38 (  0.00%)        3.39 ( -0.13%)        3.00 ( 11.33%)        2.70 ( 20.20%)        2.67 ( 21.11%)        2.37 ( 29.88%)        2.20 ( 34.91%)        1.91 ( 43.59%)
      Elapsed mean          3.39 (  0.00%)        3.40 ( -0.23%)        3.01 ( 11.33%)        2.70 ( 20.26%)        2.67 ( 21.21%)        2.38 ( 29.88%)        2.21 ( 34.93%)        1.92 ( 43.46%)
      Elapsed stddev        0.01 (  0.00%)        0.01 (-43.09%)        0.01 ( 15.42%)        0.01 ( 23.42%)        0.00 ( 89.78%)        0.01 ( -7.15%)        0.00 ( 76.69%)        0.02 (-91.77%)
      Elapsed max           3.41 (  0.00%)        3.43 ( -0.52%)        3.03 ( 11.29%)        2.72 ( 20.16%)        2.67 ( 21.63%)        2.40 ( 29.50%)        2.21 ( 35.21%)        1.96 ( 42.39%)
      Elapsed range         0.03 (  0.00%)        0.04 (-51.16%)        0.02 (  6.27%)        0.02 ( 14.67%)        0.00 ( 88.90%)        0.03 (-19.18%)        0.01 ( 73.70%)        0.06 (-113.35%
      
      The second set of measurements simulates the worst possible conditions for
      batching by using numactl --interleave, so that there is in fact only one
      page per pagevec.  Even in this case the series seems to improve
      performance thanks to reduced atomic operations and removal of
      lru_add_drain().
      
      timedmunlock
                                  3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3
                                         0                     1                     2                     3                     4                     5                     6                     7
      Elapsed min           4.00 (  0.00%)        4.04 ( -0.93%)        3.87 (  3.37%)        3.72 (  6.94%)        3.81 (  4.72%)        3.69 (  7.82%)        3.64 (  8.92%)        3.41 ( 14.81%)
      Elapsed mean          4.17 (  0.00%)        4.15 (  0.51%)        4.03 (  3.49%)        3.89 (  6.84%)        3.86 (  7.48%)        3.89 (  6.69%)        3.70 ( 11.27%)        3.48 ( 16.59%)
      Elapsed stddev        0.16 (  0.00%)        0.08 ( 50.76%)        0.10 ( 41.58%)        0.16 (  4.59%)        0.05 ( 72.38%)        0.19 (-12.91%)        0.05 ( 68.09%)        0.06 ( 66.03%)
      Elapsed max           4.34 (  0.00%)        4.32 (  0.56%)        4.19 (  3.62%)        4.12 (  5.15%)        3.91 (  9.88%)        4.12 (  5.25%)        3.80 ( 12.58%)        3.56 ( 18.08%)
      Elapsed range         0.34 (  0.00%)        0.28 ( 17.91%)        0.32 (  6.45%)        0.40 (-15.73%)        0.10 ( 70.06%)        0.43 (-24.84%)        0.15 ( 55.32%)        0.15 ( 56.16%)
      
      For completeness, a third set of measurements shows the situation where
      THP is enabled and allocations are again done on a single NUMA node.  Here
      munlock() is already very fast thanks to huge pages, and this series does
      not compromise that performance.  It seems that the removal of call to
      lru_add_drain() still helps a bit.
      
      timedmunlock
                                  3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3
                                         0                     1                     2                     3                     4                     5                     6                     7
      Elapsed min           0.01 (  0.00%)        0.01 ( -0.11%)        0.01 (  6.59%)        0.01 (  5.41%)        0.01 (  5.45%)        0.01 (  5.03%)        0.01 (  6.08%)        0.01 (  5.20%)
      Elapsed mean          0.01 (  0.00%)        0.01 ( -0.27%)        0.01 (  6.39%)        0.01 (  5.30%)        0.01 (  5.32%)        0.01 (  5.03%)        0.01 (  5.97%)        0.01 (  5.22%)
      Elapsed stddev        0.00 (  0.00%)        0.00 ( -9.59%)        0.00 ( 10.77%)        0.00 (  3.24%)        0.00 ( 24.42%)        0.00 ( 31.86%)        0.00 ( -7.46%)        0.00 (  6.11%)
      Elapsed max           0.01 (  0.00%)        0.01 ( -0.01%)        0.01 (  6.83%)        0.01 (  5.42%)        0.01 (  5.79%)        0.01 (  5.53%)        0.01 (  6.08%)        0.01 (  5.26%)
      Elapsed range         0.00 (  0.00%)        0.00 (  7.30%)        0.00 ( 24.38%)        0.00 (  6.10%)        0.00 ( 30.79%)        0.00 ( 42.52%)        0.00 (  6.11%)        0.00 ( 10.07%)
      
      This patch (of 7):
      
      In putback_lru_page() since commit c53954a0 (""mm: remove lru parameter
      from __lru_cache_add and lru_cache_add_lru") it is no longer needed to
      determine lru list via page_lru_base_type().
      
      This patch replaces it with simple flag is_unevictable which says that the
      page was put on the inevictable list.  This is the only information that
      matters in subsequent tests.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NJörn Engel <joern@logfs.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ec3b74c
    • J
      mm: vmscan: fix numa reclaim balance problem in kswapd · 892f795d
      Johannes Weiner 提交于
      The way the page allocator interacts with kswapd creates aging imbalances,
      where the amount of time a userspace page gets in memory under reclaim
      pressure is dependent on which zone, which node the allocator took the
      page frame from.
      
      #1 fixes missed kswapd wakeups on NUMA systems, which lead to some
         nodes falling behind for a full reclaim cycle relative to the other
         nodes in the system
      
      #3 fixes an interaction where kswapd and a continuous stream of page
         allocations keep the preferred zone of a task between the high and
         low watermark (allocations succeed + kswapd does not go to sleep)
         indefinitely, completely underutilizing the lower zones and
         thrashing on the preferred zone
      
      These patches are the aging fairness part of the thrash-detection based
      file LRU balancing.  Andrea recommended to submit them separately as they
      are bugfixes in their own right.
      
      The following test ran a foreground workload (memcachetest) with
      background IO of various sizes on a 4 node 8G system (similar results were
      observed with single-node 4G systems):
      
      parallelio
                                                     BAS                    FAIRALLO
                                                    BASE                   FAIRALLOC
      Ops memcachetest-0M              5170.00 (  0.00%)           5283.00 (  2.19%)
      Ops memcachetest-791M            4740.00 (  0.00%)           5293.00 ( 11.67%)
      Ops memcachetest-2639M           2551.00 (  0.00%)           4950.00 ( 94.04%)
      Ops memcachetest-4487M           2606.00 (  0.00%)           3922.00 ( 50.50%)
      Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)
      Ops io-duration-791M               55.00 (  0.00%)             18.00 ( 67.27%)
      Ops io-duration-2639M             235.00 (  0.00%)            103.00 ( 56.17%)
      Ops io-duration-4487M             278.00 (  0.00%)            173.00 ( 37.77%)
      Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)
      Ops swaptotal-791M             245184.00 (  0.00%)              0.00 (  0.00%)
      Ops swaptotal-2639M            468069.00 (  0.00%)         108778.00 ( 76.76%)
      Ops swaptotal-4487M            452529.00 (  0.00%)          76623.00 ( 83.07%)
      Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)
      Ops swapin-791M                108297.00 (  0.00%)              0.00 (  0.00%)
      Ops swapin-2639M               169537.00 (  0.00%)          50031.00 ( 70.49%)
      Ops swapin-4487M               167435.00 (  0.00%)          34178.00 ( 79.59%)
      Ops minorfaults-0M            1518666.00 (  0.00%)        1503993.00 (  0.97%)
      Ops minorfaults-791M          1676963.00 (  0.00%)        1520115.00 (  9.35%)
      Ops minorfaults-2639M         1606035.00 (  0.00%)        1799717.00 (-12.06%)
      Ops minorfaults-4487M         1612118.00 (  0.00%)        1583825.00 (  1.76%)
      Ops majorfaults-0M                  6.00 (  0.00%)              0.00 (  0.00%)
      Ops majorfaults-791M            13836.00 (  0.00%)             10.00 ( 99.93%)
      Ops majorfaults-2639M           22307.00 (  0.00%)           6490.00 ( 70.91%)
      Ops majorfaults-4487M           21631.00 (  0.00%)           4380.00 ( 79.75%)
      
                       BAS    FAIRALLO
                      BASE   FAIRALLOC
      User          287.78      460.97
      System       2151.67     3142.51
      Elapsed      9737.00     8879.34
      
                                         BAS    FAIRALLO
                                        BASE   FAIRALLOC
      Minor Faults                  53721925    57188551
      Major Faults                    392195       15157
      Swap Ins                       2994854      112770
      Swap Outs                      4907092      134982
      Direct pages scanned                 0       41824
      Kswapd pages scanned          32975063     8128269
      Kswapd pages reclaimed         6323069     7093495
      Direct pages reclaimed               0       41824
      Kswapd efficiency                  19%         87%
      Kswapd velocity               3386.573     915.414
      Direct efficiency                 100%        100%
      Direct velocity                  0.000       4.710
      Percentage direct scans             0%          0%
      Zone normal velocity          2011.338     550.661
      Zone dma32 velocity           1365.623     369.221
      Zone dma velocity                9.612       0.242
      Page writes by reclaim    18732404.000  614807.000
      Page writes file              13825312      479825
      Page writes anon               4907092      134982
      Page reclaim immediate           85490        5647
      Sector Reads                  12080532      483244
      Sector Writes                 88740508    65438876
      Page rescued immediate               0           0
      Slabs scanned                    82560       12160
      Direct inode steals                  0           0
      Kswapd inode steals              24401       40013
      Kswapd skipped wait                  0           0
      THP fault alloc                      6           8
      THP collapse alloc                5481        5812
      THP splits                          75          22
      THP fault fallback                   0           0
      THP collapse fail                    0           0
      Compaction stalls                    0          54
      Compaction success                   0          45
      Compaction failures                  0           9
      Page migrate success            881492       82278
      Page migrate failure                 0           0
      Compaction pages isolated            0       60334
      Compaction migrate scanned           0       53505
      Compaction free scanned              0     1537605
      Compaction cost                    914          86
      NUMA PTE updates              46738231    41988419
      NUMA hint faults              31175564    24213387
      NUMA hint local faults        10427393     6411593
      NUMA pages migrated             881492       55344
      AutoNUMA cost                   156221      121361
      
      The overall runtime was reduced, throughput for both the foreground
      workload as well as the background IO improved, major faults, swapping and
      reclaim activity shrunk significantly, reclaim efficiency more than
      quadrupled.
      
      This patch:
      
      When the page allocator fails to get a page from all zones in its given
      zonelist, it wakes up the per-node kswapds for all zones that are at their
      low watermark.
      
      However, with a system under load the free pages in a zone can fluctuate
      enough that the allocation fails but the kswapd wakeup is also skipped
      while the zone is still really close to the low watermark.
      
      When one node misses a wakeup like this, it won't be aged before all the
      other node's zones are down to their low watermarks again.  And skipping a
      full aging cycle is an obvious fairness problem.
      
      Kswapd runs until the high watermarks are restored, so it should also be
      woken when the high watermarks are not met.  This ages nodes more equally
      and creates a safety margin for the page counter fluctuation.
      
      By using zone_balanced(), it will now check, in addition to the watermark,
      if compaction requires more order-0 pages to create a higher order page.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Paul Bolle <paul.bollee@gmail.com>
      Tested-by: NZlatko Calusic <zcalusic@bitsync.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      892f795d
  9. 11 9月, 2013 4 次提交
    • D
      shrinker: Kill old ->shrink API. · a0b02131
      Dave Chinner 提交于
      There are no more users of this API, so kill it dead, dead, dead and
      quietly bury the corpse in a shallow, unmarked grave in a dark forest deep
      in the hills...
      
      [glommer@openvz.org: added flowers to the grave]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Reviewed-by: NGreg Thelen <gthelen@google.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a0b02131
    • G
      vmscan: per-node deferred work · 1d3d4437
      Glauber Costa 提交于
      The list_lru infrastructure already keeps per-node LRU lists in its
      node-specific list_lru_node arrays and provide us with a per-node API, and
      the shrinkers are properly equiped with node information.  This means that
      we can now focus our shrinking effort in a single node, but the work that
      is deferred from one run to another is kept global at nr_in_batch.  Work
      can be deferred, for instance, during direct reclaim under a GFP_NOFS
      allocation, where situation, all the filesystem shrinkers will be
      prevented from running and accumulate in nr_in_batch the amount of work
      they should have done, but could not.
      
      This creates an impedance problem, where upon node pressure, work deferred
      will accumulate and end up being flushed in other nodes.  The problem we
      describe is particularly harmful in big machines, where many nodes can
      accumulate at the same time, all adding to the global counter nr_in_batch.
       As we accumulate more and more, we start to ask for the caches to flush
      even bigger numbers.  The result is that the caches are depleted and do
      not stabilize.  To achieve stable steady state behavior, we need to tackle
      it differently.
      
      In this patch we keep the deferred count per-node, in the new array
      nr_deferred[] (the name is also a bit more descriptive) and will never
      accumulate that to other nodes.
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1d3d4437
    • D
      shrinker: add node awareness · 0ce3d744
      Dave Chinner 提交于
      Pass the node of the current zone being reclaimed to shrink_slab(),
      allowing the shrinker control nodemask to be set appropriately for node
      aware shrinkers.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0ce3d744
    • D
      mm: new shrinker API · 24f7c6b9
      Dave Chinner 提交于
      The current shrinker callout API uses an a single shrinker call for
      multiple functions.  To determine the function, a special magical value is
      passed in a parameter to change the behaviour.  This complicates the
      implementation and return value specification for the different
      behaviours.
      
      Separate the two different behaviours into separate operations, one to
      return a count of freeable objects in the cache, and another to scan a
      certain number of objects in the cache for freeing.  In defining these new
      operations, ensure the return values and resultant behaviours are clearly
      defined and documented.
      
      Modify shrink_slab() to use the new API and implement the callouts for all
      the existing shrinkers.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      24f7c6b9
  10. 10 7月, 2013 2 次提交
    • M
      mm: vmscan: do not scale writeback pages when deciding whether to set ZONE_WRITEBACK · 918fc718
      Mel Gorman 提交于
      After the patch "mm: vmscan: Flatten kswapd priority loop" was merged
      the scanning priority of kswapd changed.
      
      The priority now rises until it is scanning enough pages to meet the
      high watermark.  shrink_inactive_list sets ZONE_WRITEBACK if a number of
      pages were encountered under writeback but this value is scaled based on
      the priority.  As kswapd frequently scans with a higher priority now it
      is relatively easy to set ZONE_WRITEBACK.  This patch removes the
      scaling and treates writeback pages similar to how it treats unqueued
      dirty pages and congested pages.  The user-visible effect should be that
      kswapd will writeback fewer pages from reclaim context.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      918fc718
    • M
      mm: vmscan: do not continue scanning if reclaim was aborted for compaction · 5a1c9cbc
      Mel Gorman 提交于
      Direct reclaim is not aborting to allow compaction to go ahead properly.
      do_try_to_free_pages is told to abort reclaim which is happily ignores
      and instead increases priority instead until it reaches 0 and starts
      shrinking file/anon equally.  This patch corrects the situation by
      aborting reclaim when requested instead of raising priority.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a1c9cbc
  11. 04 7月, 2013 11 次提交
    • M
      mm: remove lru parameter from __lru_cache_add and lru_cache_add_lru · c53954a0
      Mel Gorman 提交于
      Similar to __pagevec_lru_add, this patch removes the LRU parameter from
      __lru_cache_add and lru_cache_add_lru as the caller does not control the
      exact LRU the page gets added to.  lru_cache_add_lru gets renamed to
      lru_cache_add the name is silly without the lru parameter.  With the
      parameter removed, it is required that the caller indicate if they want
      the page added to the active or inactive list by setting or clearing
      PageActive respectively.
      
      [akpm@linux-foundation.org: Suggested the patch]
      [gang.chen@asianux.com: fix used-unintialized warning]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NChen Gang <gang.chen@asianux.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Alexey Lyahkov <alexey.lyashkov@gmail.com>
      Cc: Andrew Perepechko <anserper@ya.ru>
      Cc: Robin Dong <sanbai@taobao.com>
      Cc: Theodore Tso <tytso@mit.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Bernd Schubert <bernd.schubert@fastmail.fm>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c53954a0
    • M
      mm: vmscan: take page buffers dirty and locked state into account · b4597226
      Mel Gorman 提交于
      Page reclaim keeps track of dirty and under writeback pages and uses it
      to determine if wait_iff_congested() should stall or if kswapd should
      begin writing back pages.  This fails to account for buffer pages that
      can be under writeback but not PageWriteback which is the case for
      filesystems like ext3 ordered mode.  Furthermore, PageDirty buffer pages
      can have all the buffers clean and writepage does no IO so it should not
      be accounted as congested.
      
      This patch adds an address_space operation that filesystems may
      optionally use to check if a page is really dirty or really under
      writeback.  An implementation is provided for for buffer_heads is added
      and used for block operations and ext3 in ordered mode.  By default the
      page flags are obeyed.
      
      Credit goes to Jan Kara for identifying that the page flags alone are
      not sufficient for ext3 and sanity checking a number of ideas on how the
      problem could be addressed.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Zlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4597226
    • M
      mm: vmscan: treat pages marked for immediate reclaim as zone congestion · d04e8acd
      Mel Gorman 提交于
      Currently a zone will only be marked congested if the underlying BDI is
      congested but if dirty pages are spread across zones it is possible that
      an individual zone is full of dirty pages without being congested.  The
      impact is that zone gets scanned very quickly potentially reclaiming
      really clean pages.  This patch treats pages marked for immediate
      reclaim as congested for the purposes of marking a zone ZONE_CONGESTED
      and stalling in wait_iff_congested.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Zlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d04e8acd
    • M
      mm: vmscan: move direct reclaim wait_iff_congested into shrink_list · 8e950282
      Mel Gorman 提交于
      shrink_inactive_list makes decisions on whether to stall based on the
      number of dirty pages encountered.  The wait_iff_congested() call in
      shrink_page_list does no such thing and it's arbitrary.
      
      This patch moves the decision on whether to set ZONE_CONGESTED and the
      wait_iff_congested call into shrink_page_list.  This keeps all the
      decisions on whether to stall or not in the one place.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Zlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e950282
    • M
      mm: vmscan: set zone flags before blocking · f7ab8db7
      Mel Gorman 提交于
      In shrink_page_list a decision may be made to stall and flag a zone as
      ZONE_WRITEBACK so that if a large number of unqueued dirty pages are
      encountered later then the reclaimer will stall.  Set ZONE_WRITEBACK
      before potentially going to sleep so it is noticed sooner.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Zlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7ab8db7
    • M
      mm: vmscan: stall page reclaim after a list of pages have been processed · b1a6f21e
      Mel Gorman 提交于
      Commit "mm: vmscan: Block kswapd if it is encountering pages under
      writeback" blocks page reclaim if it encounters pages under writeback
      marked for immediate reclaim.  It blocks while pages are still isolated
      from the LRU which is unnecessary.  This patch defers the blocking until
      after the isolated pages have been processed and tidies up some of the
      comments.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Zlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b1a6f21e
    • M
      mm: vmscan: stall page reclaim and writeback pages based on dirty/writepage pages encountered · e2be15f6
      Mel Gorman 提交于
      Further testing of the "Reduce system disruption due to kswapd"
      discovered a few problems.  First and foremost, it's possible for pages
      under writeback to be freed which will lead to badness.  Second, as
      pages were not being swapped the file LRU was being scanned faster and
      clean file pages were being reclaimed.  In some cases this results in
      increased read IO to re-read data from disk.  Third, more pages were
      being written from kswapd context which can adversly affect IO
      performance.  Lastly, it was observed that PageDirty pages are not
      necessarily dirty on all filesystems (buffers can be clean while
      PageDirty is set and ->writepage generates no IO) and not all
      filesystems set PageWriteback when the page is being written (e.g.
      ext3).  This disconnect confuses the reclaim stalling logic.  This
      follow-up series is aimed at these problems.
      
      The tests were based on three kernels
      
      vanilla:	kernel 3.9 as that is what the current mmotm uses as a baseline
      mmotm-20130522	is mmotm as of 22nd May with "Reduce system disruption due to
      		kswapd" applied on top as per what should be in Andrew's tree
      		right now
      lessdisrupt-v7r10 is this follow-up series on top of the mmotm kernel
      
      The first test used memcached+memcachetest while some background IO was
      in progress as implemented by the parallel IO tests implement in MM
      Tests.  memcachetest benchmarks how many operations/second memcached can
      service.  It starts with no background IO on a freshly created ext4
      filesystem and then re-runs the test with larger amounts of IO in the
      background to roughly simulate a large copy in progress.  The
      expectation is that the IO should have little or no impact on
      memcachetest which is running entirely in memory.
      
      parallelio
                                                   3.9.0                       3.9.0                       3.9.0
                                                 vanilla          mm1-mmotm-20130522       mm1-lessdisrupt-v7r10
      Ops memcachetest-0M             23117.00 (  0.00%)          22780.00 ( -1.46%)          22763.00 ( -1.53%)
      Ops memcachetest-715M           23774.00 (  0.00%)          23299.00 ( -2.00%)          22934.00 ( -3.53%)
      Ops memcachetest-2385M           4208.00 (  0.00%)          24154.00 (474.00%)          23765.00 (464.76%)
      Ops memcachetest-4055M           4104.00 (  0.00%)          25130.00 (512.33%)          24614.00 (499.76%)
      Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
      Ops io-duration-715M               12.00 (  0.00%)              7.00 ( 41.67%)              6.00 ( 50.00%)
      Ops io-duration-2385M             116.00 (  0.00%)             21.00 ( 81.90%)             21.00 ( 81.90%)
      Ops io-duration-4055M             160.00 (  0.00%)             36.00 ( 77.50%)             35.00 ( 78.12%)
      Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
      Ops swaptotal-715M             140138.00 (  0.00%)             18.00 ( 99.99%)             18.00 ( 99.99%)
      Ops swaptotal-2385M            385682.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
      Ops swaptotal-4055M            418029.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
      Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
      Ops swapin-715M                   144.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
      Ops swapin-2385M               134227.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
      Ops swapin-4055M               125618.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
      Ops minorfaults-0M            1536429.00 (  0.00%)        1531632.00 (  0.31%)        1533541.00 (  0.19%)
      Ops minorfaults-715M          1786996.00 (  0.00%)        1612148.00 (  9.78%)        1608832.00 (  9.97%)
      Ops minorfaults-2385M         1757952.00 (  0.00%)        1614874.00 (  8.14%)        1613541.00 (  8.21%)
      Ops minorfaults-4055M         1774460.00 (  0.00%)        1633400.00 (  7.95%)        1630881.00 (  8.09%)
      Ops majorfaults-0M                  1.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
      Ops majorfaults-715M              184.00 (  0.00%)            167.00 (  9.24%)            166.00 (  9.78%)
      Ops majorfaults-2385M           24444.00 (  0.00%)            155.00 ( 99.37%)             93.00 ( 99.62%)
      Ops majorfaults-4055M           21357.00 (  0.00%)            147.00 ( 99.31%)            134.00 ( 99.37%)
      
      memcachetest is the transactions/second reported by memcachetest. In
              the vanilla kernel note that performance drops from around
              23K/sec to just over 4K/second when there is 2385M of IO going
              on in the background. With current mmotm, there is no collapse
      	in performance and with this follow-up series there is little
      	change.
      
      swaptotal is the total amount of swap traffic. With mmotm and the follow-up
      	series, the total amount of swapping is much reduced.
      
                                       3.9.0       3.9.0       3.9.0
                                     vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
      Minor Faults                  11160152    10706748    10622316
      Major Faults                     46305         755         678
      Swap Ins                        260249           0           0
      Swap Outs                       683860          18          18
      Direct pages scanned                 0         678        2520
      Kswapd pages scanned           6046108     8814900     1639279
      Kswapd pages reclaimed         1081954     1172267     1094635
      Direct pages reclaimed               0         566        2304
      Kswapd efficiency                  17%         13%         66%
      Kswapd velocity               5217.560    7618.953    1414.879
      Direct efficiency                 100%         83%         91%
      Direct velocity                  0.000       0.586       2.175
      Percentage direct scans             0%          0%          0%
      Zone normal velocity          5105.086    6824.681     671.158
      Zone dma32 velocity            112.473     794.858     745.896
      Zone dma velocity                0.000       0.000       0.000
      Page writes by reclaim     1929612.000 6861768.000   32821.000
      Page writes file               1245752     6861750       32803
      Page writes anon                683860          18          18
      Page reclaim immediate            7484          40         239
      Sector Reads                   1130320       93996       86900
      Sector Writes                 13508052    10823500    11804436
      Page rescued immediate               0           0           0
      Slabs scanned                    33536       27136       18560
      Direct inode steals                  0           0           0
      Kswapd inode steals               8641        1035           0
      Kswapd skipped wait                  0           0           0
      THP fault alloc                      8          37          33
      THP collapse alloc                 508         552         515
      THP splits                          24           1           1
      THP fault fallback                   0           0           0
      THP collapse fail                    0           0           0
      
      There are a number of observations to make here
      
      1. Swap outs are almost eliminated. Swap ins are 0 indicating that the
         pages swapped were really unused anonymous pages. Related to that,
         major faults are much reduced.
      
      2. kswapd efficiency was impacted by the initial series but with these
         follow-up patches, the efficiency is now at 66% indicating that far
         fewer pages were skipped during scanning due to dirty or writeback
         pages.
      
      3. kswapd velocity is reduced indicating that fewer pages are being scanned
         with the follow-up series as kswapd now stalls when the tail of the
         LRU queue is full of unqueued dirty pages. The stall gives flushers a
         chance to catch-up so kswapd can reclaim clean pages when it wakes
      
      4. In light of Zlatko's recent reports about zone scanning imbalances,
         mmtests now reports scanning velocity on a per-zone basis. With mainline,
         you can see that the scanning activity is dominated by the Normal
         zone with over 45 times more scanning in Normal than the DMA32 zone.
         With the series currently in mmotm, the ratio is slightly better but it
         is still the case that the bulk of scanning is in the highest zone. With
         this follow-up series, the ratio of scanning between the Normal and
         DMA32 zone is roughly equal.
      
      5. As Dave Chinner observed, the current patches in mmotm increased the
         number of pages written from kswapd context which is expected to adversly
         impact IO performance. With the follow-up patches, far fewer pages are
         written from kswapd context than the mainline kernel
      
      6. With the series in mmotm, fewer inodes were reclaimed by kswapd. With
         the follow-up series, there is less slab shrinking activity and no inodes
         were reclaimed.
      
      7. Note that "Sectors Read" is drastically reduced implying that the source
         data being used for the IO is not being aggressively discarded due to
         page reclaim skipping over dirty pages and reclaiming clean pages. Note
         that the reducion in reads could also be due to inode data not being
         re-read from disk after a slab shrink.
      
                             3.9.0       3.9.0       3.9.0
                           vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
      Mean sda-avgqz        166.99       32.09       33.44
      Mean sda-await        853.64      192.76      185.43
      Mean sda-r_await        6.31        9.24        5.97
      Mean sda-w_await     2992.81      202.65      192.43
      Max  sda-avgqz       1409.91      718.75      698.98
      Max  sda-await       6665.74     3538.00     3124.23
      Max  sda-r_await       58.96      111.95       58.00
      Max  sda-w_await    28458.94     3977.29     3148.61
      
      In light of the changes in writes from reclaim context, the number of
      reads and Dave Chinner's concerns about IO performance I took a closer
      look at the IO stats for the test disk. Few observations
      
      1. The average queue size is reduced by the initial series and roughly
         the same with this follow up.
      
      2. Average wait times for writes are reduced and as the IO
         is completing faster it at least implies that the gain is because
         flushers are writing the files efficiently instead of page reclaim
         getting in the way.
      
      3. The reduction in maximum write latency is staggering. 28 seconds down
         to 3 seconds.
      
      Jan Kara asked how NFS is affected by all of this. Unstable pages can
      be taken into account as one of the patches in the series shows but it
      is still the case that filesystems with unusual handling of dirty or
      writeback could still be treated better.
      
      Tests like postmark, fsmark and largedd showed up nothing useful. On my test
      setup, pages are simply not being written back from reclaim context with or
      without the patches and there are no changes in performance. My test setup
      probably is just not strong enough network-wise to be really interesting.
      
      I ran a longer-lived memcached test with IO going to NFS instead of a local disk
      
      parallelio
                                                   3.9.0                       3.9.0                       3.9.0
                                                 vanilla          mm1-mmotm-20130522       mm1-lessdisrupt-v7r10
      Ops memcachetest-0M             23323.00 (  0.00%)          23241.00 ( -0.35%)          23321.00 ( -0.01%)
      Ops memcachetest-715M           25526.00 (  0.00%)          24763.00 ( -2.99%)          23242.00 ( -8.95%)
      Ops memcachetest-2385M           8814.00 (  0.00%)          26924.00 (205.47%)          23521.00 (166.86%)
      Ops memcachetest-4055M           5835.00 (  0.00%)          26827.00 (359.76%)          25560.00 (338.05%)
      Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
      Ops io-duration-715M               65.00 (  0.00%)             71.00 ( -9.23%)             11.00 ( 83.08%)
      Ops io-duration-2385M             129.00 (  0.00%)             94.00 ( 27.13%)             53.00 ( 58.91%)
      Ops io-duration-4055M             301.00 (  0.00%)            100.00 ( 66.78%)            108.00 ( 64.12%)
      Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
      Ops swaptotal-715M              14394.00 (  0.00%)            949.00 ( 93.41%)             63.00 ( 99.56%)
      Ops swaptotal-2385M            401483.00 (  0.00%)          24437.00 ( 93.91%)          30118.00 ( 92.50%)
      Ops swaptotal-4055M            554123.00 (  0.00%)          35688.00 ( 93.56%)          63082.00 ( 88.62%)
      Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
      Ops swapin-715M                  4522.00 (  0.00%)            560.00 ( 87.62%)             63.00 ( 98.61%)
      Ops swapin-2385M               169861.00 (  0.00%)           5026.00 ( 97.04%)          13917.00 ( 91.81%)
      Ops swapin-4055M               192374.00 (  0.00%)          10056.00 ( 94.77%)          25729.00 ( 86.63%)
      Ops minorfaults-0M            1445969.00 (  0.00%)        1520878.00 ( -5.18%)        1454024.00 ( -0.56%)
      Ops minorfaults-715M          1557288.00 (  0.00%)        1528482.00 (  1.85%)        1535776.00 (  1.38%)
      Ops minorfaults-2385M         1692896.00 (  0.00%)        1570523.00 (  7.23%)        1559622.00 (  7.87%)
      Ops minorfaults-4055M         1654985.00 (  0.00%)        1581456.00 (  4.44%)        1596713.00 (  3.52%)
      Ops majorfaults-0M                  0.00 (  0.00%)              1.00 (-99.00%)              0.00 (  0.00%)
      Ops majorfaults-715M              763.00 (  0.00%)            265.00 ( 65.27%)             75.00 ( 90.17%)
      Ops majorfaults-2385M           23861.00 (  0.00%)            894.00 ( 96.25%)           2189.00 ( 90.83%)
      Ops majorfaults-4055M           27210.00 (  0.00%)           1569.00 ( 94.23%)           4088.00 ( 84.98%)
      
      1. Performance does not collapse due to IO which is good. IO is also completing
         faster. Note with mmotm, IO completes in a third of the time and faster again
         with this series applied
      
      2. Swapping is reduced, although not eliminated. The figures for the follow-up
         look bad but it does vary a bit as the stalling is not perfect for nfs
         or filesystems like ext3 with unusual handling of dirty and writeback
         pages
      
      3. There are swapins, particularly with larger amounts of IO indicating
         that active pages are being reclaimed. However, the number of much
         reduced.
      
                                       3.9.0       3.9.0       3.9.0
                                     vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
      Minor Faults                  36339175    35025445    35219699
      Major Faults                    310964       27108       51887
      Swap Ins                       2176399      173069      333316
      Swap Outs                      3344050      357228      504824
      Direct pages scanned              8972       77283       43242
      Kswapd pages scanned          20899983     8939566    14772851
      Kswapd pages reclaimed         6193156     5172605     5231026
      Direct pages reclaimed            8450       73802       39514
      Kswapd efficiency                  29%         57%         35%
      Kswapd velocity               3929.743    1847.499    3058.840
      Direct efficiency                  94%         95%         91%
      Direct velocity                  1.687      15.972       8.954
      Percentage direct scans             0%          0%          0%
      Zone normal velocity          3721.907     939.103    2185.142
      Zone dma32 velocity            209.522     924.368     882.651
      Zone dma velocity                0.000       0.000       0.000
      Page writes by reclaim     4082185.000  526319.000  537114.000
      Page writes file                738135      169091       32290
      Page writes anon               3344050      357228      504824
      Page reclaim immediate            9524         170     5595843
      Sector Reads                   8909900      861192     1483680
      Sector Writes                 13428980     1488744     2076800
      Page rescued immediate               0           0           0
      Slabs scanned                    38016       31744       28672
      Direct inode steals                  0           0           0
      Kswapd inode steals                424           0           0
      Kswapd skipped wait                  0           0           0
      THP fault alloc                     14          15         119
      THP collapse alloc                1767        1569        1618
      THP splits                          30          29          25
      THP fault fallback                   0           0           0
      THP collapse fail                    8           5           0
      Compaction stalls                   17          41         100
      Compaction success                   7          31          95
      Compaction failures                 10          10           5
      Page migrate success              7083       22157       62217
      Page migrate failure                 0           0           0
      Compaction pages isolated        14847       48758      135830
      Compaction migrate scanned       18328       48398      138929
      Compaction free scanned        2000255      355827     1720269
      Compaction cost                      7          24          68
      
      I guess the main takeaway again is the much reduced page writes
      from reclaim context and reduced reads.
      
                             3.9.0       3.9.0       3.9.0
                           vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
      Mean sda-avgqz         23.58        0.35        0.44
      Mean sda-await        133.47       15.72       15.46
      Mean sda-r_await        4.72        4.69        3.95
      Mean sda-w_await      507.69       28.40       33.68
      Max  sda-avgqz        680.60       12.25       23.14
      Max  sda-await       3958.89      221.83      286.22
      Max  sda-r_await       63.86       61.23       67.29
      Max  sda-w_await    11710.38      883.57     1767.28
      
      And as before, write wait times are much reduced.
      
      This patch:
      
      The patch "mm: vmscan: Have kswapd writeback pages based on dirty pages
      encountered, not priority" decides whether to writeback pages from reclaim
      context based on the number of dirty pages encountered.  This situation is
      flagged too easily and flushers are not given the chance to catch up
      resulting in more pages being written from reclaim context and potentially
      impacting IO performance.  The check for PageWriteback is also misplaced
      as it happens within a PageDirty check which is nonsense as the dirty may
      have been cleared for IO.  The accounting is updated very late and pages
      that are already under writeback, were reactivated, could not unmapped or
      could not be released are all missed.  Similarly, a page is considered
      congested for reasons other than being congested and pages that cannot be
      written out in the correct context are skipped.  Finally, it considers
      stalling and writing back filesystem pages due to encountering dirty
      anonymous pages at the tail of the LRU which is dumb.
      
      This patch causes kswapd to begin writing filesystem pages from reclaim
      context only if page reclaim found that all filesystem pages at the tail
      of the LRU were unqueued dirty pages.  Before it starts writing filesystem
      pages, it will stall to give flushers a chance to catch up.  The decision
      on whether wait_iff_congested is also now determined by dirty filesystem
      pages only.  Congested pages are based on whether the underlying BDI is
      congested regardless of the context of the reclaiming process.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Zlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e2be15f6
    • M
      mm: vmscan: move logic from balance_pgdat() to kswapd_shrink_zone() · 7c954f6d
      Mel Gorman 提交于
      balance_pgdat() is very long and some of the logic can and should be
      internal to kswapd_shrink_zone().  Move it so the flow of
      balance_pgdat() is marginally easier to follow.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Tested-by: NZlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c954f6d
    • M
      mm: vmscan: check if kswapd should writepage once per pgdat scan · b7ea3c41
      Mel Gorman 提交于
      Currently kswapd checks if it should start writepage as it shrinks each
      zone without taking into consideration if the zone is balanced or not.
      This is not wrong as such but it does not make much sense either.  This
      patch checks once per pgdat scan if kswapd should be writing pages.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Tested-by: NZlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b7ea3c41
    • M
      mm: vmscan: block kswapd if it is encountering pages under writeback · 283aba9f
      Mel Gorman 提交于
      Historically, kswapd used to congestion_wait() at higher priorities if
      it was not making forward progress.  This made no sense as the failure
      to make progress could be completely independent of IO.  It was later
      replaced by wait_iff_congested() and removed entirely by commit 258401a6
      (mm: don't wait on congested zones in balance_pgdat()) as it was
      duplicating logic in shrink_inactive_list().
      
      This is problematic.  If kswapd encounters many pages under writeback
      and it continues to scan until it reaches the high watermark then it
      will quickly skip over the pages under writeback and reclaim clean young
      pages or push applications out to swap.
      
      The use of wait_iff_congested() is not suited to kswapd as it will only
      stall if the underlying BDI is really congested or a direct reclaimer
      was unable to write to the underlying BDI.  kswapd bypasses the BDI
      congestion as it sets PF_SWAPWRITE but even if this was taken into
      account then it would cause direct reclaimers to stall on writeback
      which is not desirable.
      
      This patch sets a ZONE_WRITEBACK flag if direct reclaim or kswapd is
      encountering too many pages under writeback.  If this flag is set and
      kswapd encounters a PageReclaim page under writeback then it'll assume
      that the LRU lists are being recycled too quickly before IO can complete
      and block waiting for some IO to complete.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Tested-by: NZlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      283aba9f
    • M
      mm: vmscan: have kswapd writeback pages based on dirty pages encountered, not priority · d43006d5
      Mel Gorman 提交于
      Currently kswapd queues dirty pages for writeback if scanning at an
      elevated priority but the priority kswapd scans at is not related to the
      number of unqueued dirty encountered.  Since commit "mm: vmscan: Flatten
      kswapd priority loop", the priority is related to the size of the LRU
      and the zone watermark which is no indication as to whether kswapd
      should write pages or not.
      
      This patch tracks if an excessive number of unqueued dirty pages are
      being encountered at the end of the LRU.  If so, it indicates that dirty
      pages are being recycled before flusher threads can clean them and flags
      the zone so that kswapd will start writing pages until the zone is
      balanced.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Tested-by: NZlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d43006d5