1. 26 8月, 2011 2 次提交
    • S
      vmscan: clear ZONE_CONGESTED for zone with good watermark · 439423f6
      Shaohua Li 提交于
      ZONE_CONGESTED is only cleared in kswapd, but pages can be freed in any
      task.  It's possible ZONE_CONGESTED isn't cleared in some cases:
      
       1. the zone is already balanced just entering balance_pgdat() for
          order-0 because concurrent tasks free memory.  In this case, later
          check will skip the zone as it's balanced so the flag isn't cleared.
      
       2. high order balance fallbacks to order-0.  quote from Mel: At the
          end of balance_pgdat(), kswapd uses the following logic;
      
      	If reclaiming at high order {
      		for each zone {
      			if all_unreclaimable
      				skip
      			if watermark is not met
      				order = 0
      				loop again
      
      			/* watermark is met */
      			clear congested
      		}
      	}
      
          i.e. it clears ZONE_CONGESTED if it the zone is balanced.  if not,
          it restarts balancing at order-0.  However, if the higher zones are
          balanced for order-0, kswapd will miss clearing ZONE_CONGESTED as
          that only happens after a zone is shrunk.  This can mean that
          wait_iff_congested() stalls unnecessarily.
      
      This patch makes kswapd clear ZONE_CONGESTED during its initial
      highmem->dma scan for zones that are already balanced.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      439423f6
    • S
      mm: fix a vmscan warning · f51bdd2e
      Shaohua Li 提交于
      I get the below warning:
      
        BUG: using smp_processor_id() in preemptible [00000000] code: bash/746
        caller is native_sched_clock+0x37/0x6e
        Pid: 746, comm: bash Tainted: G        W   3.0.0+ #254
        Call Trace:
         [<ffffffff813435c6>] debug_smp_processor_id+0xc2/0xdc
         [<ffffffff8104158d>] native_sched_clock+0x37/0x6e
         [<ffffffff81116219>] try_to_free_mem_cgroup_pages+0x7d/0x270
         [<ffffffff8114f1f8>] mem_cgroup_force_empty+0x24b/0x27a
         [<ffffffff8114ff21>] ? sys_close+0x38/0x138
         [<ffffffff8114ff21>] ? sys_close+0x38/0x138
         [<ffffffff8114f257>] mem_cgroup_force_empty_write+0x17/0x19
         [<ffffffff810c72fb>] cgroup_file_write+0xa8/0xba
         [<ffffffff811522d2>] vfs_write+0xb3/0x138
         [<ffffffff8115241a>] sys_write+0x4a/0x71
         [<ffffffff8114ffd9>] ? sys_close+0xf0/0x138
         [<ffffffff8176deab>] system_call_fastpath+0x16/0x1b
      
      sched_clock() can't be used with preempt enabled.  And we don't need
      fast approach to get clock here, so let's use ktime API.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Tested-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f51bdd2e
  2. 27 7月, 2011 4 次提交
    • K
      memcg: add memory.vmscan_stat · 82f9d486
      KAMEZAWA Hiroyuki 提交于
      The commit log of 0ae5e89c ("memcg: count the soft_limit reclaim
      in...") says it adds scanning stats to memory.stat file.  But it doesn't
      because we considered we needed to make a concensus for such new APIs.
      
      This patch is a trial to add memory.scan_stat. This shows
        - the number of scanned pages(total, anon, file)
        - the number of rotated pages(total, anon, file)
        - the number of freed pages(total, anon, file)
        - the number of elaplsed time (including sleep/pause time)
      
        for both of direct/soft reclaim.
      
      The biggest difference with oringinal Ying's one is that this file
      can be reset by some write, as
      
        # echo 0 ...../memory.scan_stat
      
      Example of output is here. This is a result after make -j 6 kernel
      under 300M limit.
      
        [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat
        [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat
        scanned_pages_by_limit 9471864
        scanned_anon_pages_by_limit 6640629
        scanned_file_pages_by_limit 2831235
        rotated_pages_by_limit 4243974
        rotated_anon_pages_by_limit 3971968
        rotated_file_pages_by_limit 272006
        freed_pages_by_limit 2318492
        freed_anon_pages_by_limit 962052
        freed_file_pages_by_limit 1356440
        elapsed_ns_by_limit 351386416101
        scanned_pages_by_system 0
        scanned_anon_pages_by_system 0
        scanned_file_pages_by_system 0
        rotated_pages_by_system 0
        rotated_anon_pages_by_system 0
        rotated_file_pages_by_system 0
        freed_pages_by_system 0
        freed_anon_pages_by_system 0
        freed_file_pages_by_system 0
        elapsed_ns_by_system 0
        scanned_pages_by_limit_under_hierarchy 9471864
        scanned_anon_pages_by_limit_under_hierarchy 6640629
        scanned_file_pages_by_limit_under_hierarchy 2831235
        rotated_pages_by_limit_under_hierarchy 4243974
        rotated_anon_pages_by_limit_under_hierarchy 3971968
        rotated_file_pages_by_limit_under_hierarchy 272006
        freed_pages_by_limit_under_hierarchy 2318492
        freed_anon_pages_by_limit_under_hierarchy 962052
        freed_file_pages_by_limit_under_hierarchy 1356440
        elapsed_ns_by_limit_under_hierarchy 351386416101
        scanned_pages_by_system_under_hierarchy 0
        scanned_anon_pages_by_system_under_hierarchy 0
        scanned_file_pages_by_system_under_hierarchy 0
        rotated_pages_by_system_under_hierarchy 0
        rotated_anon_pages_by_system_under_hierarchy 0
        rotated_file_pages_by_system_under_hierarchy 0
        freed_pages_by_system_under_hierarchy 0
        freed_anon_pages_by_system_under_hierarchy 0
        freed_file_pages_by_system_under_hierarchy 0
        elapsed_ns_by_system_under_hierarchy 0
      
      total_xxxx is for hierarchy management.
      
      This will be useful for further memcg developments and need to be
      developped before we do some complicated rework on LRU/softlimit
      management.
      
      This patch adds a new struct memcg_scanrecord into scan_control struct.
      sc->nr_scanned at el is not designed for exporting information.  For
      example, nr_scanned is reset frequentrly and incremented +2 at scanning
      mapped pages.
      
      To avoid complexity, I added a new param in scan_control which is for
      exporting scanning score.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ying Han <yinghan@google.com>
      Cc: Andrew Bresticker <abrestic@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82f9d486
    • K
      memcg: fix vmscan count in small memcgs · 4508378b
      KAMEZAWA Hiroyuki 提交于
      Commit 246e87a9 ("memcg: fix get_scan_count() for small targets")
      fixes the memcg/kswapd behavior against small targets and prevent vmscan
      priority too high.
      
      But the implementation is too naive and adds another problem to small
      memcg.  It always force scan to 32 pages of file/anon and doesn't handle
      swappiness and other rotate_info.  It makes vmscan to scan anon LRU
      regardless of swappiness and make reclaim bad.  This patch fixes it by
      adjusting scanning count with regard to swappiness at el.
      
      At a test "cat 1G file under 300M limit." (swappiness=20)
       before patch
              scanned_pages_by_limit 360919
              scanned_anon_pages_by_limit 180469
              scanned_file_pages_by_limit 180450
              rotated_pages_by_limit 31
              rotated_anon_pages_by_limit 25
              rotated_file_pages_by_limit 6
              freed_pages_by_limit 180458
              freed_anon_pages_by_limit 19
              freed_file_pages_by_limit 180439
              elapsed_ns_by_limit 429758872
       after patch
              scanned_pages_by_limit 180674
              scanned_anon_pages_by_limit 24
              scanned_file_pages_by_limit 180650
              rotated_pages_by_limit 35
              rotated_anon_pages_by_limit 24
              rotated_file_pages_by_limit 11
              freed_pages_by_limit 180634
              freed_anon_pages_by_limit 0
              freed_file_pages_by_limit 180634
              elapsed_ns_by_limit 367119089
              scanned_pages_by_system 0
      
      the numbers of scanning anon are decreased(as expected), and elapsed time
      reduced. By this patch, small memcgs will work better.
      (*) Because the amount of file-cache is much bigger than anon,
          recalaim_stat's rotate-scan counter make scanning files more.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4508378b
    • K
      memcg: consolidate memory cgroup lru stat functions · bb2a0de9
      KAMEZAWA Hiroyuki 提交于
      In mm/memcontrol.c, there are many lru stat functions as..
      
        mem_cgroup_zone_nr_lru_pages
        mem_cgroup_node_nr_file_lru_pages
        mem_cgroup_nr_file_lru_pages
        mem_cgroup_node_nr_anon_lru_pages
        mem_cgroup_nr_anon_lru_pages
        mem_cgroup_node_nr_unevictable_lru_pages
        mem_cgroup_nr_unevictable_lru_pages
        mem_cgroup_node_nr_lru_pages
        mem_cgroup_nr_lru_pages
        mem_cgroup_get_local_zonestat
      
      Some of them are under #ifdef MAX_NUMNODES >1 and others are not.
      This seems bad. This patch consolidates all functions into
      
        mem_cgroup_zone_nr_lru_pages()
        mem_cgroup_node_nr_lru_pages()
        mem_cgroup_nr_lru_pages()
      
      For these functions, "which LRU?" information is passed by a mask.
      
      example:
        mem_cgroup_nr_lru_pages(mem, BIT(LRU_ACTIVE_ANON))
      
      And I added some macro as ALL_LRU, ALL_LRU_FILE, ALL_LRU_ANON.
      
      example:
        mem_cgroup_nr_lru_pages(mem, ALL_LRU)
      
      BTW, considering layout of NUMA memory placement of counters, this patch seems
      to be better.
      
      Now, when we gather all LRU information, we scan in following orer
          for_each_lru -> for_each_node -> for_each_zone.
      
      This means we'll touch cache lines in different node in turn.
      
      After patch, we'll scan
          for_each_node -> for_each_zone -> for_each_lru(mask)
      
      Then, we'll gather information in the same cacheline at once.
      
      [akpm@linux-foundation.org: fix warnigns, build error]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb2a0de9
    • K
      memcg: export memory cgroup's swappiness with mem_cgroup_swappiness() · 1f4c025b
      KAMEZAWA Hiroyuki 提交于
      Each memory cgroup has a 'swappiness' value which can be accessed by
      get_swappiness(memcg).  The major user is try_to_free_mem_cgroup_pages()
      and swappiness is passed by argument.  It's propagated by scan_control.
      
      get_swappiness() is a static function but some planned updates will need
      to get swappiness from files other than memcontrol.c This patch exports
      get_swappiness() as mem_cgroup_swappiness().  With this, we can remove the
      argument of swapiness from try_to_free...  and drop swappiness from
      scan_control.  only memcg uses it.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ying Han <yinghan@google.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f4c025b
  3. 20 7月, 2011 5 次提交
    • D
      vmscan: add customisable shrinker batch size · e9299f50
      Dave Chinner 提交于
      For shrinkers that have their own cond_resched* calls, having
      shrink_slab break the work down into small batches is not
      paticularly efficient. Add a custom batchsize field to the struct
      shrinker so that shrinkers can use a larger batch size if they
      desire.
      
      A value of zero (uninitialised) means "use the default", so
      behaviour is unchanged by this patch.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e9299f50
    • D
      vmscan: reduce wind up shrinker->nr when shrinker can't do work · 3567b59a
      Dave Chinner 提交于
      When a shrinker returns -1 to shrink_slab() to indicate it cannot do
      any work given the current memory reclaim requirements, it adds the
      entire total_scan count to shrinker->nr. The idea ehind this is that
      whenteh shrinker is next called and can do work, it will do the work
      of the previously aborted shrinker call as well.
      
      However, if a filesystem is doing lots of allocation with GFP_NOFS
      set, then we get many, many more aborts from the shrinkers than we
      do successful calls. The result is that shrinker->nr winds up to
      it's maximum permissible value (twice the current cache size) and
      then when the next shrinker call that can do work is issued, it
      has enough scan count built up to free the entire cache twice over.
      
      This manifests itself in the cache going from full to empty in a
      matter of seconds, even when only a small part of the cache is
      needed to be emptied to free sufficient memory.
      
      Under metadata intensive workloads on ext4 and XFS, I'm seeing the
      VFS caches increase memory consumption up to 75% of memory (no page
      cache pressure) over a period of 30-60s, and then the shrinker
      empties them down to zero in the space of 2-3s. This cycle repeats
      over and over again, with the shrinker completely trashing the inode
      and dentry caches every minute or so the workload continues.
      
      This behaviour was made obvious by the shrink_slab tracepoints added
      earlier in the series, and made worse by the patch that corrected
      the concurrent accounting of shrinker->nr.
      
      To avoid this problem, stop repeated small increments of the total
      scan value from winding shrinker->nr up to a value that can cause
      the entire cache to be freed. We still need to allow it to wind up,
      so use the delta as the "large scan" threshold check - if the delta
      is more than a quarter of the entire cache size, then it is a large
      scan and allowed to cause lots of windup because we are clearly
      needing to free lots of memory.
      
      If it isn't a large scan then limit the total scan to half the size
      of the cache so that windup never increases to consume the whole
      cache. Reducing the total scan limit further does not allow enough
      wind-up to maintain the current levels of performance, whilst a
      higher threshold does not prevent the windup from freeing the entire
      cache under sustained workloads.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3567b59a
    • D
      vmscan: shrinker->nr updates race and go wrong · acf92b48
      Dave Chinner 提交于
      shrink_slab() allows shrinkers to be called in parallel so the
      struct shrinker can be updated concurrently. It does not provide any
      exclusio for such updates, so we can get the shrinker->nr value
      increasing or decreasing incorrectly.
      
      As a result, when a shrinker repeatedly returns a value of -1 (e.g.
      a VFS shrinker called w/ GFP_NOFS), the shrinker->nr goes haywire,
      sometimes updating with the scan count that wasn't used, sometimes
      losing it altogether. Worse is when a shrinker does work and that
      update is lost due to racy updates, which means the shrinker will do
      the work again!
      
      Fix this by making the total_scan calculations independent of
      shrinker->nr, and making the shrinker->nr updates atomic w.r.t. to
      other updates via cmpxchg loops.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      acf92b48
    • D
      vmscan: add shrink_slab tracepoints · 09576073
      Dave Chinner 提交于
      It is impossible to understand what the shrinkers are actually doing
      without instrumenting the code, so add a some tracepoints to allow
      insight to be gained.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      09576073
    • S
      vmscan: fix a livelock in kswapd · 4746efde
      Shaohua Li 提交于
      I'm running a workload which triggers a lot of swap in a machine with 4
      nodes.  After I kill the workload, I found a kswapd livelock.  Sometimes
      kswapd3 or kswapd2 are keeping running and I can't access filesystem,
      but most memory is free.
      
      This looks like a regression since commit 08951e54 ("mm: vmscan:
      correct check for kswapd sleeping in sleeping_prematurely").
      
      Node 2 and 3 have only ZONE_NORMAL, but balance_pgdat() will return 0
      for classzone_idx.  The reason is end_zone in balance_pgdat() is 0 by
      default, if all zones have watermark ok, end_zone will keep 0.
      
      Later sleeping_prematurely() always returns true.  Because this is an
      order 3 wakeup, and if classzone_idx is 0, both balanced_pages and
      present_pages in pgdat_balanced() are 0.  We add a special case here.
      If a zone has no page, we think it's balanced.  This fixes the livelock.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4746efde
  4. 09 7月, 2011 4 次提交
    • M
      mm: vmscan: only read new_classzone_idx from pgdat when reclaiming successfully · 215ddd66
      Mel Gorman 提交于
      During allocator-intensive workloads, kswapd will be woken frequently
      causing free memory to oscillate between the high and min watermark.  This
      is expected behaviour.  Unfortunately, if the highest zone is small, a
      problem occurs.
      
      When balance_pgdat() returns, it may be at a lower classzone_idx than it
      started because the highest zone was unreclaimable.  Before checking if it
      should go to sleep though, it checks pgdat->classzone_idx which when there
      is no other activity will be MAX_NR_ZONES-1.  It interprets this as it has
      been woken up while reclaiming, skips scheduling and reclaims again.  As
      there is no useful reclaim work to do, it enters into a loop of shrinking
      slab consuming loads of CPU until the highest zone becomes reclaimable for
      a long period of time.
      
      There are two problems here.  1) If the returned classzone or order is
      lower, it'll continue reclaiming without scheduling.  2) if the highest
      zone was marked unreclaimable but balance_pgdat() returns immediately at
      DEF_PRIORITY, the new lower classzone is not communicated back to kswapd()
      for sleeping.
      
      This patch does two things that are related.  If the end_zone is
      unreclaimable, this information is communicated back.  Second, if the
      classzone or order was reduced due to failing to reclaim, new information
      is not read from pgdat and instead an attempt is made to go to sleep.  Due
      to this, it is also necessary that pgdat->classzone_idx be initialised
      each time to pgdat->nr_zones - 1 to avoid re-reads being interpreted as
      wakeups.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NPádraig Brady <P@draigBrady.com>
      Tested-by: NPádraig Brady <P@draigBrady.com>
      Tested-by: NAndrew Lutomirski <luto@mit.edu>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      215ddd66
    • M
      mm: vmscan: evaluate the watermarks against the correct classzone · da175d06
      Mel Gorman 提交于
      When deciding if kswapd is sleeping prematurely, the classzone is taken
      into account but this is different to what balance_pgdat() and the
      allocator are doing.  Specifically, the DMA zone will be checked based on
      the classzone used when waking kswapd which could be for a GFP_KERNEL or
      GFP_HIGHMEM request.  The lowmem reserve limit kicks in, the watermark is
      not met and kswapd thinks it's sleeping prematurely keeping kswapd awake in
      error.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NPádraig Brady <P@draigBrady.com>
      Tested-by: NPádraig Brady <P@draigBrady.com>
      Tested-by: NAndrew Lutomirski <luto@mit.edu>
      Acked-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da175d06
    • M
      mm: vmscan: do not apply pressure to slab if we are not applying pressure to zone · d7868dae
      Mel Gorman 提交于
      During allocator-intensive workloads, kswapd will be woken frequently
      causing free memory to oscillate between the high and min watermark.  This
      is expected behaviour.
      
      When kswapd applies pressure to zones during node balancing, it checks if
      the zone is above a high+balance_gap threshold.  If it is, it does not
      apply pressure but it unconditionally shrinks slab on a global basis which
      is excessive.  In the event kswapd is being kept awake due to a high small
      unreclaimable zone, it skips zone shrinking but still calls shrink_slab().
      
      Once pressure has been applied, the check for zone being unreclaimable is
      being made before the check is made if all_unreclaimable should be set.
      This miss of unreclaimable can cause has_under_min_watermark_zone to be
      set due to an unreclaimable zone preventing kswapd backing off on
      congestion_wait().
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NPádraig Brady <P@draigBrady.com>
      Tested-by: NPádraig Brady <P@draigBrady.com>
      Tested-by: NAndrew Lutomirski <luto@mit.edu>
      Acked-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7868dae
    • M
      mm: vmscan: correct check for kswapd sleeping in sleeping_prematurely · 08951e54
      Mel Gorman 提交于
      During allocator-intensive workloads, kswapd will be woken frequently
      causing free memory to oscillate between the high and min watermark.  This
      is expected behaviour.  Unfortunately, if the highest zone is small, a
      problem occurs.
      
      This seems to happen most with recent sandybridge laptops but it's
      probably a co-incidence as some of these laptops just happen to have a
      small Normal zone.  The reproduction case is almost always during copying
      large files that kswapd pegs at 100% CPU until the file is deleted or
      cache is dropped.
      
      The problem is mostly down to sleeping_prematurely() keeping kswapd awake
      when the highest zone is small and unreclaimable and compounded by the
      fact we shrink slabs even when not shrinking zones causing a lot of time
      to be spent in shrinkers and a lot of memory to be reclaimed.
      
      Patch 1 corrects sleeping_prematurely to check the zones matching
      	the classzone_idx instead of all zones.
      
      Patch 2 avoids shrinking slab when we are not shrinking a zone.
      
      Patch 3 notes that sleeping_prematurely is checking lower zones against
      	a high classzone which is not what allocators or balance_pgdat()
      	is doing leading to an artifical belief that kswapd should be
      	still awake.
      
      Patch 4 notes that when balance_pgdat() gives up on a high zone that the
      	decision is not communicated to sleeping_prematurely()
      
      This problem affects 2.6.38.8 for certain and is expected to affect 2.6.39
      and 3.0-rc4 as well.  If accepted, they need to go to -stable to be picked
      up by distros and this series is against 3.0-rc4.  I've cc'd people that
      reported similar problems recently to see if they still suffer from the
      problem and if this fixes it.
      
      This patch: correct the check for kswapd sleeping in sleeping_prematurely()
      
      During allocator-intensive workloads, kswapd will be woken frequently
      causing free memory to oscillate between the high and min watermark.  This
      is expected behaviour.
      
      A problem occurs if the highest zone is small.  balance_pgdat() only
      considers unreclaimable zones when priority is DEF_PRIORITY but
      sleeping_prematurely considers all zones.  It's possible for this sequence
      to occur
      
        1. kswapd wakes up and enters balance_pgdat()
        2. At DEF_PRIORITY, marks highest zone unreclaimable
        3. At DEF_PRIORITY-1, ignores highest zone setting end_zone
        4. At DEF_PRIORITY-1, calls shrink_slab freeing memory from
              highest zone, clearing all_unreclaimable. Highest zone
              is still unbalanced
        5. kswapd returns and calls sleeping_prematurely
        6. sleeping_prematurely looks at *all* zones, not just the ones
           being considered by balance_pgdat. The highest small zone
           has all_unreclaimable cleared but the zone is not
           balanced. all_zones_ok is false so kswapd stays awake
      
      This patch corrects the behaviour of sleeping_prematurely to check the
      zones balance_pgdat() checked.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NPádraig Brady <P@draigBrady.com>
      Tested-by: NPádraig Brady <P@draigBrady.com>
      Tested-by: NAndrew Lutomirski <luto@mit.edu>
      Acked-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08951e54
  5. 28 6月, 2011 1 次提交
    • K
      memcg: fix direct softlimit reclaim to be called in limit path · ac34a1a3
      KAMEZAWA Hiroyuki 提交于
      Commit d149e3b2 ("memcg: add the soft_limit reclaim in global direct
      reclaim") adds a softlimit hook to shrink_zones().  By this, soft limit
      is called as
      
         try_to_free_pages()
             do_try_to_free_pages()
                 shrink_zones()
                     mem_cgroup_soft_limit_reclaim()
      
      Then, direct reclaim is memcg softlimit hint aware, now.
      
      But, the memory cgroup's "limit" path can call softlimit shrinker.
      
         try_to_free_mem_cgroup_pages()
             do_try_to_free_pages()
                 shrink_zones()
                     mem_cgroup_soft_limit_reclaim()
      
      This will cause a global reclaim when a memcg hits limit.
      
      This is bug. soft_limit_reclaim() should be called when
      scanning_global_lru(sc) == true.
      
      And the commit adds a variable "total_scanned" for counting softlimit
      scanned pages....it's not "total".  This patch removes the variable and
      update sc->nr_scanned instead of it.  This will affect shrink_slab()'s
      scan condition but, global LRU is scanned by softlimit and I think this
      change makes sense.
      
      TODO: avoid too much scanning of a zone when softlimit did enough work.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Ying Han <yinghan@google.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac34a1a3
  6. 16 6月, 2011 2 次提交
  7. 27 5月, 2011 5 次提交
    • Y
      memcg: rename mem_cgroup_zone_nr_pages() to mem_cgroup_zone_nr_lru_pages() · 1bac180b
      Ying Han 提交于
      The caller of the function has been renamed to zone_nr_lru_pages(), and
      this is just fixing up in the memcg code.  The current name is easily to
      be mis-read as zone's total number of pages.
      Signed-off-by: NYing Han <yinghan@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1bac180b
    • K
      memcg: fix get_scan_count() for small targets · 246e87a9
      KAMEZAWA Hiroyuki 提交于
      During memory reclaim we determine the number of pages to be scanned per
      zone as
      
      	(anon + file) >> priority.
      Assume
      	scan = (anon + file) >> priority.
      
      If scan < SWAP_CLUSTER_MAX, the scan will be skipped for this time and
      priority gets higher.  This has some problems.
      
        1. This increases priority as 1 without any scan.
           To do scan in this priority, amount of pages should be larger than 512M.
           If pages>>priority < SWAP_CLUSTER_MAX, it's recorded and scan will be
           batched, later. (But we lose 1 priority.)
           If memory size is below 16M, pages >> priority is 0 and no scan in
           DEF_PRIORITY forever.
      
        2. If zone->all_unreclaimabe==true, it's scanned only when priority==0.
           So, x86's ZONE_DMA will never be recoverred until the user of pages
           frees memory by itself.
      
        3. With memcg, the limit of memory can be small. When using small memcg,
           it gets priority < DEF_PRIORITY-2 very easily and need to call
           wait_iff_congested().
           For doing scan before priorty=9, 64MB of memory should be used.
      
      Then, this patch tries to scan SWAP_CLUSTER_MAX of pages in force...when
      
        1. the target is enough small.
        2. it's kswapd or memcg reclaim.
      
      Then we can avoid rapid priority drop and may be able to recover
      all_unreclaimable in a small zones.  And this patch removes nr_saved_scan.
       This will allow scanning in this priority even when pages >> priority is
      very small.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NYing Han <yinghan@google.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      246e87a9
    • Y
      memcg: reclaim memory from nodes in round-robin order · 889976db
      Ying Han 提交于
      Presently, memory cgroup's direct reclaim frees memory from the current
      node.  But this has some troubles.  Usually when a set of threads works in
      a cooperative way, they tend to operate on the same node.  So if they hit
      limits under memcg they will reclaim memory from themselves, damaging the
      active working set.
      
      For example, assume 2 node system which has Node 0 and Node 1 and a memcg
      which has 1G limit.  After some work, file cache remains and the usages
      are
      
         Node 0:  1M
         Node 1:  998M.
      
      and run an application on Node 0, it will eat its foot before freeing
      unnecessary file caches.
      
      This patch adds round-robin for NUMA and adds equal pressure to each node.
      When using cpuset's spread memory feature, this will work very well.
      
      But yes, a better algorithm is needed.
      
      [akpm@linux-foundation.org: comment editing]
      [kamezawa.hiroyu@jp.fujitsu.com: fix time comparisons]
      Signed-off-by: NYing Han <yinghan@google.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      889976db
    • Y
      memcg: add the soft_limit reclaim in global direct reclaim. · d149e3b2
      Ying Han 提交于
      We recently added the change in global background reclaim which counts the
      return value of soft_limit reclaim.  Now this patch adds the similar logic
      on global direct reclaim.
      
      We should skip scanning global LRU on shrink_zone if soft_limit reclaim
      does enough work.  This is the first step where we start with counting the
      nr_scanned and nr_reclaimed from soft_limit reclaim into global
      scan_control.
      Signed-off-by: NYing Han <yinghan@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d149e3b2
    • Y
      memcg: count the soft_limit reclaim in global background reclaim · 0ae5e89c
      Ying Han 提交于
      The global kswapd scans per-zone LRU and reclaims pages regardless of the
      cgroup. It breaks memory isolation since one cgroup can end up reclaiming
      pages from another cgroup. Instead we should rely on memcg-aware target
      reclaim including per-memcg kswapd and soft_limit hierarchical reclaim under
      memory pressure.
      
      In the global background reclaim, we do soft reclaim before scanning the
      per-zone LRU. However, the return value is ignored. This patch is the first
      step to skip shrink_zone() if soft_limit reclaim does enough work.
      
      This is part of the effort which tries to reduce reclaiming pages in global
      LRU in memcg. The per-memcg background reclaim patchset further enhances the
      per-cgroup targetting reclaim, which I should have V4 posted shortly.
      
      Try running multiple memory intensive workloads within seperate memcgs. Watch
      the counters of soft_steal in memory.stat.
      
        $ cat /dev/cgroup/A/memory.stat | grep 'soft'
        soft_steal 240000
        soft_scan 240000
        total_soft_steal 240000
        total_soft_scan 240000
      
      This patch:
      
      In the global background reclaim, we do soft reclaim before scanning the
      per-zone LRU.  However, the return value is ignored.
      
      We would like to skip shrink_zone() if soft_limit reclaim does enough
      work.  Also, we need to make the memory pressure balanced across per-memcg
      zones, like the logic vm-core.  This patch is the first step where we
      start with counting the nr_scanned and nr_reclaimed from soft_limit
      reclaim into the global scan_control.
      Signed-off-by: NYing Han <yinghan@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Acked-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ae5e89c
  8. 25 5月, 2011 5 次提交
    • Y
      vmscan: change shrinker API by passing shrink_control struct · 1495f230
      Ying Han 提交于
      Change each shrinker's API by consolidating the existing parameters into
      shrink_control struct.  This will simplify any further features added w/o
      touching each file of shrinker.
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: fix warning]
      [kosaki.motohiro@jp.fujitsu.com: fix up new shrinker API]
      [akpm@linux-foundation.org: fix xfs warning]
      [akpm@linux-foundation.org: update gfs2]
      Signed-off-by: NYing Han <yinghan@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Acked-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1495f230
    • Y
      vmscan: change shrink_slab() interfaces by passing shrink_control · a09ed5e0
      Ying Han 提交于
      Consolidate the existing parameters to shrink_slab() into a new
      shrink_control struct.  This is needed later to pass the same struct to
      shrinkers.
      Signed-off-by: NYing Han <yinghan@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Acked-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a09ed5e0
    • K
      mm: strictly require elevated page refcount in isolate_lru_page() · 0c917313
      Konstantin Khlebnikov 提交于
      isolate_lru_page() must be called only with stable reference to the page,
      this is what is written in the comment above it, this is reasonable.
      
      current isolate_lru_page() users and its page extra reference sources:
      
       mm/huge_memory.c:
        __collapse_huge_page_isolate()	- reference from pte
      
       mm/memcontrol.c:
        mem_cgroup_move_parent()		- get_page_unless_zero()
        mem_cgroup_move_charge_pte_range()	- reference from pte
      
       mm/memory-failure.c:
        soft_offline_page()			- fixed, reference from get_any_page()
        delete_from_lru_cache() - reference from caller or get_page_unless_zero()
      	[ seems like there bug, because __memory_failure() can call
      	  page_action() for hpages tail, but it is ok for
      	  isolate_lru_page(), tail getted and not in lru]
      
       mm/memory_hotplug.c:
        do_migrate_range()			- fixed, get_page_unless_zero()
      
       mm/mempolicy.c:
        migrate_page_add()			- reference from pte
      
       mm/migrate.c:
        do_move_page_to_node_array()		- reference from follow_page()
      
       mlock.c:				- various external references
      
       mm/vmscan.c:
        putback_lru_page()			- reference from isolate_lru_page()
      
      It seems that all isolate_lru_page() users are ready now for this
      restriction.  So, let's replace redundant get_page_unless_zero() with
      get_page() and add page initial reference count check with VM_BUG_ON()
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0c917313
    • M
      mm: vmscan: correctly check if reclaimer should schedule during shrink_slab · f06590bd
      Minchan Kim 提交于
      It has been reported on some laptops that kswapd is consuming large
      amounts of CPU and not being scheduled when SLUB is enabled during large
      amounts of file copying.  It is expected that this is due to kswapd
      missing every cond_resched() point because;
      
      shrink_page_list() calls cond_resched() if inactive pages were isolated
              which in turn may not happen if all_unreclaimable is set in
              shrink_zones(). If for whatver reason, all_unreclaimable is
              set on all zones, we can miss calling cond_resched().
      
      balance_pgdat() only calls cond_resched if the zones are not
              balanced. For a high-order allocation that is balanced, it
              checks order-0 again. During that window, order-0 might have
              become unbalanced so it loops again for order-0 and returns
              that it was reclaiming for order-0 to kswapd(). It can then
              find that a caller has rewoken kswapd for a high-order and
              re-enters balance_pgdat() without ever calling cond_resched().
      
      shrink_slab only calls cond_resched() if we are reclaiming slab
      	pages. If there are a large number of direct reclaimers, the
      	shrinker_rwsem can be contended and prevent kswapd calling
      	cond_resched().
      
      This patch modifies the shrink_slab() case.  If the semaphore is
      contended, the caller will still check cond_resched().  After each
      successful call into a shrinker, the check for cond_resched() remains in
      case one shrinker is particularly slow.
      
      [mgorman@suse.de: preserve call to cond_resched after each call into shrinker]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Tested-by: NColin King <colin.king@canonical.com>
      Cc: Raghavendra D Prabhu <raghu.prabhu13@gmail.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: <stable@kernel.org>		[2.6.38+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f06590bd
    • J
      mm: vmscan: correct use of pgdat_balanced in sleeping_prematurely · afc7e326
      Johannes Weiner 提交于
      There are a few reports of people experiencing hangs when copying large
      amounts of data with kswapd using a large amount of CPU which appear to be
      due to recent reclaim changes.  SLUB using high orders is the trigger but
      not the root cause as SLUB has been using high orders for a while.  The
      root cause was bugs introduced into reclaim which are addressed by the
      following two patches.
      
      Patch 1 corrects logic introduced by commit 1741c877 ("mm: kswapd:
              keep kswapd awake for high-order allocations until a percentage of
              the node is balanced") to allow kswapd to go to sleep when
              balanced for high orders.
      
      Patch 2 notes that it is possible for kswapd to miss every
              cond_resched() and updates shrink_slab() so it'll at least reach
              that scheduling point.
      
      Chris Wood reports that these two patches in isolation are sufficient to
      prevent the system hanging.  AFAIK, they should also resolve similar hangs
      experienced by James Bottomley.
      
      This patch:
      
      Johannes Weiner poined out that the logic in commit 1741c877 ("mm: kswapd:
      keep kswapd awake for high-order allocations until a percentage of the
      node is balanced") is backwards.  Instead of allowing kswapd to go to
      sleep when balancing for high order allocations, it keeps it kswapd
      running uselessly.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Tested-by: NColin King <colin.king@canonical.com>
      Cc: Raghavendra D Prabhu <raghu.prabhu13@gmail.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: <stable@kernel.org>		[2.6.38+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      afc7e326
  9. 21 5月, 2011 1 次提交
    • L
      sanitize <linux/prefetch.h> usage · 268bb0ce
      Linus Torvalds 提交于
      Commit e66eed65 ("list: remove prefetching from regular list
      iterators") removed the include of prefetch.h from list.h, which
      uncovered several cases that had apparently relied on that rather
      obscure header file dependency.
      
      So this fixes things up a bit, using
      
         grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]')
         grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]')
      
      to guide us in finding files that either need <linux/prefetch.h>
      inclusion, or have it despite not needing it.
      
      There are more of them around (mostly network drivers), but this gets
      many core ones.
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      268bb0ce
  10. 18 5月, 2011 1 次提交
  11. 15 4月, 2011 1 次提交
    • K
      vmscan: all_unreclaimable() use zone->all_unreclaimable as a name · 929bea7c
      KOSAKI Motohiro 提交于
      all_unreclaimable check in direct reclaim has been introduced at 2.6.19
      by following commit.
      
      	2006 Sep 25; commit 408d8544; oom: use unreclaimable info
      
      And it went through strange history. firstly, following commit broke
      the logic unintentionally.
      
      	2008 Apr 29; commit a41f24ea; page allocator: smarter retry of
      				      costly-order allocations
      
      Two years later, I've found obvious meaningless code fragment and
      restored original intention by following commit.
      
      	2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages()
      				      return value when priority==0
      
      But, the logic didn't works when 32bit highmem system goes hibernation
      and Minchan slightly changed the algorithm and fixed it .
      
      	2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable
      				      in direct reclaim path
      
      But, recently, Andrey Vagin found the new corner case. Look,
      
      	struct zone {
      	  ..
      	        int                     all_unreclaimable;
      	  ..
      	        unsigned long           pages_scanned;
      	  ..
      	}
      
      zone->all_unreclaimable and zone->pages_scanned are neigher atomic
      variables nor protected by lock.  Therefore zones can become a state of
      zone->page_scanned=0 and zone->all_unreclaimable=1.  In this case, current
      all_unreclaimable() return false even though zone->all_unreclaimabe=1.
      
      This resulted in the kernel hanging up when executing a loop of the form
      
      1. fork
      2. mmap
      3. touch memory
      4. read memory
      5. munmmap
      
      as described in
      http://www.gossamer-threads.com/lists/linux/kernel/1348725#1348725
      
      Is this ignorable minor issue?  No.  Unfortunately, x86 has very small dma
      zone and it become zone->all_unreclamble=1 easily.  and if it become
      all_unreclaimable=1, it never restore all_unreclaimable=0.  Why?  if
      all_unreclaimable=1, vmscan only try DEF_PRIORITY reclaim and
      a-few-lru-pages>>DEF_PRIORITY always makes 0.  that mean no page scan at
      all!
      
      Eventually, oom-killer never works on such systems.  That said, we can't
      use zone->pages_scanned for this purpose.  This patch restore
      all_unreclaimable() use zone->all_unreclaimable as old.  and in addition,
      to add oom_killer_disabled check to avoid reintroduce the issue of commit
      d1908362 ("vmscan: check all_unreclaimable in direct reclaim path").
      Reported-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      929bea7c
  12. 31 3月, 2011 1 次提交
  13. 23 3月, 2011 3 次提交
  14. 10 3月, 2011 1 次提交
  15. 26 2月, 2011 1 次提交
    • M
      mm: vmscan: stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT · 2876592f
      Mel Gorman 提交于
      should_continue_reclaim() for reclaim/compaction allows scanning to
      continue even if pages are not being reclaimed until the full list is
      scanned.  In terms of allocation success, this makes sense but potentially
      it introduces unwanted latency for high-order allocations such as
      transparent hugepages and network jumbo frames that would prefer to fail
      the allocation attempt and fallback to order-0 pages.  Worse, there is a
      potential that the full LRU scan will clear all the young bits, distort
      page aging information and potentially push pages into swap that would
      have otherwise remained resident.
      
      This patch will stop reclaim/compaction if no pages were reclaimed in the
      last SWAP_CLUSTER_MAX pages that were considered.  For allocations such as
      hugetlbfs that use __GFP_REPEAT and have fewer fallback options, the full
      LRU list may still be scanned.
      
      Order-0 allocation should not be affected because RECLAIM_MODE_COMPACTION
      is not set so the following avoids the gfp_mask being examined:
      
              if (!(sc->reclaim_mode & RECLAIM_MODE_COMPACTION))
                      return false;
      
      A tool was developed based on ftrace that tracked the latency of
      high-order allocations while transparent hugepage support was enabled and
      three benchmarks were run.  The "fix-infinite" figures are 2.6.38-rc4 with
      Johannes's patch "vmscan: fix zone shrinking exit when scan work is done"
      applied.
      
        STREAM Highorder Allocation Latency Statistics
                       fix-infinite     break-early
        1 :: Count            10298           10229
        1 :: Min             0.4560          0.4640
        1 :: Mean            1.0589          1.0183
        1 :: Max            14.5990         11.7510
        1 :: Stddev          0.5208          0.4719
        2 :: Count                2               1
        2 :: Min             1.8610          3.7240
        2 :: Mean            3.4325          3.7240
        2 :: Max             5.0040          3.7240
        2 :: Stddev          1.5715          0.0000
        9 :: Count           111696          111694
        9 :: Min             0.5230          0.4110
        9 :: Mean           10.5831         10.5718
        9 :: Max            38.4480         43.2900
        9 :: Stddev          1.1147          1.1325
      
      Mean time for order-1 allocations is reduced.  order-2 looks increased but
      with so few allocations, it's not particularly significant.  THP mean
      allocation latency is also reduced.  That said, allocation time varies so
      significantly that the reductions are within noise.
      
      Max allocation time is reduced by a significant amount for low-order
      allocations but reduced for THP allocations which presumably are now
      breaking before reclaim has done enough work.
      
        SysBench Highorder Allocation Latency Statistics
                       fix-infinite     break-early
        1 :: Count            15745           15677
        1 :: Min             0.4250          0.4550
        1 :: Mean            1.1023          1.0810
        1 :: Max            14.4590         10.8220
        1 :: Stddev          0.5117          0.5100
        2 :: Count                1               1
        2 :: Min             3.0040          2.1530
        2 :: Mean            3.0040          2.1530
        2 :: Max             3.0040          2.1530
        2 :: Stddev          0.0000          0.0000
        9 :: Count             2017            1931
        9 :: Min             0.4980          0.7480
        9 :: Mean           10.4717         10.3840
        9 :: Max            24.9460         26.2500
        9 :: Stddev          1.1726          1.1966
      
      Again, mean time for order-1 allocations is reduced while order-2
      allocations are too few to draw conclusions from.  The mean time for THP
      allocations is also slightly reduced albeit the reductions are within
      varianes.
      
      Once again, our maximum allocation time is significantly reduced for
      low-order allocations and slightly increased for THP allocations.
      
        Anon stream mmap reference Highorder Allocation Latency Statistics
        1 :: Count             1376            1790
        1 :: Min             0.4940          0.5010
        1 :: Mean            1.0289          0.9732
        1 :: Max             6.2670          4.2540
        1 :: Stddev          0.4142          0.2785
        2 :: Count                1               -
        2 :: Min             1.9060               -
        2 :: Mean            1.9060               -
        2 :: Max             1.9060               -
        2 :: Stddev          0.0000               -
        9 :: Count            11266           11257
        9 :: Min             0.4990          0.4940
        9 :: Mean        27250.4669      24256.1919
        9 :: Max      11439211.0000    6008885.0000
        9 :: Stddev     226427.4624     186298.1430
      
      This benchmark creates one thread per CPU which references an amount of
      anonymous memory 1.5 times the size of physical RAM.  This pounds swap
      quite heavily and is intended to exercise THP a bit.
      
      Mean allocation time for order-1 is reduced as before.  It's also reduced
      for THP allocations but the variations here are pretty massive due to
      swap.  As before, maximum allocation times are significantly reduced.
      
      Overall, the patch reduces the mean and maximum allocation latencies for
      the smaller high-order allocations.  This was with Slab configured so it
      would be expected to be more significant with Slub which uses these size
      allocations more aggressively.
      
      The mean allocation times for THP allocations are also slightly reduced.
      The maximum latency was slightly increased as predicted by the comments
      due to reclaim/compaction breaking early.  However, workloads care more
      about the latency of lower-order allocations than THP so it's an
      acceptable trade-off.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2876592f
  16. 12 2月, 2011 1 次提交
  17. 26 1月, 2011 1 次提交
    • D
      mm: fix deferred congestion timeout if preferred zone is not allowed · f33261d7
      David Rientjes 提交于
      Before 0e093d99 ("writeback: do not sleep on the congestion queue if
      there are no congested BDIs or if significant congestion is not being
      encountered in the current zone"), preferred_zone was only used for NUMA
      statistics, to determine the zoneidx from which to allocate from given
      the type requested, and whether to utilize memory compaction.
      
      wait_iff_congested(), though, uses preferred_zone to determine if the
      congestion wait should be deferred because its dirty pages are backed by
      a congested bdi.  This incorrectly defers the timeout and busy loops in
      the page allocator with various cond_resched() calls if preferred_zone
      is not allowed in the current context, usually consuming 100% of a cpu.
      
      This patch ensures preferred_zone is an allowed zone in the fastpath
      depending on whether current is constrained by its cpuset or nodes in
      its mempolicy (when the nodemask passed is non-NULL).  This is correct
      since the fastpath allocation always passes ALLOC_CPUSET when trying to
      allocate memory.  In the slowpath, this patch resets preferred_zone to
      the first zone of the allowed type when the allocation is not
      constrained by current's cpuset, i.e.  it does not pass ALLOC_CPUSET.
      
      This patch also ensures preferred_zone is from the set of allowed nodes
      when called from within direct reclaim since allocations are always
      constrained by cpusets in this context (it is blockable).
      
      Both of these uses of cpuset_current_mems_allowed are protected by
      get_mems_allowed().
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f33261d7
  18. 21 1月, 2011 1 次提交