1. 25 6月, 2015 2 次提交
    • Z
      mm: rename RECLAIM_SWAP to RECLAIM_UNMAP · 95bbc0c7
      Zhihui Zhang 提交于
      The name SWAP implies that we are dealing with anonymous pages only.  In
      fact, the original patch that introduced the min_unmapped_ratio logic
      was to fix an issue related to file pages.  Rename it to RECLAIM_UNMAP
      to match what does.
      
      Historically, commit a6dc60f8 ("vmscan: rename sc.may_swap to
      may_unmap") renamed .may_swap to .may_unmap, leaving RECLAIM_SWAP
      behind.  commit 2e2e4259 ("vmscan,memcg: reintroduce sc->may_swap")
      reintroduced .may_swap for memory controller.
      Signed-off-by: NZhihui Zhang <zzhsuny@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      95bbc0c7
    • N
      mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages · f012a84a
      Nishanth Aravamudan 提交于
      Based upon 675becce ("mm: vmscan: do not throttle based on pfmemalloc
      reserves if node has no ZONE_NORMAL") from Mel.
      
      We have a system with the following topology:
      
      # numactl -H
      available: 3 nodes (0,2-3)
      node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
      23 24 25 26 27 28 29 30 31
      node 0 size: 28273 MB
      node 0 free: 27323 MB
      node 2 cpus:
      node 2 size: 16384 MB
      node 2 free: 0 MB
      node 3 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
      node 3 size: 30533 MB
      node 3 free: 13273 MB
      node distances:
      node   0   2   3
        0:  10  20  20
        2:  20  10  20
        3:  20  20  10
      
      Node 2 has no free memory, because:
      # cat /sys/devices/system/node/node2/hugepages/hugepages-16777216kB/nr_hugepages
      1
      
      This leads to the following zoneinfo:
      
      Node 2, zone      DMA
        pages free     0
              min      1840
              low      2300
              high     2760
              scanned  0
              spanned  262144
              present  262144
              managed  262144
      ...
        all_unreclaimable: 1
      
      If one then attempts to allocate some normal 16M hugepages via
      
      echo 37 > /proc/sys/vm/nr_hugepages
      
      The echo never returns and kswapd2 consumes CPU cycles.
      
      This is because throttle_direct_reclaim ends up calling
      wait_event(pfmemalloc_wait, pfmemalloc_watermark_ok...).
      pfmemalloc_watermark_ok() in turn checks all zones on the node if there
      are any reserves, and if so, then indicates the watermarks are ok, by
      seeing if there are sufficient free pages.
      
      675becce added a condition already for memoryless nodes.  In this case,
      though, the node has memory, it is just all consumed (and not
      reclaimable).  Effectively, though, the result is the same on this call to
      pfmemalloc_watermark_ok() and thus seems like a reasonable additional
      condition.
      
      With this change, the afore-mentioned 16M hugepage allocation attempt
      succeeds and correctly round-robins between Nodes 1 and 3.
      Signed-off-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f012a84a
  2. 13 2月, 2015 1 次提交
    • V
      vmscan: per memory cgroup slab shrinkers · cb731d6c
      Vladimir Davydov 提交于
      This patch adds SHRINKER_MEMCG_AWARE flag.  If a shrinker has this flag
      set, it will be called per memory cgroup.  The memory cgroup to scan
      objects from is passed in shrink_control->memcg.  If the memory cgroup
      is NULL, a memcg aware shrinker is supposed to scan objects from the
      global list.  Unaware shrinkers are only called on global pressure with
      memcg=NULL.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb731d6c
  3. 12 2月, 2015 3 次提交
    • J
      mm: memcontrol: default hierarchy interface for memory · 241994ed
      Johannes Weiner 提交于
      Introduce the basic control files to account, partition, and limit
      memory using cgroups in default hierarchy mode.
      
      This interface versioning allows us to address fundamental design
      issues in the existing memory cgroup interface, further explained
      below.  The old interface will be maintained indefinitely, but a
      clearer model and improved workload performance should encourage
      existing users to switch over to the new one eventually.
      
      The control files are thus:
      
        - memory.current shows the current consumption of the cgroup and its
          descendants, in bytes.
      
        - memory.low configures the lower end of the cgroup's expected
          memory consumption range.  The kernel considers memory below that
          boundary to be a reserve - the minimum that the workload needs in
          order to make forward progress - and generally avoids reclaiming
          it, unless there is an imminent risk of entering an OOM situation.
      
        - memory.high configures the upper end of the cgroup's expected
          memory consumption range.  A cgroup whose consumption grows beyond
          this threshold is forced into direct reclaim, to work off the
          excess and to throttle new allocations heavily, but is generally
          allowed to continue and the OOM killer is not invoked.
      
        - memory.max configures the hard maximum amount of memory that the
          cgroup is allowed to consume before the OOM killer is invoked.
      
        - memory.events shows event counters that indicate how often the
          cgroup was reclaimed while below memory.low, how often it was
          forced to reclaim excess beyond memory.high, how often it hit
          memory.max, and how often it entered OOM due to memory.max.  This
          allows users to identify configuration problems when observing a
          degradation in workload performance.  An overcommitted system will
          have an increased rate of low boundary breaches, whereas increased
          rates of high limit breaches, maximum hits, or even OOM situations
          will indicate internally overcommitted cgroups.
      
      For existing users of memory cgroups, the following deviations from
      the current interface are worth pointing out and explaining:
      
        - The original lower boundary, the soft limit, is defined as a limit
          that is per default unset.  As a result, the set of cgroups that
          global reclaim prefers is opt-in, rather than opt-out.  The costs
          for optimizing these mostly negative lookups are so high that the
          implementation, despite its enormous size, does not even provide
          the basic desirable behavior.  First off, the soft limit has no
          hierarchical meaning.  All configured groups are organized in a
          global rbtree and treated like equal peers, regardless where they
          are located in the hierarchy.  This makes subtree delegation
          impossible.  Second, the soft limit reclaim pass is so aggressive
          that it not just introduces high allocation latencies into the
          system, but also impacts system performance due to overreclaim, to
          the point where the feature becomes self-defeating.
      
          The memory.low boundary on the other hand is a top-down allocated
          reserve.  A cgroup enjoys reclaim protection when it and all its
          ancestors are below their low boundaries, which makes delegation
          of subtrees possible.  Secondly, new cgroups have no reserve per
          default and in the common case most cgroups are eligible for the
          preferred reclaim pass.  This allows the new low boundary to be
          efficiently implemented with just a minor addition to the generic
          reclaim code, without the need for out-of-band data structures and
          reclaim passes.  Because the generic reclaim code considers all
          cgroups except for the ones running low in the preferred first
          reclaim pass, overreclaim of individual groups is eliminated as
          well, resulting in much better overall workload performance.
      
        - The original high boundary, the hard limit, is defined as a strict
          limit that can not budge, even if the OOM killer has to be called.
          But this generally goes against the goal of making the most out of
          the available memory.  The memory consumption of workloads varies
          during runtime, and that requires users to overcommit.  But doing
          that with a strict upper limit requires either a fairly accurate
          prediction of the working set size or adding slack to the limit.
          Since working set size estimation is hard and error prone, and
          getting it wrong results in OOM kills, most users tend to err on
          the side of a looser limit and end up wasting precious resources.
      
          The memory.high boundary on the other hand can be set much more
          conservatively.  When hit, it throttles allocations by forcing
          them into direct reclaim to work off the excess, but it never
          invokes the OOM killer.  As a result, a high boundary that is
          chosen too aggressively will not terminate the processes, but
          instead it will lead to gradual performance degradation.  The user
          can monitor this and make corrections until the minimal memory
          footprint that still gives acceptable performance is found.
      
          In extreme cases, with many concurrent allocations and a complete
          breakdown of reclaim progress within the group, the high boundary
          can be exceeded.  But even then it's mostly better to satisfy the
          allocation from the slack available in other groups or the rest of
          the system than killing the group.  Otherwise, memory.max is there
          to limit this type of spillover and ultimately contain buggy or
          even malicious applications.
      
        - The original control file names are unwieldy and inconsistent in
          many different ways.  For example, the upper boundary hit count is
          exported in the memory.failcnt file, but an OOM event count has to
          be manually counted by listening to memory.oom_control events, and
          lower boundary / soft limit events have to be counted by first
          setting a threshold for that value and then counting those events.
          Also, usage and limit files encode their units in the filename.
          That makes the filenames very long, even though this is not
          information that a user needs to be reminded of every time they
          type out those names.
      
          To address these naming issues, as well as to signal clearly that
          the new interface carries a new configuration model, the naming
          conventions in it necessarily differ from the old interface.
      
        - The original limit files indicate the state of an unset limit with
          a very high number, and a configured limit can be unset by echoing
          -1 into those files.  But that very high number is implementation
          and architecture dependent and not very descriptive.  And while -1
          can be understood as an underflow into the highest possible value,
          -2 or -10M etc. do not work, so it's not inconsistent.
      
          memory.low, memory.high, and memory.max will use the string
          "infinity" to indicate and set the highest possible value.
      
      [akpm@linux-foundation.org: use seq_puts() for basic strings]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      241994ed
    • V
      vmscan: force scan offline memory cgroups · 90cbc250
      Vladimir Davydov 提交于
      Since commit b2052564 ("mm: memcontrol: continue cache reclaim from
      offlined groups") pages charged to a memory cgroup are not reparented when
      the cgroup is removed.  Instead, they are supposed to be reclaimed in a
      regular way, along with pages accounted to online memory cgroups.
      
      However, an lruvec of an offline memory cgroup will sooner or later get so
      small that it will be scanned only at low scan priorities (see
      get_scan_count()).  Therefore, if there are enough reclaimable pages in
      big lruvecs, pages accounted to offline memory cgroups will never be
      scanned at all, wasting memory.
      
      Fix this by unconditionally forcing scanning dead lruvecs from kswapd.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90cbc250
    • V
      mm, vmscan: wake up all pfmemalloc-throttled processes at once · cfc51155
      Vlastimil Babka 提交于
      Kswapd in balance_pgdate() currently uses wake_up() on processes waiting
      in throttle_direct_reclaim(), which only wakes up a single process.  This
      might leave processes waiting for longer than necessary, until the check
      is reached in the next loop iteration.  Processes might also be left
      waiting if zone was fully balanced in single iteration.  Note that the
      comment in balance_pgdat() also says "Wake them", so waking up a single
      process does not seem intentional.
      
      Thus, replace wake_up() with wake_up_all().
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cfc51155
  4. 27 1月, 2015 1 次提交
    • M
      mm/vmscan: fix highidx argument type · 17636faa
      Michael S. Tsirkin 提交于
      for_each_zone_zonelist_nodemask wants an enum zone_type argument, but is
      passed gfp_t:
      
        mm/vmscan.c:2658:9:    expected int enum zone_type [signed] highest_zoneidx
        mm/vmscan.c:2658:9:    got restricted gfp_t [usertype] gfp_mask
        mm/vmscan.c:2658:9: warning: incorrect type in argument 2 (different base types)
        mm/vmscan.c:2658:9:    expected int enum zone_type [signed] highest_zoneidx
        mm/vmscan.c:2658:9:    got restricted gfp_t [usertype] gfp_mask
      
      convert argument to the correct type.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      17636faa
  5. 21 1月, 2015 1 次提交
  6. 09 1月, 2015 1 次提交
    • V
      mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed · 9e5e3661
      Vlastimil Babka 提交于
      Charles Shirron and Paul Cassella from Cray Inc have reported kswapd
      stuck in a busy loop with nothing left to balance, but
      kswapd_try_to_sleep() failing to sleep.  Their analysis found the cause
      to be a combination of several factors:
      
      1. A process is waiting in throttle_direct_reclaim() on pgdat->pfmemalloc_wait
      
      2. The process has been killed (by OOM in this case), but has not yet been
         scheduled to remove itself from the waitqueue and die.
      
      3. kswapd checks for throttled processes in prepare_kswapd_sleep():
      
              if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
                      wake_up(&pgdat->pfmemalloc_wait);
      		return false; // kswapd will not go to sleep
      	}
      
         However, for a process that was already killed, wake_up() does not remove
         the process from the waitqueue, since try_to_wake_up() checks its state
         first and returns false when the process is no longer waiting.
      
      4. kswapd is running on the same CPU as the only CPU that the process is
         allowed to run on (through cpus_allowed, or possibly single-cpu system).
      
      5. CONFIG_PREEMPT_NONE=y kernel is used. If there's nothing to balance, kswapd
         encounters no voluntary preemption points and repeatedly fails
         prepare_kswapd_sleep(), blocking the process from running and removing
         itself from the waitqueue, which would let kswapd sleep.
      
      So, the source of the problem is that we prevent kswapd from going to
      sleep until there are processes waiting on the pfmemalloc_wait queue,
      and a process waiting on a queue is guaranteed to be removed from the
      queue only when it gets scheduled.  This was done to make sure that no
      process is left sleeping on pfmemalloc_wait when kswapd itself goes to
      sleep.
      
      However, it isn't necessary to postpone kswapd sleep until the
      pfmemalloc_wait queue actually empties.  To prevent processes from being
      left sleeping, it's actually enough to guarantee that all processes
      waiting on pfmemalloc_wait queue have been woken up by the time we put
      kswapd to sleep.
      
      This patch therefore fixes this issue by substituting 'wake_up' with
      'wake_up_all' and removing 'return false' in the code snippet from
      prepare_kswapd_sleep() above.  Note that if any process puts itself in
      the queue after this waitqueue_active() check, or after the wake up
      itself, it means that the process will also wake up kswapd - and since
      we are under prepare_to_wait(), the wake up won't be missed.  Also we
      update the comment prepare_kswapd_sleep() to hopefully more clearly
      describe the races it is preventing.
      
      Fixes: 5515061d ("mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage")
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>	[3.6+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9e5e3661
  7. 14 12月, 2014 1 次提交
    • J
      mm: vmscan: invoke slab shrinkers from shrink_zone() · 6b4f7799
      Johannes Weiner 提交于
      The slab shrinkers are currently invoked from the zonelist walkers in
      kswapd, direct reclaim, and zone reclaim, all of which roughly gauge the
      eligible LRU pages and assemble a nodemask to pass to NUMA-aware
      shrinkers, which then again have to walk over the nodemask.  This is
      redundant code, extra runtime work, and fairly inaccurate when it comes to
      the estimation of actually scannable LRU pages.  The code duplication will
      only get worse when making the shrinkers cgroup-aware and requiring them
      to have out-of-band cgroup hierarchy walks as well.
      
      Instead, invoke the shrinkers from shrink_zone(), which is where all
      reclaimers end up, to avoid this duplication.
      
      Take the count for eligible LRU pages out of get_scan_count(), which
      considers many more factors than just the availability of swap space, like
      zone_reclaimable_pages() currently does.  Accumulate the number over all
      visited lruvecs to get the per-zone value.
      
      Some nodes have multiple zones due to memory addressing restrictions.  To
      avoid putting too much pressure on the shrinkers, only invoke them once
      for each such node, using the class zone of the allocation as the pivot
      zone.
      
      For now, this integrates the slab shrinking better into the reclaim logic
      and gets rid of duplicative invocations from kswapd, direct reclaim, and
      zone reclaim.  It also prepares for cgroup-awareness, allowing
      memcg-capable shrinkers to be added at the lruvec level without much
      duplication of both code and runtime work.
      
      This changes kswapd behavior, which used to invoke the shrinkers for each
      zone, but with scan ratios gathered from the entire node, resulting in
      meaningless pressure quantities on multi-zone nodes.
      
      Zone reclaim behavior also changes.  It used to shrink slabs until the
      same amount of pages were shrunk as were reclaimed from the LRUs.  Now it
      merely invokes the shrinkers once with the zone's scan ratio, which makes
      the shrinkers go easier on caches that implement aging and would prefer
      feeding back pressure from recently used slab objects to unused LRU pages.
      
      [vdavydov@parallels.com: assure class zone is populated]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b4f7799
  8. 11 12月, 2014 3 次提交
    • V
      mm, compaction: pass classzone_idx and alloc_flags to watermark checking · ebff3980
      Vlastimil Babka 提交于
      Compaction relies on zone watermark checks for decisions such as if it's
      worth to start compacting in compaction_suitable() or whether compaction
      should stop in compact_finished().  The watermark checks take
      classzone_idx and alloc_flags parameters, which are related to the memory
      allocation request.  But from the context of compaction they are currently
      passed as 0, including the direct compaction which is invoked to satisfy
      the allocation request, and could therefore know the proper values.
      
      The lack of proper values can lead to mismatch between decisions taken
      during compaction and decisions related to the allocation request.  Lack
      of proper classzone_idx value means that lowmem_reserve is not taken into
      account.  This has manifested (during recent changes to deferred
      compaction) when DMA zone was used as fallback for preferred Normal zone.
      compaction_suitable() without proper classzone_idx would think that the
      watermarks are already satisfied, but watermark check in
      get_page_from_freelist() would fail.  Because of this problem, deferring
      compaction has extra complexity that can be removed in the following
      patch.
      
      The issue (not confirmed in practice) with missing alloc_flags is opposite
      in nature.  For allocations that include ALLOC_HIGH, ALLOC_HIGHER or
      ALLOC_CMA in alloc_flags (the last includes all MOVABLE allocations on
      CMA-enabled systems) the watermark checking in compaction with 0 passed
      will be stricter than in get_page_from_freelist().  In these cases
      compaction might be running for a longer time than is really needed.
      
      Another issue compaction_suitable() is that the check for "does the zone
      need compaction at all?" comes only after the check "does the zone have
      enough free free pages to succeed compaction".  The latter considers extra
      pages for migration and can therefore in some situations fail and return
      COMPACT_SKIPPED, although the high-order allocation would succeed and we
      should return COMPACT_PARTIAL.
      
      This patch fixes these problems by adding alloc_flags and classzone_idx to
      struct compact_control and related functions involved in direct compaction
      and watermark checking.  Where possible, all other callers of
      compaction_suitable() pass proper values where those are known.  This is
      currently limited to classzone_idx, which is sometimes known in kswapd
      context.  However, the direct reclaim callers should_continue_reclaim()
      and compaction_ready() do not currently know the proper values, so the
      coordination between reclaim and compaction may still not be as accurate
      as it could.  This can be fixed later, if it's shown to be an issue.
      
      Additionaly the checks in compact_suitable() are reordered to address the
      second issue described above.
      
      The effect of this patch should be slightly better high-order allocation
      success rates and/or less compaction overhead, depending on the type of
      allocations and presence of CMA.  It allows simplifying deferred
      compaction code in a followup patch.
      
      When testing with stress-highalloc, there was some slight improvement
      (which might be just due to variance) in success rates of non-THP-like
      allocations.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebff3980
    • J
      mm: vmscan: count only dirty pages as congested · 1da58ee2
      Jamie Liu 提交于
      shrink_page_list() counts all pages with a mapping, including clean pages,
      toward nr_congested if they're on a write-congested BDI.
      shrink_inactive_list() then sets ZONE_CONGESTED if nr_dirty ==
      nr_congested.  Fix this apples-to-oranges comparison by only counting
      pages for nr_congested if they count for nr_dirty.
      Signed-off-by: NJamie Liu <jamieliu@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1da58ee2
    • P
      mm/vmscan.c: replace printk with pr_err · 8612c663
      Pintu Kumar 提交于
      This patch replaces printk(KERN_ERR..) with pr_err found under
      shrink_slab.  Thus it also reduces one line extra because of formatting.
      Signed-off-by: NPintu Kumar <pintu.k@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8612c663
  9. 27 10月, 2014 1 次提交
    • V
      cpuset: simplify cpuset_node_allowed API · 344736f2
      Vladimir Davydov 提交于
      Current cpuset API for checking if a zone/node is allowed to allocate
      from looks rather awkward. We have hardwall and softwall versions of
      cpuset_node_allowed with the softwall version doing literally the same
      as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags.
      If it isn't, the softwall version may check the given node against the
      enclosing hardwall cpuset, which it needs to take the callback lock to
      do.
      
      Such a distinction was introduced by commit 02a0e53d ("cpuset:
      rework cpuset_zone_allowed api"). Before, we had the only version with
      the __GFP_HARDWALL flag determining its behavior. The purpose of the
      commit was to avoid sleep-in-atomic bugs when someone would mistakenly
      call the function without the __GFP_HARDWALL flag for an atomic
      allocation. The suffixes introduced were intended to make the callers
      think before using the function.
      
      However, since the callback lock was converted from mutex to spinlock by
      the previous patch, the softwall check function cannot sleep, and these
      precautions are no longer necessary.
      
      So let's simplify the API back to the single check.
      Suggested-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      344736f2
  10. 10 10月, 2014 4 次提交
    • J
      mm: memcontrol: fix transparent huge page allocations under pressure · b70a2a21
      Johannes Weiner 提交于
      In a memcg with even just moderate cache pressure, success rates for
      transparent huge page allocations drop to zero, wasting a lot of effort
      that the allocator puts into assembling these pages.
      
      The reason for this is that the memcg reclaim code was never designed for
      higher-order charges.  It reclaims in small batches until there is room
      for at least one page.  Huge page charges only succeed when these batches
      add up over a series of huge faults, which is unlikely under any
      significant load involving order-0 allocations in the group.
      
      Remove that loop on the memcg side in favor of passing the actual reclaim
      goal to direct reclaim, which is already set up and optimized to meet
      higher-order goals efficiently.
      
      This brings memcg's THP policy in line with the system policy: if the
      allocator painstakingly assembles a hugepage, memcg will at least make an
      honest effort to charge it.  As a result, transparent hugepage allocation
      rates amid cache activity are drastically improved:
      
                                            vanilla                 patched
      pgalloc                 4717530.80 (  +0.00%)   4451376.40 (  -5.64%)
      pgfault                  491370.60 (  +0.00%)    225477.40 ( -54.11%)
      pgmajfault                    2.00 (  +0.00%)         1.80 (  -6.67%)
      thp_fault_alloc               0.00 (  +0.00%)       531.60 (+100.00%)
      thp_fault_fallback          749.00 (  +0.00%)       217.40 ( -70.88%)
      
      [ Note: this may in turn increase memory consumption from internal
        fragmentation, which is an inherent risk of transparent hugepages.
        Some setups may have to adjust the memcg limits accordingly to
        accomodate this - or, if the machine is already packed to capacity,
        disable the transparent huge page feature. ]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b70a2a21
    • J
      mm: clean up zone flags · 57054651
      Johannes Weiner 提交于
      Page reclaim tests zone_is_reclaim_dirty(), but the site that actually
      sets this state does zone_set_flag(zone, ZONE_TAIL_LRU_DIRTY), sending the
      reader through layers indirection just to track down a simple bit.
      
      Remove all zone flag wrappers and just use bitops against zone->flags
      directly.  It's just as readable and the lines are barely any longer.
      
      Also rename ZONE_TAIL_LRU_DIRTY to ZONE_DIRTY to match ZONE_WRITEBACK, and
      remove the zone_flags_t typedef.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57054651
    • J
      mm: remove noisy remainder of the scan_unevictable interface · 1f13ae39
      Johannes Weiner 提交于
      The deprecation warnings for the scan_unevictable interface triggers by
      scripts doing `sysctl -a | grep something else'.  This is annoying and not
      helpful.
      
      The interface has been defunct since 264e56d8 ("mm: disable user
      interface to manually rescue unevictable pages"), which was in 2011, and
      there haven't been any reports of usecases for it, only reports that the
      deprecation warnings are annying.  It's unlikely that anybody is using
      this interface specifically at this point, so remove it.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f13ae39
    • V
      mm, compaction: defer each zone individually instead of preferred zone · 53853e2d
      Vlastimil Babka 提交于
      When direct sync compaction is often unsuccessful, it may become deferred
      for some time to avoid further useless attempts, both sync and async.
      Successful high-order allocations un-defer compaction, while further
      unsuccessful compaction attempts prolong the compaction deferred period.
      
      Currently the checking and setting deferred status is performed only on
      the preferred zone of the allocation that invoked direct compaction.  But
      compaction itself is attempted on all eligible zones in the zonelist, so
      the behavior is suboptimal and may lead both to scenarios where 1)
      compaction is attempted uselessly, or 2) where it's not attempted despite
      good chances of succeeding, as shown on the examples below:
      
      1) A direct compaction with Normal preferred zone failed and set
         deferred compaction for the Normal zone.  Another unrelated direct
         compaction with DMA32 as preferred zone will attempt to compact DMA32
         zone even though the first compaction attempt also included DMA32 zone.
      
         In another scenario, compaction with Normal preferred zone failed to
         compact Normal zone, but succeeded in the DMA32 zone, so it will not
         defer compaction.  In the next attempt, it will try Normal zone which
         will fail again, instead of skipping Normal zone and trying DMA32
         directly.
      
      2) Kswapd will balance DMA32 zone and reset defer status based on
         watermarks looking good.  A direct compaction with preferred Normal
         zone will skip compaction of all zones including DMA32 because Normal
         was still deferred.  The allocation might have succeeded in DMA32, but
         won't.
      
      This patch makes compaction deferring work on individual zone basis
      instead of preferred zone.  For each zone, it checks compaction_deferred()
      to decide if the zone should be skipped.  If watermarks fail after
      compacting the zone, defer_compaction() is called.  The zone where
      watermarks passed can still be deferred when the allocation attempt is
      unsuccessful.  When allocation is successful, compaction_defer_reset() is
      called for the zone containing the allocated page.  This approach should
      approximate calling defer_compaction() only on zones where compaction was
      attempted and did not yield allocated page.  There might be corner cases
      but that is inevitable as long as the decision to stop compacting dues not
      guarantee that a page will be allocated.
      
      Due to a new COMPACT_DEFERRED return value, some functions relying
      implicitly on COMPACT_SKIPPED = 0 had to be updated, with comments made
      more accurate.  The did_some_progress output parameter of
      __alloc_pages_direct_compact() is removed completely, as the caller
      actually does not use it after compaction sets it - it is only considered
      when direct reclaim sets it.
      
      During testing on a two-node machine with a single very small Normal zone
      on node 1, this patch has improved success rates in stress-highalloc
      mmtests benchmark.  The success here were previously made worse by commit
      3a025760 ("mm: page_alloc: spill to remote nodes before waking
      kswapd") as kswapd was no longer resetting often enough the deferred
      compaction for the Normal zone, and DMA32 zones on both nodes were thus
      not considered for compaction.  On different machine, success rates were
      improved with __GFP_NO_KSWAPD allocations.
      
      [akpm@linux-foundation.org: fix CONFIG_COMPACTION=n build]
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53853e2d
  11. 09 8月, 2014 2 次提交
    • J
      mm: memcontrol: use page lists for uncharge batching · 747db954
      Johannes Weiner 提交于
      Pages are now uncharged at release time, and all sources of batched
      uncharges operate on lists of pages.  Directly use those lists, and
      get rid of the per-task batching state.
      
      This also batches statistics accounting, in addition to the res
      counter charges, to reduce IRQ-disabling and re-enabling.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      747db954
    • J
      mm: memcontrol: rewrite uncharge API · 0a31bc97
      Johannes Weiner 提交于
      The memcg uncharging code that is involved towards the end of a page's
      lifetime - truncation, reclaim, swapout, migration - is impressively
      complicated and fragile.
      
      Because anonymous and file pages were always charged before they had their
      page->mapping established, uncharges had to happen when the page type
      could still be known from the context; as in unmap for anonymous, page
      cache removal for file and shmem pages, and swap cache truncation for swap
      pages.  However, these operations happen well before the page is actually
      freed, and so a lot of synchronization is necessary:
      
      - Charging, uncharging, page migration, and charge migration all need
        to take a per-page bit spinlock as they could race with uncharging.
      
      - Swap cache truncation happens during both swap-in and swap-out, and
        possibly repeatedly before the page is actually freed.  This means
        that the memcg swapout code is called from many contexts that make
        no sense and it has to figure out the direction from page state to
        make sure memory and memory+swap are always correctly charged.
      
      - On page migration, the old page might be unmapped but then reused,
        so memcg code has to prevent untimely uncharging in that case.
        Because this code - which should be a simple charge transfer - is so
        special-cased, it is not reusable for replace_page_cache().
      
      But now that charged pages always have a page->mapping, introduce
      mem_cgroup_uncharge(), which is called after the final put_page(), when we
      know for sure that nobody is looking at the page anymore.
      
      For page migration, introduce mem_cgroup_migrate(), which is called after
      the migration is successful and the new page is fully rmapped.  Because
      the old page is no longer uncharged after migration, prevent double
      charges by decoupling the page's memcg association (PCG_USED and
      pc->mem_cgroup) from the page holding an actual charge.  The new bits
      PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
      to the new page during migration.
      
      mem_cgroup_migrate() is suitable for replace_page_cache() as well,
      which gets rid of mem_cgroup_replace_page_cache().  However, care
      needs to be taken because both the source and the target page can
      already be charged and on the LRU when fuse is splicing: grab the page
      lock on the charge moving side to prevent changing pc->mem_cgroup of a
      page under migration.  Also, the lruvecs of both pages change as we
      uncharge the old and charge the new during migration, and putback may
      race with us, so grab the lru lock and isolate the pages iff on LRU to
      prevent races and ensure the pages are on the right lruvec afterward.
      
      Swap accounting is massively simplified: because the page is no longer
      uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
      transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
      before the final put_page() in page reclaim.
      
      Finally, page_cgroup changes are now protected by whatever protection the
      page itself offers: anonymous pages are charged under the page table lock,
      whereas page cache insertions, swapin, and migration hold the page lock.
      Uncharging happens under full exclusion with no outstanding references.
      Charging and uncharging also ensure that the page is off-LRU, which
      serializes against charge migration.  Remove the very costly page_cgroup
      lock and set pc->flags non-atomically.
      
      [mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
      [vdavydov@parallels.com: fix flags definition]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Tested-by: NJet Chen <jet.chen@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Tested-by: NFelipe Balbi <balbi@ti.com>
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a31bc97
  12. 07 8月, 2014 9 次提交
  13. 09 6月, 2014 1 次提交
    • L
      Don't trigger congestion wait on dirty-but-not-writeout pages · b738d764
      Linus Torvalds 提交于
      shrink_inactive_list() used to wait 0.1s to avoid congestion when all
      the pages that were isolated from the inactive list were dirty but not
      under active writeback.  That makes no real sense, and apparently causes
      major interactivity issues under some loads since 3.11.
      
      The ostensible reason for it was to wait for kswapd to start writing
      pages, but that seems questionable as well, since the congestion wait
      code seems to trigger for kswapd itself as well.  Also, the logic behind
      delaying anything when we haven't actually started writeback is not
      clear - it only delays actually starting that writeback.
      
      We'll still trigger the congestion waiting if
      
       (a) the process is kswapd, and we hit pages flagged for immediate
           reclaim
      
       (b) the process is not kswapd, and the zone backing dev writeback is
           actually congested.
      
      This probably needs to be revisited, but as it is this fixes a reported
      regression.
      Reported-by: NFelipe Contreras <felipe.contreras@gmail.com>
      Pinpointed-by: NHillf Danton <dhillf@gmail.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b738d764
  14. 07 6月, 2014 3 次提交
    • M
      mm: convert some level-less printks to pr_* · b1de0d13
      Mitchel Humpherys 提交于
      printk is meant to be used with an associated log level.  There are some
      instances of printk scattered around the mm code where the log level is
      missing.  Add a log level and adhere to suggestions by
      scripts/checkpatch.pl by moving to the pr_* macros.
      
      Also add the typical pr_fmt definition so that print statements can be
      easily traced back to the modules where they occur, correlated one with
      another, etc.  This will require the removal of some (now redundant)
      prefixes on a few print statements.
      Signed-off-by: NMitchel Humpherys <mitchelh@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b1de0d13
    • M
      vmscan: memcg: always use swappiness of the reclaimed memcg · 688eb988
      Michal Hocko 提交于
      Memory reclaim always uses swappiness of the reclaim target memcg
      (origin of the memory pressure) or vm_swappiness for global memory
      reclaim.  This behavior was consistent (except for difference between
      global and hard limit reclaim) because swappiness was enforced to be
      consistent within each memcg hierarchy.
      
      After "mm: memcontrol: remove hierarchy restrictions for swappiness and
      oom_control" each memcg can have its own swappiness independent of
      hierarchical parents, though, so the consistency guarantee is gone.
      This can lead to an unexpected behavior.  Say that a group is explicitly
      configured to not swapout by memory.swappiness=0 but its memory gets
      swapped out anyway when the memory pressure comes from its parent with a
      It is also unexpected that the knob is meaningless without setting the
      hard limit which would trigger the reclaim and enforce the swappiness.
      There are setups where the hard limit is configured higher in the
      hierarchy by an administrator and children groups are under control of
      somebody else who is interested in the swapout behavior but not
      necessarily about the memory limit.
      
      From a semantic point of view swappiness is an attribute defining anon
      vs.
       file proportional scanning of LRU which is memcg specific (unlike
      charges which are propagated up the hierarchy) so it should be applied
      to the particular memcg's LRU regardless where the memory pressure comes
      from.
      
      This patch removes vmscan_swappiness() and stores the swappiness into
      the scan_control structure.  mem_cgroup_swappiness is then used to
      provide the correct value before shrink_lruvec is called.  The global
      vm_swappiness is used for the root memcg.
      
      [hughd@google.com: oopses immediately when booted with cgroup_disable=memory]
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      688eb988
    • J
      mm: vmscan: clear kswapd's special reclaim powers before exiting · 71abdc15
      Johannes Weiner 提交于
      When kswapd exits, it can end up taking locks that were previously held
      by allocating tasks while they waited for reclaim.  Lockdep currently
      warns about this:
      
      On Wed, May 28, 2014 at 06:06:34PM +0800, Gu Zheng wrote:
      >  inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage.
      >  kswapd2/1151 [HC0[0]:SC0[0]:HE1:SE1] takes:
      >   (&sig->group_rwsem){+++++?}, at: exit_signals+0x24/0x130
      >  {RECLAIM_FS-ON-W} state was registered at:
      >     mark_held_locks+0xb9/0x140
      >     lockdep_trace_alloc+0x7a/0xe0
      >     kmem_cache_alloc_trace+0x37/0x240
      >     flex_array_alloc+0x99/0x1a0
      >     cgroup_attach_task+0x63/0x430
      >     attach_task_by_pid+0x210/0x280
      >     cgroup_procs_write+0x16/0x20
      >     cgroup_file_write+0x120/0x2c0
      >     vfs_write+0xc0/0x1f0
      >     SyS_write+0x4c/0xa0
      >     tracesys+0xdd/0xe2
      >  irq event stamp: 49
      >  hardirqs last  enabled at (49):  _raw_spin_unlock_irqrestore+0x36/0x70
      >  hardirqs last disabled at (48):  _raw_spin_lock_irqsave+0x2b/0xa0
      >  softirqs last  enabled at (0):  copy_process.part.24+0x627/0x15f0
      >  softirqs last disabled at (0):            (null)
      >
      >  other info that might help us debug this:
      >   Possible unsafe locking scenario:
      >
      >         CPU0
      >         ----
      >    lock(&sig->group_rwsem);
      >    <Interrupt>
      >      lock(&sig->group_rwsem);
      >
      >   *** DEADLOCK ***
      >
      >  no locks held by kswapd2/1151.
      >
      >  stack backtrace:
      >  CPU: 30 PID: 1151 Comm: kswapd2 Not tainted 3.10.39+ #4
      >  Call Trace:
      >    dump_stack+0x19/0x1b
      >    print_usage_bug+0x1f7/0x208
      >    mark_lock+0x21d/0x2a0
      >    __lock_acquire+0x52a/0xb60
      >    lock_acquire+0xa2/0x140
      >    down_read+0x51/0xa0
      >    exit_signals+0x24/0x130
      >    do_exit+0xb5/0xa50
      >    kthread+0xdb/0x100
      >    ret_from_fork+0x7c/0xb0
      
      This is because the kswapd thread is still marked as a reclaimer at the
      time of exit.  But because it is exiting, nobody is actually waiting on
      it to make reclaim progress anymore, and it's nothing but a regular
      thread at this point.  Be tidy and strip it of all its powers
      (PF_MEMALLOC, PF_SWAPWRITE, PF_KSWAPD, and the lockdep reclaim state)
      before returning from the thread function.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71abdc15
  15. 05 6月, 2014 7 次提交
    • M
      mm: vmscan: use proportional scanning during direct reclaim and full scan at DEF_PRIORITY · 1a501907
      Mel Gorman 提交于
      Commit "mm: vmscan: obey proportional scanning requirements for kswapd"
      ensured that file/anon lists were scanned proportionally for reclaim from
      kswapd but ignored it for direct reclaim.  The intent was to minimse
      direct reclaim latency but Yuanhan Liu pointer out that it substitutes one
      long stall for many small stalls and distorts aging for normal workloads
      like streaming readers/writers.  Hugh Dickins pointed out that a
      side-effect of the same commit was that when one LRU list dropped to zero
      that the entirety of the other list was shrunk leading to excessive
      reclaim in memcgs.  This patch scans the file/anon lists proportionally
      for direct reclaim to similarly age page whether reclaimed by kswapd or
      direct reclaim but takes care to abort reclaim if one LRU drops to zero
      after reclaiming the requested number of pages.
      
      Based on ext4 and using the Intel VM scalability test
      
                                                    3.15.0-rc5            3.15.0-rc5
                                                      shrinker            proportion
      Unit  lru-file-readonce    elapsed      5.3500 (  0.00%)      5.4200 ( -1.31%)
      Unit  lru-file-readonce time_range      0.2700 (  0.00%)      0.1400 ( 48.15%)
      Unit  lru-file-readonce time_stddv      0.1148 (  0.00%)      0.0536 ( 53.33%)
      Unit lru-file-readtwice    elapsed      8.1700 (  0.00%)      8.1700 (  0.00%)
      Unit lru-file-readtwice time_range      0.4300 (  0.00%)      0.2300 ( 46.51%)
      Unit lru-file-readtwice time_stddv      0.1650 (  0.00%)      0.0971 ( 41.16%)
      
      The test cases are running multiple dd instances reading sparse files. The results are within
      the noise for the small test machine. The impact of the patch is more noticable from the vmstats
      
                                  3.15.0-rc5  3.15.0-rc5
                                    shrinker  proportion
      Minor Faults                     35154       36784
      Major Faults                       611        1305
      Swap Ins                           394        1651
      Swap Outs                         4394        5891
      Allocation stalls               118616       44781
      Direct pages scanned           4935171     4602313
      Kswapd pages scanned          15921292    16258483
      Kswapd pages reclaimed        15913301    16248305
      Direct pages reclaimed         4933368     4601133
      Kswapd efficiency                  99%         99%
      Kswapd velocity             670088.047  682555.961
      Direct efficiency                  99%         99%
      Direct velocity             207709.217  193212.133
      Percentage direct scans            23%         22%
      Page writes by reclaim        4858.000    6232.000
      Page writes file                   464         341
      Page writes anon                  4394        5891
      
      Note that there are fewer allocation stalls even though the amount
      of direct reclaim scanning is very approximately the same.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Tested-by: NYuanhan Liu <yuanhan.liu@linux.intel.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a501907
    • J
      mm/vmscan.c: use DIV_ROUND_UP for calculation of zone's balance_gap and correct comments. · 4be89a34
      Jianyu Zhan 提交于
      Currently, we use (zone->managed_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1)
      / KSWAPD_ZONE_BALANCE_GAP_RATIO to avoid a zero gap value.  It's better to
      use DIV_ROUND_UP macro for neater code and clear meaning.
      
      Besides, the gap value is calculated against the per-zone "managed pages",
      not "present pages".  This patch also corrects the comment and do some
      rephrasing.
      Signed-off-by: NJianyu Zhan <nasa4836@gmail.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4be89a34
    • M
      mm: page_alloc: convert hot/cold parameter and immediate callers to bool · b745bc85
      Mel Gorman 提交于
      cold is a bool, make it one.  Make the likely case the "if" part of the
      block instead of the else as according to the optimisation manual this is
      preferred.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b745bc85
    • D
      mm: shrinker: add nid to tracepoint output · df9024a8
      Dave Hansen 提交于
      Now that we are doing NUMA-aware shrinking, and can have shrinkers
      running in parallel, or working on individual nodes, it seems like we
      should also be sticking the node in the output.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NDave Chinner <david@fromorbit.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df9024a8
    • D
      mm: shrinker trace points: fix negatives · 7fe70475
      Dave Hansen 提交于
      I was looking at a trace of the slab shrinkers (attachment in this comment):
      
      	https://bugs.freedesktop.org/show_bug.cgi?id=72742#c67
      
      and noticed that "total_scan" can go negative in some cases.  We
      used to dump out the "total_scan" variable directly, but some of
      the shrinker modifications along the way changed that.
      
      This patch just dumps it out directly, again.  It doesn't make
      any sense to derive it from new_nr and nr any more since there
      are now other shrinkers that can be running in parallel and
      mucking with those values.
      
      Here's an example of the negative numbers in the output:
      
      >          kswapd0-840   [000]   160.869398: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 10 new scan count 39 total_scan 29 last shrinker return val 256
      >          kswapd0-840   [000]   160.869618: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 39 new scan count 102 total_scan 63 last shrinker return val 256
      >          kswapd0-840   [000]   160.870031: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 102 new scan count 47 total_scan -55 last shrinker return val 768
      >          kswapd0-840   [000]   160.870464: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 47 new scan count 45 total_scan -2 last shrinker return val 768
      >          kswapd0-840   [000]   163.384144: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 45 new scan count 56 total_scan 11 last shrinker return val 0
      >          kswapd0-840   [000]   163.384297: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 56 new scan count 15 total_scan -41 last shrinker return val 256
      >          kswapd0-840   [000]   163.384414: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 15 new scan count 117 total_scan 102 last shrinker return val 0
      >          kswapd0-840   [000]   163.384657: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 117 new scan count 36 total_scan -81 last shrinker return val 512
      >          kswapd0-840   [000]   163.384880: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 36 new scan count 111 total_scan 75 last shrinker return val 256
      >          kswapd0-840   [000]   163.385256: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 111 new scan count 34 total_scan -77 last shrinker return val 768
      >          kswapd0-840   [000]   163.385598: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 34 new scan count 122 total_scan 88 last shrinker return val 512
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NDave Chinner <david@fromorbit.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7fe70475
    • N
      mm/vmscan.c: avoid throttling reclaim for loop-back nfsd threads · 399ba0b9
      NeilBrown 提交于
      When a loopback NFS mount is active and the backing device for the NFS
      mount becomes congested, that can impose throttling delays on the nfsd
      threads.
      
      These delays significantly reduce throughput and so the NFS mount remains
      congested.
      
      This results in a livelock and the reduced throughput persists.
      
      This livelock has been found in testing with the 'wait_iff_congested'
      call, and could possibly be caused by the 'congestion_wait' call.
      
      This livelock is similar to the deadlock which justified the introduction
      of PF_LESS_THROTTLE, and the same flag can be used to remove this
      livelock.
      
      To minimise the impact of the change, we still throttle nfsd when the
      filesystem it is writing to is congested, but not when some separate
      filesystem (e.g.  the NFS filesystem) is congested.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      399ba0b9
    • M
      mm: vmscan: do not throttle based on pfmemalloc reserves if node has no ZONE_NORMAL · 675becce
      Mel Gorman 提交于
      throttle_direct_reclaim() is meant to trigger during swap-over-network
      during which the min watermark is treated as a pfmemalloc reserve.  It
      throttes on the first node in the zonelist but this is flawed.
      
      The user-visible impact is that a process running on CPU whose local
      memory node has no ZONE_NORMAL will stall for prolonged periods of time,
      possibly indefintely.  This is due to throttle_direct_reclaim thinking the
      pfmemalloc reserves are depleted when in fact they don't exist on that
      node.
      
      On a NUMA machine running a 32-bit kernel (I know) allocation requests
      from CPUs on node 1 would detect no pfmemalloc reserves and the process
      gets throttled.  This patch adjusts throttling of direct reclaim to
      throttle based on the first node in the zonelist that has a usable
      ZONE_NORMAL or lower zone.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      675becce