1. 07 8月, 2014 1 次提交
    • W
      mem-hotplug: improve zone_movable_is_highmem logic · 1a4dc5bc
      Wang Nan 提交于
      In original code, zone_movable_is_highmem() assumes ZONE_MOVABLE not
      highmem if CONFIG_HAVE_MEMBLOCK_NODE_MAP is not set.  In online_pages,
      it extracts pages from the previous zone before ZONE_MOVABLE.  Which is
      logically inconsistent:
      
      If HAVE_MEMBLOCK_NODE_MAP is turned off but HIGHMEM is on,
      zone_movable_is_highmem() makes movable zone not highmem, but
      online_pages() extracts pages from ZONE_HIGHMEM.
      
      This inconsistency doesn't cause real problem currently, because all
      architectures support online_pages also have HAVE_MEMBLOCK_NODE_MAP.
      However, fixing it makes code clear, and also helps futher coding.
      Signed-off-by: NWang Nan <wangnan0@huawei.com>
      Cc: Zhang Zhen <zhangzhen@huawei.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a4dc5bc
  2. 05 6月, 2014 6 次提交
    • M
      mm: page_alloc: use unsigned int for order in more places · 7aeb09f9
      Mel Gorman 提交于
      X86 prefers the use of unsigned types for iterators and there is a
      tendency to mix whether a signed or unsigned type if used for page order.
      This converts a number of sites in mm/page_alloc.c to use unsigned int for
      order where possible.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7aeb09f9
    • M
      mm: page_alloc: reduce number of times page_to_pfn is called · dc4b0caf
      Mel Gorman 提交于
      In the free path we calculate page_to_pfn multiple times. Reduce that.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc4b0caf
    • M
      mm: page_alloc: use word-based accesses for get/set pageblock bitmaps · e58469ba
      Mel Gorman 提交于
      The test_bit operations in get/set pageblock flags are expensive.  This
      patch reads the bitmap on a word basis and use shifts and masks to isolate
      the bits of interest.  Similarly masks are used to set a local copy of the
      bitmap and then use cmpxchg to update the bitmap if there have been no
      other changes made in parallel.
      
      In a test running dd onto tmpfs the overhead of the pageblock-related
      functions went from 1.27% in profiles to 0.5%.
      
      In addition to the performance benefits, this patch closes races that are
      possible between:
      
      a) get_ and set_pageblock_migratetype(), where get_pageblock_migratetype()
         reads part of the bits before and other part of the bits after
         set_pageblock_migratetype() has updated them.
      
      b) set_pageblock_migratetype() and set_pageblock_skip(), where the non-atomic
         read-modify-update set bit operation in set_pageblock_skip() will cause
         lost updates to some bits changed in the set_pageblock_migratetype().
      
      Joonsoo Kim first reported the case a) via code inspection.  Vlastimil
      Babka's testing with a debug patch showed that either a) or b) occurs
      roughly once per mmtests' stress-highalloc benchmark (although not
      necessarily in the same pageblock).  Furthermore during development of
      unrelated compaction patches, it was observed that frequent calls to
      {start,undo}_isolate_page_range() the race occurs several thousands of
      times and has resulted in NULL pointer dereferences in move_freepages()
      and free_one_page() in places where free_list[migratetype] is
      manipulated by e.g.  list_move().  Further debugging confirmed that
      migratetype had invalid value of 6, causing out of bounds access to the
      free_list array.
      
      That confirmed that the race exist, although it may be extremely rare,
      and currently only fatal where page isolation is performed due to
      memory hot remove.  Races on pageblocks being updated by
      set_pageblock_migratetype(), where both old and new migratetype are
      lower MIGRATE_RESERVE, currently cannot result in an invalid value
      being observed, although theoretically they may still lead to
      unexpected creation or destruction of MIGRATE_RESERVE pageblocks.
      Furthermore, things could get suddenly worse when memory isolation is
      used more, or when new migratetypes are added.
      
      After this patch, the race has no longer been observed in testing.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reported-and-tested-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e58469ba
    • D
      mm, compaction: add per-zone migration pfn cache for async compaction · 35979ef3
      David Rientjes 提交于
      Each zone has a cached migration scanner pfn for memory compaction so that
      subsequent calls to memory compaction can start where the previous call
      left off.
      
      Currently, the compaction migration scanner only updates the per-zone
      cached pfn when pageblocks were not skipped for async compaction.  This
      creates a dependency on calling sync compaction to avoid having subsequent
      calls to async compaction from scanning an enormous amount of non-MOVABLE
      pageblocks each time it is called.  On large machines, this could be
      potentially very expensive.
      
      This patch adds a per-zone cached migration scanner pfn only for async
      compaction.  It is updated everytime a pageblock has been scanned in its
      entirety and when no pages from it were successfully isolated.  The cached
      migration scanner pfn for sync compaction is updated only when called for
      sync compaction.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      35979ef3
    • V
      mem-hotplug: implement get/put_online_mems · bfc8c901
      Vladimir Davydov 提交于
      kmem_cache_{create,destroy,shrink} need to get a stable value of
      cpu/node online mask, because they init/destroy/access per-cpu/node
      kmem_cache parts, which can be allocated or destroyed on cpu/mem
      hotplug.  To protect against cpu hotplug, these functions use
      {get,put}_online_cpus.  However, they do nothing to synchronize with
      memory hotplug - taking the slab_mutex does not eliminate the
      possibility of race as described in patch 2.
      
      What we need there is something like get_online_cpus, but for memory.
      We already have lock_memory_hotplug, which serves for the purpose, but
      it's a bit of a hammer right now, because it's backed by a mutex.  As a
      result, it imposes some limitations to locking order, which are not
      desirable, and can't be used just like get_online_cpus.  That's why in
      patch 1 I substitute it with get/put_online_mems, which work exactly
      like get/put_online_cpus except they block not cpu, but memory hotplug.
      
      [ v1 can be found at https://lkml.org/lkml/2014/4/6/68.  I NAK'ed it by
        myself, because it used an rw semaphore for get/put_online_mems,
        making them dead lock prune.  ]
      
      This patch (of 2):
      
      {un}lock_memory_hotplug, which is used to synchronize against memory
      hotplug, is currently backed by a mutex, which makes it a bit of a
      hammer - threads that only want to get a stable value of online nodes
      mask won't be able to proceed concurrently.  Also, it imposes some
      strong locking ordering rules on it, which narrows down the set of its
      usage scenarios.
      
      This patch introduces get/put_online_mems, which are the same as
      get/put_online_cpus, but for memory hotplug, i.e.  executing a code
      inside a get/put_online_mems section will guarantee a stable value of
      online nodes, present pages, etc.
      
      lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bfc8c901
    • M
      mm: page_alloc: do not cache reclaim distances · 5f7a75ac
      Mel Gorman 提交于
      pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed
      by zone_reclaim due to its distance.  As it is expected that
      zone_reclaim_mode will be rarely enabled it is unreasonable for all
      machines to take a penalty.  Fortunately, the zone_reclaim_mode() path
      is already slow and it is the path that takes the hit.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f7a75ac
  3. 04 4月, 2014 2 次提交
    • J
      mm: keep page cache radix tree nodes in check · 449dd698
      Johannes Weiner 提交于
      Previously, page cache radix tree nodes were freed after reclaim emptied
      out their page pointers.  But now reclaim stores shadow entries in their
      place, which are only reclaimed when the inodes themselves are
      reclaimed.  This is problematic for bigger files that are still in use
      after they have a significant amount of their cache reclaimed, without
      any of those pages actually refaulting.  The shadow entries will just
      sit there and waste memory.  In the worst case, the shadow entries will
      accumulate until the machine runs out of memory.
      
      To get this under control, the VM will track radix tree nodes
      exclusively containing shadow entries on a per-NUMA node list.  Per-NUMA
      rather than global because we expect the radix tree nodes themselves to
      be allocated node-locally and we want to reduce cross-node references of
      otherwise independent cache workloads.  A simple shrinker will then
      reclaim these nodes on memory pressure.
      
      A few things need to be stored in the radix tree node to implement the
      shadow node LRU and allow tree deletions coming from the list:
      
      1. There is no index available that would describe the reverse path
         from the node up to the tree root, which is needed to perform a
         deletion.  To solve this, encode in each node its offset inside the
         parent.  This can be stored in the unused upper bits of the same
         member that stores the node's height at no extra space cost.
      
      2. The number of shadow entries needs to be counted in addition to the
         regular entries, to quickly detect when the node is ready to go to
         the shadow node LRU list.  The current entry count is an unsigned
         int but the maximum number of entries is 64, so a shadow counter
         can easily be stored in the unused upper bits.
      
      3. Tree modification needs tree lock and tree root, which are located
         in the address space, so store an address_space backpointer in the
         node.  The parent pointer of the node is in a union with the 2-word
         rcu_head, so the backpointer comes at no extra cost as well.
      
      4. The node needs to be linked to an LRU list, which requires a list
         head inside the node.  This does increase the size of the node, but
         it does not change the number of objects that fit into a slab page.
      
      [akpm@linux-foundation.org: export the right function]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      449dd698
    • J
      mm: thrash detection-based file cache sizing · a528910e
      Johannes Weiner 提交于
      The VM maintains cached filesystem pages on two types of lists.  One
      list holds the pages recently faulted into the cache, the other list
      holds pages that have been referenced repeatedly on that first list.
      The idea is to prefer reclaiming young pages over those that have shown
      to benefit from caching in the past.  We call the recently usedbut
      ultimately was not significantly better than a FIFO policy and still
      thrashed cache based on eviction speed, rather than actual demand for
      cache.
      
      This patch solves one half of the problem by decoupling the ability to
      detect working set changes from the inactive list size.  By maintaining
      a history of recently evicted file pages it can detect frequently used
      pages with an arbitrarily small inactive list size, and subsequently
      apply pressure on the active list based on actual demand for cache, not
      just overall eviction speed.
      
      Every zone maintains a counter that tracks inactive list aging speed.
      When a page is evicted, a snapshot of this counter is stored in the
      now-empty page cache radix tree slot.  On refault, the minimum access
      distance of the page can be assessed, to evaluate whether the page
      should be part of the active list or not.
      
      This fixes the VM's blindness towards working set changes in excess of
      the inactive list.  And it's the foundation to further improve the
      protection ability and reduce the minimum inactive list size of 50%.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NBob Liu <bob.liu@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a528910e
  4. 11 3月, 2014 1 次提交
    • J
      mm: fix GFP_THISNODE callers and clarify · e97ca8e5
      Johannes Weiner 提交于
      GFP_THISNODE is for callers that implement their own clever fallback to
      remote nodes.  It restricts the allocation to the specified node and
      does not invoke reclaim, assuming that the caller will take care of it
      when the fallback fails, e.g.  through a subsequent allocation request
      without GFP_THISNODE set.
      
      However, many current GFP_THISNODE users only want the node exclusive
      aspect of the flag, without actually implementing their own fallback or
      triggering reclaim if necessary.  This results in things like page
      migration failing prematurely even when there is easily reclaimable
      memory available, unless kswapd happens to be running already or a
      concurrent allocation attempt triggers the necessary reclaim.
      
      Convert all callsites that don't implement their own fallback strategy
      to __GFP_THISNODE.  This restricts the allocation a single node too, but
      at the same time allows the allocator to enter the slowpath, wake
      kswapd, and invoke direct reclaim if necessary, to make the allocation
      happen when memory is full.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Jan Stancek <jstancek@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e97ca8e5
  5. 22 1月, 2014 2 次提交
  6. 12 9月, 2013 2 次提交
    • L
      mm: vmscan: fix do_try_to_free_pages() livelock · 6e543d57
      Lisa Du 提交于
      This patch is based on KOSAKI's work and I add a little more description,
      please refer https://lkml.org/lkml/2012/6/14/74.
      
      Currently, I found system can enter a state that there are lots of free
      pages in a zone but only order-0 and order-1 pages which means the zone is
      heavily fragmented, then high order allocation could make direct reclaim
      path's long stall(ex, 60 seconds) especially in no swap and no compaciton
      enviroment.  This problem happened on v3.4, but it seems issue still lives
      in current tree, the reason is do_try_to_free_pages enter live lock:
      
      kswapd will go to sleep if the zones have been fully scanned and are still
      not balanced.  As kswapd thinks there's little point trying all over again
      to avoid infinite loop.  Instead it changes order from high-order to
      0-order because kswapd think order-0 is the most important.  Look at
      73ce02e9 in detail.  If watermarks are ok, kswapd will go back to sleep
      and may leave zone->all_unreclaimable =3D 0.  It assume high-order users
      can still perform direct reclaim if they wish.
      
      Direct reclaim continue to reclaim for a high order which is not a
      COSTLY_ORDER without oom-killer until kswapd turn on
      zone->all_unreclaimble= .  This is because to avoid too early oom-kill.
      So it means direct_reclaim depends on kswapd to break this loop.
      
      In worst case, direct-reclaim may continue to page reclaim forever when
      kswapd sleeps forever until someone like watchdog detect and finally kill
      the process.  As described in:
      http://thread.gmane.org/gmane.linux.kernel.mm/103737
      
      We can't turn on zone->all_unreclaimable from direct reclaim path because
      direct reclaim path don't take any lock and this way is racy.  Thus this
      patch removes zone->all_unreclaimable field completely and recalculates
      zone reclaimable state every time.
      
      Note: we can't take the idea that direct-reclaim see zone->pages_scanned
      directly and kswapd continue to use zone->all_unreclaimable.  Because, it
      is racy.  commit 929bea7c (vmscan: all_unreclaimable() use
      zone->all_unreclaimable as a name) describes the detail.
      
      [akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()]
      Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Nick Piggin <npiggin@gmail.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Bob Liu <lliubbo@gmail.com>
      Cc: Neil Zhang <zhangwm@marvell.com>
      Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NLisa Du <cldu@marvell.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6e543d57
    • J
      mm: page_alloc: fair zone allocator policy · 81c0a2bb
      Johannes Weiner 提交于
      Each zone that holds userspace pages of one workload must be aged at a
      speed proportional to the zone size.  Otherwise, the time an individual
      page gets to stay in memory depends on the zone it happened to be
      allocated in.  Asymmetry in the zone aging creates rather unpredictable
      aging behavior and results in the wrong pages being reclaimed, activated
      etc.
      
      But exactly this happens right now because of the way the page allocator
      and kswapd interact.  The page allocator uses per-node lists of all zones
      in the system, ordered by preference, when allocating a new page.  When
      the first iteration does not yield any results, kswapd is woken up and the
      allocator retries.  Due to the way kswapd reclaims zones below the high
      watermark while a zone can be allocated from when it is above the low
      watermark, the allocator may keep kswapd running while kswapd reclaim
      ensures that the page allocator can keep allocating from the first zone in
      the zonelist for extended periods of time.  Meanwhile the other zones
      rarely see new allocations and thus get aged much slower in comparison.
      
      The result is that the occasional page placed in lower zones gets
      relatively more time in memory, even gets promoted to the active list
      after its peers have long been evicted.  Meanwhile, the bulk of the
      working set may be thrashing on the preferred zone even though there may
      be significant amounts of memory available in the lower zones.
      
      Even the most basic test -- repeatedly reading a file slightly bigger than
      memory -- shows how broken the zone aging is.  In this scenario, no single
      page should be able stay in memory long enough to get referenced twice and
      activated, but activation happens in spades:
      
        $ grep active_file /proc/zoneinfo
            nr_inactive_file 0
            nr_active_file 0
            nr_inactive_file 0
            nr_active_file 8
            nr_inactive_file 1582
            nr_active_file 11994
        $ cat data data data data >/dev/null
        $ grep active_file /proc/zoneinfo
            nr_inactive_file 0
            nr_active_file 70
            nr_inactive_file 258753
            nr_active_file 443214
            nr_inactive_file 149793
            nr_active_file 12021
      
      Fix this with a very simple round robin allocator.  Each zone is allowed a
      batch of allocations that is proportional to the zone's size, after which
      it is treated as full.  The batch counters are reset when all zones have
      been tried and the allocator enters the slowpath and kicks off kswapd
      reclaim.  Allocation and reclaim is now fairly spread out to all
      available/allowable zones:
      
        $ grep active_file /proc/zoneinfo
            nr_inactive_file 0
            nr_active_file 0
            nr_inactive_file 174
            nr_active_file 4865
            nr_inactive_file 53
            nr_active_file 860
        $ cat data data data data >/dev/null
        $ grep active_file /proc/zoneinfo
            nr_inactive_file 0
            nr_active_file 0
            nr_inactive_file 666622
            nr_active_file 4988
            nr_inactive_file 190969
            nr_active_file 937
      
      When zone_reclaim_mode is enabled, allocations will now spread out to all
      zones on the local node, not just the first preferred zone (which on a 4G
      node might be a tiny Normal zone).
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Paul Bolle <paul.bollee@gmail.com>
      Cc: Zlatko Calusic <zcalusic@bitsync.net>
      Tested-by: NKevin Hilman <khilman@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81c0a2bb
  7. 10 7月, 2013 1 次提交
  8. 04 7月, 2013 6 次提交
  9. 27 3月, 2013 1 次提交
  10. 23 3月, 2013 1 次提交
    • R
      mm: zone_end_pfn is too small · f9228b20
      Russ Anderson 提交于
      Booting with 32 TBytes memory hits BUG at mm/page_alloc.c:552! (output
      below).
      
      The key hint is "page 4294967296 outside zone".
      4294967296 = 0x100000000 (bit 32 is set).
      
      The problem is in include/linux/mmzone.h:
      
        530 static inline unsigned zone_end_pfn(const struct zone *zone)
        531 {
        532         return zone->zone_start_pfn + zone->spanned_pages;
        533 }
      
      zone_end_pfn is "unsigned" (32 bits).  Changing it to "unsigned long"
      (64 bits) fixes the problem.
      
      zone_end_pfn() was added recently in commit 108bcc96 ("mm: add & use
      zone_end_pfn() and zone_spans_pfn()")
      
      Output from the failure.
      
        No AGP bridge found
        page 4294967296 outside zone [ 4294967296 - 4327469056 ]
        ------------[ cut here ]------------
        kernel BUG at mm/page_alloc.c:552!
        invalid opcode: 0000 [#1] SMP
        Modules linked in:
        CPU 0
        Pid: 0, comm: swapper Not tainted 3.9.0-rc2.dtp+ #10
        RIP: free_one_page+0x382/0x430
        Process swapper (pid: 0, threadinfo ffffffff81942000, task ffffffff81955420)
        Call Trace:
          __free_pages_ok+0x96/0xb0
          __free_pages+0x25/0x50
          __free_pages_bootmem+0x8a/0x8c
          __free_memory_core+0xea/0x131
          free_low_memory_core_early+0x4a/0x98
          free_all_bootmem+0x45/0x47
          mem_init+0x7b/0x14c
          start_kernel+0x216/0x433
          x86_64_start_reservations+0x2a/0x2c
          x86_64_start_kernel+0x144/0x153
        Code: 89 f1 ba 01 00 00 00 31 f6 d3 e2 4c 89 ef e8 66 a4 01 00 e9 2c fe ff ff 0f 0b eb fe 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 eb f3 <0f> 0b eb fe 0f 0b 0f 1f 84 00 00 00 00 00 eb f6 0f 0b eb fe 49
      Signed-off-by: NRuss Anderson <rja@sgi.com>
      Reported-by: NGeorge Beshers <gbeshers@sgi.com>
      Acked-by: NHedi Berriche <hedi@sgi.com>
      Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f9228b20
  11. 24 2月, 2013 5 次提交
  12. 05 1月, 2013 1 次提交
  13. 13 12月, 2012 1 次提交
    • J
      mm: introduce new field "managed_pages" to struct zone · 9feedc9d
      Jiang Liu 提交于
      Currently a zone's present_pages is calcuated as below, which is
      inaccurate and may cause trouble to memory hotplug.
      
      	spanned_pages - absent_pages - memmap_pages - dma_reserve.
      
      During fixing bugs caused by inaccurate zone->present_pages, we found
      zone->present_pages has been abused.  The field zone->present_pages may
      have different meanings in different contexts:
      
      1) pages existing in a zone.
      2) pages managed by the buddy system.
      
      For more discussions about the issue, please refer to:
        http://lkml.org/lkml/2012/11/5/866
        https://patchwork.kernel.org/patch/1346751/
      
      This patchset tries to introduce a new field named "managed_pages" to
      struct zone, which counts "pages managed by the buddy system".  And revert
      zone->present_pages to count "physical pages existing in a zone", which
      also keep in consistence with pgdat->node_present_pages.
      
      We will set an initial value for zone->managed_pages in function
      free_area_init_core() and will adjust it later if the initial value is
      inaccurate.
      
      For DMA/normal zones, the initial value is set to:
      
      	(spanned_pages - absent_pages - memmap_pages - dma_reserve)
      
      Later zone->managed_pages will be adjusted to the accurate value when the
      bootmem allocator frees all free pages to the buddy system in function
      free_all_bootmem_node() and free_all_bootmem().
      
      The bootmem allocator doesn't touch highmem pages, so highmem zones'
      managed_pages is set to the accurate value "spanned_pages - absent_pages"
      in function free_area_init_core() and won't be updated anymore.
      
      This patch also adds a new field "managed_pages" to /proc/zoneinfo
      and sysrq showmem.
      
      [akpm@linux-foundation.org: small comment tweaks]
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
      Tested-by: NChris Clayton <chris2553@googlemail.com>
      Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9feedc9d
  14. 12 12月, 2012 1 次提交
  15. 11 12月, 2012 1 次提交
  16. 17 11月, 2012 1 次提交
    • H
      memcg: fix hotplugged memory zone oops · bea8c150
      Hugh Dickins 提交于
      When MEMCG is configured on (even when it's disabled by boot option),
      when adding or removing a page to/from its lru list, the zone pointer
      used for stats updates is nowadays taken from the struct lruvec.  (On
      many configurations, calculating zone from page is slower.)
      
      But we have no code to update all the lruvecs (per zone, per memcg) when
      a memory node is hotadded.  Here's an extract from the oops which
      results when running numactl to bind a program to a newly onlined node:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000f60
        IP:  __mod_zone_page_state+0x9/0x60
        Pid: 1219, comm: numactl Not tainted 3.6.0-rc5+ #180 Bochs Bochs
        Process numactl (pid: 1219, threadinfo ffff880039abc000, task ffff8800383c4ce0)
        Call Trace:
          __pagevec_lru_add_fn+0xdf/0x140
          pagevec_lru_move_fn+0xb1/0x100
          __pagevec_lru_add+0x1c/0x30
          lru_add_drain_cpu+0xa3/0x130
          lru_add_drain+0x2f/0x40
         ...
      
      The natural solution might be to use a memcg callback whenever memory is
      hotadded; but that solution has not been scoped out, and it happens that
      we do have an easy location at which to update lruvec->zone.  The lruvec
      pointer is discovered either by mem_cgroup_zone_lruvec() or by
      mem_cgroup_page_lruvec(), and both of those do know the right zone.
      
      So check and set lruvec->zone in those; and remove the inadequate
      attempt to set lruvec->zone from lruvec_init(), which is called before
      NODE_DATA(node) has been allocated in such cases.
      
      Ah, there was one exceptionr.  For no particularly good reason,
      mem_cgroup_force_empty_list() has its own code for deciding lruvec.
      Change it to use the standard mem_cgroup_zone_lruvec() and
      mem_cgroup_get_lru_size() too.  In fact it was already safe against such
      an oops (the lru lists in danger could only be empty), but we're better
      proofed against future changes this way.
      
      I've marked this for stable (3.6) since we introduced the problem in 3.5
      (now closed to stable); but I have no idea if this is the only fix
      needed to get memory hotadd working with memcg in 3.6, and received no
      answer when I enquired twice before.
      Reported-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bea8c150
  17. 09 10月, 2012 7 次提交
    • M
      CMA: migrate mlocked pages · e46a2879
      Minchan Kim 提交于
      Presently CMA cannot migrate mlocked pages so it ends up failing to allocate
      contiguous memory space.
      
      This patch makes mlocked pages be migrated out.  Of course, it can affect
      realtime processes but in CMA usecase, contiguous memory allocation failing
      is far worse than access latency to an mlocked page being variable while
      CMA is running.  If someone wants to make the system realtime, he shouldn't
      enable CMA because stalls can still happen at random times.
      
      [akpm@linux-foundation.org: tweak comment text, per Mel]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e46a2879
    • D
      mm, numa: reclaim from all nodes within reclaim distance · 957f822a
      David Rientjes 提交于
      RECLAIM_DISTANCE represents the distance between nodes at which it is
      deemed too costly to allocate from; it's preferred to try to reclaim from
      a local zone before falling back to allocating on a remote node with such
      a distance.
      
      To do this, zone_reclaim_mode is set if the distance between any two
      nodes on the system is greather than this distance.  This, however, ends
      up causing the page allocator to reclaim from every zone regardless of
      its affinity.
      
      What we really want is to reclaim only from zones that are closer than
      RECLAIM_DISTANCE.  This patch adds a nodemask to each node that
      represents the set of nodes that are within this distance.  During the
      zone iteration, if the bit for a zone's node is set for the local node,
      then reclaim is attempted; otherwise, the zone is skipped.
      
      [akpm@linux-foundation.org: fix CONFIG_NUMA=n build]
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      957f822a
    • M
      mm: compaction: clear PG_migrate_skip based on compaction and reclaim activity · 62997027
      Mel Gorman 提交于
      Compaction caches if a pageblock was scanned and no pages were isolated so
      that the pageblocks can be skipped in the future to reduce scanning.  This
      information is not cleared by the page allocator based on activity due to
      the impact it would have to the page allocator fast paths.  Hence there is
      a requirement that something clear the cache or pageblocks will be skipped
      forever.  Currently the cache is cleared if there were a number of recent
      allocation failures and it has not been cleared within the last 5 seconds.
      Time-based decisions like this are terrible as they have no relationship
      to VM activity and is basically a big hammer.
      
      Unfortunately, accurate heuristics would add cost to some hot paths so
      this patch implements a rough heuristic.  There are two cases where the
      cache is cleared.
      
      1. If a !kswapd process completes a compaction cycle (migrate and free
         scanner meet), the zone is marked compact_blockskip_flush. When kswapd
         goes to sleep, it will clear the cache. This is expected to be the
         common case where the cache is cleared. It does not really matter if
         kswapd happens to be asleep or going to sleep when the flag is set as
         it will be woken on the next allocation request.
      
      2. If there have been multiple failures recently and compaction just
         finished being deferred then a process will clear the cache and start a
         full scan.  This situation happens if there are multiple high-order
         allocation requests under heavy memory pressure.
      
      The clearing of the PG_migrate_skip bits and other scans is inherently
      racy but the race is harmless.  For allocations that can fail such as THP,
      they will simply fail.  For requests that cannot fail, they will retry the
      allocation.  Tests indicated that scanning rates were roughly similar to
      when the time-based heuristic was used and the allocation success rates
      were similar.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Richard Davies <richard@arachsys.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62997027
    • M
      mm: compaction: Restart compaction from near where it left off · c89511ab
      Mel Gorman 提交于
      This is almost entirely based on Rik's previous patches and discussions
      with him about how this might be implemented.
      
      Order > 0 compaction stops when enough free pages of the correct page
      order have been coalesced.  When doing subsequent higher order
      allocations, it is possible for compaction to be invoked many times.
      
      However, the compaction code always starts out looking for things to
      compact at the start of the zone, and for free pages to compact things to
      at the end of the zone.
      
      This can cause quadratic behaviour, with isolate_freepages starting at the
      end of the zone each time, even though previous invocations of the
      compaction code already filled up all free memory on that end of the zone.
       This can cause isolate_freepages to take enormous amounts of CPU with
      certain workloads on larger memory systems.
      
      This patch caches where the migration and free scanner should start from
      on subsequent compaction invocations using the pageblock-skip information.
       When compaction starts it begins from the cached restart points and will
      update the cached restart points until a page is isolated or a pageblock
      is skipped that would have been scanned by synchronous compaction.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Richard Davies <richard@arachsys.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Avi Kivity <avi@redhat.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c89511ab
    • M
      mm: compaction: cache if a pageblock was scanned and no pages were isolated · bb13ffeb
      Mel Gorman 提交于
      When compaction was implemented it was known that scanning could
      potentially be excessive.  The ideal was that a counter be maintained for
      each pageblock but maintaining this information would incur a severe
      penalty due to a shared writable cache line.  It has reached the point
      where the scanning costs are a serious problem, particularly on
      long-lived systems where a large process starts and allocates a large
      number of THPs at the same time.
      
      Instead of using a shared counter, this patch adds another bit to the
      pageblock flags called PG_migrate_skip.  If a pageblock is scanned by
      either migrate or free scanner and 0 pages were isolated, the pageblock is
      marked to be skipped in the future.  When scanning, this bit is checked
      before any scanning takes place and the block skipped if set.
      
      The main difficulty with a patch like this is "when to ignore the cached
      information?" If it's ignored too often, the scanning rates will still be
      excessive.  If the information is too stale then allocations will fail
      that might have otherwise succeeded.  In this patch
      
      o CMA always ignores the information
      o If the migrate and free scanner meet then the cached information will
        be discarded if it's at least 5 seconds since the last time the cache
        was discarded
      o If there are a large number of allocation failures, discard the cache.
      
      The time-based heuristic is very clumsy but there are few choices for a
      better event.  Depending solely on multiple allocation failures still
      allows excessive scanning when THP allocations are failing in quick
      succession due to memory pressure.  Waiting until memory pressure is
      relieved would cause compaction to continually fail instead of using
      reclaim/compaction to try allocate the page.  The time-based mechanism is
      clumsy but a better option is not obvious.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Richard Davies <richard@arachsys.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Avi Kivity <avi@redhat.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Cc: Mark Brown <broonie@opensource.wolfsonmicro.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb13ffeb
    • M
      revert "mm: have order > 0 compaction start off where it left" · 753341a4
      Mel Gorman 提交于
      This reverts commit 7db8889a ("mm: have order > 0 compaction start
      off where it left") and commit de74f1cc ("mm: have order > 0 compaction
      start near a pageblock with free pages").  These patches were a good
      idea and tests confirmed that they massively reduced the amount of
      scanning but the implementation is complex and tricky to understand.  A
      later patch will cache what pageblocks should be skipped and
      reimplements the concept of compact_cached_free_pfn on top for both
      migration and free scanners.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Richard Davies <richard@arachsys.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Avi Kivity <avi@redhat.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      753341a4
    • B
      cma: count free CMA pages · d1ce749a
      Bartlomiej Zolnierkiewicz 提交于
      Add NR_FREE_CMA_PAGES counter to be later used for checking watermark in
      __zone_watermark_ok().  For simplicity and to avoid #ifdef hell make this
      counter always available (not only when CONFIG_CMA=y).
      
      [akpm@linux-foundation.org: use conventional migratetype naming]
      Signed-off-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Signed-off-by: NKyungmin Park <kyungmin.park@samsung.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d1ce749a