1. 29 7月, 2016 7 次提交
  2. 27 7月, 2016 6 次提交
    • J
      mm/page_alloc: introduce post allocation processing on page allocator · 46f24fd8
      Joonsoo Kim 提交于
      This patch is motivated from Hugh and Vlastimil's concern [1].
      
      There are two ways to get freepage from the allocator.  One is using
      normal memory allocation API and the other is __isolate_free_page()
      which is internally used for compaction and pageblock isolation.  Later
      usage is rather tricky since it doesn't do whole post allocation
      processing done by normal API.
      
      One problematic thing I already know is that poisoned page would not be
      checked if it is allocated by __isolate_free_page().  Perhaps, there
      would be more.
      
      We could add more debug logic for allocated page in the future and this
      separation would cause more problem.  I'd like to fix this situation at
      this time.  Solution is simple.  This patch commonize some logic for
      newly allocated page and uses it on all sites.  This will solve the
      problem.
      
      [1] http://marc.info/?i=alpine.LSU.2.11.1604270029350.7066%40eggly.anvils%3E
      
      [iamjoonsoo.kim@lge.com: mm-page_alloc-introduce-post-allocation-processing-on-page-allocator-v3]
        Link: http://lkml.kernel.org/r/1464230275-25791-7-git-send-email-iamjoonsoo.kim@lge.com
        Link: http://lkml.kernel.org/r/1466150259-27727-9-git-send-email-iamjoonsoo.kim@lge.com
      Link: http://lkml.kernel.org/r/1464230275-25791-7-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      46f24fd8
    • J
      mm/page_owner: initialize page owner without holding the zone lock · 83358ece
      Joonsoo Kim 提交于
      It's not necessary to initialized page_owner with holding the zone lock.
      It would cause more contention on the zone lock although it's not a big
      problem since it is just debug feature.  But, it is better than before
      so do it.  This is also preparation step to use stackdepot in page owner
      feature.  Stackdepot allocates new pages when there is no reserved space
      and holding the zone lock in this case will cause deadlock.
      
      Link: http://lkml.kernel.org/r/1464230275-25791-2-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      83358ece
    • J
      mm/compaction: split freepages without holding the zone lock · 66c64223
      Joonsoo Kim 提交于
      We don't need to split freepages with holding the zone lock.  It will
      cause more contention on zone lock so not desirable.
      
      [rientjes@google.com: if __isolate_free_page() fails, avoid adding to freelist so we don't call map_pages() with it]
        Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1606211447001.43430@chino.kir.corp.google.com
      Link: http://lkml.kernel.org/r/1464230275-25791-1-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      66c64223
    • M
      zsmalloc: introduce zspage structure · 3783689a
      Minchan Kim 提交于
      We have squeezed meta data of zspage into first page's descriptor.  So,
      to get meta data from subpage, we should get first page first of all.
      But it makes trouble to implment page migration feature of zsmalloc
      because any place where to get first page from subpage can be raced with
      first page migration.  IOW, first page it got could be stale.  For
      preventing it, I have tried several approahces but it made code
      complicated so finally, I concluded to separate metadata from first
      page.  Of course, it consumes more memory.  IOW, 16bytes per zspage on
      32bit at the moment.  It means we lost 1% at *worst case*(40B/4096B)
      which is not bad I think at the cost of maintenance.
      
      Link: http://lkml.kernel.org/r/1464736881-24886-9-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3783689a
    • M
      mm: balloon: use general non-lru movable page feature · b1123ea6
      Minchan Kim 提交于
      Now, VM has a feature to migrate non-lru movable pages so balloon
      doesn't need custom migration hooks in migrate.c and compaction.c.
      
      Instead, this patch implements the page->mapping->a_ops->
      {isolate|migrate|putback} functions.
      
      With that, we could remove hooks for ballooning in general migration
      functions and make balloon compaction simple.
      
      [akpm@linux-foundation.org: compaction.h requires that the includer first include node.h]
      Link: http://lkml.kernel.org/r/1464736881-24886-4-git-send-email-minchan@kernel.orgSigned-off-by: NGioh Kim <gi-oh.kim@profitbricks.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b1123ea6
    • M
      mm: migrate: support non-lru movable page migration · bda807d4
      Minchan Kim 提交于
      We have allowed migration for only LRU pages until now and it was enough
      to make high-order pages.  But recently, embedded system(e.g., webOS,
      android) uses lots of non-movable pages(e.g., zram, GPU memory) so we
      have seen several reports about troubles of small high-order allocation.
      For fixing the problem, there were several efforts (e,g,.  enhance
      compaction algorithm, SLUB fallback to 0-order page, reserved memory,
      vmalloc and so on) but if there are lots of non-movable pages in system,
      their solutions are void in the long run.
      
      So, this patch is to support facility to change non-movable pages with
      movable.  For the feature, this patch introduces functions related to
      migration to address_space_operations as well as some page flags.
      
      If a driver want to make own pages movable, it should define three
      functions which are function pointers of struct
      address_space_operations.
      
      1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);
      
      What VM expects on isolate_page function of driver is to return *true*
      if driver isolates page successfully.  On returing true, VM marks the
      page as PG_isolated so concurrent isolation in several CPUs skip the
      page for isolation.  If a driver cannot isolate the page, it should
      return *false*.
      
      Once page is successfully isolated, VM uses page.lru fields so driver
      shouldn't expect to preserve values in that fields.
      
      2. int (*migratepage) (struct address_space *mapping,
      		struct page *newpage, struct page *oldpage, enum migrate_mode);
      
      After isolation, VM calls migratepage of driver with isolated page.  The
      function of migratepage is to move content of the old page to new page
      and set up fields of struct page newpage.  Keep in mind that you should
      indicate to the VM the oldpage is no longer movable via
      __ClearPageMovable() under page_lock if you migrated the oldpage
      successfully and returns 0.  If driver cannot migrate the page at the
      moment, driver can return -EAGAIN.  On -EAGAIN, VM will retry page
      migration in a short time because VM interprets -EAGAIN as "temporal
      migration failure".  On returning any error except -EAGAIN, VM will give
      up the page migration without retrying in this time.
      
      Driver shouldn't touch page.lru field VM using in the functions.
      
      3. void (*putback_page)(struct page *);
      
      If migration fails on isolated page, VM should return the isolated page
      to the driver so VM calls driver's putback_page with migration failed
      page.  In this function, driver should put the isolated page back to the
      own data structure.
      
      4. non-lru movable page flags
      
      There are two page flags for supporting non-lru movable page.
      
      * PG_movable
      
      Driver should use the below function to make page movable under
      page_lock.
      
      	void __SetPageMovable(struct page *page, struct address_space *mapping)
      
      It needs argument of address_space for registering migration family
      functions which will be called by VM.  Exactly speaking, PG_movable is
      not a real flag of struct page.  Rather than, VM reuses page->mapping's
      lower bits to represent it.
      
      	#define PAGE_MAPPING_MOVABLE 0x2
      	page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;
      
      so driver shouldn't access page->mapping directly.  Instead, driver
      should use page_mapping which mask off the low two bits of page->mapping
      so it can get right struct address_space.
      
      For testing of non-lru movable page, VM supports __PageMovable function.
      However, it doesn't guarantee to identify non-lru movable page because
      page->mapping field is unified with other variables in struct page.  As
      well, if driver releases the page after isolation by VM, page->mapping
      doesn't have stable value although it has PAGE_MAPPING_MOVABLE (Look at
      __ClearPageMovable).  But __PageMovable is cheap to catch whether page
      is LRU or non-lru movable once the page has been isolated.  Because LRU
      pages never can have PAGE_MAPPING_MOVABLE in page->mapping.  It is also
      good for just peeking to test non-lru movable pages before more
      expensive checking with lock_page in pfn scanning to select victim.
      
      For guaranteeing non-lru movable page, VM provides PageMovable function.
      Unlike __PageMovable, PageMovable functions validates page->mapping and
      mapping->a_ops->isolate_page under lock_page.  The lock_page prevents
      sudden destroying of page->mapping.
      
      Driver using __SetPageMovable should clear the flag via
      __ClearMovablePage under page_lock before the releasing the page.
      
      * PG_isolated
      
      To prevent concurrent isolation among several CPUs, VM marks isolated
      page as PG_isolated under lock_page.  So if a CPU encounters PG_isolated
      non-lru movable page, it can skip it.  Driver doesn't need to manipulate
      the flag because VM will set/clear it automatically.  Keep in mind that
      if driver sees PG_isolated page, it means the page have been isolated by
      VM so it shouldn't touch page.lru field.  PG_isolated is alias with
      PG_reclaim flag so driver shouldn't use the flag for own purpose.
      
      [opensource.ganesh@gmail.com: mm/compaction: remove local variable is_lru]
        Link: http://lkml.kernel.org/r/20160618014841.GA7422@leo-test
      Link: http://lkml.kernel.org/r/1464736881-24886-3-git-send-email-minchan@kernel.orgSigned-off-by: NGioh Kim <gi-oh.kim@profitbricks.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NGanesh Mahendran <opensource.ganesh@gmail.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: John Einar Reitan <john.reitan@foss.arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bda807d4
  3. 15 7月, 2016 1 次提交
  4. 25 6月, 2016 1 次提交
    • D
      mm, compaction: abort free scanner if split fails · a4f04f2c
      David Rientjes 提交于
      If the memory compaction free scanner cannot successfully split a free
      page (only possible due to per-zone low watermark), terminate the free
      scanner rather than continuing to scan memory needlessly.  If the
      watermark is insufficient for a free page of order <= cc->order, then
      terminate the scanner since all future splits will also likely fail.
      
      This prevents the compaction freeing scanner from scanning all memory on
      very large zones (very noticeable for zones > 128GB, for instance) when
      all splits will likely fail while holding zone->lock.
      
      compaction_alloc() iterating a 128GB zone has been benchmarked to take
      over 400ms on some systems whereas any free page isolated and ready to
      be split ends up failing in split_free_page() because of the low
      watermark check and thus the iteration continues.
      
      The next time compaction occurs, the freeing scanner will likely start
      at the end of the zone again since no success was made previously and we
      get the same lengthy iteration until the zone is brought above the low
      watermark.  All thp page faults can take >400ms in such a state without
      this fix.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1606211820350.97086@chino.kir.corp.google.comSigned-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a4f04f2c
  5. 21 5月, 2016 6 次提交
    • C
      mm/compaction.c: fix zoneindex in kcompactd() · 6cd9dc3e
      Chen Feng 提交于
      While testing the kcompactd in my platform 3G MEM only DMA ZONE.  I
      found the kcompactd never wakeup.  It seems the zoneindex has already
      minus 1 before.  So the traverse here should be <=.
      
      It fixes a regression where kswapd could previously compact, but
      kcompactd not.  Not a crash fix though.
      
      [akpm@linux-foundation.org: fix kcompactd_do_work() as well, per Hugh]
      Link: http://lkml.kernel.org/r/1463659121-84124-1-git-send-email-puck.chen@hisilicon.com
      Fixes: accf6242 ("mm, kswapd: replace kswapd compaction with waking up kcompactd")
      Signed-off-by: NChen Feng <puck.chen@hisilicon.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zhuangluan Su <suzhuangluan@hisilicon.com>
      Cc: Yiping Xu <xuyiping@hisilicon.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6cd9dc3e
    • M
      mm, oom, compaction: prevent from should_compact_retry looping for ever for costly orders · 86a294a8
      Michal Hocko 提交于
      "mm: consider compaction feedback also for costly allocation" has
      removed the upper bound for the reclaim/compaction retries based on the
      number of reclaimed pages for costly orders.  While this is desirable
      the patch did miss a mis interaction between reclaim, compaction and the
      retry logic.  The direct reclaim tries to get zones over min watermark
      while compaction backs off and returns COMPACT_SKIPPED when all zones
      are below low watermark + 1<<order gap.  If we are getting really close
      to OOM then __compaction_suitable can keep returning COMPACT_SKIPPED a
      high order request (e.g.  hugetlb order-9) while the reclaim is not able
      to release enough pages to get us over low watermark.  The reclaim is
      still able to make some progress (usually trashing over few remaining
      pages) so we are not able to break out from the loop.
      
      I have seen this happening with the same test described in "mm: consider
      compaction feedback also for costly allocation" on a swapless system.
      The original problem got resolved by "vmscan: consider classzone_idx in
      compaction_ready" but it shows how things might go wrong when we
      approach the oom event horizont.
      
      The reason why compaction requires being over low rather than min
      watermark is not clear to me.  This check was there essentially since
      56de7263 ("mm: compaction: direct compact when a high-order
      allocation fails").  It is clearly an implementation detail though and
      we shouldn't pull it into the generic retry logic while we should be
      able to cope with such eventuality.  The only place in
      should_compact_retry where we retry without any upper bound is for
      compaction_withdrawn() case.
      
      Introduce compaction_zonelist_suitable function which checks the given
      zonelist and returns true only if there is at least one zone which would
      would unblock __compaction_suitable if more memory got reclaimed.  In
      this implementation it checks __compaction_suitable with NR_FREE_PAGES
      plus part of the reclaimable memory as the target for the watermark
      check.  The reclaimable memory is reduced linearly by the allocation
      order.  The idea is that we do not want to reclaim all the remaining
      memory for a single allocation request just unblock
      __compaction_suitable which doesn't guarantee we will make a further
      progress.
      
      The new helper is then used if compaction_withdrawn() feedback was
      provided so we do not retry if there is no outlook for a further
      progress.  !costly requests shouldn't be affected much - e.g.  order-2
      pages would require to have at least 64kB on the reclaimable LRUs while
      order-9 would need at least 32M which should be enough to not lock up.
      
      [vbabka@suse.cz: fix classzone_idx vs. high_zoneidx usage in compaction_zonelist_suitable]
      [akpm@linux-foundation.org: fix it for Mel's mm-page_alloc-remove-field-from-alloc_context.patch]
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86a294a8
    • M
      mm, compaction: distinguish between full and partial COMPACT_COMPLETE · c8f7de0b
      Michal Hocko 提交于
      COMPACT_COMPLETE now means that compaction and free scanner met.  This
      is not very useful information if somebody just wants to use this
      feedback and make any decisions based on that.  The current caller might
      be a poor guy who just happened to scan tiny portion of the zone and
      that could be the reason no suitable pages were compacted.  Make sure we
      distinguish the full and partial zone walks.
      
      Consumers should treat COMPACT_PARTIAL_SKIPPED as a potential success
      and be optimistic in retrying.
      
      The existing users of COMPACT_COMPLETE are conservatively changed to use
      COMPACT_PARTIAL_SKIPPED as well but some of them should be probably
      reconsidered and only defer the compaction only for COMPACT_COMPLETE
      with the new semantic.
      
      This patch shouldn't introduce any functional changes.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c8f7de0b
    • M
      mm, compaction: distinguish COMPACT_DEFERRED from COMPACT_SKIPPED · 1d4746d3
      Michal Hocko 提交于
      try_to_compact_pages() can currently return COMPACT_SKIPPED even when
      the compaction is defered for some zone just because zone DMA is skipped
      in 99% of cases due to watermark checks.  This makes COMPACT_DEFERRED
      basically unusable for the page allocator as a feedback mechanism.
      
      Make sure we distinguish those two states properly and switch their
      ordering in the enum.  This would mean that the COMPACT_SKIPPED will be
      returned only when all eligible zones are skipped.
      
      As a result COMPACT_DEFERRED handling for THP in __alloc_pages_slowpath
      will be more precise and we would bail out rather than reclaim.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d4746d3
    • M
      mm, compaction: cover all compaction mode in compact_zone · c46649de
      Michal Hocko 提交于
      The compiler is complaining after "mm, compaction: change COMPACT_
      constants into enum"
      
        mm/compaction.c: In function `compact_zone':
        mm/compaction.c:1350:2: warning: enumeration value `COMPACT_DEFERRED' not handled in switch [-Wswitch]
          switch (ret) {
          ^
        mm/compaction.c:1350:2: warning: enumeration value `COMPACT_COMPLETE' not handled in switch [-Wswitch]
        mm/compaction.c:1350:2: warning: enumeration value `COMPACT_NO_SUITABLE_PAGE' not handled in switch [-Wswitch]
        mm/compaction.c:1350:2: warning: enumeration value `COMPACT_NOT_SUITABLE_ZONE' not handled in switch [-Wswitch]
        mm/compaction.c:1350:2: warning: enumeration value `COMPACT_CONTENDED' not handled in switch [-Wswitch]
      
      compaction_suitable is allowed to return only COMPACT_PARTIAL,
      COMPACT_SKIPPED and COMPACT_CONTINUE so other cases are simply
      impossible.  Put a VM_BUG_ON to catch an impossible return value.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c46649de
    • M
      mm, compaction: change COMPACT_ constants into enum · ea7ab982
      Michal Hocko 提交于
      Compaction code is doing weird dances between COMPACT_FOO -> int ->
      unsigned long
      
      But there doesn't seem to be any reason for that.  All functions which
      return/use one of those constants are not expecting any other value so it
      really makes sense to define an enum for them and make it clear that no
      other values are expected.
      
      This is a pure cleanup and shouldn't introduce any functional changes.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea7ab982
  6. 20 5月, 2016 5 次提交
    • M
      mm, page_alloc: remove field from alloc_context · 93ea9964
      Mel Gorman 提交于
      The classzone_idx can be inferred from preferred_zoneref so remove the
      unnecessary field and save stack space.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93ea9964
    • M
      mm, page_alloc: convert alloc_flags to unsigned · c603844b
      Mel Gorman 提交于
      alloc_flags is a bitmask of flags but it is signed which does not
      necessarily generate the best code depending on the compiler.  Even
      without an impact, it makes more sense that this be unsigned.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c603844b
    • V
      mm, compaction: skip blocks where isolation fails in async direct compaction · fdd048e1
      Vlastimil Babka 提交于
      The goal of direct compaction is to quickly make a high-order page
      available for the pending allocation.  Within an aligned block of pages
      of desired order, a single allocated page that cannot be isolated for
      migration means that the block cannot fully merge to a buddy page that
      would satisfy the allocation request.  Therefore we can reduce the
      allocation stall by skipping the rest of the block immediately on
      isolation failure.  For async compaction, this also means a higher
      chance of succeeding until it detects contention.
      
      We however shouldn't completely sacrifice the second objective of
      compaction, which is to reduce overal long-term memory fragmentation.
      As a compromise, perform the eager skipping only in direct async
      compaction, while sync compaction (including kcompactd) remains
      thorough.
      
      Testing was done using stress-highalloc from mmtests, configured for
      order-4 GFP_KERNEL allocations:
      
                                       4.6-rc1               4.6-rc1
                                        before                 after
        Success 1 Min         24.00 (  0.00%)       27.00 (-12.50%)
        Success 1 Mean        30.20 (  0.00%)       31.60 ( -4.64%)
        Success 1 Max         37.00 (  0.00%)       35.00 (  5.41%)
        Success 2 Min         42.00 (  0.00%)       32.00 ( 23.81%)
        Success 2 Mean        44.00 (  0.00%)       44.80 ( -1.82%)
        Success 2 Max         48.00 (  0.00%)       52.00 ( -8.33%)
        Success 3 Min         91.00 (  0.00%)       92.00 ( -1.10%)
        Success 3 Mean        92.20 (  0.00%)       92.80 ( -0.65%)
        Success 3 Max         94.00 (  0.00%)       93.00 (  1.06%)
      
      We can see that success rates are unaffected by the skipping.
      
                      4.6-rc1     4.6-rc1
                       before       after
        User         2587.42     2566.53
        System        482.89      471.20
        Elapsed      1395.68     1382.00
      
      Times are not so useful metric for this benchmark as main portion is the
      interfering kernel builds, but results do hint at reduced system times.
      
                                            4.6-rc1     4.6-rc1
                                             before       after
        Direct pages scanned                163614      159608
        Kswapd pages scanned               2070139     2078790
        Kswapd pages reclaimed             2061707     2069757
        Direct pages reclaimed              163354      159505
      
      Reduced direct reclaim was unintended, but could be explained by more
      successful first attempt at (async) direct compaction, which is
      attempted before the first reclaim attempt in __alloc_pages_slowpath().
      
        Compaction stalls                    33052       39853
        Compaction success                   12121       19773
        Compaction failures                  20931       20079
      
      Compaction is indeed more successful, and thus less likely to get
      deferred, so there are also more direct compaction stalls.
      
        Page migrate success               3781876     3326819
        Page migrate failure                 45817       41774
        Compaction pages isolated          7868232     6941457
        Compaction migrate scanned       168160492   127269354
        Compaction migrate prescanned            0           0
        Compaction free scanned         2522142582  2326342620
        Compaction free direct alloc             0           0
        Compaction free dir. all. miss           0           0
        Compaction cost                       5252        4476
      
      The patch reduces migration scanned pages by 25% thanks to the eager
      skipping.
      
      [hughd@google.com: prevent nr_isolated_* from going negative]
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fdd048e1
    • V
      mm, compaction: reduce spurious pcplist drains · a34753d2
      Vlastimil Babka 提交于
      Compaction drains the local pcplists each time migration scanner moves
      away from a cc->order aligned block where it isolated pages for
      migration, so that the pages freed by migrations can merge into higher
      orders.
      
      The detection is currently coarser than it could be.  The
      cc->last_migrated_pfn variable should track the lowest pfn that was
      isolated for migration.  But it is set to the pfn where
      isolate_migratepages_block() starts scanning, which is typically the
      first pfn of the pageblock.  There, the scanner might fail to isolate
      several order-aligned blocks, and then isolate COMPACT_CLUSTER_MAX in
      another block.  This would cause the pcplists drain to be performed,
      although the scanner didn't yet finish the block where it isolated from.
      
      This patch thus makes cc->last_migrated_pfn handling more accurate by
      setting it to the pfn of an actually isolated page in
      isolate_migratepages_block().  Although practical effects of this patch
      are likely low, it arguably makes the intent of the code more obvious.
      Also the next patch will make async direct compaction skip blocks more
      aggressively, and draining pcplists due to skipped blocks is wasteful.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a34753d2
    • V
      mm, compaction: wrap calculating first and last pfn of pageblock · 06b6640a
      Vlastimil Babka 提交于
      Compaction code has accumulated numerous instances of manual
      calculations of the first (inclusive) and last (exclusive) pfn of a
      pageblock (or a smaller block of given order), given a pfn within the
      pageblock.
      
      Wrap these calculations by introducing pageblock_start_pfn(pfn) and
      pageblock_end_pfn(pfn) macros.
      
      [vbabka@suse.cz: fix crash in get_pfnblock_flags_mask() from isolate_freepages():]
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      06b6640a
  7. 06 5月, 2016 2 次提交
    • V
      mm: fix kcompactd hang during memory offlining · 172400c6
      Vlastimil Babka 提交于
      Assume memory47 is the last online block left in node1.  This will hang:
      
        # echo offline > /sys/devices/system/node/node1/memory47/state
      
      After a couple of minutes, the following pops up in dmesg:
      
        INFO: task bash:957 blocked for more than 120 seconds.
               Not tainted 4.6.0-rc6+ #6
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        bash            D ffff8800b7adbaf8     0   957    951 0x00000000
        Call Trace:
          schedule+0x35/0x80
          schedule_timeout+0x1ac/0x270
          wait_for_completion+0xe1/0x120
          kthread_stop+0x4f/0x110
          kcompactd_stop+0x26/0x40
          __offline_pages.constprop.28+0x7e6/0x840
          offline_pages+0x11/0x20
          memory_block_action+0x73/0x1d0
          memory_subsys_offline+0x47/0x60
          device_offline+0x86/0xb0
          store_mem_state+0xda/0xf0
          dev_attr_store+0x18/0x30
          sysfs_kf_write+0x37/0x40
          kernfs_fop_write+0x11d/0x170
          __vfs_write+0x37/0x120
          vfs_write+0xa9/0x1a0
          SyS_write+0x55/0xc0
          entry_SYSCALL_64_fastpath+0x1a/0xa4
      
      kcompactd is waiting for kcompactd_max_order > 0 when it's woken up to
      actually exit.  Check kthread_should_stop() to break out of the wait.
      
      Fixes: 698b1b30 ("mm, compaction: introduce kcompactd").
      Reported-by: NReza Arbab <arbab@linux.vnet.ibm.com>
      Tested-by: NReza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      172400c6
    • H
      mm, cma: prevent nr_isolated_* counters from going negative · 14af4a5e
      Hugh Dickins 提交于
      /proc/sys/vm/stat_refresh warns nr_isolated_anon and nr_isolated_file go
      increasingly negative under compaction: which would add delay when
      should be none, or no delay when should delay.  The bug in compaction
      was due to a recent mmotm patch, but much older instance of the bug was
      also noticed in isolate_migratepages_range() which is used for CMA and
      gigantic hugepage allocations.
      
      The bug is caused by putback_movable_pages() in an error path
      decrementing the isolated counters without them being previously
      incremented by acct_isolated().  Fix isolate_migratepages_range() by
      removing the error-path putback, thus reaching acct_isolated() with
      migratepages still isolated, and leaving putback to caller like most
      other places do.
      
      Fixes: edc2ca61 ("mm, compaction: move pageblock checks up from isolate_migratepages_range()")
      [vbabka@suse.cz: expanded the changelog]
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      14af4a5e
  8. 18 3月, 2016 2 次提交
    • V
      mm, kswapd: replace kswapd compaction with waking up kcompactd · accf6242
      Vlastimil Babka 提交于
      Similarly to direct reclaim/compaction, kswapd attempts to combine
      reclaim and compaction to attempt making memory allocation of given
      order available.
      
      The details differ from direct reclaim e.g. in having high watermark as
      a goal.  The code involved in kswapd's reclaim/compaction decisions has
      evolved to be quite complex.
      
      Testing reveals that it doesn't actually work in at least one scenario,
      and closer inspection suggests that it could be greatly simplified
      without compromising on the goal (make high-order page available) or
      efficiency (don't reclaim too much).  The simplification relieas of
      doing all compaction in kcompactd, which is simply woken up when high
      watermarks are reached by kswapd's reclaim.
      
      The scenario where kswapd compaction doesn't work was found with mmtests
      test stress-highalloc configured to attempt order-9 allocations without
      direct reclaim, just waking up kswapd.  There was no compaction attempt
      from kswapd during the whole test.  Some added instrumentation shows
      what happens:
      
       - balance_pgdat() sets end_zone to Normal, as it's not balanced
       - reclaim is attempted on DMA zone, which sets nr_attempted to 99, but
         it cannot reclaim anything, so sc.nr_reclaimed is 0
       - for zones DMA32 and Normal, kswapd_shrink_zone uses testorder=0, so
         it merely checks if high watermarks were reached for base pages.
         This is true, so no reclaim is attempted.  For DMA, testorder=0
         wasn't used, as compaction_suitable() returned COMPACT_SKIPPED
       - even though the pgdat_needs_compaction flag wasn't set to false, no
         compaction happens due to the condition sc.nr_reclaimed >
         nr_attempted being false (as 0 < 99)
       - priority-- due to nr_reclaimed being 0, repeat until priority reaches
         0 pgdat_balanced() is false as only the small zone DMA appears
         balanced (curiously in that check, watermark appears OK and
         compaction_suitable() returns COMPACT_PARTIAL, because a lower
         classzone_idx is used there)
      
      Now, even if it was decided that reclaim shouldn't be attempted on the
      DMA zone, the scenario would be the same, as (sc.nr_reclaimed=0 >
      nr_attempted=0) is also false.  The condition really should use >= as
      the comment suggests.  Then there is a mismatch in the check for setting
      pgdat_needs_compaction to false using low watermark, while the rest uses
      high watermark, and who knows what other subtlety.  Hopefully this
      demonstrates that this is unsustainable.
      
      Luckily we can simplify this a lot.  The reclaim/compaction decisions
      make sense for direct reclaim scenario, but in kswapd, our primary goal
      is to reach high watermark in order-0 pages.  Afterwards we can attempt
      compaction just once.  Unlike direct reclaim, we don't reclaim extra
      pages (over the high watermark), the current code already disallows it
      for good reasons.
      
      After this patch, we simply wake up kcompactd to process the pgdat,
      after we have either succeeded or failed to reach the high watermarks in
      kswapd, which goes to sleep.  We pass kswapd's order and classzone_idx,
      so kcompactd can apply the same criteria to determine which zones are
      worth compacting.  Note that we use the classzone_idx from
      wakeup_kswapd(), not balanced_classzone_idx which can include higher
      zones that kswapd tried to balance too, but didn't consider them in
      pgdat_balanced().
      
      Since kswapd now cannot create high-order pages itself, we need to
      adjust how it determines the zones to be balanced.  The key element here
      is adding a "highorder" parameter to zone_balanced, which, when set to
      false, makes it consider only order-0 watermark instead of the desired
      higher order (this was done previously by kswapd_shrink_zone(), but not
      elsewhere).  This false is passed for example in pgdat_balanced().
      Importantly, wakeup_kswapd() uses true to make sure kswapd and thus
      kcompactd are woken up for a high-order allocation failure.
      
      The last thing is to decide what to do with pageblock_skip bitmap
      handling.  Compaction maintains a pageblock_skip bitmap to record
      pageblocks where isolation recently failed.  This bitmap can be reset by
      three ways:
      
      1) direct compaction is restarting after going through the full deferred cycle
      
      2) kswapd goes to sleep, and some other direct compaction has previously
         finished scanning the whole zone and set zone->compact_blockskip_flush.
         Note that a successful direct compaction clears this flag.
      
      3) compaction was invoked manually via trigger in /proc
      
      The case 2) is somewhat fuzzy to begin with, but after introducing
      kcompactd we should update it.  The check for direct compaction in 1),
      and to set the flush flag in 2) use current_is_kswapd(), which doesn't
      work for kcompactd.  Thus, this patch adds bool direct_compaction to
      compact_control to use in 2).  For the case 1) we remove the check
      completely - unlike the former kswapd compaction, kcompactd does use the
      deferred compaction functionality, so flushing tied to restarting from
      deferred compaction makes sense here.
      
      Note that when kswapd goes to sleep, kcompactd is woken up, so it will
      see the flushed pageblock_skip bits.  This is different from when the
      former kswapd compaction observed the bits and I believe it makes more
      sense.  Kcompactd can afford to be more thorough than a direct
      compaction trying to limit allocation latency, or kswapd whose primary
      goal is to reclaim.
      
      For testing, I used stress-highalloc configured to do order-9
      allocations with GFP_NOWAIT|__GFP_HIGH|__GFP_COMP, so they relied just
      on kswapd/kcompactd reclaim/compaction (the interfering kernel builds in
      phases 1 and 2 work as usual):
      
      stress-highalloc
                              4.5-rc1+before          4.5-rc1+after
                                   -nodirect              -nodirect
      Success 1 Min          1.00 (  0.00%)         5.00 (-66.67%)
      Success 1 Mean         1.40 (  0.00%)         6.20 (-55.00%)
      Success 1 Max          2.00 (  0.00%)         7.00 (-16.67%)
      Success 2 Min          1.00 (  0.00%)         5.00 (-66.67%)
      Success 2 Mean         1.80 (  0.00%)         6.40 (-52.38%)
      Success 2 Max          3.00 (  0.00%)         7.00 (-16.67%)
      Success 3 Min         34.00 (  0.00%)        62.00 (  1.59%)
      Success 3 Mean        41.80 (  0.00%)        63.80 (  1.24%)
      Success 3 Max         53.00 (  0.00%)        65.00 (  2.99%)
      
      User                          3166.67        3181.09
      System                        1153.37        1158.25
      Elapsed                       1768.53        1799.37
      
                                  4.5-rc1+before   4.5-rc1+after
                                       -nodirect    -nodirect
      Direct pages scanned                32938        32797
      Kswapd pages scanned              2183166      2202613
      Kswapd pages reclaimed            2152359      2143524
      Direct pages reclaimed              32735        32545
      Percentage direct scans                1%           1%
      THP fault alloc                       579          612
      THP collapse alloc                    304          316
      THP splits                              0            0
      THP fault fallback                    793          778
      THP collapse fail                      11           16
      Compaction stalls                    1013         1007
      Compaction success                     92           67
      Compaction failures                   920          939
      Page migrate success               238457       721374
      Page migrate failure                23021        23469
      Compaction pages isolated          504695      1479924
      Compaction migrate scanned         661390      8812554
      Compaction free scanned          13476658     84327916
      Compaction cost                       262          838
      
      After this patch we see improvements in allocation success rate
      (especially for phase 3) along with increased compaction activity.  The
      compaction stalls (direct compaction) in the interfering kernel builds
      (probably THP's) also decreased somewhat thanks to kcompactd activity,
      yet THP alloc successes improved a bit.
      
      Note that elapsed and user time isn't so useful for this benchmark,
      because of the background interference being unpredictable.  It's just
      to quickly spot some major unexpected differences.  System time is
      somewhat more useful and that didn't increase.
      
      Also (after adjusting mmtests' ftrace monitor):
      
      Time kswapd awake               2547781     2269241
      Time kcompactd awake                  0      119253
      Time direct compacting           939937      557649
      Time kswapd compacting                0           0
      Time kcompactd compacting             0      119099
      
      The decrease of overal time spent compacting appears to not match the
      increased compaction stats.  I suspect the tasks get rescheduled and
      since the ftrace monitor doesn't see that, the reported time is wall
      time, not CPU time.  But arguably direct compactors care about overall
      latency anyway, whether busy compacting or waiting for CPU doesn't
      matter.  And that latency seems to almost halved.
      
      It's also interesting how much time kswapd spent awake just going
      through all the priorities and failing to even try compacting, over and
      over.
      
      We can also configure stress-highalloc to perform both direct
      reclaim/compaction and wakeup kswapd/kcompactd, by using
      GFP_KERNEL|__GFP_HIGH|__GFP_COMP:
      
      stress-highalloc
                              4.5-rc1+before         4.5-rc1+after
                                     -direct               -direct
      Success 1 Min          4.00 (  0.00%)        9.00 (-50.00%)
      Success 1 Mean         8.00 (  0.00%)       10.00 (-19.05%)
      Success 1 Max         12.00 (  0.00%)       11.00 ( 15.38%)
      Success 2 Min          4.00 (  0.00%)        9.00 (-50.00%)
      Success 2 Mean         8.20 (  0.00%)       10.00 (-16.28%)
      Success 2 Max         13.00 (  0.00%)       11.00 (  8.33%)
      Success 3 Min         75.00 (  0.00%)       74.00 (  1.33%)
      Success 3 Mean        75.60 (  0.00%)       75.20 (  0.53%)
      Success 3 Max         77.00 (  0.00%)       76.00 (  0.00%)
      
      User                          3344.73       3246.04
      System                        1194.24       1172.29
      Elapsed                       1838.04       1836.76
      
                                  4.5-rc1+before  4.5-rc1+after
                                         -direct     -direct
      Direct pages scanned               125146      120966
      Kswapd pages scanned              2119757     2135012
      Kswapd pages reclaimed            2073183     2108388
      Direct pages reclaimed             124909      120577
      Percentage direct scans                5%          5%
      THP fault alloc                       599         652
      THP collapse alloc                    323         354
      THP splits                              0           0
      THP fault fallback                    806         793
      THP collapse fail                      17          16
      Compaction stalls                    2457        2025
      Compaction success                    906         518
      Compaction failures                  1551        1507
      Page migrate success              2031423     2360608
      Page migrate failure                32845       40852
      Compaction pages isolated         4129761     4802025
      Compaction migrate scanned       11996712    21750613
      Compaction free scanned         214970969   344372001
      Compaction cost                      2271        2694
      
      In this scenario, this patch doesn't change the overall success rate as
      direct compaction already tries all it can.  There's however significant
      reduction in direct compaction stalls (that is, the number of
      allocations that went into direct compaction).  The number of successes
      (i.e.  direct compaction stalls that ended up with successful
      allocation) is reduced by the same number.  This means the offload to
      kcompactd is working as expected, and direct compaction is reduced
      either due to detecting contention, or compaction deferred by kcompactd.
      In the previous version of this patchset there was some apparent
      reduction of success rate, but the changes in this version (such as
      using sync compaction only), new baseline kernel, and/or averaging
      results from 5 executions (my bet), made this go away.
      
      Ftrace-based stats seem to roughly agree:
      
      Time kswapd awake               2532984     2326824
      Time kcompactd awake                  0      257916
      Time direct compacting           864839      735130
      Time kswapd compacting                0           0
      Time kcompactd compacting             0      257585
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      accf6242
    • V
      mm, compaction: introduce kcompactd · 698b1b30
      Vlastimil Babka 提交于
      Memory compaction can be currently performed in several contexts:
      
       - kswapd balancing a zone after a high-order allocation failure
       - direct compaction to satisfy a high-order allocation, including THP
         page fault attemps
       - khugepaged trying to collapse a hugepage
       - manually from /proc
      
      The purpose of compaction is two-fold.  The obvious purpose is to
      satisfy a (pending or future) high-order allocation, and is easy to
      evaluate.  The other purpose is to keep overal memory fragmentation low
      and help the anti-fragmentation mechanism.  The success wrt the latter
      purpose is more
      
      The current situation wrt the purposes has a few drawbacks:
      
       - compaction is invoked only when a high-order page or hugepage is not
         available (or manually).  This might be too late for the purposes of
         keeping memory fragmentation low.
       - direct compaction increases latency of allocations.  Again, it would
         be better if compaction was performed asynchronously to keep
         fragmentation low, before the allocation itself comes.
       - (a special case of the previous) the cost of compaction during THP
         page faults can easily offset the benefits of THP.
       - kswapd compaction appears to be complex, fragile and not working in
         some scenarios.  It could also end up compacting for a high-order
         allocation request when it should be reclaiming memory for a later
         order-0 request.
      
      To improve the situation, we should be able to benefit from an
      equivalent of kswapd, but for compaction - i.e. a background thread
      which responds to fragmentation and the need for high-order allocations
      (including hugepages) somewhat proactively.
      
      One possibility is to extend the responsibilities of kswapd, which could
      however complicate its design too much.  It should be better to let
      kswapd handle reclaim, as order-0 allocations are often more critical
      than high-order ones.
      
      Another possibility is to extend khugepaged, but this kthread is a
      single instance and tied to THP configs.
      
      This patch goes with the option of a new set of per-node kthreads called
      kcompactd, and lays the foundations, without introducing any new
      tunables.  The lifecycle mimics kswapd kthreads, including the memory
      hotplug hooks.
      
      For compaction, kcompactd uses the standard compaction_suitable() and
      ompact_finished() criteria and the deferred compaction functionality.
      Unlike direct compaction, it uses only sync compaction, as there's no
      allocation latency to minimize.
      
      This patch doesn't yet add a call to wakeup_kcompactd.  The kswapd
      compact/reclaim loop for high-order pages will be replaced by waking up
      kcompactd in the next patch with the description of what's wrong with
      the old approach.
      
      Waking up of the kcompactd threads is also tied to kswapd activity and
      follows these rules:
       - we don't want to affect any fastpaths, so wake up kcompactd only from
         the slowpath, as it's done for kswapd
       - if kswapd is doing reclaim, it's more important than compaction, so
         don't invoke kcompactd until kswapd goes to sleep
       - the target order used for kswapd is passed to kcompactd
      
      Future possible future uses for kcompactd include the ability to wake up
      kcompactd on demand in special situations, such as when hugepages are
      not available (currently not done due to __GFP_NO_KSWAPD) or when a
      fragmentation event (i.e.  __rmqueue_fallback()) occurs.  It's also
      possible to perform periodic compaction with kcompactd.
      
      [arnd@arndb.de: fix build errors with kcompactd]
      [paul.gortmaker@windriver.com: don't use modular references for non modular code]
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      698b1b30
  9. 16 3月, 2016 3 次提交
    • J
      mm/compaction: speed up pageblock_pfn_to_page() when zone is contiguous · 7cf91a98
      Joonsoo Kim 提交于
      There is a performance drop report due to hugepage allocation and in
      there half of cpu time are spent on pageblock_pfn_to_page() in
      compaction [1].
      
      In that workload, compaction is triggered to make hugepage but most of
      pageblocks are un-available for compaction due to pageblock type and
      skip bit so compaction usually fails.  Most costly operations in this
      case is to find valid pageblock while scanning whole zone range.  To
      check if pageblock is valid to compact, valid pfn within pageblock is
      required and we can obtain it by calling pageblock_pfn_to_page().  This
      function checks whether pageblock is in a single zone and return valid
      pfn if possible.  Problem is that we need to check it every time before
      scanning pageblock even if we re-visit it and this turns out to be very
      expensive in this workload.
      
      Although we have no way to skip this pageblock check in the system where
      hole exists at arbitrary position, we can use cached value for zone
      continuity and just do pfn_to_page() in the system where hole doesn't
      exist.  This optimization considerably speeds up in above workload.
      
      Before vs After
        Max: 1096 MB/s vs 1325 MB/s
        Min: 635 MB/s 1015 MB/s
        Avg: 899 MB/s 1194 MB/s
      
      Avg is improved by roughly 30% [2].
      
      [1]: http://www.spinics.net/lists/linux-mm/msg97378.html
      [2]: https://lkml.org/lkml/2015/12/9/23
      
      [akpm@linux-foundation.org: don't forget to restore zone->contiguous on error path, per Vlastimil]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reported-by: NAaron Lu <aaron.lu@intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NAaron Lu <aaron.lu@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cf91a98
    • J
      mm/compaction: pass only pageblock aligned range to pageblock_pfn_to_page · e1409c32
      Joonsoo Kim 提交于
      pageblock_pfn_to_page() is used to check there is valid pfn and all
      pages in the pageblock is in a single zone.  If there is a hole in the
      pageblock, passing arbitrary position to pageblock_pfn_to_page() could
      cause to skip whole pageblock scanning, instead of just skipping the
      hole page.  For deterministic behaviour, it's better to always pass
      pageblock aligned range to pageblock_pfn_to_page().  It will also help
      further optimization on pageblock_pfn_to_page() in the following patch.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1409c32
    • J
      mm/compaction: fix invalid free_pfn and compact_cached_free_pfn · 623446e4
      Joonsoo Kim 提交于
      free_pfn and compact_cached_free_pfn are the pointer that remember
      restart position of freepage scanner.  When they are reset or invalid,
      we set them to zone_end_pfn because freepage scanner works in reverse
      direction.  But, because zone range is defined as [zone_start_pfn,
      zone_end_pfn), zone_end_pfn is invalid to access.  Therefore, we should
      not store it to free_pfn and compact_cached_free_pfn.  Instead, we need
      to store zone_end_pfn - 1 to them.  There is one more thing we should
      consider.  Freepage scanner scan reversely by pageblock unit.  If
      free_pfn and compact_cached_free_pfn are set to middle of pageblock, it
      regards that sitiation as that it already scans front part of pageblock
      so we lose opportunity to scan there.  To fix-up, this patch do
      round_down() to guarantee that reset position will be pageblock aligned.
      
      Note that thanks to the current pageblock_pfn_to_page() implementation,
      actual access to zone_end_pfn doesn't happen until now.  But, following
      patch will change pageblock_pfn_to_page() so this patch is needed from
      now on.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      623446e4
  10. 15 1月, 2016 2 次提交
  11. 06 11月, 2015 3 次提交
  12. 09 9月, 2015 2 次提交
    • J
      mm/compaction: correct to flush migrated pages if pageblock skip happens · 1a16718c
      Joonsoo Kim 提交于
      We cache isolate_start_pfn before entering isolate_migratepages().  If
      pageblock is skipped in isolate_migratepages() due to whatever reason,
      cc->migrate_pfn can be far from isolate_start_pfn hence we flush pages
      that were freed.  For example, the following scenario can be possible:
      
      - assume order-9 compaction, pageblock order is 9
      - start_isolate_pfn is 0x200
      - isolate_migratepages()
        - skip a number of pageblocks
        - start to isolate from pfn 0x600
        - cc->migrate_pfn = 0x620
        - return
      - last_migrated_pfn is set to 0x200
      - check flushing condition
        - current_block_start is set to 0x600
        - last_migrated_pfn < current_block_start then do useless flush
      
      This wrong flush would not help the performance and success rate so this
      patch tries to fix it.  One simple way to know the exact position where
      we start to isolate migratable pages is that we cache it in
      isolate_migratepages() before entering actual isolation.  This patch
      implements that and fixes the problem.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a16718c
    • V
      mm, compaction: skip compound pages by order in free scanner · 9fcd6d2e
      Vlastimil Babka 提交于
      The compaction free scanner is looking for PageBuddy() pages and
      skipping all others.  For large compound pages such as THP or hugetlbfs,
      we can save a lot of iterations if we skip them at once using their
      compound_order().  This is generally unsafe and we can read a bogus
      value of order due to a race, but if we are careful, the only danger is
      skipping too much.
      
      When tested with stress-highalloc from mmtests on 4GB system with 1GB
      hugetlbfs pages, the vmstat compact_free_scanned count decreased by at
      least 15%.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9fcd6d2e