1. 12 9月, 2013 40 次提交
    • L
      mm: vmscan: fix do_try_to_free_pages() livelock · 6e543d57
      Lisa Du 提交于
      This patch is based on KOSAKI's work and I add a little more description,
      please refer https://lkml.org/lkml/2012/6/14/74.
      
      Currently, I found system can enter a state that there are lots of free
      pages in a zone but only order-0 and order-1 pages which means the zone is
      heavily fragmented, then high order allocation could make direct reclaim
      path's long stall(ex, 60 seconds) especially in no swap and no compaciton
      enviroment.  This problem happened on v3.4, but it seems issue still lives
      in current tree, the reason is do_try_to_free_pages enter live lock:
      
      kswapd will go to sleep if the zones have been fully scanned and are still
      not balanced.  As kswapd thinks there's little point trying all over again
      to avoid infinite loop.  Instead it changes order from high-order to
      0-order because kswapd think order-0 is the most important.  Look at
      73ce02e9 in detail.  If watermarks are ok, kswapd will go back to sleep
      and may leave zone->all_unreclaimable =3D 0.  It assume high-order users
      can still perform direct reclaim if they wish.
      
      Direct reclaim continue to reclaim for a high order which is not a
      COSTLY_ORDER without oom-killer until kswapd turn on
      zone->all_unreclaimble= .  This is because to avoid too early oom-kill.
      So it means direct_reclaim depends on kswapd to break this loop.
      
      In worst case, direct-reclaim may continue to page reclaim forever when
      kswapd sleeps forever until someone like watchdog detect and finally kill
      the process.  As described in:
      http://thread.gmane.org/gmane.linux.kernel.mm/103737
      
      We can't turn on zone->all_unreclaimable from direct reclaim path because
      direct reclaim path don't take any lock and this way is racy.  Thus this
      patch removes zone->all_unreclaimable field completely and recalculates
      zone reclaimable state every time.
      
      Note: we can't take the idea that direct-reclaim see zone->pages_scanned
      directly and kswapd continue to use zone->all_unreclaimable.  Because, it
      is racy.  commit 929bea7c (vmscan: all_unreclaimable() use
      zone->all_unreclaimable as a name) describes the detail.
      
      [akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()]
      Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Nick Piggin <npiggin@gmail.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Bob Liu <lliubbo@gmail.com>
      Cc: Neil Zhang <zhangwm@marvell.com>
      Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NLisa Du <cldu@marvell.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6e543d57
    • V
      mm: munlock: manual pte walk in fast path instead of follow_page_mask() · 7a8010cd
      Vlastimil Babka 提交于
      Currently munlock_vma_pages_range() calls follow_page_mask() to obtain
      each individual struct page.  This entails repeated full page table
      translations and page table lock taken for each page separately.
      
      This patch avoids the costly follow_page_mask() where possible, by
      iterating over ptes within single pmd under single page table lock.  The
      first pte is obtained by get_locked_pte() for non-THP page acquired by the
      initial follow_page_mask().  The rest of the on-stack pagevec for munlock
      is filled up using pte_walk as long as pte_present() and vm_normal_page()
      are sufficient to obtain the struct page.
      
      After this patch, a 14% speedup was measured for munlocking a 56GB large
      memory area with THP disabled.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jörn Engel <joern@logfs.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a8010cd
    • V
      mm: munlock: remove redundant get_page/put_page pair on the fast path · 5b40998a
      Vlastimil Babka 提交于
      The performance of the fast path in munlock_vma_range() can be further
      improved by avoiding atomic ops of a redundant get_page()/put_page() pair.
      
      When calling get_page() during page isolation, we already have the pin
      from follow_page_mask().  This pin will be then returned by
      __pagevec_lru_add(), after which we do not reference the pages anymore.
      
      After this patch, an 8% speedup was measured for munlocking a 56GB large
      memory area with THP disabled.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NJörn Engel <joern@logfs.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b40998a
    • V
      mm: munlock: bypass per-cpu pvec for putback_lru_page · 56afe477
      Vlastimil Babka 提交于
      After introducing batching by pagevecs into munlock_vma_range(), we can
      further improve performance by bypassing the copying into per-cpu pagevec
      and the get_page/put_page pair associated with that.  Instead we perform
      LRU putback directly from our pagevec.  However, this is possible only for
      single-mapped pages that are evictable after munlock.  Unevictable pages
      require rechecking after putting on the unevictable list, so for those we
      fallback to putback_lru_page(), hich handles that.
      
      After this patch, a 13% speedup was measured for munlocking a 56GB large
      memory area with THP disabled.
      
      [akpm@linux-foundation.org:clarify comment]
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NJörn Engel <joern@logfs.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56afe477
    • V
      mm: munlock: batch NR_MLOCK zone state updates · 1ebb7cc6
      Vlastimil Babka 提交于
      Depending on previous batch which introduced batched isolation in
      munlock_vma_range(), we can batch also the updates of NR_MLOCK page stats.
       After the whole pagevec is processed for page isolation, the stats are
      updated only once with the number of successful isolations.  There were
      however no measurable perfomance gains.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NJörn Engel <joern@logfs.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1ebb7cc6
    • V
      mm: munlock: batch non-THP page isolation and munlock+putback using pagevec · 7225522b
      Vlastimil Babka 提交于
      Currently, munlock_vma_range() calls munlock_vma_page on each page in a
      loop, which results in repeated taking and releasing of the lru_lock
      spinlock for isolating pages one by one.  This patch batches the munlock
      operations using an on-stack pagevec, so that isolation is done under
      single lru_lock.  For THP pages, the old behavior is preserved as they
      might be split while putting them into the pagevec.  After this patch, a
      9% speedup was measured for munlocking a 56GB large memory area with THP
      disabled.
      
      A new function __munlock_pagevec() is introduced that takes a pagevec and:
      1) It clears PageMlocked and isolates all pages under lru_lock.  Zone page
      stats can be also updated using the variant which assumes disabled
      interrupts.  2) It finishes the munlock and lru putback on all pages under
      their lock_page.  Note that previously, lock_page covered also the
      PageMlocked clearing and page isolation, but it is not needed for those
      operations.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NJörn Engel <joern@logfs.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7225522b
    • V
      mm: munlock: remove unnecessary call to lru_add_drain() · 586a32ac
      Vlastimil Babka 提交于
      In munlock_vma_range(), lru_add_drain() is currently called in a loop
      before each munlock_vma_page() call.
      
      This is suboptimal for performance when munlocking many pages.  The
      benefits of per-cpu pagevec for batching the LRU putback are removed since
      the pagevec only holds at most one page from the previous loop's
      iteration.
      
      The lru_add_drain() call also does not serve any purposes for correctness
      - it does not even drain pagavecs of all cpu's.  The munlock code already
      expects and handles situations where a page cannot be isolated from the
      LRU (e.g.  because it is on some per-cpu pagevec).
      
      The history of the (not commented) call also suggest that it appears there
      as an oversight rather than intentionally.  Before commit ff6a6da6 ("mm:
      accelerate munlock() treatment of THP pages") the call happened only once
      upon entering the function.  The commit has moved the call into the while
      loope.  So while the other changes in the commit improved munlock
      performance for THP pages, it introduced the abovementioned suboptimal
      per-cpu pagevec usage.
      
      Further in history, before commit 408e82b7 ("mm: munlock use
      follow_page"), munlock_vma_pages_range() was just a wrapper around
      __mlock_vma_pages_range which performed both mlock and munlock depending
      on a flag.  However, before ba470de4 ("mmap: handle mlocked pages during
      map, remap, unmap") the function handled only mlock, not munlock.  The
      lru_add_drain call thus comes from the implementation in commit b291f000
      ("mlock: mlocked pages are unevictable" and was intended only for
      mlocking, not munlocking.  The original intention of draining the LRU
      pagevec at mlock time was to ensure the pages were on the LRU before the
      lock operation so that they could be placed on the unevictable list
      immediately.  There is very little motivation to do the same in the
      munlock path this, particularly for every single page.
      
      This patch therefore removes the call completely.  After removing the
      call, a 10% speedup was measured for munlock() of a 56GB large memory area
      with THP disabled.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NJörn Engel <joern@logfs.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      586a32ac
    • V
      mm: putback_lru_page: remove unnecessary call to page_lru_base_type() · 0ec3b74c
      Vlastimil Babka 提交于
      The goal of this patch series is to improve performance of munlock() of
      large mlocked memory areas on systems without THP.  This is motivated by
      reported very long times of crash recovery of processes with such areas,
      where munlock() can take several seconds.  See
      http://lwn.net/Articles/548108/
      
      The work was driven by a simple benchmark (to be included in mmtests) that
      mmaps() e.g.  56GB with MAP_LOCKED | MAP_POPULATE and measures the time of
      munlock().  Profiling was performed by attaching operf --pid to the
      process and sending a signal to trigger the munlock() part and then notify
      bach the monitoring wrapper to stop operf, so that only munlock() appears
      in the profile.
      
      The profiles have shown that CPU time is spent mostly by atomic operations
      and repeated locking per single pages. This series aims to reduce both, starting
      from simpler to more complex changes.
      
      Patch 1 performs a simple cleanup in putback_lru_page() so that page lru base
      	type is not determined without being actually needed.
      
      Patch 2 removes an unnecessary call to lru_add_drain() which drains the per-cpu
      	pagevec after each munlocked page is put there.
      
      Patch 3 changes munlock_vma_range() to use an on-stack pagevec for isolating
      	multiple non-THP pages under a single lru_lock instead of locking and
      	processing each page separately.
      
      Patch 4 changes the NR_MLOCK accounting to be called only once per the pvec
      	introduced by previous patch.
      
      Patch 5 uses the introduced pagevec to batch also the work of putback_lru_page
      	when possible, bypassing the per-cpu pvec and associated overhead.
      
      Patch 6 removes a redundant get_page/put_page pair which saves costly atomic
      	operations.
      
      Patch 7 avoids calling follow_page_mask() on each individual page, and obtains
      	multiple page references under a single page table lock where possible.
      
      Measurements were made using 3.11-rc3 as a baseline.  The first set of
      measurements shows the possibly ideal conditions where batching should
      help the most.  All memory is allocated from a single NUMA node and THP is
      disabled.
      
      timedmunlock
                                  3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3
                                         0                     1                     2                     3                     4                     5                     6                     7
      Elapsed min           3.38 (  0.00%)        3.39 ( -0.13%)        3.00 ( 11.33%)        2.70 ( 20.20%)        2.67 ( 21.11%)        2.37 ( 29.88%)        2.20 ( 34.91%)        1.91 ( 43.59%)
      Elapsed mean          3.39 (  0.00%)        3.40 ( -0.23%)        3.01 ( 11.33%)        2.70 ( 20.26%)        2.67 ( 21.21%)        2.38 ( 29.88%)        2.21 ( 34.93%)        1.92 ( 43.46%)
      Elapsed stddev        0.01 (  0.00%)        0.01 (-43.09%)        0.01 ( 15.42%)        0.01 ( 23.42%)        0.00 ( 89.78%)        0.01 ( -7.15%)        0.00 ( 76.69%)        0.02 (-91.77%)
      Elapsed max           3.41 (  0.00%)        3.43 ( -0.52%)        3.03 ( 11.29%)        2.72 ( 20.16%)        2.67 ( 21.63%)        2.40 ( 29.50%)        2.21 ( 35.21%)        1.96 ( 42.39%)
      Elapsed range         0.03 (  0.00%)        0.04 (-51.16%)        0.02 (  6.27%)        0.02 ( 14.67%)        0.00 ( 88.90%)        0.03 (-19.18%)        0.01 ( 73.70%)        0.06 (-113.35%
      
      The second set of measurements simulates the worst possible conditions for
      batching by using numactl --interleave, so that there is in fact only one
      page per pagevec.  Even in this case the series seems to improve
      performance thanks to reduced atomic operations and removal of
      lru_add_drain().
      
      timedmunlock
                                  3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3
                                         0                     1                     2                     3                     4                     5                     6                     7
      Elapsed min           4.00 (  0.00%)        4.04 ( -0.93%)        3.87 (  3.37%)        3.72 (  6.94%)        3.81 (  4.72%)        3.69 (  7.82%)        3.64 (  8.92%)        3.41 ( 14.81%)
      Elapsed mean          4.17 (  0.00%)        4.15 (  0.51%)        4.03 (  3.49%)        3.89 (  6.84%)        3.86 (  7.48%)        3.89 (  6.69%)        3.70 ( 11.27%)        3.48 ( 16.59%)
      Elapsed stddev        0.16 (  0.00%)        0.08 ( 50.76%)        0.10 ( 41.58%)        0.16 (  4.59%)        0.05 ( 72.38%)        0.19 (-12.91%)        0.05 ( 68.09%)        0.06 ( 66.03%)
      Elapsed max           4.34 (  0.00%)        4.32 (  0.56%)        4.19 (  3.62%)        4.12 (  5.15%)        3.91 (  9.88%)        4.12 (  5.25%)        3.80 ( 12.58%)        3.56 ( 18.08%)
      Elapsed range         0.34 (  0.00%)        0.28 ( 17.91%)        0.32 (  6.45%)        0.40 (-15.73%)        0.10 ( 70.06%)        0.43 (-24.84%)        0.15 ( 55.32%)        0.15 ( 56.16%)
      
      For completeness, a third set of measurements shows the situation where
      THP is enabled and allocations are again done on a single NUMA node.  Here
      munlock() is already very fast thanks to huge pages, and this series does
      not compromise that performance.  It seems that the removal of call to
      lru_add_drain() still helps a bit.
      
      timedmunlock
                                  3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3
                                         0                     1                     2                     3                     4                     5                     6                     7
      Elapsed min           0.01 (  0.00%)        0.01 ( -0.11%)        0.01 (  6.59%)        0.01 (  5.41%)        0.01 (  5.45%)        0.01 (  5.03%)        0.01 (  6.08%)        0.01 (  5.20%)
      Elapsed mean          0.01 (  0.00%)        0.01 ( -0.27%)        0.01 (  6.39%)        0.01 (  5.30%)        0.01 (  5.32%)        0.01 (  5.03%)        0.01 (  5.97%)        0.01 (  5.22%)
      Elapsed stddev        0.00 (  0.00%)        0.00 ( -9.59%)        0.00 ( 10.77%)        0.00 (  3.24%)        0.00 ( 24.42%)        0.00 ( 31.86%)        0.00 ( -7.46%)        0.00 (  6.11%)
      Elapsed max           0.01 (  0.00%)        0.01 ( -0.01%)        0.01 (  6.83%)        0.01 (  5.42%)        0.01 (  5.79%)        0.01 (  5.53%)        0.01 (  6.08%)        0.01 (  5.26%)
      Elapsed range         0.00 (  0.00%)        0.00 (  7.30%)        0.00 ( 24.38%)        0.00 (  6.10%)        0.00 ( 30.79%)        0.00 ( 42.52%)        0.00 (  6.11%)        0.00 ( 10.07%)
      
      This patch (of 7):
      
      In putback_lru_page() since commit c53954a0 (""mm: remove lru parameter
      from __lru_cache_add and lru_cache_add_lru") it is no longer needed to
      determine lru list via page_lru_base_type().
      
      This patch replaces it with simple flag is_unevictable which says that the
      page was put on the inevictable list.  This is the only information that
      matters in subsequent tests.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NJörn Engel <joern@logfs.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ec3b74c
    • C
      mm: track vma changes with VM_SOFTDIRTY bit · d9104d1c
      Cyrill Gorcunov 提交于
      Pavel reported that in case if vma area get unmapped and then mapped (or
      expanded) in-place, the soft dirty tracker won't be able to recognize this
      situation since it works on pte level and ptes are get zapped on unmap,
      loosing soft dirty bit of course.
      
      So to resolve this situation we need to track actions on vma level, there
      VM_SOFTDIRTY flag comes in.  When new vma area created (or old expanded)
      we set this bit, and keep it here until application calls for clearing
      soft dirty bit.
      
      Thus when user space application track memory changes now it can detect if
      vma area is renewed.
      Reported-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Rob Landley <rob@landley.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d9104d1c
    • S
      mm: page_alloc: fix comment get_page_from_freelist · 3b11f0aa
      SeungHun Lee 提交于
      cpuset_zone_allowed is changed to cpuset_zone_allowed_softwall and the
      comment is moved to __cpuset_node_allowed_softwall.  So fix this comment.
      Signed-off-by: NSeungHun Lee <waydi1@gmail.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b11f0aa
    • J
      writeback: fix occasional slow sync(1) · 47df3dde
      Jan Kara 提交于
      In case when system contains no dirty pages, wakeup_flusher_threads() will
      submit WB_SYNC_NONE writeback for 0 pages so wb_writeback() exits
      immediately without doing anything, even though there are dirty inodes in
      the system.  Thus sync(1) will write all the dirty inodes from a
      WB_SYNC_ALL writeback pass which is slow.
      
      Fix the problem by using get_nr_dirty_pages() in wakeup_flusher_threads()
      instead of calculating number of dirty pages manually.  That function also
      takes number of dirty inodes into account.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reported-by: NPaul Taysom <taysom@chromium.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      47df3dde
    • K
      mm: fix aio performance regression for database caused by THP · 7cb2ef56
      Khalid Aziz 提交于
      I am working with a tool that simulates oracle database I/O workload.
      This tool (orion to be specific -
      <http://docs.oracle.com/cd/E11882_01/server.112/e16638/iodesign.htm#autoId24>)
      allocates hugetlbfs pages using shmget() with SHM_HUGETLB flag.  It then
      does aio into these pages from flash disks using various common block
      sizes used by database.  I am looking at performance with two of the most
      common block sizes - 1M and 64K.  aio performance with these two block
      sizes plunged after Transparent HugePages was introduced in the kernel.
      Here are performance numbers:
      
      		pre-THP		2.6.39		3.11-rc5
      1M read		8384 MB/s	5629 MB/s	6501 MB/s
      64K read	7867 MB/s	4576 MB/s	4251 MB/s
      
      I have narrowed the performance impact down to the overheads introduced by
      THP in __get_page_tail() and put_compound_page() routines.  perf top shows
      >40% of cycles being spent in these two routines.  Every time direct I/O
      to hugetlbfs pages starts, kernel calls get_page() to grab a reference to
      the pages and calls put_page() when I/O completes to put the reference
      away.  THP introduced significant amount of locking overhead to get_page()
      and put_page() when dealing with compound pages because hugepages can be
      split underneath get_page() and put_page().  It added this overhead
      irrespective of whether it is dealing with hugetlbfs pages or transparent
      hugepages.  This resulted in 20%-45% drop in aio performance when using
      hugetlbfs pages.
      
      Since hugetlbfs pages can not be split, there is no reason to go through
      all the locking overhead for these pages from what I can see.  I added
      code to __get_page_tail() and put_compound_page() to bypass all the
      locking code when working with hugetlbfs pages.  This improved performance
      significantly.  Performance numbers with this patch:
      
      		pre-THP		3.11-rc5	3.11-rc5 + Patch
      1M read		8384 MB/s	6501 MB/s	8371 MB/s
      64K read	7867 MB/s	4251 MB/s	6510 MB/s
      
      Performance with 64K read is still lower than what it was before THP, but
      still a 53% improvement.  It does mean there is more work to be done but I
      will take a 53% improvement for now.
      
      Please take a look at the following patch and let me know if it looks
      reasonable.
      
      [akpm@linux-foundation.org: tweak comments]
      Signed-off-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Cc: Pravin B Shelar <pshelar@nicira.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cb2ef56
    • M
      mm: compaction: do not compact pgdat for order-0 · 3a7200af
      Mel Gorman 提交于
      If kswapd was reclaiming for a high order and resets it to 0 due to
      fragmentation it will still call compact_pgdat.  For the most part, this
      will fail a compaction_suitable() test and not compact but it is
      unnecessarily sloppy.  It could be fixed in the caller but fix it in the
      API instead.
      
      [dhillf@gmail.com: pointed out that it was a potential problem]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Hillf Danton <dhillf@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3a7200af
    • A
      kmemcg: don't allocate extra memory for root memcg_cache_params · 90c7a79c
      Andrey Vagin 提交于
      The memcg_cache_params structure contains the common part and the union,
      which represents two different types of data: one for root cashes and
      another for child caches.
      
      The size of child data is fixed.  The size of the memcg_caches array is
      calculated in runtime.
      
      Currently the size of memcg_cache_params for root caches is calculated
      incorrectly, because it includes the size of parameters for child caches.
      
      ssize_t size = memcg_caches_array_size(num_groups);
      size *= sizeof(void *);
      
      size += sizeof(struct memcg_cache_params);
      
      v2: Fix a typo in calculations
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90c7a79c
    • Y
      memblock, numa: binary search node id · e76b63f8
      Yinghai Lu 提交于
      Current early_pfn_to_nid() on arch that support memblock go over
      memblock.memory one by one, so will take too many try near the end.
      
      We can use existing memblock_search to find the node id for given pfn,
      that could save some time on bigger system that have many entries
      memblock.memory array.
      
      Here are the timing differences for several machines.  In each case with
      the patch less time was spent in __early_pfn_to_nid().
      
                              3.11-rc5        with patch      difference (%)
                              --------        ----------      --------------
      UV1: 256 nodes  9TB:     411.66          402.47         -9.19 (2.23%)
      UV2: 255 nodes 16TB:    1141.02         1138.12         -2.90 (0.25%)
      UV2:  64 nodes  2TB:     128.15          126.53         -1.62 (1.26%)
      UV2:  32 nodes  2TB:     121.87          121.07         -0.80 (0.66%)
                              Time in seconds.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Acked-by: NRuss Anderson <rja@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e76b63f8
    • N
      mbind: add BUG_ON(!vma) in new_vma_page() · 0bf598d8
      Naoya Horiguchi 提交于
      new_vma_page() is called only by page migration called from do_mbind(),
      where pages to be migrated are queued into a pagelist by
      queue_pages_range().  queue_pages_range() confirms that a queued page
      belongs to some vma, so !vma case is not supposed to be happen.  This
      patch adds BUG_ON() to catch this unexpected case.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0bf598d8
    • N
      mm/mempolicy: rename check_*range to queue_pages_*range · 98094945
      Naoya Horiguchi 提交于
      The function check_range() (and its family) is not well-named, because it
      does not only checking something, but moving pages from list to list to do
      page migration for them.  So queue_pages_*range is more desirable name.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      98094945
    • N
      mm: prepare to remove /proc/sys/vm/hugepages_treat_as_movable · 86cdb465
      Naoya Horiguchi 提交于
      Now hugepage migration is enabled, although restricted on pmd-based
      hugepages for now (due to lack of testing.) So we should allocate
      migratable hugepages from ZONE_MOVABLE if possible.
      
      This patch makes GFP flags in hugepage allocation dependent on migration
      support, not only the value of hugepages_treat_as_movable.  It provides no
      change on the behavior for architectures which do not support hugepage
      migration,
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86cdb465
    • N
      mm: migrate: check movability of hugepage in unmap_and_move_huge_page() · 83467efb
      Naoya Horiguchi 提交于
      Currently hugepage migration works well only for pmd-based hugepages
      (mainly due to lack of testing,) so we had better not enable migration of
      other levels of hugepages until we are ready for it.
      
      Some users of hugepage migration (mbind, move_pages, and migrate_pages) do
      page table walk and check pud/pmd_huge() there, so they are safe.  But the
      other users (softoffline and memory hotremove) don't do this, so without
      this patch they can try to migrate unexpected types of hugepages.
      
      To prevent this, we introduce hugepage_migration_support() as an
      architecture dependent check of whether hugepage are implemented on a pmd
      basis or not.  And on some architecture multiple sizes of hugepages are
      available, so hugepage_migration_support() also checks hugepage size.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      83467efb
    • N
      mm: memory-hotplug: enable memory hotplug to handle hugepage · c8721bbb
      Naoya Horiguchi 提交于
      Until now we can't offline memory blocks which contain hugepages because a
      hugepage is considered as an unmovable page.  But now with this patch
      series, a hugepage has become movable, so by using hugepage migration we
      can offline such memory blocks.
      
      What's different from other users of hugepage migration is that we need to
      decompose all the hugepages inside the target memory block into free buddy
      pages after hugepage migration, because otherwise free hugepages remaining
      in the memory block intervene the memory offlining.  For this reason we
      introduce new functions dissolve_free_huge_page() and
      dissolve_free_huge_pages().
      
      Other than that, what this patch does is straightforwardly to add hugepage
      migration code, that is, adding hugepage code to the functions which scan
      over pfn and collect hugepages to be migrated, and adding a hugepage
      allocation function to alloc_migrate_target().
      
      As for larger hugepages (1GB for x86_64), it's not easy to do hotremove
      over them because it's larger than memory block.  So we now simply leave
      it to fail as it is.
      
      [yongjun_wei@trendmicro.com.cn: remove duplicated include]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c8721bbb
    • N
      mm: migrate: remove VM_HUGETLB from vma flag check in vma_migratable() · 71ea2efb
      Naoya Horiguchi 提交于
      Enable hugepage migration from migrate_pages(2), move_pages(2), and
      mbind(2).
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NHillf Danton <dhillf@gmail.com>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71ea2efb
    • N
      mm: mbind: add hugepage migration code to mbind() · 74060e4d
      Naoya Horiguchi 提交于
      Extend do_mbind() to handle vma with VM_HUGETLB set.  We will be able to
      migrate hugepage with mbind(2) after applying the enablement patch which
      comes later in this series.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Acked-by: NHillf Danton <dhillf@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      74060e4d
    • N
      mm: migrate: add hugepage migration code to move_pages() · e632a938
      Naoya Horiguchi 提交于
      Extend move_pages() to handle vma with VM_HUGETLB set.  We will be able to
      migrate hugepage with move_pages(2) after applying the enablement patch
      which comes later in this series.
      
      We avoid getting refcount on tail pages of hugepage, because unlike thp,
      hugepage is not split and we need not care about races with splitting.
      
      And migration of larger (1GB for x86_64) hugepage are not enabled.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e632a938
    • N
      migrate: add hugepage migration code to migrate_pages() · e2d8cf40
      Naoya Horiguchi 提交于
      Extend check_range() to handle vma with VM_HUGETLB set.  We will be able
      to migrate hugepage with migrate_pages(2) after applying the enablement
      patch which comes later in this series.
      
      Note that for larger hugepages (covered by pud entries, 1GB for x86_64 for
      example), we simply skip it now.
      
      Note that using pmd_huge/pud_huge assumes that hugepages are pointed to by
      pmd/pud.  This is not true in some architectures implementing hugepage
      with other mechanisms like ia64, but it's OK because pmd_huge/pud_huge
      simply return 0 in such arch and page walker simply ignores such
      hugepages.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Acked-by: NHillf Danton <dhillf@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e2d8cf40
    • N
      mm: soft-offline: use migrate_pages() instead of migrate_huge_page() · b8ec1cee
      Naoya Horiguchi 提交于
      Currently migrate_huge_page() takes a pointer to a hugepage to be migrated
      as an argument, instead of taking a pointer to the list of hugepages to be
      migrated.  This behavior was introduced in commit 189ebff2 ("hugetlb:
      simplify migrate_huge_page()"), and was OK because until now hugepage
      migration is enabled only for soft-offlining which migrates only one
      hugepage in a single call.
      
      But the situation will change in the later patches in this series which
      enable other users of page migration to support hugepage migration.  They
      can kick migration for both of normal pages and hugepages in a single
      call, so we need to go back to original implementation which uses linked
      lists to collect the hugepages to be migrated.
      
      With this patch, soft_offline_huge_page() switches to use migrate_pages(),
      and migrate_huge_page() is not used any more.  So let's remove it.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Acked-by: NHillf Danton <dhillf@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8ec1cee
    • N
      mm: migrate: make core migration code aware of hugepage · 31caf665
      Naoya Horiguchi 提交于
      Currently hugepage migration is available only for soft offlining, but
      it's also useful for some other users of page migration (clearly because
      users of hugepage can enjoy the benefit of mempolicy and memory hotplug.)
      So this patchset tries to extend such users to support hugepage migration.
      
      The target of this patchset is to enable hugepage migration for NUMA
      related system calls (migrate_pages(2), move_pages(2), and mbind(2)), and
      memory hotplug.
      
      This patchset does not add hugepage migration for memory compaction,
      because users of memory compaction mainly expect to construct thp by
      arranging raw pages, and there's little or no need to compact hugepages.
      CMA, another user of page migration, can have benefit from hugepage
      migration, but is not enabled to support it for now (just because of lack
      of testing and expertise in CMA.)
      
      Hugepage migration of non pmd-based hugepage (for example 1GB hugepage in
      x86_64, or hugepages in architectures like ia64) is not enabled for now
      (again, because of lack of testing.)
      
      As for how these are achived, I extended the API (migrate_pages()) to
      handle hugepage (with patch 1 and 2) and adjusted code of each caller to
      check and collect movable hugepages (with patch 3-7).  Remaining 2 patches
      are kind of miscellaneous ones to avoid unexpected behavior.  Patch 8 is
      about making sure that we only migrate pmd-based hugepages.  And patch 9
      is about choosing appropriate zone for hugepage allocation.
      
      My test is mainly functional one, simply kicking hugepage migration via
      each entry point and confirm that migration is done correctly.  Test code
      is available here:
      
        git://github.com/Naoya-Horiguchi/test_hugepage_migration_extension.git
      
      And I always run libhugetlbfs test when changing hugetlbfs's code.  With
      this patchset, no regression was found in the test.
      
      This patch (of 9):
      
      Before enabling each user of page migration to support hugepage,
      this patch enables the list of pages for migration to link not only
      LRU pages, but also hugepages. As a result, putback_movable_pages()
      and migrate_pages() can handle both of LRU pages and hugepages.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Acked-by: NHillf Danton <dhillf@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31caf665
    • J
      mm, hugetlb: return a reserved page to a reserved pool if failed · 07443a85
      Joonsoo Kim 提交于
      If we fail with a reserved page, just calling put_page() is not
      sufficient, because put_page() invoke free_huge_page() at last step and it
      doesn't know whether a page comes from a reserved pool or not.  So it
      doesn't do anything related to reserved count.  This makes reserve count
      lower than how we need, because reserve count already decrease in
      dequeue_huge_page_vma().  This patch fix this situation.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      07443a85
    • J
      mm, hugetlb: grab a page_table_lock after page_cache_release · 8312034f
      Joonsoo Kim 提交于
      We don't need to grab a page_table_lock when we try to release a page.
      So, defer to grab a page_table_lock.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: NDavidlohr Bueso <davidlohr@hp.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8312034f
    • J
      mm, hugetlb: remove useless check about mapping type · 5944d011
      Joonsoo Kim 提交于
      is_vma_resv_set(vma, HPAGE_RESV_OWNER) implys that this mapping is for
      private.  So we don't need to check whether this mapping is for shared or
      not.
      
      This patch is just for clean-up.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NDavidlohr Bueso <davidlohr@hp.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5944d011
    • J
      mm, hugetlb: fix subpool accounting handling · 8bb3f12e
      Joonsoo Kim 提交于
      If we alloc hugepage with avoid_reserve, we don't dequeue reserved one.
      So, we should check subpool counter when avoid_reserve.  This patch
      implement it.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8bb3f12e
    • J
      mm, hugetlb: change variable name reservations to resv · f522c3ac
      Joonsoo Kim 提交于
      'reservations' is so long name as a variable and we use 'resv_map' to
      represent 'struct resv_map' in other place.  To reduce confusion and
      unreadability, change it.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: NAneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NDavidlohr Bueso <davidlohr@hp.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f522c3ac
    • J
      mm, hugetlb: protect reserved pages when soft offlining a hugepage · 4ef91848
      Joonsoo Kim 提交于
      Don't use the reserve pool when soft offlining a hugepage.  Check we have
      free pages outside the reserve pool before we dequeue the huge page.
      Otherwise, we can steal other's reserve page.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: NAneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NDavidlohr Bueso <davidlohr@hp.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4ef91848
    • T
      mm/hotplug: remove stop_machine() from try_offline_node() · 0f1cfe9d
      Toshi Kani 提交于
      lock_device_hotplug() serializes hotplug & online/offline operations.  The
      lock is held in common sysfs online/offline interfaces and ACPI hotplug
      code paths.
      
      And here are the code paths:
      
      - CPU & Mem online/offline via sysfs online
      	store_online()->lock_device_hotplug()
      
      - Mem online via sysfs state:
      	store_mem_state()->lock_device_hotplug()
      
      - ACPI CPU & Mem hot-add:
      	acpi_scan_bus_device_check()->lock_device_hotplug()
      
      - ACPI CPU & Mem hot-delete:
      	acpi_scan_hot_remove()->lock_device_hotplug()
      
      try_offline_node() off-lines a node if all memory sections and cpus are
      removed on the node.  It is called from acpi_processor_remove() and
      acpi_memory_remove_memory()->remove_memory() paths, both of which are in
      the ACPI hotplug code.
      
      try_offline_node() calls stop_machine() to stop all cpus while checking
      all cpu status with the assumption that the caller is not protected from
      CPU hotplug or CPU online/offline operations.  However, the caller is
      always serialized with lock_device_hotplug().  Also, the code needs to be
      properly serialized with a lock, not by stopping all cpus at a random
      place with stop_machine().
      
      This patch removes the use of stop_machine() in try_offline_node() and
      adds comments to try_offline_node() and remove_memory() that
      lock_device_hotplug() is required.
      Signed-off-by: NToshi Kani <toshi.kani@hp.com>
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f1cfe9d
    • T
      mm/hotplug: verify hotplug memory range · 27356f54
      Toshi Kani 提交于
      add_memory() and remove_memory() can only handle a memory range aligned
      with section.  There are problems when an unaligned range is added and
      then deleted as follows:
      
       - add_memory() with an unaligned range succeeds, but __add_pages()
         called from add_memory() adds a whole section of pages even though
         a given memory range is less than the section size.
       - remove_memory() to the added unaligned range hits BUG_ON() in
         __remove_pages().
      
      This patch changes add_memory() and remove_memory() to check if a given
      memory range is aligned with section at the beginning.  As the result,
      add_memory() fails with -EINVAL when a given range is unaligned, and does
      not add such memory range.  This prevents remove_memory() to be called
      with an unaligned range as well.  Note that remove_memory() has to use
      BUG_ON() since this function cannot fail.
      
      [akpm@linux-foundation.org: avoid printk warnings]
      Signed-off-by: NToshi Kani <toshi.kani@hp.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27356f54
    • D
      hugepage: mention libhugetlbfs in doc · 15610c86
      Davidlohr Bueso 提交于
      Explicitly mention/recommend using the libhugetlbfs test cases when
      changing related kernel code.  Developers that are unaware of the project
      can easily miss this and introduce potential regressions that may or may
      not be caught by community review.
      
      Also do some cleanups that make the document visually easier to view at a
      first glance.
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      15610c86
    • F
      readahead: make context readahead more conservative · 2cad4018
      Fengguang Wu 提交于
      This helps performance on moderately dense random reads on SSD.
      
      Transaction-Per-Second numbers provided by Taobao:
      
      		QPS	case
      		-------------------------------------------------------
      		7536	disable context readahead totally
      w/ patch:	7129	slower size rampup and start RA on the 3rd read
      		6717	slower size rampup
      w/o patch:	5581	unmodified context readahead
      
      Before, readahead will be started whenever reading page N+1 when it happen
      to read N recently.  After patch, we'll only start readahead when *three*
      random reads happen to access pages N, N+1, N+2.  The probability of this
      happening is extremely low for pure random reads, unless they are very
      dense, which actually deserves some readahead.
      
      Also start with a smaller readahead window.  The impact to interleaved
      sequential reads should be small, because for a long run stream, the the
      small readahead window rampup phase is negletable.
      
      The context readahead actually benefits clustered random reads on HDD
      whose seek cost is pretty high.  However as SSD is increasingly used for
      random read workloads it's better for the context readahead to concentrate
      on interleaved sequential reads.
      
      Another SSD rand read test from Miao
      
              # file size:        2GB
              # read IO amount: 625MB
              sysbench --test=fileio          \
                      --max-requests=10000    \
                      --num-threads=1         \
                      --file-num=1            \
                      --file-block-size=64K   \
                      --file-test-mode=rndrd  \
                      --file-fsync-freq=0     \
                      --file-fsync-end=off    run
      
      shows the performance of btrfs grows up from 69MB/s to 121MB/s, ext4 from
      104MB/s to 121MB/s.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Tested-by: NTao Ma <tm@tao.ma>
      Tested-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2cad4018
    • X
      mm: use zone_is_initialized() instead of if(zone->wait_table) · 139c2d75
      Xishi Qiu 提交于
      Use "zone_is_initialized()" instead of "if (zone->wait_table)".
      Simplify the code, no functional change.
      Signed-off-by: NXishi Qiu <qiuxishi@huawei.com>
      Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      139c2d75
    • X
      mm: use zone_is_empty() instead of if(zone->spanned_pages) · 8080fc03
      Xishi Qiu 提交于
      Use "zone_is_empty()" instead of "if (zone->spanned_pages)".
      Simplify the code, no functional change.
      Signed-off-by: NXishi Qiu <qiuxishi@huawei.com>
      Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8080fc03
    • X
      mm: use zone_end_pfn() instead of zone_start_pfn+spanned_pages · c33bc315
      Xishi Qiu 提交于
      Use "zone_end_pfn()" instead of "zone->zone_start_pfn + zone->spanned_pages".
      Simplify the code, no functional change.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NXishi Qiu <qiuxishi@huawei.com>
      Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c33bc315
    • J
      lib/genalloc.c: fix overflow of ending address of memory chunk · 674470d9
      Joonyoung Shim 提交于
      In struct gen_pool_chunk, end_addr means the end address of memory chunk
      (inclusive), but in the implementation it is treated as address + size of
      memory chunk (exclusive), so it points to the address plus one instead of
      correct ending address.
      
      The ending address of memory chunk plus one will cause overflow on the
      memory chunk including the last address of memory map, e.g.  when starting
      address is 0xFFF00000 and size is 0x100000 on 32bit machine, ending
      address will be 0x100000000.
      
      Use correct ending address like starting address + size - 1.
      
      [akpm@linux-foundation.org: add comment to struct gen_pool_chunk:end_addr]
      Signed-off-by: NJoonyoung Shim <jy0922.shim@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      674470d9