1. 04 7月, 2013 4 次提交
    • M
      mm: remove lru parameter from __pagevec_lru_add and remove parts of pagevec API · a0b8cab3
      Mel Gorman 提交于
      Now that the LRU to add a page to is decided at LRU-add time, remove the
      misleading lru parameter from __pagevec_lru_add.  A consequence of this
      is that the pagevec_lru_add_file, pagevec_lru_add_anon and similar
      helpers are misleading as the caller no longer has direct control over
      what LRU the page is added to.  Unused helpers are removed by this patch
      and existing users of pagevec_lru_add_file() are converted to use
      lru_cache_add_file() directly and use the per-cpu pagevecs instead of
      creating their own pagevec.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Alexey Lyahkov <alexey.lyashkov@gmail.com>
      Cc: Andrew Perepechko <anserper@ya.ru>
      Cc: Robin Dong <sanbai@taobao.com>
      Cc: Theodore Tso <tytso@mit.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Bernd Schubert <bernd.schubert@fastmail.fm>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0b8cab3
    • M
      mm: activate !PageLRU pages on mark_page_accessed if page is on local pagevec · 059285a2
      Mel Gorman 提交于
      If a page is on a pagevec then it is !PageLRU and mark_page_accessed()
      may fail to move a page to the active list as expected.  Now that the
      LRU is selected at LRU drain time, mark pages PageActive if they are on
      the local pagevec so it gets moved to the correct list at LRU drain
      time.  Using a debugging patch it was found that for a simple git
      checkout based workload that pages were never added to the active file
      list in practice but with this patch applied they are.
      
      				before   after
      LRU Add Active File                  0      750583
      LRU Add Active Anon            2640587     2702818
      LRU Add Inactive File          8833662     8068353
      LRU Add Inactive Anon              207         200
      
      Note that only pages on the local pagevec are considered on purpose.  A
      !PageLRU page could be in the process of being released, reclaimed,
      migrated or on a remote pagevec that is currently being drained.
      Marking it PageActive is vunerable to races where PageLRU and Active
      bits are checked at the wrong time.  Page reclaim will trigger
      VM_BUG_ONs but depending on when the race hits, it could also free a
      PageActive page to the page allocator and trigger a bad_page warning.
      Similarly a potential race exists between a per-cpu drain on a pagevec
      list and an activation on a remote CPU.
      
      				lru_add_drain_cpu
      				__pagevec_lru_add
      				  lru = page_lru(page);
      mark_page_accessed
        if (PageLRU(page))
          activate_page
        else
          SetPageActive
      				  SetPageLRU(page);
      				  add_page_to_lru_list(page, lruvec, lru);
      
      In this case a PageActive page is added to the inactivate list and later
      the inactive/active stats will get skewed.  While the PageActive checks
      in vmscan could be removed and potentially dealt with, a skew in the
      statistics would be very difficult to detect.  Hence this patch deals
      just with the common case where a page being marked accessed has just
      been added to the local pagevec.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Alexey Lyahkov <alexey.lyashkov@gmail.com>
      Cc: Andrew Perepechko <anserper@ya.ru>
      Cc: Robin Dong <sanbai@taobao.com>
      Cc: Theodore Tso <tytso@mit.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Bernd Schubert <bernd.schubert@fastmail.fm>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      059285a2
    • M
      mm: pagevec: defer deciding which LRU to add a page to until pagevec drain time · 13f7f789
      Mel Gorman 提交于
      mark_page_accessed() cannot activate an inactive page that is located on
      an inactive LRU pagevec.  Hints from filesystems may be ignored as a
      result.  In preparation for fixing that problem, this patch removes the
      per-LRU pagevecs and leaves just one pagevec.  The final LRU the page is
      added to is deferred until the pagevec is drained.
      
      This means that fewer pagevecs are available and potentially there is
      greater contention on the LRU lock.  However, this only applies in the
      case where there is an almost perfect mix of file, anon, active and
      inactive pages being added to the LRU.  In practice I expect that we are
      adding stream of pages of a particular time and that the changes in
      contention will barely be measurable.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Alexey Lyahkov <alexey.lyashkov@gmail.com>
      Cc: Andrew Perepechko <anserper@ya.ru>
      Cc: Robin Dong <sanbai@taobao.com>
      Cc: Theodore Tso <tytso@mit.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Bernd Schubert <bernd.schubert@fastmail.fm>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13f7f789
    • M
      mm: add tracepoints for LRU activation and insertions · c6286c98
      Mel Gorman 提交于
      Andrew Perepechko reported a problem whereby pages are being prematurely
      evicted as the mark_page_accessed() hint is ignored for pages that are
      currently on a pagevec --
      http://www.spinics.net/lists/linux-ext4/msg37340.html .
      
      Alexey Lyahkov and Robin Dong have also reported problems recently that
      could be due to hot pages reaching the end of the inactive list too
      quickly and be reclaimed.
      
      Rather than addressing this on a per-filesystem basis, this series aims
      to fix the mark_page_accessed() interface by deferring what LRU a page
      is added to pagevec drain time and allowing mark_page_accessed() to call
      SetPageActive on a pagevec page.
      
      Patch 1 adds two tracepoints for LRU page activation and insertion. Using
      	these processes it's possible to build a model of pages in the
      	LRU that can be processed offline.
      
      Patch 2 defers making the decision on what LRU to add a page to until when
      	the pagevec is drained.
      
      Patch 3 searches the local pagevec for pages to mark PageActive on
      	mark_page_accessed. The changelog explains why only the local
      	pagevec is examined.
      
      Patches 4 and 5 tidy up the API.
      
      postmark, a dd-based test and fs-mark both single and threaded mode were
      run but none of them showed any performance degradation or gain as a
      result of the patch.
      
      Using patch 1, I built a *very* basic model of the LRU to examine
      offline what the average age of different page types on the LRU were in
      milliseconds.  Of course, capturing the trace distorts the test as it's
      written to local disk but it does not matter for the purposes of this
      test.  The average age of pages in milliseconds were
      
      				    vanilla deferdrain
      Average age mapped anon:               1454       1250
      Average age mapped file:             127841     155552
      Average age unmapped anon:               85        235
      Average age unmapped file:            73633      38884
      Average age unmapped buffers:         74054     116155
      
      The LRU activity was mostly files which you'd expect for a dd-based
      workload.  Note that the average age of buffer pages is increased by the
      series and it is expected this is due to the fact that the buffer pages
      are now getting added to the active list when drained from the pagevecs.
      Note that the average age of the unmapped file data is decreased as they
      are still added to the inactive list and are reclaimed before the
      buffers.
      
      There is no guarantee this is a universal win for all workloads and it
      would be nice if the filesystem people gave some thought as to whether
      this decision is generally a win or a loss.
      
      This patch:
      
      Using these tracepoints it is possible to model LRU activity and the
      average residency of pages of different types.  This can be used to
      debug problems related to premature reclaim of pages of particular
      types.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Alexey Lyahkov <alexey.lyashkov@gmail.com>
      Cc: Andrew Perepechko <anserper@ya.ru>
      Cc: Robin Dong <sanbai@taobao.com>
      Cc: Theodore Tso <tytso@mit.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Bernd Schubert <bernd.schubert@fastmail.fm>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c6286c98
  2. 08 5月, 2013 1 次提交
  3. 30 4月, 2013 1 次提交
  4. 24 2月, 2013 1 次提交
  5. 09 10月, 2012 2 次提交
    • H
      mm: remove vma arg from page_evictable · 39b5f29a
      Hugh Dickins 提交于
      page_evictable(page, vma) is an irritant: almost all its callers pass
      NULL for vma.  Remove the vma arg and use mlocked_vma_newpage(vma, page)
      explicitly in the couple of places it's needed.  But in those places we
      don't even need page_evictable() itself!  They're dealing with a freshly
      allocated anonymous page, which has no "mapping" and cannot be mlocked yet.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      39b5f29a
    • R
      mm: fix nonuniform page status when writing new file with small buffer · d741c9cd
      Robin Dong 提交于
      When writing a new file with 2048 bytes buffer, such as write(fd, buffer,
      2048), it will call generic_perform_write() twice for every page:
      
      	write_begin
      	mark_page_accessed(page)
      	write_end
      
      	write_begin
      	mark_page_accessed(page)
      	write_end
      
      Pages 1-13 will be added to lru-pvecs in write_begin() and will *NOT* be
      added to active_list even they have be accessed twice because they are not
      PageLRU(page).  But when page 14th comes, all pages in lru-pvecs will be
      moved to inactive_list (by __lru_cache_add() ) in first write_begin(), now
      page 14th *is* PageLRU(page).  And after second write_end() only page 14th
      will be in active_list.
      
      In Hadoop environment, we do comes to this situation: after writing a
      file, we find out that only 14th, 28th, 42th...  page are in active_list
      and others in inactive_list.  Now kswapd works, shrinks the inactive_list,
      the file only have 14th, 28th...pages in memory, the readahead request
      size will be broken to only 52k (13*4k), system's performance falls
      dramatically.
      
      This problem can also replay by below steps (the machine has 8G memory):
      
      	1. dd if=/dev/zero of=/test/file.out bs=1024 count=1048576
      	2. cat another 7.5G file to /dev/null
      	3. vmtouch -m 1G -v /test/file.out, it will show:
      
      	/test/file.out
      	[oooooooooooooooooooOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO] 187847/262144
      
      	the 'o' means same pages are in memory but same are not.
      
      The solution for this problem is simple: the 14th page should be added to
      lru_add_pvecs before mark_page_accessed() just as other pages.
      
      [akpm@linux-foundation.org: tweak comment]
      [akpm@linux-foundation.org: grab better comment from the v3 patch]
      Signed-off-by: NRobin Dong <sanbai@taobao.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d741c9cd
  6. 01 8月, 2012 2 次提交
    • M
      mm: add support for direct_IO to highmem pages · 5a178119
      Mel Gorman 提交于
      The patch "mm: add support for a filesystem to activate swap files and use
      direct_IO for writing swap pages" added support for using direct_IO to
      write swap pages but it is insufficient for highmem pages.
      
      To support highmem pages, this patch kmaps() the page before calling the
      direct_IO() handler.  As direct_IO deals with virtual addresses an
      additional helper is necessary for get_kernel_pages() to lookup the struct
      page for a kmap virtual address.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Xiaotian Feng <dfeng@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a178119
    • M
      mm: add get_kernel_page[s] for pinning of kernel addresses for I/O · 18022c5d
      Mel Gorman 提交于
      This patch adds two new APIs get_kernel_pages() and get_kernel_page() that
      may be used to pin a vector of kernel addresses for IO.  The initial user
      is expected to be NFS for allowing pages to be written to swap using
      aops->direct_IO().  Strictly speaking, swap-over-NFS only needs to pin one
      page for IO but it makes sense to express the API in terms of a vector and
      add a helper for pinning single pages.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Xiaotian Feng <dfeng@redhat.com>
      Cc: Mark Salter <msalter@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      18022c5d
  7. 30 5月, 2012 3 次提交
  8. 22 3月, 2012 1 次提交
  9. 06 3月, 2012 1 次提交
    • H
      memcg: fix GPF when cgroup removal races with last exit · 7512102c
      Hugh Dickins 提交于
      When moving tasks from old memcg (with move_charge_at_immigrate on new
      memcg), followed by removal of old memcg, hit General Protection Fault in
      mem_cgroup_lru_del_list() (called from release_pages called from
      free_pages_and_swap_cache from tlb_flush_mmu from tlb_finish_mmu from
      exit_mmap from mmput from exit_mm from do_exit).
      
      Somewhat reproducible, takes a few hours: the old struct mem_cgroup has
      been freed and poisoned by SLAB_DEBUG, but mem_cgroup_lru_del_list() is
      still trying to update its stats, and take page off lru before freeing.
      
      A task, or a charge, or a page on lru: each secures a memcg against
      removal.  In this case, the last task has been moved out of the old memcg,
      and it is exiting: anonymous pages are uncharged one by one from the
      memcg, as they are zapped from its pagetables, so the charge gets down to
      0; but the pages themselves are queued in an mmu_gather for freeing.
      
      Most of those pages will be on lru (and force_empty is careful to
      lru_add_drain_all, to add pages from pagevec to lru first), but not
      necessarily all: perhaps some have been isolated for page reclaim, perhaps
      some isolated for other reasons.  So, force_empty may find no task, no
      charge and no page on lru, and let the removal proceed.
      
      There would still be no problem if these pages were immediately freed; but
      typically (and the put_page_testzero protocol demands it) they have to be
      added back to lru before they are found freeable, then removed from lru
      and freed.  We don't see the issue when adding, because the
      mem_cgroup_iter() loops keep their own reference to the memcg being
      scanned; but when it comes to mem_cgroup_lru_del_list().
      
      I believe this was not an issue in v3.2: there, PageCgroupAcctLRU and
      PageCgroupUsed flags were used (like a trick with mirrors) to deflect view
      of pc->mem_cgroup to the stable root_mem_cgroup when neither set.
      38c5d72f ("memcg: simplify LRU handling by new rule") mercifully
      removed those convolutions, but left this General Protection Fault.
      
      But it's surprisingly easy to restore the old behaviour: just check
      PageCgroupUsed in mem_cgroup_lru_add_list() (which decides on which lruvec
      to add), and reset pc to root_mem_cgroup if page is uncharged.  A risky
      change?  just going back to how it worked before; testing, and an audit of
      uses of pc->mem_cgroup, show no problem.
      
      And there's a nice bonus: with mem_cgroup_lru_add_list() itself making
      sure that an uncharged page goes to root lru, mem_cgroup_reset_owner() no
      longer has any purpose, and we can safely revert 4e5f01c2 ("memcg:
      clear pc->mem_cgroup if necessary").
      
      Calling update_page_reclaim_stat() after add_page_to_lru_list() in swap.c
      is not strictly necessary: the lru_lock there, with RCU before memcg
      structures are freed, makes mem_cgroup_get_reclaim_stat_from_page safe
      without that; but it seems cleaner to rely on one dependency less.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7512102c
  10. 09 2月, 2012 1 次提交
  11. 13 1月, 2012 8 次提交
  12. 11 1月, 2012 1 次提交
  13. 03 11月, 2011 1 次提交
    • A
      mm: thp: tail page refcounting fix · 70b50f94
      Andrea Arcangeli 提交于
      Michel while working on the working set estimation code, noticed that
      calling get_page_unless_zero() on a random pfn_to_page(random_pfn)
      wasn't safe, if the pfn ended up being a tail page of a transparent
      hugepage under splitting by __split_huge_page_refcount().
      
      He then found the problem could also theoretically materialize with
      page_cache_get_speculative() during the speculative radix tree lookups
      that uses get_page_unless_zero() in SMP if the radix tree page is freed
      and reallocated and get_user_pages is called on it before
      page_cache_get_speculative has a chance to call get_page_unless_zero().
      
      So the best way to fix the problem is to keep page_tail->_count zero at
      all times.  This will guarantee that get_page_unless_zero() can never
      succeed on any tail page.  page_tail->_mapcount is guaranteed zero and
      is unused for all tail pages of a compound page, so we can simply
      account the tail page references there and transfer them to
      tail_page->_count in __split_huge_page_refcount() (in addition to the
      head_page->_mapcount).
      
      While debugging this s/_count/_mapcount/ change I also noticed get_page is
      called by direct-io.c on pages returned by get_user_pages.  That wasn't
      entirely safe because the two atomic_inc in get_page weren't atomic.  As
      opposed to other get_user_page users like secondary-MMU page fault to
      establish the shadow pagetables would never call any superflous get_page
      after get_user_page returns.  It's safer to make get_page universally safe
      for tail pages and to use get_page_foll() within follow_page (inside
      get_user_pages()).  get_page_foll() is safe to do the refcounting for tail
      pages without taking any locks because it is run within PT lock protected
      critical sections (PT lock for pte and page_table_lock for
      pmd_trans_huge).
      
      The standard get_page() as invoked by direct-io instead will now take
      the compound_lock but still only for tail pages.  The direct-io paths
      are usually I/O bound and the compound_lock is per THP so very
      finegrined, so there's no risk of scalability issues with it.  A simple
      direct-io benchmarks with all lockdep prove locking and spinlock
      debugging infrastructure enabled shows identical performance and no
      overhead.  So it's worth it.  Ideally direct-io should stop calling
      get_page() on pages returned by get_user_pages().  The spinlock in
      get_page() is already optimized away for no-THP builds but doing
      get_page() on tail pages returned by GUP is generally a rare operation
      and usually only run in I/O paths.
      
      This new refcounting on page_tail->_mapcount in addition to avoiding new
      RCU critical sections will also allow the working set estimation code to
      work without any further complexity associated to the tail page
      refcounting with THP.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NMichel Lespinasse <walken@google.com>
      Reviewed-by: NMichel Lespinasse <walken@google.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: <stable@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70b50f94
  14. 31 10月, 2011 1 次提交
  15. 25 5月, 2011 2 次提交
    • S
      mm: batch activate_page() to reduce lock contention · eb709b0d
      Shaohua Li 提交于
      The zone->lru_lock is heavily contented in workload where activate_page()
      is frequently used.  We could do batch activate_page() to reduce the lock
      contention.  The batched pages will be added into zone list when the pool
      is full or page reclaim is trying to drain them.
      
      For example, in a 4 socket 64 CPU system, create a sparse file and 64
      processes, processes shared map to the file.  Each process read access the
      whole file and then exit.  The process exit will do unmap_vmas() and cause
      a lot of activate_page() call.  In such workload, we saw about 58% total
      time reduction with below patch.  Other workloads with a lot of
      activate_page also benefits a lot too.
      
      Andrew Morton suggested activate_page() and putback_lru_pages() should
      follow the same path to active pages, but this is hard to implement (see
      commit 7a608572 ("Revert "mm: batch activate_page() to reduce lock
      contention")).  On the other hand, do we really need putback_lru_pages()
      to follow the same path?  I tested several FIO/FFSB benchmark (about 20
      scripts for each benchmark) in 3 machines here from 2 sockets to 4
      sockets.  My test doesn't show anything significant with/without below
      patch (there is slight difference but mostly some noise which we found
      even without below patch before).  Below patch basically returns to the
      same as my first post.
      
      I tested some microbenchmarks:
        case-anon-cow-rand-mt         0.58%
        case-anon-cow-rand           -3.30%
        case-anon-cow-seq-mt         -0.51%
        case-anon-cow-seq            -5.68%
        case-anon-r-rand-mt           0.23%
        case-anon-r-rand              0.81%
        case-anon-r-seq-mt           -0.71%
        case-anon-r-seq              -1.99%
        case-anon-rx-rand-mt          2.11%
        case-anon-rx-seq-mt           3.46%
        case-anon-w-rand-mt          -0.03%
        case-anon-w-rand             -0.50%
        case-anon-w-seq-mt           -1.08%
        case-anon-w-seq              -0.12%
        case-anon-wx-rand-mt         -5.02%
        case-anon-wx-seq-mt          -1.43%
        case-fork                     1.65%
        case-fork-sleep              -0.07%
        case-fork-withmem             1.39%
        case-hugetlb                 -0.59%
        case-lru-file-mmap-read-mt   -0.54%
        case-lru-file-mmap-read       0.61%
        case-lru-file-mmap-read-rand -2.24%
        case-lru-file-readonce       -0.64%
        case-lru-file-readtwice     -11.69%
        case-lru-memcg               -1.35%
        case-mmap-pread-rand-mt       1.88%
        case-mmap-pread-rand        -15.26%
        case-mmap-pread-seq-mt        0.89%
        case-mmap-pread-seq         -69.72%
        case-mmap-xread-rand-mt       0.71%
        case-mmap-xread-seq-mt        0.38%
      
      The most significent are:
        case-lru-file-readtwice     -11.69%
        case-mmap-pread-rand        -15.26%
        case-mmap-pread-seq         -69.72%
      
      which use activate_page a lot.  others are basically variations because
      each run has slightly difference.
      
      In UP case, 'size mm/swap.o'
      before the two patches:
         text    data     bss     dec     hex filename
         6466     896       4    7366    1cc6 mm/swap.o
      after the two patches:
         text    data     bss     dec     hex filename
         6343     896       4    7243    1c4b mm/swap.o
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eb709b0d
    • M
      mm: filter unevictable page out in deactivate_page() · 821ed6bb
      Minchan Kim 提交于
      It's pointless that deactive_page's operates on unevictable pages.  This
      patch removes unnecessary overhead which might be a bit problem in case
      that there are many unevictable page in system(ex, mprotect workload)
      
      [akpm@linux-foundation.org: tidy up comment]
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: Rik van Riel<riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      821ed6bb
  16. 12 5月, 2011 1 次提交
  17. 23 3月, 2011 4 次提交
  18. 18 1月, 2011 2 次提交
    • L
      Revert "mm: simplify code of swap.c" · 83896fb5
      Linus Torvalds 提交于
      This reverts commit d8505dee.
      
      Chris Mason ended up chasing down some page allocation errors and pages
      stuck waiting on the IO scheduler, and was able to narrow it down to two
      commits: commit 744ed144 ("mm: batch activate_page() to reduce lock
      contention") and d8505dee ("mm: simplify code of swap.c").
      
      This reverts the second one.
      Reported-and-debugged-by: NChris Mason <chris.mason@oracle.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Jens Axboe <jaxboe@fusionio.com>
      Cc: linux-mm <linux-mm@kvack.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      83896fb5
    • L
      Revert "mm: batch activate_page() to reduce lock contention" · 7a608572
      Linus Torvalds 提交于
      This reverts commit 744ed144.
      
      Chris Mason ended up chasing down some page allocation errors and pages
      stuck waiting on the IO scheduler, and was able to narrow it down to two
      commits: commit 744ed144 ("mm: batch activate_page() to reduce lock
      contention") and d8505dee ("mm: simplify code of swap.c").
      
      This reverts the first of them.
      Reported-and-debugged-by: NChris Mason <chris.mason@oracle.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Jens Axboe <jaxboe@fusionio.com>
      Cc: linux-mm <linux-mm@kvack.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a608572
  19. 14 1月, 2011 3 次提交
    • S
      mm: batch activate_page() to reduce lock contention · 744ed144
      Shaohua Li 提交于
      The zone->lru_lock is heavily contented in workload where activate_page()
      is frequently used.  We could do batch activate_page() to reduce the lock
      contention.  The batched pages will be added into zone list when the pool
      is full or page reclaim is trying to drain them.
      
      For example, in a 4 socket 64 CPU system, create a sparse file and 64
      processes, processes shared map to the file.  Each process read access the
      whole file and then exit.  The process exit will do unmap_vmas() and cause
      a lot of activate_page() call.  In such workload, we saw about 58% total
      time reduction with below patch.  Other workloads with a lot of
      activate_page also benefits a lot too.
      
      I tested some microbenchmarks:
      case-anon-cow-rand-mt		0.58%
      case-anon-cow-rand		-3.30%
      case-anon-cow-seq-mt		-0.51%
      case-anon-cow-seq		-5.68%
      case-anon-r-rand-mt		0.23%
      case-anon-r-rand		0.81%
      case-anon-r-seq-mt		-0.71%
      case-anon-r-seq			-1.99%
      case-anon-rx-rand-mt		2.11%
      case-anon-rx-seq-mt		3.46%
      case-anon-w-rand-mt		-0.03%
      case-anon-w-rand		-0.50%
      case-anon-w-seq-mt		-1.08%
      case-anon-w-seq			-0.12%
      case-anon-wx-rand-mt		-5.02%
      case-anon-wx-seq-mt		-1.43%
      case-fork			1.65%
      case-fork-sleep			-0.07%
      case-fork-withmem		1.39%
      case-hugetlb			-0.59%
      case-lru-file-mmap-read-mt	-0.54%
      case-lru-file-mmap-read		0.61%
      case-lru-file-mmap-read-rand	-2.24%
      case-lru-file-readonce		-0.64%
      case-lru-file-readtwice		-11.69%
      case-lru-memcg			-1.35%
      case-mmap-pread-rand-mt		1.88%
      case-mmap-pread-rand		-15.26%
      case-mmap-pread-seq-mt		0.89%
      case-mmap-pread-seq		-69.72%
      case-mmap-xread-rand-mt		0.71%
      case-mmap-xread-seq-mt		0.38%
      
      The most significent are:
      case-lru-file-readtwice		-11.69%
      case-mmap-pread-rand		-15.26%
      case-mmap-pread-seq		-69.72%
      
      which use activate_page a lot.  others are basically variations because
      each run has slightly difference.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      744ed144
    • S
      mm: simplify code of swap.c · d8505dee
      Shaohua Li 提交于
      Clean up code and remove duplicate code.  Next patch will use
      pagevec_lru_move_fn introduced here too.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8505dee
    • A
      thp: transparent hugepage core · 71e3aac0
      Andrea Arcangeli 提交于
      Lately I've been working to make KVM use hugepages transparently without
      the usual restrictions of hugetlbfs.  Some of the restrictions I'd like to
      see removed:
      
      1) hugepages have to be swappable or the guest physical memory remains
         locked in RAM and can't be paged out to swap
      
      2) if a hugepage allocation fails, regular pages should be allocated
         instead and mixed in the same vma without any failure and without
         userland noticing
      
      3) if some task quits and more hugepages become available in the
         buddy, guest physical memory backed by regular pages should be
         relocated on hugepages automatically in regions under
         madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
         kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes
         not null)
      
      4) avoidance of reservation and maximization of use of hugepages whenever
         possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
         1 machine with 1 database with 1 database cache with 1 database cache size
         known at boot time. It's definitely not feasible with a virtualization
         hypervisor usage like RHEV-H that runs an unknown number of virtual machines
         with an unknown size of each virtual machine with an unknown amount of
         pagecache that could be potentially useful in the host for guest not using
         O_DIRECT (aka cache=off).
      
      hugepages in the virtualization hypervisor (and also in the guest!) are
      much more important than in a regular host not using virtualization,
      becasue with NPT/EPT they decrease the tlb-miss cacheline accesses from 24
      to 19 in case only the hypervisor uses transparent hugepages, and they
      decrease the tlb-miss cacheline accesses from 19 to 15 in case both the
      linux hypervisor and the linux guest both uses this patch (though the
      guest will limit the addition speedup to anonymous regions only for
      now...).  Even more important is that the tlb miss handler is much slower
      on a NPT/EPT guest than for a regular shadow paging or no-virtualization
      scenario.  So maximizing the amount of virtual memory cached by the TLB
      pays off significantly more with NPT/EPT than without (even if there would
      be no significant speedup in the tlb-miss runtime).
      
      The first (and more tedious) part of this work requires allowing the VM to
      handle anonymous hugepages mixed with regular pages transparently on
      regular anonymous vmas.  This is what this patch tries to achieve in the
      least intrusive possible way.  We want hugepages and hugetlb to be used in
      a way so that all applications can benefit without changes (as usual we
      leverage the KVM virtualization design: by improving the Linux VM at
      large, KVM gets the performance boost too).
      
      The most important design choice is: always fallback to 4k allocation if
      the hugepage allocation fails!  This is the _very_ opposite of some large
      pagecache patches that failed with -EIO back then if a 64k (or similar)
      allocation failed...
      
      Second important decision (to reduce the impact of the feature on the
      existing pagetable handling code) is that at any time we can split an
      hugepage into 512 regular pages and it has to be done with an operation
      that can't fail.  This way the reliability of the swapping isn't decreased
      (no need to allocate memory when we are short on memory to swap) and it's
      trivial to plug a split_huge_page* one-liner where needed without
      polluting the VM.  Over time we can teach mprotect, mremap and friends to
      handle pmd_trans_huge natively without calling split_huge_page*.  The fact
      it can't fail isn't just for swap: if split_huge_page would return -ENOMEM
      (instead of the current void) we'd need to rollback the mprotect from the
      middle of it (ideally including undoing the split_vma) which would be a
      big change and in the very wrong direction (it'd likely be simpler not to
      call split_huge_page at all and to teach mprotect and friends to handle
      hugepages instead of rolling them back from the middle).  In short the
      very value of split_huge_page is that it can't fail.
      
      The collapsing and madvise(MADV_HUGEPAGE) part will remain separated and
      incremental and it'll just be an "harmless" addition later if this initial
      part is agreed upon.  It also should be noted that locking-wise replacing
      regular pages with hugepages is going to be very easy if compared to what
      I'm doing below in split_huge_page, as it will only happen when
      page_count(page) matches page_mapcount(page) if we can take the PG_lock
      and mmap_sem in write mode.  collapse_huge_page will be a "best effort"
      that (unlike split_huge_page) can fail at the minimal sign of trouble and
      we can try again later.  collapse_huge_page will be similar to how KSM
      works and the madvise(MADV_HUGEPAGE) will work similar to
      madvise(MADV_MERGEABLE).
      
      The default I like is that transparent hugepages are used at page fault
      time.  This can be changed with
      /sys/kernel/mm/transparent_hugepage/enabled.  The control knob can be set
      to three values "always", "madvise", "never" which mean respectively that
      hugepages are always used, or only inside madvise(MADV_HUGEPAGE) regions,
      or never used.  /sys/kernel/mm/transparent_hugepage/defrag instead
      controls if the hugepage allocation should defrag memory aggressively
      "always", only inside "madvise" regions, or "never".
      
      The pmd_trans_splitting/pmd_trans_huge locking is very solid.  The
      put_page (from get_user_page users that can't use mmu notifier like
      O_DIRECT) that runs against a __split_huge_page_refcount instead was a
      pain to serialize in a way that would result always in a coherent page
      count for both tail and head.  I think my locking solution with a
      compound_lock taken only after the page_first is valid and is still a
      PageHead should be safe but it surely needs review from SMP race point of
      view.  In short there is no current existing way to serialize the O_DIRECT
      final put_page against split_huge_page_refcount so I had to invent a new
      one (O_DIRECT loses knowledge on the mapping status by the time gup_fast
      returns so...).  And I didn't want to impact all gup/gup_fast users for
      now, maybe if we change the gup interface substantially we can avoid this
      locking, I admit I didn't think too much about it because changing the gup
      unpinning interface would be invasive.
      
      If we ignored O_DIRECT we could stick to the existing compound refcounting
      code, by simply adding a get_user_pages_fast_flags(foll_flags) where KVM
      (and any other mmu notifier user) would call it without FOLL_GET (and if
      FOLL_GET isn't set we'd just BUG_ON if nobody registered itself in the
      current task mmu notifier list yet).  But O_DIRECT is fundamental for
      decent performance of virtualized I/O on fast storage so we can't avoid it
      to solve the race of put_page against split_huge_page_refcount to achieve
      a complete hugepage feature for KVM.
      
      Swap and oom works fine (well just like with regular pages ;).  MMU
      notifier is handled transparently too, with the exception of the young bit
      on the pmd, that didn't have a range check but I think KVM will be fine
      because the whole point of hugepages is that EPT/NPT will also use a huge
      pmd when they notice gup returns pages with PageCompound set, so they
      won't care of a range and there's just the pmd young bit to check in that
      case.
      
      NOTE: in some cases if the L2 cache is small, this may slowdown and waste
      memory during COWs because 4M of memory are accessed in a single fault
      instead of 8k (the payoff is that after COW the program can run faster).
      So we might want to switch the copy_huge_page (and clear_huge_page too) to
      not temporal stores.  I also extensively researched ways to avoid this
      cache trashing with a full prefault logic that would cow in 8k/16k/32k/64k
      up to 1M (I can send those patches that fully implemented prefault) but I
      concluded they're not worth it and they add an huge additional complexity
      and they remove all tlb benefits until the full hugepage has been faulted
      in, to save a little bit of memory and some cache during app startup, but
      they still don't improve substantially the cache-trashing during startup
      if the prefault happens in >4k chunks.  One reason is that those 4k pte
      entries copied are still mapped on a perfectly cache-colored hugepage, so
      the trashing is the worst one can generate in those copies (cow of 4k page
      copies aren't so well colored so they trashes less, but again this results
      in software running faster after the page fault).  Those prefault patches
      allowed things like a pte where post-cow pages were local 4k regular anon
      pages and the not-yet-cowed pte entries were pointing in the middle of
      some hugepage mapped read-only.  If it doesn't payoff substantially with
      todays hardware it will payoff even less in the future with larger l2
      caches, and the prefault logic would blot the VM a lot.  If one is
      emebdded transparent_hugepage can be disabled during boot with sysfs or
      with the boot commandline parameter transparent_hugepage=0 (or
      transparent_hugepage=2 to restrict hugepages inside madvise regions) that
      will ensure not a single hugepage is allocated at boot time.  It is simple
      enough to just disable transparent hugepage globally and let transparent
      hugepages be allocated selectively by applications in the MADV_HUGEPAGE
      region (both at page fault time, and if enabled with the
      collapse_huge_page too through the kernel daemon).
      
      This patch supports only hugepages mapped in the pmd, archs that have
      smaller hugepages will not fit in this patch alone.  Also some archs like
      power have certain tlb limits that prevents mixing different page size in
      the same regions so they will not fit in this framework that requires
      "graceful fallback" to basic PAGE_SIZE in case of physical memory
      fragmentation.  hugetlbfs remains a perfect fit for those because its
      software limits happen to match the hardware limits.  hugetlbfs also
      remains a perfect fit for hugepage sizes like 1GByte that cannot be hoped
      to be found not fragmented after a certain system uptime and that would be
      very expensive to defragment with relocation, so requiring reservation.
      hugetlbfs is the "reservation way", the point of transparent hugepages is
      not to have any reservation at all and maximizing the use of cache and
      hugepages at all times automatically.
      
      Some performance result:
      
      vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
      ages3
      memset page fault 1566023
      memset tlb miss 453854
      memset second tlb miss 453321
      random access tlb miss 41635
      random access second tlb miss 41658
      vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
      memset page fault 1566471
      memset tlb miss 453375
      memset second tlb miss 453320
      random access tlb miss 41636
      random access second tlb miss 41637
      vmx andrea # ./largepages3
      memset page fault 1566642
      memset tlb miss 453417
      memset second tlb miss 453313
      random access tlb miss 41630
      random access second tlb miss 41647
      vmx andrea # ./largepages3
      memset page fault 1566872
      memset tlb miss 453418
      memset second tlb miss 453315
      random access tlb miss 41618
      random access second tlb miss 41659
      vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
      vmx andrea # ./largepages3
      memset page fault 2182476
      memset tlb miss 460305
      memset second tlb miss 460179
      random access tlb miss 44483
      random access second tlb miss 44186
      vmx andrea # ./largepages3
      memset page fault 2182791
      memset tlb miss 460742
      memset second tlb miss 459962
      random access tlb miss 43981
      random access second tlb miss 43988
      
      ============
      #include <stdio.h>
      #include <stdlib.h>
      #include <string.h>
      #include <sys/time.h>
      
      #define SIZE (3UL*1024*1024*1024)
      
      int main()
      {
      	char *p = malloc(SIZE), *p2;
      	struct timeval before, after;
      
      	gettimeofday(&before, NULL);
      	memset(p, 0, SIZE);
      	gettimeofday(&after, NULL);
      	printf("memset page fault %Lu\n",
      	       (after.tv_sec-before.tv_sec)*1000000UL +
      	       after.tv_usec-before.tv_usec);
      
      	gettimeofday(&before, NULL);
      	memset(p, 0, SIZE);
      	gettimeofday(&after, NULL);
      	printf("memset tlb miss %Lu\n",
      	       (after.tv_sec-before.tv_sec)*1000000UL +
      	       after.tv_usec-before.tv_usec);
      
      	gettimeofday(&before, NULL);
      	memset(p, 0, SIZE);
      	gettimeofday(&after, NULL);
      	printf("memset second tlb miss %Lu\n",
      	       (after.tv_sec-before.tv_sec)*1000000UL +
      	       after.tv_usec-before.tv_usec);
      
      	gettimeofday(&before, NULL);
      	for (p2 = p; p2 < p+SIZE; p2 += 4096)
      		*p2 = 0;
      	gettimeofday(&after, NULL);
      	printf("random access tlb miss %Lu\n",
      	       (after.tv_sec-before.tv_sec)*1000000UL +
      	       after.tv_usec-before.tv_usec);
      
      	gettimeofday(&before, NULL);
      	for (p2 = p; p2 < p+SIZE; p2 += 4096)
      		*p2 = 0;
      	gettimeofday(&after, NULL);
      	printf("random access second tlb miss %Lu\n",
      	       (after.tv_sec-before.tv_sec)*1000000UL +
      	       after.tv_usec-before.tv_usec);
      
      	return 0;
      }
      ============
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71e3aac0