1. 04 4月, 2014 1 次提交
    • J
      mm + fs: prepare for non-page entries in page cache radix trees · 0cd6144a
      Johannes Weiner 提交于
      shmem mappings already contain exceptional entries where swap slot
      information is remembered.
      
      To be able to store eviction information for regular page cache, prepare
      every site dealing with the radix trees directly to handle entries other
      than pages.
      
      The common lookup functions will filter out non-page entries and return
      NULL for page cache holes, just as before.  But provide a raw version of
      the API which returns non-page entries as well, and switch shmem over to
      use it.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0cd6144a
  2. 04 3月, 2014 1 次提交
    • D
      mm: close PageTail race · 668f9abb
      David Rientjes 提交于
      Commit bf6bddf1 ("mm: introduce compaction and migration for
      ballooned pages") introduces page_count(page) into memory compaction
      which dereferences page->first_page if PageTail(page).
      
      This results in a very rare NULL pointer dereference on the
      aforementioned page_count(page).  Indeed, anything that does
      compound_head(), including page_count() is susceptible to racing with
      prep_compound_page() and seeing a NULL or dangling page->first_page
      pointer.
      
      This patch uses Andrea's implementation of compound_trans_head() that
      deals with such a race and makes it the default compound_head()
      implementation.  This includes a read memory barrier that ensures that
      if PageTail(head) is true that we return a head page that is neither
      NULL nor dangling.  The patch then adds a store memory barrier to
      prep_compound_page() to ensure page->first_page is set.
      
      This is the safest way to ensure we see the head page that we are
      expecting, PageTail(page) is already in the unlikely() path and the
      memory barriers are unfortunately required.
      
      Hugetlbfs is the exception, we don't enforce a store memory barrier
      during init since no race is possible.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Holger Kiehl <Holger.Kiehl@dwd.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      668f9abb
  3. 24 1月, 2014 1 次提交
  4. 22 1月, 2014 4 次提交
    • A
      mm/swap.c: reorganize put_compound_page() · 26296ad2
      Andrew Morton 提交于
      Tweak it so save a tab stop, make code layout slightly less nutty.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Pravin Shelar <pshelar@nicira.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26296ad2
    • A
      mm: hugetlbfs: use __compound_tail_refcounted in __get_page_tail too · 3bfcd13e
      Andrea Arcangeli 提交于
      Also remove hugetlb.h which isn't needed anymore as PageHeadHuge is
      handled in mm.h.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Pravin Shelar <pshelar@nicira.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3bfcd13e
    • A
      mm: tail page refcounting optimization for slab and hugetlbfs · 44518d2b
      Andrea Arcangeli 提交于
      This skips the _mapcount mangling for slab and hugetlbfs pages.
      
      The main trouble in doing this is to guarantee that PageSlab and
      PageHeadHuge remains constant for all get_page/put_page run on the tail
      of slab or hugetlbfs compound pages.  Otherwise if they're set during
      get_page but not set during put_page, the _mapcount of the tail page
      would underflow.
      
      PageHeadHuge will remain true until the compound page is released and
      enters the buddy allocator so it won't risk to change even if the tail
      page is the last reference left on the page.
      
      PG_slab instead is cleared before the slab frees the head page with
      put_page, so if the tail pin is released after the slab freed the page,
      we would have a problem.  But in the slab case the tail pin cannot be
      the last reference left on the page.  This is because the slab code is
      free to reuse the compound page after a kfree/kmem_cache_free without
      having to check if there's any tail pin left.  In turn all tail pins
      must be always released while the head is still pinned by the slab code
      and so we know PG_slab will be still set too.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Cc: Pravin Shelar <pshelar@nicira.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      44518d2b
    • A
      mm: hugetlbfs: move the put/get_page slab and hugetlbfs optimization in a faster path · ebf360f9
      Andrea Arcangeli 提交于
      We don't actually need a reference on the head page in the slab and
      hugetlbfs paths, as long as we add a smp_rmb() which should be faster
      than get_page_unless_zero.
      
      [akpm@linux-foundation.org: fix typo in comment]
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Pravin Shelar <pshelar@nicira.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebf360f9
  5. 22 11月, 2013 1 次提交
    • A
      mm: hugetlbfs: fix hugetlbfs optimization · 27c73ae7
      Andrea Arcangeli 提交于
      Commit 7cb2ef56 ("mm: fix aio performance regression for database
      caused by THP") can cause dereference of a dangling pointer if
      split_huge_page runs during PageHuge() if there are updates to the
      tail_page->private field.
      
      Also it is repeating compound_head twice for hugetlbfs and it is running
      compound_head+compound_trans_head for THP when a single one is needed in
      both cases.
      
      The new code within the PageSlab() check doesn't need to verify that the
      THP page size is never bigger than the smallest hugetlbfs page size, to
      avoid memory corruption.
      
      A longstanding theoretical race condition was found while fixing the
      above (see the change right after the skip_unlock label, that is
      relevant for the compound_lock path too).
      
      By re-establishing the _mapcount tail refcounting for all compound
      pages, this also fixes the below problem:
      
        echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
      
        BUG: Bad page state in process bash  pfn:59a01
        page:ffffea000139b038 count:0 mapcount:10 mapping:          (null) index:0x0
        page flags: 0x1c00000000008000(tail)
        Modules linked in:
        CPU: 6 PID: 2018 Comm: bash Not tainted 3.12.0+ #25
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        Call Trace:
          dump_stack+0x55/0x76
          bad_page+0xd5/0x130
          free_pages_prepare+0x213/0x280
          __free_pages+0x36/0x80
          update_and_free_page+0xc1/0xd0
          free_pool_huge_page+0xc2/0xe0
          set_max_huge_pages.part.58+0x14c/0x220
          nr_hugepages_store_common.isra.60+0xd0/0xf0
          nr_hugepages_store+0x13/0x20
          kobj_attr_store+0xf/0x20
          sysfs_write_file+0x189/0x1e0
          vfs_write+0xc5/0x1f0
          SyS_write+0x55/0xb0
          system_call_fastpath+0x16/0x1b
      Signed-off-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Tested-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Cc: Pravin Shelar <pshelar@nicira.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27c73ae7
  6. 08 11月, 2013 1 次提交
  7. 13 9月, 2013 1 次提交
  8. 12 9月, 2013 1 次提交
    • K
      mm: fix aio performance regression for database caused by THP · 7cb2ef56
      Khalid Aziz 提交于
      I am working with a tool that simulates oracle database I/O workload.
      This tool (orion to be specific -
      <http://docs.oracle.com/cd/E11882_01/server.112/e16638/iodesign.htm#autoId24>)
      allocates hugetlbfs pages using shmget() with SHM_HUGETLB flag.  It then
      does aio into these pages from flash disks using various common block
      sizes used by database.  I am looking at performance with two of the most
      common block sizes - 1M and 64K.  aio performance with these two block
      sizes plunged after Transparent HugePages was introduced in the kernel.
      Here are performance numbers:
      
      		pre-THP		2.6.39		3.11-rc5
      1M read		8384 MB/s	5629 MB/s	6501 MB/s
      64K read	7867 MB/s	4576 MB/s	4251 MB/s
      
      I have narrowed the performance impact down to the overheads introduced by
      THP in __get_page_tail() and put_compound_page() routines.  perf top shows
      >40% of cycles being spent in these two routines.  Every time direct I/O
      to hugetlbfs pages starts, kernel calls get_page() to grab a reference to
      the pages and calls put_page() when I/O completes to put the reference
      away.  THP introduced significant amount of locking overhead to get_page()
      and put_page() when dealing with compound pages because hugepages can be
      split underneath get_page() and put_page().  It added this overhead
      irrespective of whether it is dealing with hugetlbfs pages or transparent
      hugepages.  This resulted in 20%-45% drop in aio performance when using
      hugetlbfs pages.
      
      Since hugetlbfs pages can not be split, there is no reason to go through
      all the locking overhead for these pages from what I can see.  I added
      code to __get_page_tail() and put_compound_page() to bypass all the
      locking code when working with hugetlbfs pages.  This improved performance
      significantly.  Performance numbers with this patch:
      
      		pre-THP		3.11-rc5	3.11-rc5 + Patch
      1M read		8384 MB/s	6501 MB/s	8371 MB/s
      64K read	7867 MB/s	4251 MB/s	6510 MB/s
      
      Performance with 64K read is still lower than what it was before THP, but
      still a 53% improvement.  It does mean there is more work to be done but I
      will take a 53% improvement for now.
      
      Please take a look at the following patch and let me know if it looks
      reasonable.
      
      [akpm@linux-foundation.org: tweak comments]
      Signed-off-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Cc: Pravin B Shelar <pshelar@nicira.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cb2ef56
  9. 01 8月, 2013 2 次提交
    • K
      thp, mm: avoid PageUnevictable on active/inactive lru lists · e180cf80
      Kirill A. Shutemov 提交于
      active/inactive lru lists can contain unevicable pages (i.e.  ramfs pages
      that have been placed on the LRU lists when first allocated), but these
      pages must not have PageUnevictable set - otherwise shrink_[in]active_list
      goes crazy:
      
      kernel BUG at /home/space/kas/git/public/linux-next/mm/vmscan.c:1122!
      
      1090 static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
      1091                 struct lruvec *lruvec, struct list_head *dst,
      1092                 unsigned long *nr_scanned, struct scan_control *sc,
      1093                 isolate_mode_t mode, enum lru_list lru)
      1094 {
      ...
      1108                 switch (__isolate_lru_page(page, mode)) {
      1109                 case 0:
      ...
      1116                 case -EBUSY:
      ...
      1121                 default:
      1122                         BUG();
      1123                 }
      1124         }
      ...
      1130 }
      
      __isolate_lru_page() returns EINVAL for PageUnevictable(page).
      
      For lru_add_page_tail(), it means we should not set PageUnevictable()
      for tail pages unless we're sure that it will go to LRU_UNEVICTABLE.
      Let's just copy PG_active and PG_unevictable from head page in
      __split_huge_page_refcount(), it will simplify lru_add_page_tail().
      
      This will fix one more bug in lru_add_page_tail(): if
      page_evictable(page_tail) is false and PageLRU(page) is true, page_tail
      will go to the same lru as page, but nobody cares to sync page_tail
      active/inactive state with page.  So we can end up with inactive page on
      active lru.  The patch will fix it as well since we copy PG_active from
      head page.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e180cf80
    • N
      mm/swap.c: clear PageActive before adding pages onto unevictable list · ef2a2cbd
      Naoya Horiguchi 提交于
      As a result of commit 13f7f789 ("mm: pagevec: defer deciding which
      LRU to add a page to until pagevec drain time"), pages on unevictable
      lists can have both of PageActive and PageUnevictable set.  This is not
      only confusing, but also corrupts page migration and
      shrink_[in]active_list.
      
      This patch fixes the problem by adding ClearPageActive before adding
      pages into unevictable list.  It also cleans up VM_BUG_ONs.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef2a2cbd
  10. 04 7月, 2013 5 次提交
    • M
      mm: remove lru parameter from __lru_cache_add and lru_cache_add_lru · c53954a0
      Mel Gorman 提交于
      Similar to __pagevec_lru_add, this patch removes the LRU parameter from
      __lru_cache_add and lru_cache_add_lru as the caller does not control the
      exact LRU the page gets added to.  lru_cache_add_lru gets renamed to
      lru_cache_add the name is silly without the lru parameter.  With the
      parameter removed, it is required that the caller indicate if they want
      the page added to the active or inactive list by setting or clearing
      PageActive respectively.
      
      [akpm@linux-foundation.org: Suggested the patch]
      [gang.chen@asianux.com: fix used-unintialized warning]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NChen Gang <gang.chen@asianux.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Alexey Lyahkov <alexey.lyashkov@gmail.com>
      Cc: Andrew Perepechko <anserper@ya.ru>
      Cc: Robin Dong <sanbai@taobao.com>
      Cc: Theodore Tso <tytso@mit.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Bernd Schubert <bernd.schubert@fastmail.fm>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c53954a0
    • M
      mm: remove lru parameter from __pagevec_lru_add and remove parts of pagevec API · a0b8cab3
      Mel Gorman 提交于
      Now that the LRU to add a page to is decided at LRU-add time, remove the
      misleading lru parameter from __pagevec_lru_add.  A consequence of this
      is that the pagevec_lru_add_file, pagevec_lru_add_anon and similar
      helpers are misleading as the caller no longer has direct control over
      what LRU the page is added to.  Unused helpers are removed by this patch
      and existing users of pagevec_lru_add_file() are converted to use
      lru_cache_add_file() directly and use the per-cpu pagevecs instead of
      creating their own pagevec.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Alexey Lyahkov <alexey.lyashkov@gmail.com>
      Cc: Andrew Perepechko <anserper@ya.ru>
      Cc: Robin Dong <sanbai@taobao.com>
      Cc: Theodore Tso <tytso@mit.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Bernd Schubert <bernd.schubert@fastmail.fm>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0b8cab3
    • M
      mm: activate !PageLRU pages on mark_page_accessed if page is on local pagevec · 059285a2
      Mel Gorman 提交于
      If a page is on a pagevec then it is !PageLRU and mark_page_accessed()
      may fail to move a page to the active list as expected.  Now that the
      LRU is selected at LRU drain time, mark pages PageActive if they are on
      the local pagevec so it gets moved to the correct list at LRU drain
      time.  Using a debugging patch it was found that for a simple git
      checkout based workload that pages were never added to the active file
      list in practice but with this patch applied they are.
      
      				before   after
      LRU Add Active File                  0      750583
      LRU Add Active Anon            2640587     2702818
      LRU Add Inactive File          8833662     8068353
      LRU Add Inactive Anon              207         200
      
      Note that only pages on the local pagevec are considered on purpose.  A
      !PageLRU page could be in the process of being released, reclaimed,
      migrated or on a remote pagevec that is currently being drained.
      Marking it PageActive is vunerable to races where PageLRU and Active
      bits are checked at the wrong time.  Page reclaim will trigger
      VM_BUG_ONs but depending on when the race hits, it could also free a
      PageActive page to the page allocator and trigger a bad_page warning.
      Similarly a potential race exists between a per-cpu drain on a pagevec
      list and an activation on a remote CPU.
      
      				lru_add_drain_cpu
      				__pagevec_lru_add
      				  lru = page_lru(page);
      mark_page_accessed
        if (PageLRU(page))
          activate_page
        else
          SetPageActive
      				  SetPageLRU(page);
      				  add_page_to_lru_list(page, lruvec, lru);
      
      In this case a PageActive page is added to the inactivate list and later
      the inactive/active stats will get skewed.  While the PageActive checks
      in vmscan could be removed and potentially dealt with, a skew in the
      statistics would be very difficult to detect.  Hence this patch deals
      just with the common case where a page being marked accessed has just
      been added to the local pagevec.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Alexey Lyahkov <alexey.lyashkov@gmail.com>
      Cc: Andrew Perepechko <anserper@ya.ru>
      Cc: Robin Dong <sanbai@taobao.com>
      Cc: Theodore Tso <tytso@mit.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Bernd Schubert <bernd.schubert@fastmail.fm>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      059285a2
    • M
      mm: pagevec: defer deciding which LRU to add a page to until pagevec drain time · 13f7f789
      Mel Gorman 提交于
      mark_page_accessed() cannot activate an inactive page that is located on
      an inactive LRU pagevec.  Hints from filesystems may be ignored as a
      result.  In preparation for fixing that problem, this patch removes the
      per-LRU pagevecs and leaves just one pagevec.  The final LRU the page is
      added to is deferred until the pagevec is drained.
      
      This means that fewer pagevecs are available and potentially there is
      greater contention on the LRU lock.  However, this only applies in the
      case where there is an almost perfect mix of file, anon, active and
      inactive pages being added to the LRU.  In practice I expect that we are
      adding stream of pages of a particular time and that the changes in
      contention will barely be measurable.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Alexey Lyahkov <alexey.lyashkov@gmail.com>
      Cc: Andrew Perepechko <anserper@ya.ru>
      Cc: Robin Dong <sanbai@taobao.com>
      Cc: Theodore Tso <tytso@mit.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Bernd Schubert <bernd.schubert@fastmail.fm>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13f7f789
    • M
      mm: add tracepoints for LRU activation and insertions · c6286c98
      Mel Gorman 提交于
      Andrew Perepechko reported a problem whereby pages are being prematurely
      evicted as the mark_page_accessed() hint is ignored for pages that are
      currently on a pagevec --
      http://www.spinics.net/lists/linux-ext4/msg37340.html .
      
      Alexey Lyahkov and Robin Dong have also reported problems recently that
      could be due to hot pages reaching the end of the inactive list too
      quickly and be reclaimed.
      
      Rather than addressing this on a per-filesystem basis, this series aims
      to fix the mark_page_accessed() interface by deferring what LRU a page
      is added to pagevec drain time and allowing mark_page_accessed() to call
      SetPageActive on a pagevec page.
      
      Patch 1 adds two tracepoints for LRU page activation and insertion. Using
      	these processes it's possible to build a model of pages in the
      	LRU that can be processed offline.
      
      Patch 2 defers making the decision on what LRU to add a page to until when
      	the pagevec is drained.
      
      Patch 3 searches the local pagevec for pages to mark PageActive on
      	mark_page_accessed. The changelog explains why only the local
      	pagevec is examined.
      
      Patches 4 and 5 tidy up the API.
      
      postmark, a dd-based test and fs-mark both single and threaded mode were
      run but none of them showed any performance degradation or gain as a
      result of the patch.
      
      Using patch 1, I built a *very* basic model of the LRU to examine
      offline what the average age of different page types on the LRU were in
      milliseconds.  Of course, capturing the trace distorts the test as it's
      written to local disk but it does not matter for the purposes of this
      test.  The average age of pages in milliseconds were
      
      				    vanilla deferdrain
      Average age mapped anon:               1454       1250
      Average age mapped file:             127841     155552
      Average age unmapped anon:               85        235
      Average age unmapped file:            73633      38884
      Average age unmapped buffers:         74054     116155
      
      The LRU activity was mostly files which you'd expect for a dd-based
      workload.  Note that the average age of buffer pages is increased by the
      series and it is expected this is due to the fact that the buffer pages
      are now getting added to the active list when drained from the pagevecs.
      Note that the average age of the unmapped file data is decreased as they
      are still added to the inactive list and are reclaimed before the
      buffers.
      
      There is no guarantee this is a universal win for all workloads and it
      would be nice if the filesystem people gave some thought as to whether
      this decision is generally a win or a loss.
      
      This patch:
      
      Using these tracepoints it is possible to model LRU activity and the
      average residency of pages of different types.  This can be used to
      debug problems related to premature reclaim of pages of particular
      types.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Alexey Lyahkov <alexey.lyashkov@gmail.com>
      Cc: Andrew Perepechko <anserper@ya.ru>
      Cc: Robin Dong <sanbai@taobao.com>
      Cc: Theodore Tso <tytso@mit.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Bernd Schubert <bernd.schubert@fastmail.fm>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c6286c98
  11. 08 5月, 2013 1 次提交
  12. 30 4月, 2013 1 次提交
  13. 24 2月, 2013 1 次提交
  14. 09 10月, 2012 2 次提交
    • H
      mm: remove vma arg from page_evictable · 39b5f29a
      Hugh Dickins 提交于
      page_evictable(page, vma) is an irritant: almost all its callers pass
      NULL for vma.  Remove the vma arg and use mlocked_vma_newpage(vma, page)
      explicitly in the couple of places it's needed.  But in those places we
      don't even need page_evictable() itself!  They're dealing with a freshly
      allocated anonymous page, which has no "mapping" and cannot be mlocked yet.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      39b5f29a
    • R
      mm: fix nonuniform page status when writing new file with small buffer · d741c9cd
      Robin Dong 提交于
      When writing a new file with 2048 bytes buffer, such as write(fd, buffer,
      2048), it will call generic_perform_write() twice for every page:
      
      	write_begin
      	mark_page_accessed(page)
      	write_end
      
      	write_begin
      	mark_page_accessed(page)
      	write_end
      
      Pages 1-13 will be added to lru-pvecs in write_begin() and will *NOT* be
      added to active_list even they have be accessed twice because they are not
      PageLRU(page).  But when page 14th comes, all pages in lru-pvecs will be
      moved to inactive_list (by __lru_cache_add() ) in first write_begin(), now
      page 14th *is* PageLRU(page).  And after second write_end() only page 14th
      will be in active_list.
      
      In Hadoop environment, we do comes to this situation: after writing a
      file, we find out that only 14th, 28th, 42th...  page are in active_list
      and others in inactive_list.  Now kswapd works, shrinks the inactive_list,
      the file only have 14th, 28th...pages in memory, the readahead request
      size will be broken to only 52k (13*4k), system's performance falls
      dramatically.
      
      This problem can also replay by below steps (the machine has 8G memory):
      
      	1. dd if=/dev/zero of=/test/file.out bs=1024 count=1048576
      	2. cat another 7.5G file to /dev/null
      	3. vmtouch -m 1G -v /test/file.out, it will show:
      
      	/test/file.out
      	[oooooooooooooooooooOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO] 187847/262144
      
      	the 'o' means same pages are in memory but same are not.
      
      The solution for this problem is simple: the 14th page should be added to
      lru_add_pvecs before mark_page_accessed() just as other pages.
      
      [akpm@linux-foundation.org: tweak comment]
      [akpm@linux-foundation.org: grab better comment from the v3 patch]
      Signed-off-by: NRobin Dong <sanbai@taobao.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d741c9cd
  15. 01 8月, 2012 2 次提交
    • M
      mm: add support for direct_IO to highmem pages · 5a178119
      Mel Gorman 提交于
      The patch "mm: add support for a filesystem to activate swap files and use
      direct_IO for writing swap pages" added support for using direct_IO to
      write swap pages but it is insufficient for highmem pages.
      
      To support highmem pages, this patch kmaps() the page before calling the
      direct_IO() handler.  As direct_IO deals with virtual addresses an
      additional helper is necessary for get_kernel_pages() to lookup the struct
      page for a kmap virtual address.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Xiaotian Feng <dfeng@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a178119
    • M
      mm: add get_kernel_page[s] for pinning of kernel addresses for I/O · 18022c5d
      Mel Gorman 提交于
      This patch adds two new APIs get_kernel_pages() and get_kernel_page() that
      may be used to pin a vector of kernel addresses for IO.  The initial user
      is expected to be NFS for allowing pages to be written to swap using
      aops->direct_IO().  Strictly speaking, swap-over-NFS only needs to pin one
      page for IO but it makes sense to express the API in terms of a vector and
      add a helper for pinning single pages.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Xiaotian Feng <dfeng@redhat.com>
      Cc: Mark Salter <msalter@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      18022c5d
  16. 30 5月, 2012 3 次提交
  17. 22 3月, 2012 1 次提交
  18. 06 3月, 2012 1 次提交
    • H
      memcg: fix GPF when cgroup removal races with last exit · 7512102c
      Hugh Dickins 提交于
      When moving tasks from old memcg (with move_charge_at_immigrate on new
      memcg), followed by removal of old memcg, hit General Protection Fault in
      mem_cgroup_lru_del_list() (called from release_pages called from
      free_pages_and_swap_cache from tlb_flush_mmu from tlb_finish_mmu from
      exit_mmap from mmput from exit_mm from do_exit).
      
      Somewhat reproducible, takes a few hours: the old struct mem_cgroup has
      been freed and poisoned by SLAB_DEBUG, but mem_cgroup_lru_del_list() is
      still trying to update its stats, and take page off lru before freeing.
      
      A task, or a charge, or a page on lru: each secures a memcg against
      removal.  In this case, the last task has been moved out of the old memcg,
      and it is exiting: anonymous pages are uncharged one by one from the
      memcg, as they are zapped from its pagetables, so the charge gets down to
      0; but the pages themselves are queued in an mmu_gather for freeing.
      
      Most of those pages will be on lru (and force_empty is careful to
      lru_add_drain_all, to add pages from pagevec to lru first), but not
      necessarily all: perhaps some have been isolated for page reclaim, perhaps
      some isolated for other reasons.  So, force_empty may find no task, no
      charge and no page on lru, and let the removal proceed.
      
      There would still be no problem if these pages were immediately freed; but
      typically (and the put_page_testzero protocol demands it) they have to be
      added back to lru before they are found freeable, then removed from lru
      and freed.  We don't see the issue when adding, because the
      mem_cgroup_iter() loops keep their own reference to the memcg being
      scanned; but when it comes to mem_cgroup_lru_del_list().
      
      I believe this was not an issue in v3.2: there, PageCgroupAcctLRU and
      PageCgroupUsed flags were used (like a trick with mirrors) to deflect view
      of pc->mem_cgroup to the stable root_mem_cgroup when neither set.
      38c5d72f ("memcg: simplify LRU handling by new rule") mercifully
      removed those convolutions, but left this General Protection Fault.
      
      But it's surprisingly easy to restore the old behaviour: just check
      PageCgroupUsed in mem_cgroup_lru_add_list() (which decides on which lruvec
      to add), and reset pc to root_mem_cgroup if page is uncharged.  A risky
      change?  just going back to how it worked before; testing, and an audit of
      uses of pc->mem_cgroup, show no problem.
      
      And there's a nice bonus: with mem_cgroup_lru_add_list() itself making
      sure that an uncharged page goes to root lru, mem_cgroup_reset_owner() no
      longer has any purpose, and we can safely revert 4e5f01c2 ("memcg:
      clear pc->mem_cgroup if necessary").
      
      Calling update_page_reclaim_stat() after add_page_to_lru_list() in swap.c
      is not strictly necessary: the lru_lock there, with RCU before memcg
      structures are freed, makes mem_cgroup_get_reclaim_stat_from_page safe
      without that; but it seems cleaner to rely on one dependency less.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7512102c
  19. 09 2月, 2012 1 次提交
  20. 13 1月, 2012 8 次提交
  21. 11 1月, 2012 1 次提交