1. 06 3月, 2012 1 次提交
    • A
      mm: thp: fix BUG on mm->nr_ptes · 1c641e84
      Andrea Arcangeli 提交于
      Dave Jones reports a few Fedora users hitting the BUG_ON(mm->nr_ptes...)
      in exit_mmap() recently.
      
      Quoting Hugh's discovery and explanation of the SMP race condition:
      
        "mm->nr_ptes had unusual locking: down_read mmap_sem plus
         page_table_lock when incrementing, down_write mmap_sem (or mm_users
         0) when decrementing; whereas THP is careful to increment and
         decrement it under page_table_lock.
      
         Now most of those paths in THP also hold mmap_sem for read or write
         (with appropriate checks on mm_users), but two do not: when
         split_huge_page() is called by hwpoison_user_mappings(), and when
         called by add_to_swap().
      
         It's conceivable that the latter case is responsible for the
         exit_mmap() BUG_ON mm->nr_ptes that has been reported on Fedora."
      
      The simplest way to fix it without having to alter the locking is to make
      split_huge_page() a noop in nr_ptes terms, so by counting the preallocated
      pagetables that exists for every mapped hugepage.  It was an arbitrary
      choice not to count them and either way is not wrong or right, because
      they are not used but they're still allocated.
      Reported-by: NDave Jones <davej@redhat.com>
      Reported-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Josh Boyer <jwboyer@redhat.com>
      Cc: <stable@vger.kernel.org>	[3.0.x, 3.1.x, 3.2.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c641e84
  2. 09 2月, 2012 1 次提交
  3. 13 1月, 2012 6 次提交
  4. 09 12月, 2011 1 次提交
  5. 03 11月, 2011 1 次提交
    • A
      mm: thp: tail page refcounting fix · 70b50f94
      Andrea Arcangeli 提交于
      Michel while working on the working set estimation code, noticed that
      calling get_page_unless_zero() on a random pfn_to_page(random_pfn)
      wasn't safe, if the pfn ended up being a tail page of a transparent
      hugepage under splitting by __split_huge_page_refcount().
      
      He then found the problem could also theoretically materialize with
      page_cache_get_speculative() during the speculative radix tree lookups
      that uses get_page_unless_zero() in SMP if the radix tree page is freed
      and reallocated and get_user_pages is called on it before
      page_cache_get_speculative has a chance to call get_page_unless_zero().
      
      So the best way to fix the problem is to keep page_tail->_count zero at
      all times.  This will guarantee that get_page_unless_zero() can never
      succeed on any tail page.  page_tail->_mapcount is guaranteed zero and
      is unused for all tail pages of a compound page, so we can simply
      account the tail page references there and transfer them to
      tail_page->_count in __split_huge_page_refcount() (in addition to the
      head_page->_mapcount).
      
      While debugging this s/_count/_mapcount/ change I also noticed get_page is
      called by direct-io.c on pages returned by get_user_pages.  That wasn't
      entirely safe because the two atomic_inc in get_page weren't atomic.  As
      opposed to other get_user_page users like secondary-MMU page fault to
      establish the shadow pagetables would never call any superflous get_page
      after get_user_page returns.  It's safer to make get_page universally safe
      for tail pages and to use get_page_foll() within follow_page (inside
      get_user_pages()).  get_page_foll() is safe to do the refcounting for tail
      pages without taking any locks because it is run within PT lock protected
      critical sections (PT lock for pte and page_table_lock for
      pmd_trans_huge).
      
      The standard get_page() as invoked by direct-io instead will now take
      the compound_lock but still only for tail pages.  The direct-io paths
      are usually I/O bound and the compound_lock is per THP so very
      finegrined, so there's no risk of scalability issues with it.  A simple
      direct-io benchmarks with all lockdep prove locking and spinlock
      debugging infrastructure enabled shows identical performance and no
      overhead.  So it's worth it.  Ideally direct-io should stop calling
      get_page() on pages returned by get_user_pages().  The spinlock in
      get_page() is already optimized away for no-THP builds but doing
      get_page() on tail pages returned by GUP is generally a rare operation
      and usually only run in I/O paths.
      
      This new refcounting on page_tail->_mapcount in addition to avoiding new
      RCU critical sections will also allow the working set estimation code to
      work without any further complexity associated to the tail page
      refcounting with THP.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NMichel Lespinasse <walken@google.com>
      Reviewed-by: NMichel Lespinasse <walken@google.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: <stable@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70b50f94
  6. 01 11月, 2011 4 次提交
    • H
      mm/huge_memory: fix typo when updating mmu cache · 35d8c7ad
      Hillf Danton 提交于
      There are three cases of update_mmu_cache() in the file, and the case in
      function collapse_huge_page() has a typo, namely the last parameter used,
      which is corrected based on the other two cases.
      
      Due to the define of update_mmu_cache by X86, the only arch that
      implements THP currently, the change here has no really crystal point, but
      one or two minutes of efforts could be saved for those archs that are
      likely to support THP in future.
      Signed-off-by: NHillf Danton <dhillf@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      35d8c7ad
    • H
      mm/huge_memory: fix copying user highpage · 0089e485
      Hillf Danton 提交于
      The THP copy-on-write handler falls back to regular-sized pages for a huge
      page replacement upon allocation failure or if THP has been individually
      disabled in the target VMA.  The loop responsible for copying page-sized
      chunks accidentally uses multiples of PAGE_SHIFT instead of PAGE_SIZE as
      the virtual address arg for copy_user_highpage().
      Signed-off-by: NHillf Danton <dhillf@gmail.com>
      Acked-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0089e485
    • H
      mm/huge_memory.c: quiet sparse noise · 2f1da642
      H Hartley Sweeten 提交于
      Quiet the sparse noise:
      
      warning: symbol 'khugepaged_scan' was not declared. Should it be static?
      warning: context imbalance in 'khugepaged_scan_mm_slot' - unexpected unlock
      Signed-off-by: NH Hartley Sweeten <hsweeten@visionengravers.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f1da642
    • A
      thp: mremap support and TLB optimization · 37a1c49a
      Andrea Arcangeli 提交于
      This adds THP support to mremap (decreases the number of split_huge_page()
      calls).
      
      Here are also some benchmarks with a proggy like this:
      
      ===
      #define _GNU_SOURCE
      #include <sys/mman.h>
      #include <stdlib.h>
      #include <stdio.h>
      #include <string.h>
      #include <sys/time.h>
      
      #define SIZE (5UL*1024*1024*1024)
      
      int main()
      {
              static struct timeval oldstamp, newstamp;
      	long diffsec;
      	char *p, *p2, *p3, *p4;
      	if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
      		perror("memalign"), exit(1);
      	if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
      		perror("memalign"), exit(1);
      	if (posix_memalign((void **)&p3, 2*1024*1024, 4096))
      		perror("memalign"), exit(1);
      
      	memset(p, 0xff, SIZE);
      	memset(p2, 0xff, SIZE);
      	memset(p3, 0x77, 4096);
      	gettimeofday(&oldstamp, NULL);
      	p4 = mremap(p, SIZE, SIZE, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
      	gettimeofday(&newstamp, NULL);
      	diffsec = newstamp.tv_sec - oldstamp.tv_sec;
      	diffsec = newstamp.tv_usec - oldstamp.tv_usec + 1000000 * diffsec;
      	printf("usec %ld\n", diffsec);
      	if (p == MAP_FAILED || p4 != p3)
      	//if (p == MAP_FAILED)
      		perror("mremap"), exit(1);
      	if (memcmp(p4, p2, SIZE))
      		printf("mremap bug\n"), exit(1);
      	printf("ok\n");
      
      	return 0;
      }
      ===
      
      THP on
      
       Performance counter stats for './largepage13' (3 runs):
      
                69195836 dTLB-loads                 ( +-   3.546% )  (scaled from 50.30%)
                   60708 dTLB-load-misses           ( +-  11.776% )  (scaled from 52.62%)
               676266476 dTLB-stores                ( +-   5.654% )  (scaled from 69.54%)
                   29856 dTLB-store-misses          ( +-   4.081% )  (scaled from 89.22%)
              1055848782 iTLB-loads                 ( +-   4.526% )  (scaled from 80.18%)
                    8689 iTLB-load-misses           ( +-   2.987% )  (scaled from 58.20%)
      
              7.314454164  seconds time elapsed   ( +-   0.023% )
      
      THP off
      
       Performance counter stats for './largepage13' (3 runs):
      
              1967379311 dTLB-loads                 ( +-   0.506% )  (scaled from 60.59%)
                 9238687 dTLB-load-misses           ( +-  22.547% )  (scaled from 61.87%)
              2014239444 dTLB-stores                ( +-   0.692% )  (scaled from 60.40%)
                 3312335 dTLB-store-misses          ( +-   7.304% )  (scaled from 67.60%)
              6764372065 iTLB-loads                 ( +-   0.925% )  (scaled from 79.00%)
                    8202 iTLB-load-misses           ( +-   0.475% )  (scaled from 70.55%)
      
              9.693655243  seconds time elapsed   ( +-   0.069% )
      
      grep thp /proc/vmstat
      thp_fault_alloc 35849
      thp_fault_fallback 0
      thp_collapse_alloc 3
      thp_collapse_alloc_failed 0
      thp_split 0
      
      thp_split 0 confirms no thp split despite plenty of hugepages allocated.
      
      The measurement of only the mremap time (so excluding the 3 long
      memset and final long 10GB memory accessing memcmp):
      
      THP on
      
      usec 14824
      usec 14862
      usec 14859
      
      THP off
      
      usec 256416
      usec 255981
      usec 255847
      
      With an older kernel without the mremap optimizations (the below patch
      optimizes the non THP version too).
      
      THP on
      
      usec 392107
      usec 390237
      usec 404124
      
      THP off
      
      usec 444294
      usec 445237
      usec 445820
      
      I guess with a threaded program that sends more IPI on large SMP it'd
      create an even larger difference.
      
      All debug options are off except DEBUG_VM to avoid skewing the
      results.
      
      The only problem for native 2M mremap like it happens above both the
      source and destination address must be 2M aligned or the hugepmd can't be
      moved without a split but that is an hardware limitation.
      
      [akpm@linux-foundation.org: coding-style nitpicking]
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NJohannes Weiner <jweiner@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      37a1c49a
  7. 26 7月, 2011 1 次提交
  8. 16 6月, 2011 1 次提交
  9. 25 5月, 2011 2 次提交
  10. 29 4月, 2011 1 次提交
    • A
      mm: thp: fix /dev/zero MAP_PRIVATE and vm_flags cleanups · 78f11a25
      Andrea Arcangeli 提交于
      The huge_memory.c THP page fault was allowed to run if vm_ops was null
      (which would succeed for /dev/zero MAP_PRIVATE, as the f_op->mmap wouldn't
      setup a special vma->vm_ops and it would fallback to regular anonymous
      memory) but other THP logics weren't fully activated for vmas with vm_file
      not NULL (/dev/zero has a not NULL vma->vm_file).
      
      So this removes the vm_file checks so that /dev/zero also can safely use
      THP (the other albeit safer approach to fix this bug would have been to
      prevent the THP initial page fault to run if vm_file was set).
      
      After removing the vm_file checks, this also makes huge_memory.c stricter
      in khugepaged for the DEBUG_VM=y case.  It doesn't replace the vm_file
      check with a is_pfn_mapping check (but it keeps checking for VM_PFNMAP
      under VM_BUG_ON) because for a is_cow_mapping() mapping VM_PFNMAP should
      only be allowed to exist before the first page fault, and in turn when
      vma->anon_vma is null (so preventing khugepaged registration).  So I tend
      to think the previous comment saying if vm_file was set, VM_PFNMAP might
      have been set and we could still be registered in khugepaged (despite
      anon_vma was not NULL to be registered in khugepaged) was too paranoid.
      The is_linear_pfn_mapping check is also I think superfluous (as described
      by comment) but under DEBUG_VM it is safe to stay.
      
      Addresses https://bugzilla.kernel.org/show_bug.cgi?id=33682Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NCaspar Zhang <bugs@casparzhang.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: <stable@kernel.org>		[2.6.38.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      78f11a25
  11. 15 4月, 2011 2 次提交
  12. 23 3月, 2011 1 次提交
  13. 14 3月, 2011 1 次提交
  14. 05 3月, 2011 3 次提交
  15. 16 2月, 2011 1 次提交
  16. 12 2月, 2011 1 次提交
    • K
      memcg: fix leak of accounting at failure path of hugepage collapsing · 678ff896
      KAMEZAWA Hiroyuki 提交于
      mem_cgroup_uncharge_page() should be called in all failure cases after
      mem_cgroup_charge_newpage() is called in huge_memory.c::collapse_huge_page()
      
       [ 4209.076861] BUG: Bad page state in process khugepaged  pfn:1e9800
       [ 4209.077601] page:ffffea0006b14000 count:0 mapcount:0 mapping:          (null) index:0x2800
       [ 4209.078674] page flags: 0x40000000004000(head)
       [ 4209.079294] pc:ffff880214a30000 pc->flags:2146246697418756 pc->mem_cgroup:ffffc9000177a000
       [ 4209.082177] (/A)
       [ 4209.082500] Pid: 31, comm: khugepaged Not tainted 2.6.38-rc3-mm1 #1
       [ 4209.083412] Call Trace:
       [ 4209.083678]  [<ffffffff810f4454>] ? bad_page+0xe4/0x140
       [ 4209.084240]  [<ffffffff810f53e6>] ? free_pages_prepare+0xd6/0x120
       [ 4209.084837]  [<ffffffff8155621d>] ? rwsem_down_failed_common+0xbd/0x150
       [ 4209.085509]  [<ffffffff810f5462>] ? __free_pages_ok+0x32/0xe0
       [ 4209.086110]  [<ffffffff810f552b>] ? free_compound_page+0x1b/0x20
       [ 4209.086699]  [<ffffffff810fad6c>] ? __put_compound_page+0x1c/0x30
       [ 4209.087333]  [<ffffffff810fae1d>] ? put_compound_page+0x4d/0x200
       [ 4209.087935]  [<ffffffff810fb015>] ? put_page+0x45/0x50
       [ 4209.097361]  [<ffffffff8113f779>] ? khugepaged+0x9e9/0x1430
       [ 4209.098364]  [<ffffffff8107c870>] ? autoremove_wake_function+0x0/0x40
       [ 4209.099121]  [<ffffffff8113ed90>] ? khugepaged+0x0/0x1430
       [ 4209.099780]  [<ffffffff8107c236>] ? kthread+0x96/0xa0
       [ 4209.100452]  [<ffffffff8100dda4>] ? kernel_thread_helper+0x4/0x10
       [ 4209.101214]  [<ffffffff8107c1a0>] ? kthread+0x0/0xa0
       [ 4209.101842]  [<ffffffff8100dda0>] ? kernel_thread_helper+0x0/0x10
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      678ff896
  17. 03 2月, 2011 1 次提交
    • J
      thp: fix the wrong reported address of hwpoisoned hugepages · a6d30ddd
      Jin Dongming 提交于
      When the tail page of THP is poisoned, the head page will be poisoned too.
       And the wrong address, address of head page, will be sent with sigbus
      always.
      
      So when the poisoned page is used by Guest OS which is running on KVM,
      after the address changing(hva->gpa) by qemu, the unexpected process on
      Guest OS will be killed by sigbus.
      
      What we expected is that the process using the poisoned tail page could be
      killed on Guest OS, but not that the process using the healthy head page
      is killed.
      
      Since it is not good to poison the healthy page, avoid poisoning other
      than the page which is really poisoned.
        (While we poison all pages in a huge page in case of hugetlb,
         we can do this for THP thanks to split_huge_page().)
      
      Here we fix two parts:
        1. Isolate the poisoned page only to make sure
           the reported address is the address of poisoned page.
        2. make the poisoned page work as the poisoned regular page.
      
      [akpm@linux-foundation.org: fix spello in comment]
      Signed-off-by: NJin Dongming <jin.dongming@np.css.fujitsu.com>
      Reviewed-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6d30ddd
  18. 21 1月, 2011 2 次提交
  19. 14 1月, 2011 9 次提交