1. 21 12月, 2012 1 次提交
  2. 17 12月, 2012 1 次提交
    • H
      mm: fix kernel BUG at huge_memory.c:1474! · a4f1de17
      Hugh Dickins 提交于
      Andrea's autonuma-benchmark numa01 hits kernel BUG at huge_memory.c:1474!
      in change_huge_pmd called from change_protection from change_prot_numa
      from task_numa_work.
      
      That BUG, introduced in the huge zero page commit cad7f613 ("thp:
      change_huge_pmd(): make sure we don't try to make a page writable")
      was trying to verify that newprot never adds write permission to an
      anonymous huge page; but Automatic NUMA Balancing's 4b10e7d5 ("mm:
      mempolicy: Implement change_prot_numa() in terms of change_protection()")
      adds a new prot_numa path into change_huge_pmd(), which makes no use of
      the newprot provided, and may retain the write bit in the pmd.
      
      Just move the BUG_ON(pmd_write(entry)) up into the !prot_numa block.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a4f1de17
  3. 13 12月, 2012 13 次提交
    • K
      thp: avoid race on multiple parallel page faults to the same page · 3ea41e62
      Kirill A. Shutemov 提交于
      pmd value is stable only with mm->page_table_lock taken. After taking
      the lock we need to check that nobody modified the pmd before changing it.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Reviewed-by: NBob Liu <lliubbo@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ea41e62
    • K
      thp: introduce sysfs knob to disable huge zero page · 79da5407
      Kirill A. Shutemov 提交于
      By default kernel tries to use huge zero page on read page fault.  It's
      possible to disable huge zero page by writing 0 or enable it back by
      writing 1:
      
      echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/use_zero_page
      echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/use_zero_page
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79da5407
    • K
      thp, vmstat: implement HZP_ALLOC and HZP_ALLOC_FAILED events · d8a8e1f0
      Kirill A. Shutemov 提交于
      hzp_alloc is incremented every time a huge zero page is successfully
      	allocated. It includes allocations which where dropped due
      	race with other allocation. Note, it doesn't count every map
      	of the huge zero page, only its allocation.
      
      hzp_alloc_failed is incremented if kernel fails to allocate huge zero
      	page and falls back to using small pages.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8a8e1f0
    • K
      thp: implement refcounting for huge zero page · 97ae1749
      Kirill A. Shutemov 提交于
      H.  Peter Anvin doesn't like huge zero page which sticks in memory forever
      after the first allocation.  Here's implementation of lockless refcounting
      for huge zero page.
      
      We have two basic primitives: {get,put}_huge_zero_page(). They
      manipulate reference counter.
      
      If counter is 0, get_huge_zero_page() allocates a new huge page and takes
      two references: one for caller and one for shrinker.  We free the page
      only in shrinker callback if counter is 1 (only shrinker has the
      reference).
      
      put_huge_zero_page() only decrements counter.  Counter is never zero in
      put_huge_zero_page() since shrinker holds on reference.
      
      Freeing huge zero page in shrinker callback helps to avoid frequent
      allocate-free.
      
      Refcounting has cost.  On 4 socket machine I observe ~1% slowdown on
      parallel (40 processes) read page faulting comparing to lazy huge page
      allocation.  I think it's pretty reasonable for synthetic benchmark.
      
      [lliubbo@gmail.com: fix mismerge]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NBob Liu <lliubbo@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97ae1749
    • K
      thp: lazy huge zero page allocation · 78ca0e67
      Kirill A. Shutemov 提交于
      Instead of allocating huge zero page on hugepage_init() we can postpone it
      until first huge zero page map. It saves memory if THP is not in use.
      
      cmpxchg() is used to avoid race on huge_zero_pfn initialization.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      78ca0e67
    • K
      thp: setup huge zero page on non-write page fault · 80371957
      Kirill A. Shutemov 提交于
      All code paths seems covered. Now we can map huge zero page on read page
      fault.
      
      We setup it in do_huge_pmd_anonymous_page() if area around fault address
      is suitable for THP and we've got read page fault.
      
      If we fail to setup huge zero page (ENOMEM) we fallback to
      handle_pte_fault() as we normally do in THP.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      80371957
    • K
      thp: implement splitting pmd for huge zero page · c5a647d0
      Kirill A. Shutemov 提交于
      We can't split huge zero page itself (and it's bug if we try), but we
      can split the pmd which points to it.
      
      On splitting the pmd we create a table with all ptes set to normal zero
      page.
      
      [akpm@linux-foundation.org: fix build error]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5a647d0
    • K
      thp: change split_huge_page_pmd() interface · e180377f
      Kirill A. Shutemov 提交于
      Pass vma instead of mm and add address parameter.
      
      In most cases we already have vma on the stack. We provides
      split_huge_page_pmd_mm() for few cases when we have mm, but not vma.
      
      This change is preparation to huge zero pmd splitting implementation.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e180377f
    • K
      thp: change_huge_pmd(): make sure we don't try to make a page writable · cad7f613
      Kirill A. Shutemov 提交于
      mprotect core never tries to make page writable using change_huge_pmd().
      Let's add an assert that the assumption is true.  It's important to be
      sure we will not make huge zero page writable.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cad7f613
    • K
      thp: do_huge_pmd_wp_page(): handle huge zero page · 93b4796d
      Kirill A. Shutemov 提交于
      On write access to huge zero page we alloc a new huge page and clear it.
      
      If ENOMEM, graceful fallback: we create a new pmd table and set pte around
      fault address to newly allocated normal (4k) page.  All other ptes in the
      pmd set to normal zero page.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93b4796d
    • K
      thp: copy_huge_pmd(): copy huge zero page · fc9fe822
      Kirill A. Shutemov 提交于
      It's easy to copy huge zero page. Just set destination pmd to huge zero
      page.
      
      It's safe to copy huge zero page since we have none yet :-p
      
      [rientjes@google.com: fix comment]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc9fe822
    • K
      thp: zap_huge_pmd(): zap huge zero pmd · 479f0abb
      Kirill A. Shutemov 提交于
      We don't have a mapped page to zap in huge zero page case.  Let's just clear
      pmd and remove it from tlb.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      479f0abb
    • K
      thp: huge zero page: basic preparation · 4a6c1297
      Kirill A. Shutemov 提交于
      During testing I noticed big (up to 2.5 times) memory consumption overhead
      on some workloads (e.g.  ft.A from NPB) if THP is enabled.
      
      The main reason for that big difference is lacking zero page in THP case.
      We have to allocate a real page on read page fault.
      
      A program to demonstrate the issue:
      #include <assert.h>
      #include <stdlib.h>
      #include <unistd.h>
      
      #define MB 1024*1024
      
      int main(int argc, char **argv)
      {
              char *p;
              int i;
      
              posix_memalign((void **)&p, 2 * MB, 200 * MB);
              for (i = 0; i < 200 * MB; i+= 4096)
                      assert(p[i] == 0);
              pause();
              return 0;
      }
      
      With thp-never RSS is about 400k, but with thp-always it's 200M.  After
      the patcheset thp-always RSS is 400k too.
      
      Design overview.
      
      Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with
      zeros.  The way how we allocate it changes in the patchset:
      
      - [01/10] simplest way: hzp allocated on boot time in hugepage_init();
      - [09/10] lazy allocation on first use;
      - [10/10] lockless refcounting + shrinker-reclaimable hzp;
      
      We setup it in do_huge_pmd_anonymous_page() if area around fault address
      is suitable for THP and we've got read page fault.  If we fail to setup
      hzp (ENOMEM) we fallback to handle_pte_fault() as we normally do in THP.
      
      On wp fault to hzp we allocate real memory for the huge page and clear it.
       If ENOMEM, graceful fallback: we create a new pmd table and set pte
      around fault address to newly allocated normal (4k) page.  All other ptes
      in the pmd set to normal zero page.
      
      We cannot split hzp (and it's bug if we try), but we can split the pmd
      which points to it.  On splitting the pmd we create a table with all ptes
      set to normal zero page.
      
      ===
      
      By hpa's request I've tried alternative approach for hzp implementation
      (see Virtual huge zero page patchset): pmd table with all entries set to
      zero page.  This way should be more cache friendly, but it increases TLB
      pressure.
      
      The problem with virtual huge zero page: it requires per-arch enabling.
      We need a way to mark that pmd table has all ptes set to zero page.
      
      Some numbers to compare two implementations (on 4s Westmere-EX):
      
      Mirobenchmark1
      ==============
      
      test:
              posix_memalign((void **)&p, 2 * MB, 8 * GB);
              for (i = 0; i < 100; i++) {
                      assert(memcmp(p, p + 4*GB, 4*GB) == 0);
                      asm volatile ("": : :"memory");
              }
      
      hzp:
       Performance counter stats for './test_memcmp' (5 runs):
      
            32356.272845 task-clock                #    0.998 CPUs utilized            ( +-  0.13% )
                      40 context-switches          #    0.001 K/sec                    ( +-  0.94% )
                       0 CPU-migrations            #    0.000 K/sec
                   4,218 page-faults               #    0.130 K/sec                    ( +-  0.00% )
          76,712,481,765 cycles                    #    2.371 GHz                      ( +-  0.13% ) [83.31%]
          36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles idle     ( +-  0.28% ) [83.35%]
           1,684,049,110 stalled-cycles-backend    #    2.20% backend  cycles idle     ( +-  2.96% ) [66.67%]
         134,355,715,816 instructions              #    1.75  insns per cycle
                                                   #    0.27  stalled cycles per insn  ( +-  0.10% ) [83.35%]
          13,526,169,702 branches                  #  418.039 M/sec                    ( +-  0.10% ) [83.31%]
               1,058,230 branch-misses             #    0.01% of all branches          ( +-  0.91% ) [83.36%]
      
            32.413866442 seconds time elapsed                                          ( +-  0.13% )
      
      vhzp:
       Performance counter stats for './test_memcmp' (5 runs):
      
            30327.183829 task-clock                #    0.998 CPUs utilized            ( +-  0.13% )
                      38 context-switches          #    0.001 K/sec                    ( +-  1.53% )
                       0 CPU-migrations            #    0.000 K/sec
                   4,218 page-faults               #    0.139 K/sec                    ( +-  0.01% )
          71,964,773,660 cycles                    #    2.373 GHz                      ( +-  0.13% ) [83.35%]
          31,191,284,231 stalled-cycles-frontend   #   43.34% frontend cycles idle     ( +-  0.40% ) [83.32%]
             773,484,474 stalled-cycles-backend    #    1.07% backend  cycles idle     ( +-  6.61% ) [66.67%]
         134,982,215,437 instructions              #    1.88  insns per cycle
                                                   #    0.23  stalled cycles per insn  ( +-  0.11% ) [83.32%]
          13,509,150,683 branches                  #  445.447 M/sec                    ( +-  0.11% ) [83.34%]
               1,017,667 branch-misses             #    0.01% of all branches          ( +-  1.07% ) [83.32%]
      
            30.381324695 seconds time elapsed                                          ( +-  0.13% )
      
      Mirobenchmark2
      ==============
      
      test:
              posix_memalign((void **)&p, 2 * MB, 8 * GB);
              for (i = 0; i < 1000; i++) {
                      char *_p = p;
                      while (_p < p+4*GB) {
                              assert(*_p == *(_p+4*GB));
                              _p += 4096;
                              asm volatile ("": : :"memory");
                      }
              }
      
      hzp:
       Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):
      
             3505.727639 task-clock                #    0.998 CPUs utilized            ( +-  0.26% )
                       9 context-switches          #    0.003 K/sec                    ( +-  4.97% )
                   4,384 page-faults               #    0.001 M/sec                    ( +-  0.00% )
           8,318,482,466 cycles                    #    2.373 GHz                      ( +-  0.26% ) [33.31%]
           5,134,318,786 stalled-cycles-frontend   #   61.72% frontend cycles idle     ( +-  0.42% ) [33.32%]
           2,193,266,208 stalled-cycles-backend    #   26.37% backend  cycles idle     ( +-  5.51% ) [33.33%]
           9,494,670,537 instructions              #    1.14  insns per cycle
                                                   #    0.54  stalled cycles per insn  ( +-  0.13% ) [41.68%]
           2,108,522,738 branches                  #  601.451 M/sec                    ( +-  0.09% ) [41.68%]
                 158,746 branch-misses             #    0.01% of all branches          ( +-  1.60% ) [41.71%]
           3,168,102,115 L1-dcache-loads
                #  903.693 M/sec                    ( +-  0.11% ) [41.70%]
           1,048,710,998 L1-dcache-misses
               #   33.10% of all L1-dcache hits    ( +-  0.11% ) [41.72%]
           1,047,699,685 LLC-load
                       #  298.854 M/sec                    ( +-  0.03% ) [33.38%]
                   2,287 LLC-misses
                     #    0.00% of all LL-cache hits     ( +-  8.27% ) [33.37%]
           3,166,187,367 dTLB-loads
                     #  903.147 M/sec                    ( +-  0.02% ) [33.35%]
               4,266,538 dTLB-misses
                    #    0.13% of all dTLB cache hits   ( +-  0.03% ) [33.33%]
      
             3.513339813 seconds time elapsed                                          ( +-  0.26% )
      
      vhzp:
       Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):
      
            27313.891128 task-clock                #    0.998 CPUs utilized            ( +-  0.24% )
                      62 context-switches          #    0.002 K/sec                    ( +-  0.61% )
                   4,384 page-faults               #    0.160 K/sec                    ( +-  0.01% )
          64,747,374,606 cycles                    #    2.370 GHz                      ( +-  0.24% ) [33.33%]
          61,341,580,278 stalled-cycles-frontend   #   94.74% frontend cycles idle     ( +-  0.26% ) [33.33%]
          56,702,237,511 stalled-cycles-backend    #   87.57% backend  cycles idle     ( +-  0.07% ) [33.33%]
          10,033,724,846 instructions              #    0.15  insns per cycle
                                                   #    6.11  stalled cycles per insn  ( +-  0.09% ) [41.65%]
           2,190,424,932 branches                  #   80.195 M/sec                    ( +-  0.12% ) [41.66%]
               1,028,630 branch-misses             #    0.05% of all branches          ( +-  1.50% ) [41.66%]
           3,302,006,540 L1-dcache-loads
                #  120.891 M/sec                    ( +-  0.11% ) [41.68%]
             271,374,358 L1-dcache-misses
               #    8.22% of all L1-dcache hits    ( +-  0.04% ) [41.66%]
              20,385,476 LLC-load
                       #    0.746 M/sec                    ( +-  1.64% ) [33.34%]
                  76,754 LLC-misses
                     #    0.38% of all LL-cache hits     ( +-  2.35% ) [33.34%]
           3,309,927,290 dTLB-loads
                     #  121.181 M/sec                    ( +-  0.03% ) [33.34%]
           2,098,967,427 dTLB-misses
                    #   63.41% of all dTLB cache hits   ( +-  0.03% ) [33.34%]
      
            27.364448741 seconds time elapsed                                          ( +-  0.24% )
      
      ===
      
      I personally prefer implementation present in this patchset. It doesn't
      touch arch-specific code.
      
      This patch:
      
      Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with
      zeros.
      
      For now let's allocate the page on hugepage_init().  We'll switch to lazy
      allocation later.
      
      We are not going to map the huge zero page until we can handle it properly
      on all code paths.
      
      is_huge_zero_{pfn,pmd}() functions will be used by following patches to
      check whether the pfn/pmd is huge zero page.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a6c1297
  4. 12 12月, 2012 5 次提交
  5. 11 12月, 2012 12 次提交
    • I
      mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable · 4fc3f1d6
      Ingo Molnar 提交于
      rmap_walk_anon() and try_to_unmap_anon() appears to be too
      careful about locking the anon vma: while it needs protection
      against anon vma list modifications, it does not need exclusive
      access to the list itself.
      
      Transforming this exclusive lock to a read-locked rwsem removes
      a global lock from the hot path of page-migration intense
      threaded workloads which can cause pathological performance like
      this:
      
          96.43%        process 0  [kernel.kallsyms]  [k] perf_trace_sched_switch
                        |
                        --- perf_trace_sched_switch
                            __schedule
                            schedule
                            schedule_preempt_disabled
                            __mutex_lock_common.isra.6
                            __mutex_lock_slowpath
                            mutex_lock
                           |
                           |--50.61%-- rmap_walk
                           |          move_to_new_page
                           |          migrate_pages
                           |          migrate_misplaced_page
                           |          __do_numa_page.isra.69
                           |          handle_pte_fault
                           |          handle_mm_fault
                           |          __do_page_fault
                           |          do_page_fault
                           |          page_fault
                           |          __memset_sse2
                           |          |
                           |           --100.00%-- worker_thread
                           |                     |
                           |                      --100.00%-- start_thread
                           |
                            --49.39%-- page_lock_anon_vma
                                      try_to_unmap_anon
                                      try_to_unmap
                                      migrate_pages
                                      migrate_misplaced_page
                                      __do_numa_page.isra.69
                                      handle_pte_fault
                                      handle_mm_fault
                                      __do_page_fault
                                      do_page_fault
                                      page_fault
                                      __memset_sse2
                                      |
                                       --100.00%-- worker_thread
                                                 start_thread
      
      With this change applied the profile is now nicely flat
      and there's no anon-vma related scheduling/blocking.
      
      Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(),
      to make it clearer that it's an exclusive write-lock in
      that case - suggested by Rik van Riel.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Turner <pjt@google.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      4fc3f1d6
    • I
      mm/rmap: Convert the struct anon_vma::mutex to an rwsem · 5a505085
      Ingo Molnar 提交于
      Convert the struct anon_vma::mutex to an rwsem, which will help
      in solving a page-migration scalability problem. (Addressed in
      a separate patch.)
      
      The conversion is simple and straightforward: in every case
      where we mutex_lock()ed we'll now down_write().
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Turner <pjt@google.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      5a505085
    • M
      mm: numa: Add THP migration for the NUMA working set scanning fault case. · b32967ff
      Mel Gorman 提交于
      Note: This is very heavily based on a patch from Peter Zijlstra with
      	fixes from Ingo Molnar, Hugh Dickins and Johannes Weiner.  That patch
      	put a lot of migration logic into mm/huge_memory.c where it does
      	not belong. This version puts tries to share some of the migration
      	logic with migrate_misplaced_page.  However, it should be noted
      	that now migrate.c is doing more with the pagetable manipulation
      	than is preferred. The end result is barely recognisable so as
      	before, the signed-offs had to be removed but will be re-added if
      	the original authors are ok with it.
      
      Add THP migration for the NUMA working set scanning fault case.
      
      It uses the page lock to serialize. No migration pte dance is
      necessary because the pte is already unmapped when we decide
      to migrate.
      
      [dhillf@gmail.com: Fix memory leak on isolation failure]
      [dhillf@gmail.com: Fix transfer of last_nid information]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      b32967ff
    • M
      mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate · b8593bfd
      Mel Gorman 提交于
      The PTE scanning rate and fault rates are two of the biggest sources of
      system CPU overhead with automatic NUMA placement.  Ideally a proper policy
      would detect if a workload was properly placed, schedule and adjust the
      PTE scanning rate accordingly. We do not track the necessary information
      to do that but we at least know if we migrated or not.
      
      This patch scans slower if a page was not migrated as the result of a
      NUMA hinting fault up to sysctl_numa_balancing_scan_period_max which is
      now higher than the previous default. Once every minute it will reset
      the scanner in case of phase changes.
      
      This is hilariously crude and the numbers are arbitrary. Workloads will
      converge quite slowly in comparison to what a proper policy should be able
      to do. On the plus side, we will chew up less CPU for workloads that have
      no need for automatic balancing.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      b8593bfd
    • H
      mm: numa: split_huge_page: Transfer last_nid on tail page · 5aa80374
      Hillf Danton 提交于
      Pass last_nid from head page to tail page.
      Signed-off-by: NHillf Danton <dhillf@gmail.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      5aa80374
    • M
      mm: numa: Add pte updates, hinting and migration stats · 03c5a6e1
      Mel Gorman 提交于
      It is tricky to quantify the basic cost of automatic NUMA placement in a
      meaningful manner. This patch adds some vmstats that can be used as part
      of a basic costing model.
      
      u    = basic unit = sizeof(void *)
      Ca   = cost of struct page access = sizeof(struct page) / u
      Cpte = Cost PTE access = Ca
      Cupdate = Cost PTE update = (2 * Cpte) + (2 * Wlock)
      	where Cpte is incurred twice for a read and a write and Wlock
      	is a constant representing the cost of taking or releasing a
      	lock
      Cnumahint = Cost of a minor page fault = some high constant e.g. 1000
      Cpagerw = Cost to read or write a full page = Ca + PAGE_SIZE/u
      Ci = Cost of page isolation = Ca + Wi
      	where Wi is a constant that should reflect the approximate cost
      	of the locking operation
      Cpagecopy = Cpagerw + (Cpagerw * Wnuma) + Ci + (Ci * Wnuma)
      	where Wnuma is the approximate NUMA factor. 1 is local. 1.2
      	would imply that remote accesses are 20% more expensive
      
      Balancing cost = Cpte * numa_pte_updates +
      		Cnumahint * numa_hint_faults +
      		Ci * numa_pages_migrated +
      		Cpagecopy * numa_pages_migrated
      
      Note that numa_pages_migrated is used as a measure of how many pages
      were isolated even though it would miss pages that failed to migrate. A
      vmstat counter could have been added for it but the isolation cost is
      pretty marginal in comparison to the overall cost so it seemed overkill.
      
      The ideal way to measure automatic placement benefit would be to count
      the number of remote accesses versus local accesses and do something like
      
      	benefit = (remote_accesses_before - remove_access_after) * Wnuma
      
      but the information is not readily available. As a workload converges, the
      expection would be that the number of remote numa hints would reduce to 0.
      
      	convergence = numa_hint_faults_local / numa_hint_faults
      		where this is measured for the last N number of
      		numa hints recorded. When the workload is fully
      		converged the value is 1.
      
      This can measure if the placement policy is converging and how fast it is
      doing it.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      03c5a6e1
    • P
      mm: numa: Add fault driven placement and migration · cbee9f88
      Peter Zijlstra 提交于
      NOTE: This patch is based on "sched, numa, mm: Add fault driven
      	placement and migration policy" but as it throws away all the policy
      	to just leave a basic foundation I had to drop the signed-offs-by.
      
      This patch creates a bare-bones method for setting PTEs pte_numa in the
      context of the scheduler that when faulted later will be faulted onto the
      node the CPU is running on.  In itself this does nothing useful but any
      placement policy will fundamentally depend on receiving hints on placement
      from fault context and doing something intelligent about it.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      cbee9f88
    • M
      mm: mempolicy: Implement change_prot_numa() in terms of change_protection() · 4b10e7d5
      Mel Gorman 提交于
      This patch converts change_prot_numa() to use change_protection(). As
      pte_numa and friends check the PTE bits directly it is necessary for
      change_protection() to use pmd_mknuma(). Hence the required
      modifications to change_protection() are a little clumsy but the
      end result is that most of the numa page table helpers are just one or
      two instructions.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      4b10e7d5
    • M
      mm: mempolicy: Use _PAGE_NUMA to migrate pages · 4daae3b4
      Mel Gorman 提交于
      Note: Based on "mm/mpol: Use special PROT_NONE to migrate pages" but
      	sufficiently different that the signed-off-bys were dropped
      
      Combine our previous _PAGE_NUMA, mpol_misplaced and migrate_misplaced_page()
      pieces into an effective migrate on fault scheme.
      
      Note that (on x86) we rely on PROT_NONE pages being !present and avoid
      the TLB flush from try_to_unmap(TTU_MIGRATION). This greatly improves the
      page-migration performance.
      Based-on-work-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      4daae3b4
    • M
      mm: numa: Create basic numa page hinting infrastructure · d10e63f2
      Mel Gorman 提交于
      Note: This patch started as "mm/mpol: Create special PROT_NONE
      	infrastructure" and preserves the basic idea but steals *very*
      	heavily from "autonuma: numa hinting page faults entry points" for
      	the actual fault handlers without the migration parts.	The end
      	result is barely recognisable as either patch so all Signed-off
      	and Reviewed-bys are dropped. If Peter, Ingo and Andrea are ok with
      	this version, I will re-add the signed-offs-by to reflect the history.
      
      In order to facilitate a lazy -- fault driven -- migration of pages, create
      a special transient PAGE_NUMA variant, we can then use the 'spurious'
      protection faults to drive our migrations from.
      
      The meaning of PAGE_NUMA depends on the architecture but on x86 it is
      effectively PROT_NONE. Actual PROT_NONE mappings will not generate these
      NUMA faults for the reason that the page fault code checks the permission on
      the VMA (and will throw a segmentation fault on actual PROT_NONE mappings),
      before it ever calls handle_mm_fault.
      
      [dhillf@gmail.com: Fix typo]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      d10e63f2
    • A
      mm: numa: split_huge_page: transfer the NUMA type from the pmd to the pte · 1ba6e0b5
      Andrea Arcangeli 提交于
      When we split a transparent hugepage, transfer the NUMA type from the
      pmd to the pte if needed.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      1ba6e0b5
    • M
      mm: Check if PTE is already allocated during page fault · 4fd01770
      Mel Gorman 提交于
      With transparent hugepage support, handle_mm_fault() has to be careful
      that a normal PMD has been established before handling a PTE fault. To
      achieve this, it used __pte_alloc() directly instead of pte_alloc_map
      as pte_alloc_map is unsafe to run against a huge PMD. pte_offset_map()
      is called once it is known the PMD is safe.
      
      pte_alloc_map() is smart enough to check if a PTE is already present
      before calling __pte_alloc but this check was lost. As a consequence,
      PTEs may be allocated unnecessarily and the page table lock taken.
      Thi useless PTE does get cleaned up but it's a performance hit which
      is visible in page_test from aim9.
      
      This patch simply re-adds the check normally done by pte_alloc_map to
      check if the PTE needs to be allocated before taking the page table
      lock. The effect is noticable in page_test from aim9.
      
       AIM9
                       2.6.38-vanilla 2.6.38-checkptenone
       creat-clo      446.10 ( 0.00%)   424.47 (-5.10%)
       page_test       38.10 ( 0.00%)    42.04 ( 9.37%)
       brk_test        52.45 ( 0.00%)    51.57 (-1.71%)
       exec_test      382.00 ( 0.00%)   456.90 (16.39%)
       fork_test       60.11 ( 0.00%)    67.79 (11.34%)
       MMTests Statistics: duration
       Total Elapsed Time (seconds)                611.90    612.22
      
      (While this affects 2.6.38, it is a performance rather than a
      functional bug and normally outside the rules -stable. While the big
      performance differences are to a microbench, the difference in fork
      and exec performance may be significant enough that -stable wants to
      consider the patch)
      Reported-by: NRaz Ben Yehuda <raziebe@gmail.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Rik van Riel <riel@redhat.com>
      [ Picked this up from the AutoNUMA tree to help
        it upstream and to allow apples-to-apples
        performance comparisons. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      4fd01770
  6. 15 10月, 2012 1 次提交
  7. 09 10月, 2012 7 次提交
    • D
      mm: thp: Use more portable PMD clearing sequenece in zap_huge_pmd(). · f5c8ad47
      David Miller 提交于
      Invalidation sequences are handled in various ways on various
      architectures.
      
      One way, which sparc64 uses, is to let the set_*_at() functions accumulate
      pending flushes into a per-cpu array.  Then the flush_tlb_range() et al.
      calls process the pending TLB flushes.
      
      In this regime, the __tlb_remove_*tlb_entry() implementations are
      essentially NOPs.
      
      The canonical PTE zap in mm/memory.c is:
      
      			ptent = ptep_get_and_clear_full(mm, addr, pte,
      							tlb->fullmm);
      			tlb_remove_tlb_entry(tlb, pte, addr);
      
      With a subsequent tlb_flush_mmu() if needed.
      
      Mirror this in the THP PMD zapping using:
      
      		orig_pmd = pmdp_get_and_clear(tlb->mm, addr, pmd);
      		page = pmd_page(orig_pmd);
      		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
      
      And we properly accomodate TLB flush mechanims like the one described
      above.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5c8ad47
    • D
      mm: Add and use update_mmu_cache_pmd() in transparent huge page code. · b113da65
      David Miller 提交于
      The transparent huge page code passes a PMD pointer in as the third
      argument of update_mmu_cache(), which expects a PTE pointer.
      
      This never got noticed because X86 implements update_mmu_cache() as a
      macro and thus we don't get any type checking, and X86 is the only
      architecture which supports transparent huge pages currently.
      
      Before other architectures can support transparent huge pages properly we
      need to add a new interface which will take a PMD pointer as the third
      argument rather than a PTE pointer.
      
      [akpm@linux-foundation.org: implement update_mm_cache_pmd() for s390]
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b113da65
    • D
      mm, thp: fix mapped pages avoiding unevictable list on mlock · b676b293
      David Rientjes 提交于
      When a transparent hugepage is mapped and it is included in an mlock()
      range, follow_page() incorrectly avoids setting the page's mlock bit and
      moving it to the unevictable lru.
      
      This is evident if you try to mlock(), munlock(), and then mlock() a
      range again.  Currently:
      
      	#define MAP_SIZE	(4 << 30)	/* 4GB */
      
      	void *ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
      			 MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
      	mlock(ptr, MAP_SIZE);
      
      		$ grep -E "Unevictable|Inactive\(anon" /proc/meminfo
      		Inactive(anon):     6304 kB
      		Unevictable:     4213924 kB
      
      	munlock(ptr, MAP_SIZE);
      
      		Inactive(anon):  4186252 kB
      		Unevictable:       19652 kB
      
      	mlock(ptr, MAP_SIZE);
      
      		Inactive(anon):  4198556 kB
      		Unevictable:       21684 kB
      
      Notice that less than 2MB was added to the unevictable list; this is
      because these pages in the range are not transparent hugepages since the
      4GB range was allocated with mmap() and has no specific alignment.  If
      posix_memalign() were used instead, unevictable would not have grown at
      all on the second mlock().
      
      The fix is to call mlock_vma_page() so that the mlock bit is set and the
      page is added to the unevictable list.  With this patch:
      
      	mlock(ptr, MAP_SIZE);
      
      		Inactive(anon):     4056 kB
      		Unevictable:     4213940 kB
      
      	munlock(ptr, MAP_SIZE);
      
      		Inactive(anon):  4198268 kB
      		Unevictable:       19636 kB
      
      	mlock(ptr, MAP_SIZE);
      
      		Inactive(anon):     4008 kB
      		Unevictable:     4213940 kB
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michel Lespinasse <walken@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b676b293
    • S
      mm: move all mmu notifier invocations to be done outside the PT lock · 2ec74c3e
      Sagi Grimberg 提交于
      In order to allow sleeping during mmu notifier calls, we need to avoid
      invoking them under the page table spinlock.  This patch solves the
      problem by calling invalidate_page notification after releasing the lock
      (but before freeing the page itself), or by wrapping the page invalidation
      with calls to invalidate_range_begin and invalidate_range_end.
      
      To prevent accidental changes to the invalidate_range_end arguments after
      the call to invalidate_range_begin, the patch introduces a convention of
      saving the arguments in consistently named locals:
      
      	unsigned long mmun_start;	/* For mmu_notifiers */
      	unsigned long mmun_end;	/* For mmu_notifiers */
      
      	...
      
      	mmun_start = ...
      	mmun_end = ...
      	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
      
      	...
      
      	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
      
      The patch changes code to use this convention for all calls to
      mmu_notifier_invalidate_range_start/end, except those where the calls are
      close enough so that anyone who glances at the code can see the values
      aren't changing.
      
      This patchset is a preliminary step towards on-demand paging design to be
      added to the RDMA stack.
      
      Why do we want on-demand paging for Infiniband?
      
        Applications register memory with an RDMA adapter using system calls,
        and subsequently post IO operations that refer to the corresponding
        virtual addresses directly to HW.  Until now, this was achieved by
        pinning the memory during the registration calls.  The goal of on demand
        paging is to avoid pinning the pages of registered memory regions (MRs).
         This will allow users the same flexibility they get when swapping any
        other part of their processes address spaces.  Instead of requiring the
        entire MR to fit in physical memory, we can allow the MR to be larger,
        and only fit the current working set in physical memory.
      
      Why should anyone care?  What problems are users currently experiencing?
      
        This can make programming with RDMA much simpler.  Today, developers
        that are working with more data than their RAM can hold need either to
        deregister and reregister memory regions throughout their process's
        life, or keep a single memory region and copy the data to it.  On demand
        paging will allow these developers to register a single MR at the
        beginning of their process's life, and let the operating system manage
        which pages needs to be fetched at a given time.  In the future, we
        might be able to provide a single memory access key for each process
        that would provide the entire process's address as one large memory
        region, and the developers wouldn't need to register memory regions at
        all.
      
      Is there any prospect that any other subsystems will utilise these
      infrastructural changes?  If so, which and how, etc?
      
        As for other subsystems, I understand that XPMEM wanted to sleep in
        MMU notifiers, as Christoph Lameter wrote at
        http://lkml.indiana.edu/hypermail/linux/kernel/0802.1/0460.html and
        perhaps Andrea knows about other use cases.
      
        Scheduling in mmu notifications is required since we need to sync the
        hardware with the secondary page tables change.  A TLB flush of an IO
        device is inherently slower than a CPU TLB flush, so our design works by
        sending the invalidation request to the device, and waiting for an
        interrupt before exiting the mmu notifier handler.
      
      Avi said:
      
        kvm may be a buyer.  kvm::mmu_lock, which serializes guest page
        faults, also protects long operations such as destroying large ranges.
        It would be good to convert it into a spinlock, but as it is used inside
        mmu notifiers, this cannot be done.
      
        (there are alternatives, such as keeping the spinlock and using a
        generation counter to do the teardown in O(1), which is what the "may"
        is doing up there).
      
      [akpm@linux-foundation.orgpossible speed tweak in hugetlb_cow(), cleanups]
      Signed-off-by: NAndrea Arcangeli <andrea@qumranet.com>
      Signed-off-by: NSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: NHaggai Eran <haggaie@mellanox.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Or Gerlitz <ogerlitz@mellanox.com>
      Cc: Haggai Eran <haggaie@mellanox.com>
      Cc: Shachar Raindel <raindel@mellanox.com>
      Cc: Liran Liss <liranl@mellanox.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ec74c3e
    • C
      mm: thp: fix the update_mmu_cache() last argument passing in mm/huge_memory.c · eab1eef9
      Catalin Marinas 提交于
      The update_mmu_cache() takes a pointer (to pte_t by default) as the last
      argument but the huge_memory.c passes a pmd_t value.  The patch changes
      the argument to the pmd_t * pointer.
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NSteve Capper <steve.capper@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Reviewed-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eab1eef9
    • X
      thp: khugepaged_prealloc_page() forgot to reset the page alloc indicator · e3b4126c
      Xiao Guangrong 提交于
      If NUMA is enabled, the indicator is not reset if the previous page
      request failed, ausing us to trigger the BUG_ON() in
      khugepaged_alloc_page().
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3b4126c
    • M
      mm rmap: remove vma_address check for address inside vma · 86c2ad19
      Michel Lespinasse 提交于
      In file and anon rmap, we use interval trees to find potentially relevant
      vmas and then call vma_address() to find the virtual address the given
      page might be found at in these vmas.  vma_address() used to include a
      check that the returned address falls within the limits of the vma, but
      this check isn't necessary now that we always use interval trees in rmap:
      the interval tree just doesn't return any vmas which this check would find
      to be irrelevant.  As a result, we can replace the use of -EFAULT error
      code (which then needed to be checked in every call site) with a
      VM_BUG_ON().
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86c2ad19