1. 13 12月, 2012 19 次提交
    • L
      oom: use N_MEMORY instead N_HIGH_MEMORY · bd3a66c1
      Lai Jiangshan 提交于
      N_HIGH_MEMORY stands for the nodes that has normal or high memory.
      N_MEMORY stands for the nodes that has any memory.
      
      The code here need to handle with the nodes which have memory, we should
      use N_MEMORY instead.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Acked-by: NHillf Danton <dhillf@gmail.com>
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Lin Feng <linfeng@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd3a66c1
    • L
      memcontrol: use N_MEMORY instead N_HIGH_MEMORY · 31aaea4a
      Lai Jiangshan 提交于
      N_HIGH_MEMORY stands for the nodes that has normal or high memory.
      N_MEMORY stands for the nodes that has any memory.
      
      The code here need to handle with the nodes which have memory, we should
      use N_MEMORY instead.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Lin Feng <linfeng@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31aaea4a
    • M
      mm: use migrate_prep() instead of migrate_prep_local() · be49a6e1
      Marek Szyprowski 提交于
      __alloc_contig_migrate_range() should use all possible ways to get all the
      pages migrated from the given memory range, so pruning per-cpu lru lists
      for all CPUs is required, regadless the cost of such operation.  Otherwise
      some pages which got stuck at per-cpu lru list might get missed by
      migration procedure causing the contiguous allocation to fail.
      Reported-by: NSeongHwan Yoon <sunghwan.yun@samsung.com>
      Signed-off-by: NMarek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: NKyungmin Park <kyungmin.park@samsung.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be49a6e1
    • T
      mm: compaction: Fix compiler warning · c8bf2d8b
      Thierry Reding 提交于
      compact_capture_page() is only used if compaction is enabled so it should
      be moved into the corresponding #ifdef.
      Signed-off-by: NThierry Reding <thierry.reding@avionic-design.de>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c8bf2d8b
    • K
      thp: avoid race on multiple parallel page faults to the same page · 3ea41e62
      Kirill A. Shutemov 提交于
      pmd value is stable only with mm->page_table_lock taken. After taking
      the lock we need to check that nobody modified the pmd before changing it.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Reviewed-by: NBob Liu <lliubbo@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ea41e62
    • K
      thp: introduce sysfs knob to disable huge zero page · 79da5407
      Kirill A. Shutemov 提交于
      By default kernel tries to use huge zero page on read page fault.  It's
      possible to disable huge zero page by writing 0 or enable it back by
      writing 1:
      
      echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/use_zero_page
      echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/use_zero_page
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79da5407
    • K
      thp, vmstat: implement HZP_ALLOC and HZP_ALLOC_FAILED events · d8a8e1f0
      Kirill A. Shutemov 提交于
      hzp_alloc is incremented every time a huge zero page is successfully
      	allocated. It includes allocations which where dropped due
      	race with other allocation. Note, it doesn't count every map
      	of the huge zero page, only its allocation.
      
      hzp_alloc_failed is incremented if kernel fails to allocate huge zero
      	page and falls back to using small pages.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8a8e1f0
    • K
      thp: implement refcounting for huge zero page · 97ae1749
      Kirill A. Shutemov 提交于
      H.  Peter Anvin doesn't like huge zero page which sticks in memory forever
      after the first allocation.  Here's implementation of lockless refcounting
      for huge zero page.
      
      We have two basic primitives: {get,put}_huge_zero_page(). They
      manipulate reference counter.
      
      If counter is 0, get_huge_zero_page() allocates a new huge page and takes
      two references: one for caller and one for shrinker.  We free the page
      only in shrinker callback if counter is 1 (only shrinker has the
      reference).
      
      put_huge_zero_page() only decrements counter.  Counter is never zero in
      put_huge_zero_page() since shrinker holds on reference.
      
      Freeing huge zero page in shrinker callback helps to avoid frequent
      allocate-free.
      
      Refcounting has cost.  On 4 socket machine I observe ~1% slowdown on
      parallel (40 processes) read page faulting comparing to lazy huge page
      allocation.  I think it's pretty reasonable for synthetic benchmark.
      
      [lliubbo@gmail.com: fix mismerge]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NBob Liu <lliubbo@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97ae1749
    • K
      thp: lazy huge zero page allocation · 78ca0e67
      Kirill A. Shutemov 提交于
      Instead of allocating huge zero page on hugepage_init() we can postpone it
      until first huge zero page map. It saves memory if THP is not in use.
      
      cmpxchg() is used to avoid race on huge_zero_pfn initialization.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      78ca0e67
    • K
      thp: setup huge zero page on non-write page fault · 80371957
      Kirill A. Shutemov 提交于
      All code paths seems covered. Now we can map huge zero page on read page
      fault.
      
      We setup it in do_huge_pmd_anonymous_page() if area around fault address
      is suitable for THP and we've got read page fault.
      
      If we fail to setup huge zero page (ENOMEM) we fallback to
      handle_pte_fault() as we normally do in THP.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      80371957
    • K
      thp: implement splitting pmd for huge zero page · c5a647d0
      Kirill A. Shutemov 提交于
      We can't split huge zero page itself (and it's bug if we try), but we
      can split the pmd which points to it.
      
      On splitting the pmd we create a table with all ptes set to normal zero
      page.
      
      [akpm@linux-foundation.org: fix build error]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5a647d0
    • K
      thp: change split_huge_page_pmd() interface · e180377f
      Kirill A. Shutemov 提交于
      Pass vma instead of mm and add address parameter.
      
      In most cases we already have vma on the stack. We provides
      split_huge_page_pmd_mm() for few cases when we have mm, but not vma.
      
      This change is preparation to huge zero pmd splitting implementation.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e180377f
    • K
      thp: change_huge_pmd(): make sure we don't try to make a page writable · cad7f613
      Kirill A. Shutemov 提交于
      mprotect core never tries to make page writable using change_huge_pmd().
      Let's add an assert that the assumption is true.  It's important to be
      sure we will not make huge zero page writable.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cad7f613
    • K
      thp: do_huge_pmd_wp_page(): handle huge zero page · 93b4796d
      Kirill A. Shutemov 提交于
      On write access to huge zero page we alloc a new huge page and clear it.
      
      If ENOMEM, graceful fallback: we create a new pmd table and set pte around
      fault address to newly allocated normal (4k) page.  All other ptes in the
      pmd set to normal zero page.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93b4796d
    • K
      thp: copy_huge_pmd(): copy huge zero page · fc9fe822
      Kirill A. Shutemov 提交于
      It's easy to copy huge zero page. Just set destination pmd to huge zero
      page.
      
      It's safe to copy huge zero page since we have none yet :-p
      
      [rientjes@google.com: fix comment]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc9fe822
    • K
      thp: zap_huge_pmd(): zap huge zero pmd · 479f0abb
      Kirill A. Shutemov 提交于
      We don't have a mapped page to zap in huge zero page case.  Let's just clear
      pmd and remove it from tlb.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      479f0abb
    • K
      thp: huge zero page: basic preparation · 4a6c1297
      Kirill A. Shutemov 提交于
      During testing I noticed big (up to 2.5 times) memory consumption overhead
      on some workloads (e.g.  ft.A from NPB) if THP is enabled.
      
      The main reason for that big difference is lacking zero page in THP case.
      We have to allocate a real page on read page fault.
      
      A program to demonstrate the issue:
      #include <assert.h>
      #include <stdlib.h>
      #include <unistd.h>
      
      #define MB 1024*1024
      
      int main(int argc, char **argv)
      {
              char *p;
              int i;
      
              posix_memalign((void **)&p, 2 * MB, 200 * MB);
              for (i = 0; i < 200 * MB; i+= 4096)
                      assert(p[i] == 0);
              pause();
              return 0;
      }
      
      With thp-never RSS is about 400k, but with thp-always it's 200M.  After
      the patcheset thp-always RSS is 400k too.
      
      Design overview.
      
      Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with
      zeros.  The way how we allocate it changes in the patchset:
      
      - [01/10] simplest way: hzp allocated on boot time in hugepage_init();
      - [09/10] lazy allocation on first use;
      - [10/10] lockless refcounting + shrinker-reclaimable hzp;
      
      We setup it in do_huge_pmd_anonymous_page() if area around fault address
      is suitable for THP and we've got read page fault.  If we fail to setup
      hzp (ENOMEM) we fallback to handle_pte_fault() as we normally do in THP.
      
      On wp fault to hzp we allocate real memory for the huge page and clear it.
       If ENOMEM, graceful fallback: we create a new pmd table and set pte
      around fault address to newly allocated normal (4k) page.  All other ptes
      in the pmd set to normal zero page.
      
      We cannot split hzp (and it's bug if we try), but we can split the pmd
      which points to it.  On splitting the pmd we create a table with all ptes
      set to normal zero page.
      
      ===
      
      By hpa's request I've tried alternative approach for hzp implementation
      (see Virtual huge zero page patchset): pmd table with all entries set to
      zero page.  This way should be more cache friendly, but it increases TLB
      pressure.
      
      The problem with virtual huge zero page: it requires per-arch enabling.
      We need a way to mark that pmd table has all ptes set to zero page.
      
      Some numbers to compare two implementations (on 4s Westmere-EX):
      
      Mirobenchmark1
      ==============
      
      test:
              posix_memalign((void **)&p, 2 * MB, 8 * GB);
              for (i = 0; i < 100; i++) {
                      assert(memcmp(p, p + 4*GB, 4*GB) == 0);
                      asm volatile ("": : :"memory");
              }
      
      hzp:
       Performance counter stats for './test_memcmp' (5 runs):
      
            32356.272845 task-clock                #    0.998 CPUs utilized            ( +-  0.13% )
                      40 context-switches          #    0.001 K/sec                    ( +-  0.94% )
                       0 CPU-migrations            #    0.000 K/sec
                   4,218 page-faults               #    0.130 K/sec                    ( +-  0.00% )
          76,712,481,765 cycles                    #    2.371 GHz                      ( +-  0.13% ) [83.31%]
          36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles idle     ( +-  0.28% ) [83.35%]
           1,684,049,110 stalled-cycles-backend    #    2.20% backend  cycles idle     ( +-  2.96% ) [66.67%]
         134,355,715,816 instructions              #    1.75  insns per cycle
                                                   #    0.27  stalled cycles per insn  ( +-  0.10% ) [83.35%]
          13,526,169,702 branches                  #  418.039 M/sec                    ( +-  0.10% ) [83.31%]
               1,058,230 branch-misses             #    0.01% of all branches          ( +-  0.91% ) [83.36%]
      
            32.413866442 seconds time elapsed                                          ( +-  0.13% )
      
      vhzp:
       Performance counter stats for './test_memcmp' (5 runs):
      
            30327.183829 task-clock                #    0.998 CPUs utilized            ( +-  0.13% )
                      38 context-switches          #    0.001 K/sec                    ( +-  1.53% )
                       0 CPU-migrations            #    0.000 K/sec
                   4,218 page-faults               #    0.139 K/sec                    ( +-  0.01% )
          71,964,773,660 cycles                    #    2.373 GHz                      ( +-  0.13% ) [83.35%]
          31,191,284,231 stalled-cycles-frontend   #   43.34% frontend cycles idle     ( +-  0.40% ) [83.32%]
             773,484,474 stalled-cycles-backend    #    1.07% backend  cycles idle     ( +-  6.61% ) [66.67%]
         134,982,215,437 instructions              #    1.88  insns per cycle
                                                   #    0.23  stalled cycles per insn  ( +-  0.11% ) [83.32%]
          13,509,150,683 branches                  #  445.447 M/sec                    ( +-  0.11% ) [83.34%]
               1,017,667 branch-misses             #    0.01% of all branches          ( +-  1.07% ) [83.32%]
      
            30.381324695 seconds time elapsed                                          ( +-  0.13% )
      
      Mirobenchmark2
      ==============
      
      test:
              posix_memalign((void **)&p, 2 * MB, 8 * GB);
              for (i = 0; i < 1000; i++) {
                      char *_p = p;
                      while (_p < p+4*GB) {
                              assert(*_p == *(_p+4*GB));
                              _p += 4096;
                              asm volatile ("": : :"memory");
                      }
              }
      
      hzp:
       Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):
      
             3505.727639 task-clock                #    0.998 CPUs utilized            ( +-  0.26% )
                       9 context-switches          #    0.003 K/sec                    ( +-  4.97% )
                   4,384 page-faults               #    0.001 M/sec                    ( +-  0.00% )
           8,318,482,466 cycles                    #    2.373 GHz                      ( +-  0.26% ) [33.31%]
           5,134,318,786 stalled-cycles-frontend   #   61.72% frontend cycles idle     ( +-  0.42% ) [33.32%]
           2,193,266,208 stalled-cycles-backend    #   26.37% backend  cycles idle     ( +-  5.51% ) [33.33%]
           9,494,670,537 instructions              #    1.14  insns per cycle
                                                   #    0.54  stalled cycles per insn  ( +-  0.13% ) [41.68%]
           2,108,522,738 branches                  #  601.451 M/sec                    ( +-  0.09% ) [41.68%]
                 158,746 branch-misses             #    0.01% of all branches          ( +-  1.60% ) [41.71%]
           3,168,102,115 L1-dcache-loads
                #  903.693 M/sec                    ( +-  0.11% ) [41.70%]
           1,048,710,998 L1-dcache-misses
               #   33.10% of all L1-dcache hits    ( +-  0.11% ) [41.72%]
           1,047,699,685 LLC-load
                       #  298.854 M/sec                    ( +-  0.03% ) [33.38%]
                   2,287 LLC-misses
                     #    0.00% of all LL-cache hits     ( +-  8.27% ) [33.37%]
           3,166,187,367 dTLB-loads
                     #  903.147 M/sec                    ( +-  0.02% ) [33.35%]
               4,266,538 dTLB-misses
                    #    0.13% of all dTLB cache hits   ( +-  0.03% ) [33.33%]
      
             3.513339813 seconds time elapsed                                          ( +-  0.26% )
      
      vhzp:
       Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):
      
            27313.891128 task-clock                #    0.998 CPUs utilized            ( +-  0.24% )
                      62 context-switches          #    0.002 K/sec                    ( +-  0.61% )
                   4,384 page-faults               #    0.160 K/sec                    ( +-  0.01% )
          64,747,374,606 cycles                    #    2.370 GHz                      ( +-  0.24% ) [33.33%]
          61,341,580,278 stalled-cycles-frontend   #   94.74% frontend cycles idle     ( +-  0.26% ) [33.33%]
          56,702,237,511 stalled-cycles-backend    #   87.57% backend  cycles idle     ( +-  0.07% ) [33.33%]
          10,033,724,846 instructions              #    0.15  insns per cycle
                                                   #    6.11  stalled cycles per insn  ( +-  0.09% ) [41.65%]
           2,190,424,932 branches                  #   80.195 M/sec                    ( +-  0.12% ) [41.66%]
               1,028,630 branch-misses             #    0.05% of all branches          ( +-  1.50% ) [41.66%]
           3,302,006,540 L1-dcache-loads
                #  120.891 M/sec                    ( +-  0.11% ) [41.68%]
             271,374,358 L1-dcache-misses
               #    8.22% of all L1-dcache hits    ( +-  0.04% ) [41.66%]
              20,385,476 LLC-load
                       #    0.746 M/sec                    ( +-  1.64% ) [33.34%]
                  76,754 LLC-misses
                     #    0.38% of all LL-cache hits     ( +-  2.35% ) [33.34%]
           3,309,927,290 dTLB-loads
                     #  121.181 M/sec                    ( +-  0.03% ) [33.34%]
           2,098,967,427 dTLB-misses
                    #   63.41% of all dTLB cache hits   ( +-  0.03% ) [33.34%]
      
            27.364448741 seconds time elapsed                                          ( +-  0.24% )
      
      ===
      
      I personally prefer implementation present in this patchset. It doesn't
      touch arch-specific code.
      
      This patch:
      
      Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with
      zeros.
      
      For now let's allocate the page on hugepage_init().  We'll switch to lazy
      allocation later.
      
      We are not going to map the huge zero page until we can handle it properly
      on all code paths.
      
      is_huge_zero_{pfn,pmd}() functions will be used by following patches to
      check whether the pfn/pmd is huge zero page.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a6c1297
    • J
      bootmem: remove alloc_arch_preferred_bootmem() · 3f7dfe24
      Joonsoo Kim 提交于
      The name of this function is not suitable, and removing the function and
      open-coding it into each call sites makes the code more understandable.
      
      Additionally, we shouldn't do an allocation from bootmem when
      slab_is_available(), so directly return kmalloc()'s return value.
      Signed-off-by: NJoonsoo Kim <js1304@gmail.com>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f7dfe24
    • J
      bootmem: remove not implemented function call, bootmem_arch_preferred_node() · 2d7a6956
      Joonsoo Kim 提交于
      There is no implementation of bootmem_arch_preferred_node() and a call to
      this function will cause a compilation error.  So remove it.
      Signed-off-by: NJoonsoo Kim <js1304@gmail.com>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d7a6956
  2. 12 12月, 2012 21 次提交
    • L
      memory_hotplug: ensure every online node has NORMAL memory · 74d42d8f
      Lai Jiangshan 提交于
      Old memory hotplug code and new online/movable may cause a online node
      don't have any normal memory, but memory-management acts bad when we have
      nodes which is online but don't have any normal memory.  Example: it may
      cause a bound task fail on all kernel allocation and cause the task can't
      create task or create other kernel object.
      
      So we disable non-normal-memory-node here, we will enable it when we
      prepared.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Greg KH <greg@kroah.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      74d42d8f
    • L
      memory_hotplug: handle empty zone when online_movable/online_kernel · e455a9b9
      Lai Jiangshan 提交于
      Make online_movable/online_kernel can empty a zone or can move memory to a
      empty zone.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Greg KH <greg@kroah.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e455a9b9
    • L
      mm, memory-hotplug: dynamic configure movable memory and portion memory · 511c2aba
      Lai Jiangshan 提交于
      Add online_movable and online_kernel for logic memory hotplug.  This is
      the dynamic version of "movablecore" & "kernelcore".
      
      We have the same reason to introduce it as to introduce "movablecore" &
      "kernelcore".  It has the same motive as "movablecore" & "kernelcore", but
      it is dynamic/running-time:
      
      o We can configure memory as kernelcore or movablecore after boot.
      
        Userspace workload is increased, we need more hugepage, we can't use
        "online_movable" to add memory and allow the system use more
        THP(transparent-huge-page), vice-verse when kernel workload is increase.
      
        Also help for virtualization to dynamic configure host/guest's memory,
        to save/(reduce waste) memory.
      
        Memory capacity on Demand
      
      o When a new node is physically online after boot, we need to use
        "online_movable" or "online_kernel" to configure/portion it as we
        expected when we logic-online it.
      
        This configuration also helps for physically-memory-migrate.
      
      o all benefit as the same as existed "movablecore" & "kernelcore".
      
      o Preparing for movable-node, which is very important for power-saving,
        hardware partitioning and high-available-system(hardware fault
        management).
      
      (Note, we don't introduce movable-node here.)
      
      Action behavior:
      When a memoryblock/memorysection is onlined by "online_movable", the kernel
      will not have directly reference to the page of the memoryblock,
      thus we can remove that memory any time when needed.
      
      When it is online by "online_kernel", the kernel can use it.
      When it is online by "online", the zone type doesn't changed.
      
      Current constraints:
      Only the memoryblock which is adjacent to the ZONE_MOVABLE
      can be online from ZONE_NORMAL to ZONE_MOVABLE.
      
      [akpm@linux-foundation.org: use min_t, cleanups]
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Greg KH <greg@kroah.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      511c2aba
    • J
      bootmem: fix wrong call parameter for free_bootmem() · 81df9bff
      Joonsoo Kim 提交于
      It is strange that alloc_bootmem() returns a virtual address and
      free_bootmem() requires a physical address.  Anyway, free_bootmem()'s
      first parameter should be physical address.
      
      There are some call sites for free_bootmem() with virtual address.  So fix
      them.
      
      [akpm@linux-foundation.org: improve free_bootmem() and free_bootmem_pate() documentation]
      Signed-off-by: NJoonsoo Kim <js1304@gmail.com>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81df9bff
    • M
      mm: cma: remove watermark hacks · bc357f43
      Marek Szyprowski 提交于
      Commits 2139cbe6 ("cma: fix counting of isolated pages") and
      d95ea5d1 ("cma: fix watermark checking") introduced a reliable
      method of free page accounting when memory is being allocated from CMA
      regions, so the workaround introduced earlier by commit 49f223a9
      ("mm: trigger page reclaim in alloc_contig_range() to stabilise
      watermarks") can be finally removed.
      Signed-off-by: NMarek Szyprowski <m.szyprowski@samsung.com>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc357f43
    • M
      mm: cma: skip watermarks check for already isolated blocks in split_free_page() · 2e30abd1
      Marek Szyprowski 提交于
      Since commit 2139cbe6 ("cma: fix counting of isolated pages") free
      pages in isolated pageblocks are not accounted to NR_FREE_PAGES counters,
      so watermarks check is not required if one operates on a free page in
      isolated pageblock.
      Signed-off-by: NMarek Szyprowski <m.szyprowski@samsung.com>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e30abd1
    • D
      mm, oom: fix race when specifying a thread as the oom origin · e1e12d2f
      David Rientjes 提交于
      test_set_oom_score_adj() and compare_swap_oom_score_adj() are used to
      specify that current should be killed first if an oom condition occurs in
      between the two calls.
      
      The usage is
      
      	short oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX);
      	...
      	compare_swap_oom_score_adj(OOM_SCORE_ADJ_MAX, oom_score_adj);
      
      to store the thread's oom_score_adj, temporarily change it to the maximum
      score possible, and then restore the old value if it is still the same.
      
      This happens to still be racy, however, if the user writes
      OOM_SCORE_ADJ_MAX to /proc/pid/oom_score_adj in between the two calls.
      The compare_swap_oom_score_adj() will then incorrectly reset the old value
      prior to the write of OOM_SCORE_ADJ_MAX.
      
      To fix this, introduce a new oom_flags_t member in struct signal_struct
      that will be used for per-thread oom killer flags.  KSM and swapoff can
      now use a bit in this member to specify that threads should be killed
      first in oom conditions without playing around with oom_score_adj.
      
      This also allows the correct oom_score_adj to always be shown when reading
      /proc/pid/oom_score.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1e12d2f
    • D
      mm, oom: change type of oom_score_adj to short · a9c58b90
      David Rientjes 提交于
      The maximum oom_score_adj is 1000 and the minimum oom_score_adj is -1000,
      so this range can be represented by the signed short type with no
      functional change.  The extra space this frees up in struct signal_struct
      will be used for per-thread oom kill flags in the next patch.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9c58b90
    • D
      mm, mempolicy: remove duplicate code · 212a0a6f
      David Rientjes 提交于
      Remove some duplicate code and simplify alloc_pages_vma().  No functional
      change.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      212a0a6f
    • J
      mm/vmscan.c: try_to_freeze() returns boolean · 6f6313d4
      Jeff Liu 提交于
      kswapd()->try_to_freeze() is defined to return a boolean, so it's better
      to use a bool to hold its return value.
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f6313d4
    • R
      mm: introduce putback_movable_pages() · 5733c7d1
      Rafael Aquini 提交于
      The PATCH "mm: introduce compaction and migration for virtio ballooned pages"
      hacks around putback_lru_pages() in order to allow ballooned pages to be
      re-inserted on balloon page list as if a ballooned page was like a LRU page.
      
      As ballooned pages are not legitimate LRU pages, this patch introduces
      putback_movable_pages() to properly cope with cases where the isolated
      pageset contains ballooned pages and LRU pages, thus fixing the mentioned
      inelegant hack around putback_lru_pages().
      Signed-off-by: NRafael Aquini <aquini@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5733c7d1
    • R
      mm: introduce compaction and migration for ballooned pages · bf6bddf1
      Rafael Aquini 提交于
      Memory fragmentation introduced by ballooning might reduce significantly
      the number of 2MB contiguous memory blocks that can be used within a guest,
      thus imposing performance penalties associated with the reduced number of
      transparent huge pages that could be used by the guest workload.
      
      This patch introduces the helper functions as well as the necessary changes
      to teach compaction and migration bits how to cope with pages which are
      part of a guest memory balloon, in order to make them movable by memory
      compaction procedures.
      Signed-off-by: NRafael Aquini <aquini@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf6bddf1
    • R
      mm: introduce a common interface for balloon pages mobility · 18468d93
      Rafael Aquini 提交于
      Memory fragmentation introduced by ballooning might reduce significantly
      the number of 2MB contiguous memory blocks that can be used within a guest,
      thus imposing performance penalties associated with the reduced number of
      transparent huge pages that could be used by the guest workload.
      
      This patch introduces a common interface to help a balloon driver on
      making its page set movable to compaction, and thus allowing the system
      to better leverage the compation efforts on memory defragmentation.
      
      [akpm@linux-foundation.org: use PAGE_FLAGS_CHECK_AT_PREP, s/__balloon_page_flags/page_flags_cleared/, small cleanups]
      [rientjes@google.com: allow balloon compaction for any system with memory compaction enabled, which is the defconfig]
      Signed-off-by: NRafael Aquini <aquini@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      18468d93
    • R
      mm: adjust address_space_operations.migratepage() return code · 78bd5209
      Rafael Aquini 提交于
      Memory fragmentation introduced by ballooning might reduce significantly
      the number of 2MB contiguous memory blocks that can be used within a
      guest, thus imposing performance penalties associated with the reduced
      number of transparent huge pages that could be used by the guest workload.
      
      This patch-set follows the main idea discussed at 2012 LSFMMS session:
      "Ballooning for transparent huge pages" -- http://lwn.net/Articles/490114/
      to introduce the required changes to the virtio_balloon driver, as well as
      the changes to the core compaction & migration bits, in order to make
      those subsystems aware of ballooned pages and allow memory balloon pages
      become movable within a guest, thus avoiding the aforementioned
      fragmentation issue
      
      Following are numbers that prove this patch benefits on allowing
      compaction to be more effective at memory ballooned guests.
      
      Results for STRESS-HIGHALLOC benchmark, from Mel Gorman's mmtests suite,
      running on a 4gB RAM KVM guest which was ballooning 512mB RAM in 64mB
      chunks, at every minute (inflating/deflating), while test was running:
      
      ===BEGIN stress-highalloc
      
      STRESS-HIGHALLOC
                       highalloc-3.7     highalloc-3.7
                           rc4-clean         rc4-patch
      Pass 1          55.00 ( 0.00%)    62.00 ( 7.00%)
      Pass 2          54.00 ( 0.00%)    62.00 ( 8.00%)
      while Rested    75.00 ( 0.00%)    80.00 ( 5.00%)
      
      MMTests Statistics: duration
                       3.7         3.7
                 rc4-clean   rc4-patch
      User         1207.59     1207.46
      System       1300.55     1299.61
      Elapsed      2273.72     2157.06
      
      MMTests Statistics: vmstat
                                      3.7         3.7
                                rc4-clean   rc4-patch
      Page Ins                    3581516     2374368
      Page Outs                  11148692    10410332
      Swap Ins                         80          47
      Swap Outs                      3641         476
      Direct pages scanned          37978       33826
      Kswapd pages scanned        1828245     1342869
      Kswapd pages reclaimed      1710236     1304099
      Direct pages reclaimed        32207       31005
      Kswapd efficiency               93%         97%
      Kswapd velocity             804.077     622.546
      Direct efficiency               84%         91%
      Direct velocity              16.703      15.682
      Percentage direct scans          2%          2%
      Page writes by reclaim        79252        9704
      Page writes file              75611        9228
      Page writes anon               3641         476
      Page reclaim immediate        16764       11014
      Page rescued immediate            0           0
      Slabs scanned               2171904     2152448
      Direct inode steals             385        2261
      Kswapd inode steals          659137      609670
      Kswapd skipped wait               1          69
      THP fault alloc                 546         631
      THP collapse alloc              361         339
      THP splits                      259         263
      THP fault fallback               98          50
      THP collapse fail                20          17
      Compaction stalls               747         499
      Compaction success              244         145
      Compaction failures             503         354
      Compaction pages moved       370888      474837
      Compaction move failure       77378       65259
      
      ===END stress-highalloc
      
      This patch:
      
      Introduce MIGRATEPAGE_SUCCESS as the default return code for
      address_space_operations.migratepage() method and documents the expected
      return code for the same method in failure cases.
      Signed-off-by: NRafael Aquini <aquini@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      78bd5209
    • M
      mm: vm_unmapped_area() lookup function · db4fbfb9
      Michel Lespinasse 提交于
      Implement vm_unmapped_area() using the rb_subtree_gap and highest_vm_end
      information to look up for suitable virtual address space gaps.
      
      struct vm_unmapped_area_info is used to define the desired allocation
      request:
       - lowest or highest possible address matching the remaining constraints
       - desired gap length
       - low/high address limits that the gap must fit into
       - alignment mask and offset
      
      Also update the generic arch_get_unmapped_area[_topdown] functions to make
      use of vm_unmapped_area() instead of implementing a brute force search.
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      db4fbfb9
    • M
      mm: check rb_subtree_gap correctness · 5a0768f6
      Michel Lespinasse 提交于
      When CONFIG_DEBUG_VM_RB is enabled, check that rb_subtree_gap is correctly
      set for every vma and that mm->highest_vm_end is also correct.
      
      Also add an explicit 'bug' variable to track if browse_rb() detected any
      invalid condition.
      
      [akpm@linux-foundation.org: repair innovative coding-style inventions]
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a0768f6
    • M
      mm: augment vma rbtree with rb_subtree_gap · d3737187
      Michel Lespinasse 提交于
      Define vma->rb_subtree_gap as the largest gap between any vma in the
      subtree rooted at that vma, and their predecessor.  Or, for a recursive
      definition, vma->rb_subtree_gap is the max of:
      
       - vma->vm_start - vma->vm_prev->vm_end
       - rb_subtree_gap fields of the vmas pointed by vma->rb.rb_left and
         vma->rb.rb_right
      
      This will allow get_unmapped_area_* to find a free area of the right
      size in O(log(N)) time, instead of potentially having to do a linear
      walk across all the VMAs.
      
      Also define mm->highest_vm_end as the vm_end field of the highest vma,
      so that we can easily check if the following gap is suitable.
      
      This does have the potential to make unmapping VMAs more expensive,
      especially for processes with very large numbers of VMAs, where the VMA
      rbtree can grow quite deep.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d3737187
    • A
      mm: support more pagesizes for MAP_HUGETLB/SHM_HUGETLB · 42d7395f
      Andi Kleen 提交于
      There was some desire in large applications using MAP_HUGETLB or
      SHM_HUGETLB to use 1GB huge pages on some mappings, and stay with 2MB on
      others.  This is useful together with NUMA policy: use 2MB interleaving
      on some mappings, but 1GB on local mappings.
      
      This patch extends the IPC/SHM syscall interfaces slightly to allow
      specifying the page size.
      
      It borrows some upper bits in the existing flag arguments and allows
      encoding the log of the desired page size in addition to the *_HUGETLB
      flag.  When 0 is specified the default size is used, this makes the
      change fully compatible.
      
      Extending the internal hugetlb code to handle this is straight forward.
      Instead of a single mount it just keeps an array of them and selects the
      right mount based on the specified page size.  When no page size is
      specified it uses the mount of the default page size.
      
      The change is not visible in /proc/mounts because internal mounts don't
      appear there.  It also has very little overhead: the additional mounts
      just consume a super block, but not more memory when not used.
      
      I also exported the new flags to the user headers (they were previously
      under __KERNEL__).  Right now only symbols for x86 and some other
      architecture for 1GB and 2MB are defined.  The interface should already
      work for all other architectures though.  Only architectures that define
      multiple hugetlb sizes actually need it (that is currently x86, tile,
      powerpc).  However tile and powerpc have user configurable hugetlb
      sizes, so it's not easy to add defines.  A program on those
      architectures would need to query sysfs and use the appropiate log2.
      
      [akpm@linux-foundation.org: cleanups]
      [rientjes@google.com: fix build]
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      42d7395f
    • N
      mm: hwpoison: fix action_result() to print out dirty/clean · ff604cf6
      Naoya Horiguchi 提交于
      action_result() fails to print out "dirty" even if an error occurred on
      a dirty pagecache, because when we check PageDirty in action_result() it
      was cleared after page isolation even if it's dirty before error
      handling.  This can break some applications that monitor this message,
      so should be fixed.
      
      There are several callers of action_result() except page_action(), but
      either of them are not for LRU pages but for free pages or kernel pages,
      so we don't have to consider dirty or not for them.
      
      Note that PG_dirty can be set outside page locks as described in commit
      6746aff7 ("HWPOISON: shmem: call set_page_dirty() with locked
      page"), so this patch does not completely closes the race window, but
      just narrows it.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Jun'ichi Nomura" <j-nomura@ce.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff604cf6
    • M
      dmapool: make DMAPOOL_DEBUG detect corruption of free marker · 5de55b26
      Matthieu CASTET 提交于
      This can help to catch the case where hardware is writing after dma free.
      
      [akpm@linux-foundation.org: tidy code, fix comment, use sizeof(page->offset), use pr_err()]
      Signed-off-by: NMatthieu Castet <matthieu.castet@parrot.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5de55b26
    • D
      mm, oom: allow exiting threads to have access to memory reserves · 9ff4868e
      David Rientjes 提交于
      Exiting threads, those with PF_EXITING set, can pagefault and require
      memory before they can make forward progress.  This happens, for instance,
      when a process must fault task->robust_list, a userspace structure, before
      detaching its memory.
      
      These threads also aren't guaranteed to get access to memory reserves
      unless oom killed or killed from userspace.  The oom killer won't grant
      memory reserves if other threads are also exiting other than current and
      stalling at the same point.  This prevents needlessly killing processes
      when others are already exiting.
      
      Instead of special casing all the possible situations between PF_EXITING
      getting set and a thread detaching its mm where it may allocate memory,
      which probably wouldn't get updated when a change is made to the exit
      path, the solution is to give all exiting threads access to memory
      reserves if they call the oom killer.  This allows them to quickly
      allocate, detach its mm, and free the memory it represents.
      
      Summary of Luigi's bug report:
      
      : He had an oom condition where threads were faulting on task->robust_list
      : and repeatedly called the oom killer but it would defer killing a thread
      : because it saw other PF_EXITING threads.  This can happen anytime we need
      : to allocate memory after setting PF_EXITING and before detaching our mm;
      : if there are other threads in the same state then the oom killer won't do
      : anything unless one of them happens to be killed from userspace.
      :
      : So instead of only deferring for PF_EXITING and !task->robust_list, it's
      : better to just give them access to memory reserves to prevent a potential
      : livelock so that any other faults that may be introduced in the future in
      : the exit path don't cause the same problem (and hopefully we don't allow
      : too many of those!).
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Tested-by: NLuigi Semenzato <semenzato@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ff4868e