1. 13 12月, 2012 11 次提交
    • K
      thp, vmstat: implement HZP_ALLOC and HZP_ALLOC_FAILED events · d8a8e1f0
      Kirill A. Shutemov 提交于
      hzp_alloc is incremented every time a huge zero page is successfully
      	allocated. It includes allocations which where dropped due
      	race with other allocation. Note, it doesn't count every map
      	of the huge zero page, only its allocation.
      
      hzp_alloc_failed is incremented if kernel fails to allocate huge zero
      	page and falls back to using small pages.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8a8e1f0
    • K
      thp: implement refcounting for huge zero page · 97ae1749
      Kirill A. Shutemov 提交于
      H.  Peter Anvin doesn't like huge zero page which sticks in memory forever
      after the first allocation.  Here's implementation of lockless refcounting
      for huge zero page.
      
      We have two basic primitives: {get,put}_huge_zero_page(). They
      manipulate reference counter.
      
      If counter is 0, get_huge_zero_page() allocates a new huge page and takes
      two references: one for caller and one for shrinker.  We free the page
      only in shrinker callback if counter is 1 (only shrinker has the
      reference).
      
      put_huge_zero_page() only decrements counter.  Counter is never zero in
      put_huge_zero_page() since shrinker holds on reference.
      
      Freeing huge zero page in shrinker callback helps to avoid frequent
      allocate-free.
      
      Refcounting has cost.  On 4 socket machine I observe ~1% slowdown on
      parallel (40 processes) read page faulting comparing to lazy huge page
      allocation.  I think it's pretty reasonable for synthetic benchmark.
      
      [lliubbo@gmail.com: fix mismerge]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NBob Liu <lliubbo@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97ae1749
    • K
      thp: lazy huge zero page allocation · 78ca0e67
      Kirill A. Shutemov 提交于
      Instead of allocating huge zero page on hugepage_init() we can postpone it
      until first huge zero page map. It saves memory if THP is not in use.
      
      cmpxchg() is used to avoid race on huge_zero_pfn initialization.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      78ca0e67
    • K
      thp: setup huge zero page on non-write page fault · 80371957
      Kirill A. Shutemov 提交于
      All code paths seems covered. Now we can map huge zero page on read page
      fault.
      
      We setup it in do_huge_pmd_anonymous_page() if area around fault address
      is suitable for THP and we've got read page fault.
      
      If we fail to setup huge zero page (ENOMEM) we fallback to
      handle_pte_fault() as we normally do in THP.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      80371957
    • K
      thp: implement splitting pmd for huge zero page · c5a647d0
      Kirill A. Shutemov 提交于
      We can't split huge zero page itself (and it's bug if we try), but we
      can split the pmd which points to it.
      
      On splitting the pmd we create a table with all ptes set to normal zero
      page.
      
      [akpm@linux-foundation.org: fix build error]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5a647d0
    • K
      thp: change split_huge_page_pmd() interface · e180377f
      Kirill A. Shutemov 提交于
      Pass vma instead of mm and add address parameter.
      
      In most cases we already have vma on the stack. We provides
      split_huge_page_pmd_mm() for few cases when we have mm, but not vma.
      
      This change is preparation to huge zero pmd splitting implementation.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e180377f
    • K
      thp: change_huge_pmd(): make sure we don't try to make a page writable · cad7f613
      Kirill A. Shutemov 提交于
      mprotect core never tries to make page writable using change_huge_pmd().
      Let's add an assert that the assumption is true.  It's important to be
      sure we will not make huge zero page writable.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cad7f613
    • K
      thp: do_huge_pmd_wp_page(): handle huge zero page · 93b4796d
      Kirill A. Shutemov 提交于
      On write access to huge zero page we alloc a new huge page and clear it.
      
      If ENOMEM, graceful fallback: we create a new pmd table and set pte around
      fault address to newly allocated normal (4k) page.  All other ptes in the
      pmd set to normal zero page.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93b4796d
    • K
      thp: copy_huge_pmd(): copy huge zero page · fc9fe822
      Kirill A. Shutemov 提交于
      It's easy to copy huge zero page. Just set destination pmd to huge zero
      page.
      
      It's safe to copy huge zero page since we have none yet :-p
      
      [rientjes@google.com: fix comment]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc9fe822
    • K
      thp: zap_huge_pmd(): zap huge zero pmd · 479f0abb
      Kirill A. Shutemov 提交于
      We don't have a mapped page to zap in huge zero page case.  Let's just clear
      pmd and remove it from tlb.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      479f0abb
    • K
      thp: huge zero page: basic preparation · 4a6c1297
      Kirill A. Shutemov 提交于
      During testing I noticed big (up to 2.5 times) memory consumption overhead
      on some workloads (e.g.  ft.A from NPB) if THP is enabled.
      
      The main reason for that big difference is lacking zero page in THP case.
      We have to allocate a real page on read page fault.
      
      A program to demonstrate the issue:
      #include <assert.h>
      #include <stdlib.h>
      #include <unistd.h>
      
      #define MB 1024*1024
      
      int main(int argc, char **argv)
      {
              char *p;
              int i;
      
              posix_memalign((void **)&p, 2 * MB, 200 * MB);
              for (i = 0; i < 200 * MB; i+= 4096)
                      assert(p[i] == 0);
              pause();
              return 0;
      }
      
      With thp-never RSS is about 400k, but with thp-always it's 200M.  After
      the patcheset thp-always RSS is 400k too.
      
      Design overview.
      
      Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with
      zeros.  The way how we allocate it changes in the patchset:
      
      - [01/10] simplest way: hzp allocated on boot time in hugepage_init();
      - [09/10] lazy allocation on first use;
      - [10/10] lockless refcounting + shrinker-reclaimable hzp;
      
      We setup it in do_huge_pmd_anonymous_page() if area around fault address
      is suitable for THP and we've got read page fault.  If we fail to setup
      hzp (ENOMEM) we fallback to handle_pte_fault() as we normally do in THP.
      
      On wp fault to hzp we allocate real memory for the huge page and clear it.
       If ENOMEM, graceful fallback: we create a new pmd table and set pte
      around fault address to newly allocated normal (4k) page.  All other ptes
      in the pmd set to normal zero page.
      
      We cannot split hzp (and it's bug if we try), but we can split the pmd
      which points to it.  On splitting the pmd we create a table with all ptes
      set to normal zero page.
      
      ===
      
      By hpa's request I've tried alternative approach for hzp implementation
      (see Virtual huge zero page patchset): pmd table with all entries set to
      zero page.  This way should be more cache friendly, but it increases TLB
      pressure.
      
      The problem with virtual huge zero page: it requires per-arch enabling.
      We need a way to mark that pmd table has all ptes set to zero page.
      
      Some numbers to compare two implementations (on 4s Westmere-EX):
      
      Mirobenchmark1
      ==============
      
      test:
              posix_memalign((void **)&p, 2 * MB, 8 * GB);
              for (i = 0; i < 100; i++) {
                      assert(memcmp(p, p + 4*GB, 4*GB) == 0);
                      asm volatile ("": : :"memory");
              }
      
      hzp:
       Performance counter stats for './test_memcmp' (5 runs):
      
            32356.272845 task-clock                #    0.998 CPUs utilized            ( +-  0.13% )
                      40 context-switches          #    0.001 K/sec                    ( +-  0.94% )
                       0 CPU-migrations            #    0.000 K/sec
                   4,218 page-faults               #    0.130 K/sec                    ( +-  0.00% )
          76,712,481,765 cycles                    #    2.371 GHz                      ( +-  0.13% ) [83.31%]
          36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles idle     ( +-  0.28% ) [83.35%]
           1,684,049,110 stalled-cycles-backend    #    2.20% backend  cycles idle     ( +-  2.96% ) [66.67%]
         134,355,715,816 instructions              #    1.75  insns per cycle
                                                   #    0.27  stalled cycles per insn  ( +-  0.10% ) [83.35%]
          13,526,169,702 branches                  #  418.039 M/sec                    ( +-  0.10% ) [83.31%]
               1,058,230 branch-misses             #    0.01% of all branches          ( +-  0.91% ) [83.36%]
      
            32.413866442 seconds time elapsed                                          ( +-  0.13% )
      
      vhzp:
       Performance counter stats for './test_memcmp' (5 runs):
      
            30327.183829 task-clock                #    0.998 CPUs utilized            ( +-  0.13% )
                      38 context-switches          #    0.001 K/sec                    ( +-  1.53% )
                       0 CPU-migrations            #    0.000 K/sec
                   4,218 page-faults               #    0.139 K/sec                    ( +-  0.01% )
          71,964,773,660 cycles                    #    2.373 GHz                      ( +-  0.13% ) [83.35%]
          31,191,284,231 stalled-cycles-frontend   #   43.34% frontend cycles idle     ( +-  0.40% ) [83.32%]
             773,484,474 stalled-cycles-backend    #    1.07% backend  cycles idle     ( +-  6.61% ) [66.67%]
         134,982,215,437 instructions              #    1.88  insns per cycle
                                                   #    0.23  stalled cycles per insn  ( +-  0.11% ) [83.32%]
          13,509,150,683 branches                  #  445.447 M/sec                    ( +-  0.11% ) [83.34%]
               1,017,667 branch-misses             #    0.01% of all branches          ( +-  1.07% ) [83.32%]
      
            30.381324695 seconds time elapsed                                          ( +-  0.13% )
      
      Mirobenchmark2
      ==============
      
      test:
              posix_memalign((void **)&p, 2 * MB, 8 * GB);
              for (i = 0; i < 1000; i++) {
                      char *_p = p;
                      while (_p < p+4*GB) {
                              assert(*_p == *(_p+4*GB));
                              _p += 4096;
                              asm volatile ("": : :"memory");
                      }
              }
      
      hzp:
       Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):
      
             3505.727639 task-clock                #    0.998 CPUs utilized            ( +-  0.26% )
                       9 context-switches          #    0.003 K/sec                    ( +-  4.97% )
                   4,384 page-faults               #    0.001 M/sec                    ( +-  0.00% )
           8,318,482,466 cycles                    #    2.373 GHz                      ( +-  0.26% ) [33.31%]
           5,134,318,786 stalled-cycles-frontend   #   61.72% frontend cycles idle     ( +-  0.42% ) [33.32%]
           2,193,266,208 stalled-cycles-backend    #   26.37% backend  cycles idle     ( +-  5.51% ) [33.33%]
           9,494,670,537 instructions              #    1.14  insns per cycle
                                                   #    0.54  stalled cycles per insn  ( +-  0.13% ) [41.68%]
           2,108,522,738 branches                  #  601.451 M/sec                    ( +-  0.09% ) [41.68%]
                 158,746 branch-misses             #    0.01% of all branches          ( +-  1.60% ) [41.71%]
           3,168,102,115 L1-dcache-loads
                #  903.693 M/sec                    ( +-  0.11% ) [41.70%]
           1,048,710,998 L1-dcache-misses
               #   33.10% of all L1-dcache hits    ( +-  0.11% ) [41.72%]
           1,047,699,685 LLC-load
                       #  298.854 M/sec                    ( +-  0.03% ) [33.38%]
                   2,287 LLC-misses
                     #    0.00% of all LL-cache hits     ( +-  8.27% ) [33.37%]
           3,166,187,367 dTLB-loads
                     #  903.147 M/sec                    ( +-  0.02% ) [33.35%]
               4,266,538 dTLB-misses
                    #    0.13% of all dTLB cache hits   ( +-  0.03% ) [33.33%]
      
             3.513339813 seconds time elapsed                                          ( +-  0.26% )
      
      vhzp:
       Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):
      
            27313.891128 task-clock                #    0.998 CPUs utilized            ( +-  0.24% )
                      62 context-switches          #    0.002 K/sec                    ( +-  0.61% )
                   4,384 page-faults               #    0.160 K/sec                    ( +-  0.01% )
          64,747,374,606 cycles                    #    2.370 GHz                      ( +-  0.24% ) [33.33%]
          61,341,580,278 stalled-cycles-frontend   #   94.74% frontend cycles idle     ( +-  0.26% ) [33.33%]
          56,702,237,511 stalled-cycles-backend    #   87.57% backend  cycles idle     ( +-  0.07% ) [33.33%]
          10,033,724,846 instructions              #    0.15  insns per cycle
                                                   #    6.11  stalled cycles per insn  ( +-  0.09% ) [41.65%]
           2,190,424,932 branches                  #   80.195 M/sec                    ( +-  0.12% ) [41.66%]
               1,028,630 branch-misses             #    0.05% of all branches          ( +-  1.50% ) [41.66%]
           3,302,006,540 L1-dcache-loads
                #  120.891 M/sec                    ( +-  0.11% ) [41.68%]
             271,374,358 L1-dcache-misses
               #    8.22% of all L1-dcache hits    ( +-  0.04% ) [41.66%]
              20,385,476 LLC-load
                       #    0.746 M/sec                    ( +-  1.64% ) [33.34%]
                  76,754 LLC-misses
                     #    0.38% of all LL-cache hits     ( +-  2.35% ) [33.34%]
           3,309,927,290 dTLB-loads
                     #  121.181 M/sec                    ( +-  0.03% ) [33.34%]
           2,098,967,427 dTLB-misses
                    #   63.41% of all dTLB cache hits   ( +-  0.03% ) [33.34%]
      
            27.364448741 seconds time elapsed                                          ( +-  0.24% )
      
      ===
      
      I personally prefer implementation present in this patchset. It doesn't
      touch arch-specific code.
      
      This patch:
      
      Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with
      zeros.
      
      For now let's allocate the page on hugepage_init().  We'll switch to lazy
      allocation later.
      
      We are not going to map the huge zero page until we can handle it properly
      on all code paths.
      
      is_huge_zero_{pfn,pmd}() functions will be used by following patches to
      check whether the pfn/pmd is huge zero page.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a6c1297
  2. 12 12月, 2012 5 次提交
  3. 15 10月, 2012 1 次提交
  4. 09 10月, 2012 23 次提交