1. 26 3月, 2015 2 次提交
    • M
      mm: numa: preserve PTE write permissions across a NUMA hinting fault · b191f9b1
      Mel Gorman 提交于
      Protecting a PTE to trap a NUMA hinting fault clears the writable bit
      and further faults are needed after trapping a NUMA hinting fault to set
      the writable bit again.  This patch preserves the writable bit when
      trapping NUMA hinting faults.  The impact is obvious from the number of
      minor faults trapped during the basis balancing benchmark and the system
      CPU usage;
      
        autonumabench
                                                   4.0.0-rc4             4.0.0-rc4
                                                    baseline              preserve
        Time System-NUMA01                  107.13 (  0.00%)      103.13 (  3.73%)
        Time System-NUMA01_THEADLOCAL       131.87 (  0.00%)       83.30 ( 36.83%)
        Time System-NUMA02                    8.95 (  0.00%)       10.72 (-19.78%)
        Time System-NUMA02_SMT                4.57 (  0.00%)        3.99 ( 12.69%)
        Time Elapsed-NUMA01                 515.78 (  0.00%)      517.26 ( -0.29%)
        Time Elapsed-NUMA01_THEADLOCAL      384.10 (  0.00%)      384.31 ( -0.05%)
        Time Elapsed-NUMA02                  48.86 (  0.00%)       48.78 (  0.16%)
        Time Elapsed-NUMA02_SMT              47.98 (  0.00%)       48.12 ( -0.29%)
      
                     4.0.0-rc4   4.0.0-rc4
                      baseline    preserve
        User          44383.95    43971.89
        System          252.61      201.24
        Elapsed         998.68     1000.94
      
        Minor Faults   2597249     1981230
        Major Faults       365         364
      
      There is a similar drop in system CPU usage using Dave Chinner's xfsrepair
      workload
      
                                            4.0.0-rc4             4.0.0-rc4
                                             baseline              preserve
        Amean    real-xfsrepair      454.14 (  0.00%)      442.36 (  2.60%)
        Amean    syst-xfsrepair      277.20 (  0.00%)      204.68 ( 26.16%)
      
      The patch looks hacky but the alternatives looked worse.  The tidest was
      to rewalk the page tables after a hinting fault but it was more complex
      than this approach and the performance was worse.  It's not generally
      safe to just mark the page writable during the fault if it's a write
      fault as it may have been read-only for COW so that approach was
      discarded.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NDave Chinner <david@fromorbit.com>
      Tested-by: NDave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b191f9b1
    • M
      mm: numa: group related processes based on VMA flags instead of page table flags · bea66fbd
      Mel Gorman 提交于
      These are three follow-on patches based on the xfsrepair workload Dave
      Chinner reported was problematic in 4.0-rc1 due to changes in page table
      management -- https://lkml.org/lkml/2015/3/1/226.
      
      Much of the problem was reduced by commit 53da3bc2 ("mm: fix up numa
      read-only thread grouping logic") and commit ba68bc01 ("mm: thp:
      Return the correct value for change_huge_pmd").  It was known that the
      performance in 3.19 was still better even if is far less safe.  This
      series aims to restore the performance without compromising on safety.
      
      For the test of this mail, I'm comparing 3.19 against 4.0-rc4 and the
      three patches applied on top
      
        autonumabench
                                                      3.19.0             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4
                                                     vanilla               vanilla          vmwrite-v5r8         preserve-v5r8         slowscan-v5r8
        Time System-NUMA01                  124.00 (  0.00%)      161.86 (-30.53%)      107.13 ( 13.60%)      103.13 ( 16.83%)      145.01 (-16.94%)
        Time System-NUMA01_THEADLOCAL       115.54 (  0.00%)      107.64 (  6.84%)      131.87 (-14.13%)       83.30 ( 27.90%)       92.35 ( 20.07%)
        Time System-NUMA02                    9.35 (  0.00%)       10.44 (-11.66%)        8.95 (  4.28%)       10.72 (-14.65%)        8.16 ( 12.73%)
        Time System-NUMA02_SMT                3.87 (  0.00%)        4.63 (-19.64%)        4.57 (-18.09%)        3.99 ( -3.10%)        3.36 ( 13.18%)
        Time Elapsed-NUMA01                 570.06 (  0.00%)      567.82 (  0.39%)      515.78 (  9.52%)      517.26 (  9.26%)      543.80 (  4.61%)
        Time Elapsed-NUMA01_THEADLOCAL      393.69 (  0.00%)      384.83 (  2.25%)      384.10 (  2.44%)      384.31 (  2.38%)      380.73 (  3.29%)
        Time Elapsed-NUMA02                  49.09 (  0.00%)       49.33 ( -0.49%)       48.86 (  0.47%)       48.78 (  0.63%)       50.94 ( -3.77%)
        Time Elapsed-NUMA02_SMT              47.51 (  0.00%)       47.15 (  0.76%)       47.98 ( -0.99%)       48.12 ( -1.28%)       49.56 ( -4.31%)
      
                      3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                     vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
        User        46334.60    46391.94    44383.95    43971.89    44372.12
        System        252.84      284.66      252.61      201.24      249.00
        Elapsed      1062.14     1050.96      998.68     1000.94     1026.78
      
      Overall the system CPU usage is comparable and the test is naturally a
      bit variable.  The slowing of the scanner hurts numa01 but on this
      machine it is an adverse workload and patches that dramatically help it
      often hurt absolutely everything else.
      
      Due to patch 2, the fault activity is interesting
      
                                        3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                                       vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
        Minor Faults                   2097811     2656646     2597249     1981230     1636841
        Major Faults                       362         450         365         364         365
      
      Note the impact preserving the write bit across protection updates and
      fault reduces faults.
      
        NUMA alloc hit                 1229008     1217015     1191660     1178322     1199681
        NUMA alloc miss                      0           0           0           0           0
        NUMA interleave hit                  0           0           0           0           0
        NUMA alloc local               1228514     1216317     1190871     1177448     1199021
        NUMA base PTE updates        245706197   240041607   238195516   244704842   115012800
        NUMA huge PMD updates           479530      468448      464868      477573      224487
        NUMA page range updates      491225557   479886983   476207932   489222218   229950144
        NUMA hint faults                659753      656503      641678      656926      294842
        NUMA hint local faults          381604      373963      360478      337585      186249
        NUMA hint local percent             57          56          56          51          63
        NUMA pages migrated            5412140     6374899     6266530     5277468     5755096
        AutoNUMA cost                    5121%       5083%       4994%       5097%       2388%
      
      Here the impact of slowing the PTE scanner on migratrion failures is
      obvious as "NUMA base PTE updates" and "NUMA huge PMD updates" are
      massively reduced even though the headline performance is very similar.
      
      As xfsrepair was the reported workload here is the impact of the series
      on it.
      
        xfsrepair
                                               3.19.0             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4
                                              vanilla               vanilla          vmwrite-v5r8         preserve-v5r8         slowscan-v5r8
        Min      real-fsmark        1183.29 (  0.00%)     1165.73 (  1.48%)     1152.78 (  2.58%)     1153.64 (  2.51%)     1177.62 (  0.48%)
        Min      syst-fsmark        4107.85 (  0.00%)     4027.75 (  1.95%)     3986.74 (  2.95%)     3979.16 (  3.13%)     4048.76 (  1.44%)
        Min      real-xfsrepair      441.51 (  0.00%)      463.96 ( -5.08%)      449.50 ( -1.81%)      440.08 (  0.32%)      439.87 (  0.37%)
        Min      syst-xfsrepair      195.76 (  0.00%)      278.47 (-42.25%)      262.34 (-34.01%)      203.70 ( -4.06%)      143.64 ( 26.62%)
        Amean    real-fsmark        1188.30 (  0.00%)     1177.34 (  0.92%)     1157.97 (  2.55%)     1158.21 (  2.53%)     1182.22 (  0.51%)
        Amean    syst-fsmark        4111.37 (  0.00%)     4055.70 (  1.35%)     3987.19 (  3.02%)     3998.72 (  2.74%)     4061.69 (  1.21%)
        Amean    real-xfsrepair      450.88 (  0.00%)      468.32 ( -3.87%)      454.14 ( -0.72%)      442.36 (  1.89%)      440.59 (  2.28%)
        Amean    syst-xfsrepair      199.66 (  0.00%)      290.60 (-45.55%)      277.20 (-38.84%)      204.68 ( -2.51%)      150.55 ( 24.60%)
        Stddev   real-fsmark           4.12 (  0.00%)       10.82 (-162.29%)       4.14 ( -0.28%)        5.98 (-45.05%)        4.60 (-11.53%)
        Stddev   syst-fsmark           2.63 (  0.00%)       20.32 (-671.82%)       0.37 ( 85.89%)       16.47 (-525.59%)      15.05 (-471.79%)
        Stddev   real-xfsrepair        6.87 (  0.00%)        4.55 ( 33.75%)        3.46 ( 49.58%)        1.78 ( 74.12%)        0.52 ( 92.50%)
        Stddev   syst-xfsrepair        3.02 (  0.00%)       10.30 (-241.37%)      13.17 (-336.37%)       0.71 ( 76.63%)        5.00 (-65.61%)
        CoeffVar real-fsmark           0.35 (  0.00%)        0.92 (-164.73%)       0.36 ( -2.91%)        0.52 (-48.82%)        0.39 (-12.10%)
        CoeffVar syst-fsmark           0.06 (  0.00%)        0.50 (-682.41%)       0.01 ( 85.45%)        0.41 (-543.22%)       0.37 (-478.78%)
        CoeffVar real-xfsrepair        1.52 (  0.00%)        0.97 ( 36.21%)        0.76 ( 49.94%)        0.40 ( 73.62%)        0.12 ( 92.33%)
        CoeffVar syst-xfsrepair        1.51 (  0.00%)        3.54 (-134.54%)       4.75 (-214.31%)       0.34 ( 77.20%)        3.32 (-119.63%)
        Max      real-fsmark        1193.39 (  0.00%)     1191.77 (  0.14%)     1162.90 (  2.55%)     1166.66 (  2.24%)     1188.50 (  0.41%)
        Max      syst-fsmark        4114.18 (  0.00%)     4075.45 (  0.94%)     3987.65 (  3.08%)     4019.45 (  2.30%)     4082.80 (  0.76%)
        Max      real-xfsrepair      457.80 (  0.00%)      474.60 ( -3.67%)      457.82 ( -0.00%)      444.42 (  2.92%)      441.03 (  3.66%)
        Max      syst-xfsrepair      203.11 (  0.00%)      303.65 (-49.50%)      294.35 (-44.92%)      205.33 ( -1.09%)      155.28 ( 23.55%)
      
      The really relevant lines as syst-xfsrepair which is the system CPU
      usage when running xfsrepair.  Note that on my machine the overhead was
      45% higher on 4.0-rc4 which may be part of what Dave is seeing.  Once we
      preserve the write bit across faults, it's only 2.51% higher on average.
      With the full series applied, system CPU usage is 24.6% lower on
      average.
      
      Again, the impact of preserving the write bit on minor faults is obvious
      and the impact of slowing scanning after migration failures is obvious
      on the PTE updates.  Note also that the number of pages migrated is much
      reduced even though the headline performance is comparable.
      
                                        3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                                       vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
        Minor Faults                 153466827   254507978   249163829   153501373   105737890
        Major Faults                       610         702         690         649         724
        NUMA base PTE updates        217735049   210756527   217729596   216937111   144344993
        NUMA huge PMD updates           129294       85044      106921      127246       79887
        NUMA pages migrated           21938995    29705270    28594162    22687324    16258075
      
                              3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                             vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
        Mean sdb-avgqusz       13.47        2.54        2.55        2.47        2.49
        Mean sdb-avgrqsz      202.32      140.22      139.50      139.02      138.12
        Mean sdb-await         25.92        5.09        5.33        5.02        5.22
        Mean sdb-r_await        4.71        0.19        0.83        0.51        0.11
        Mean sdb-w_await      104.13        5.21        5.38        5.05        5.32
        Mean sdb-svctm          0.59        0.13        0.14        0.13        0.14
        Mean sdb-rrqm           0.16        0.00        0.00        0.00        0.00
        Mean sdb-wrqm           3.59     1799.43     1826.84     1812.21     1785.67
        Max  sdb-avgqusz      111.06       12.13       14.05       11.66       15.60
        Max  sdb-avgrqsz      255.60      190.34      190.01      187.33      191.78
        Max  sdb-await        168.24       39.28       49.22       44.64       65.62
        Max  sdb-r_await      660.00       52.00      280.00       76.00       12.00
        Max  sdb-w_await     7804.00       39.28       49.22       44.64       65.62
        Max  sdb-svctm          4.00        2.82        2.86        1.98        2.84
        Max  sdb-rrqm           8.30        0.00        0.00        0.00        0.00
        Max  sdb-wrqm          34.20     5372.80     5278.60     5386.60     5546.15
      
      FWIW, I also checked SPECjbb in different configurations but it's
      similar observations -- minor faults lower, PTE update activity lower
      and performance is roughly comparable against 3.19.
      
      This patch (of 3):
      
      Threads that share writable data within pages are grouped together as
      related tasks.  This decision is based on whether the PTE is marked
      dirty which is subject to timing races between the PTE scanner update
      and when the application writes the page.  If the page is file-backed,
      then background flushes and sync also affect placement.  This is
      unpredictable behaviour which is impossible to reason about so this
      patch makes grouping decisions based on the VMA flags.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NDave Chinner <david@fromorbit.com>
      Tested-by: NDave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bea66fbd
  2. 13 3月, 2015 1 次提交
  3. 12 3月, 2015 1 次提交
    • L
      mm: fix up numa read-only thread grouping logic · 53da3bc2
      Linus Torvalds 提交于
      Dave Chinner reported that commit 4d942466 ("mm: convert
      p[te|md]_mknonnuma and remaining page table manipulations") slowed down
      his xfsrepair test enormously.  In particular, it was using more system
      time due to extra TLB flushing.
      
      The ultimate reason turns out to be how the change to use the regular
      page table accessor functions broke the NUMA grouping logic.  The old
      special mknuma/mknonnuma code accessed the page table present bit and
      the magic NUMA bit directly, while the new code just changes the page
      protections using PROT_NONE and the regular vma protections.
      
      That sounds equivalent, and from a fault standpoint it really is, but a
      subtle side effect is that the *other* protection bits of the page table
      entries also change.  And the code to decide how to group the NUMA
      entries together used the writable bit to decide whether a particular
      page was likely to be shared read-only or not.
      
      And with the change to make the NUMA handling use the regular permission
      setting functions, that writable bit was basically always cleared for
      private mappings due to COW.  So even if the page actually ends up being
      written to in the end, the NUMA balancing would act as if it was always
      shared RO.
      
      This code is a heuristic anyway, so the fix - at least for now - is to
      instead check whether the page is dirty rather than writable.  The bit
      doesn't change with protection changes.
      
      NOTE! This also adds a FIXME comment to revisit this issue,
      
      Not only should we probably re-visit the whole "is this a shared
      read-only page" heuristic (we might want to take the vma permissions
      into account and base this more on those than the per-page ones, and
      also look at whether the particular access that triggers it is a write
      or not), but the whole COW issue shows that we should think about the
      NUMA fault handling some more.
      
      For example, maybe we should do the early-COW thing that a regular fault
      does.  Or maybe we should accept that while using the same bits as
      PROTNONE was a good thing (and got rid of the specual NUMA bit), we
      might still want to just preseve the other protection bits across NUMA
      faulting.
      
      Those are bigger questions, left for later.  This just fixes up the
      heuristic so that it at least approximates working again.  More analysis
      and work needed.
      Reported-by: NDave Chinner <david@fromorbit.com>
      Tested-by: NMel Gorman <mgorman@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>,
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53da3bc2
  4. 13 2月, 2015 6 次提交
    • M
      mm: numa: avoid unnecessary TLB flushes when setting NUMA hinting entries · 10c1045f
      Mel Gorman 提交于
      If a PTE or PMD is already marked NUMA when scanning to mark entries for
      NUMA hinting then it is not necessary to update the entry and incur a TLB
      flush penalty.  Avoid the avoidhead where possible.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10c1045f
    • M
      mm: numa: add paranoid check around pte_protnone_numa · c0e7cad9
      Mel Gorman 提交于
      pte_protnone_numa is only safe to use after VMA checks for PROT_NONE are
      complete.  Treating a real PROT_NONE PTE as a NUMA hinting fault is going
      to result in strangeness so add a check for it.  BUG_ON looks like
      overkill but if this is hit then it's a serious bug that could result in
      corruption so do not even try recovering.  It would have been more
      comprehensive to check VMA flags in pte_protnone_numa but it would have
      made the API ugly just for a debugging check.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0e7cad9
    • M
      mm: numa: do not trap faults on the huge zero page · e944fd67
      Mel Gorman 提交于
      Faults on the huge zero page are pointless and there is a BUG_ON to catch
      them during fault time.  This patch reintroduces a check that avoids
      marking the zero page PAGE_NONE.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e944fd67
    • M
      mm: convert p[te|md]_mknonnuma and remaining page table manipulations · 4d942466
      Mel Gorman 提交于
      With PROT_NONE, the traditional page table manipulation functions are
      sufficient.
      
      [andre.przywara@arm.com: fix compiler warning in pmdp_invalidate()]
      [akpm@linux-foundation.org: fix build with STRICT_MM_TYPECHECKS]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NAneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d942466
    • M
      mm: convert p[te|md]_numa users to p[te|md]_protnone_numa · 8a0516ed
      Mel Gorman 提交于
      Convert existing users of pte_numa and friends to the new helper.  Note
      that the kernel is broken after this patch is applied until the other page
      table modifiers are also altered.  This patch layout is to make review
      easier.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NAneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a0516ed
    • M
      mm: numa: do not dereference pmd outside of the lock during NUMA hinting fault · 5d833062
      Mel Gorman 提交于
      Automatic NUMA balancing depends on being able to protect PTEs to trap a
      fault and gather reference locality information.  Very broadly speaking
      it would mark PTEs as not present and use another bit to distinguish
      between NUMA hinting faults and other types of faults.  It was
      universally loved by everybody and caused no problems whatsoever.  That
      last sentence might be a lie.
      
      This series is very heavily based on patches from Linus and Aneesh to
      replace the existing PTE/PMD NUMA helper functions with normal change
      protections.  I did alter and add parts of it but I consider them
      relatively minor contributions.  At their suggestion, acked-bys are in
      there but I've no problem converting them to Signed-off-by if requested.
      
      AFAIK, this has received no testing on ppc64 and I'm depending on Aneesh
      for that.  I tested trinity under kvm-tool and passed and ran a few
      other basic tests.  At the time of writing, only the short-lived tests
      have completed but testing of V2 indicated that long-term testing had no
      surprises.  In most cases I'm leaving out detail as it's not that
      interesting.
      
      specjbb single JVM: There was negligible performance difference in the
      	benchmark itself for short runs. However, system activity is
      	higher and interrupts are much higher over time -- possibly TLB
      	flushes. Migrations are also higher. Overall, this is more overhead
      	but considering the problems faced with the old approach I think
      	we just have to suck it up and find another way of reducing the
      	overhead.
      
      specjbb multi JVM: Negligible performance difference to the actual benchmark
      	but like the single JVM case, the system overhead is noticeably
      	higher.  Again, interrupts are a major factor.
      
      autonumabench: This was all over the place and about all that can be
      	reasonably concluded is that it's different but not necessarily
      	better or worse.
      
      autonumabench
                                           3.18.0-rc5            3.18.0-rc5
                                       mmotm-20141119         protnone-v3r3
      User    NUMA01               32380.24 (  0.00%)    21642.92 ( 33.16%)
      User    NUMA01_THEADLOCAL    22481.02 (  0.00%)    22283.22 (  0.88%)
      User    NUMA02                3137.00 (  0.00%)     3116.54 (  0.65%)
      User    NUMA02_SMT            1614.03 (  0.00%)     1543.53 (  4.37%)
      System  NUMA01                 322.97 (  0.00%)     1465.89 (-353.88%)
      System  NUMA01_THEADLOCAL       91.87 (  0.00%)       49.32 ( 46.32%)
      System  NUMA02                  37.83 (  0.00%)       14.61 ( 61.38%)
      System  NUMA02_SMT               7.36 (  0.00%)        7.45 ( -1.22%)
      Elapsed NUMA01                 716.63 (  0.00%)      599.29 ( 16.37%)
      Elapsed NUMA01_THEADLOCAL      553.98 (  0.00%)      539.94 (  2.53%)
      Elapsed NUMA02                  83.85 (  0.00%)       83.04 (  0.97%)
      Elapsed NUMA02_SMT              86.57 (  0.00%)       79.15 (  8.57%)
      CPU     NUMA01                4563.00 (  0.00%)     3855.00 ( 15.52%)
      CPU     NUMA01_THEADLOCAL     4074.00 (  0.00%)     4136.00 ( -1.52%)
      CPU     NUMA02                3785.00 (  0.00%)     3770.00 (  0.40%)
      CPU     NUMA02_SMT            1872.00 (  0.00%)     1959.00 ( -4.65%)
      
      System CPU usage of NUMA01 is worse but it's an adverse workload on this
      machine so I'm reluctant to conclude that it's a problem that matters.  On
      the other workloads that are sensible on this machine, system CPU usage is
      great.  Overall time to complete the benchmark is comparable
      
                3.18.0-rc5  3.18.0-rc5
              mmotm-20141119protnone-v3r3
      User        59612.50    48586.44
      System        460.22     1537.45
      Elapsed      1442.20     1304.29
      
      NUMA alloc hit                 5075182     5743353
      NUMA alloc miss                      0           0
      NUMA interleave hit                  0           0
      NUMA alloc local               5075174     5743339
      NUMA base PTE updates        637061448   443106883
      NUMA huge PMD updates          1243434      864747
      NUMA page range updates     1273699656   885857347
      NUMA hint faults               1658116     1214277
      NUMA hint local faults          959487      754113
      NUMA hint local percent             57          62
      NUMA pages migrated            5467056    61676398
      
      The NUMA pages migrated look terrible but when I looked at a graph of the
      activity over time I see that the massive spike in migration activity was
      during NUMA01.  This correlates with high system CPU usage and could be
      simply down to bad luck but any modifications that affect that workload
      would be related to scan rates and migrations, not the protection
      mechanism.  For all other workloads, migration activity was comparable.
      
      Overall, headline performance figures are comparable but the overhead is
      higher, mostly in interrupts.  To some extent, higher overhead from this
      approach was anticipated but not to this degree.  It's going to be
      necessary to reduce this again with a separate series in the future.  It's
      still worth going ahead with this series though as it's likely to avoid
      constant headaches with Xen and is probably easier to maintain.
      
      This patch (of 10):
      
      A transhuge NUMA hinting fault may find the page is migrating and should
      wait until migration completes.  The check is race-prone because the pmd
      is deferenced outside of the page lock and while the race is tiny, it'll
      be larger if the PMD is cleared while marking PMDs for hinting fault.
      This patch closes the race.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d833062
  5. 12 2月, 2015 4 次提交
  6. 11 12月, 2014 1 次提交
  7. 30 10月, 2014 2 次提交
    • D
      mm, thp: fix collapsing of hugepages on madvise · 6d50e60c
      David Rientjes 提交于
      If an anonymous mapping is not allowed to fault thp memory and then
      madvise(MADV_HUGEPAGE) is used after fault, khugepaged will never
      collapse this memory into thp memory.
      
      This occurs because the madvise(2) handler for thp, hugepage_madvise(),
      clears VM_NOHUGEPAGE on the stack and it isn't stored in vma->vm_flags
      until the final action of madvise_behavior().  This causes the
      khugepaged_enter_vma_merge() to be a no-op in hugepage_madvise() when
      the vma had previously had VM_NOHUGEPAGE set.
      
      Fix this by passing the correct vma flags to the khugepaged mm slot
      handler.  There's no chance khugepaged can run on this vma until after
      madvise_behavior() returns since we hold mm->mmap_sem.
      
      It would be possible to clear VM_NOHUGEPAGE directly from vma->vm_flags
      in hugepage_advise(), but I didn't want to introduce special case
      behavior into madvise_behavior().  I think it's best to just let it
      always set vma->vm_flags itself.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NSuleiman Souhlal <suleiman@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d50e60c
    • Y
      mm: free compound page with correct order · 5ddacbe9
      Yu Zhao 提交于
      Compound page should be freed by put_page() or free_pages() with correct
      order.  Not doing so will cause tail pages leaked.
      
      The compound order can be obtained by compound_order() or use
      HPAGE_PMD_ORDER in our case.  Some people would argue the latter is
      faster but I prefer the former which is more general.
      
      This bug was observed not just on our servers (the worst case we saw is
      11G leaked on a 48G machine) but also on our workstations running Ubuntu
      based distro.
      
        $ cat /proc/vmstat  | grep thp_zero_page_alloc
        thp_zero_page_alloc 55
        thp_zero_page_alloc_failed 0
      
      This means there is (thp_zero_page_alloc - 1) * (2M - 4K) memory leaked.
      
      Fixes: 97ae1749 ("thp: implement refcounting for huge zero page")
      Signed-off-by: NYu Zhao <yuzhao@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Bob Liu <lliubbo@gmail.com>
      Cc: <stable@vger.kernel.org>	[3.8+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ddacbe9
  8. 27 10月, 2014 2 次提交
  9. 10 10月, 2014 3 次提交
  10. 03 10月, 2014 1 次提交
    • M
      mm: numa: Do not mark PTEs pte_numa when splitting huge pages · abc40bd2
      Mel Gorman 提交于
      This patch reverts 1ba6e0b5 ("mm: numa: split_huge_page: transfer the
      NUMA type from the pmd to the pte"). If a huge page is being split due
      a protection change and the tail will be in a PROT_NONE vma then NUMA
      hinting PTEs are temporarily created in the protected VMA.
      
       VM_RW|VM_PROTNONE
      |-----------------|
            ^
            split here
      
      In the specific case above, it should get fixed up by change_pte_range()
      but there is a window of opportunity for weirdness to happen. Similarly,
      if a huge page is shrunk and split during a protection update but before
      pmd_numa is cleared then a pte_numa can be left behind.
      
      Instead of adding complexity trying to deal with the case, this patch
      will not mark PTEs NUMA when splitting a huge page. NUMA hinting faults
      will not be triggered which is marginal in comparison to the complexity
      in dealing with the corner cases during THP split.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      abc40bd2
  11. 09 8月, 2014 1 次提交
    • J
      mm: memcontrol: rewrite charge API · 00501b53
      Johannes Weiner 提交于
      These patches rework memcg charge lifetime to integrate more naturally
      with the lifetime of user pages.  This drastically simplifies the code and
      reduces charging and uncharging overhead.  The most expensive part of
      charging and uncharging is the page_cgroup bit spinlock, which is removed
      entirely after this series.
      
      Here are the top-10 profile entries of a stress test that reads a 128G
      sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
       executing in the root memcg).  Before:
      
          15.36%              cat  [kernel.kallsyms]   [k] copy_user_generic_string
          13.31%              cat  [kernel.kallsyms]   [k] memset
          11.48%              cat  [kernel.kallsyms]   [k] do_mpage_readpage
           4.23%              cat  [kernel.kallsyms]   [k] get_page_from_freelist
           2.38%              cat  [kernel.kallsyms]   [k] put_page
           2.32%              cat  [kernel.kallsyms]   [k] __mem_cgroup_commit_charge
           2.18%          kswapd0  [kernel.kallsyms]   [k] __mem_cgroup_uncharge_common
           1.92%          kswapd0  [kernel.kallsyms]   [k] shrink_page_list
           1.86%              cat  [kernel.kallsyms]   [k] __radix_tree_lookup
           1.62%              cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn
      
      After:
      
          15.67%           cat  [kernel.kallsyms]   [k] copy_user_generic_string
          13.48%           cat  [kernel.kallsyms]   [k] memset
          11.42%           cat  [kernel.kallsyms]   [k] do_mpage_readpage
           3.98%           cat  [kernel.kallsyms]   [k] get_page_from_freelist
           2.46%           cat  [kernel.kallsyms]   [k] put_page
           2.13%       kswapd0  [kernel.kallsyms]   [k] shrink_page_list
           1.88%           cat  [kernel.kallsyms]   [k] __radix_tree_lookup
           1.67%           cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn
           1.39%       kswapd0  [kernel.kallsyms]   [k] free_pcppages_bulk
           1.30%           cat  [kernel.kallsyms]   [k] kfree
      
      As you can see, the memcg footprint has shrunk quite a bit.
      
         text    data     bss     dec     hex filename
        37970    9892     400   48262    bc86 mm/memcontrol.o.old
        35239    9892     400   45531    b1db mm/memcontrol.o
      
      This patch (of 4):
      
      The memcg charge API charges pages before they are rmapped - i.e.  have an
      actual "type" - and so every callsite needs its own set of charge and
      uncharge functions to know what type is being operated on.  Worse,
      uncharge has to happen from a context that is still type-specific, rather
      than at the end of the page's lifetime with exclusive access, and so
      requires a lot of synchronization.
      
      Rewrite the charge API to provide a generic set of try_charge(),
      commit_charge() and cancel_charge() transaction operations, much like
      what's currently done for swap-in:
      
        mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
        pages from the memcg if necessary.
      
        mem_cgroup_commit_charge() commits the page to the charge once it
        has a valid page->mapping and PageAnon() reliably tells the type.
      
        mem_cgroup_cancel_charge() aborts the transaction.
      
      This reduces the charge API and enables subsequent patches to
      drastically simplify uncharging.
      
      As pages need to be committed after rmap is established but before they
      are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
      additions again.  Revive lru_cache_add_active_or_unevictable().
      
      [hughd@google.com: fix shmem_unuse]
      [hughd@google.com: Add comments on the private use of -EAGAIN]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      00501b53
  12. 07 8月, 2014 4 次提交
  13. 24 6月, 2014 2 次提交
    • H
      mm: let mm_find_pmd fix buggy race with THP fault · f72e7dcd
      Hugh Dickins 提交于
      Trinity has reported:
      
          BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
          IP: __lock_acquire (kernel/locking/lockdep.c:3070 (discriminator 1))
          CPU: 6 PID: 16173 Comm: trinity-c364 Tainted: G        W
                                  3.15.0-rc1-next-20140415-sasha-00020-gaa90d09 #398
          lock_acquire (arch/x86/include/asm/current.h:14
                        kernel/locking/lockdep.c:3602)
          _raw_spin_lock (include/linux/spinlock_api_smp.h:143
                          kernel/locking/spinlock.c:151)
          remove_migration_pte (mm/migrate.c:137)
          rmap_walk (mm/rmap.c:1628 mm/rmap.c:1699)
          remove_migration_ptes (mm/migrate.c:224)
          migrate_pages (mm/migrate.c:922 mm/migrate.c:960 mm/migrate.c:1126)
          migrate_misplaced_page (mm/migrate.c:1733)
          __handle_mm_fault (mm/memory.c:3762 mm/memory.c:3812 mm/memory.c:3925)
          handle_mm_fault (mm/memory.c:3948)
          __get_user_pages (mm/memory.c:1851)
          __mlock_vma_pages_range (mm/mlock.c:255)
          __mm_populate (mm/mlock.c:711)
          SyS_mlockall (include/linux/mm.h:1799 mm/mlock.c:817 mm/mlock.c:791)
      
      I believe this comes about because, whereas collapsing and splitting THP
      functions take anon_vma lock in write mode (which excludes concurrent
      rmap walks), faulting THP functions (write protection and misplaced
      NUMA) do not - and mostly they do not need to.
      
      But they do use a pmdp_clear_flush(), set_pmd_at() sequence which, for
      an instant (indeed, for a long instant, given the inter-CPU TLB flush in
      there), leaves *pmd neither present not trans_huge.
      
      Which can confuse a concurrent rmap walk, as when removing migration
      ptes, seen in the dumped trace.  Although that rmap walk has a 4k page
      to insert, anon_vmas containing THPs are in no way segregated from
      4k-page anon_vmas, so the 4k-intent mm_find_pmd() does need to cope with
      that instant when a trans_huge pmd is temporarily absent.
      
      I don't think we need strengthen the locking at the THP end: it's easily
      handled with an ACCESS_ONCE() before testing both conditions.
      
      And since mm_find_pmd() had only one caller who wanted a THP rather than
      a pmd, let's slightly repurpose it to fail when it hits a THP or
      non-present pmd, and open code split_huge_page_address() again.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Lameter <cl@gentwo.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f72e7dcd
    • H
      mm: thp: fix DEBUG_PAGEALLOC oops in copy_page_rep() · 5338a937
      Hugh Dickins 提交于
      Trinity has for over a year been reporting a CONFIG_DEBUG_PAGEALLOC oops
      in copy_page_rep() called from copy_user_huge_page() called from
      do_huge_pmd_wp_page().
      
      I believe this is a DEBUG_PAGEALLOC false positive, due to the source
      page being split, and a tail page freed, while copy is in progress; and
      not a problem without DEBUG_PAGEALLOC, since the pmd_same() check will
      prevent a miscopy from being made visible.
      
      Fix by adding get_user_huge_page() and put_user_huge_page(): reducing to
      the usual get_page() and put_page() on head page in the usual config;
      but get and put references to all of the tail pages when
      DEBUG_PAGEALLOC.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5338a937
  14. 05 6月, 2014 2 次提交
  15. 19 4月, 2014 1 次提交
    • K
      thp: close race between split and zap huge pages · b5a8cad3
      Kirill A. Shutemov 提交于
      Sasha Levin has reported two THP BUGs[1][2].  I believe both of them
      have the same root cause.  Let's look to them one by one.
      
      The first bug[1] is "kernel BUG at mm/huge_memory.c:1829!".  It's
      BUG_ON(mapcount != page_mapcount(page)) in __split_huge_page().  From my
      testing I see that page_mapcount() is higher than mapcount here.
      
      I think it happens due to race between zap_huge_pmd() and
      page_check_address_pmd().  page_check_address_pmd() misses PMD which is
      under zap:
      
      	CPU0						CPU1
      						zap_huge_pmd()
      						  pmdp_get_and_clear()
      __split_huge_page()
        anon_vma_interval_tree_foreach()
          __split_huge_page_splitting()
            page_check_address_pmd()
              mm_find_pmd()
      	  /*
      	   * We check if PMD present without taking ptl: no
      	   * serialization against zap_huge_pmd(). We miss this PMD,
      	   * it's not accounted to 'mapcount' in __split_huge_page().
      	   */
      	  pmd_present(pmd) == 0
      
        BUG_ON(mapcount != page_mapcount(page)) // CRASH!!!
      
      						  page_remove_rmap(page)
      						    atomic_add_negative(-1, &page->_mapcount)
      
      The second bug[2] is "kernel BUG at mm/huge_memory.c:1371!".
      It's VM_BUG_ON_PAGE(!PageHead(page), page) in zap_huge_pmd().
      
      This happens in similar way:
      
      	CPU0						CPU1
      						zap_huge_pmd()
      						  pmdp_get_and_clear()
      						  page_remove_rmap(page)
      						    atomic_add_negative(-1, &page->_mapcount)
      __split_huge_page()
        anon_vma_interval_tree_foreach()
          __split_huge_page_splitting()
            page_check_address_pmd()
              mm_find_pmd()
      	  pmd_present(pmd) == 0	/* The same comment as above */
        /*
         * No crash this time since we already decremented page->_mapcount in
         * zap_huge_pmd().
         */
        BUG_ON(mapcount != page_mapcount(page))
      
        /*
         * We split the compound page here into small pages without
         * serialization against zap_huge_pmd()
         */
        __split_huge_page_refcount()
      						VM_BUG_ON_PAGE(!PageHead(page), page); // CRASH!!!
      
      So my understanding the problem is pmd_present() check in mm_find_pmd()
      without taking page table lock.
      
      The bug was introduced by me commit with commit 117b0791. Sorry for
      that. :(
      
      Let's open code mm_find_pmd() in page_check_address_pmd() and do the
      check under page table lock.
      
      Note that __page_check_address() does the same for PTE entires
      if sync != 0.
      
      I've stress tested split and zap code paths for 36+ hours by now and
      don't see crashes with the patch applied. Before it took <20 min to
      trigger the first bug and few hours for second one (if we ignore
      first).
      
      [1] https://lkml.kernel.org/g/<53440991.9090001@oracle.com>
      [2] https://lkml.kernel.org/g/<5310C56C.60709@oracle.com>
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Bob Liu <lliubbo@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>	[3.13+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b5a8cad3
  16. 18 4月, 2014 1 次提交
  17. 08 4月, 2014 2 次提交
    • M
      memcg: rename high level charging functions · d715ae08
      Michal Hocko 提交于
      mem_cgroup_newpage_charge is used only for charging anonymous memory so
      it is better to rename it to mem_cgroup_charge_anon.
      
      mem_cgroup_cache_charge is used for file backed memory so rename it to
      mem_cgroup_charge_file.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d715ae08
    • A
      mm: revert "thp: make MADV_HUGEPAGE check for mm->def_flags" · 1e1836e8
      Alex Thorlton 提交于
      The main motivation behind this patch is to provide a way to disable THP
      for jobs where the code cannot be modified, and using a malloc hook with
      madvise is not an option (i.e.  statically allocated data).  This patch
      allows us to do just that, without affecting other jobs running on the
      system.
      
      We need to do this sort of thing for jobs where THP hurts performance,
      due to the possibility of increased remote memory accesses that can be
      created by situations such as the following:
      
      When you touch 1 byte of an untouched, contiguous 2MB chunk, a THP will
      be handed out, and the THP will be stuck on whatever node the chunk was
      originally referenced from.  If many remote nodes need to do work on
      that same chunk, they'll be making remote accesses.
      
      With THP disabled, 4K pages can be handed out to separate nodes as
      they're needed, greatly reducing the amount of remote accesses to
      memory.
      
      This patch is based on some of my work combined with some
      suggestions/patches given by Oleg Nesterov.  The main goal here is to
      add a prctl switch to allow us to disable to THP on a per mm_struct
      basis.
      
      Here's a bit of test data with the new patch in place...
      
      First with the flag unset:
      
        # perf stat -a ./prctl_wrapper_mmv3 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
        Setting thp_disabled for this task...
        thp_disable: 0
        Set thp_disabled state to 0
        Process pid = 18027
      
                                                                                                                             PF/
                                        MAX        MIN                                  TOTCPU/      TOT_PF/   TOT_PF/     WSEC/
        TYPE:               CPUS       WALL       WALL        SYS     USER     TOTCPU       CPU     WALL_SEC   SYS_SEC       CPU   NODES
         512      1.120      0.060      0.000    0.110      0.110     0.000    28571428864 -9223372036854775808  55803572      23
      
         Performance counter stats for './prctl_wrapper_mmv3_hack 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g':
      
          273719072.841402 task-clock                #  641.026 CPUs utilized           [100.00%]
                 1,008,986 context-switches          #    0.000 M/sec                   [100.00%]
                     7,717 CPU-migrations            #    0.000 M/sec                   [100.00%]
                 1,698,932 page-faults               #    0.000 M/sec
        355,222,544,890,379 cycles                   #    1.298 GHz                     [100.00%]
        536,445,412,234,588 stalled-cycles-frontend  #  151.02% frontend cycles idle    [100.00%]
        409,110,531,310,223 stalled-cycles-backend   #  115.17% backend  cycles idle    [100.00%]
        148,286,797,266,411 instructions             #    0.42  insns per cycle
                                                     #    3.62  stalled cycles per insn [100.00%]
        27,061,793,159,503 branches                  #   98.867 M/sec                   [100.00%]
             1,188,655,196 branch-misses             #    0.00% of all branches
      
             427.001706337 seconds time elapsed
      
      Now with the flag set:
      
        # perf stat -a ./prctl_wrapper_mmv3 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
        Setting thp_disabled for this task...
        thp_disable: 1
        Set thp_disabled state to 1
        Process pid = 144957
      
                                                                                                                             PF/
                                        MAX        MIN                                  TOTCPU/      TOT_PF/   TOT_PF/     WSEC/
        TYPE:               CPUS       WALL       WALL        SYS     USER     TOTCPU       CPU     WALL_SEC   SYS_SEC       CPU   NODES
         512      0.620      0.260      0.250    0.320      0.570     0.001    51612901376 128000000000 100806448      23
      
         Performance counter stats for './prctl_wrapper_mmv3_hack 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g':
      
          138789390.540183 task-clock                #  641.959 CPUs utilized           [100.00%]
                   534,205 context-switches          #    0.000 M/sec                   [100.00%]
                     4,595 CPU-migrations            #    0.000 M/sec                   [100.00%]
                63,133,119 page-faults               #    0.000 M/sec
        147,977,747,269,768 cycles                   #    1.066 GHz                     [100.00%]
        200,524,196,493,108 stalled-cycles-frontend  #  135.51% frontend cycles idle    [100.00%]
        105,175,163,716,388 stalled-cycles-backend   #   71.07% backend  cycles idle    [100.00%]
        180,916,213,503,160 instructions             #    1.22  insns per cycle
                                                     #    1.11  stalled cycles per insn [100.00%]
        26,999,511,005,868 branches                  #  194.536 M/sec                   [100.00%]
               714,066,351 branch-misses             #    0.00% of all branches
      
             216.196778807 seconds time elapsed
      
      As with previous versions of the patch, We're getting about a 2x
      performance increase here.  Here's a link to the test case I used, along
      with the little wrapper to activate the flag:
      
        http://oss.sgi.com/projects/memtests/thp_pthread_mmprctlv3.tar.gz
      
      This patch (of 3):
      
      Revert commit 8e72033f and add in code to fix up any issues caused
      by the revert.
      
      The revert is necessary because hugepage_madvise would return -EINVAL
      when VM_NOHUGEPAGE is set, which will break subsequent chunks of this
      patch set.
      
      Here's a snip of an e-mail from Gerald detailing the original purpose of
      this code, and providing justification for the revert:
      
        "The intent of commit 8e72033f was to guard against any future
         programming errors that may result in an madvice(MADV_HUGEPAGE) on
         guest mappings, which would crash the kernel.
      
         Martin suggested adding the bit to arch/s390/mm/pgtable.c, if
         8e72033f was to be reverted, because that check will also prevent
         a kernel crash in the case described above, it will now send a
         SIGSEGV instead.
      
         This would now also allow to do the madvise on other parts, if
         needed, so it is a more flexible approach.  One could also say that
         it would have been better to do it this way right from the
         beginning..."
      Signed-off-by: NAlex Thorlton <athorlton@sgi.com>
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Tested-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1e1836e8
  18. 04 4月, 2014 1 次提交
  19. 04 3月, 2014 1 次提交
    • V
      mm: include VM_MIXEDMAP flag in the VM_SPECIAL list to avoid m(un)locking · 9050d7eb
      Vlastimil Babka 提交于
      Daniel Borkmann reported a VM_BUG_ON assertion failing:
      
        ------------[ cut here ]------------
        kernel BUG at mm/mlock.c:528!
        invalid opcode: 0000 [#1] SMP
        Modules linked in: ccm arc4 iwldvm [...]
         video
        CPU: 3 PID: 2266 Comm: netsniff-ng Not tainted 3.14.0-rc2+ #8
        Hardware name: LENOVO 2429BP3/2429BP3, BIOS G4ET37WW (1.12 ) 05/29/2012
        task: ffff8801f87f9820 ti: ffff88002cb44000 task.ti: ffff88002cb44000
        RIP: 0010:[<ffffffff81171ad0>]  [<ffffffff81171ad0>] munlock_vma_pages_range+0x2e0/0x2f0
        Call Trace:
          do_munmap+0x18f/0x3b0
          vm_munmap+0x41/0x60
          SyS_munmap+0x22/0x30
          system_call_fastpath+0x1a/0x1f
        RIP   munlock_vma_pages_range+0x2e0/0x2f0
        ---[ end trace a0088dcf07ae10f2 ]---
      
      because munlock_vma_pages_range() thinks it's unexpectedly in the middle
      of a THP page.  This can be reproduced with default config since 3.11
      kernels.  A reproducer can be found in the kernel's selftest directory
      for networking by running ./psock_tpacket.
      
      The problem is that an order=2 compound page (allocated by
      alloc_one_pg_vec_page() is part of the munlocked VM_MIXEDMAP vma (mapped
      by packet_mmap()) and mistaken for a THP page and assumed to be order=9.
      
      The checks for THP in munlock came with commit ff6a6da6 ("mm:
      accelerate munlock() treatment of THP pages"), i.e.  since 3.9, but did
      not trigger a bug.  It just makes munlock_vma_pages_range() skip such
      compound pages until the next 512-pages-aligned page, when it encounters
      a head page.  This is however not a problem for vma's where mlocking has
      no effect anyway, but it can distort the accounting.
      
      Since commit 7225522b ("mm: munlock: batch non-THP page isolation
      and munlock+putback using pagevec") this can trigger a VM_BUG_ON in
      PageTransHuge() check.
      
      This patch fixes the issue by adding VM_MIXEDMAP flag to VM_SPECIAL, a
      list of flags that make vma's non-mlockable and non-mergeable.  The
      reasoning is that VM_MIXEDMAP vma's are similar to VM_PFNMAP, which is
      already on the VM_SPECIAL list, and both are intended for non-LRU pages
      where mlocking makes no sense anyway.  Related Lkml discussion can be
      found in [2].
      
       [1] tools/testing/selftests/net/psock_tpacket
       [2] https://lkml.org/lkml/2014/1/10/427Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Reported-by: NDaniel Borkmann <dborkman@redhat.com>
      Tested-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: John David Anglin <dave.anglin@bell.net>
      Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Jared Hulbert <jaredeh@gmail.com>
      Tested-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org> [3.11.x+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9050d7eb
  20. 26 2月, 2014 1 次提交
    • K
      mm, thp: fix infinite loop on memcg OOM · 9845cbbd
      Kirill A. Shutemov 提交于
      Masayoshi Mizuma reported a bug with the hang of an application under
      the memcg limit.  It happens on write-protection fault to huge zero page
      
      If we successfully allocate a huge page to replace zero page but hit the
      memcg limit we need to split the zero page with split_huge_page_pmd()
      and fallback to small pages.
      
      The other part of the problem is that VM_FAULT_OOM has special meaning
      in do_huge_pmd_wp_page() context.  __handle_mm_fault() expects the page
      to be split if it sees VM_FAULT_OOM and it will will retry page fault
      handling.  This causes an infinite loop if the page was not split.
      
      do_huge_pmd_wp_zero_page_fallback() can return VM_FAULT_OOM if it failed
      to allocate one small page, so fallback to small pages will not help.
      
      The solution for this part is to replace VM_FAULT_OOM with
      VM_FAULT_FALLBACK is fallback required.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9845cbbd
  21. 17 2月, 2014 1 次提交
    • A
      mm: Use ptep/pmdp_set_numa() for updating _PAGE_NUMA bit · 56eecdb9
      Aneesh Kumar K.V 提交于
      Archs like ppc64 doesn't do tlb flush in set_pte/pmd functions when using
      a hash table MMU for various reasons (the flush is handled as part of
      the PTE modification when necessary).
      
      ppc64 thus doesn't implement flush_tlb_range for hash based MMUs.
      
      Additionally ppc64 require the tlb flushing to be batched within ptl locks.
      
      The reason to do that is to ensure that the hash page table is in sync with
      linux page table.
      
      We track the hpte index in linux pte and if we clear them without flushing
      hash and drop the ptl lock, we can have another cpu update the pte and can
      end up with duplicate entry in the hash table, which is fatal.
      
      We also want to keep set_pte_at simpler by not requiring them to do hash
      flush for performance reason. We do that by assuming that set_pte_at() is
      never *ever* called on a PTE that is already valid.
      
      This was the case until the NUMA code went in which broke that assumption.
      
      Fix that by introducing a new pair of helpers to set _PAGE_NUMA in a
      way similar to ptep/pmdp_set_wrprotect(), with a generic implementation
      using set_pte_at() and a powerpc specific one using the appropriate
      mechanism needed to keep the hash table in sync.
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      56eecdb9