1. 09 9月, 2015 2 次提交
  2. 05 9月, 2015 2 次提交
  3. 10 7月, 2015 1 次提交
  4. 25 6月, 2015 1 次提交
    • M
      mm, memcg: Try charging a page before setting page up to date · eb3c24f3
      Mel Gorman 提交于
      Historically memcg overhead was high even if memcg was unused.  This has
      improved a lot but it still showed up in a profile summary as being a
      problem.
      
      /usr/src/linux-4.0-vanilla/mm/memcontrol.c                           6.6441   395842
        mem_cgroup_try_charge                                                        2.950%   175781
        __mem_cgroup_count_vm_event                                                  1.431%    85239
        mem_cgroup_page_lruvec                                                       0.456%    27156
        mem_cgroup_commit_charge                                                     0.392%    23342
        uncharge_list                                                                0.323%    19256
        mem_cgroup_update_lru_size                                                   0.278%    16538
        memcg_check_events                                                           0.216%    12858
        mem_cgroup_charge_statistics.isra.22                                         0.188%    11172
        try_charge                                                                   0.150%     8928
        commit_charge                                                                0.141%     8388
        get_mem_cgroup_from_mm                                                       0.121%     7184
      
      That is showing that 6.64% of system CPU cycles were in memcontrol.c and
      dominated by mem_cgroup_try_charge.  The annotation shows that the bulk
      of the cost was checking PageSwapCache which is expected to be cache hot
      but is very expensive.  The problem appears to be that __SetPageUptodate
      is called just before the check which is a write barrier.  It is
      required to make sure struct page and page data is written before the
      PTE is updated and the data visible to userspace.  memcg charging does
      not require or need the barrier but gets unfairly hit with the cost so
      this patch attempts the charging before the barrier.  Aside from the
      accidental cost to memcg there is the added benefit that the barrier is
      avoided if the page cannot be charged.  When applied the relevant
      profile summary is as follows.
      
      /usr/src/linux-4.0-chargefirst-v2r1/mm/memcontrol.c                  3.7907   223277
        __mem_cgroup_count_vm_event                                                  1.143%    67312
        mem_cgroup_page_lruvec                                                       0.465%    27403
        mem_cgroup_commit_charge                                                     0.381%    22452
        uncharge_list                                                                0.332%    19543
        mem_cgroup_update_lru_size                                                   0.284%    16704
        get_mem_cgroup_from_mm                                                       0.271%    15952
        mem_cgroup_try_charge                                                        0.237%    13982
        memcg_check_events                                                           0.222%    13058
        mem_cgroup_charge_statistics.isra.22                                         0.185%    10920
        commit_charge                                                                0.140%     8235
        try_charge                                                                   0.131%     7716
      
      That brings the overhead down to 3.79% and leaves the memcg fault
      accounting to the root cgroup but it's an improvement.  The difference
      in headline performance of the page fault microbench is marginal as
      memcg is such a small component of it.
      
      pft faults
                                             4.0.0                  4.0.0
                                           vanilla            chargefirst
      Hmean    faults/cpu-1 1443258.1051 (  0.00%) 1509075.7561 (  4.56%)
      Hmean    faults/cpu-3 1340385.9270 (  0.00%) 1339160.7113 ( -0.09%)
      Hmean    faults/cpu-5  875599.0222 (  0.00%)  874174.1255 ( -0.16%)
      Hmean    faults/cpu-7  601146.6726 (  0.00%)  601370.9977 (  0.04%)
      Hmean    faults/cpu-8  510728.2754 (  0.00%)  510598.8214 ( -0.03%)
      Hmean    faults/sec-1 1432084.7845 (  0.00%) 1497935.5274 (  4.60%)
      Hmean    faults/sec-3 3943818.1437 (  0.00%) 3941920.1520 ( -0.05%)
      Hmean    faults/sec-5 3877573.5867 (  0.00%) 3869385.7553 ( -0.21%)
      Hmean    faults/sec-7 3991832.0418 (  0.00%) 3992181.4189 (  0.01%)
      Hmean    faults/sec-8 3987189.8167 (  0.00%) 3986452.2204 ( -0.02%)
      
      It's only visible at single threaded. The overhead is there for higher
      threads but other factors dominate.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eb3c24f3
  5. 24 6月, 2015 1 次提交
  6. 19 5月, 2015 1 次提交
    • D
      sched/preempt, mm/fault: Trigger might_sleep() in might_fault() with disabled pagefaults · 9ec23531
      David Hildenbrand 提交于
      Commit 662bbcb2 ("mm, sched: Allow uaccess in atomic with
      pagefault_disable()") removed might_sleep() checks for all user access
      code (that uses might_fault()).
      
      The reason was to disable wrong "sleep in atomic" warnings in the
      following scenario:
      
          pagefault_disable()
          rc = copy_to_user(...)
          pagefault_enable()
      
      Which is valid, as pagefault_disable() increments the preempt counter
      and therefore disables the pagefault handler. copy_to_user() will not
      sleep and return an error code if a page is not available.
      
      However, as all might_sleep() checks are removed,
      CONFIG_DEBUG_ATOMIC_SLEEP would no longer detect the following scenario:
      
          spin_lock(&lock);
          rc = copy_to_user(...)
          spin_unlock(&lock)
      
      If the kernel is compiled with preemption turned on, preempt_disable()
      will make in_atomic() detect disabled preemption. The fault handler would
      correctly never sleep on user access.
      However, with preemption turned off, preempt_disable() is usually a NOP
      (with !CONFIG_PREEMPT_COUNT), therefore in_atomic() will not be able to
      detect disabled preemption nor disabled pagefaults. The fault handler
      could sleep.
      We really want to enable CONFIG_DEBUG_ATOMIC_SLEEP checks for user access
      functions again, otherwise we can end up with horrible deadlocks.
      
      Root of all evil is that pagefault_disable() acts almost as
      preempt_disable(), depending on preemption being turned on/off.
      
      As we now have pagefault_disabled(), we can use it to distinguish
      whether user acces functions might sleep.
      
      Convert might_fault() into a makro that calls __might_fault(), to
      allow proper file + line messages in case of a might_sleep() warning.
      Reviewed-and-tested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: David.Laight@ACULAB.COM
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: airlied@linux.ie
      Cc: akpm@linux-foundation.org
      Cc: benh@kernel.crashing.org
      Cc: bigeasy@linutronix.de
      Cc: borntraeger@de.ibm.com
      Cc: daniel.vetter@intel.com
      Cc: heiko.carstens@de.ibm.com
      Cc: herbert@gondor.apana.org.au
      Cc: hocko@suse.cz
      Cc: hughd@google.com
      Cc: mst@redhat.com
      Cc: paulus@samba.org
      Cc: ralf@linux-mips.org
      Cc: schwidefsky@de.ibm.com
      Cc: yang.shi@windriver.com
      Link: http://lkml.kernel.org/r/1431359540-32227-3-git-send-email-dahi@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9ec23531
  7. 16 4月, 2015 3 次提交
    • B
      mm: new pfn_mkwrite same as page_mkwrite for VM_PFNMAP · dd906184
      Boaz Harrosh 提交于
      This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs) to
      get notified when access is a write to a read-only PFN.
      
      This can happen if we mmap() a file then first mmap-read from it to
      page-in a read-only PFN, than we mmap-write to the same page.
      
      We need this functionality to fix a DAX bug, where in the scenario above
      we fail to set ctime/mtime though we modified the file.  An xfstest is
      attached to this patchset that shows the failure and the fix.  (A DAX
      patch will follow)
      
      This functionality is extra important for us, because upon dirtying of a
      pmem page we also want to RDMA the page to a remote cluster node.
      
      We define a new pfn_mkwrite and do not reuse page_mkwrite because
        1 - The name ;-)
        2 - But mainly because it would take a very long and tedious
            audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
            users. To make sure they do not now CRASH. For example current
            DAX code (which this is for) would crash.
            If we would want to reuse page_mkwrite, We will need to first
            patch all users, so to not-crash-on-no-page. Then enable this
            patch. But even if I did that I would not sleep so well at night.
            Adding a new vector is the safest thing to do, and is not that
            expensive. an extra pointer at a static function vector per driver.
            Also the new vector is better for performance, because else we
            Will call all current Kernel vectors, so to:
              check-ha-no-page-do-nothing and return.
      
      No need to call it from do_shared_fault because do_wp_page is called to
      change pte permissions anyway.
      Signed-off-by: NYigal Korman <yigal@plexistor.com>
      Signed-off-by: NBoaz Harrosh <boaz@plexistor.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd906184
    • K
      mm/memory: also print a_ops->readpage in print_bad_pte() · 2682582a
      Konstantin Khlebnikov 提交于
      A lot of filesystems use generic_file_mmap() and filemap_fault(),
      f_op->mmap and vm_ops->fault aren't enough to identify filesystem.
      
      This prints file name, vm_ops->fault, f_op->mmap and a_ops->readpage
      (which is almost always implemented and filesystem-specific).
      
      Example:
      
      [   23.676410] BUG: Bad page map in process sh  pte:1b7e6025 pmd:19bbd067
      [   23.676887] page:ffffea00006df980 count:4 mapcount:1 mapping:ffff8800196426c0 index:0x97
      [   23.677481] flags: 0x10000000000000c(referenced|uptodate)
      [   23.677896] page dumped because: bad pte
      [   23.678205] addr:00007f52fcb17000 vm_flags:00000075 anon_vma:          (null) mapping:ffff8800196426c0 index:97
      [   23.678922] file:libc-2.19.so fault:filemap_fault mmap:generic_file_readonly_mmap readpage:v9fs_vfs_readpage
      
      [akpm@linux-foundation.org: use pr_alert, per Kirill]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Acked-by: NKirill A. Shutemov <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2682582a
    • J
      mm: remove rest of ACCESS_ONCE() usages · 4db0c3c2
      Jason Low 提交于
      We converted some of the usages of ACCESS_ONCE to READ_ONCE in the mm/
      tree since it doesn't work reliably on non-scalar types.
      
      This patch removes the rest of the usages of ACCESS_ONCE, and use the new
      READ_ONCE API for the read accesses.  This makes things cleaner, instead
      of using separate/multiple sets of APIs.
      Signed-off-by: NJason Low <jason.low2@hp.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NDavidlohr Bueso <dave@stgolabs.net>
      Acked-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4db0c3c2
  8. 15 4月, 2015 4 次提交
  9. 26 3月, 2015 3 次提交
    • M
      mm: numa: slow PTE scan rate if migration failures occur · 074c2381
      Mel Gorman 提交于
      Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226
      
        Across the board the 4.0-rc1 numbers are much slower, and the degradation
        is far worse when using the large memory footprint configs. Perf points
        straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config:
      
         -   56.07%    56.07%  [kernel]            [k] default_send_IPI_mask_sequence_phys
            - default_send_IPI_mask_sequence_phys
               - 99.99% physflat_send_IPI_mask
                  - 99.37% native_send_call_func_ipi
                       smp_call_function_many
                     - native_flush_tlb_others
                        - 99.85% flush_tlb_page
                             ptep_clear_flush
                             try_to_unmap_one
                             rmap_walk
                             try_to_unmap
                             migrate_pages
                             migrate_misplaced_page
                           - handle_mm_fault
                              - 99.73% __do_page_fault
                                   trace_do_page_fault
                                   do_async_page_fault
                                 + async_page_fault
                    0.63% native_send_call_func_single_ipi
                       generic_exec_single
                       smp_call_function_single
      
      This is showing excessive migration activity even though excessive
      migrations are meant to get throttled.  Normally, the scan rate is tuned
      on a per-task basis depending on the locality of faults.  However, if
      migrations fail for any reason then the PTE scanner may scan faster if
      the faults continue to be remote.  This means there is higher system CPU
      overhead and fault trapping at exactly the time we know that migrations
      cannot happen.  This patch tracks when migration failures occur and
      slows the PTE scanner.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NDave Chinner <david@fromorbit.com>
      Tested-by: NDave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      074c2381
    • M
      mm: numa: preserve PTE write permissions across a NUMA hinting fault · b191f9b1
      Mel Gorman 提交于
      Protecting a PTE to trap a NUMA hinting fault clears the writable bit
      and further faults are needed after trapping a NUMA hinting fault to set
      the writable bit again.  This patch preserves the writable bit when
      trapping NUMA hinting faults.  The impact is obvious from the number of
      minor faults trapped during the basis balancing benchmark and the system
      CPU usage;
      
        autonumabench
                                                   4.0.0-rc4             4.0.0-rc4
                                                    baseline              preserve
        Time System-NUMA01                  107.13 (  0.00%)      103.13 (  3.73%)
        Time System-NUMA01_THEADLOCAL       131.87 (  0.00%)       83.30 ( 36.83%)
        Time System-NUMA02                    8.95 (  0.00%)       10.72 (-19.78%)
        Time System-NUMA02_SMT                4.57 (  0.00%)        3.99 ( 12.69%)
        Time Elapsed-NUMA01                 515.78 (  0.00%)      517.26 ( -0.29%)
        Time Elapsed-NUMA01_THEADLOCAL      384.10 (  0.00%)      384.31 ( -0.05%)
        Time Elapsed-NUMA02                  48.86 (  0.00%)       48.78 (  0.16%)
        Time Elapsed-NUMA02_SMT              47.98 (  0.00%)       48.12 ( -0.29%)
      
                     4.0.0-rc4   4.0.0-rc4
                      baseline    preserve
        User          44383.95    43971.89
        System          252.61      201.24
        Elapsed         998.68     1000.94
      
        Minor Faults   2597249     1981230
        Major Faults       365         364
      
      There is a similar drop in system CPU usage using Dave Chinner's xfsrepair
      workload
      
                                            4.0.0-rc4             4.0.0-rc4
                                             baseline              preserve
        Amean    real-xfsrepair      454.14 (  0.00%)      442.36 (  2.60%)
        Amean    syst-xfsrepair      277.20 (  0.00%)      204.68 ( 26.16%)
      
      The patch looks hacky but the alternatives looked worse.  The tidest was
      to rewalk the page tables after a hinting fault but it was more complex
      than this approach and the performance was worse.  It's not generally
      safe to just mark the page writable during the fault if it's a write
      fault as it may have been read-only for COW so that approach was
      discarded.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NDave Chinner <david@fromorbit.com>
      Tested-by: NDave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b191f9b1
    • M
      mm: numa: group related processes based on VMA flags instead of page table flags · bea66fbd
      Mel Gorman 提交于
      These are three follow-on patches based on the xfsrepair workload Dave
      Chinner reported was problematic in 4.0-rc1 due to changes in page table
      management -- https://lkml.org/lkml/2015/3/1/226.
      
      Much of the problem was reduced by commit 53da3bc2 ("mm: fix up numa
      read-only thread grouping logic") and commit ba68bc01 ("mm: thp:
      Return the correct value for change_huge_pmd").  It was known that the
      performance in 3.19 was still better even if is far less safe.  This
      series aims to restore the performance without compromising on safety.
      
      For the test of this mail, I'm comparing 3.19 against 4.0-rc4 and the
      three patches applied on top
      
        autonumabench
                                                      3.19.0             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4
                                                     vanilla               vanilla          vmwrite-v5r8         preserve-v5r8         slowscan-v5r8
        Time System-NUMA01                  124.00 (  0.00%)      161.86 (-30.53%)      107.13 ( 13.60%)      103.13 ( 16.83%)      145.01 (-16.94%)
        Time System-NUMA01_THEADLOCAL       115.54 (  0.00%)      107.64 (  6.84%)      131.87 (-14.13%)       83.30 ( 27.90%)       92.35 ( 20.07%)
        Time System-NUMA02                    9.35 (  0.00%)       10.44 (-11.66%)        8.95 (  4.28%)       10.72 (-14.65%)        8.16 ( 12.73%)
        Time System-NUMA02_SMT                3.87 (  0.00%)        4.63 (-19.64%)        4.57 (-18.09%)        3.99 ( -3.10%)        3.36 ( 13.18%)
        Time Elapsed-NUMA01                 570.06 (  0.00%)      567.82 (  0.39%)      515.78 (  9.52%)      517.26 (  9.26%)      543.80 (  4.61%)
        Time Elapsed-NUMA01_THEADLOCAL      393.69 (  0.00%)      384.83 (  2.25%)      384.10 (  2.44%)      384.31 (  2.38%)      380.73 (  3.29%)
        Time Elapsed-NUMA02                  49.09 (  0.00%)       49.33 ( -0.49%)       48.86 (  0.47%)       48.78 (  0.63%)       50.94 ( -3.77%)
        Time Elapsed-NUMA02_SMT              47.51 (  0.00%)       47.15 (  0.76%)       47.98 ( -0.99%)       48.12 ( -1.28%)       49.56 ( -4.31%)
      
                      3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                     vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
        User        46334.60    46391.94    44383.95    43971.89    44372.12
        System        252.84      284.66      252.61      201.24      249.00
        Elapsed      1062.14     1050.96      998.68     1000.94     1026.78
      
      Overall the system CPU usage is comparable and the test is naturally a
      bit variable.  The slowing of the scanner hurts numa01 but on this
      machine it is an adverse workload and patches that dramatically help it
      often hurt absolutely everything else.
      
      Due to patch 2, the fault activity is interesting
      
                                        3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                                       vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
        Minor Faults                   2097811     2656646     2597249     1981230     1636841
        Major Faults                       362         450         365         364         365
      
      Note the impact preserving the write bit across protection updates and
      fault reduces faults.
      
        NUMA alloc hit                 1229008     1217015     1191660     1178322     1199681
        NUMA alloc miss                      0           0           0           0           0
        NUMA interleave hit                  0           0           0           0           0
        NUMA alloc local               1228514     1216317     1190871     1177448     1199021
        NUMA base PTE updates        245706197   240041607   238195516   244704842   115012800
        NUMA huge PMD updates           479530      468448      464868      477573      224487
        NUMA page range updates      491225557   479886983   476207932   489222218   229950144
        NUMA hint faults                659753      656503      641678      656926      294842
        NUMA hint local faults          381604      373963      360478      337585      186249
        NUMA hint local percent             57          56          56          51          63
        NUMA pages migrated            5412140     6374899     6266530     5277468     5755096
        AutoNUMA cost                    5121%       5083%       4994%       5097%       2388%
      
      Here the impact of slowing the PTE scanner on migratrion failures is
      obvious as "NUMA base PTE updates" and "NUMA huge PMD updates" are
      massively reduced even though the headline performance is very similar.
      
      As xfsrepair was the reported workload here is the impact of the series
      on it.
      
        xfsrepair
                                               3.19.0             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4
                                              vanilla               vanilla          vmwrite-v5r8         preserve-v5r8         slowscan-v5r8
        Min      real-fsmark        1183.29 (  0.00%)     1165.73 (  1.48%)     1152.78 (  2.58%)     1153.64 (  2.51%)     1177.62 (  0.48%)
        Min      syst-fsmark        4107.85 (  0.00%)     4027.75 (  1.95%)     3986.74 (  2.95%)     3979.16 (  3.13%)     4048.76 (  1.44%)
        Min      real-xfsrepair      441.51 (  0.00%)      463.96 ( -5.08%)      449.50 ( -1.81%)      440.08 (  0.32%)      439.87 (  0.37%)
        Min      syst-xfsrepair      195.76 (  0.00%)      278.47 (-42.25%)      262.34 (-34.01%)      203.70 ( -4.06%)      143.64 ( 26.62%)
        Amean    real-fsmark        1188.30 (  0.00%)     1177.34 (  0.92%)     1157.97 (  2.55%)     1158.21 (  2.53%)     1182.22 (  0.51%)
        Amean    syst-fsmark        4111.37 (  0.00%)     4055.70 (  1.35%)     3987.19 (  3.02%)     3998.72 (  2.74%)     4061.69 (  1.21%)
        Amean    real-xfsrepair      450.88 (  0.00%)      468.32 ( -3.87%)      454.14 ( -0.72%)      442.36 (  1.89%)      440.59 (  2.28%)
        Amean    syst-xfsrepair      199.66 (  0.00%)      290.60 (-45.55%)      277.20 (-38.84%)      204.68 ( -2.51%)      150.55 ( 24.60%)
        Stddev   real-fsmark           4.12 (  0.00%)       10.82 (-162.29%)       4.14 ( -0.28%)        5.98 (-45.05%)        4.60 (-11.53%)
        Stddev   syst-fsmark           2.63 (  0.00%)       20.32 (-671.82%)       0.37 ( 85.89%)       16.47 (-525.59%)      15.05 (-471.79%)
        Stddev   real-xfsrepair        6.87 (  0.00%)        4.55 ( 33.75%)        3.46 ( 49.58%)        1.78 ( 74.12%)        0.52 ( 92.50%)
        Stddev   syst-xfsrepair        3.02 (  0.00%)       10.30 (-241.37%)      13.17 (-336.37%)       0.71 ( 76.63%)        5.00 (-65.61%)
        CoeffVar real-fsmark           0.35 (  0.00%)        0.92 (-164.73%)       0.36 ( -2.91%)        0.52 (-48.82%)        0.39 (-12.10%)
        CoeffVar syst-fsmark           0.06 (  0.00%)        0.50 (-682.41%)       0.01 ( 85.45%)        0.41 (-543.22%)       0.37 (-478.78%)
        CoeffVar real-xfsrepair        1.52 (  0.00%)        0.97 ( 36.21%)        0.76 ( 49.94%)        0.40 ( 73.62%)        0.12 ( 92.33%)
        CoeffVar syst-xfsrepair        1.51 (  0.00%)        3.54 (-134.54%)       4.75 (-214.31%)       0.34 ( 77.20%)        3.32 (-119.63%)
        Max      real-fsmark        1193.39 (  0.00%)     1191.77 (  0.14%)     1162.90 (  2.55%)     1166.66 (  2.24%)     1188.50 (  0.41%)
        Max      syst-fsmark        4114.18 (  0.00%)     4075.45 (  0.94%)     3987.65 (  3.08%)     4019.45 (  2.30%)     4082.80 (  0.76%)
        Max      real-xfsrepair      457.80 (  0.00%)      474.60 ( -3.67%)      457.82 ( -0.00%)      444.42 (  2.92%)      441.03 (  3.66%)
        Max      syst-xfsrepair      203.11 (  0.00%)      303.65 (-49.50%)      294.35 (-44.92%)      205.33 ( -1.09%)      155.28 ( 23.55%)
      
      The really relevant lines as syst-xfsrepair which is the system CPU
      usage when running xfsrepair.  Note that on my machine the overhead was
      45% higher on 4.0-rc4 which may be part of what Dave is seeing.  Once we
      preserve the write bit across faults, it's only 2.51% higher on average.
      With the full series applied, system CPU usage is 24.6% lower on
      average.
      
      Again, the impact of preserving the write bit on minor faults is obvious
      and the impact of slowing scanning after migration failures is obvious
      on the PTE updates.  Note also that the number of pages migrated is much
      reduced even though the headline performance is comparable.
      
                                        3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                                       vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
        Minor Faults                 153466827   254507978   249163829   153501373   105737890
        Major Faults                       610         702         690         649         724
        NUMA base PTE updates        217735049   210756527   217729596   216937111   144344993
        NUMA huge PMD updates           129294       85044      106921      127246       79887
        NUMA pages migrated           21938995    29705270    28594162    22687324    16258075
      
                              3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                             vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
        Mean sdb-avgqusz       13.47        2.54        2.55        2.47        2.49
        Mean sdb-avgrqsz      202.32      140.22      139.50      139.02      138.12
        Mean sdb-await         25.92        5.09        5.33        5.02        5.22
        Mean sdb-r_await        4.71        0.19        0.83        0.51        0.11
        Mean sdb-w_await      104.13        5.21        5.38        5.05        5.32
        Mean sdb-svctm          0.59        0.13        0.14        0.13        0.14
        Mean sdb-rrqm           0.16        0.00        0.00        0.00        0.00
        Mean sdb-wrqm           3.59     1799.43     1826.84     1812.21     1785.67
        Max  sdb-avgqusz      111.06       12.13       14.05       11.66       15.60
        Max  sdb-avgrqsz      255.60      190.34      190.01      187.33      191.78
        Max  sdb-await        168.24       39.28       49.22       44.64       65.62
        Max  sdb-r_await      660.00       52.00      280.00       76.00       12.00
        Max  sdb-w_await     7804.00       39.28       49.22       44.64       65.62
        Max  sdb-svctm          4.00        2.82        2.86        1.98        2.84
        Max  sdb-rrqm           8.30        0.00        0.00        0.00        0.00
        Max  sdb-wrqm          34.20     5372.80     5278.60     5386.60     5546.15
      
      FWIW, I also checked SPECjbb in different configurations but it's
      similar observations -- minor faults lower, PTE update activity lower
      and performance is roughly comparable against 3.19.
      
      This patch (of 3):
      
      Threads that share writable data within pages are grouped together as
      related tasks.  This decision is based on whether the PTE is marked
      dirty which is subject to timing races between the PTE scanner update
      and when the application writes the page.  If the page is file-backed,
      then background flushes and sync also affect placement.  This is
      unpredictable behaviour which is impossible to reason about so this
      patch makes grouping decisions based on the VMA flags.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NDave Chinner <david@fromorbit.com>
      Tested-by: NDave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bea66fbd
  10. 12 3月, 2015 1 次提交
    • L
      mm: fix up numa read-only thread grouping logic · 53da3bc2
      Linus Torvalds 提交于
      Dave Chinner reported that commit 4d942466 ("mm: convert
      p[te|md]_mknonnuma and remaining page table manipulations") slowed down
      his xfsrepair test enormously.  In particular, it was using more system
      time due to extra TLB flushing.
      
      The ultimate reason turns out to be how the change to use the regular
      page table accessor functions broke the NUMA grouping logic.  The old
      special mknuma/mknonnuma code accessed the page table present bit and
      the magic NUMA bit directly, while the new code just changes the page
      protections using PROT_NONE and the regular vma protections.
      
      That sounds equivalent, and from a fault standpoint it really is, but a
      subtle side effect is that the *other* protection bits of the page table
      entries also change.  And the code to decide how to group the NUMA
      entries together used the writable bit to decide whether a particular
      page was likely to be shared read-only or not.
      
      And with the change to make the NUMA handling use the regular permission
      setting functions, that writable bit was basically always cleared for
      private mappings due to COW.  So even if the page actually ends up being
      written to in the end, the NUMA balancing would act as if it was always
      shared RO.
      
      This code is a heuristic anyway, so the fix - at least for now - is to
      instead check whether the page is dirty rather than writable.  The bit
      doesn't change with protection changes.
      
      NOTE! This also adds a FIXME comment to revisit this issue,
      
      Not only should we probably re-visit the whole "is this a shared
      read-only page" heuristic (we might want to take the vma permissions
      into account and base this more on those than the per-page ones, and
      also look at whether the particular access that triggers it is a write
      or not), but the whole COW issue shows that we should think about the
      NUMA fault handling some more.
      
      For example, maybe we should do the early-COW thing that a regular fault
      does.  Or maybe we should accept that while using the same bits as
      PROTNONE was a good thing (and got rid of the specual NUMA bit), we
      might still want to just preseve the other protection bits across NUMA
      faulting.
      
      Those are bigger questions, left for later.  This just fixes up the
      heuristic so that it at least approximates working again.  More analysis
      and work needed.
      Reported-by: NDave Chinner <david@fromorbit.com>
      Tested-by: NMel Gorman <mgorman@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>,
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53da3bc2
  11. 17 2月, 2015 2 次提交
    • M
      mm: allow page fault handlers to perform the COW · 2e4cdab0
      Matthew Wilcox 提交于
      Currently COW of an XIP file is done by first bringing in a read-only
      mapping, then retrying the fault and copying the page.  It is much more
      efficient to tell the fault handler that a COW is being attempted (by
      passing in the pre-allocated page in the vm_fault structure), and allow
      the handler to perform the COW operation itself.
      
      The handler cannot insert the page itself if there is already a read-only
      mapping at that address, so allow the handler to return VM_FAULT_LOCKED
      and set the fault_page to be NULL.  This indicates to the MM code that the
      i_mmap_lock is held instead of the page lock.
      Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andreas Dilger <andreas.dilger@intel.com>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e4cdab0
    • M
      mm: fix XIP fault vs truncate race · 283307c7
      Matthew Wilcox 提交于
      DAX is a replacement for the variation of XIP currently supported by the
      ext2 filesystem.  We have three different things in the tree called 'XIP',
      and the new focus is on access to data rather than executables, so a name
      change was in order.  DAX stands for Direct Access.  The X is for
      eXciting.
      
      The new focus on data access has resulted in more careful attention to
      races that exist in the current XIP code, but are not hit by the use-case
      that it was designed for.  XIP's architecture worked fine for ext2, but
      DAX is architected to work with modern filsystems such as ext4 and XFS.
      DAX is not intended for use with btrfs; the value that btrfs adds relies
      on manipulating data and writing data to different locations, while DAX's
      value is for write-in-place and keeping the kernel from touching the data.
      
      DAX was developed in order to support NV-DIMMs, but it's become clear that
      its usefuless extends beyond NV-DIMMs and there are several potential
      customers including the tracing machinery.  Other people want to place the
      kernel log in an area of memory, as long as they have a BIOS that does not
      clear DRAM on reboot.
      
      Patch 1 is a bug fix, probably worth including in 3.18.
      
      Patches 2 & 3 are infrastructure for DAX.
      
      Patches 4-8 replace the XIP code with its DAX equivalents, transforming
      ext2 to use the DAX code as we go.  Note that patch 10 is the
      Documentation patch.
      
      Patches 9-15 clean up after the XIP code, removing the infrastructure
      that is no longer needed and renaming various XIP things to DAX.
      Most of these patches were added after Jan found things he didn't
      like in an earlier version of the ext4 patch ... that had been copied
      from ext2.  So ext2 i being transformed to do things the same way that
      ext4 will later.  The ability to mount ext2 filesystems with the 'xip'
      option is retained, although the 'dax' option is now preferred.
      
      Patch 16 adds some DAX infrastructure to support ext4.
      
      Patch 17 adds DAX support to ext4.  It is broadly similar to ext2's DAX
      support, but it is more efficient than ext4's due to its support for
      unwritten extents.
      
      Patch 18 is another cleanup patch renaming XIP to DAX.
      
      My thanks to Mathieu Desnoyers for his reviews of the v11 patchset.  Most
      of the changes below were based on his feedback.
      
      This patch (of 18):
      
      Pagecache faults recheck i_size after taking the page lock to ensure that
      the fault didn't race against a truncate.  We don't have a page to lock in
      the XIP case, so use i_mmap_lock_read() instead.  It is locked in the
      truncate path in unmap_mapping_range() after updating i_size.  So while we
      hold it in the fault path, we are guaranteed that either i_size has
      already been updated in the truncate path, or that the truncate will
      subsequently call zap_page_range_single() and so remove the mapping we
      have just inserted.
      
      There is a window of time in which i_size has been reduced and the thread
      has a mapping to a page which will be removed from the file, but this is
      harmless as the page will not be allocated to a different purpose before
      the thread's access to it is revoked.
      
      [akpm@linux-foundation.org: switch to i_mmap_lock_read(), add comment in unmap_single_vma()]
      Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Andreas Dilger <andreas.dilger@intel.com>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      283307c7
  12. 13 2月, 2015 5 次提交
  13. 12 2月, 2015 1 次提交
    • K
      mm: account pmd page tables to the process · dc6c9a35
      Kirill A. Shutemov 提交于
      Dave noticed that unprivileged process can allocate significant amount of
      memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
      memory cgroup.  The trick is to allocate a lot of PMD page tables.  Linux
      kernel doesn't account PMD tables to the process, only PTE.
      
      The use-cases below use few tricks to allocate a lot of PMD page tables
      while keeping VmRSS and VmPTE low.  oom_score for the process will be 0.
      
      	#include <errno.h>
      	#include <stdio.h>
      	#include <stdlib.h>
      	#include <unistd.h>
      	#include <sys/mman.h>
      	#include <sys/prctl.h>
      
      	#define PUD_SIZE (1UL << 30)
      	#define PMD_SIZE (1UL << 21)
      
      	#define NR_PUD 130000
      
      	int main(void)
      	{
      		char *addr = NULL;
      		unsigned long i;
      
      		prctl(PR_SET_THP_DISABLE);
      		for (i = 0; i < NR_PUD ; i++) {
      			addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
      					MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
      			if (addr == MAP_FAILED) {
      				perror("mmap");
      				break;
      			}
      			*addr = 'x';
      			munmap(addr, PMD_SIZE);
      			mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
      					MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
      			if (addr == MAP_FAILED)
      				perror("re-mmap"), exit(1);
      		}
      		printf("PID %d consumed %lu KiB in PMD page tables\n",
      				getpid(), i * 4096 >> 10);
      		return pause();
      	}
      
      The patch addresses the issue by account PMD tables to the process the
      same way we account PTE.
      
      The main place where PMD tables is accounted is __pmd_alloc() and
      free_pmd_range(). But there're few corner cases:
      
       - HugeTLB can share PMD page tables. The patch handles by accounting
         the table to all processes who share it.
      
       - x86 PAE pre-allocates few PMD tables on fork.
      
       - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
         check on exit(2).
      
      Accounting only happens on configuration where PMD page table's level is
      present (PMD is not folded).  As with nr_ptes we use per-mm counter.  The
      counter value is used to calculate baseline for badness score by
      oom-killer.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: David Rientjes <rientjes@google.com>
      Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc6c9a35
  14. 11 2月, 2015 5 次提交
  15. 30 1月, 2015 1 次提交
    • L
      vm: make stack guard page errors return VM_FAULT_SIGSEGV rather than SIGBUS · 9c145c56
      Linus Torvalds 提交于
      The stack guard page error case has long incorrectly caused a SIGBUS
      rather than a SIGSEGV, but nobody actually noticed until commit
      fee7e49d ("mm: propagate error from stack expansion even for guard
      page") because that error case was never actually triggered in any
      normal situations.
      
      Now that we actually report the error, people noticed the wrong signal
      that resulted.  So far, only the test suite of libsigsegv seems to have
      actually cared, but there are real applications that use libsigsegv, so
      let's not wait for any of those to break.
      Reported-and-tested-by: NTakashi Iwai <tiwai@suse.de>
      Tested-by: NJan Engelhardt <jengelh@inai.de>
      Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots"
      Cc: linux-arch@vger.kernel.org
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c145c56
  16. 28 1月, 2015 1 次提交
    • D
      mm: provide a find_special_page vma operation · 667a0a06
      David Vrabel 提交于
      The optional find_special_page VMA operation is used to lookup the
      pages backing a VMA.  This is useful in cases where the normal
      mechanisms for finding the page don't work.  This is only called if
      the PTE is special.
      
      One use case is a Xen PV guest mapping foreign pages into userspace.
      
      In a Xen PV guest, the PTEs contain MFNs so get_user_pages() (for
      example) must do an MFN to PFN (M2P) lookup before it can get the
      page.  For foreign pages (those owned by another guest) the M2P lookup
      returns the PFN as seen by the foreign guest (which would be
      completely the wrong page for the local guest).
      
      This cannot be fixed up improving the M2P lookup since one MFN may be
      mapped onto two or more pages so getting the right page is impossible
      given just the MFN.
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      667a0a06
  17. 13 1月, 2015 1 次提交
    • W
      mm: mmu_gather: use tlb->end != 0 only for TLB invalidation · 721c21c1
      Will Deacon 提交于
      When batching up address ranges for TLB invalidation, we check tlb->end
      != 0 to indicate that some pages have actually been unmapped.
      
      As of commit f045bbb9 ("mmu_gather: fix over-eager
      tlb_flush_mmu_free() calling"), we use the same check for freeing these
      pages in order to avoid a performance regression where we call
      free_pages_and_swap_cache even when no pages are actually queued up.
      
      Unfortunately, the range could have been reset (tlb->end = 0) by
      tlb_end_vma, which has been shown to cause memory leaks on arm64.
      Furthermore, investigation into these leaks revealed that the fullmm
      case on task exit no longer invalidates the TLB, by virtue of tlb->end
       == 0 (in 3.18, need_flush would have been set).
      
      This patch resolves the problem by reverting commit f045bbb9, using
      instead tlb->local.nr as the predicate for page freeing in
      tlb_flush_mmu_free and ensuring that tlb->end is initialised to a
      non-zero value in the fullmm case.
      Tested-by: NMark Langsdorf <mlangsdo@redhat.com>
      Tested-by: NDave Hansen <dave@sr71.net>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      721c21c1
  18. 09 1月, 2015 1 次提交
    • J
      mm: protect set_page_dirty() from ongoing truncation · 2d6d7f98
      Johannes Weiner 提交于
      Tejun, while reviewing the code, spotted the following race condition
      between the dirtying and truncation of a page:
      
      __set_page_dirty_nobuffers()       __delete_from_page_cache()
        if (TestSetPageDirty(page))
                                           page->mapping = NULL
      				     if (PageDirty())
      				       dec_zone_page_state(page, NR_FILE_DIRTY);
      				       dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
          if (page->mapping)
            account_page_dirtied(page)
              __inc_zone_page_state(page, NR_FILE_DIRTY);
      	__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
      
      which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.
      
      Dirtiers usually lock out truncation, either by holding the page lock
      directly, or in case of zap_pte_range(), by pinning the mapcount with
      the page table lock held.  The notable exception to this rule, though,
      is do_wp_page(), for which this race exists.  However, do_wp_page()
      already waits for a locked page to unlock before setting the dirty bit,
      in order to prevent a race where clear_page_dirty() misses the page bit
      in the presence of dirty ptes.  Upgrade that wait to a fully locked
      set_page_dirty() to also cover the situation explained above.
      
      Afterwards, the code in set_page_dirty() dealing with a truncation race
      is no longer needed.  Remove it.
      Reported-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d6d7f98
  19. 07 1月, 2015 1 次提交
    • L
      mm: propagate error from stack expansion even for guard page · fee7e49d
      Linus Torvalds 提交于
      Jay Foad reports that the address sanitizer test (asan) sometimes gets
      confused by a stack pointer that ends up being outside the stack vma
      that is reported by /proc/maps.
      
      This happens due to an interaction between RLIMIT_STACK and the guard
      page: when we do the guard page check, we ignore the potential error
      from the stack expansion, which effectively results in a missing guard
      page, since the expected stack expansion won't have been done.
      
      And since /proc/maps explicitly ignores the guard page (commit
      d7824370: "mm: fix up some user-visible effects of the stack guard
      page"), the stack pointer ends up being outside the reported stack area.
      
      This is the minimal patch: it just propagates the error.  It also
      effectively makes the guard page part of the stack limit, which in turn
      measn that the actual real stack is one page less than the stack limit.
      
      Let's see if anybody notices.  We could teach acct_stack_growth() to
      allow an extra page for a grow-up/grow-down stack in the rlimit test,
      but I don't want to add more complexity if it isn't needed.
      Reported-and-tested-by: NJay Foad <jay.foad@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fee7e49d
  20. 23 12月, 2014 1 次提交
  21. 19 12月, 2014 1 次提交
  22. 18 12月, 2014 1 次提交