1. 07 11月, 2015 4 次提交
    • K
      mm: make compound_head() robust · 1d798ca3
      Kirill A. Shutemov 提交于
      Hugh has pointed that compound_head() call can be unsafe in some
      context. There's one example:
      
      	CPU0					CPU1
      
      isolate_migratepages_block()
        page_count()
          compound_head()
            !!PageTail() == true
      					put_page()
      					  tail->first_page = NULL
            head = tail->first_page
      					alloc_pages(__GFP_COMP)
      					   prep_compound_page()
      					     tail->first_page = head
      					     __SetPageTail(p);
            !!PageTail() == true
          <head == NULL dereferencing>
      
      The race is pure theoretical. I don't it's possible to trigger it in
      practice. But who knows.
      
      We can fix the race by changing how encode PageTail() and compound_head()
      within struct page to be able to update them in one shot.
      
      The patch introduces page->compound_head into third double word block in
      front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
      the rest bits are pointer to head page if bit zero is set.
      
      The patch moves page->pmd_huge_pte out of word, just in case if an
      architecture defines pgtable_t into something what can have the bit 0
      set.
      
      hugetlb_cgroup uses page->lru.next in the second tail page to store
      pointer struct hugetlb_cgroup. The patch switch it to use page->private
      in the second tail page instead. The space is free since ->first_page is
      removed from the union.
      
      The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
      limitation, since there's now space in first tail page to store struct
      hugetlb_cgroup pointer. But that's out of scope of the patch.
      
      That means page->compound_head shares storage space with:
      
       - page->lru.next;
       - page->next;
       - page->rcu_head.next;
      
      That's too long list to be absolutely sure, but looks like nobody uses
      bit 0 of the word.
      
      page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use
      call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future
      call_rcu_lazy() is not allowed as it makes use of the bit and we can
      get false positive PageTail().
      
      [1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d798ca3
    • A
      thp: remove unused vma parameter from khugepaged_alloc_page · d6669d68
      Aaron Tomlin 提交于
      The "vma" parameter to khugepaged_alloc_page() is unused.  It has to
      remain unused or the drop read lock 'map_sem' optimisation introduce by
      commit 8b164568 ("mm, THP: don't hold mmap_sem in khugepaged when
      allocating THP") wouldn't be safe.  So let's remove it.
      Signed-off-by: NAaron Tomlin <atomlin@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d6669d68
    • M
      mm, page_alloc: remove MIGRATE_RESERVE · 974a786e
      Mel Gorman 提交于
      MIGRATE_RESERVE preserves an old property of the buddy allocator that
      existed prior to fragmentation avoidance -- min_free_kbytes worth of pages
      tended to remain contiguous until the only alternative was to fail the
      allocation.  At the time it was discovered that high-order atomic
      allocations relied on this property so MIGRATE_RESERVE was introduced.  A
      later patch will introduce an alternative MIGRATE_HIGHATOMIC so this patch
      deletes MIGRATE_RESERVE and supporting code so it'll be easier to review.
      Note that this patch in isolation may look like a false regression if
      someone was bisecting high-order atomic allocation failures.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      974a786e
    • M
      mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM · 71baba4b
      Mel Gorman 提交于
      __GFP_WAIT was used to signal that the caller was in atomic context and
      could not sleep.  Now it is possible to distinguish between true atomic
      context and callers that are not willing to sleep.  The latter should
      clear __GFP_DIRECT_RECLAIM so kswapd will still wake.  As clearing
      __GFP_WAIT behaves differently, there is a risk that people will clear the
      wrong flags.  This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly
      indicate what it does -- setting it allows all reclaim activity, clearing
      them prevents it.
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71baba4b
  2. 06 11月, 2015 1 次提交
    • E
      mm: introduce VM_LOCKONFAULT · de60f5f1
      Eric B Munson 提交于
      The cost of faulting in all memory to be locked can be very high when
      working with large mappings.  If only portions of the mapping will be used
      this can incur a high penalty for locking.
      
      For the example of a large file, this is the usage pattern for a large
      statical language model (probably applies to other statical or graphical
      models as well).  For the security example, any application transacting in
      data that cannot be swapped out (credit card data, medical records, etc).
      
      This patch introduces the ability to request that pages are not
      pre-faulted, but are placed on the unevictable LRU when they are finally
      faulted in.  The VM_LOCKONFAULT flag will be used together with VM_LOCKED
      and has no effect when set without VM_LOCKED.  Setting the VM_LOCKONFAULT
      flag for a VMA will cause pages faulted into that VMA to be added to the
      unevictable LRU when they are faulted or if they are already present, but
      will not cause any missing pages to be faulted in.
      
      Exposing this new lock state means that we cannot overload the meaning of
      the FOLL_POPULATE flag any longer.  Prior to this patch it was used to
      mean that the VMA for a fault was locked.  This means we need the new
      FOLL_MLOCK flag to communicate the locked state of a VMA.  FOLL_POPULATE
      will now only control if the VMA should be populated and in the case of
      VM_LOCKONFAULT, it will not be set.
      Signed-off-by: NEric B Munson <emunson@akamai.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de60f5f1
  3. 23 10月, 2015 1 次提交
  4. 17 10月, 2015 1 次提交
  5. 11 9月, 2015 1 次提交
    • V
      mm: introduce idle page tracking · 33c3fc71
      Vladimir Davydov 提交于
      Knowing the portion of memory that is not used by a certain application or
      memory cgroup (idle memory) can be useful for partitioning the system
      efficiently, e.g.  by setting memory cgroup limits appropriately.
      Currently, the only means to estimate the amount of idle memory provided
      by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
      access bit for all pages mapped to a particular process by writing 1 to
      clear_refs, wait for some time, and then count smaps:Referenced.  However,
      this method has two serious shortcomings:
      
       - it does not count unmapped file pages
       - it affects the reclaimer logic
      
      To overcome these drawbacks, this patch introduces two new page flags,
      Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
      A page's Idle flag can only be set from userspace by setting bit in
      /sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
      and it is cleared whenever the page is accessed either through page tables
      (it is cleared in page_referenced() in this case) or using the read(2)
      system call (mark_page_accessed()). Thus by setting the Idle flag for
      pages of a particular workload, which can be found e.g.  by reading
      /proc/PID/pagemap, waiting for some time to let the workload access its
      working set, and then reading the bitmap file, one can estimate the amount
      of pages that are not used by the workload.
      
      The Young page flag is used to avoid interference with the memory
      reclaimer.  A page's Young flag is set whenever the Access bit of a page
      table entry pointing to the page is cleared by writing to the bitmap file.
      If page_referenced() is called on a Young page, it will add 1 to its
      return value, therefore concealing the fact that the Access bit was
      cleared.
      
      Note, since there is no room for extra page flags on 32 bit, this feature
      uses extended page flags when compiled on 32 bit.
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: kpageidle requires an MMU]
      [akpm@linux-foundation.org: decouple from page-flags rework]
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33c3fc71
  6. 09 9月, 2015 12 次提交
  7. 05 9月, 2015 3 次提交
    • A
      userfaultfd: propagate the full address in THP faults · 230c92a8
      Andrea Arcangeli 提交于
      The THP faults were not propagating the original fault address.  The
      latest version of the API with uffd.arg.pagefault.address is supposed to
      propagate the full address through THP faults.
      
      This was not a kernel crashing bug and it wouldn't risk to corrupt user
      memory, but it would cause a SIGBUS failure because the wrong page was
      being copied.
      
      For various reasons this wasn't easily reproducible in the qemu workload,
      but the strestest exposed the problem immediately.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      230c92a8
    • A
      userfaultfd: prevent khugepaged to merge if userfaultfd is armed · c1294d05
      Andrea Arcangeli 提交于
      If userfaultfd is armed on a certain vma we can't "fill" the holes with
      zeroes or we'll break the userland on demand paging.  The holes if the
      userfault is armed, are really missing information (not zeroes) that the
      userland has to load from network or elsewhere.
      
      The same issue happens for wrprotected ptes that we can't just convert
      into a single writable pmd_trans_huge.
      
      We could however in theory still merge across zeropages if only
      VM_UFFD_MISSING is set (so if VM_UFFD_WP is not set)...  that could be
      slightly improved but it'd be much more complex code for a tiny corner
      case.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c1294d05
    • A
      userfaultfd: call handle_userfault() for userfaultfd_missing() faults · 6b251fc9
      Andrea Arcangeli 提交于
      This is where the page faults must be modified to call
      handle_userfault() if userfaultfd_missing() is true (so if the
      vma->vm_flags had VM_UFFD_MISSING set).
      
      handle_userfault() then takes care of blocking the page fault and
      delivering it to userland.
      
      The fault flags must also be passed as parameter so the "read|write"
      kind of fault can be passed to userland.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b251fc9
  8. 07 8月, 2015 1 次提交
    • N
      mm: check __PG_HWPOISON separately from PAGE_FLAGS_CHECK_AT_* · f4c18e6f
      Naoya Horiguchi 提交于
      The race condition addressed in commit add05cec ("mm: soft-offline:
      don't free target page in successful page migration") was not closed
      completely, because that can happen not only for soft-offline, but also
      for hard-offline.  Consider that a slab page is about to be freed into
      buddy pool, and then an uncorrected memory error hits the page just
      after entering __free_one_page(), then VM_BUG_ON_PAGE(page->flags &
      PAGE_FLAGS_CHECK_AT_PREP) is triggered, despite the fact that it's not
      necessary because the data on the affected page is not consumed.
      
      To solve it, this patch drops __PG_HWPOISON from page flag checks at
      allocation/free time.  I think it's justified because __PG_HWPOISON
      flags is defined to prevent the page from being reused, and setting it
      outside the page's alloc-free cycle is a designed behavior (not a bug.)
      
      For recent months, I was annoyed about BUG_ON when soft-offlined page
      remains on lru cache list for a while, which is avoided by calling
      put_page() instead of putback_lru_page() in page migration's success
      path.  This means that this patch reverts a major change from commit
      add05cec about the new refcounting rule of soft-offlined pages, so
      "reuse window" revives.  This will be closed by a subsequent patch.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Dean Nelson <dnelson@redhat.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f4c18e6f
  9. 25 6月, 2015 3 次提交
  10. 16 4月, 2015 5 次提交
  11. 15 4月, 2015 3 次提交
  12. 26 3月, 2015 4 次提交
    • M
      mm: numa: mark huge PTEs young when clearing NUMA hinting faults · b7b04004
      Mel Gorman 提交于
      Base PTEs are marked young when the NUMA hinting information is cleared
      but the same does not happen for huge pages which this patch addresses.
      
      Note that migrated pages are not marked young as the base page migration
      code does not assume that migrated pages have been referenced.  This
      could be addressed but beyond the scope of this series which is aimed at
      Dave Chinners shrink workload that is unlikely to be affected by this
      issue.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b7b04004
    • M
      mm: numa: slow PTE scan rate if migration failures occur · 074c2381
      Mel Gorman 提交于
      Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226
      
        Across the board the 4.0-rc1 numbers are much slower, and the degradation
        is far worse when using the large memory footprint configs. Perf points
        straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config:
      
         -   56.07%    56.07%  [kernel]            [k] default_send_IPI_mask_sequence_phys
            - default_send_IPI_mask_sequence_phys
               - 99.99% physflat_send_IPI_mask
                  - 99.37% native_send_call_func_ipi
                       smp_call_function_many
                     - native_flush_tlb_others
                        - 99.85% flush_tlb_page
                             ptep_clear_flush
                             try_to_unmap_one
                             rmap_walk
                             try_to_unmap
                             migrate_pages
                             migrate_misplaced_page
                           - handle_mm_fault
                              - 99.73% __do_page_fault
                                   trace_do_page_fault
                                   do_async_page_fault
                                 + async_page_fault
                    0.63% native_send_call_func_single_ipi
                       generic_exec_single
                       smp_call_function_single
      
      This is showing excessive migration activity even though excessive
      migrations are meant to get throttled.  Normally, the scan rate is tuned
      on a per-task basis depending on the locality of faults.  However, if
      migrations fail for any reason then the PTE scanner may scan faster if
      the faults continue to be remote.  This means there is higher system CPU
      overhead and fault trapping at exactly the time we know that migrations
      cannot happen.  This patch tracks when migration failures occur and
      slows the PTE scanner.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NDave Chinner <david@fromorbit.com>
      Tested-by: NDave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      074c2381
    • M
      mm: numa: preserve PTE write permissions across a NUMA hinting fault · b191f9b1
      Mel Gorman 提交于
      Protecting a PTE to trap a NUMA hinting fault clears the writable bit
      and further faults are needed after trapping a NUMA hinting fault to set
      the writable bit again.  This patch preserves the writable bit when
      trapping NUMA hinting faults.  The impact is obvious from the number of
      minor faults trapped during the basis balancing benchmark and the system
      CPU usage;
      
        autonumabench
                                                   4.0.0-rc4             4.0.0-rc4
                                                    baseline              preserve
        Time System-NUMA01                  107.13 (  0.00%)      103.13 (  3.73%)
        Time System-NUMA01_THEADLOCAL       131.87 (  0.00%)       83.30 ( 36.83%)
        Time System-NUMA02                    8.95 (  0.00%)       10.72 (-19.78%)
        Time System-NUMA02_SMT                4.57 (  0.00%)        3.99 ( 12.69%)
        Time Elapsed-NUMA01                 515.78 (  0.00%)      517.26 ( -0.29%)
        Time Elapsed-NUMA01_THEADLOCAL      384.10 (  0.00%)      384.31 ( -0.05%)
        Time Elapsed-NUMA02                  48.86 (  0.00%)       48.78 (  0.16%)
        Time Elapsed-NUMA02_SMT              47.98 (  0.00%)       48.12 ( -0.29%)
      
                     4.0.0-rc4   4.0.0-rc4
                      baseline    preserve
        User          44383.95    43971.89
        System          252.61      201.24
        Elapsed         998.68     1000.94
      
        Minor Faults   2597249     1981230
        Major Faults       365         364
      
      There is a similar drop in system CPU usage using Dave Chinner's xfsrepair
      workload
      
                                            4.0.0-rc4             4.0.0-rc4
                                             baseline              preserve
        Amean    real-xfsrepair      454.14 (  0.00%)      442.36 (  2.60%)
        Amean    syst-xfsrepair      277.20 (  0.00%)      204.68 ( 26.16%)
      
      The patch looks hacky but the alternatives looked worse.  The tidest was
      to rewalk the page tables after a hinting fault but it was more complex
      than this approach and the performance was worse.  It's not generally
      safe to just mark the page writable during the fault if it's a write
      fault as it may have been read-only for COW so that approach was
      discarded.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NDave Chinner <david@fromorbit.com>
      Tested-by: NDave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b191f9b1
    • M
      mm: numa: group related processes based on VMA flags instead of page table flags · bea66fbd
      Mel Gorman 提交于
      These are three follow-on patches based on the xfsrepair workload Dave
      Chinner reported was problematic in 4.0-rc1 due to changes in page table
      management -- https://lkml.org/lkml/2015/3/1/226.
      
      Much of the problem was reduced by commit 53da3bc2 ("mm: fix up numa
      read-only thread grouping logic") and commit ba68bc01 ("mm: thp:
      Return the correct value for change_huge_pmd").  It was known that the
      performance in 3.19 was still better even if is far less safe.  This
      series aims to restore the performance without compromising on safety.
      
      For the test of this mail, I'm comparing 3.19 against 4.0-rc4 and the
      three patches applied on top
      
        autonumabench
                                                      3.19.0             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4
                                                     vanilla               vanilla          vmwrite-v5r8         preserve-v5r8         slowscan-v5r8
        Time System-NUMA01                  124.00 (  0.00%)      161.86 (-30.53%)      107.13 ( 13.60%)      103.13 ( 16.83%)      145.01 (-16.94%)
        Time System-NUMA01_THEADLOCAL       115.54 (  0.00%)      107.64 (  6.84%)      131.87 (-14.13%)       83.30 ( 27.90%)       92.35 ( 20.07%)
        Time System-NUMA02                    9.35 (  0.00%)       10.44 (-11.66%)        8.95 (  4.28%)       10.72 (-14.65%)        8.16 ( 12.73%)
        Time System-NUMA02_SMT                3.87 (  0.00%)        4.63 (-19.64%)        4.57 (-18.09%)        3.99 ( -3.10%)        3.36 ( 13.18%)
        Time Elapsed-NUMA01                 570.06 (  0.00%)      567.82 (  0.39%)      515.78 (  9.52%)      517.26 (  9.26%)      543.80 (  4.61%)
        Time Elapsed-NUMA01_THEADLOCAL      393.69 (  0.00%)      384.83 (  2.25%)      384.10 (  2.44%)      384.31 (  2.38%)      380.73 (  3.29%)
        Time Elapsed-NUMA02                  49.09 (  0.00%)       49.33 ( -0.49%)       48.86 (  0.47%)       48.78 (  0.63%)       50.94 ( -3.77%)
        Time Elapsed-NUMA02_SMT              47.51 (  0.00%)       47.15 (  0.76%)       47.98 ( -0.99%)       48.12 ( -1.28%)       49.56 ( -4.31%)
      
                      3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                     vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
        User        46334.60    46391.94    44383.95    43971.89    44372.12
        System        252.84      284.66      252.61      201.24      249.00
        Elapsed      1062.14     1050.96      998.68     1000.94     1026.78
      
      Overall the system CPU usage is comparable and the test is naturally a
      bit variable.  The slowing of the scanner hurts numa01 but on this
      machine it is an adverse workload and patches that dramatically help it
      often hurt absolutely everything else.
      
      Due to patch 2, the fault activity is interesting
      
                                        3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                                       vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
        Minor Faults                   2097811     2656646     2597249     1981230     1636841
        Major Faults                       362         450         365         364         365
      
      Note the impact preserving the write bit across protection updates and
      fault reduces faults.
      
        NUMA alloc hit                 1229008     1217015     1191660     1178322     1199681
        NUMA alloc miss                      0           0           0           0           0
        NUMA interleave hit                  0           0           0           0           0
        NUMA alloc local               1228514     1216317     1190871     1177448     1199021
        NUMA base PTE updates        245706197   240041607   238195516   244704842   115012800
        NUMA huge PMD updates           479530      468448      464868      477573      224487
        NUMA page range updates      491225557   479886983   476207932   489222218   229950144
        NUMA hint faults                659753      656503      641678      656926      294842
        NUMA hint local faults          381604      373963      360478      337585      186249
        NUMA hint local percent             57          56          56          51          63
        NUMA pages migrated            5412140     6374899     6266530     5277468     5755096
        AutoNUMA cost                    5121%       5083%       4994%       5097%       2388%
      
      Here the impact of slowing the PTE scanner on migratrion failures is
      obvious as "NUMA base PTE updates" and "NUMA huge PMD updates" are
      massively reduced even though the headline performance is very similar.
      
      As xfsrepair was the reported workload here is the impact of the series
      on it.
      
        xfsrepair
                                               3.19.0             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4
                                              vanilla               vanilla          vmwrite-v5r8         preserve-v5r8         slowscan-v5r8
        Min      real-fsmark        1183.29 (  0.00%)     1165.73 (  1.48%)     1152.78 (  2.58%)     1153.64 (  2.51%)     1177.62 (  0.48%)
        Min      syst-fsmark        4107.85 (  0.00%)     4027.75 (  1.95%)     3986.74 (  2.95%)     3979.16 (  3.13%)     4048.76 (  1.44%)
        Min      real-xfsrepair      441.51 (  0.00%)      463.96 ( -5.08%)      449.50 ( -1.81%)      440.08 (  0.32%)      439.87 (  0.37%)
        Min      syst-xfsrepair      195.76 (  0.00%)      278.47 (-42.25%)      262.34 (-34.01%)      203.70 ( -4.06%)      143.64 ( 26.62%)
        Amean    real-fsmark        1188.30 (  0.00%)     1177.34 (  0.92%)     1157.97 (  2.55%)     1158.21 (  2.53%)     1182.22 (  0.51%)
        Amean    syst-fsmark        4111.37 (  0.00%)     4055.70 (  1.35%)     3987.19 (  3.02%)     3998.72 (  2.74%)     4061.69 (  1.21%)
        Amean    real-xfsrepair      450.88 (  0.00%)      468.32 ( -3.87%)      454.14 ( -0.72%)      442.36 (  1.89%)      440.59 (  2.28%)
        Amean    syst-xfsrepair      199.66 (  0.00%)      290.60 (-45.55%)      277.20 (-38.84%)      204.68 ( -2.51%)      150.55 ( 24.60%)
        Stddev   real-fsmark           4.12 (  0.00%)       10.82 (-162.29%)       4.14 ( -0.28%)        5.98 (-45.05%)        4.60 (-11.53%)
        Stddev   syst-fsmark           2.63 (  0.00%)       20.32 (-671.82%)       0.37 ( 85.89%)       16.47 (-525.59%)      15.05 (-471.79%)
        Stddev   real-xfsrepair        6.87 (  0.00%)        4.55 ( 33.75%)        3.46 ( 49.58%)        1.78 ( 74.12%)        0.52 ( 92.50%)
        Stddev   syst-xfsrepair        3.02 (  0.00%)       10.30 (-241.37%)      13.17 (-336.37%)       0.71 ( 76.63%)        5.00 (-65.61%)
        CoeffVar real-fsmark           0.35 (  0.00%)        0.92 (-164.73%)       0.36 ( -2.91%)        0.52 (-48.82%)        0.39 (-12.10%)
        CoeffVar syst-fsmark           0.06 (  0.00%)        0.50 (-682.41%)       0.01 ( 85.45%)        0.41 (-543.22%)       0.37 (-478.78%)
        CoeffVar real-xfsrepair        1.52 (  0.00%)        0.97 ( 36.21%)        0.76 ( 49.94%)        0.40 ( 73.62%)        0.12 ( 92.33%)
        CoeffVar syst-xfsrepair        1.51 (  0.00%)        3.54 (-134.54%)       4.75 (-214.31%)       0.34 ( 77.20%)        3.32 (-119.63%)
        Max      real-fsmark        1193.39 (  0.00%)     1191.77 (  0.14%)     1162.90 (  2.55%)     1166.66 (  2.24%)     1188.50 (  0.41%)
        Max      syst-fsmark        4114.18 (  0.00%)     4075.45 (  0.94%)     3987.65 (  3.08%)     4019.45 (  2.30%)     4082.80 (  0.76%)
        Max      real-xfsrepair      457.80 (  0.00%)      474.60 ( -3.67%)      457.82 ( -0.00%)      444.42 (  2.92%)      441.03 (  3.66%)
        Max      syst-xfsrepair      203.11 (  0.00%)      303.65 (-49.50%)      294.35 (-44.92%)      205.33 ( -1.09%)      155.28 ( 23.55%)
      
      The really relevant lines as syst-xfsrepair which is the system CPU
      usage when running xfsrepair.  Note that on my machine the overhead was
      45% higher on 4.0-rc4 which may be part of what Dave is seeing.  Once we
      preserve the write bit across faults, it's only 2.51% higher on average.
      With the full series applied, system CPU usage is 24.6% lower on
      average.
      
      Again, the impact of preserving the write bit on minor faults is obvious
      and the impact of slowing scanning after migration failures is obvious
      on the PTE updates.  Note also that the number of pages migrated is much
      reduced even though the headline performance is comparable.
      
                                        3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                                       vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
        Minor Faults                 153466827   254507978   249163829   153501373   105737890
        Major Faults                       610         702         690         649         724
        NUMA base PTE updates        217735049   210756527   217729596   216937111   144344993
        NUMA huge PMD updates           129294       85044      106921      127246       79887
        NUMA pages migrated           21938995    29705270    28594162    22687324    16258075
      
                              3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                             vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
        Mean sdb-avgqusz       13.47        2.54        2.55        2.47        2.49
        Mean sdb-avgrqsz      202.32      140.22      139.50      139.02      138.12
        Mean sdb-await         25.92        5.09        5.33        5.02        5.22
        Mean sdb-r_await        4.71        0.19        0.83        0.51        0.11
        Mean sdb-w_await      104.13        5.21        5.38        5.05        5.32
        Mean sdb-svctm          0.59        0.13        0.14        0.13        0.14
        Mean sdb-rrqm           0.16        0.00        0.00        0.00        0.00
        Mean sdb-wrqm           3.59     1799.43     1826.84     1812.21     1785.67
        Max  sdb-avgqusz      111.06       12.13       14.05       11.66       15.60
        Max  sdb-avgrqsz      255.60      190.34      190.01      187.33      191.78
        Max  sdb-await        168.24       39.28       49.22       44.64       65.62
        Max  sdb-r_await      660.00       52.00      280.00       76.00       12.00
        Max  sdb-w_await     7804.00       39.28       49.22       44.64       65.62
        Max  sdb-svctm          4.00        2.82        2.86        1.98        2.84
        Max  sdb-rrqm           8.30        0.00        0.00        0.00        0.00
        Max  sdb-wrqm          34.20     5372.80     5278.60     5386.60     5546.15
      
      FWIW, I also checked SPECjbb in different configurations but it's
      similar observations -- minor faults lower, PTE update activity lower
      and performance is roughly comparable against 3.19.
      
      This patch (of 3):
      
      Threads that share writable data within pages are grouped together as
      related tasks.  This decision is based on whether the PTE is marked
      dirty which is subject to timing races between the PTE scanner update
      and when the application writes the page.  If the page is file-backed,
      then background flushes and sync also affect placement.  This is
      unpredictable behaviour which is impossible to reason about so this
      patch makes grouping decisions based on the VMA flags.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NDave Chinner <david@fromorbit.com>
      Tested-by: NDave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bea66fbd
  13. 13 3月, 2015 1 次提交