1. 19 7月, 2019 2 次提交
  2. 06 7月, 2019 1 次提交
    • L
      Revert "mm: page cache: store only head pages in i_pages" · 69bf4b6b
      Linus Torvalds 提交于
      This reverts commit 5fd4ca2d.
      
      Mikhail Gavrilov reports that it causes the VM_BUG_ON_PAGE() in
      __delete_from_swap_cache() to trigger:
      
         page:ffffd6d34dff0000 refcount:1 mapcount:1 mapping:ffff97812323a689 index:0xfecec363
         anon
         flags: 0x17fffe00080034(uptodate|lru|active|swapbacked)
         raw: 0017fffe00080034 ffffd6d34c67c508 ffffd6d3504b8d48 ffff97812323a689
         raw: 00000000fecec363 0000000000000000 0000000100000000 ffff978433ace000
         page dumped because: VM_BUG_ON_PAGE(entry != page)
         page->mem_cgroup:ffff978433ace000
         ------------[ cut here ]------------
         kernel BUG at mm/swap_state.c:170!
         invalid opcode: 0000 [#1] SMP NOPTI
         CPU: 1 PID: 221 Comm: kswapd0 Not tainted 5.2.0-0.rc2.git0.1.fc31.x86_64 #1
         Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 2202 04/11/2019
         RIP: 0010:__delete_from_swap_cache+0x20d/0x240
         Code: 30 65 48 33 04 25 28 00 00 00 75 4a 48 83 c4 38 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c6 2f dc 0f 8a 48 89 c7 e8 93 1b fd ff <0f> 0b 48 c7 c6 a8 74 0f 8a e8 85 1b fd ff 0f 0b 48 c7 c6 a8 7d 0f
         RSP: 0018:ffffa982036e7980 EFLAGS: 00010046
         RAX: 0000000000000021 RBX: 0000000000000040 RCX: 0000000000000006
         RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffff97843d657900
         RBP: 0000000000000001 R08: ffffa982036e7835 R09: 0000000000000535
         R10: ffff97845e21a46c R11: ffffa982036e7835 R12: ffff978426387120
         R13: 0000000000000000 R14: ffffd6d34dff0040 R15: ffffd6d34dff0000
         FS:  0000000000000000(0000) GS:ffff97843d640000(0000) knlGS:0000000000000000
         CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         CR2: 00002cba88ef5000 CR3: 000000078a97c000 CR4: 00000000003406e0
         Call Trace:
          delete_from_swap_cache+0x46/0xa0
          try_to_free_swap+0xbc/0x110
          swap_writepage+0x13/0x70
          pageout.isra.0+0x13c/0x350
          shrink_page_list+0xc14/0xdf0
          shrink_inactive_list+0x1e5/0x3c0
          shrink_node_memcg+0x202/0x760
          shrink_node+0xe0/0x470
          balance_pgdat+0x2d1/0x510
          kswapd+0x220/0x420
          kthread+0xfb/0x130
          ret_from_fork+0x22/0x40
      
      and it's not immediately obvious why it happens.  It's too late in the
      rc cycle to do anything but revert for now.
      
      Link: https://lore.kernel.org/lkml/CABXGCsN9mYmBD-4GaaeW_NrDu+FDXLzr_6x+XNxfmFV6QkYCDg@mail.gmail.com/Reported-and-bisected-by: NMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Suggested-by: NJan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Kirill Shutemov <kirill@shutemov.name>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69bf4b6b
  3. 19 6月, 2019 1 次提交
  4. 15 5月, 2019 5 次提交
  5. 06 4月, 2019 1 次提交
  6. 03 4月, 2019 1 次提交
  7. 06 3月, 2019 5 次提交
  8. 05 1月, 2019 1 次提交
    • J
      mm: treewide: remove unused address argument from pte_alloc functions · 4cf58924
      Joel Fernandes (Google) 提交于
      Patch series "Add support for fast mremap".
      
      This series speeds up the mremap(2) syscall by copying page tables at
      the PMD level even for non-THP systems.  There is concern that the extra
      'address' argument that mremap passes to pte_alloc may do something
      subtle architecture related in the future that may make the scheme not
      work.  Also we find that there is no point in passing the 'address' to
      pte_alloc since its unused.  This patch therefore removes this argument
      tree-wide resulting in a nice negative diff as well.  Also ensuring
      along the way that the enabled architectures do not do anything funky
      with the 'address' argument that goes unnoticed by the optimization.
      
      Build and boot tested on x86-64.  Build tested on arm64.  The config
      enablement patch for arm64 will be posted in the future after more
      testing.
      
      The changes were obtained by applying the following Coccinelle script.
      (thanks Julia for answering all Coccinelle questions!).
      Following fix ups were done manually:
      * Removal of address argument from  pte_fragment_alloc
      * Removal of pte_alloc_one_fast definitions from m68k and microblaze.
      
      // Options: --include-headers --no-includes
      // Note: I split the 'identifier fn' line, so if you are manually
      // running it, please unsplit it so it runs for you.
      
      virtual patch
      
      @pte_alloc_func_def depends on patch exists@
      identifier E2;
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      type T2;
      @@
      
       fn(...
      - , T2 E2
       )
       { ... }
      
      @pte_alloc_func_proto_noarg depends on patch exists@
      type T1, T2, T3, T4;
      identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      @@
      
      (
      - T3 fn(T1, T2);
      + T3 fn(T1);
      |
      - T3 fn(T1, T2, T4);
      + T3 fn(T1, T2);
      )
      
      @pte_alloc_func_proto depends on patch exists@
      identifier E1, E2, E4;
      type T1, T2, T3, T4;
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      @@
      
      (
      - T3 fn(T1 E1, T2 E2);
      + T3 fn(T1 E1);
      |
      - T3 fn(T1 E1, T2 E2, T4 E4);
      + T3 fn(T1 E1, T2 E2);
      )
      
      @pte_alloc_func_call depends on patch exists@
      expression E2;
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      @@
      
       fn(...
      -,  E2
       )
      
      @pte_alloc_macro depends on patch exists@
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      identifier a, b, c;
      expression e;
      position p;
      @@
      
      (
      - #define fn(a, b, c) e
      + #define fn(a, b) e
      |
      - #define fn(a, b) e
      + #define fn(a) e
      )
      
      Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.comSigned-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Suggested-by: NKirill A. Shutemov <kirill@shutemov.name>
      Acked-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Julia Lawall <Julia.Lawall@lip6.fr>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4cf58924
  9. 29 12月, 2018 4 次提交
  10. 22 12月, 2018 1 次提交
  11. 09 12月, 2018 1 次提交
    • D
      Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask" · 356ff8a9
      David Rientjes 提交于
      This reverts commit 89c83fb5.
      
      This should have been done as part of 2f0799a0 ("mm, thp: restore
      node-local hugepage allocations").  The movement of the thp allocation
      policy from alloc_pages_vma() to alloc_hugepage_direct_gfpmask() was
      intended to only set __GFP_THISNODE for mempolicies that are not
      MPOL_BIND whereas the revert could set this regardless of mempolicy.
      
      While the check for MPOL_BIND between alloc_hugepage_direct_gfpmask()
      and alloc_pages_vma() was racy, that has since been removed since the
      revert.  What is left is the possibility to use __GFP_THISNODE in
      policy_node() when it is unexpected because the special handling for
      hugepages in alloc_pages_vma()  was removed as part of the consolidation.
      
      Secondly, prior to 89c83fb5, alloc_pages_vma() implemented a somewhat
      different policy for hugepage allocations, which were allocated through
      alloc_hugepage_vma().  For hugepage allocations, if the allocating
      process's node is in the set of allowed nodes, allocate with
      __GFP_THISNODE for that node (for MPOL_PREFERRED, use that node with
      __GFP_THISNODE instead).  This was changed for shmem_alloc_hugepage() to
      allow fallback to other nodes in 89c83fb5 as it did for new_page() in
      mm/mempolicy.c which is functionally different behavior and removes the
      requirement to only allocate hugepages locally.
      
      So this commit does a full revert of 89c83fb5 instead of the partial
      revert that was done in 2f0799a0.  The result is the same thp
      allocation policy for 4.20 that was in 4.19.
      
      Fixes: 89c83fb5 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask")
      Fixes: 2f0799a0 ("mm, thp: restore node-local hugepage allocations")
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      356ff8a9
  12. 06 12月, 2018 1 次提交
    • D
      mm, thp: restore node-local hugepage allocations · 2f0799a0
      David Rientjes 提交于
      This is a full revert of ac5b2c18 ("mm: thp: relax __GFP_THISNODE for
      MADV_HUGEPAGE mappings") and a partial revert of 89c83fb5 ("mm, thp:
      consolidate THP gfp handling into alloc_hugepage_direct_gfpmask").
      
      By not setting __GFP_THISNODE, applications can allocate remote hugepages
      when the local node is fragmented or low on memory when either the thp
      defrag setting is "always" or the vma has been madvised with
      MADV_HUGEPAGE.
      
      Remote access to hugepages often has much higher latency than local pages
      of the native page size.  On Haswell, ac5b2c18 was shown to have a
      13.9% access regression after this commit for binaries that remap their
      text segment to be backed by transparent hugepages.
      
      The intent of ac5b2c18 is to address an issue where a local node is
      low on memory or fragmented such that a hugepage cannot be allocated.  In
      every scenario where this was described as a fix, there is abundant and
      unfragmented remote memory available to allocate from, even with a greater
      access latency.
      
      If remote memory is also low or fragmented, not setting __GFP_THISNODE was
      also measured on Haswell to have a 40% regression in allocation latency.
      
      Restore __GFP_THISNODE for thp allocations.
      
      Fixes: ac5b2c18 ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings")
      Fixes: 89c83fb5 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask")
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f0799a0
  13. 01 12月, 2018 3 次提交
  14. 04 11月, 2018 1 次提交
    • M
      mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask · 89c83fb5
      Michal Hocko 提交于
      THP allocation mode is quite complex and it depends on the defrag mode.
      This complexity is hidden in alloc_hugepage_direct_gfpmask from a large
      part currently. The NUMA special casing (namely __GFP_THISNODE) is
      however independent and placed in alloc_pages_vma currently. This both
      adds an unnecessary branch to all vma based page allocation requests and
      it makes the code more complex unnecessarily as well. Not to mention
      that e.g. shmem THP used to do the node reclaiming unconditionally
      regardless of the defrag mode until recently. This was not only
      unexpected behavior but it was also hardly a good default behavior and I
      strongly suspect it was just a side effect of the code sharing more than
      a deliberate decision which suggests that such a layering is wrong.
      
      Get rid of the thp special casing from alloc_pages_vma and move the
      logic to alloc_hugepage_direct_gfpmask. __GFP_THISNODE is applied to the
      resulting gfp mask only when the direct reclaim is not requested and
      when there is no explicit numa binding to preserve the current logic.
      
      Please note that there's also a slight difference wrt MPOL_BIND now. The
      previous code would avoid using __GFP_THISNODE if the local node was
      outside of policy_nodemask(). After this patch __GFP_THISNODE is avoided
      for all MPOL_BIND policies. So there's a difference that if local node
      is actually allowed by the bind policy's nodemask, previously
      __GFP_THISNODE would be added, but now it won't be. From the behavior
      POV this is still correct because the policy nodemask is used.
      
      Link: http://lkml.kernel.org/r/20180925120326.24392-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89c83fb5
  15. 27 10月, 2018 3 次提交
    • A
      mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page() · 7066f0f9
      Andrea Arcangeli 提交于
      change_huge_pmd() after arming the numa/protnone pmd doesn't flush the TLB
      right away.  do_huge_pmd_numa_page() flushes the TLB before calling
      migrate_misplaced_transhuge_page().  By the time do_huge_pmd_numa_page()
      runs some CPU could still access the page through the TLB.
      
      change_huge_pmd() before arming the numa/protnone transhuge pmd calls
      mmu_notifier_invalidate_range_start().  So there's no need of
      mmu_notifier_invalidate_range_start()/mmu_notifier_invalidate_range_only_end()
      sequence in migrate_misplaced_transhuge_page() too, because by the time
      migrate_misplaced_transhuge_page() runs, the pmd mapping has already been
      invalidated in the secondary MMUs.  It has to or if a secondary MMU can
      still write to the page, the migrate_page_copy() would lose data.
      
      However an explicit mmu_notifier_invalidate_range() is needed before
      migrate_misplaced_transhuge_page() starts copying the data of the
      transhuge page or the below can happen for MMU notifier users sharing the
      primary MMU pagetables and only implementing ->invalidate_range:
      
      CPU0		CPU1		GPU sharing linux pagetables using
                                      only ->invalidate_range
      -----------	------------	---------
      				GPU secondary MMU writes to the page
      				mapped by the transhuge pmd
      change_pmd_range()
      mmu..._range_start()
      ->invalidate_range_start() noop
      change_huge_pmd()
      set_pmd_at(numa/protnone)
      pmd_unlock()
      		do_huge_pmd_numa_page()
      		CPU TLB flush globally (1)
      		CPU cannot write to page
      		migrate_misplaced_transhuge_page()
      				GPU writes to the page...
      		migrate_page_copy()
      				...GPU stops writing to the page
      CPU TLB flush (2)
      mmu..._range_end() (3)
      ->invalidate_range_stop() noop
      ->invalidate_range()
      				GPU secondary MMU is invalidated
      				and cannot write to the page anymore
      				(too late)
      
      Just like we need a CPU TLB flush (1) because the TLB flush (2) arrives
      too late, we also need a mmu_notifier_invalidate_range() before calling
      migrate_misplaced_transhuge_page(), because the ->invalidate_range() in
      (3) also arrives too late.
      
      This requirement is the result of the lazy optimization in
      change_huge_pmd() that releases the pmd_lock without first flushing the
      TLB and without first calling mmu_notifier_invalidate_range().
      
      Even converting the removed mmu_notifier_invalidate_range_only_end() into
      a mmu_notifier_invalidate_range_end() would not have been enough to fix
      this, because it run after migrate_page_copy().
      
      After the hugepage data copy is done migrate_misplaced_transhuge_page()
      can proceed and call set_pmd_at without having to flush the TLB nor any
      secondary MMUs because the secondary MMU invalidate, just like the CPU TLB
      flush, has to happen before the migrate_page_copy() is called or it would
      be a bug in the first place (and it was for drivers using
      ->invalidate_range()).
      
      KVM is unaffected because it doesn't implement ->invalidate_range().
      
      The standard PAGE_SIZEd migrate_misplaced_page is less accelerated and
      uses the generic migrate_pages which transitions the pte from
      numa/protnone to a migration entry in try_to_unmap_one() and flushes TLBs
      and all mmu notifiers there before copying the page.
      
      Link: http://lkml.kernel.org/r/20181013002430.698-3-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NAaron Tomlin <atomlin@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7066f0f9
    • K
      mm/gup: cache dev_pagemap while pinning pages · df06b37f
      Keith Busch 提交于
      Getting pages from ZONE_DEVICE memory needs to check the backing device's
      live-ness, which is tracked in the device's dev_pagemap metadata.  This
      metadata is stored in a radix tree and looking it up adds measurable
      software overhead.
      
      This patch avoids repeating this relatively costly operation when
      dev_pagemap is used by caching the last dev_pagemap while getting user
      pages.  The gup_benchmark kernel self test reports this reduces time to
      get user pages to as low as 1/3 of the previous time.
      
      Link: http://lkml.kernel.org/r/20181012173040.15669-1-keith.busch@intel.comSigned-off-by: NKeith Busch <keith.busch@intel.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df06b37f
    • J
      mm: workingset: tell cache transitions from workingset thrashing · 1899ad18
      Johannes Weiner 提交于
      Refaults happen during transitions between workingsets as well as in-place
      thrashing.  Knowing the difference between the two has a range of
      applications, including measuring the impact of memory shortage on the
      system performance, as well as the ability to smarter balance pressure
      between the filesystem cache and the swap-backed workingset.
      
      During workingset transitions, inactive cache refaults and pushes out
      established active cache.  When that active cache isn't stale, however,
      and also ends up refaulting, that's bonafide thrashing.
      
      Introduce a new page flag that tells on eviction whether the page has been
      active or not in its lifetime.  This bit is then stored in the shadow
      entry, to classify refaults as transitioning or thrashing.
      
      How many page->flags does this leave us with on 32-bit?
      
      	20 bits are always page flags
      
      	21 if you have an MMU
      
      	23 with the zone bits for DMA, Normal, HighMem, Movable
      
      	29 with the sparsemem section bits
      
      	30 if PAE is enabled
      
      	31 with this patch.
      
      So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes.  If
      that's not enough, the system can switch to discontigmem and re-gain the 6
      or 7 sparsemem section bits.
      
      Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1899ad18
  16. 21 10月, 2018 1 次提交
  17. 18 10月, 2018 1 次提交
  18. 13 10月, 2018 1 次提交
  19. 06 10月, 2018 1 次提交
    • K
      mm, thp: fix mlocking THP page with migration enabled · e125fe40
      Kirill A. Shutemov 提交于
      A transparent huge page is represented by a single entry on an LRU list.
      Therefore, we can only make unevictable an entire compound page, not
      individual subpages.
      
      If a user tries to mlock() part of a huge page, we want the rest of the
      page to be reclaimable.
      
      We handle this by keeping PTE-mapped huge pages on normal LRU lists: the
      PMD on border of VM_LOCKED VMA will be split into PTE table.
      
      Introduction of THP migration breaks[1] the rules around mlocking THP
      pages.  If we had a single PMD mapping of the page in mlocked VMA, the
      page will get mlocked, regardless of PTE mappings of the page.
      
      For tmpfs/shmem it's easy to fix by checking PageDoubleMap() in
      remove_migration_pmd().
      
      Anon THP pages can only be shared between processes via fork().  Mlocked
      page can only be shared if parent mlocked it before forking, otherwise CoW
      will be triggered on mlock().
      
      For Anon-THP, we can fix the issue by munlocking the page on removing PTE
      migration entry for the page.  PTEs for the page will always come after
      mlocked PMD: rmap walks VMAs from oldest to newest.
      
      Test-case:
      
      	#include <unistd.h>
      	#include <sys/mman.h>
      	#include <sys/wait.h>
      	#include <linux/mempolicy.h>
      	#include <numaif.h>
      
      	int main(void)
      	{
      	        unsigned long nodemask = 4;
      	        void *addr;
      
      		addr = mmap((void *)0x20000000UL, 2UL << 20, PROT_READ | PROT_WRITE,
      			MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, -1, 0);
      
      	        if (fork()) {
      			wait(NULL);
      			return 0;
      	        }
      
      	        mlock(addr, 4UL << 10);
      	        mbind(addr, 2UL << 20, MPOL_PREFERRED | MPOL_F_RELATIVE_NODES,
      	                &nodemask, 4, MPOL_MF_MOVE);
      
      	        return 0;
      	}
      
      [1] https://lkml.kernel.org/r/CAOMGZ=G52R-30rZvhGxEbkTw7rLLwBGadVYeo--iizcD3upL3A@mail.gmail.com
      
      Link: http://lkml.kernel.org/r/20180917133816.43995-1-kirill.shutemov@linux.intel.com
      Fixes: 616b8371 ("mm: thp: enable thp migration in generic path")
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NVegard Nossum <vegard.nossum@oracle.com>
      Reviewed-by: NZi Yan <zi.yan@cs.rutgers.edu>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>	[4.14+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e125fe40
  20. 05 9月, 2018 1 次提交
  21. 24 8月, 2018 1 次提交
  22. 18 8月, 2018 3 次提交
    • H
      mm, huge page: copy target sub-page last when copy huge page · c9f4cd71
      Huang Ying 提交于
      Huge page helps to reduce TLB miss rate, but it has higher cache
      footprint, sometimes this may cause some issue.  For example, when
      copying huge page on x86_64 platform, the cache footprint is 4M.  But on
      a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M LLC
      (last level cache).  That is, in average, there are 2.5M LLC for each
      core and 1.25M LLC for each thread.
      
      If the cache contention is heavy when copying the huge page, and we copy
      the huge page from the begin to the end, it is possible that the begin
      of huge page is evicted from the cache after we finishing copying the
      end of the huge page.  And it is possible for the application to access
      the begin of the huge page after copying the huge page.
      
      In c79b57e4 ("mm: hugetlb: clear target sub-page last when clearing
      huge page"), to keep the cache lines of the target subpage hot, the
      order to clear the subpages in the huge page in clear_huge_page() is
      changed to clearing the subpage which is furthest from the target
      subpage firstly, and the target subpage last.  The similar order
      changing helps huge page copying too.  That is implemented in this
      patch.  Because we have put the order algorithm into a separate
      function, the implementation is quite simple.
      
      The patch is a generic optimization which should benefit quite some
      workloads, not for a specific use case.  To demonstrate the performance
      benefit of the patch, we tested it with vm-scalability run on
      transparent huge page.
      
      With this patch, the throughput increases ~16.6% in vm-scalability
      anon-cow-seq test case with 36 processes on a 2 socket Xeon E5 v3 2699
      system (36 cores, 72 threads).  The test case set
      /sys/kernel/mm/transparent_hugepage/enabled to be always, mmap() a big
      anonymous memory area and populate it, then forked 36 child processes,
      each writes to the anonymous memory area from the begin to the end, so
      cause copy on write.  For each child process, other child processes
      could be seen as other workloads which generate heavy cache pressure.
      At the same time, the IPC (instruction per cycle) increased from 0.63 to
      0.78, and the time spent in user space is reduced ~7.2%.
      
      Link: http://lkml.kernel.org/r/20180524005851.4079-3-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Christopher Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9f4cd71
    • Y
      thp: use mm_file_counter to determine update which rss counter · fadae295
      Yang Shi 提交于
      Since commit eca56ff9 ("mm, shmem: add internal shmem resident
      memory accounting"), MM_SHMEMPAGES is added to separate the shmem
      accounting from regular files.  So, all shmem pages should be accounted
      to MM_SHMEMPAGES instead of MM_FILEPAGES.
      
      And, normal 4K shmem pages have been accounted to MM_SHMEMPAGES, so
      shmem thp pages should be not treated differently.  Account them to
      MM_SHMEMPAGES via mm_counter_file() since shmem pages are swap backed to
      keep consistent with normal 4K shmem pages.
      
      This will not change the rss counter of processes since shmem pages are
      still a part of it.
      
      The /proc/pid/status and /proc/pid/statm counters will however be more
      accurate wrt shmem usage, as originally intended.  And as eca56ff9
      ("mm, shmem: add internal shmem resident memory accounting") mentioned,
      oom also could report more accurate "shmem-rss".
      
      Link: http://lkml.kernel.org/r/1529442518-17398-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fadae295
    • D
      dax: remove VM_MIXEDMAP for fsdax and device dax · e1fb4a08
      Dave Jiang 提交于
      This patch is reworked from an earlier patch that Dan has posted:
      https://patchwork.kernel.org/patch/10131727/
      
      VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
      the memory page it is dealing with is not typical memory from the linear
      map.  The get_user_pages_fast() path, since it does not resolve the vma,
      is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
      use that as a VM_MIXEDMAP replacement in some locations.  In the cases
      where there is no pte to consult we fallback to using vma_is_dax() to
      detect the VM_MIXEDMAP special case.
      
      Now that we have explicit driver pfn_t-flag opt-in/opt-out for
      get_user_pages() support for DAX we can stop setting VM_MIXEDMAP.  This
      also means we no longer need to worry about safely manipulating vm_flags
      in a future where we support dynamically changing the dax mode of a
      file.
      
      DAX should also now be supported with madvise_behavior(), vma_merge(),
      and copy_page_range().
      
      This patch has been tested against ndctl unit test.  It has also been
      tested against xfstests commit: 625515d using fake pmem created by
      memmap and no additional issues have been observed.
      
      Link: http://lkml.kernel.org/r/152847720311.55924.16999195879201817653.stgit@djiang5-desk3.ch.intel.comSigned-off-by: NDave Jiang <dave.jiang@intel.com>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1fb4a08