1. 04 6月, 2020 3 次提交
    • M
      mm/vmscan.c: change prototype for shrink_page_list · 730ec8c0
      Maninder Singh 提交于
      commit 3c710c1a ("mm, vmscan extract shrink_page_list reclaim counters
      into a struct") changed data type for the function, so changing return
      type for funciton and its caller.
      Signed-off-by: NVaneet Narang <v.narang@samsung.com>
      Signed-off-by: NManinder Singh <maninder1.s@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Amit Sahrawat <a.sahrawat@samsung.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/1588168259-25604-1-git-send-email-maninder1.s@samsung.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      730ec8c0
    • J
      mm/page_alloc: integrate classzone_idx and high_zoneidx · 97a225e6
      Joonsoo Kim 提交于
      classzone_idx is just different name for high_zoneidx now.  So, integrate
      them and add some comment to struct alloc_context in order to reduce
      future confusion about the meaning of this variable.
      
      The accessor, ac_classzone_idx() is also removed since it isn't needed
      after integration.
      
      In addition to integration, this patch also renames high_zoneidx to
      highest_zoneidx since it represents more precise meaning.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ye Xiaolong <xiaolong.ye@intel.com>
      Link: http://lkml.kernel.org/r/1587095923-7515-3-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97a225e6
    • J
      mm/page_alloc: use ac->high_zoneidx for classzone_idx · 3334a45e
      Joonsoo Kim 提交于
      Patch series "integrate classzone_idx and high_zoneidx", v5.
      
      This patchset is followup of the problem reported and discussed two years
      ago [1, 2].  The problem this patchset solves is related to the
      classzone_idx on the NUMA system.  It causes a problem when the lowmem
      reserve protection exists for some zones on a node that do not exist on
      other nodes.
      
      This problem was reported two years ago, and, at that time, the solution
      got general agreements [2].  But it was not upstreamed.
      
      [1]: http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop
      [2]: http://lkml.kernel.org/r/1525408246-14768-1-git-send-email-iamjoonsoo.kim@lge.com
      
      This patch (of 2):
      
      Currently, we use classzone_idx to calculate lowmem reserve proetection
      for an allocation request.  This classzone_idx causes a problem on NUMA
      systems when the lowmem reserve protection exists for some zones on a node
      that do not exist on other nodes.
      
      Before further explanation, I should first clarify how to compute the
      classzone_idx and the high_zoneidx.
      
      - ac->high_zoneidx is computed via the arcane gfp_zone(gfp_mask) and
        represents the index of the highest zone the allocation can use
      
      - classzone_idx was supposed to be the index of the highest zone on the
        local node that the allocation can use, that is actually available in
        the system
      
      Think about following example.  Node 0 has 4 populated zone,
      DMA/DMA32/NORMAL/MOVABLE.  Node 1 has 1 populated zone, NORMAL.  Some
      zones, such as MOVABLE, doesn't exist on node 1 and this makes following
      difference.
      
      Assume that there is an allocation request whose gfp_zone(gfp_mask) is the
      zone, MOVABLE.  Then, it's high_zoneidx is 3.  If this allocation is
      initiated on node 0, it's classzone_idx is 3 since actually
      available/usable zone on local (node 0) is MOVABLE.  If this allocation is
      initiated on node 1, it's classzone_idx is 2 since actually
      available/usable zone on local (node 1) is NORMAL.
      
      You can see that classzone_idx of the allocation request are different
      according to their starting node, even if their high_zoneidx is the same.
      
      Think more about these two allocation requests.  If they are processed on
      local, there is no problem.  However, if allocation is initiated on node 1
      are processed on remote, in this example, at the NORMAL zone on node 0,
      due to memory shortage, problem occurs.  Their different classzone_idx
      leads to different lowmem reserve and then different min watermark.  See
      the following example.
      
      root@ubuntu:/sys/devices/system/memory# cat /proc/zoneinfo
      Node 0, zone      DMA
        per-node stats
      ...
        pages free     3965
              min      5
              low      8
              high     11
              spanned  4095
              present  3998
              managed  3977
              protection: (0, 2961, 4928, 5440)
      ...
      Node 0, zone    DMA32
        pages free     757955
              min      1129
              low      1887
              high     2645
              spanned  1044480
              present  782303
              managed  758116
              protection: (0, 0, 1967, 2479)
      ...
      Node 0, zone   Normal
        pages free     459806
              min      750
              low      1253
              high     1756
              spanned  524288
              present  524288
              managed  503620
              protection: (0, 0, 0, 4096)
      ...
      Node 0, zone  Movable
        pages free     130759
              min      195
              low      326
              high     457
              spanned  1966079
              present  131072
              managed  131072
              protection: (0, 0, 0, 0)
      ...
      Node 1, zone      DMA
        pages free     0
              min      0
              low      0
              high     0
              spanned  0
              present  0
              managed  0
              protection: (0, 0, 1006, 1006)
      Node 1, zone    DMA32
        pages free     0
              min      0
              low      0
              high     0
              spanned  0
              present  0
              managed  0
              protection: (0, 0, 1006, 1006)
      Node 1, zone   Normal
        per-node stats
      ...
        pages free     233277
              min      383
              low      640
              high     897
              spanned  262144
              present  262144
              managed  257744
              protection: (0, 0, 0, 0)
      ...
      Node 1, zone  Movable
        pages free     0
              min      0
              low      0
              high     0
              spanned  262144
              present  0
              managed  0
              protection: (0, 0, 0, 0)
      
      - static min watermark for the NORMAL zone on node 0 is 750.
      
      - lowmem reserve for the request with classzone idx 3 at the NORMAL on
        node 0 is 4096.
      
      - lowmem reserve for the request with classzone idx 2 at the NORMAL on
        node 0 is 0.
      
      So, overall min watermark is:
      allocation initiated on node 0 (classzone_idx 3): 750 + 4096 = 4846
      allocation initiated on node 1 (classzone_idx 2): 750 + 0 = 750
      
      Allocation initiated on node 1 will have some precedence than allocation
      initiated on node 0 because min watermark of the former allocation is
      lower than the other.  So, allocation initiated on node 1 could succeed on
      node 0 when allocation initiated on node 0 could not, and, this could
      cause too many numa_miss allocation.  Then, performance could be
      downgraded.
      
      Recently, there was a regression report about this problem on CMA patches
      since CMA memory are placed in ZONE_MOVABLE by those patches.  I checked
      that problem is disappeared with this fix that uses high_zoneidx for
      classzone_idx.
      
      http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop
      
      Using high_zoneidx for classzone_idx is more consistent way than previous
      approach because system's memory layout doesn't affect anything to it.
      With this patch, both classzone_idx on above example will be 3 so will
      have the same min watermark.
      
      allocation initiated on node 0: 750 + 4096 = 4846
      allocation initiated on node 1: 750 + 4096 = 4846
      
      One could wonder if there is a side effect that allocation initiated on
      node 1 will use higher bar when allocation is handled on local since
      classzone_idx could be higher than before.  It will not happen because the
      zone without managed page doesn't contributes lowmem_reserve at all.
      Reported-by: NYe Xiaolong <xiaolong.ye@intel.com>
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NYe Xiaolong <xiaolong.ye@intel.com>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Link: http://lkml.kernel.org/r/1587095923-7515-1-git-send-email-iamjoonsoo.kim@lge.com
      Link: http://lkml.kernel.org/r/1587095923-7515-2-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3334a45e
  2. 03 6月, 2020 2 次提交
  3. 08 4月, 2020 1 次提交
  4. 03 4月, 2020 4 次提交
    • R
      mm,compaction,cma: add alloc_contig flag to compact_control · b06eda09
      Rik van Riel 提交于
      Patch series "fix THP migration for CMA allocations", v2.
      
      Transparent huge pages are allocated with __GFP_MOVABLE, and can end up in
      CMA memory blocks.  Transparent huge pages also have most of the
      infrastructure in place to allow migration.
      
      However, a few pieces were missing, causing THP migration to fail when
      attempting to use CMA to allocate 1GB hugepages.
      
      With these patches in place, THP migration from CMA blocks seems to work,
      both for anonymous THPs and for tmpfs/shmem THPs.
      
      This patch (of 2):
      
      Add information to struct compact_control to indicate that the allocator
      would really like to clear out this specific part of memory, used by for
      example CMA.
      Signed-off-by: NRik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Link: http://lkml.kernel.org/r/20200227213238.1298752-1-riel@surriel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b06eda09
    • M
      mm, pagealloc: micro-optimisation: save two branches on hot page allocation path · 736838e9
      Mateusz Nosek 提交于
      This patch makes ALLOC_KSWAPD equal to __GFP_KSWAPD_RECLAIM (cast to int).
      
      Thanks to that code like:
      
          if (gfp_mask & __GFP_KSWAPD_RECLAIM)
      	    alloc_flags |= ALLOC_KSWAPD;
      
      can be changed to:
      
          alloc_flags |= (__force int) (gfp_mask &__GFP_KSWAPD_RECLAIM);
      
      Thanks to this one branch less is generated in the assembly.
      
      In case of ALLOC_KSWAPD flag two branches are saved, first one in code
      that always executes in the beginning of page allocation and the second
      one in loop in page allocator slowpath.
      Signed-off-by: NMateusz Nosek <mateusznosek0@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Link: http://lkml.kernel.org/r/20200304162118.14784-1-mateusznosek0@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      736838e9
    • P
      mm: allow VM_FAULT_RETRY for multiple times · 4064b982
      Peter Xu 提交于
      The idea comes from a discussion between Linus and Andrea [1].
      
      Before this patch we only allow a page fault to retry once.  We achieved
      this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing
      handle_mm_fault() the second time.  This was majorly used to avoid
      unexpected starvation of the system by looping over forever to handle the
      page fault on a single page.  However that should hardly happen, and after
      all for each code path to return a VM_FAULT_RETRY we'll first wait for a
      condition (during which time we should possibly yield the cpu) to happen
      before VM_FAULT_RETRY is really returned.
      
      This patch removes the restriction by keeping the FAULT_FLAG_ALLOW_RETRY
      flag when we receive VM_FAULT_RETRY.  It means that the page fault handler
      now can retry the page fault for multiple times if necessary without the
      need to generate another page fault event.  Meanwhile we still keep the
      FAULT_FLAG_TRIED flag so page fault handler can still identify whether a
      page fault is the first attempt or not.
      
      Then we'll have these combinations of fault flags (only considering
      ALLOW_RETRY flag and TRIED flag):
      
        - ALLOW_RETRY and !TRIED:  this means the page fault allows to
                                   retry, and this is the first try
      
        - ALLOW_RETRY and TRIED:   this means the page fault allows to
                                   retry, and this is not the first try
      
        - !ALLOW_RETRY and !TRIED: this means the page fault does not allow
                                   to retry at all
      
        - !ALLOW_RETRY and TRIED:  this is forbidden and should never be used
      
      In existing code we have multiple places that has taken special care of
      the first condition above by checking against (fault_flags &
      FAULT_FLAG_ALLOW_RETRY).  This patch introduces a simple helper to detect
      the first retry of a page fault by checking against both (fault_flags &
      FAULT_FLAG_ALLOW_RETRY) and !(fault_flag & FAULT_FLAG_TRIED) because now
      even the 2nd try will have the ALLOW_RETRY set, then use that helper in
      all existing special paths.  One example is in __lock_page_or_retry(), now
      we'll drop the mmap_sem only in the first attempt of page fault and we'll
      keep it in follow up retries, so old locking behavior will be retained.
      
      This will be a nice enhancement for current code [2] at the same time a
      supporting material for the future userfaultfd-writeprotect work, since in
      that work there will always be an explicit userfault writeprotect retry
      for protected pages, and if that cannot resolve the page fault (e.g., when
      userfaultfd-writeprotect is used in conjunction with swapped pages) then
      we'll possibly need a 3rd retry of the page fault.  It might also benefit
      other potential users who will have similar requirement like userfault
      write-protection.
      
      GUP code is not touched yet and will be covered in follow up patch.
      
      Please read the thread below for more information.
      
      [1] https://lore.kernel.org/lkml/20171102193644.GB22686@redhat.com/
      [2] https://lore.kernel.org/lkml/20181230154648.GB9832@redhat.com/Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Suggested-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NBrian Geffon <bgeffon@google.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220160246.9790-1-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4064b982
    • Y
      mm: swap: make page_evictable() inline · 1eb6234e
      Yang Shi 提交于
      When backporting commit 9c4e6b1a ("mm, mlock, vmscan: no more skipping
      pagevecs") to our 4.9 kernel, our test bench noticed around 10% down with
      a couple of vm-scalability's test cases (lru-file-readonce,
      lru-file-readtwice and lru-file-mmap-read).  I didn't see that much down
      on my VM (32c-64g-2nodes).  It might be caused by the test configuration,
      which is 32c-256g with NUMA disabled and the tests were run in root memcg,
      so the tests actually stress only one inactive and active lru.  It sounds
      not very usual in mordern production environment.
      
      That commit did two major changes:
      1. Call page_evictable()
      2. Use smp_mb to force the PG_lru set visible
      
      It looks they contribute the most overhead.  The page_evictable() is a
      function which does function prologue and epilogue, and that was used by
      page reclaim path only.  However, lru add is a very hot path, so it sounds
      better to make it inline.  However, it calls page_mapping() which is not
      inlined either, but the disassemble shows it doesn't do push and pop
      operations and it sounds not very straightforward to inline it.
      
      Other than this, it sounds smp_mb() is not necessary for x86 since
      SetPageLRU is atomic which enforces memory barrier already, replace it
      with smp_mb__after_atomic() in the following patch.
      
      With the two fixes applied, the tests can get back around 5% on that test
      bench and get back normal on my VM.  Since the test bench configuration is
      not that usual and I also saw around 6% up on the latest upstream, so it
      sounds good enough IMHO.
      
      The below is test data (lru-file-readtwice throughput) against the v5.6-rc4:
      	mainline	w/ inline fix
                150MB            154MB
      
      With this patch the throughput gets 2.67% up.  The data with using
      smp_mb__after_atomic() is showed in the following patch.
      
      Shakeel Butt did the below test:
      
      On a real machine with limiting the 'dd' on a single node and reading 100
      GiB sparse file (less than a single node).  Just ran a single instance to
      not cause the lru lock contention.  The cmdline used is "dd if=file-100GiB
      of=/dev/null bs=4k".  Ran the cmd 10 times with drop_caches in between and
      measured the time it took.
      
      Without patch: 56.64143 +- 0.672 sec
      
      With patches: 56.10 +- 0.21 sec
      
      [akpm@linux-foundation.org: move page_evictable() to internal.h]
      Fixes: 9c4e6b1a ("mm, mlock, vmscan: no more skipping pagevecs")
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Link: http://lkml.kernel.org/r/1584500541-46817-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1eb6234e
  5. 02 12月, 2019 1 次提交
  6. 01 12月, 2019 3 次提交
  7. 26 9月, 2019 1 次提交
    • M
      mm: introduce MADV_COLD · 9c276cc6
      Minchan Kim 提交于
      Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.
      
      - Background
      
      The Android terminology used for forking a new process and starting an app
      from scratch is a cold start, while resuming an existing app is a hot
      start.  While we continually try to improve the performance of cold
      starts, hot starts will always be significantly less power hungry as well
      as faster so we are trying to make hot start more likely than cold start.
      
      To increase hot start, Android userspace manages the order that apps
      should be killed in a process called ActivityManagerService.
      ActivityManagerService tracks every Android app or service that the user
      could be interacting with at any time and translates that into a ranked
      list for lmkd(low memory killer daemon).  They are likely to be killed by
      lmkd if the system has to reclaim memory.  In that sense they are similar
      to entries in any other cache.  Those apps are kept alive for
      opportunistic performance improvements but those performance improvements
      will vary based on the memory requirements of individual workloads.
      
      - Problem
      
      Naturally, cached apps were dominant consumers of memory on the system.
      However, they were not significant consumers of swap even though they are
      good candidate for swap.  Under investigation, swapping out only begins
      once the low zone watermark is hit and kswapd wakes up, but the overall
      allocation rate in the system might trip lmkd thresholds and cause a
      cached process to be killed(we measured performance swapping out vs.
      zapping the memory by killing a process.  Unsurprisingly, zapping is 10x
      times faster even though we use zram which is much faster than real
      storage) so kill from lmkd will often satisfy the high zone watermark,
      resulting in very few pages actually being moved to swap.
      
      - Approach
      
      The approach we chose was to use a new interface to allow userspace to
      proactively reclaim entire processes by leveraging platform information.
      This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
      that are known to be cold from userspace and to avoid races with lmkd by
      reclaiming apps as soon as they entered the cached state.  Additionally,
      it could provide many chances for platform to use much information to
      optimize memory efficiency.
      
      To achieve the goal, the patchset introduce two new options for madvise.
      One is MADV_COLD which will deactivate activated pages and the other is
      MADV_PAGEOUT which will reclaim private pages instantly.  These new
      options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
      ways to gain some free memory space.  MADV_PAGEOUT is similar to
      MADV_DONTNEED in a way that it hints the kernel that memory region is not
      currently needed and should be reclaimed immediately; MADV_COLD is similar
      to MADV_FREE in a way that it hints the kernel that memory region is not
      currently needed and should be reclaimed when memory pressure rises.
      
      This patch (of 5):
      
      When a process expects no accesses to a certain memory range, it could
      give a hint to kernel that the pages can be reclaimed when memory pressure
      happens but data should be preserved for future use.  This could reduce
      workingset eviction so it ends up increasing performance.
      
      This patch introduces the new MADV_COLD hint to madvise(2) syscall.
      MADV_COLD can be used by a process to mark a memory range as not expected
      to be used in the near future.  The hint can help kernel in deciding which
      pages to evict early during memory pressure.
      
      It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
      
      	active file page -> inactive file LRU
      	active anon page -> inacdtive anon LRU
      
      Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
      LRU's head because MADV_COLD is a little bit different symantic.
      MADV_FREE means it's okay to discard when the memory pressure because the
      content of the page is *garbage* so freeing such pages is almost zero
      overhead since we don't need to swap out and access afterward causes just
      minor fault.  Thus, it would make sense to put those freeable pages in
      inactive file LRU to compete other used-once pages.  It makes sense for
      implmentaion point of view, too because it's not swapbacked memory any
      longer until it would be re-dirtied.  Even, it could give a bonus to make
      them be reclaimed on swapless system.  However, MADV_COLD doesn't mean
      garbage so reclaiming them requires swap-out/in in the end so it's bigger
      cost.  Since we have designed VM LRU aging based on cost-model, anonymous
      cold pages would be better to position inactive anon's LRU list, not file
      LRU.  Furthermore, it would help to avoid unnecessary scanning if system
      doesn't have a swap device.  Let's start simpler way without adding
      complexity at this moment.  However, keep in mind, too that it's a caveat
      that workloads with a lot of pages cache are likely to ignore MADV_COLD on
      anonymous memory because we rarely age anonymous LRU lists.
      
      * man-page material
      
      MADV_COLD (since Linux x.x)
      
      Pages in the specified regions will be treated as less-recently-accessed
      compared to pages in the system with similar access frequencies.  In
      contrast to MADV_FREE, the contents of the region are preserved regardless
      of subsequent writes to pages.
      
      MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
      pages.
      
      [akpm@linux-foundation.org: resolve conflicts with hmm.git]
      Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c276cc6
  8. 31 5月, 2019 1 次提交
  9. 06 3月, 2019 9 次提交
    • M
      mm, compaction: capture a page under direct compaction · 5e1f0f09
      Mel Gorman 提交于
      Compaction is inherently race-prone as a suitable page freed during
      compaction can be allocated by any parallel task.  This patch uses a
      capture_control structure to isolate a page immediately when it is freed
      by a direct compactor in the slow path of the page allocator.  The
      intent is to avoid redundant scanning.
      
                                           5.0.0-rc1              5.0.0-rc1
                                     selective-v3r17          capture-v3r19
      Amean     fault-both-1         0.00 (   0.00%)        0.00 *   0.00%*
      Amean     fault-both-3      2582.11 (   0.00%)     2563.68 (   0.71%)
      Amean     fault-both-5      4500.26 (   0.00%)     4233.52 (   5.93%)
      Amean     fault-both-7      5819.53 (   0.00%)     6333.65 (  -8.83%)
      Amean     fault-both-12     9321.18 (   0.00%)     9759.38 (  -4.70%)
      Amean     fault-both-18     9782.76 (   0.00%)    10338.76 (  -5.68%)
      Amean     fault-both-24    15272.81 (   0.00%)    13379.55 *  12.40%*
      Amean     fault-both-30    15121.34 (   0.00%)    16158.25 (  -6.86%)
      Amean     fault-both-32    18466.67 (   0.00%)    18971.21 (  -2.73%)
      
      Latency is only moderately affected but the devil is in the details.  A
      closer examination indicates that base page fault latency is reduced but
      latency of huge pages is increased as it takes creater care to succeed.
      Part of the "problem" is that allocation success rates are close to 100%
      even when under pressure and compaction gets harder
      
                                      5.0.0-rc1              5.0.0-rc1
                                selective-v3r17          capture-v3r19
      Percentage huge-3        96.70 (   0.00%)       98.23 (   1.58%)
      Percentage huge-5        96.99 (   0.00%)       95.30 (  -1.75%)
      Percentage huge-7        94.19 (   0.00%)       97.24 (   3.24%)
      Percentage huge-12       94.95 (   0.00%)       97.35 (   2.53%)
      Percentage huge-18       96.74 (   0.00%)       97.30 (   0.58%)
      Percentage huge-24       97.07 (   0.00%)       97.55 (   0.50%)
      Percentage huge-30       95.69 (   0.00%)       98.50 (   2.95%)
      Percentage huge-32       96.70 (   0.00%)       99.27 (   2.65%)
      
      And scan rates are reduced as expected by 6% for the migration scanner
      and 29% for the free scanner indicating that there is less redundant
      work.
      
      Compaction migrate scanned    20815362    19573286
      Compaction free scanned       16352612    11510663
      
      [mgorman@techsingularity.net: remove redundant check]
        Link: http://lkml.kernel.org/r/20190201143853.GH9565@techsingularity.net
      Link: http://lkml.kernel.org/r/20190118175136.31341-23-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5e1f0f09
    • M
      mm, compaction: round-robin the order while searching the free lists for a target · dbe2d4e4
      Mel Gorman 提交于
      As compaction proceeds and creates high-order blocks, the free list
      search gets less efficient as the larger blocks are used as compaction
      targets.  Eventually, the larger blocks will be behind the migration
      scanner for partially migrated pageblocks and the search fails.  This
      patch round-robins what orders are searched so that larger blocks can be
      ignored and find smaller blocks that can be used as migration targets.
      
      The overall impact was small on 1-socket but it avoids corner cases
      where the migration/free scanners meet prematurely or situations where
      many of the pageblocks encountered by the free scanner are almost full
      instead of being properly packed.  Previous testing had indicated that
      without this patch there were occasional large spikes in the free
      scanner without this patch.
      
      [dan.carpenter@oracle.com: fix static checker warning]
      Link: http://lkml.kernel.org/r/20190118175136.31341-20-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dbe2d4e4
    • M
      mm, compaction: avoid rescanning the same pageblock multiple times · 804d3121
      Mel Gorman 提交于
      Pageblocks are marked for skip when no pages are isolated after a scan.
      However, it's possible to hit corner cases where the migration scanner
      gets stuck near the boundary between the source and target scanner.  Due
      to pages being migrated in blocks of COMPACT_CLUSTER_MAX, pages that are
      migrated can be reallocated before the pageblock is complete.  The
      pageblock is not necessarily skipped so it can be rescanned multiple
      times.  Similarly, a pageblock with some dirty/writeback pages may fail
      to migrate and be rescanned until writeback completes which is wasteful.
      
      This patch tracks if a pageblock is being rescanned.  If so, then the
      entire pageblock will be migrated as one operation.  This narrows the
      race window during which pages can be reallocated during migration.
      Secondly, if there are pages that cannot be isolated then the pageblock
      will still be fully scanned and marked for skipping.  On the second
      rescan, the pageblock skip is set and the migration scanner makes
      progress.
      
                                           5.0.0-rc1              5.0.0-rc1
                                      findfree-v3r16         norescan-v3r16
      Amean     fault-both-1         0.00 (   0.00%)        0.00 *   0.00%*
      Amean     fault-both-3      3200.68 (   0.00%)     3002.07 (   6.21%)
      Amean     fault-both-5      4847.75 (   0.00%)     4684.47 (   3.37%)
      Amean     fault-both-7      6658.92 (   0.00%)     6815.54 (  -2.35%)
      Amean     fault-both-12    11077.62 (   0.00%)    10864.02 (   1.93%)
      Amean     fault-both-18    12403.97 (   0.00%)    12247.52 (   1.26%)
      Amean     fault-both-24    15607.10 (   0.00%)    15683.99 (  -0.49%)
      Amean     fault-both-30    18752.27 (   0.00%)    18620.02 (   0.71%)
      Amean     fault-both-32    21207.54 (   0.00%)    19250.28 *   9.23%*
      
                                      5.0.0-rc1              5.0.0-rc1
                                 findfree-v3r16         norescan-v3r16
      Percentage huge-3        96.86 (   0.00%)       95.00 (  -1.91%)
      Percentage huge-5        93.72 (   0.00%)       94.22 (   0.53%)
      Percentage huge-7        94.31 (   0.00%)       92.35 (  -2.08%)
      Percentage huge-12       92.66 (   0.00%)       91.90 (  -0.82%)
      Percentage huge-18       91.51 (   0.00%)       89.58 (  -2.11%)
      Percentage huge-24       90.50 (   0.00%)       90.03 (  -0.52%)
      Percentage huge-30       91.57 (   0.00%)       89.14 (  -2.65%)
      Percentage huge-32       91.00 (   0.00%)       90.58 (  -0.46%)
      
      Negligible difference but this was likely a case when the specific
      corner case was not hit.  A previous run of the same patch based on an
      earlier iteration of the series showed large differences where migration
      rates could be halved when the corner case was hit.
      
      The specific corner case where migration scan rates go through the roof
      was due to a dirty/writeback pageblock located at the boundary of the
      migration/free scanner did not happen in this case.  When it does
      happen, the scan rates multipled by massive margins.
      
      Link: http://lkml.kernel.org/r/20190118175136.31341-13-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      804d3121
    • M
      mm, compaction: use free lists to quickly locate a migration source · 70b44595
      Mel Gorman 提交于
      The migration scanner is a linear scan of a zone with a potentiall large
      search space.  Furthermore, many pageblocks are unusable such as those
      filled with reserved pages or partially filled with pages that cannot
      migrate.  These still get scanned in the common case of allocating a THP
      and the cost accumulates.
      
      The patch uses a partial search of the free lists to locate a migration
      source candidate that is marked as MOVABLE when allocating a THP.  It
      prefers picking a block with a larger number of free pages already on
      the basis that there are fewer pages to migrate to free the entire
      block.  The lowest PFN found during searches is tracked as the basis of
      the start for the linear search after the first search of the free list
      fails.  After the search, the free list is shuffled so that the next
      search will not encounter the same page.  If the search fails then the
      subsequent searches will be shorter and the linear scanner is used.
      
      If this search fails, or if the request is for a small or
      unmovable/reclaimable allocation then the linear scanner is still used.
      It is somewhat pointless to use the list search in those cases.  Small
      free pages must be used for the search and there is no guarantee that
      movable pages are located within that block that are contiguous.
      
                                           5.0.0-rc1              5.0.0-rc1
                                       noboost-v3r10          findmig-v3r15
      Amean     fault-both-3      3771.41 (   0.00%)     3390.40 (  10.10%)
      Amean     fault-both-5      5409.05 (   0.00%)     5082.28 (   6.04%)
      Amean     fault-both-7      7040.74 (   0.00%)     7012.51 (   0.40%)
      Amean     fault-both-12    11887.35 (   0.00%)    11346.63 (   4.55%)
      Amean     fault-both-18    16718.19 (   0.00%)    15324.19 (   8.34%)
      Amean     fault-both-24    21157.19 (   0.00%)    16088.50 *  23.96%*
      Amean     fault-both-30    21175.92 (   0.00%)    18723.42 *  11.58%*
      Amean     fault-both-32    21339.03 (   0.00%)    18612.01 *  12.78%*
      
                                      5.0.0-rc1              5.0.0-rc1
                                  noboost-v3r10          findmig-v3r15
      Percentage huge-3        86.50 (   0.00%)       89.83 (   3.85%)
      Percentage huge-5        92.52 (   0.00%)       91.96 (  -0.61%)
      Percentage huge-7        92.44 (   0.00%)       92.85 (   0.44%)
      Percentage huge-12       92.98 (   0.00%)       92.74 (  -0.25%)
      Percentage huge-18       91.70 (   0.00%)       91.71 (   0.02%)
      Percentage huge-24       91.59 (   0.00%)       92.13 (   0.60%)
      Percentage huge-30       90.14 (   0.00%)       93.79 (   4.04%)
      Percentage huge-32       90.03 (   0.00%)       91.27 (   1.37%)
      
      This shows an improvement in allocation latencies with similar
      allocation success rates.  While not presented, there was a 31%
      reduction in migration scanning and a 8% reduction on system CPU usage.
      A 2-socket machine showed similar benefits.
      
      [mgorman@techsingularity.net: several fixes]
        Link: http://lkml.kernel.org/r/20190204120111.GL9565@techsingularity.net
      [vbabka@suse.cz: migrate block that was found-fast, some optimisations]
      Link: http://lkml.kernel.org/r/20190118175136.31341-10-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <Vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70b44595
    • M
      mm, compaction: always finish scanning of a full pageblock · efe771c7
      Mel Gorman 提交于
      When compaction is finishing, it uses a flag to ensure the pageblock is
      complete but it makes sense to always complete migration of a pageblock.
      Minimally, skip information is based on a pageblock and partially
      scanned pageblocks may incur more scanning in the future.  The pageblock
      skip handling also becomes more strict later in the series and the hint
      is more useful if a complete pageblock was always scanned.
      
      The potentially impacts latency as more scanning is done but it's not a
      consistent win or loss as the scanning is not always a high percentage
      of the pageblock and sometimes it is offset by future reductions in
      scanning.  Hence, the results are not presented this time due to a
      misleading mix of gains/losses without any clear pattern.  However, full
      scanning of the pageblock is important for later patches.
      
      Link: http://lkml.kernel.org/r/20190118175136.31341-8-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      efe771c7
    • M
      mm, compaction: remove last_migrated_pfn from compact_control · 566e54e1
      Mel Gorman 提交于
      The last_migrated_pfn field is a bit dubious as to whether it really
      helps but either way, the information from it can be inferred without
      increasing the size of compact_control so remove the field.
      
      Link: http://lkml.kernel.org/r/20190118175136.31341-4-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      566e54e1
    • M
      mm, compaction: rearrange compact_control · c5943b9c
      Mel Gorman 提交于
      compact_control spans two cache lines with write-intensive lines on
      both.  Rearrange so the most write-intensive fields are in the same
      cache line.  This has a negligible impact on the overall performance of
      compaction and is more a tidying exercise than anything.
      
      Link: http://lkml.kernel.org/r/20190118175136.31341-3-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5943b9c
    • M
      mm, compaction: shrink compact_control · c5fbd937
      Mel Gorman 提交于
      Patch series "Increase success rates and reduce latency of compaction", v3.
      
      This series reduces scan rates and success rates of compaction,
      primarily by using the free lists to shorten scans, better controlling
      of skip information and whether multiple scanners can target the same
      block and capturing pageblocks before being stolen by parallel requests.
      The series is based on mmotm from January 9th, 2019 with the previous
      compaction series reverted.
      
      I'm mostly using thpscale to measure the impact of the series.  The
      benchmark creates a large file, maps it, faults it, punches holes in the
      mapping so that the virtual address space is fragmented and then tries
      to allocate THP.  It re-executes for different numbers of threads.  From
      a fragmentation perspective, the workload is relatively benign but it
      does stress compaction.
      
      The overall impact on latencies for a 1-socket machine is
      
      				      baseline		      patches
      Amean     fault-both-3      3832.09 (   0.00%)     2748.56 *  28.28%*
      Amean     fault-both-5      4933.06 (   0.00%)     4255.52 (  13.73%)
      Amean     fault-both-7      7017.75 (   0.00%)     6586.93 (   6.14%)
      Amean     fault-both-12    11610.51 (   0.00%)     9162.34 *  21.09%*
      Amean     fault-both-18    17055.85 (   0.00%)    11530.06 *  32.40%*
      Amean     fault-both-24    19306.27 (   0.00%)    17956.13 (   6.99%)
      Amean     fault-both-30    22516.49 (   0.00%)    15686.47 *  30.33%*
      Amean     fault-both-32    23442.93 (   0.00%)    16564.83 *  29.34%*
      
      The allocation success rates are much improved
      
      			 	 baseline		 patches
      Percentage huge-3        85.99 (   0.00%)       97.96 (  13.92%)
      Percentage huge-5        88.27 (   0.00%)       96.87 (   9.74%)
      Percentage huge-7        85.87 (   0.00%)       94.53 (  10.09%)
      Percentage huge-12       82.38 (   0.00%)       98.44 (  19.49%)
      Percentage huge-18       83.29 (   0.00%)       99.14 (  19.04%)
      Percentage huge-24       81.41 (   0.00%)       97.35 (  19.57%)
      Percentage huge-30       80.98 (   0.00%)       98.05 (  21.08%)
      Percentage huge-32       80.53 (   0.00%)       97.06 (  20.53%)
      
      That's a nearly perfect allocation success rate.
      
      The biggest impact is on the scan rates
      
      Compaction migrate scanned    55893379    19341254
      Compaction free scanned      474739990    11903963
      
      The number of pages scanned for migration was reduced by 65% and the
      free scanner was reduced by 97.5%.  So much less work in exchange for
      lower latency and better success rates.
      
      The series was also evaluated using a workload that heavily fragments
      memory but the benefits there are also significant, albeit not
      presented.
      
      It was commented that we should be rethinking scanning entirely and to a
      large extent I agree.  However, to achieve that you need a lot of this
      series in place first so it's best to make the linear scanners as best
      as possible before ripping them out.
      
      This patch (of 22):
      
      The isolate and migrate scanners should never isolate more than a
      pageblock of pages so unsigned int is sufficient saving 8 bytes on a
      64-bit build.
      
      Link: http://lkml.kernel.org/r/20190118175136.31341-2-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5fbd937
    • A
      mm/page_alloc.c: memory hotplug: free pages as higher order · a9cd410a
      Arun KS 提交于
      When freeing pages are done with higher order, time spent on coalescing
      pages by buddy allocator can be reduced.  With section size of 256MB,
      hot add latency of a single section shows improvement from 50-60 ms to
      less than 1 ms, hence improving the hot add latency by 60 times.  Modify
      external providers of online callback to align with the change.
      
      [arunks@codeaurora.org: v11]
        Link: http://lkml.kernel.org/r/1547792588-18032-1-git-send-email-arunks@codeaurora.org
      [akpm@linux-foundation.org: remove unused local, per Arun]
      [akpm@linux-foundation.org: avoid return of void-returning __free_pages_core(), per Oscar]
      [akpm@linux-foundation.org: fix it for mm-convert-totalram_pages-and-totalhigh_pages-variables-to-atomic.patch]
      [arunks@codeaurora.org: v8]
        Link: http://lkml.kernel.org/r/1547032395-24582-1-git-send-email-arunks@codeaurora.org
      [arunks@codeaurora.org: v9]
        Link: http://lkml.kernel.org/r/1547098543-26452-1-git-send-email-arunks@codeaurora.org
      Link: http://lkml.kernel.org/r/1538727006-5727-1-git-send-email-arunks@codeaurora.orgSigned-off-by: NArun KS <arunks@codeaurora.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Srivatsa Vaddagiri <vatsa@codeaurora.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9cd410a
  10. 29 12月, 2018 3 次提交
    • M
      mm: use alloc_flags to record if kswapd can wake · 0a79cdad
      Mel Gorman 提交于
      This is a preparation patch that copies the GFP flag __GFP_KSWAPD_RECLAIM
      into alloc_flags.  This is a preparation patch only that avoids having to
      pass gfp_mask through a long callchain in a future patch.
      
      Note that the setting in the fast path happens in alloc_flags_nofragment()
      and it may be claimed that this has nothing to do with ALLOC_NO_FRAGMENT.
      That's true in this patch but is not true later so it's done now for
      easier review to show where the flag needs to be recorded.
      
      No functional change.
      
      [mgorman@techsingularity.net: ALLOC_KSWAPD flag needs to be applied in the !CONFIG_ZONE_DMA32 case]
        Link: http://lkml.kernel.org/r/20181126143503.GO23260@techsingularity.net
      Link: http://lkml.kernel.org/r/20181123114528.28802-4-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a79cdad
    • M
      mm, page_alloc: spread allocations across zones before introducing fragmentation · 6bb15450
      Mel Gorman 提交于
      Patch series "Fragmentation avoidance improvements", v5.
      
      It has been noted before that fragmentation avoidance (aka
      anti-fragmentation) is not perfect. Given sufficient time or an adverse
      workload, memory gets fragmented and the long-term success of high-order
      allocations degrades. This series defines an adverse workload, a definition
      of external fragmentation events (including serious) ones and a series
      that reduces the level of those fragmentation events.
      
      The details of the workload and the consequences are described in more
      detail in the changelogs. However, from patch 1, this is a high-level
      summary of the adverse workload. The exact details are found in the
      mmtests implementation.
      
      The broad details of the workload are as follows;
      
      1. Create an XFS filesystem (not specified in the configuration but done
         as part of the testing for this patch)
      2. Start 4 fio threads that write a number of 64K files inefficiently.
         Inefficiently means that files are created on first access and not
         created in advance (fio parameterr create_on_open=1) and fallocate
         is not used (fallocate=none). With multiple IO issuers this creates
         a mix of slab and page cache allocations over time. The total size
         of the files is 150% physical memory so that the slabs and page cache
         pages get mixed
      3. Warm up a number of fio read-only threads accessing the same files
         created in step 2. This part runs for the same length of time it
         took to create the files. It'll fault back in old data and further
         interleave slab and page cache allocations. As it's now low on
         memory due to step 2, fragmentation occurs as pageblocks get
         stolen.
      4. While step 3 is still running, start a process that tries to allocate
         75% of memory as huge pages with a number of threads. The number of
         threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
         threads contending with fio, any other threads or forcing cross-NUMA
         scheduling. Note that the test has not been used on a machine with less
         than 8 cores. The benchmark records whether huge pages were allocated
         and what the fault latency was in microseconds
      5. Measure the number of events potentially causing external fragmentation,
         the fault latency and the huge page allocation success rate.
      6. Cleanup
      
      Overall the series reduces external fragmentation causing events by over 94%
      on 1 and 2 socket machines, which in turn impacts high-order allocation
      success rates over the long term. There are differences in latencies and
      high-order allocation success rates. Latencies are a mixed bag as they
      are vulnerable to exact system state and whether allocations succeeded
      so they are treated as a secondary metric.
      
      Patch 1 uses lower zones if they are populated and have free memory
      	instead of fragmenting a higher zone. It's special cased to
      	handle a Normal->DMA32 fallback with the reasons explained
      	in the changelog.
      
      Patch 2-4 boosts watermarks temporarily when an external fragmentation
      	event occurs. kswapd wakes to reclaim a small amount of old memory
      	and then wakes kcompactd on completion to recover the system
      	slightly. This introduces some overhead in the slowpath. The level
      	of boosting can be tuned or disabled depending on the tolerance
      	for fragmentation vs allocation latency.
      
      Patch 5 stalls some movable allocation requests to let kswapd from patch 4
      	make some progress. The duration of the stalls is very low but it
      	is possible to tune the system to avoid fragmentation events if
      	larger stalls can be tolerated.
      
      The bulk of the improvement in fragmentation avoidance is from patches
      1-4 but patch 5 can deal with a rare corner case and provides the option
      of tuning a system for THP allocation success rates in exchange for
      some stalls to control fragmentation.
      
      This patch (of 5):
      
      The page allocator zone lists are iterated based on the watermarks of each
      zone which does not take anti-fragmentation into account.  On x86, node 0
      may have multiple zones while other nodes have one zone.  A consequence is
      that tasks running on node 0 may fragment ZONE_NORMAL even though
      ZONE_DMA32 has plenty of free memory.  This patch special cases the
      allocator fast path such that it'll try an allocation from a lower local
      zone before fragmenting a higher zone.  In this case, stealing of
      pageblocks or orders larger than a pageblock are still allowed in the fast
      path as they are uninteresting from a fragmentation point of view.
      
      This was evaluated using a benchmark designed to fragment memory before
      attempting THP allocations.  It's implemented in mmtests as the following
      configurations
      
      configs/config-global-dhp__workload_thpfioscale
      configs/config-global-dhp__workload_thpfioscale-defrag
      configs/config-global-dhp__workload_thpfioscale-madvhugepage
      
      e.g. from mmtests
      ./run-mmtests.sh --run-monitor --config configs/config-global-dhp__workload_thpfioscale test-run-1
      
      The broad details of the workload are as follows;
      
      1. Create an XFS filesystem (not specified in the configuration but done
         as part of the testing for this patch).
      2. Start 4 fio threads that write a number of 64K files inefficiently.
         Inefficiently means that files are created on first access and not
         created in advance (fio parameter create_on_open=1) and fallocate
         is not used (fallocate=none). With multiple IO issuers this creates
         a mix of slab and page cache allocations over time. The total size
         of the files is 150% physical memory so that the slabs and page cache
         pages get mixed.
      3. Warm up a number of fio read-only processes accessing the same files
         created in step 2. This part runs for the same length of time it
         took to create the files. It'll refault old data and further
         interleave slab and page cache allocations. As it's now low on
         memory due to step 2, fragmentation occurs as pageblocks get
         stolen.
      4. While step 3 is still running, start a process that tries to allocate
         75% of memory as huge pages with a number of threads. The number of
         threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
         threads contending with fio, any other threads or forcing cross-NUMA
         scheduling. Note that the test has not been used on a machine with less
         than 8 cores. The benchmark records whether huge pages were allocated
         and what the fault latency was in microseconds.
      5. Measure the number of events potentially causing external fragmentation,
         the fault latency and the huge page allocation success rate.
      6. Cleanup the test files.
      
      Note that due to the use of IO and page cache that this benchmark is not
      suitable for running on large machines where the time to fragment memory
      may be excessive.  Also note that while this is one mix that generates
      fragmentation that it's not the only mix that generates fragmentation.
      Differences in workload that are more slab-intensive or whether SLUB is
      used with high-order pages may yield different results.
      
      When the page allocator fragments memory, it records the event using the
      mm_page_alloc_extfrag ftrace event.  If the fallback_order is smaller than
      a pageblock order (order-9 on 64-bit x86) then it's considered to be an
      "external fragmentation event" that may cause issues in the future.
      Hence, the primary metric here is the number of external fragmentation
      events that occur with order < 9.  The secondary metric is allocation
      latency and huge page allocation success rates but note that differences
      in latencies and what the success rate also can affect the number of
      external fragmentation event which is why it's a secondary metric.
      
      1-socket Skylake machine
      config-global-dhp__workload_thpfioscale XFS (no special madvise)
      4 fio threads, 1 THP allocating thread
      --------------------------------------
      
      4.20-rc3 extfrag events < order 9:   804694
      4.20-rc3+patch:                      408912 (49% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                            vanilla           lowzone-v5r8
      Amean     fault-base-1      662.92 (   0.00%)      653.58 *   1.41%*
      Amean     fault-huge-1        0.00 (   0.00%)        0.00 (   0.00%)
      
                                    4.20.0-rc3             4.20.0-rc3
                                       vanilla           lowzone-v5r8
      Percentage huge-1        0.00 (   0.00%)        0.00 (   0.00%)
      
      Fault latencies are slightly reduced while allocation success rates remain
      at zero as this configuration does not make any special effort to allocate
      THP and fio is heavily active at the time and either filling memory or
      keeping pages resident.  However, a 49% reduction of serious fragmentation
      events reduces the changes of external fragmentation being a problem in
      the future.
      
      Vlastimil asked during review for a breakdown of the allocation types
      that are falling back.
      
      vanilla
         3816 MIGRATE_UNMOVABLE
       800845 MIGRATE_MOVABLE
           33 MIGRATE_UNRECLAIMABLE
      
      patch
          735 MIGRATE_UNMOVABLE
       408135 MIGRATE_MOVABLE
           42 MIGRATE_UNRECLAIMABLE
      
      The majority of the fallbacks are due to movable allocations and this is
      consistent for the workload throughout the series so will not be presented
      again as the primary source of fallbacks are movable allocations.
      
      Movable fallbacks are sometimes considered "ok" to fallback because they
      can be migrated.  The problem is that they can fill an
      unmovable/reclaimable pageblock causing those allocations to fallback
      later and polluting pageblocks with pages that cannot move.  If there is a
      movable fallback, it is pretty much guaranteed to affect an
      unmovable/reclaimable pageblock and while it might not be enough to
      actually cause a unmovable/reclaimable fallback in the future, we cannot
      know that in advance so the patch takes the only option available to it.
      Hence, it's important to control them.  This point is also consistent
      throughout the series and will not be repeated.
      
      1-socket Skylake machine
      global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
      -----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9:  291392
      4.20-rc3+patch:                     191187 (34% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                            vanilla           lowzone-v5r8
      Amean     fault-base-1     1495.14 (   0.00%)     1467.55 (   1.85%)
      Amean     fault-huge-1     1098.48 (   0.00%)     1127.11 (  -2.61%)
      
      thpfioscale Percentage Faults Huge
                                    4.20.0-rc3             4.20.0-rc3
                                       vanilla           lowzone-v5r8
      Percentage huge-1       78.57 (   0.00%)       77.64 (  -1.18%)
      
      Fragmentation events were reduced quite a bit although this is known
      to be a little variable. The latencies and allocation success rates
      are similar but they were already quite high.
      
      2-socket Haswell machine
      config-global-dhp__workload_thpfioscale XFS (no special madvise)
      4 fio threads, 5 THP allocating threads
      ----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9:  215698
      4.20-rc3+patch:                     200210 (7% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                            vanilla           lowzone-v5r8
      Amean     fault-base-5     1350.05 (   0.00%)     1346.45 (   0.27%)
      Amean     fault-huge-5     4181.01 (   0.00%)     3418.60 (  18.24%)
      
                                    4.20.0-rc3             4.20.0-rc3
                                       vanilla           lowzone-v5r8
      Percentage huge-5        1.15 (   0.00%)        0.78 ( -31.88%)
      
      The reduction of external fragmentation events is slight and this is
      partially due to the removal of __GFP_THISNODE in commit ac5b2c18
      ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings") as THP
      allocations can now spill over to remote nodes instead of fragmenting
      local memory.
      
      2-socket Haswell machine
      global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
      -----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9: 166352
      4.20-rc3+patch:                    147463 (11% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                            vanilla           lowzone-v5r8
      Amean     fault-base-5     6138.97 (   0.00%)     6217.43 (  -1.28%)
      Amean     fault-huge-5     2294.28 (   0.00%)     3163.33 * -37.88%*
      
      thpfioscale Percentage Faults Huge
                                    4.20.0-rc3             4.20.0-rc3
                                       vanilla           lowzone-v5r8
      Percentage huge-5       96.82 (   0.00%)       95.14 (  -1.74%)
      
      There was a slight reduction in external fragmentation events although the
      latencies were higher.  The allocation success rate is high enough that
      the system is struggling and there is quite a lot of parallel reclaim and
      compaction activity.  There is also a certain degree of luck on whether
      processes start on node 0 or not for this patch but the relevance is
      reduced later in the series.
      
      Overall, the patch reduces the number of external fragmentation causing
      events so the success of THP over long periods of time would be improved
      for this adverse workload.
      
      Link: http://lkml.kernel.org/r/20181123114528.28802-2-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6bb15450
    • W
      vmscan: return NODE_RECLAIM_NOSCAN in node_reclaim() when CONFIG_NUMA is n · 8b09549c
      Wei Yang 提交于
      Commit fa5e084e ("vmscan: do not unconditionally treat zones that
      fail zone_reclaim() as full") changed the return value of
      node_reclaim().  The original return value 0 means NODE_RECLAIM_SOME
      after this commit.
      
      While the return value of node_reclaim() when CONFIG_NUMA is n is not
      changed.  This will leads to call zone_watermark_ok() again.
      
      This patch fixes the return value by adjusting to NODE_RECLAIM_NOSCAN.
      Since node_reclaim() is only called in page_alloc.c, move it to
      mm/internal.h.
      
      Link: http://lkml.kernel.org/r/20181113080436.22078-1-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NMatthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8b09549c
  11. 31 10月, 2018 1 次提交
    • M
      memblock: rename __free_pages_bootmem to memblock_free_pages · 7c2ee349
      Mike Rapoport 提交于
      The conversion is done using
      
      sed -i 's@__free_pages_bootmem@memblock_free_pages@' \
          $(git grep -l __free_pages_bootmem)
      
      Link: http://lkml.kernel.org/r/1536927045-23536-27-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Palmer Dabbelt <palmer@sifive.com>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Serge Semin <fancer.lancer@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c2ee349
  12. 24 8月, 2018 1 次提交
  13. 23 8月, 2018 1 次提交
  14. 02 6月, 2018 1 次提交
  15. 25 5月, 2018 1 次提交
    • J
      Revert "mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE" · d883c6cf
      Joonsoo Kim 提交于
      This reverts the following commits that change CMA design in MM.
      
       3d2054ad ("ARM: CMA: avoid double mapping to the CMA area if CONFIG_HIGHMEM=y")
      
       1d47a3ec ("mm/cma: remove ALLOC_CMA")
      
       bad8c6c0 ("mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE")
      
      Ville reported a following error on i386.
      
        Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
        microcode: microcode updated early to revision 0x4, date = 2013-06-28
        Initializing CPU#0
        Initializing HighMem for node 0 (000377fe:00118000)
        Initializing Movable for node 0 (00000001:00118000)
        BUG: Bad page state in process swapper  pfn:377fe
        page:f53effc0 count:0 mapcount:-127 mapping:00000000 index:0x0
        flags: 0x80000000()
        raw: 80000000 00000000 00000000 ffffff80 00000000 00000100 00000200 00000001
        page dumped because: nonzero mapcount
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper Not tainted 4.17.0-rc5-elk+ #145
        Hardware name: Dell Inc. Latitude E5410/03VXMC, BIOS A15 07/11/2013
        Call Trace:
         dump_stack+0x60/0x96
         bad_page+0x9a/0x100
         free_pages_check_bad+0x3f/0x60
         free_pcppages_bulk+0x29d/0x5b0
         free_unref_page_commit+0x84/0xb0
         free_unref_page+0x3e/0x70
         __free_pages+0x1d/0x20
         free_highmem_page+0x19/0x40
         add_highpages_with_active_regions+0xab/0xeb
         set_highmem_pages_init+0x66/0x73
         mem_init+0x1b/0x1d7
         start_kernel+0x17a/0x363
         i386_start_kernel+0x95/0x99
         startup_32_smp+0x164/0x168
      
      The reason for this error is that the span of MOVABLE_ZONE is extended
      to whole node span for future CMA initialization, and, normal memory is
      wrongly freed here.  I submitted the fix and it seems to work, but,
      another problem happened.
      
      It's so late time to fix the later problem so I decide to reverting the
      series.
      Reported-by: NVille Syrjälä <ville.syrjala@linux.intel.com>
      Acked-by: NLaura Abbott <labbott@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d883c6cf
  16. 12 4月, 2018 4 次提交
    • J
      mm/cma: remove ALLOC_CMA · 1d47a3ec
      Joonsoo Kim 提交于
      Now, all reserved pages for CMA region are belong to the ZONE_MOVABLE
      and it only serves for a request with GFP_HIGHMEM && GFP_MOVABLE.
      
      Therefore, we don't need to maintain ALLOC_CMA at all.
      
      Link: http://lkml.kernel.org/r/1512114786-5085-3-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Tested-by: NTony Lindgren <tony@atomide.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d47a3ec
    • J
      mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE · bad8c6c0
      Joonsoo Kim 提交于
      Patch series "mm/cma: manage the memory of the CMA area by using the
      ZONE_MOVABLE", v2.
      
      0. History
      
      This patchset is the follow-up of the discussion about the "Introduce
      ZONE_CMA (v7)" [1].  Please reference it if more information is needed.
      
      1. What does this patch do?
      
      This patch changes the management way for the memory of the CMA area in
      the MM subsystem.  Currently the memory of the CMA area is managed by
      the zone where their pfn is belong to.  However, this approach has some
      problems since MM subsystem doesn't have enough logic to handle the
      situation that different characteristic memories are in a single zone.
      To solve this issue, this patch try to manage all the memory of the CMA
      area by using the MOVABLE zone.  In MM subsystem's point of view,
      characteristic of the memory on the MOVABLE zone and the memory of the
      CMA area are the same.  So, managing the memory of the CMA area by using
      the MOVABLE zone will not have any problem.
      
      2. Motivation
      
      There are some problems with current approach.  See following.  Although
      these problem would not be inherent and it could be fixed without this
      conception change, it requires many hooks addition in various code path
      and it would be intrusive to core MM and would be really error-prone.
      Therefore, I try to solve them with this new approach.  Anyway,
      following is the problems of the current implementation.
      
      o CMA memory utilization
      
      First, following is the freepage calculation logic in MM.
      
       - For movable allocation: freepage = total freepage
       - For unmovable allocation: freepage = total freepage - CMA freepage
      
      Freepages on the CMA area is used after the normal freepages in the zone
      where the memory of the CMA area is belong to are exhausted.  At that
      moment that the number of the normal freepages is zero, so
      
       - For movable allocation: freepage = total freepage = CMA freepage
       - For unmovable allocation: freepage = 0
      
      If unmovable allocation comes at this moment, allocation request would
      fail to pass the watermark check and reclaim is started.  After reclaim,
      there would exist the normal freepages so freepages on the CMA areas
      would not be used.
      
      FYI, there is another attempt [2] trying to solve this problem in lkml.
      And, as far as I know, Qualcomm also has out-of-tree solution for this
      problem.
      
      Useless reclaim:
      
      There is no logic to distinguish CMA pages in the reclaim path.  Hence,
      CMA page is reclaimed even if the system just needs the page that can be
      usable for the kernel allocation.
      
      Atomic allocation failure:
      
      This is also related to the fallback allocation policy for the memory of
      the CMA area.  Consider the situation that the number of the normal
      freepages is *zero* since the bunch of the movable allocation requests
      come.  Kswapd would not be woken up due to following freepage
      calculation logic.
      
      - For movable allocation: freepage = total freepage = CMA freepage
      
      If atomic unmovable allocation request comes at this moment, it would
      fails due to following logic.
      
      - For unmovable allocation: freepage = total freepage - CMA freepage = 0
      
      It was reported by Aneesh [3].
      
      Useless compaction:
      
      Usual high-order allocation request is unmovable allocation request and
      it cannot be served from the memory of the CMA area.  In compaction,
      migration scanner try to migrate the page in the CMA area and make
      high-order page there.  As mentioned above, it cannot be usable for the
      unmovable allocation request so it's just waste.
      
      3. Current approach and new approach
      
      Current approach is that the memory of the CMA area is managed by the
      zone where their pfn is belong to.  However, these memory should be
      distinguishable since they have a strong limitation.  So, they are
      marked as MIGRATE_CMA in pageblock flag and handled specially.  However,
      as mentioned in section 2, the MM subsystem doesn't have enough logic to
      deal with this special pageblock so many problems raised.
      
      New approach is that the memory of the CMA area is managed by the
      MOVABLE zone.  MM already have enough logic to deal with special zone
      like as HIGHMEM and MOVABLE zone.  So, managing the memory of the CMA
      area by the MOVABLE zone just naturally work well because constraints
      for the memory of the CMA area that the memory should always be
      migratable is the same with the constraint for the MOVABLE zone.
      
      There is one side-effect for the usability of the memory of the CMA
      area.  The use of MOVABLE zone is only allowed for a request with
      GFP_HIGHMEM && GFP_MOVABLE so now the memory of the CMA area is also
      only allowed for this gfp flag.  Before this patchset, a request with
      GFP_MOVABLE can use them.  IMO, It would not be a big issue since most
      of GFP_MOVABLE request also has GFP_HIGHMEM flag.  For example, file
      cache page and anonymous page.  However, file cache page for blockdev
      file is an exception.  Request for it has no GFP_HIGHMEM flag.  There is
      pros and cons on this exception.  In my experience, blockdev file cache
      pages are one of the top reason that causes cma_alloc() to fail
      temporarily.  So, we can get more guarantee of cma_alloc() success by
      discarding this case.
      
      Note that there is no change in admin POV since this patchset is just
      for internal implementation change in MM subsystem.  Just one minor
      difference for admin is that the memory stat for CMA area will be
      printed in the MOVABLE zone.  That's all.
      
      4. Result
      
      Following is the experimental result related to utilization problem.
      
      8 CPUs, 1024 MB, VIRTUAL MACHINE
      make -j16
      
      <Before>
        CMA area:               0 MB            512 MB
        Elapsed-time:           92.4		186.5
        pswpin:                 82		18647
        pswpout:                160		69839
      
      <After>
        CMA        :            0 MB            512 MB
        Elapsed-time:           93.1		93.4
        pswpin:                 84		46
        pswpout:                183		92
      
      akpm: "kernel test robot" reported a 26% improvement in
      vm-scalability.throughput:
      http://lkml.kernel.org/r/20180330012721.GA3845@yexl-desktop
      
      [1]: lkml.kernel.org/r/1491880640-9944-1-git-send-email-iamjoonsoo.kim@lge.com
      [2]: https://lkml.org/lkml/2014/10/15/623
      [3]: http://www.spinics.net/lists/linux-mm/msg100562.html
      
      Link: http://lkml.kernel.org/r/1512114786-5085-2-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Tested-by: NTony Lindgren <tony@atomide.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bad8c6c0
    • M
      mm, migrate: remove reason argument from new_page_t · 666feb21
      Michal Hocko 提交于
      No allocation callback is using this argument anymore.  new_page_node
      used to use this parameter to convey node_id resp.  migration error up
      to move_pages code (do_move_page_to_node_array).  The error status never
      made it into the final status field and we have a better way to
      communicate node id to the status field now.  All other allocation
      callbacks simply ignored the argument so we can drop it finally.
      
      [mhocko@suse.com: fix migration callback]
        Link: http://lkml.kernel.org/r/20180105085259.GH2801@dhcp22.suse.cz
      [akpm@linux-foundation.org: fix alloc_misplaced_dst_page()]
      [mhocko@kernel.org: fix build]
        Link: http://lkml.kernel.org/r/20180103091134.GB11319@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20180103082555.14592-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NZi Yan <zi.yan@cs.rutgers.edu>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      666feb21
    • M
      mm, numa: rework do_pages_move · a49bd4d7
      Michal Hocko 提交于
      Patch series "unclutter thp migration"
      
      Motivation:
      
      THP migration is hacked into the generic migration with rather
      surprising semantic.  The migration allocation callback is supposed to
      check whether the THP can be migrated at once and if that is not the
      case then it allocates a simple page to migrate.  unmap_and_move then
      fixes that up by splitting the THP into small pages while moving the
      head page to the newly allocated order-0 page.  Remaining pages are
      moved to the LRU list by split_huge_page.  The same happens if the THP
      allocation fails.  This is really ugly and error prone [2].
      
      I also believe that split_huge_page to the LRU lists is inherently wrong
      because all tail pages are not migrated.  Some callers will just work
      around that by retrying (e.g.  memory hotplug).  There are other pfn
      walkers which are simply broken though.  e.g. madvise_inject_error will
      migrate head and then advances next pfn by the huge page size.
      do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
      will simply split the THP before migration if the THP migration is not
      supported then falls back to single page migration but it doesn't handle
      tail pages if the THP migration path is not able to allocate a fresh THP
      so we end up with ENOMEM and fail the whole migration which is a
      questionable behavior.  Page compaction doesn't try to migrate large
      pages so it should be immune.
      
      The first patch reworks do_pages_move which relies on a very ugly
      calling semantic when the return status is pushed to the migration path
      via private pointer.  It uses pre allocated fixed size batching to
      achieve that.  We simply cannot do the same if a THP is to be split
      during the migration path which is done in the patch 3.  Patch 2 is
      follow up cleanup which removes the mentioned return status calling
      convention ugliness.
      
      On a side note:
      
      There are some semantic issues I have encountered on the way when
      working on patch 1 but I am not addressing them here.  E.g. trying to
      move THP tail pages will result in either success or EBUSY (the later
      one more likely once we isolate head from the LRU list).  Hugetlb
      reports EACCESS on tail pages.  Some errors are reported via status
      parameter but migration failures are not even though the original
      `reason' argument suggests there was an intention to do so.  From a
      quick look into git history this never worked.  I have tried to keep the
      semantic unchanged.
      
      Then there is a relatively minor thing that the page isolation might
      fail because of pages not being on the LRU - e.g. because they are
      sitting on the per-cpu LRU caches.  Easily fixable.
      
      This patch (of 3):
      
      do_pages_move is supposed to move user defined memory (an array of
      addresses) to the user defined numa nodes (an array of nodes one for
      each address).  The user provided status array then contains resulting
      numa node for each address or an error.  The semantic of this function
      is little bit confusing because only some errors are reported back.
      Notably migrate_pages error is only reported via the return value.  This
      patch doesn't try to address these semantic nuances but rather change
      the underlying implementation.
      
      Currently we are processing user input (which can be really large) in
      batches which are stored to a temporarily allocated page.  Each address
      is resolved to its struct page and stored to page_to_node structure
      along with the requested target numa node.  The array of these
      structures is then conveyed down the page migration path via private
      argument.  new_page_node then finds the corresponding structure and
      allocates the proper target page.
      
      What is the problem with the current implementation and why to change
      it? Apart from being quite ugly it also doesn't cope with unexpected
      pages showing up on the migration list inside migrate_pages path.  That
      doesn't happen currently but the follow up patch would like to make the
      thp migration code more clear and that would need to split a THP into
      the list for some cases.
      
      How does the new implementation work? Well, instead of batching into a
      fixed size array we simply batch all pages that should be migrated to
      the same node and isolate all of them into a linked list which doesn't
      require any additional storage.  This should work reasonably well
      because page migration usually migrates larger ranges of memory to a
      specific node.  So the common case should work equally well as the
      current implementation.  Even if somebody constructs an input where the
      target numa nodes would be interleaved we shouldn't see a large
      performance impact because page migration alone doesn't really benefit
      from batching.  mmap_sem batching for the lookup is quite questionable
      and isolate_lru_page which would benefit from batching is not using it
      even in the current implementation.
      
      Link: http://lkml.kernel.org/r/20180103082555.14592-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NKirill A. Shutemov <kirill@shutemov.name>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a49bd4d7
  17. 30 11月, 2017 1 次提交
  18. 28 11月, 2017 1 次提交
    • K
      mm, thp: Do not make pmd/pud dirty without a reason · 152e93af
      Kirill A. Shutemov 提交于
      Currently we make page table entries dirty all the time regardless of
      access type and don't even consider if the mapping is write-protected.
      The reasoning is that we don't really need dirty tracking on THP and
      making the entry dirty upfront may save some time on first write to the
      page.
      
      Unfortunately, such approach may result in false-positive
      can_follow_write_pmd() for huge zero page or read-only shmem file.
      
      Let's only make page dirty only if we about to write to the page anyway
      (as we do for small pages).
      
      I've restructured the code to make entry dirty inside
      maybe_p[mu]d_mkwrite(). It also takes into account if the vma is
      write-protected.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      152e93af
  19. 18 11月, 2017 1 次提交