1. 06 3月, 2019 2 次提交
    • M
      mm, compaction: shrink compact_control · c5fbd937
      Mel Gorman 提交于
      Patch series "Increase success rates and reduce latency of compaction", v3.
      
      This series reduces scan rates and success rates of compaction,
      primarily by using the free lists to shorten scans, better controlling
      of skip information and whether multiple scanners can target the same
      block and capturing pageblocks before being stolen by parallel requests.
      The series is based on mmotm from January 9th, 2019 with the previous
      compaction series reverted.
      
      I'm mostly using thpscale to measure the impact of the series.  The
      benchmark creates a large file, maps it, faults it, punches holes in the
      mapping so that the virtual address space is fragmented and then tries
      to allocate THP.  It re-executes for different numbers of threads.  From
      a fragmentation perspective, the workload is relatively benign but it
      does stress compaction.
      
      The overall impact on latencies for a 1-socket machine is
      
      				      baseline		      patches
      Amean     fault-both-3      3832.09 (   0.00%)     2748.56 *  28.28%*
      Amean     fault-both-5      4933.06 (   0.00%)     4255.52 (  13.73%)
      Amean     fault-both-7      7017.75 (   0.00%)     6586.93 (   6.14%)
      Amean     fault-both-12    11610.51 (   0.00%)     9162.34 *  21.09%*
      Amean     fault-both-18    17055.85 (   0.00%)    11530.06 *  32.40%*
      Amean     fault-both-24    19306.27 (   0.00%)    17956.13 (   6.99%)
      Amean     fault-both-30    22516.49 (   0.00%)    15686.47 *  30.33%*
      Amean     fault-both-32    23442.93 (   0.00%)    16564.83 *  29.34%*
      
      The allocation success rates are much improved
      
      			 	 baseline		 patches
      Percentage huge-3        85.99 (   0.00%)       97.96 (  13.92%)
      Percentage huge-5        88.27 (   0.00%)       96.87 (   9.74%)
      Percentage huge-7        85.87 (   0.00%)       94.53 (  10.09%)
      Percentage huge-12       82.38 (   0.00%)       98.44 (  19.49%)
      Percentage huge-18       83.29 (   0.00%)       99.14 (  19.04%)
      Percentage huge-24       81.41 (   0.00%)       97.35 (  19.57%)
      Percentage huge-30       80.98 (   0.00%)       98.05 (  21.08%)
      Percentage huge-32       80.53 (   0.00%)       97.06 (  20.53%)
      
      That's a nearly perfect allocation success rate.
      
      The biggest impact is on the scan rates
      
      Compaction migrate scanned    55893379    19341254
      Compaction free scanned      474739990    11903963
      
      The number of pages scanned for migration was reduced by 65% and the
      free scanner was reduced by 97.5%.  So much less work in exchange for
      lower latency and better success rates.
      
      The series was also evaluated using a workload that heavily fragments
      memory but the benefits there are also significant, albeit not
      presented.
      
      It was commented that we should be rethinking scanning entirely and to a
      large extent I agree.  However, to achieve that you need a lot of this
      series in place first so it's best to make the linear scanners as best
      as possible before ripping them out.
      
      This patch (of 22):
      
      The isolate and migrate scanners should never isolate more than a
      pageblock of pages so unsigned int is sufficient saving 8 bytes on a
      64-bit build.
      
      Link: http://lkml.kernel.org/r/20190118175136.31341-2-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5fbd937
    • A
      mm/page_alloc.c: memory hotplug: free pages as higher order · a9cd410a
      Arun KS 提交于
      When freeing pages are done with higher order, time spent on coalescing
      pages by buddy allocator can be reduced.  With section size of 256MB,
      hot add latency of a single section shows improvement from 50-60 ms to
      less than 1 ms, hence improving the hot add latency by 60 times.  Modify
      external providers of online callback to align with the change.
      
      [arunks@codeaurora.org: v11]
        Link: http://lkml.kernel.org/r/1547792588-18032-1-git-send-email-arunks@codeaurora.org
      [akpm@linux-foundation.org: remove unused local, per Arun]
      [akpm@linux-foundation.org: avoid return of void-returning __free_pages_core(), per Oscar]
      [akpm@linux-foundation.org: fix it for mm-convert-totalram_pages-and-totalhigh_pages-variables-to-atomic.patch]
      [arunks@codeaurora.org: v8]
        Link: http://lkml.kernel.org/r/1547032395-24582-1-git-send-email-arunks@codeaurora.org
      [arunks@codeaurora.org: v9]
        Link: http://lkml.kernel.org/r/1547098543-26452-1-git-send-email-arunks@codeaurora.org
      Link: http://lkml.kernel.org/r/1538727006-5727-1-git-send-email-arunks@codeaurora.orgSigned-off-by: NArun KS <arunks@codeaurora.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Srivatsa Vaddagiri <vatsa@codeaurora.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9cd410a
  2. 29 12月, 2018 3 次提交
    • M
      mm: use alloc_flags to record if kswapd can wake · 0a79cdad
      Mel Gorman 提交于
      This is a preparation patch that copies the GFP flag __GFP_KSWAPD_RECLAIM
      into alloc_flags.  This is a preparation patch only that avoids having to
      pass gfp_mask through a long callchain in a future patch.
      
      Note that the setting in the fast path happens in alloc_flags_nofragment()
      and it may be claimed that this has nothing to do with ALLOC_NO_FRAGMENT.
      That's true in this patch but is not true later so it's done now for
      easier review to show where the flag needs to be recorded.
      
      No functional change.
      
      [mgorman@techsingularity.net: ALLOC_KSWAPD flag needs to be applied in the !CONFIG_ZONE_DMA32 case]
        Link: http://lkml.kernel.org/r/20181126143503.GO23260@techsingularity.net
      Link: http://lkml.kernel.org/r/20181123114528.28802-4-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a79cdad
    • M
      mm, page_alloc: spread allocations across zones before introducing fragmentation · 6bb15450
      Mel Gorman 提交于
      Patch series "Fragmentation avoidance improvements", v5.
      
      It has been noted before that fragmentation avoidance (aka
      anti-fragmentation) is not perfect. Given sufficient time or an adverse
      workload, memory gets fragmented and the long-term success of high-order
      allocations degrades. This series defines an adverse workload, a definition
      of external fragmentation events (including serious) ones and a series
      that reduces the level of those fragmentation events.
      
      The details of the workload and the consequences are described in more
      detail in the changelogs. However, from patch 1, this is a high-level
      summary of the adverse workload. The exact details are found in the
      mmtests implementation.
      
      The broad details of the workload are as follows;
      
      1. Create an XFS filesystem (not specified in the configuration but done
         as part of the testing for this patch)
      2. Start 4 fio threads that write a number of 64K files inefficiently.
         Inefficiently means that files are created on first access and not
         created in advance (fio parameterr create_on_open=1) and fallocate
         is not used (fallocate=none). With multiple IO issuers this creates
         a mix of slab and page cache allocations over time. The total size
         of the files is 150% physical memory so that the slabs and page cache
         pages get mixed
      3. Warm up a number of fio read-only threads accessing the same files
         created in step 2. This part runs for the same length of time it
         took to create the files. It'll fault back in old data and further
         interleave slab and page cache allocations. As it's now low on
         memory due to step 2, fragmentation occurs as pageblocks get
         stolen.
      4. While step 3 is still running, start a process that tries to allocate
         75% of memory as huge pages with a number of threads. The number of
         threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
         threads contending with fio, any other threads or forcing cross-NUMA
         scheduling. Note that the test has not been used on a machine with less
         than 8 cores. The benchmark records whether huge pages were allocated
         and what the fault latency was in microseconds
      5. Measure the number of events potentially causing external fragmentation,
         the fault latency and the huge page allocation success rate.
      6. Cleanup
      
      Overall the series reduces external fragmentation causing events by over 94%
      on 1 and 2 socket machines, which in turn impacts high-order allocation
      success rates over the long term. There are differences in latencies and
      high-order allocation success rates. Latencies are a mixed bag as they
      are vulnerable to exact system state and whether allocations succeeded
      so they are treated as a secondary metric.
      
      Patch 1 uses lower zones if they are populated and have free memory
      	instead of fragmenting a higher zone. It's special cased to
      	handle a Normal->DMA32 fallback with the reasons explained
      	in the changelog.
      
      Patch 2-4 boosts watermarks temporarily when an external fragmentation
      	event occurs. kswapd wakes to reclaim a small amount of old memory
      	and then wakes kcompactd on completion to recover the system
      	slightly. This introduces some overhead in the slowpath. The level
      	of boosting can be tuned or disabled depending on the tolerance
      	for fragmentation vs allocation latency.
      
      Patch 5 stalls some movable allocation requests to let kswapd from patch 4
      	make some progress. The duration of the stalls is very low but it
      	is possible to tune the system to avoid fragmentation events if
      	larger stalls can be tolerated.
      
      The bulk of the improvement in fragmentation avoidance is from patches
      1-4 but patch 5 can deal with a rare corner case and provides the option
      of tuning a system for THP allocation success rates in exchange for
      some stalls to control fragmentation.
      
      This patch (of 5):
      
      The page allocator zone lists are iterated based on the watermarks of each
      zone which does not take anti-fragmentation into account.  On x86, node 0
      may have multiple zones while other nodes have one zone.  A consequence is
      that tasks running on node 0 may fragment ZONE_NORMAL even though
      ZONE_DMA32 has plenty of free memory.  This patch special cases the
      allocator fast path such that it'll try an allocation from a lower local
      zone before fragmenting a higher zone.  In this case, stealing of
      pageblocks or orders larger than a pageblock are still allowed in the fast
      path as they are uninteresting from a fragmentation point of view.
      
      This was evaluated using a benchmark designed to fragment memory before
      attempting THP allocations.  It's implemented in mmtests as the following
      configurations
      
      configs/config-global-dhp__workload_thpfioscale
      configs/config-global-dhp__workload_thpfioscale-defrag
      configs/config-global-dhp__workload_thpfioscale-madvhugepage
      
      e.g. from mmtests
      ./run-mmtests.sh --run-monitor --config configs/config-global-dhp__workload_thpfioscale test-run-1
      
      The broad details of the workload are as follows;
      
      1. Create an XFS filesystem (not specified in the configuration but done
         as part of the testing for this patch).
      2. Start 4 fio threads that write a number of 64K files inefficiently.
         Inefficiently means that files are created on first access and not
         created in advance (fio parameter create_on_open=1) and fallocate
         is not used (fallocate=none). With multiple IO issuers this creates
         a mix of slab and page cache allocations over time. The total size
         of the files is 150% physical memory so that the slabs and page cache
         pages get mixed.
      3. Warm up a number of fio read-only processes accessing the same files
         created in step 2. This part runs for the same length of time it
         took to create the files. It'll refault old data and further
         interleave slab and page cache allocations. As it's now low on
         memory due to step 2, fragmentation occurs as pageblocks get
         stolen.
      4. While step 3 is still running, start a process that tries to allocate
         75% of memory as huge pages with a number of threads. The number of
         threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
         threads contending with fio, any other threads or forcing cross-NUMA
         scheduling. Note that the test has not been used on a machine with less
         than 8 cores. The benchmark records whether huge pages were allocated
         and what the fault latency was in microseconds.
      5. Measure the number of events potentially causing external fragmentation,
         the fault latency and the huge page allocation success rate.
      6. Cleanup the test files.
      
      Note that due to the use of IO and page cache that this benchmark is not
      suitable for running on large machines where the time to fragment memory
      may be excessive.  Also note that while this is one mix that generates
      fragmentation that it's not the only mix that generates fragmentation.
      Differences in workload that are more slab-intensive or whether SLUB is
      used with high-order pages may yield different results.
      
      When the page allocator fragments memory, it records the event using the
      mm_page_alloc_extfrag ftrace event.  If the fallback_order is smaller than
      a pageblock order (order-9 on 64-bit x86) then it's considered to be an
      "external fragmentation event" that may cause issues in the future.
      Hence, the primary metric here is the number of external fragmentation
      events that occur with order < 9.  The secondary metric is allocation
      latency and huge page allocation success rates but note that differences
      in latencies and what the success rate also can affect the number of
      external fragmentation event which is why it's a secondary metric.
      
      1-socket Skylake machine
      config-global-dhp__workload_thpfioscale XFS (no special madvise)
      4 fio threads, 1 THP allocating thread
      --------------------------------------
      
      4.20-rc3 extfrag events < order 9:   804694
      4.20-rc3+patch:                      408912 (49% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                            vanilla           lowzone-v5r8
      Amean     fault-base-1      662.92 (   0.00%)      653.58 *   1.41%*
      Amean     fault-huge-1        0.00 (   0.00%)        0.00 (   0.00%)
      
                                    4.20.0-rc3             4.20.0-rc3
                                       vanilla           lowzone-v5r8
      Percentage huge-1        0.00 (   0.00%)        0.00 (   0.00%)
      
      Fault latencies are slightly reduced while allocation success rates remain
      at zero as this configuration does not make any special effort to allocate
      THP and fio is heavily active at the time and either filling memory or
      keeping pages resident.  However, a 49% reduction of serious fragmentation
      events reduces the changes of external fragmentation being a problem in
      the future.
      
      Vlastimil asked during review for a breakdown of the allocation types
      that are falling back.
      
      vanilla
         3816 MIGRATE_UNMOVABLE
       800845 MIGRATE_MOVABLE
           33 MIGRATE_UNRECLAIMABLE
      
      patch
          735 MIGRATE_UNMOVABLE
       408135 MIGRATE_MOVABLE
           42 MIGRATE_UNRECLAIMABLE
      
      The majority of the fallbacks are due to movable allocations and this is
      consistent for the workload throughout the series so will not be presented
      again as the primary source of fallbacks are movable allocations.
      
      Movable fallbacks are sometimes considered "ok" to fallback because they
      can be migrated.  The problem is that they can fill an
      unmovable/reclaimable pageblock causing those allocations to fallback
      later and polluting pageblocks with pages that cannot move.  If there is a
      movable fallback, it is pretty much guaranteed to affect an
      unmovable/reclaimable pageblock and while it might not be enough to
      actually cause a unmovable/reclaimable fallback in the future, we cannot
      know that in advance so the patch takes the only option available to it.
      Hence, it's important to control them.  This point is also consistent
      throughout the series and will not be repeated.
      
      1-socket Skylake machine
      global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
      -----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9:  291392
      4.20-rc3+patch:                     191187 (34% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                            vanilla           lowzone-v5r8
      Amean     fault-base-1     1495.14 (   0.00%)     1467.55 (   1.85%)
      Amean     fault-huge-1     1098.48 (   0.00%)     1127.11 (  -2.61%)
      
      thpfioscale Percentage Faults Huge
                                    4.20.0-rc3             4.20.0-rc3
                                       vanilla           lowzone-v5r8
      Percentage huge-1       78.57 (   0.00%)       77.64 (  -1.18%)
      
      Fragmentation events were reduced quite a bit although this is known
      to be a little variable. The latencies and allocation success rates
      are similar but they were already quite high.
      
      2-socket Haswell machine
      config-global-dhp__workload_thpfioscale XFS (no special madvise)
      4 fio threads, 5 THP allocating threads
      ----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9:  215698
      4.20-rc3+patch:                     200210 (7% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                            vanilla           lowzone-v5r8
      Amean     fault-base-5     1350.05 (   0.00%)     1346.45 (   0.27%)
      Amean     fault-huge-5     4181.01 (   0.00%)     3418.60 (  18.24%)
      
                                    4.20.0-rc3             4.20.0-rc3
                                       vanilla           lowzone-v5r8
      Percentage huge-5        1.15 (   0.00%)        0.78 ( -31.88%)
      
      The reduction of external fragmentation events is slight and this is
      partially due to the removal of __GFP_THISNODE in commit ac5b2c18
      ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings") as THP
      allocations can now spill over to remote nodes instead of fragmenting
      local memory.
      
      2-socket Haswell machine
      global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
      -----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9: 166352
      4.20-rc3+patch:                    147463 (11% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                            vanilla           lowzone-v5r8
      Amean     fault-base-5     6138.97 (   0.00%)     6217.43 (  -1.28%)
      Amean     fault-huge-5     2294.28 (   0.00%)     3163.33 * -37.88%*
      
      thpfioscale Percentage Faults Huge
                                    4.20.0-rc3             4.20.0-rc3
                                       vanilla           lowzone-v5r8
      Percentage huge-5       96.82 (   0.00%)       95.14 (  -1.74%)
      
      There was a slight reduction in external fragmentation events although the
      latencies were higher.  The allocation success rate is high enough that
      the system is struggling and there is quite a lot of parallel reclaim and
      compaction activity.  There is also a certain degree of luck on whether
      processes start on node 0 or not for this patch but the relevance is
      reduced later in the series.
      
      Overall, the patch reduces the number of external fragmentation causing
      events so the success of THP over long periods of time would be improved
      for this adverse workload.
      
      Link: http://lkml.kernel.org/r/20181123114528.28802-2-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6bb15450
    • W
      vmscan: return NODE_RECLAIM_NOSCAN in node_reclaim() when CONFIG_NUMA is n · 8b09549c
      Wei Yang 提交于
      Commit fa5e084e ("vmscan: do not unconditionally treat zones that
      fail zone_reclaim() as full") changed the return value of
      node_reclaim().  The original return value 0 means NODE_RECLAIM_SOME
      after this commit.
      
      While the return value of node_reclaim() when CONFIG_NUMA is n is not
      changed.  This will leads to call zone_watermark_ok() again.
      
      This patch fixes the return value by adjusting to NODE_RECLAIM_NOSCAN.
      Since node_reclaim() is only called in page_alloc.c, move it to
      mm/internal.h.
      
      Link: http://lkml.kernel.org/r/20181113080436.22078-1-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NMatthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8b09549c
  3. 31 10月, 2018 1 次提交
    • M
      memblock: rename __free_pages_bootmem to memblock_free_pages · 7c2ee349
      Mike Rapoport 提交于
      The conversion is done using
      
      sed -i 's@__free_pages_bootmem@memblock_free_pages@' \
          $(git grep -l __free_pages_bootmem)
      
      Link: http://lkml.kernel.org/r/1536927045-23536-27-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Palmer Dabbelt <palmer@sifive.com>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Serge Semin <fancer.lancer@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c2ee349
  4. 24 8月, 2018 1 次提交
  5. 23 8月, 2018 1 次提交
  6. 02 6月, 2018 1 次提交
  7. 25 5月, 2018 1 次提交
    • J
      Revert "mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE" · d883c6cf
      Joonsoo Kim 提交于
      This reverts the following commits that change CMA design in MM.
      
       3d2054ad ("ARM: CMA: avoid double mapping to the CMA area if CONFIG_HIGHMEM=y")
      
       1d47a3ec ("mm/cma: remove ALLOC_CMA")
      
       bad8c6c0 ("mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE")
      
      Ville reported a following error on i386.
      
        Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
        microcode: microcode updated early to revision 0x4, date = 2013-06-28
        Initializing CPU#0
        Initializing HighMem for node 0 (000377fe:00118000)
        Initializing Movable for node 0 (00000001:00118000)
        BUG: Bad page state in process swapper  pfn:377fe
        page:f53effc0 count:0 mapcount:-127 mapping:00000000 index:0x0
        flags: 0x80000000()
        raw: 80000000 00000000 00000000 ffffff80 00000000 00000100 00000200 00000001
        page dumped because: nonzero mapcount
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper Not tainted 4.17.0-rc5-elk+ #145
        Hardware name: Dell Inc. Latitude E5410/03VXMC, BIOS A15 07/11/2013
        Call Trace:
         dump_stack+0x60/0x96
         bad_page+0x9a/0x100
         free_pages_check_bad+0x3f/0x60
         free_pcppages_bulk+0x29d/0x5b0
         free_unref_page_commit+0x84/0xb0
         free_unref_page+0x3e/0x70
         __free_pages+0x1d/0x20
         free_highmem_page+0x19/0x40
         add_highpages_with_active_regions+0xab/0xeb
         set_highmem_pages_init+0x66/0x73
         mem_init+0x1b/0x1d7
         start_kernel+0x17a/0x363
         i386_start_kernel+0x95/0x99
         startup_32_smp+0x164/0x168
      
      The reason for this error is that the span of MOVABLE_ZONE is extended
      to whole node span for future CMA initialization, and, normal memory is
      wrongly freed here.  I submitted the fix and it seems to work, but,
      another problem happened.
      
      It's so late time to fix the later problem so I decide to reverting the
      series.
      Reported-by: NVille Syrjälä <ville.syrjala@linux.intel.com>
      Acked-by: NLaura Abbott <labbott@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d883c6cf
  8. 12 4月, 2018 4 次提交
    • J
      mm/cma: remove ALLOC_CMA · 1d47a3ec
      Joonsoo Kim 提交于
      Now, all reserved pages for CMA region are belong to the ZONE_MOVABLE
      and it only serves for a request with GFP_HIGHMEM && GFP_MOVABLE.
      
      Therefore, we don't need to maintain ALLOC_CMA at all.
      
      Link: http://lkml.kernel.org/r/1512114786-5085-3-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Tested-by: NTony Lindgren <tony@atomide.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d47a3ec
    • J
      mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE · bad8c6c0
      Joonsoo Kim 提交于
      Patch series "mm/cma: manage the memory of the CMA area by using the
      ZONE_MOVABLE", v2.
      
      0. History
      
      This patchset is the follow-up of the discussion about the "Introduce
      ZONE_CMA (v7)" [1].  Please reference it if more information is needed.
      
      1. What does this patch do?
      
      This patch changes the management way for the memory of the CMA area in
      the MM subsystem.  Currently the memory of the CMA area is managed by
      the zone where their pfn is belong to.  However, this approach has some
      problems since MM subsystem doesn't have enough logic to handle the
      situation that different characteristic memories are in a single zone.
      To solve this issue, this patch try to manage all the memory of the CMA
      area by using the MOVABLE zone.  In MM subsystem's point of view,
      characteristic of the memory on the MOVABLE zone and the memory of the
      CMA area are the same.  So, managing the memory of the CMA area by using
      the MOVABLE zone will not have any problem.
      
      2. Motivation
      
      There are some problems with current approach.  See following.  Although
      these problem would not be inherent and it could be fixed without this
      conception change, it requires many hooks addition in various code path
      and it would be intrusive to core MM and would be really error-prone.
      Therefore, I try to solve them with this new approach.  Anyway,
      following is the problems of the current implementation.
      
      o CMA memory utilization
      
      First, following is the freepage calculation logic in MM.
      
       - For movable allocation: freepage = total freepage
       - For unmovable allocation: freepage = total freepage - CMA freepage
      
      Freepages on the CMA area is used after the normal freepages in the zone
      where the memory of the CMA area is belong to are exhausted.  At that
      moment that the number of the normal freepages is zero, so
      
       - For movable allocation: freepage = total freepage = CMA freepage
       - For unmovable allocation: freepage = 0
      
      If unmovable allocation comes at this moment, allocation request would
      fail to pass the watermark check and reclaim is started.  After reclaim,
      there would exist the normal freepages so freepages on the CMA areas
      would not be used.
      
      FYI, there is another attempt [2] trying to solve this problem in lkml.
      And, as far as I know, Qualcomm also has out-of-tree solution for this
      problem.
      
      Useless reclaim:
      
      There is no logic to distinguish CMA pages in the reclaim path.  Hence,
      CMA page is reclaimed even if the system just needs the page that can be
      usable for the kernel allocation.
      
      Atomic allocation failure:
      
      This is also related to the fallback allocation policy for the memory of
      the CMA area.  Consider the situation that the number of the normal
      freepages is *zero* since the bunch of the movable allocation requests
      come.  Kswapd would not be woken up due to following freepage
      calculation logic.
      
      - For movable allocation: freepage = total freepage = CMA freepage
      
      If atomic unmovable allocation request comes at this moment, it would
      fails due to following logic.
      
      - For unmovable allocation: freepage = total freepage - CMA freepage = 0
      
      It was reported by Aneesh [3].
      
      Useless compaction:
      
      Usual high-order allocation request is unmovable allocation request and
      it cannot be served from the memory of the CMA area.  In compaction,
      migration scanner try to migrate the page in the CMA area and make
      high-order page there.  As mentioned above, it cannot be usable for the
      unmovable allocation request so it's just waste.
      
      3. Current approach and new approach
      
      Current approach is that the memory of the CMA area is managed by the
      zone where their pfn is belong to.  However, these memory should be
      distinguishable since they have a strong limitation.  So, they are
      marked as MIGRATE_CMA in pageblock flag and handled specially.  However,
      as mentioned in section 2, the MM subsystem doesn't have enough logic to
      deal with this special pageblock so many problems raised.
      
      New approach is that the memory of the CMA area is managed by the
      MOVABLE zone.  MM already have enough logic to deal with special zone
      like as HIGHMEM and MOVABLE zone.  So, managing the memory of the CMA
      area by the MOVABLE zone just naturally work well because constraints
      for the memory of the CMA area that the memory should always be
      migratable is the same with the constraint for the MOVABLE zone.
      
      There is one side-effect for the usability of the memory of the CMA
      area.  The use of MOVABLE zone is only allowed for a request with
      GFP_HIGHMEM && GFP_MOVABLE so now the memory of the CMA area is also
      only allowed for this gfp flag.  Before this patchset, a request with
      GFP_MOVABLE can use them.  IMO, It would not be a big issue since most
      of GFP_MOVABLE request also has GFP_HIGHMEM flag.  For example, file
      cache page and anonymous page.  However, file cache page for blockdev
      file is an exception.  Request for it has no GFP_HIGHMEM flag.  There is
      pros and cons on this exception.  In my experience, blockdev file cache
      pages are one of the top reason that causes cma_alloc() to fail
      temporarily.  So, we can get more guarantee of cma_alloc() success by
      discarding this case.
      
      Note that there is no change in admin POV since this patchset is just
      for internal implementation change in MM subsystem.  Just one minor
      difference for admin is that the memory stat for CMA area will be
      printed in the MOVABLE zone.  That's all.
      
      4. Result
      
      Following is the experimental result related to utilization problem.
      
      8 CPUs, 1024 MB, VIRTUAL MACHINE
      make -j16
      
      <Before>
        CMA area:               0 MB            512 MB
        Elapsed-time:           92.4		186.5
        pswpin:                 82		18647
        pswpout:                160		69839
      
      <After>
        CMA        :            0 MB            512 MB
        Elapsed-time:           93.1		93.4
        pswpin:                 84		46
        pswpout:                183		92
      
      akpm: "kernel test robot" reported a 26% improvement in
      vm-scalability.throughput:
      http://lkml.kernel.org/r/20180330012721.GA3845@yexl-desktop
      
      [1]: lkml.kernel.org/r/1491880640-9944-1-git-send-email-iamjoonsoo.kim@lge.com
      [2]: https://lkml.org/lkml/2014/10/15/623
      [3]: http://www.spinics.net/lists/linux-mm/msg100562.html
      
      Link: http://lkml.kernel.org/r/1512114786-5085-2-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Tested-by: NTony Lindgren <tony@atomide.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bad8c6c0
    • M
      mm, migrate: remove reason argument from new_page_t · 666feb21
      Michal Hocko 提交于
      No allocation callback is using this argument anymore.  new_page_node
      used to use this parameter to convey node_id resp.  migration error up
      to move_pages code (do_move_page_to_node_array).  The error status never
      made it into the final status field and we have a better way to
      communicate node id to the status field now.  All other allocation
      callbacks simply ignored the argument so we can drop it finally.
      
      [mhocko@suse.com: fix migration callback]
        Link: http://lkml.kernel.org/r/20180105085259.GH2801@dhcp22.suse.cz
      [akpm@linux-foundation.org: fix alloc_misplaced_dst_page()]
      [mhocko@kernel.org: fix build]
        Link: http://lkml.kernel.org/r/20180103091134.GB11319@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20180103082555.14592-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NZi Yan <zi.yan@cs.rutgers.edu>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      666feb21
    • M
      mm, numa: rework do_pages_move · a49bd4d7
      Michal Hocko 提交于
      Patch series "unclutter thp migration"
      
      Motivation:
      
      THP migration is hacked into the generic migration with rather
      surprising semantic.  The migration allocation callback is supposed to
      check whether the THP can be migrated at once and if that is not the
      case then it allocates a simple page to migrate.  unmap_and_move then
      fixes that up by splitting the THP into small pages while moving the
      head page to the newly allocated order-0 page.  Remaining pages are
      moved to the LRU list by split_huge_page.  The same happens if the THP
      allocation fails.  This is really ugly and error prone [2].
      
      I also believe that split_huge_page to the LRU lists is inherently wrong
      because all tail pages are not migrated.  Some callers will just work
      around that by retrying (e.g.  memory hotplug).  There are other pfn
      walkers which are simply broken though.  e.g. madvise_inject_error will
      migrate head and then advances next pfn by the huge page size.
      do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
      will simply split the THP before migration if the THP migration is not
      supported then falls back to single page migration but it doesn't handle
      tail pages if the THP migration path is not able to allocate a fresh THP
      so we end up with ENOMEM and fail the whole migration which is a
      questionable behavior.  Page compaction doesn't try to migrate large
      pages so it should be immune.
      
      The first patch reworks do_pages_move which relies on a very ugly
      calling semantic when the return status is pushed to the migration path
      via private pointer.  It uses pre allocated fixed size batching to
      achieve that.  We simply cannot do the same if a THP is to be split
      during the migration path which is done in the patch 3.  Patch 2 is
      follow up cleanup which removes the mentioned return status calling
      convention ugliness.
      
      On a side note:
      
      There are some semantic issues I have encountered on the way when
      working on patch 1 but I am not addressing them here.  E.g. trying to
      move THP tail pages will result in either success or EBUSY (the later
      one more likely once we isolate head from the LRU list).  Hugetlb
      reports EACCESS on tail pages.  Some errors are reported via status
      parameter but migration failures are not even though the original
      `reason' argument suggests there was an intention to do so.  From a
      quick look into git history this never worked.  I have tried to keep the
      semantic unchanged.
      
      Then there is a relatively minor thing that the page isolation might
      fail because of pages not being on the LRU - e.g. because they are
      sitting on the per-cpu LRU caches.  Easily fixable.
      
      This patch (of 3):
      
      do_pages_move is supposed to move user defined memory (an array of
      addresses) to the user defined numa nodes (an array of nodes one for
      each address).  The user provided status array then contains resulting
      numa node for each address or an error.  The semantic of this function
      is little bit confusing because only some errors are reported back.
      Notably migrate_pages error is only reported via the return value.  This
      patch doesn't try to address these semantic nuances but rather change
      the underlying implementation.
      
      Currently we are processing user input (which can be really large) in
      batches which are stored to a temporarily allocated page.  Each address
      is resolved to its struct page and stored to page_to_node structure
      along with the requested target numa node.  The array of these
      structures is then conveyed down the page migration path via private
      argument.  new_page_node then finds the corresponding structure and
      allocates the proper target page.
      
      What is the problem with the current implementation and why to change
      it? Apart from being quite ugly it also doesn't cope with unexpected
      pages showing up on the migration list inside migrate_pages path.  That
      doesn't happen currently but the follow up patch would like to make the
      thp migration code more clear and that would need to split a THP into
      the list for some cases.
      
      How does the new implementation work? Well, instead of batching into a
      fixed size array we simply batch all pages that should be migrated to
      the same node and isolate all of them into a linked list which doesn't
      require any additional storage.  This should work reasonably well
      because page migration usually migrates larger ranges of memory to a
      specific node.  So the common case should work equally well as the
      current implementation.  Even if somebody constructs an input where the
      target numa nodes would be interleaved we shouldn't see a large
      performance impact because page migration alone doesn't really benefit
      from batching.  mmap_sem batching for the lookup is quite questionable
      and isolate_lru_page which would benefit from batching is not using it
      even in the current implementation.
      
      Link: http://lkml.kernel.org/r/20180103082555.14592-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NKirill A. Shutemov <kirill@shutemov.name>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a49bd4d7
  9. 30 11月, 2017 1 次提交
  10. 28 11月, 2017 1 次提交
    • K
      mm, thp: Do not make pmd/pud dirty without a reason · 152e93af
      Kirill A. Shutemov 提交于
      Currently we make page table entries dirty all the time regardless of
      access type and don't even consider if the mapping is write-protected.
      The reasoning is that we don't really need dirty tracking on THP and
      making the entry dirty upfront may save some time on first write to the
      page.
      
      Unfortunately, such approach may result in false-positive
      can_follow_write_pmd() for huge zero page or read-only shmem file.
      
      Let's only make page dirty only if we about to write to the page anyway
      (as we do for small pages).
      
      I've restructured the code to make entry dirty inside
      maybe_p[mu]d_mkwrite(). It also takes into account if the vma is
      write-protected.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      152e93af
  11. 18 11月, 2017 1 次提交
  12. 07 9月, 2017 2 次提交
    • M
      mm, oom: do not rely on TIF_MEMDIE for memory reserves access · cd04ae1e
      Michal Hocko 提交于
      For ages we have been relying on TIF_MEMDIE thread flag to mark OOM
      victims and then, among other things, to give these threads full access
      to memory reserves.  There are few shortcomings of this implementation,
      though.
      
      First of all and the most serious one is that the full access to memory
      reserves is quite dangerous because we leave no safety room for the
      system to operate and potentially do last emergency steps to move on.
      
      Secondly this flag is per task_struct while the OOM killer operates on
      mm_struct granularity so all processes sharing the given mm are killed.
      Giving the full access to all these task_structs could lead to a quick
      memory reserves depletion.  We have tried to reduce this risk by giving
      TIF_MEMDIE only to the main thread and the currently allocating task but
      that doesn't really solve this problem while it surely opens up a room
      for corner cases - e.g.  GFP_NO{FS,IO} requests might loop inside the
      allocator without access to memory reserves because a particular thread
      was not the group leader.
      
      Now that we have the oom reaper and that all oom victims are reapable
      after 1b51e65e ("oom, oom_reaper: allow to reap mm shared by the
      kthreads") we can be more conservative and grant only partial access to
      memory reserves because there are reasonable chances of the parallel
      memory freeing.  We still want some access to reserves because we do not
      want other consumers to eat up the victim's freed memory.  oom victims
      will still contend with __GFP_HIGH users but those shouldn't be so
      aggressive to starve oom victims completely.
      
      Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to
      the half of the reserves.  This makes the access to reserves independent
      on which task has passed through mark_oom_victim.  Also drop any usage
      of TIF_MEMDIE from the page allocator proper and replace it by
      tsk_is_oom_victim as well which will make page_alloc.c completely
      TIF_MEMDIE free finally.
      
      CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
      ALLOC_NO_WATERMARKS approach.
      
      There is a demand to make the oom killer memcg aware which will imply
      many tasks killed at once.  This change will allow such a usecase
      without worrying about complete memory reserves depletion.
      
      Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd04ae1e
    • M
      mm, memory_hotplug: drop zone from build_all_zonelists · 72675e13
      Michal Hocko 提交于
      build_all_zonelists gets a zone parameter to initialize zone's pagesets.
      There is only a single user which gives a non-NULL zone parameter and
      that one doesn't really need the rest of the build_all_zonelists (see
      commit 6dcd73d7 ("memory-hotplug: allocate zone's pcp before
      onlining pages")).
      
      Therefore remove setup_zone_pageset from build_all_zonelists and call it
      from its only user directly.  This will also remove a pointless zonlists
      rebuilding which is always good.
      
      Link: http://lkml.kernel.org/r/20170721143915.14161-5-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72675e13
  13. 03 8月, 2017 1 次提交
    • M
      mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries · 3ea27719
      Mel Gorman 提交于
      Nadav Amit identified a theoritical race between page reclaim and
      mprotect due to TLB flushes being batched outside of the PTL being held.
      
      He described the race as follows:
      
              CPU0                            CPU1
              ----                            ----
                                              user accesses memory using RW PTE
                                              [PTE now cached in TLB]
              try_to_unmap_one()
              ==> ptep_get_and_clear()
              ==> set_tlb_ubc_flush_pending()
                                              mprotect(addr, PROT_READ)
                                              ==> change_pte_range()
                                              ==> [ PTE non-present - no flush ]
      
                                              user writes using cached RW PTE
              ...
      
              try_to_unmap_flush()
      
      The same type of race exists for reads when protecting for PROT_NONE and
      also exists for operations that can leave an old TLB entry behind such
      as munmap, mremap and madvise.
      
      For some operations like mprotect, it's not necessarily a data integrity
      issue but it is a correctness issue as there is a window where an
      mprotect that limits access still allows access.  For munmap, it's
      potentially a data integrity issue although the race is massive as an
      munmap, mmap and return to userspace must all complete between the
      window when reclaim drops the PTL and flushes the TLB.  However, it's
      theoritically possible so handle this issue by flushing the mm if
      reclaim is potentially currently batching TLB flushes.
      
      Other instances where a flush is required for a present pte should be ok
      as either the page lock is held preventing parallel reclaim or a page
      reference count is elevated preventing a parallel free leading to
      corruption.  In the case of page_mkclean there isn't an obvious path
      that userspace could take advantage of without using the operations that
      are guarded by this patch.  Other users such as gup as a race with
      reclaim looks just at PTEs.  huge page variants should be ok as they
      don't race with reclaim.  mincore only looks at PTEs.  userfault also
      should be ok as if a parallel reclaim takes place, it will either fault
      the page back in or read some of the data before the flush occurs
      triggering a fault.
      
      Note that a variant of this patch was acked by Andy Lutomirski but this
      was for the x86 parts on top of his PCID work which didn't make the 4.13
      merge window as expected.  His ack is dropped from this version and
      there will be a follow-on patch on top of PCID that will include his
      ack.
      
      [akpm@linux-foundation.org: tweak comments]
      [akpm@linux-foundation.org: fix spello]
      Link: http://lkml.kernel.org/r/20170717155523.emckq2esjro6hf3z@suse.deReported-by: NNadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: <stable@vger.kernel.org>	[v4.4+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ea27719
  14. 13 7月, 2017 1 次提交
    • M
      mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic · dcda9b04
      Michal Hocko 提交于
      __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
      the page allocator.  This has been true but only for allocations
      requests larger than PAGE_ALLOC_COSTLY_ORDER.  It has been always
      ignored for smaller sizes.  This is a bit unfortunate because there is
      no way to express the same semantic for those requests and they are
      considered too important to fail so they might end up looping in the
      page allocator for ever, similarly to GFP_NOFAIL requests.
      
      Now that the whole tree has been cleaned up and accidental or misled
      usage of __GFP_REPEAT flag has been removed for !costly requests we can
      give the original flag a better name and more importantly a more useful
      semantic.  Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
      that the allocator would try really hard but there is no promise of a
      success.  This will work independent of the order and overrides the
      default allocator behavior.  Page allocator users have several levels of
      guarantee vs.  cost options (take GFP_KERNEL as an example)
      
       - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
         attempt to free memory at all. The most light weight mode which even
         doesn't kick the background reclaim. Should be used carefully because
         it might deplete the memory and the next user might hit the more
         aggressive reclaim
      
       - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
         allocation without any attempt to free memory from the current
         context but can wake kswapd to reclaim memory if the zone is below
         the low watermark. Can be used from either atomic contexts or when
         the request is a performance optimization and there is another
         fallback for a slow path.
      
       - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
         non sleeping allocation with an expensive fallback so it can access
         some portion of memory reserves. Usually used from interrupt/bh
         context with an expensive slow path fallback.
      
       - GFP_KERNEL - both background and direct reclaim are allowed and the
         _default_ page allocator behavior is used. That means that !costly
         allocation requests are basically nofail but there is no guarantee of
         that behavior so failures have to be checked properly by callers
         (e.g. OOM killer victim is allowed to fail currently).
      
       - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
         and all allocation requests fail early rather than cause disruptive
         reclaim (one round of reclaim in this implementation). The OOM killer
         is not invoked.
      
       - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
         behavior and all allocation requests try really hard. The request
         will fail if the reclaim cannot make any progress. The OOM killer
         won't be triggered.
      
       - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
         and all allocation requests will loop endlessly until they succeed.
         This might be really dangerous especially for larger orders.
      
      Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
      because they already had their semantic.  No new users are added.
      __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
      there is no progress and we have already passed the OOM point.
      
      This means that all the reclaim opportunities have been exhausted except
      the most disruptive one (the OOM killer) and a user defined fallback
      behavior is more sensible than keep retrying in the page allocator.
      
      [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
      [mhocko@suse.com: semantic fix]
        Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
      [mhocko@kernel.org: address other thing spotted by Vlastimil]
        Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Alex Belits <alex.belits@cavium.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David Daney <david.daney@cavium.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dcda9b04
  15. 09 5月, 2017 3 次提交
    • V
      mm, compaction: finish whole pageblock to reduce fragmentation · baf6a9a1
      Vlastimil Babka 提交于
      The main goal of direct compaction is to form a high-order page for
      allocation, but it should also help against long-term fragmentation when
      possible.
      
      Most lower-than-pageblock-order compactions are for non-movable
      allocations, which means that if we compact in a movable pageblock and
      terminate as soon as we create the high-order page, it's unlikely that
      the fallback heuristics will claim the whole block.  Instead there might
      be a single unmovable page in a pageblock full of movable pages, and the
      next unmovable allocation might pick another pageblock and increase
      long-term fragmentation.
      
      To help against such scenarios, this patch changes the termination
      criteria for compaction so that the current pageblock is finished even
      though the high-order page already exists.  Note that it might be
      possible that the high-order page formed elsewhere in the zone due to
      parallel activity, but this patch doesn't try to detect that.
      
      This is only done with sync compaction, because async compaction is
      limited to pageblock of the same migratetype, where it cannot result in
      a migratetype fallback.  (Async compaction also eagerly skips
      order-aligned blocks where isolation fails, which is against the goal of
      migrating away as much of the pageblock as possible.)
      
      As a result of this patch, long-term memory fragmentation should be
      reduced.
      
      In testing based on 4.9 kernel with stress-highalloc from mmtests
      configured for order-4 GFP_KERNEL allocations, this patch has reduced
      the number of unmovable allocations falling back to movable pageblocks
      by 20%.  The number
      
      Link: http://lkml.kernel.org/r/20170307131545.28577-9-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      baf6a9a1
    • V
      mm, compaction: add migratetype to compact_control · d39773a0
      Vlastimil Babka 提交于
      Preparation patch.  We are going to need migratetype at lower layers
      than compact_zone() and compact_finished().
      
      Link: http://lkml.kernel.org/r/20170307131545.28577-7-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d39773a0
    • V
      mm, compaction: reorder fields in struct compact_control · f25ba6dc
      Vlastimil Babka 提交于
      Patch series "try to reduce fragmenting fallbacks", v3.
      
      Last year, Johannes Weiner has reported a regression in page mobility
      grouping [1] and while the exact cause was not found, I've come up with
      some ways to improve it by reducing the number of allocations falling
      back to different migratetype and causing permanent fragmentation.
      
      The series was tested with mmtests stress-highalloc modified to do
      GFP_KERNEL order-4 allocations, on 4.9 with "mm, vmscan: fix zone
      balance check in prepare_kswapd_sleep" (without that, kcompactd indeed
      wasn't woken up) on UMA machine with 4GB memory.  There were 5 repeats
      of each run, as the extfrag stats are quite volatile (note the stats
      below are sums, not averages, as it was less perl hacking for me).
      
      Success rate are the same, already high due to the low allocation order
      used, so I'm not including them.
      
      Compaction stats:
      (the patches are stacked, and I haven't measured the non-functional-changes
      patches separately)
      
                                           patch 1     patch 2     patch 3     patch 4     patch 7     patch 8
        Compaction stalls                    22449       24680       24846       19765       22059       17480
        Compaction success                   12971       14836       14608       10475       11632        8757
        Compaction failures                   9477        9843       10238        9290       10426        8722
        Page migrate success               3109022     3370438     3312164     1695105     1608435     2111379
        Page migrate failure                911588     1149065     1028264     1112675     1077251     1026367
        Compaction pages isolated          7242983     8015530     7782467     4629063     4402787     5377665
        Compaction migrate scanned       980838938   987367943   957690188   917647238   947155598  1018922197
        Compaction free scanned          557926893   598946443   602236894   594024490   541169699   763651731
        Compaction cost                      10243       10578       10304        8286        8398        9440
      
      Compaction stats are mostly within noise until patch 4, which decreases
      the number of compactions, and migrations.  Part of that could be due to
      more pageblocks marked as unmovable, and async compaction skipping
      those.  This changes a bit with patch 7, but not so much.  Patch 8
      increases free scanner stats and migrations, which comes from the
      changed termination criteria.  Interestingly number of compactions
      decreases - probably the fully compacted pageblock satisfies multiple
      subsequent allocations, so it amortizes.
      
      Next comes the extfrag tracepoint, where "fragmenting" means that an
      allocation had to fallback to a pageblock of another migratetype which
      wasn't fully free (which is almost all of the fallbacks).  I have
      locally added another tracepoint for "Page steal" into
      steal_suitable_fallback() which triggers in situations where we are
      allowed to do move_freepages_block().  If we decide to also do
      set_pageblock_migratetype(), it's "Pages steal with pageblock" with
      break down for which allocation migratetype we are stealing and from
      which fallback migratetype.  The last part "due to counting" comes from
      patch 4 and counts the events where the counting of movable pages
      allowed us to change pageblock's migratetype, while the number of free
      pages alone wouldn't be enough to cross the threshold.
      
                                                             patch 1     patch 2     patch 3     patch 4     patch 7     patch 8
        Page alloc extfrag event                            10155066     8522968    10164959    15622080    13727068    13140319
        Extfrag fragmenting                                 10149231     8517025    10159040    15616925    13721391    13134792
        Extfrag fragmenting for unmovable                     159504      168500      184177       97835       70625       56948
        Extfrag fragmenting unmovable placed with movable     153613      163549      172693       91740       64099       50917
        Extfrag fragmenting unmovable placed with reclaim.      5891        4951       11484        6095        6526        6031
        Extfrag fragmenting for reclaimable                     4738        4829        6345        4822        5640        5378
        Extfrag fragmenting reclaimable placed with movable     1836        1902        1851        1579        1739        1760
        Extfrag fragmenting reclaimable placed with unmov.      2902        2927        4494        3243        3901        3618
        Extfrag fragmenting for movable                      9984989     8343696     9968518    15514268    13645126    13072466
        Pages steal                                           179954      192291      210880      123254       94545       81486
        Pages steal with pageblock                             22153       18943       20154       33562       29969       33444
        Pages steal with pageblock for unmovable               14350       12858       13256       20660       19003       20852
        Pages steal with pageblock for unmovable from mov.     12812       11402       11683       19072       17467       19298
        Pages steal with pageblock for unmovable from recl.     1538        1456        1573        1588        1536        1554
        Pages steal with pageblock for movable                  7114        5489        5965       11787       10012       11493
        Pages steal with pageblock for movable from unmov.      6885        5291        5541       11179        9525       10885
        Pages steal with pageblock for movable from recl.        229         198         424         608         487         608
        Pages steal with pageblock for reclaimable               689         596         933        1115         954        1099
        Pages steal with pageblock for reclaimable from unmov.   273         219         537         658         547         667
        Pages steal with pageblock for reclaimable from mov.     416         377         396         457         407         432
        Pages steal with pageblock due to counting                                                 11834       10075        7530
        ... for unmovable                                                                           8993        7381        4616
        ... for movable                                                                             2792        2653        2851
        ... for reclaimable                                                                           49          41          63
      
      What we can see is that "Extfrag fragmenting for unmovable" and "...
      placed with movable" drops with almost each patch, which is good as we
      are polluting less movable pageblocks with unmovable pages.
      
      The most significant change is patch 4 with movable page counting.  On
      the other hand it increases "Extfrag fragmenting for movable" by 50%.
      "Pages steal" drops though, so these movable allocation fallbacks find
      only small free pages and are not allowed to steal whole pageblocks
      back.  "Pages steal with pageblock" raises, because the patch increases
      the chances of pageblock migratetype changes to happen.  This affects
      all migratetypes.
      
      The summary is that patch 4 is not a clear win wrt these stats, but I
      believe that the tradeoff it makes is a good one.  There's less
      pollution of movable pageblocks by unmovable allocations.  There's less
      stealing between pageblock, and those that remain have higher chance of
      changing migratetype also the pageblock itself, so it should more
      faithfully reflect the migratetype of the pages within the pageblock.
      The increase of movable allocations falling back to unmovable pageblock
      might look dramatic, but those allocations can be migrated by compaction
      when needed, and other patches in the series (7-9) improve that aspect.
      
      Patches 7 and 8 continue the trend of reduced unmovable fallbacks and
      also reduce the impact on movable fallbacks from patch 4.
      
      [1] https://www.spinics.net/lists/linux-mm/msg114237.html
      
      This patch (of 8):
      
      While currently there are (mostly by accident) no holes in struct
      compact_control (on x86_64), but we are going to add more bool flags, so
      place them all together to the end of the structure.  While at it, just
      order all fields from largest to smallest.
      
      Link: http://lkml.kernel.org/r/20170307131545.28577-2-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f25ba6dc
  16. 04 5月, 2017 3 次提交
    • X
      mm: use is_migrate_highatomic() to simplify the code · a6ffdc07
      Xishi Qiu 提交于
      Introduce two helpers, is_migrate_highatomic() and is_migrate_highatomic_page().
      
      Simplify the code, no functional changes.
      
      [akpm@linux-foundation.org: use static inlines rather than macros, per mhocko]
      Link: http://lkml.kernel.org/r/58B94F15.6060606@huawei.comSigned-off-by: NXishi Qiu <qiuxishi@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6ffdc07
    • J
      mm: delete NR_PAGES_SCANNED and pgdat_reclaimable() · c822f622
      Johannes Weiner 提交于
      NR_PAGES_SCANNED counts number of pages scanned since the last page free
      event in the allocator.  This was used primarily to measure the
      reclaimability of zones and nodes, and determine when reclaim should
      give up on them.  In that role, it has been replaced in the preceding
      patches by a different mechanism.
      
      Being implemented as an efficient vmstat counter, it was automatically
      exported to userspace as well.  It's however unlikely that anyone
      outside the kernel is using this counter in any meaningful way.
      
      Remove the counter and the unused pgdat_reclaimable().
      
      Link: http://lkml.kernel.org/r/20170228214007.5621-8-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Jia He <hejianet@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c822f622
    • J
      mm: fix 100% CPU kswapd busyloop on unreclaimable nodes · c73322d0
      Johannes Weiner 提交于
      Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and
      cleanups".
      
      Jia reported a scenario in which the kswapd of a node indefinitely spins
      at 100% CPU usage.  We have seen similar cases at Facebook.
      
      The kernel's current method of judging its ability to reclaim a node (or
      whether to back off and sleep) is based on the amount of scanned pages
      in proportion to the amount of reclaimable pages.  In Jia's and our
      scenarios, there are no reclaimable pages in the node, however, and the
      condition for backing off is never met.  Kswapd busyloops in an attempt
      to restore the watermarks while having nothing to work with.
      
      This series reworks the definition of an unreclaimable node based not on
      scanning but on whether kswapd is able to actually reclaim pages in
      MAX_RECLAIM_RETRIES (16) consecutive runs.  This is the same criteria
      the page allocator uses for giving up on direct reclaim and invoking the
      OOM killer.  If it cannot free any pages, kswapd will go to sleep and
      leave further attempts to direct reclaim invocations, which will either
      make progress and re-enable kswapd, or invoke the OOM killer.
      
      Patch #1 fixes the immediate problem Jia reported, the remainder are
      smaller fixlets, cleanups, and overall phasing out of the old method.
      
      Patch #6 is the odd one out.  It's a nice cleanup to get_scan_count(),
      and directly related to #5, but in itself not relevant to the series.
      
      If the whole series is too ambitious for 4.11, I would consider the
      first three patches fixes, the rest cleanups.
      
      This patch (of 9):
      
      Jia He reports a problem with kswapd spinning at 100% CPU when
      requesting more hugepages than memory available in the system:
      
      $ echo 4000 >/proc/sys/vm/nr_hugepages
      
      top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
      Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
      %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
      KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
      KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
      
        PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
         76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3
      
      At that time, there are no reclaimable pages left in the node, but as
      kswapd fails to restore the high watermarks it refuses to go to sleep.
      
      Kswapd needs to back away from nodes that fail to balance.  Up until
      commit 1d82de61 ("mm, vmscan: make kswapd reclaim in terms of
      nodes") kswapd had such a mechanism.  It considered zones whose
      theoretically reclaimable pages it had reclaimed six times over as
      unreclaimable and backed away from them.  This guard was erroneously
      removed as the patch changed the definition of a balanced node.
      
      However, simply restoring this code wouldn't help in the case reported
      here: there *are* no reclaimable pages that could be scanned until the
      threshold is met.  Kswapd would stay awake anyway.
      
      Introduce a new and much simpler way of backing off.  If kswapd runs
      through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
      page, make it back off from the node.  This is the same number of shots
      direct reclaim takes before declaring OOM.  Kswapd will go to sleep on
      that node until a direct reclaimer manages to reclaim some pages, thus
      proving the node reclaimable again.
      
      [hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count]
        Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org
      [shakeelb@google.com: fix condition for throttle_direct_reclaim]
        Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com
      Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Reported-by: NJia He <hejianet@gmail.com>
      Tested-by: NJia He <hejianet@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c73322d0
  17. 08 4月, 2017 1 次提交
  18. 25 2月, 2017 1 次提交
  19. 23 2月, 2017 3 次提交
  20. 26 12月, 2016 1 次提交
    • N
      mm: add PageWaiters indicating tasks are waiting for a page bit · 62906027
      Nicholas Piggin 提交于
      Add a new page flag, PageWaiters, to indicate the page waitqueue has
      tasks waiting. This can be tested rather than testing waitqueue_active
      which requires another cacheline load.
      
      This bit is always set when the page has tasks on page_waitqueue(page),
      and is set and cleared under the waitqueue lock. It may be set when
      there are no tasks on the waitqueue, which will cause a harmless extra
      wakeup check that will clears the bit.
      
      The generic bit-waitqueue infrastructure is no longer used for pages.
      Instead, waitqueues are used directly with a custom key type. The
      generic code was not flexible enough to have PageWaiters manipulation
      under the waitqueue lock (which simplifies concurrency).
      
      This improves the performance of page lock intensive microbenchmarks by
      2-3%.
      
      Putting two bits in the same word opens the opportunity to remove the
      memory barrier between clearing the lock bit and testing the waiters
      bit, after some work on the arch primitives (e.g., ensuring memory
      operand widths match and cover both bits).
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Andrew Lutomirski <luto@kernel.org>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62906027
  21. 15 12月, 2016 2 次提交
  22. 08 10月, 2016 2 次提交
    • V
      mm, compaction: make full priority ignore pageblock suitability · 9f7e3387
      Vlastimil Babka 提交于
      Several people have reported premature OOMs for order-2 allocations
      (stack) due to OOM rework in 4.7.  In the scenario (parallel kernel
      build and dd writing to two drives) many pageblocks get marked as
      Unmovable and compaction free scanner struggles to isolate free pages.
      Joonsoo Kim pointed out that the free scanner skips pageblocks that are
      not movable to prevent filling them and forcing non-movable allocations
      to fallback to other pageblocks.  Such heuristic makes sense to help
      prevent long-term fragmentation, but premature OOMs are relatively more
      urgent problem.  As a compromise, this patch disables the heuristic only
      for the ultimate compaction priority.
      
      Link: http://lkml.kernel.org/r/20160906135258.18335-5-vbabka@suse.czReported-by: NRalf-Peter Rohbeck <Ralf-Peter.Rohbeck@quantum.com>
      Reported-by: NArkadiusz Miskiewicz <a.miskiewicz@gmail.com>
      Reported-by: NOlaf Hering <olaf@aepfle.de>
      Suggested-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9f7e3387
    • V
      mm, compaction: make whole_zone flag ignore cached scanner positions · 06ed2998
      Vlastimil Babka 提交于
      Patch series "make direct compaction more deterministic")
      
      This is mostly a followup to Michal's oom detection rework, which
      highlighted the need for direct compaction to provide better feedback in
      reclaim/compaction loop, so that it can reliably recognize when
      compaction cannot make further progress, and allocation should invoke
      OOM killer or fail.  We've discussed this at LSF/MM [1] where I proposed
      expanding the async/sync migration mode used in compaction to more
      general "priorities".  This patchset adds one new priority that just
      overrides all the heuristics and makes compaction fully scan all zones.
      I don't currently think that we need more fine-grained priorities, but
      we'll see.  Other than that there's some smaller fixes and cleanups,
      mainly related to the THP-specific hacks.
      
      I've tested this with stress-highalloc in GFP_KERNEL order-4 and
      THP-like order-9 scenarios.  There's some improvement for compaction
      stats for the order-4, which is likely due to the better watermarks
      handling.  In the previous version I reported mostly noise wrt
      compaction stats, and decreased direct reclaim - now the reclaim is
      without difference.  I believe this is due to the less aggressive
      compaction priority increase in patch 6.
      
      "before" is a mmotm tree prior to 4.7 release plus the first part of the
      series that was sent and merged separately
      
                                          before        after
      order-4:
      
      Compaction stalls                    27216       30759
      Compaction success                   19598       25475
      Compaction failures                   7617        5283
      Page migrate success                370510      464919
      Page migrate failure                 25712       27987
      Compaction pages isolated           849601     1041581
      Compaction migrate scanned       143146541   101084990
      Compaction free scanned          208355124   144863510
      Compaction cost                       1403        1210
      
      order-9:
      
      Compaction stalls                     7311        7401
      Compaction success                    1634        1683
      Compaction failures                   5677        5718
      Page migrate success                194657      183988
      Page migrate failure                  4753        4170
      Compaction pages isolated           498790      456130
      Compaction migrate scanned          565371      524174
      Compaction free scanned            4230296     4250744
      Compaction cost                        215         203
      
      [1] https://lwn.net/Articles/684611/
      
      This patch (of 11):
      
      A recent patch has added whole_zone flag that compaction sets when
      scanning starts from the zone boundary, in order to report that zone has
      been fully scanned in one attempt.  For allocations that want to try
      really hard or cannot fail, we will want to introduce a mode where
      scanning whole zone is guaranteed regardless of the cached positions.
      
      This patch reuses the whole_zone flag in a way that if it's already
      passed true to compaction, the cached scanner positions are ignored.
      Employing this flag during reclaim/compaction loop will be done in the
      next patch.  This patch however converts compaction invoked from
      userspace via procfs to use this flag.  Before this patch, the cached
      positions were first reset to zone boundaries and then read back from
      struct zone, so there was a window where a parallel compaction could
      replace the reset values, making the manual compaction less effective.
      Using the flag instead of performing reset is more robust.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/20160810091226.6709-2-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      06ed2998
  23. 29 7月, 2016 3 次提交