1. 15 1月, 2022 1 次提交
    • B
      mm_zone: add function to check if managed dma zone exists · 62b31070
      Baoquan He 提交于
      Patch series "Handle warning of allocation failure on DMA zone w/o
      managed pages", v4.
      
      **Problem observed:
      On x86_64, when crash is triggered and entering into kdump kernel, page
      allocation failure can always be seen.
      
       ---------------------------------
       DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations
       swapper/0: page allocation failure: order:5, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
       CPU: 0 PID: 1 Comm: swapper/0
       Call Trace:
        dump_stack+0x7f/0xa1
        warn_alloc.cold+0x72/0xd6
        ......
        __alloc_pages+0x24d/0x2c0
        ......
        dma_atomic_pool_init+0xdb/0x176
        do_one_initcall+0x67/0x320
        ? rcu_read_lock_sched_held+0x3f/0x80
        kernel_init_freeable+0x290/0x2dc
        ? rest_init+0x24f/0x24f
        kernel_init+0xa/0x111
        ret_from_fork+0x22/0x30
       Mem-Info:
       ------------------------------------
      
      ***Root cause:
      In the current kernel, it assumes that DMA zone must have managed pages
      and try to request pages if CONFIG_ZONE_DMA is enabled. While this is not
      always true. E.g in kdump kernel of x86_64, only low 1M is presented and
      locked down at very early stage of boot, so that this low 1M won't be
      added into buddy allocator to become managed pages of DMA zone. This
      exception will always cause page allocation failure if page is requested
      from DMA zone.
      
      ***Investigation:
      This failure happens since below commit merged into linus's tree.
        1a6a9044 x86/setup: Remove CONFIG_X86_RESERVE_LOW and reservelow= options
        23721c8e x86/crash: Remove crash_reserve_low_1M()
        f1d4d47c x86/setup: Always reserve the first 1M of RAM
        7c321eb2 x86/kdump: Remove the backup region handling
        6f599d84 x86/kdump: Always reserve the low 1M when the crashkernel option is specified
      
      Before them, on x86_64, the low 640K area will be reused by kdump kernel.
      So in kdump kernel, the content of low 640K area is copied into a backup
      region for dumping before jumping into kdump. Then except of those firmware
      reserved region in [0, 640K], the left area will be added into buddy
      allocator to become available managed pages of DMA zone.
      
      However, after above commits applied, in kdump kernel of x86_64, the low
      1M is reserved by memblock, but not released to buddy allocator. So any
      later page allocation requested from DMA zone will fail.
      
      At the beginning, if crashkernel is reserved, the low 1M need be locked
      down because AMD SME encrypts memory making the old backup region
      mechanims impossible when switching into kdump kernel.
      
      Later, it was also observed that there are BIOSes corrupting memory
      under 1M. To solve this, in commit f1d4d47c, the entire region of
      low 1M is always reserved after the real mode trampoline is allocated.
      
      Besides, recently, Intel engineer mentioned their TDX (Trusted domain
      extensions) which is under development in kernel also needs to lock down
      the low 1M. So we can't simply revert above commits to fix the page allocation
      failure from DMA zone as someone suggested.
      
      ***Solution:
      Currently, only DMA atomic pool and dma-kmalloc will initialize and
      request page allocation with GFP_DMA during bootup.
      
      So only initializ DMA atomic pool when DMA zone has available managed
      pages, otherwise just skip the initialization.
      
      For dma-kmalloc(), for the time being, let's mute the warning of
      allocation failure if requesting pages from DMA zone while no manged
      pages.  Meanwhile, change code to use dma_alloc_xx/dma_map_xx API to
      replace kmalloc(GFP_DMA), or do not use GFP_DMA when calling kmalloc() if
      not necessary.  Christoph is posting patches to fix those under
      drivers/scsi/.  Finally, we can remove the need of dma-kmalloc() as people
      suggested.
      
      This patch (of 3):
      
      In some places of the current kernel, it assumes that dma zone must have
      managed pages if CONFIG_ZONE_DMA is enabled.  While this is not always
      true.  E.g in kdump kernel of x86_64, only low 1M is presented and locked
      down at very early stage of boot, so that there's no managed pages at all
      in DMA zone.  This exception will always cause page allocation failure if
      page is requested from DMA zone.
      
      Here add function has_managed_dma() and the relevant helper functions to
      check if there's DMA zone with managed pages.  It will be used in later
      patches.
      
      Link: https://lkml.kernel.org/r/20211223094435.248523-1-bhe@redhat.com
      Link: https://lkml.kernel.org/r/20211223094435.248523-2-bhe@redhat.com
      Fixes: 6f599d84 ("x86/kdump: Always reserve the low 1M when the crashkernel option is specified")
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NJohn Donnelly  <john.p.donnelly@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62b31070
  2. 01 1月, 2022 1 次提交
  3. 07 11月, 2021 5 次提交
    • M
      mm/vmscan: throttle reclaim when no progress is being made · 69392a40
      Mel Gorman 提交于
      Memcg reclaim throttles on congestion if no reclaim progress is made.
      This makes little sense, it might be due to writeback or a host of other
      factors.
      
      For !memcg reclaim, it's messy.  Direct reclaim primarily is throttled
      in the page allocator if it is failing to make progress.  Kswapd
      throttles if too many pages are under writeback and marked for immediate
      reclaim.
      
      This patch explicitly throttles if reclaim is failing to make progress.
      
      [vbabka@suse.cz: Remove redundant code]
      
      Link: https://lkml.kernel.org/r/20211022144651.19914-4-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Darrick J . Wong" <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69392a40
    • M
      mm/vmscan: throttle reclaim and compaction when too may pages are isolated · d818fca1
      Mel Gorman 提交于
      Page reclaim throttles on congestion if too many parallel reclaim
      instances have isolated too many pages.  This makes no sense, excessive
      parallelisation has nothing to do with writeback or congestion.
      
      This patch creates an additional workqueue to sleep on when too many
      pages are isolated.  The throttled tasks are woken when the number of
      isolated pages is reduced or a timeout occurs.  There may be some false
      positive wakeups for GFP_NOIO/GFP_NOFS callers but the tasks will
      throttle again if necessary.
      
      [shy828301@gmail.com: Wake up from compaction context]
      [vbabka@suse.cz: Account number of throttled tasks only for writeback]
      
      Link: https://lkml.kernel.org/r/20211022144651.19914-3-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Darrick J . Wong" <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d818fca1
    • M
      mm/vmscan: throttle reclaim until some writeback completes if congested · 8cd7c588
      Mel Gorman 提交于
      Patch series "Remove dependency on congestion_wait in mm/", v5.
      
      This series that removes all calls to congestion_wait in mm/ and deletes
      wait_iff_congested.  It's not a clever implementation but
      congestion_wait has been broken for a long time [1].
      
      Even if congestion throttling worked, it was never a great idea.  While
      excessive dirty/writeback pages at the tail of the LRU is one
      possibility that reclaim may be slow, there is also the problem of too
      many pages being isolated and reclaim failing for other reasons
      (elevated references, too many pages isolated, excessive LRU contention
      etc).
      
      This series replaces the "congestion" throttling with 3 different types.
      
       - If there are too many dirty/writeback pages, sleep until a timeout or
         enough pages get cleaned
      
       - If too many pages are isolated, sleep until enough isolated pages are
         either reclaimed or put back on the LRU
      
       - If no progress is being made, direct reclaim tasks sleep until
         another task makes progress with acceptable efficiency.
      
      This was initially tested with a mix of workloads that used to trigger
      corner cases that no longer work.  A new test case was created called
      "stutterp" (pagereclaim-stutterp-noreaders in mmtests) using a freshly
      created XFS filesystem.  Note that it may be necessary to increase the
      timeout of ssh if executing remotely as ssh itself can get throttled and
      the connection may timeout.
      
      stutterp varies the number of "worker" processes from 4 up to NR_CPUS*4
      to check the impact as the number of direct reclaimers increase.  It has
      four types of worker.
      
       - One "anon latency" worker creates small mappings with mmap() and
         times how long it takes to fault the mapping reading it 4K at a time
      
       - X file writers which is fio randomly writing X files where the total
         size of the files add up to the allowed dirty_ratio. fio is allowed
         to run for a warmup period to allow some file-backed pages to
         accumulate. The duration of the warmup is based on the best-case
         linear write speed of the storage.
      
       - Y file readers which is fio randomly reading small files
      
       - Z anon memory hogs which continually map (100-dirty_ratio)% of memory
      
       - Total estimated WSS = (100+dirty_ration) percentage of memory
      
      X+Y+Z+1 == NR_WORKERS varying from 4 up to NR_CPUS*4
      
      The intent is to maximise the total WSS with a mix of file and anon
      memory where some anonymous memory must be swapped and there is a high
      likelihood of dirty/writeback pages reaching the end of the LRU.
      
      The test can be configured to have no background readers to stress
      dirty/writeback pages.  The results below are based on having zero
      readers.
      
      The short summary of the results is that the series works and stalls
      until some event occurs but the timeouts may need adjustment.
      
      The test results are not broken down by patch as the series should be
      treated as one block that replaces a broken throttling mechanism with a
      working one.
      
      Finally, three machines were tested but I'm reporting the worst set of
      results.  The other two machines had much better latencies for example.
      
      First the results of the "anon latency" latency
      
        stutterp
                                      5.15.0-rc1             5.15.0-rc1
                                         vanilla mm-reclaimcongest-v5r4
        Amean     mmap-4      31.4003 (   0.00%)   2661.0198 (-8374.52%)
        Amean     mmap-7      38.1641 (   0.00%)    149.2891 (-291.18%)
        Amean     mmap-12     60.0981 (   0.00%)    187.8105 (-212.51%)
        Amean     mmap-21    161.2699 (   0.00%)    213.9107 ( -32.64%)
        Amean     mmap-30    174.5589 (   0.00%)    377.7548 (-116.41%)
        Amean     mmap-48   8106.8160 (   0.00%)   1070.5616 (  86.79%)
        Stddev    mmap-4      41.3455 (   0.00%)  27573.9676 (-66591.66%)
        Stddev    mmap-7      53.5556 (   0.00%)   4608.5860 (-8505.23%)
        Stddev    mmap-12    171.3897 (   0.00%)   5559.4542 (-3143.75%)
        Stddev    mmap-21   1506.6752 (   0.00%)   5746.2507 (-281.39%)
        Stddev    mmap-30    557.5806 (   0.00%)   7678.1624 (-1277.05%)
        Stddev    mmap-48  61681.5718 (   0.00%)  14507.2830 (  76.48%)
        Max-90    mmap-4      31.4243 (   0.00%)     83.1457 (-164.59%)
        Max-90    mmap-7      41.0410 (   0.00%)     41.0720 (  -0.08%)
        Max-90    mmap-12     66.5255 (   0.00%)     53.9073 (  18.97%)
        Max-90    mmap-21    146.7479 (   0.00%)    105.9540 (  27.80%)
        Max-90    mmap-30    193.9513 (   0.00%)     64.3067 (  66.84%)
        Max-90    mmap-48    277.9137 (   0.00%)    591.0594 (-112.68%)
        Max       mmap-4    1913.8009 (   0.00%) 299623.9695 (-15555.96%)
        Max       mmap-7    2423.9665 (   0.00%) 204453.1708 (-8334.65%)
        Max       mmap-12   6845.6573 (   0.00%) 221090.3366 (-3129.64%)
        Max       mmap-21  56278.6508 (   0.00%) 213877.3496 (-280.03%)
        Max       mmap-30  19716.2990 (   0.00%) 216287.6229 (-997.00%)
        Max       mmap-48 477923.9400 (   0.00%) 245414.8238 (  48.65%)
      
      For most thread counts, the time to mmap() is unfortunately increased.
      In earlier versions of the series, this was lower but a large number of
      throttling events were reaching their timeout increasing the amount of
      inefficient scanning of the LRU.  There is no prioritisation of reclaim
      tasks making progress based on each tasks rate of page allocation versus
      progress of reclaim.  The variance is also impacted for high worker
      counts but in all cases, the differences in latency are not
      statistically significant due to very large maximum outliers.  Max-90
      shows that 90% of the stalls are comparable but the Max results show the
      massive outliers which are increased to to stalling.
      
      It is expected that this will be very machine dependant.  Due to the
      test design, reclaim is difficult so allocations stall and there are
      variances depending on whether THPs can be allocated or not.  The amount
      of memory will affect exactly how bad the corner cases are and how often
      they trigger.  The warmup period calculation is not ideal as it's based
      on linear writes where as fio is randomly writing multiple files from
      multiple tasks so the start state of the test is variable.  For example,
      these are the latencies on a single-socket machine that had more memory
      
        Amean     mmap-4      42.2287 (   0.00%)     49.6838 * -17.65%*
        Amean     mmap-7     216.4326 (   0.00%)     47.4451 *  78.08%*
        Amean     mmap-12   2412.0588 (   0.00%)     51.7497 (  97.85%)
        Amean     mmap-21   5546.2548 (   0.00%)     51.8862 (  99.06%)
        Amean     mmap-30   1085.3121 (   0.00%)     72.1004 (  93.36%)
      
      The overall system CPU usage and elapsed time is as follows
      
                          5.15.0-rc3  5.15.0-rc3
                             vanilla mm-reclaimcongest-v5r4
        Duration User        6989.03      983.42
        Duration System      7308.12      799.68
        Duration Elapsed     2277.67     2092.98
      
      The patches reduce system CPU usage by 89% as the vanilla kernel is rarely
      stalling.
      
      The high-level /proc/vmstats show
      
                                             5.15.0-rc1     5.15.0-rc1
                                                vanilla mm-reclaimcongest-v5r2
        Ops Direct pages scanned          1056608451.00   503594991.00
        Ops Kswapd pages scanned           109795048.00   147289810.00
        Ops Kswapd pages reclaimed          63269243.00    31036005.00
        Ops Direct pages reclaimed          10803973.00     6328887.00
        Ops Kswapd efficiency %                   57.62          21.07
        Ops Kswapd velocity                    48204.98       57572.86
        Ops Direct efficiency %                    1.02           1.26
        Ops Direct velocity                   463898.83      196845.97
      
      Kswapd scanned less pages but the detailed pattern is different.  The
      vanilla kernel scans slowly over time where as the patches exhibits
      burst patterns of scan activity.  Direct reclaim scanning is reduced by
      52% due to stalling.
      
      The pattern for stealing pages is also slightly different.  Both kernels
      exhibit spikes but the vanilla kernel when reclaiming shows pages being
      reclaimed over a period of time where as the patches tend to reclaim in
      spikes.  The difference is that vanilla is not throttling and instead
      scanning constantly finding some pages over time where as the patched
      kernel throttles and reclaims in spikes.
      
        Ops Percentage direct scans               90.59          77.37
      
      For direct reclaim, vanilla scanned 90.59% of pages where as with the
      patches, 77.37% were direct reclaim due to throttling
      
        Ops Page writes by reclaim           2613590.00     1687131.00
      
      Page writes from reclaim context are reduced.
      
        Ops Page writes anon                 2932752.00     1917048.00
      
      And there is less swapping.
      
        Ops Page reclaim immediate         996248528.00   107664764.00
      
      The number of pages encountered at the tail of the LRU tagged for
      immediate reclaim but still dirty/writeback is reduced by 89%.
      
        Ops Slabs scanned                     164284.00      153608.00
      
      Slab scan activity is similar.
      
      ftrace was used to gather stall activity
      
        Vanilla
        -------
            1 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=16000
            2 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=12000
            8 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=8000
           29 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=4000
        82394 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=0
      
      The fast majority of wait_iff_congested calls do not stall at all.  What
      is likely happening is that cond_resched() reschedules the task for a
      short period when the BDI is not registering congestion (which it never
      will in this test setup).
      
            1 writeback_congestion_wait: usec_timeout=100000 usec_delayed=120000
            2 writeback_congestion_wait: usec_timeout=100000 usec_delayed=132000
            4 writeback_congestion_wait: usec_timeout=100000 usec_delayed=112000
          380 writeback_congestion_wait: usec_timeout=100000 usec_delayed=108000
          778 writeback_congestion_wait: usec_timeout=100000 usec_delayed=104000
      
      congestion_wait if called always exceeds the timeout as there is no
      trigger to wake it up.
      
      Bottom line: Vanilla will throttle but it's not effective.
      
      Patch series
      ------------
      
      Kswapd throttle activity was always due to scanning pages tagged for
      immediate reclaim at the tail of the LRU
      
            1 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
            4 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
            5 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
            6 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
           11 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK
           11 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
           94 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
          112 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK
      
      The majority of events did not stall or stalled for a short period.
      Roughly 16% of stalls reached the timeout before expiry.  For direct
      reclaim, the number of times stalled for each reason were
      
         6624 reason=VMSCAN_THROTTLE_ISOLATED
        93246 reason=VMSCAN_THROTTLE_NOPROGRESS
        96934 reason=VMSCAN_THROTTLE_WRITEBACK
      
      The most common reason to stall was due to excessive pages tagged for
      immediate reclaim at the tail of the LRU followed by a failure to make
      forward.  A relatively small number were due to too many pages isolated
      from the LRU by parallel threads
      
      For VMSCAN_THROTTLE_ISOLATED, the breakdown of delays was
      
            9 usec_timeout=20000 usect_delayed=4000 reason=VMSCAN_THROTTLE_ISOLATED
           12 usec_timeout=20000 usect_delayed=16000 reason=VMSCAN_THROTTLE_ISOLATED
           83 usec_timeout=20000 usect_delayed=20000 reason=VMSCAN_THROTTLE_ISOLATED
         6520 usec_timeout=20000 usect_delayed=0 reason=VMSCAN_THROTTLE_ISOLATED
      
      Most did not stall at all.  A small number reached the timeout.
      
      For VMSCAN_THROTTLE_NOPROGRESS, the breakdown of stalls were all over
      the map
      
            1 usec_timeout=500000 usect_delayed=324000 reason=VMSCAN_THROTTLE_NOPROGRESS
            1 usec_timeout=500000 usect_delayed=332000 reason=VMSCAN_THROTTLE_NOPROGRESS
            1 usec_timeout=500000 usect_delayed=348000 reason=VMSCAN_THROTTLE_NOPROGRESS
            1 usec_timeout=500000 usect_delayed=360000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=228000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=260000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=340000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=364000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=372000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=428000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=460000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=464000 reason=VMSCAN_THROTTLE_NOPROGRESS
            3 usec_timeout=500000 usect_delayed=244000 reason=VMSCAN_THROTTLE_NOPROGRESS
            3 usec_timeout=500000 usect_delayed=252000 reason=VMSCAN_THROTTLE_NOPROGRESS
            3 usec_timeout=500000 usect_delayed=272000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=188000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=268000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=328000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=380000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=392000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=432000 reason=VMSCAN_THROTTLE_NOPROGRESS
            5 usec_timeout=500000 usect_delayed=204000 reason=VMSCAN_THROTTLE_NOPROGRESS
            5 usec_timeout=500000 usect_delayed=220000 reason=VMSCAN_THROTTLE_NOPROGRESS
            5 usec_timeout=500000 usect_delayed=412000 reason=VMSCAN_THROTTLE_NOPROGRESS
            5 usec_timeout=500000 usect_delayed=436000 reason=VMSCAN_THROTTLE_NOPROGRESS
            6 usec_timeout=500000 usect_delayed=488000 reason=VMSCAN_THROTTLE_NOPROGRESS
            7 usec_timeout=500000 usect_delayed=212000 reason=VMSCAN_THROTTLE_NOPROGRESS
            7 usec_timeout=500000 usect_delayed=300000 reason=VMSCAN_THROTTLE_NOPROGRESS
            7 usec_timeout=500000 usect_delayed=316000 reason=VMSCAN_THROTTLE_NOPROGRESS
            7 usec_timeout=500000 usect_delayed=472000 reason=VMSCAN_THROTTLE_NOPROGRESS
            8 usec_timeout=500000 usect_delayed=248000 reason=VMSCAN_THROTTLE_NOPROGRESS
            8 usec_timeout=500000 usect_delayed=356000 reason=VMSCAN_THROTTLE_NOPROGRESS
            8 usec_timeout=500000 usect_delayed=456000 reason=VMSCAN_THROTTLE_NOPROGRESS
            9 usec_timeout=500000 usect_delayed=124000 reason=VMSCAN_THROTTLE_NOPROGRESS
            9 usec_timeout=500000 usect_delayed=376000 reason=VMSCAN_THROTTLE_NOPROGRESS
            9 usec_timeout=500000 usect_delayed=484000 reason=VMSCAN_THROTTLE_NOPROGRESS
           10 usec_timeout=500000 usect_delayed=172000 reason=VMSCAN_THROTTLE_NOPROGRESS
           10 usec_timeout=500000 usect_delayed=420000 reason=VMSCAN_THROTTLE_NOPROGRESS
           10 usec_timeout=500000 usect_delayed=452000 reason=VMSCAN_THROTTLE_NOPROGRESS
           11 usec_timeout=500000 usect_delayed=256000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=112000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=116000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=144000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=152000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=264000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=384000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=424000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=492000 reason=VMSCAN_THROTTLE_NOPROGRESS
           13 usec_timeout=500000 usect_delayed=184000 reason=VMSCAN_THROTTLE_NOPROGRESS
           13 usec_timeout=500000 usect_delayed=444000 reason=VMSCAN_THROTTLE_NOPROGRESS
           14 usec_timeout=500000 usect_delayed=308000 reason=VMSCAN_THROTTLE_NOPROGRESS
           14 usec_timeout=500000 usect_delayed=440000 reason=VMSCAN_THROTTLE_NOPROGRESS
           14 usec_timeout=500000 usect_delayed=476000 reason=VMSCAN_THROTTLE_NOPROGRESS
           16 usec_timeout=500000 usect_delayed=140000 reason=VMSCAN_THROTTLE_NOPROGRESS
           17 usec_timeout=500000 usect_delayed=232000 reason=VMSCAN_THROTTLE_NOPROGRESS
           17 usec_timeout=500000 usect_delayed=240000 reason=VMSCAN_THROTTLE_NOPROGRESS
           17 usec_timeout=500000 usect_delayed=280000 reason=VMSCAN_THROTTLE_NOPROGRESS
           18 usec_timeout=500000 usect_delayed=404000 reason=VMSCAN_THROTTLE_NOPROGRESS
           20 usec_timeout=500000 usect_delayed=148000 reason=VMSCAN_THROTTLE_NOPROGRESS
           20 usec_timeout=500000 usect_delayed=216000 reason=VMSCAN_THROTTLE_NOPROGRESS
           20 usec_timeout=500000 usect_delayed=468000 reason=VMSCAN_THROTTLE_NOPROGRESS
           21 usec_timeout=500000 usect_delayed=448000 reason=VMSCAN_THROTTLE_NOPROGRESS
           23 usec_timeout=500000 usect_delayed=168000 reason=VMSCAN_THROTTLE_NOPROGRESS
           23 usec_timeout=500000 usect_delayed=296000 reason=VMSCAN_THROTTLE_NOPROGRESS
           25 usec_timeout=500000 usect_delayed=132000 reason=VMSCAN_THROTTLE_NOPROGRESS
           25 usec_timeout=500000 usect_delayed=352000 reason=VMSCAN_THROTTLE_NOPROGRESS
           26 usec_timeout=500000 usect_delayed=180000 reason=VMSCAN_THROTTLE_NOPROGRESS
           27 usec_timeout=500000 usect_delayed=284000 reason=VMSCAN_THROTTLE_NOPROGRESS
           28 usec_timeout=500000 usect_delayed=164000 reason=VMSCAN_THROTTLE_NOPROGRESS
           29 usec_timeout=500000 usect_delayed=136000 reason=VMSCAN_THROTTLE_NOPROGRESS
           30 usec_timeout=500000 usect_delayed=200000 reason=VMSCAN_THROTTLE_NOPROGRESS
           30 usec_timeout=500000 usect_delayed=400000 reason=VMSCAN_THROTTLE_NOPROGRESS
           31 usec_timeout=500000 usect_delayed=196000 reason=VMSCAN_THROTTLE_NOPROGRESS
           32 usec_timeout=500000 usect_delayed=156000 reason=VMSCAN_THROTTLE_NOPROGRESS
           33 usec_timeout=500000 usect_delayed=224000 reason=VMSCAN_THROTTLE_NOPROGRESS
           35 usec_timeout=500000 usect_delayed=128000 reason=VMSCAN_THROTTLE_NOPROGRESS
           35 usec_timeout=500000 usect_delayed=176000 reason=VMSCAN_THROTTLE_NOPROGRESS
           36 usec_timeout=500000 usect_delayed=368000 reason=VMSCAN_THROTTLE_NOPROGRESS
           36 usec_timeout=500000 usect_delayed=496000 reason=VMSCAN_THROTTLE_NOPROGRESS
           37 usec_timeout=500000 usect_delayed=312000 reason=VMSCAN_THROTTLE_NOPROGRESS
           38 usec_timeout=500000 usect_delayed=304000 reason=VMSCAN_THROTTLE_NOPROGRESS
           40 usec_timeout=500000 usect_delayed=288000 reason=VMSCAN_THROTTLE_NOPROGRESS
           43 usec_timeout=500000 usect_delayed=408000 reason=VMSCAN_THROTTLE_NOPROGRESS
           55 usec_timeout=500000 usect_delayed=416000 reason=VMSCAN_THROTTLE_NOPROGRESS
           56 usec_timeout=500000 usect_delayed=76000 reason=VMSCAN_THROTTLE_NOPROGRESS
           58 usec_timeout=500000 usect_delayed=120000 reason=VMSCAN_THROTTLE_NOPROGRESS
           59 usec_timeout=500000 usect_delayed=208000 reason=VMSCAN_THROTTLE_NOPROGRESS
           61 usec_timeout=500000 usect_delayed=68000 reason=VMSCAN_THROTTLE_NOPROGRESS
           71 usec_timeout=500000 usect_delayed=192000 reason=VMSCAN_THROTTLE_NOPROGRESS
           71 usec_timeout=500000 usect_delayed=480000 reason=VMSCAN_THROTTLE_NOPROGRESS
           79 usec_timeout=500000 usect_delayed=60000 reason=VMSCAN_THROTTLE_NOPROGRESS
           82 usec_timeout=500000 usect_delayed=320000 reason=VMSCAN_THROTTLE_NOPROGRESS
           82 usec_timeout=500000 usect_delayed=92000 reason=VMSCAN_THROTTLE_NOPROGRESS
           85 usec_timeout=500000 usect_delayed=64000 reason=VMSCAN_THROTTLE_NOPROGRESS
           85 usec_timeout=500000 usect_delayed=80000 reason=VMSCAN_THROTTLE_NOPROGRESS
           88 usec_timeout=500000 usect_delayed=84000 reason=VMSCAN_THROTTLE_NOPROGRESS
           90 usec_timeout=500000 usect_delayed=160000 reason=VMSCAN_THROTTLE_NOPROGRESS
           90 usec_timeout=500000 usect_delayed=292000 reason=VMSCAN_THROTTLE_NOPROGRESS
           94 usec_timeout=500000 usect_delayed=56000 reason=VMSCAN_THROTTLE_NOPROGRESS
          118 usec_timeout=500000 usect_delayed=88000 reason=VMSCAN_THROTTLE_NOPROGRESS
          119 usec_timeout=500000 usect_delayed=72000 reason=VMSCAN_THROTTLE_NOPROGRESS
          126 usec_timeout=500000 usect_delayed=108000 reason=VMSCAN_THROTTLE_NOPROGRESS
          146 usec_timeout=500000 usect_delayed=52000 reason=VMSCAN_THROTTLE_NOPROGRESS
          148 usec_timeout=500000 usect_delayed=36000 reason=VMSCAN_THROTTLE_NOPROGRESS
          148 usec_timeout=500000 usect_delayed=48000 reason=VMSCAN_THROTTLE_NOPROGRESS
          159 usec_timeout=500000 usect_delayed=28000 reason=VMSCAN_THROTTLE_NOPROGRESS
          178 usec_timeout=500000 usect_delayed=44000 reason=VMSCAN_THROTTLE_NOPROGRESS
          183 usec_timeout=500000 usect_delayed=40000 reason=VMSCAN_THROTTLE_NOPROGRESS
          237 usec_timeout=500000 usect_delayed=100000 reason=VMSCAN_THROTTLE_NOPROGRESS
          266 usec_timeout=500000 usect_delayed=32000 reason=VMSCAN_THROTTLE_NOPROGRESS
          313 usec_timeout=500000 usect_delayed=24000 reason=VMSCAN_THROTTLE_NOPROGRESS
          347 usec_timeout=500000 usect_delayed=96000 reason=VMSCAN_THROTTLE_NOPROGRESS
          470 usec_timeout=500000 usect_delayed=20000 reason=VMSCAN_THROTTLE_NOPROGRESS
          559 usec_timeout=500000 usect_delayed=16000 reason=VMSCAN_THROTTLE_NOPROGRESS
          964 usec_timeout=500000 usect_delayed=12000 reason=VMSCAN_THROTTLE_NOPROGRESS
         2001 usec_timeout=500000 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPROGRESS
         2447 usec_timeout=500000 usect_delayed=8000 reason=VMSCAN_THROTTLE_NOPROGRESS
         7888 usec_timeout=500000 usect_delayed=4000 reason=VMSCAN_THROTTLE_NOPROGRESS
        22727 usec_timeout=500000 usect_delayed=0 reason=VMSCAN_THROTTLE_NOPROGRESS
        51305 usec_timeout=500000 usect_delayed=500000 reason=VMSCAN_THROTTLE_NOPROGRESS
      
      The full timeout is often hit but a large number also do not stall at
      all.  The remainder slept a little allowing other reclaim tasks to make
      progress.
      
      While this timeout could be further increased, it could also negatively
      impact worst-case behaviour when there is no prioritisation of what task
      should make progress.
      
      For VMSCAN_THROTTLE_WRITEBACK, the breakdown was
      
            1 usec_timeout=100000 usect_delayed=44000 reason=VMSCAN_THROTTLE_WRITEBACK
            2 usec_timeout=100000 usect_delayed=76000 reason=VMSCAN_THROTTLE_WRITEBACK
            3 usec_timeout=100000 usect_delayed=80000 reason=VMSCAN_THROTTLE_WRITEBACK
            5 usec_timeout=100000 usect_delayed=48000 reason=VMSCAN_THROTTLE_WRITEBACK
            5 usec_timeout=100000 usect_delayed=84000 reason=VMSCAN_THROTTLE_WRITEBACK
            6 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
            7 usec_timeout=100000 usect_delayed=88000 reason=VMSCAN_THROTTLE_WRITEBACK
           11 usec_timeout=100000 usect_delayed=56000 reason=VMSCAN_THROTTLE_WRITEBACK
           12 usec_timeout=100000 usect_delayed=64000 reason=VMSCAN_THROTTLE_WRITEBACK
           16 usec_timeout=100000 usect_delayed=92000 reason=VMSCAN_THROTTLE_WRITEBACK
           24 usec_timeout=100000 usect_delayed=68000 reason=VMSCAN_THROTTLE_WRITEBACK
           28 usec_timeout=100000 usect_delayed=32000 reason=VMSCAN_THROTTLE_WRITEBACK
           30 usec_timeout=100000 usect_delayed=60000 reason=VMSCAN_THROTTLE_WRITEBACK
           30 usec_timeout=100000 usect_delayed=96000 reason=VMSCAN_THROTTLE_WRITEBACK
           32 usec_timeout=100000 usect_delayed=52000 reason=VMSCAN_THROTTLE_WRITEBACK
           42 usec_timeout=100000 usect_delayed=40000 reason=VMSCAN_THROTTLE_WRITEBACK
           77 usec_timeout=100000 usect_delayed=28000 reason=VMSCAN_THROTTLE_WRITEBACK
           99 usec_timeout=100000 usect_delayed=36000 reason=VMSCAN_THROTTLE_WRITEBACK
          137 usec_timeout=100000 usect_delayed=24000 reason=VMSCAN_THROTTLE_WRITEBACK
          190 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
          339 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
          518 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
          852 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
         3359 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK
         7147 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
        83962 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK
      
      The majority hit the timeout in direct reclaim context although a
      sizable number did not stall at all.  This is very different to kswapd
      where only a tiny percentage of stalls due to writeback reached the
      timeout.
      
      Bottom line, the throttling appears to work and the wakeup events may
      limit worst case stalls.  There might be some grounds for adjusting
      timeouts but it's likely futile as the worst-case scenarios depend on
      the workload, memory size and the speed of the storage.  A better
      approach to improve the series further would be to prioritise tasks
      based on their rate of allocation with the caveat that it may be very
      expensive to track.
      
      This patch (of 5):
      
      Page reclaim throttles on wait_iff_congested under the following
      conditions:
      
       - kswapd is encountering pages under writeback and marked for immediate
         reclaim implying that pages are cycling through the LRU faster than
         pages can be cleaned.
      
       - Direct reclaim will stall if all dirty pages are backed by congested
         inodes.
      
      wait_iff_congested is almost completely broken with few exceptions.
      This patch adds a new node-based workqueue and tracks the number of
      throttled tasks and pages written back since throttling started.  If
      enough pages belonging to the node are written back then the throttled
      tasks will wake early.  If not, the throttled tasks sleeps until the
      timeout expires.
      
      [neilb@suse.de: Uninterruptible sleep and simpler wakeups]
      [hdanton@sina.com: Avoid race when reclaim starts]
      [vbabka@suse.cz: vmstat irq-safe api, clarifications]
      
      Link: https://lore.kernel.org/linux-mm/45d8b7a6-8548-65f5-cccf-9f451d4ae3d4@kernel.dk/ [1]
      Link: https://lkml.kernel.org/r/20211022144651.19914-1-mgorman@techsingularity.net
      Link: https://lkml.kernel.org/r/20211022144651.19914-2-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: NeilBrown <neilb@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Darrick J . Wong" <djwong@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8cd7c588
    • F
      mm/page_alloc: detect allocation forbidden by cpuset and bail out early · 8ca1b5a4
      Feng Tang 提交于
      There was a report that starting an Ubuntu in docker while using cpuset
      to bind it to movable nodes (a node only has movable zone, like a node
      for hotplug or a Persistent Memory node in normal usage) will fail due
      to memory allocation failure, and then OOM is involved and many other
      innocent processes got killed.
      
      It can be reproduced with command:
      
          $ docker run -it --rm --cpuset-mems 4 ubuntu:latest bash -c "grep Mems_allowed /proc/self/status"
      
      (where node 4 is a movable node)
      
        runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0
        CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G        W I E     5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased)
        Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020
        Call Trace:
         dump_stack+0x6b/0x88
         dump_header+0x4a/0x1e2
         oom_kill_process.cold+0xb/0x10
         out_of_memory.part.0+0xaf/0x230
         out_of_memory+0x3d/0x80
         __alloc_pages_slowpath.constprop.0+0x954/0xa20
         __alloc_pages_nodemask+0x2d3/0x300
         pipe_write+0x322/0x590
         new_sync_write+0x196/0x1b0
         vfs_write+0x1c3/0x1f0
         ksys_write+0xa7/0xe0
         do_syscall_64+0x52/0xd0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        Mem-Info:
        active_anon:392832 inactive_anon:182 isolated_anon:0
         active_file:68130 inactive_file:151527 isolated_file:0
         unevictable:2701 dirty:0 writeback:7
         slab_reclaimable:51418 slab_unreclaimable:116300
         mapped:45825 shmem:735 pagetables:2540 bounce:0
         free:159849484 free_pcp:73 free_cma:0
        Node 4 active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB all_unreclaimable? no
        Node 4 Movable free:130021408kB min:9140kB low:139160kB high:269180kB reserved_highatomic:0KB active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:130023424kB managed:130023424kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:292kB local_pcp:84kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0 0
        Node 4 Movable: 1*4kB (M) 0*8kB 0*16kB 1*32kB (M) 0*64kB 0*128kB 1*256kB (M) 1*512kB (M) 1*1024kB (M) 0*2048kB 31743*4096kB (M) = 130021156kB
      
        oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=docker-9976a269caec812c134fa317f27487ee36e1129beba7278a463dd53e5fb9997b.scope,mems_allowed=4,global_oom,task_memcg=/system.slice/containerd.service,task=containerd,pid=4100,uid=0
        Out of memory: Killed process 4100 (containerd) total-vm:4077036kB, anon-rss:51184kB, file-rss:26016kB, shmem-rss:0kB, UID:0 pgtables:676kB oom_score_adj:0
        oom_reaper: reaped process 8248 (docker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
        oom_reaper: reaped process 2054 (node_exporter), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
        oom_reaper: reaped process 1452 (systemd-journal), now anon-rss:0kB, file-rss:8564kB, shmem-rss:4kB
        oom_reaper: reaped process 2146 (munin-node), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
        oom_reaper: reaped process 8291 (runc:[2:INIT]), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      
      The reason is that in this case, the target cpuset nodes only have
      movable zone, while the creation of an OS in docker sometimes needs to
      allocate memory in non-movable zones (dma/dma32/normal) like
      GFP_HIGHUSER, and the cpuset limit forbids the allocation, then
      out-of-memory killing is involved even when normal nodes and movable
      nodes both have many free memory.
      
      The OOM killer cannot help to resolve the situation as there is no
      usable memory for the request in the cpuset scope.  The only reasonable
      measure to take is to fail the allocation right away and have the caller
      to deal with it.
      
      So add a check for cases like this in the slowpath of allocation, and
      bail out early returning NULL for the allocation.
      
      As page allocation is one of the hottest path in kernel, this check will
      hurt all users with sane cpuset configuration, add a static branch check
      and detect the abnormal config in cpuset memory binding setup so that
      the extra check cost in page allocation is not paid by everyone.
      
      [thanks to Micho Hocko and David Rientjes for suggesting not handling
       it inside OOM code, adding cpuset check, refining comments]
      
      Link: https://lkml.kernel.org/r/1632481657-68112-1-git-send-email-feng.tang@intel.comSigned-off-by: NFeng Tang <feng.tang@intel.com>
      Suggested-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ca1b5a4
    • R
      f1dc0db2
  4. 09 9月, 2021 2 次提交
    • D
      mm: track present early pages per zone · 4b097002
      David Hildenbrand 提交于
      Patch series "mm/memory_hotplug: "auto-movable" online policy and memory groups", v3.
      
      I. Goal
      
      The goal of this series is improving in-kernel auto-online support.  It
      tackles the fundamental problems that:
      
       1) We can create zone imbalances when onlining all memory blindly to
          ZONE_MOVABLE, in the worst case crashing the system. We have to know
          upfront how much memory we are going to hotplug such that we can
          safely enable auto-onlining of all hotplugged memory to ZONE_MOVABLE
          via "online_movable". This is far from practical and only applicable in
          limited setups -- like inside VMs under the RHV/oVirt hypervisor which
          will never hotplug more than 3 times the boot memory (and the
          limitation is only in place due to the Linux limitation).
      
       2) We see more setups that implement dynamic VM resizing, hot(un)plugging
          memory to resize VM memory. In these setups, we might hotplug a lot of
          memory, but it might happen in various small steps in both directions
          (e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...). virtio-mem is the
          primary driver of this upstream right now, performing such dynamic
          resizing NUMA-aware via multiple virtio-mem devices.
      
          Onlining all hotplugged memory to ZONE_NORMAL means we basically have
          no hotunplug guarantees. Onlining all to ZONE_MOVABLE means we can
          easily run into zone imbalances when growing a VM. We want a mixture,
          and we want as much memory as reasonable/configured in ZONE_MOVABLE.
          Details regarding zone imbalances can be found at [1].
      
       3) Memory devices consist of 1..X memory block devices, however, the
          kernel doesn't really track the relationship. Consequently, also user
          space has no idea. We want to make per-device decisions.
      
          As one example, for memory hotunplug it doesn't make sense to use a
          mixture of zones within a single DIMM: we want all MOVABLE if
          possible, otherwise all !MOVABLE, because any !MOVABLE part will easily
          block the whole DIMM from getting hotunplugged.
      
          As another example, virtio-mem operates on individual units that span
          1..X memory blocks. Similar to a DIMM, we want a unit to either be all
          MOVABLE or !MOVABLE. A "unit" can be thought of like a DIMM, however,
          all units of a virtio-mem device logically belong together and are
          managed (added/removed) by a single driver. We want as much memory of
          a virtio-mem device to be MOVABLE as possible.
      
       4) We want memory onlining to be done right from the kernel while adding
          memory, not triggered by user space via udev rules; for example, this
          is reqired for fast memory hotplug for drivers that add individual
          memory blocks, like virito-mem. We want a way to configure a policy in
          the kernel and avoid implementing advanced policies in user space.
      
      The auto-onlining support we have in the kernel is not sufficient.  All we
      have is a) online everything MOVABLE (online_movable) b) online everything
      !MOVABLE (online_kernel) c) keep zones contiguous (online).  This series
      allows configuring c) to mean instead "online movable if possible
      according to the coniguration, driven by a maximum MOVABLE:KERNEL ratio"
      -- a new onlining policy.
      
      II. Approach
      
      This series does 3 things:
      
       1) Introduces the "auto-movable" online policy that initially operates on
          individual memory blocks only. It uses a maximum MOVABLE:KERNEL ratio
          to make a decision whether a memory block will be onlined to
          ZONE_MOVABLE or not. However, in the basic form, hotplugged KERNEL
          memory does not allow for more MOVABLE memory (details in the
          patches). CMA memory is treated like MOVABLE memory.
      
       2) Introduces static (e.g., DIMM) and dynamic (e.g., virtio-mem) memory
          groups and uses group information to make decisions in the
          "auto-movable" online policy across memory blocks of a single memory
          device (modeled as memory group). More details can be found in patch
          #3 or in the DIMM example below.
      
       3) Maximizes ZONE_MOVABLE memory within dynamic memory groups, by
          allowing ZONE_NORMAL memory within a dynamic memory group to allow for
          more ZONE_MOVABLE memory within the same memory group. The target use
          case is dynamic VM resizing using virtio-mem. See the virtio-mem
          example below.
      
      I remember that the basic idea of using a ratio to implement a policy in
      the kernel was once mentioned by Vitaly Kuznetsov, but I might be wrong (I
      lost the pointer to that discussion).
      
      For me, the main use case is using it along with virtio-mem (and DIMMs /
      ppc64 dlpar where necessary) for dynamic resizing of VMs, increasing the
      amount of memory we can hotunplug reliably again if we might eventually
      hotplug a lot of memory to a VM.
      
      III. Target Usage
      
      The target usage will be:
      
       1) Linux boots with "mhp_default_online_type=offline"
      
       2) User space (e.g., systemd unit) configures memory onlining (according
          to a config file and system properties), for example:
          * Setting memory_hotplug.online_policy=auto-movable
          * Setting memory_hotplug.auto_movable_ratio=301
          * Setting memory_hotplug.auto_movable_numa_aware=true
      
       3) User space enabled auto onlining via "echo online >
          /sys/devices/system/memory/auto_online_blocks"
      
       4) User space triggers manual onlining of all already-offline memory
          blocks (go over offline memory blocks and set them to "online")
      
      IV. Example
      
      For DIMMs, hotplugging 4 GiB DIMMs to a 4 GiB VM with a configured ratio of
      301% results in the following layout:
      	Memory block 0-15:    DMA32   (early)
      	Memory block 32-47:   Normal  (early)
      	Memory block 48-79:   Movable (DIMM 0)
      	Memory block 80-111:  Movable (DIMM 1)
      	Memory block 112-143: Movable (DIMM 2)
      	Memory block 144-275: Normal  (DIMM 3)
      	Memory block 176-207: Normal  (DIMM 4)
      	... all Normal
      	(-> hotplugged Normal memory does not allow for more Movable memory)
      
      For virtio-mem, using a simple, single virtio-mem device with a 4 GiB VM
      will result in the following layout:
      	Memory block 0-15:    DMA32   (early)
      	Memory block 32-47:   Normal  (early)
      	Memory block 48-143:  Movable (virtio-mem, first 12 GiB)
      	Memory block 144:     Normal  (virtio-mem, next 128 MiB)
      	Memory block 145-147: Movable (virtio-mem, next 384 MiB)
      	Memory block 148:     Normal  (virtio-mem, next 128 MiB)
      	Memory block 149-151: Movable (virtio-mem, next 384 MiB)
      	... Normal/Movable mixture as above
      	(-> hotplugged Normal memory allows for more Movable memory within
      	    the same device)
      
      Which gives us maximum flexibility when dynamically growing/shrinking a
      VM in smaller steps.
      
      V. Doc Update
      
      I'll update the memory-hotplug.rst documentation, once the overhaul [1] is
      usptream. Until then, details can be found in patch #2.
      
      VI. Future Work
      
       1) Use memory groups for ppc64 dlpar
       2) Being able to specify a portion of (early) kernel memory that will be
          excluded from the ratio. Like "128 MiB globally/per node" are excluded.
      
          This might be helpful when starting VMs with extremely small memory
          footprint (e.g., 128 MiB) and hotplugging memory later -- not wanting
          the first hotplugged units getting onlined to ZONE_MOVABLE. One
          alternative would be a trigger to not consider ZONE_DMA memory
          in the ratio. We'll have to see if this is really rrequired.
       3) Indicate to user space that MOVABLE might be a bad idea -- especially
          relevant when memory ballooning without support for balloon compaction
          is active.
      
      This patch (of 9):
      
      For implementing a new memory onlining policy, which determines when to
      online memory blocks to ZONE_MOVABLE semi-automatically, we need the
      number of present early (boot) pages -- present pages excluding hotplugged
      pages.  Let's track these pages per zone.
      
      Pass a page instead of the zone to adjust_present_page_count(), similar as
      adjust_managed_page_count() and derive the zone from the page.
      
      It's worth noting that a memory block to be offlined/onlined is either
      completely "early" or "not early".  add_memory() and friends can only add
      complete memory blocks and we only online/offline complete (individual)
      memory blocks.
      
      Link: https://lkml.kernel.org/r/20210806124715.17090-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20210806124715.17090-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Marek Kedzierski <mkedzier@redhat.com>
      Cc: Hui Zhu <teawater@gmail.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b097002
    • M
      mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE · 859a85dd
      Mike Rapoport 提交于
      Patch series "mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE".
      
      After recent updates to freeing unused parts of the memory map, no
      architecture can have holes in the memory map within a pageblock.  This
      makes pfn_valid_within() check and CONFIG_HOLES_IN_ZONE configuration
      option redundant.
      
      The first patch removes them both in a mechanical way and the second patch
      simplifies memory_hotplug::test_pages_in_a_zone() that had
      pfn_valid_within() surrounded by more logic than simple if.
      
      This patch (of 2):
      
      After recent changes in freeing of the unused parts of the memory map and
      rework of pfn_valid() in arm and arm64 there are no architectures that can
      have holes in the memory map within a pageblock and so nothing can enable
      CONFIG_HOLES_IN_ZONE which guards non trivial implementation of
      pfn_valid_within().
      
      With that, pfn_valid_within() is always hardwired to 1 and can be
      completely removed.
      
      Remove calls to pfn_valid_within() and CONFIG_HOLES_IN_ZONE.
      
      Link: https://lkml.kernel.org/r/20210713080035.7464-1-rppt@kernel.org
      Link: https://lkml.kernel.org/r/20210713080035.7464-2-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      859a85dd
  5. 04 9月, 2021 3 次提交
  6. 02 7月, 2021 2 次提交
  7. 01 7月, 2021 2 次提交
  8. 30 6月, 2021 13 次提交
    • M
      mm/page_alloc: allow high-order pages to be stored on the per-cpu lists · 44042b44
      Mel Gorman 提交于
      The per-cpu page allocator (PCP) only stores order-0 pages.  This means
      that all THP and "cheap" high-order allocations including SLUB contends on
      the zone->lock.  This patch extends the PCP allocator to store THP and
      "cheap" high-order pages.  Note that struct per_cpu_pages increases in
      size to 256 bytes (4 cache lines) on x86-64.
      
      Note that this is not necessarily a universal performance win because of
      how it is implemented.  High-order pages can cause pcp->high to be
      exceeded prematurely for lower-orders so for example, a large number of
      THP pages being freed could release order-0 pages from the PCP lists.
      Hence, much depends on the allocation/free pattern as observed by a single
      CPU to determine if caching helps or hurts a particular workload.
      
      That said, basic performance testing passed.  The following is a netperf
      UDP_STREAM test which hits the relevant patches as some of the network
      allocations are high-order.
      
      netperf-udp
                                       5.13.0-rc2             5.13.0-rc2
                                 mm-pcpburst-v3r4   mm-pcphighorder-v1r7
      Hmean     send-64         261.46 (   0.00%)      266.30 *   1.85%*
      Hmean     send-128        516.35 (   0.00%)      536.78 *   3.96%*
      Hmean     send-256       1014.13 (   0.00%)     1034.63 *   2.02%*
      Hmean     send-1024      3907.65 (   0.00%)     4046.11 *   3.54%*
      Hmean     send-2048      7492.93 (   0.00%)     7754.85 *   3.50%*
      Hmean     send-3312     11410.04 (   0.00%)    11772.32 *   3.18%*
      Hmean     send-4096     13521.95 (   0.00%)    13912.34 *   2.89%*
      Hmean     send-8192     21660.50 (   0.00%)    22730.72 *   4.94%*
      Hmean     send-16384    31902.32 (   0.00%)    32637.50 *   2.30%*
      
      Functionally, a patch like this is necessary to make bulk allocation of
      high-order pages work with similar performance to order-0 bulk
      allocations.  The bulk allocator is not updated in this series as it would
      have to be determined by bulk allocation users how they want to track the
      order of pages allocated with the bulk allocator.
      
      Link: https://lkml.kernel.org/r/20210611135753.GC30378@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      44042b44
    • M
      mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM · 43b02ba9
      Mike Rapoport 提交于
      After removal of the DISCONTIGMEM memory model the FLAT_NODE_MEM_MAP
      configuration option is equivalent to FLATMEM.
      
      Drop CONFIG_FLAT_NODE_MEM_MAP and use CONFIG_FLATMEM instead.
      
      Link: https://lkml.kernel.org/r/20210608091316.3622-10-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43b02ba9
    • M
      mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA · a9ee6cf5
      Mike Rapoport 提交于
      After removal of DISCINTIGMEM the NEED_MULTIPLE_NODES and NUMA
      configuration options are equivalent.
      
      Drop CONFIG_NEED_MULTIPLE_NODES and use CONFIG_NUMA instead.
      
      Done with
      
      	$ sed -i 's/CONFIG_NEED_MULTIPLE_NODES/CONFIG_NUMA/' \
      		$(git grep -wl CONFIG_NEED_MULTIPLE_NODES)
      	$ sed -i 's/NEED_MULTIPLE_NODES/NUMA/' \
      		$(git grep -wl NEED_MULTIPLE_NODES)
      
      with manual tweaks afterwards.
      
      [rppt@linux.ibm.com: fix arm boot crash]
        Link: https://lkml.kernel.org/r/YMj9vHhHOiCVN4BF@linux.ibm.com
      
      Link: https://lkml.kernel.org/r/20210608091316.3622-9-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9ee6cf5
    • M
      mm: remove CONFIG_DISCONTIGMEM · bb1c50d3
      Mike Rapoport 提交于
      There are no architectures that support DISCONTIGMEM left.
      
      Remove the configuration option and the dead code it was guarding in the
      generic memory management code.
      
      Link: https://lkml.kernel.org/r/20210608091316.3622-6-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb1c50d3
    • D
      mm: drop SECTION_SHIFT in code comments · 777c00f5
      Dong Aisheng 提交于
      Actually SECTIONS_SHIFT is used in the kernel code, so the code comments
      is strictly incorrect.  And since commit bbeae5b0 ("mm: move page
      flags layout to separate header"), SECTIONS_SHIFT definition has been
      moved to include/linux/page-flags-layout.h, since code itself looks quite
      straighforward, instead of moving the code comment into the new place as
      well, we just simply remove it.
      
      This also fixed a checkpatch complain derived from the original code:
      WARNING: please, no space before tabs
      + * SECTIONS_SHIFT    ^I^I#bits space required to store a section #$
      
      Link: https://lkml.kernel.org/r/20210531091908.1738465-2-aisheng.dong@nxp.comSigned-off-by: NDong Aisheng <aisheng.dong@nxp.com>
      Suggested-by: NYu Zhao <yuzhao@google.com>
      Reviewed-by: NYu Zhao <yuzhao@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      777c00f5
    • M
      mm/page_alloc: introduce vm.percpu_pagelist_high_fraction · 74f44822
      Mel Gorman 提交于
      This introduces a new sysctl vm.percpu_pagelist_high_fraction.  It is
      similar to the old vm.percpu_pagelist_fraction.  The old sysctl increased
      both pcp->batch and pcp->high with the higher pcp->high potentially
      reducing zone->lock contention.  However, the higher pcp->batch value also
      potentially increased allocation latency while the PCP was refilled.  This
      sysctl only adjusts pcp->high so that zone->lock contention is potentially
      reduced but allocation latency during a PCP refill remains the same.
      
        # grep -E "high:|batch" /proc/zoneinfo | tail -2
                    high:  649
                    batch: 63
      
        # sysctl vm.percpu_pagelist_high_fraction=8
        # grep -E "high:|batch" /proc/zoneinfo | tail -2
                    high:  35071
                    batch: 63
      
        # sysctl vm.percpu_pagelist_high_fraction=64
                    high:  4383
                    batch: 63
      
        # sysctl vm.percpu_pagelist_high_fraction=0
                    high:  649
                    batch: 63
      
      [mgorman@techsingularity.net: fix documentation]
        Link: https://lkml.kernel.org/r/20210528151010.GQ30378@techsingularity.net
      
      Link: https://lkml.kernel.org/r/20210525080119.5455-7-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      74f44822
    • M
      mm/page_alloc: limit the number of pages on PCP lists when reclaim is active · c49c2c47
      Mel Gorman 提交于
      When kswapd is active then direct reclaim is potentially active.  In
      either case, it is possible that a zone would be balanced if pages were
      not trapped on PCP lists.  Instead of draining remote pages, simply limit
      the size of the PCP lists while kswapd is active.
      
      Link: https://lkml.kernel.org/r/20210525080119.5455-6-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c49c2c47
    • M
      mm/page_alloc: scale the number of pages that are batch freed · 3b12e7e9
      Mel Gorman 提交于
      When a task is freeing a large number of order-0 pages, it may acquire the
      zone->lock multiple times freeing pages in batches.  This may
      unnecessarily contend on the zone lock when freeing very large number of
      pages.  This patch adapts the size of the batch based on the recent
      pattern to scale the batch size for subsequent frees.
      
      As the machines I used were not large enough to test this are not large
      enough to illustrate a problem, a debugging patch shows patterns like the
      following (slightly editted for clarity)
      
      Baseline vanilla kernel
        time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
        time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
        time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
        time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
        time-unmap-14426   [...] free_pcppages_bulk: free   63 count  378 high  378
      
      With patches
        time-unmap-7724    [...] free_pcppages_bulk: free  126 count  814 high  814
        time-unmap-7724    [...] free_pcppages_bulk: free  252 count  814 high  814
        time-unmap-7724    [...] free_pcppages_bulk: free  504 count  814 high  814
        time-unmap-7724    [...] free_pcppages_bulk: free  751 count  814 high  814
        time-unmap-7724    [...] free_pcppages_bulk: free  751 count  814 high  814
      
      Link: https://lkml.kernel.org/r/20210525080119.5455-5-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b12e7e9
    • M
      mm/page_alloc: delete vm.percpu_pagelist_fraction · bbbecb35
      Mel Gorman 提交于
      Patch series "Calculate pcp->high based on zone sizes and active CPUs", v2.
      
      The per-cpu page allocator (PCP) is meant to reduce contention on the zone
      lock but the sizing of batch and high is archaic and neither takes the
      zone size into account or the number of CPUs local to a zone.  With larger
      zones and more CPUs per node, the contention is getting worse.
      Furthermore, the fact that vm.percpu_pagelist_fraction adjusts both batch
      and high values means that the sysctl can reduce zone lock contention but
      also increase allocation latencies.
      
      This series disassociates pcp->high from pcp->batch and then scales
      pcp->high based on the size of the local zone with limited impact to
      reclaim and accounting for active CPUs but leaves pcp->batch static.  It
      also adapts the number of pages that can be on the pcp list based on
      recent freeing patterns.
      
      The motivation is partially to adjust to larger memory sizes but is also
      driven by the fact that large batches of page freeing via release_pages()
      often shows zone contention as a major part of the problem.  Another is a
      bug report based on an older kernel where a multi-terabyte process can
      takes several minutes to exit.  A workaround was to use
      vm.percpu_pagelist_fraction to increase the pcp->high value but testing
      indicated that a production workload could not use the same values because
      of an increase in allocation latencies.  Unfortunately, I cannot reproduce
      this test case myself as the multi-terabyte machines are in active use but
      it should alleviate the problem.
      
      The series aims to address both and partially acts as a pre-requisite.
      pcp only works with order-0 which is useless for SLUB (when using high
      orders) and THP (unconditionally).  To store high-order pages on PCP, the
      pcp->high values need to be increased first.
      
      This patch (of 6):
      
      The vm.percpu_pagelist_fraction is used to increase the batch and high
      limits for the per-cpu page allocator (PCP).  The intent behind the sysctl
      is to reduce zone lock acquisition when allocating/freeing pages but it
      has a problem.  While it can decrease contention, it can also increase
      latency on the allocation side due to unreasonably large batch sizes.
      This leads to games where an administrator adjusts
      percpu_pagelist_fraction on the fly to work around contention and
      allocation latency problems.
      
      This series aims to alleviate the problems with zone lock contention while
      avoiding the allocation-side latency problems.  For the purposes of
      review, it's easier to remove this sysctl now and reintroduce a similar
      sysctl later in the series that deals only with pcp->high.
      
      Link: https://lkml.kernel.org/r/20210525080119.5455-1-mgorman@techsingularity.net
      Link: https://lkml.kernel.org/r/20210525080119.5455-2-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bbbecb35
    • M
      mm/vmstat: convert NUMA statistics to basic NUMA counters · f19298b9
      Mel Gorman 提交于
      NUMA statistics are maintained on the zone level for hits, misses, foreign
      etc but nothing relies on them being perfectly accurate for functional
      correctness.  The counters are used by userspace to get a general overview
      of a workloads NUMA behaviour but the page allocator incurs a high cost to
      maintain perfect accuracy similar to what is required for a vmstat like
      NR_FREE_PAGES.  There even is a sysctl vm.numa_stat to allow userspace to
      turn off the collection of NUMA statistics like NUMA_HIT.
      
      This patch converts NUMA_HIT and friends to be NUMA events with similar
      accuracy to VM events.  There is a possibility that slight errors will be
      introduced but the overall trend as seen by userspace will be similar.
      The counters are no longer updated from vmstat_refresh context as it is
      unnecessary overhead for counters that may never be read by userspace.
      Note that counters could be maintained at the node level to save space but
      it would have a user-visible impact due to /proc/zoneinfo.
      
      [lkp@intel.com: Fix misplaced closing brace for !CONFIG_NUMA]
      
      Link: https://lkml.kernel.org/r/20210512095458.30632-4-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f19298b9
    • M
      mm/page_alloc: convert per-cpu list protection to local_lock · dbbee9d5
      Mel Gorman 提交于
      There is a lack of clarity of what exactly
      local_irq_save/local_irq_restore protects in page_alloc.c .  It conflates
      the protection of per-cpu page allocation structures with per-cpu vmstat
      deltas.
      
      This patch protects the PCP structure using local_lock which for most
      configurations is identical to IRQ enabling/disabling.  The scope of the
      lock is still wider than it should be but this is decreased later.
      
      It is possible for the local_lock to be embedded safely within struct
      per_cpu_pages but it adds complexity to free_unref_page_list.
      
      [akpm@linux-foundation.org: coding style fixes]
      [mgorman@techsingularity.net: work around a pahole limitation with zero-sized struct pagesets]
        Link: https://lkml.kernel.org/r/20210526080741.GW30378@techsingularity.net
      [lkp@intel.com: Make pagesets static]
      
      Link: https://lkml.kernel.org/r/20210512095458.30632-3-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dbbee9d5
    • M
      mm/page_alloc: split per cpu page lists and zone stats · 28f836b6
      Mel Gorman 提交于
      The PCP (per-cpu page allocator in page_alloc.c) shares locking
      requirements with vmstat and the zone lock which is inconvenient and
      causes some issues.  For example, the PCP list and vmstat share the same
      per-cpu space meaning that it's possible that vmstat updates dirty cache
      lines holding per-cpu lists across CPUs unless padding is used.  Second,
      PREEMPT_RT does not want to disable IRQs for too long in the page
      allocator.
      
      This series splits the locking requirements and uses locks types more
      suitable for PREEMPT_RT, reduces the time when special locking is required
      for stats and reduces the time when IRQs need to be disabled on
      !PREEMPT_RT kernels.
      
      Why local_lock?  PREEMPT_RT considers the following sequence to be unsafe
      as documented in Documentation/locking/locktypes.rst
      
         local_irq_disable();
         spin_lock(&lock);
      
      The pcp allocator has this sequence for rmqueue_pcplist (local_irq_save)
      -> __rmqueue_pcplist -> rmqueue_bulk (spin_lock).  While it's possible to
      separate this out, it generally means there are points where we enable
      IRQs and reenable them again immediately.  To prevent a migration and the
      per-cpu pointer going stale, migrate_disable is also needed.  That is a
      custom lock that is similar, but worse, than local_lock.  Furthermore, on
      PREEMPT_RT, it's undesirable to leave IRQs disabled for too long.  By
      converting to local_lock which disables migration on PREEMPT_RT, the
      locking requirements can be separated and start moving the protections for
      PCP, stats and the zone lock to PREEMPT_RT-safe equivalent locking.  As a
      bonus, local_lock also means that PROVE_LOCKING does something useful.
      
      After that, it's obvious that zone_statistics incurs too much overhead and
      leaves IRQs disabled for longer than necessary on !PREEMPT_RT kernels.
      zone_statistics uses perfectly accurate counters requiring IRQs be
      disabled for parallel RMW sequences when inaccurate ones like vm_events
      would do.  The series makes the NUMA statistics (NUMA_HIT and friends)
      inaccurate counters that then require no special protection on
      !PREEMPT_RT.
      
      The bulk page allocator can then do stat updates in bulk with IRQs enabled
      which should improve the efficiency.  Technically, this could have been
      done without the local_lock and vmstat conversion work and the order
      simply reflects the timing of when different series were implemented.
      
      Finally, there are places where we conflate IRQs being disabled for the
      PCP with the IRQ-safe zone spinlock.  The remainder of the series reduces
      the scope of what is protected by disabled IRQs on !PREEMPT_RT kernels.
      By the end of the series, page_alloc.c does not call local_irq_save so the
      locking scope is a bit clearer.  The one exception is that modifying
      NR_FREE_PAGES still happens in places where it's known the IRQs are
      disabled as it's harmless for PREEMPT_RT and would be expensive to split
      the locking there.
      
      No performance data is included because despite the overhead of the stats,
      it's within the noise for most workloads on !PREEMPT_RT.  However, Jesper
      Dangaard Brouer ran a page allocation microbenchmark on a E5-1650 v4 @
      3.60GHz CPU on the first version of this series.  Focusing on the array
      variant of the bulk page allocator reveals the following.
      
      (CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz)
      ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size
      
               Baseline        Patched
       1       56.383          54.225 (+3.83%)
       2       40.047          35.492 (+11.38%)
       3       37.339          32.643 (+12.58%)
       4       35.578          30.992 (+12.89%)
       8       33.592          29.606 (+11.87%)
       16      32.362          28.532 (+11.85%)
       32      31.476          27.728 (+11.91%)
       64      30.633          27.252 (+11.04%)
       128     30.596          27.090 (+11.46%)
      
      While this is a positive outcome, the series is more likely to be
      interesting to the RT people in terms of getting parts of the PREEMPT_RT
      tree into mainline.
      
      This patch (of 9):
      
      The per-cpu page allocator lists and the per-cpu vmstat deltas are stored
      in the same struct per_cpu_pages even though vmstats have no direct impact
      on the per-cpu page lists.  This is inconsistent because the vmstats for a
      node are stored on a dedicated structure.  The bigger issue is that the
      per_cpu_pages structure is not cache-aligned and stat updates either cache
      conflict with adjacent per-cpu lists incurring a runtime cost or padding
      is required incurring a memory cost.
      
      This patch splits the per-cpu pagelists and the vmstat deltas into
      separate structures.  It's mostly a mechanical conversion but some
      variable renaming is done to clearly distinguish the per-cpu pages
      structure (pcp) from the vmstats (pzstats).
      
      Superficially, this appears to increase the size of the per_cpu_pages
      structure but the movement of expire fills a structure hole so there is no
      impact overall.
      
      [mgorman@techsingularity.net: make it W=1 cleaner]
        Link: https://lkml.kernel.org/r/20210514144622.GA3735@techsingularity.net
      [mgorman@techsingularity.net: make it W=1 even cleaner]
        Link: https://lkml.kernel.org/r/20210516140705.GB3735@techsingularity.net
      [lkp@intel.com: check struct per_cpu_zonestat has a non-zero size]
      [vbabka@suse.cz: Init zone->per_cpu_zonestats properly]
      
      Link: https://lkml.kernel.org/r/20210512095458.30632-1-mgorman@techsingularity.net
      Link: https://lkml.kernel.org/r/20210512095458.30632-2-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28f836b6
    • M
      mm/mmzone.h: simplify is_highmem_idx() · b19bd1c9
      Mike Rapoport 提交于
      There is a lot of historical ifdefery in is_highmem_idx() and its helper
      zone_movable_is_highmem() that was required because of two different paths
      for nodes and zones initialization that were selected at compile time.
      
      Until commit 3f08a302 ("mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP
      option") the movable_zone variable was only available for configurations
      that had CONFIG_HAVE_MEMBLOCK_NODE_MAP enabled so the test in
      zone_movable_is_highmem() used that variable only for such configurations.
      For other configurations the test checked if the index of ZONE_MOVABLE
      was greater by 1 than the index of ZONE_HIGMEM and then movable zone was
      considered a highmem zone.  Needless to say, ZONE_MOVABLE - 1 equals
      ZONE_HIGHMEM by definition when CONFIG_HIGHMEM=y.
      
      Commit 3f08a302 ("mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP option")
      made movable_zone variable always available.  Since this variable is set
      to ZONE_HIGHMEM if CONFIG_HIGHMEM is enabled and highmem zone is
      populated, it is enough to check whether
      
      	zone_idx == ZONE_MOVABLE && movable_zone == ZONE_HIGMEM
      
      to test if zone index points to a highmem zone.
      
      Remove zone_movable_is_highmem() that is not used anywhere except
      is_highmem_idx() and use the test above in is_highmem_idx() instead.
      
      Link: https://lkml.kernel.org/r/20210426141927.1314326-3-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b19bd1c9
  9. 07 5月, 2021 1 次提交
  10. 06 5月, 2021 3 次提交
    • O
      mm,memory_hotplug: allocate memmap from the added memory range · a08a2ae3
      Oscar Salvador 提交于
      Physical memory hotadd has to allocate a memmap (struct page array) for
      the newly added memory section.  Currently, alloc_pages_node() is used
      for those allocations.
      
      This has some disadvantages:
       a) an existing memory is consumed for that purpose
          (eg: ~2MB per 128MB memory section on x86_64)
          This can even lead to extreme cases where system goes OOM because
          the physically hotplugged memory depletes the available memory before
          it is onlined.
       b) if the whole node is movable then we have off-node struct pages
          which has performance drawbacks.
       c) It might be there are no PMD_ALIGNED chunks so memmap array gets
          populated with base pages.
      
      This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.
      
      Vmemap page tables can map arbitrary memory.  That means that we can
      reserve a part of the physically hotadded memory to back vmemmap page
      tables.  This implementation uses the beginning of the hotplugged memory
      for that purpose.
      
      There are some non-obviously things to consider though.
      
      Vmemmap pages are allocated/freed during the memory hotplug events
      (add_memory_resource(), try_remove_memory()) when the memory is
      added/removed.  This means that the reserved physical range is not
      online although it is used.  The most obvious side effect is that
      pfn_to_online_page() returns NULL for those pfns.  The current design
      expects that this should be OK as the hotplugged memory is considered a
      garbage until it is onlined.  For example hibernation wouldn't save the
      content of those vmmemmaps into the image so it wouldn't be restored on
      resume but this should be OK as there no real content to recover anyway
      while metadata is reachable from other data structures (e.g.  vmemmap
      page tables).
      
      The reserved space is therefore (de)initialized during the {on,off}line
      events (mhp_{de}init_memmap_on_memory).  That is done by extracting page
      allocator independent initialization from the regular onlining path.
      The primary reason to handle the reserved space outside of
      {on,off}line_pages is to make each initialization specific to the
      purpose rather than special case them in a single function.
      
      As per above, the functions that are introduced are:
      
       - mhp_init_memmap_on_memory:
         Initializes vmemmap pages by calling move_pfn_range_to_zone(), calls
         kasan_add_zero_shadow(), and onlines as many sections as vmemmap pages
         fully span.
      
       - mhp_deinit_memmap_on_memory:
         Offlines as many sections as vmemmap pages fully span, removes the
         range from zhe zone by remove_pfn_range_from_zone(), and calls
         kasan_remove_zero_shadow() for the range.
      
      The new function memory_block_online() calls mhp_init_memmap_on_memory()
      before doing the actual online_pages().  Should online_pages() fail, we
      clean up by calling mhp_deinit_memmap_on_memory().  Adjusting of
      present_pages is done at the end once we know that online_pages()
      succedeed.
      
      On offline, memory_block_offline() needs to unaccount vmemmap pages from
      present_pages() before calling offline_pages().  This is necessary because
      offline_pages() tears down some structures based on the fact whether the
      node or the zone become empty.  If offline_pages() fails, we account back
      vmemmap pages.  If it succeeds, we call mhp_deinit_memmap_on_memory().
      
      Hot-remove:
      
       We need to be careful when removing memory, as adding and
       removing memory needs to be done with the same granularity.
       To check that this assumption is not violated, we check the
       memory range we want to remove and if a) any memory block has
       vmemmap pages and b) the range spans more than a single memory
       block, we scream out loud and refuse to proceed.
      
       If all is good and the range was using memmap on memory (aka vmemmap pages),
       we construct an altmap structure so free_hugepage_table does the right
       thing and calls vmem_altmap_free instead of free_pagetable.
      
      Link: https://lkml.kernel.org/r/20210421102701.25051-5-osalvador@suse.deSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a08a2ae3
    • P
      mm/gup: migrate pinned pages out of movable zone · d1e153fe
      Pavel Tatashin 提交于
      We should not pin pages in ZONE_MOVABLE.  Currently, we do not pin only
      movable CMA pages.  Generalize the function that migrates CMA pages to
      migrate all movable pages.  Use is_pinnable_page() to check which pages
      need to be migrated
      
      Link: https://lkml.kernel.org/r/20210215161349.246722-10-pasha.tatashin@soleen.comSigned-off-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d1e153fe
    • P
      mm/gup: do not migrate zero page · 9afaf30f
      Pavel Tatashin 提交于
      On some platforms ZERO_PAGE(0) might end-up in a movable zone.  Do not
      migrate zero page in gup during longterm pinning as migration of zero page
      is not allowed.
      
      For example, in x86 QEMU with 16G of memory and kernelcore=5G parameter, I
      see the following:
      
      Boot#1: zero_pfn  0x48a8d zero_pfn zone: ZONE_DMA32
      Boot#2: zero_pfn 0x20168d zero_pfn zone: ZONE_MOVABLE
      
      On x86, empty_zero_page is declared in .bss and depending on the loader
      may end up in different physical locations during boots.
      
      Also, move is_zero_pfn() my_zero_pfn() functions under CONFIG_MMU, because
      zero_pfn that they are using is declared in memory.c which is compiled
      with CONFIG_MMU.
      
      Link: https://lkml.kernel.org/r/20210215161349.246722-9-pasha.tatashin@soleen.comSigned-off-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9afaf30f
  11. 01 5月, 2021 1 次提交
  12. 27 2月, 2021 2 次提交
  13. 25 2月, 2021 4 次提交
    • Y
      mm/vmscan.c: make lruvec_lru_size() static · 2091339d
      Yu Zhao 提交于
      All other references to the function were removed after
      commit b910718a ("mm: vmscan: detect file thrashing at the reclaim
      root").
      
      Link: https://lore.kernel.org/linux-mm/20201207220949.830352-11-yuzhao@google.com/
      Link: https://lkml.kernel.org/r/20210122220600.906146-11-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com>
      Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2091339d
    • S
      mm: memcg: add swapcache stat for memcg v2 · b6038942
      Shakeel Butt 提交于
      This patch adds swapcache stat for the cgroup v2.  The swapcache
      represents the memory that is accounted against both the memory and the
      swap limit of the cgroup.  The main motivation behind exposing the
      swapcache stat is for enabling users to gracefully migrate from cgroup
      v1's memsw counter to cgroup v2's memory and swap counters.
      
      Cgroup v1's memsw limit allows users to limit the memory+swap usage of a
      workload but without control on the exact proportion of memory and swap.
      Cgroup v2 provides separate limits for memory and swap which enables more
      control on the exact usage of memory and swap individually for the
      workload.
      
      With some little subtleties, the v1's memsw limit can be switched with the
      sum of the v2's memory and swap limits.  However the alternative for memsw
      usage is not yet available in cgroup v2.  Exposing per-cgroup swapcache
      stat enables that alternative.  Adding the memory usage and swap usage and
      subtracting the swapcache will approximate the memsw usage.  This will
      help in the transparent migration of the workloads depending on memsw
      usage and limit to v2' memory and swap counters.
      
      The reasons these applications are still interested in this approximate
      memsw usage are: (1) these applications are not really interested in two
      separate memory and swap usage metrics.  A single usage metric is more
      simple to use and reason about for them.
      
      (2) The memsw usage metric hides the underlying system's swap setup from
      the applications.  Applications with multiple instances running in a
      datacenter with heterogeneous systems (some have swap and some don't) will
      keep seeing a consistent view of their usage.
      
      [akpm@linux-foundation.org: fix CONFIG_SWAP=n build]
      
      Link: https://lkml.kernel.org/r/20210108155813.2914586-3-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6038942
    • M
      mm: memcontrol: convert NR_FILE_PMDMAPPED account to pages · 380780e7
      Muchun Song 提交于
      Currently we use struct per_cpu_nodestat to cache the vmstat counters,
      which leads to inaccurate statistics especially THP vmstat counters.  In
      the systems with hundreds of processors it can be GBs of memory.  For
      example, for a 96 CPUs system, the threshold is the maximum number of 125.
      And the per cpu counters can cache 23.4375 GB in total.
      
      The THP page is already a form of batched addition (it will add 512 worth
      of memory in one go) so skipping the batching seems like sensible.
      Although every THP stats update overflows the per-cpu counter, resorting
      to atomic global updates.  But it can make the statistics more accuracy
      for the THP vmstat counters.
      
      So we convert the NR_FILE_PMDMAPPED account to pages.  This patch is
      consistent with 8f182270 ("mm/swap.c: flush lru pvecs on compound page
      arrival").  Doing this also can make the unit of vmstat counters more
      unified.  Finally, the unit of the vmstat counters are pages, kB and
      bytes.  The B/KB suffix can tell us that the unit is bytes or kB.  The
      rest which is without suffix are pages.
      
      Link: https://lkml.kernel.org/r/20201228164110.2838-7-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Cc: Rafael. J. Wysocki <rafael@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      380780e7
    • M
      mm: memcontrol: convert NR_SHMEM_PMDMAPPED account to pages · a1528e21
      Muchun Song 提交于
      Currently we use struct per_cpu_nodestat to cache the vmstat counters,
      which leads to inaccurate statistics especially THP vmstat counters.  In
      the systems with hundreds of processors it can be GBs of memory.  For
      example, for a 96 CPUs system, the threshold is the maximum number of 125.
      And the per cpu counters can cache 23.4375 GB in total.
      
      The THP page is already a form of batched addition (it will add 512 worth
      of memory in one go) so skipping the batching seems like sensible.
      Although every THP stats update overflows the per-cpu counter, resorting
      to atomic global updates.  But it can make the statistics more accuracy
      for the THP vmstat counters.
      
      So we convert the NR_SHMEM_PMDMAPPED account to pages.  This patch is
      consistent with 8f182270 ("mm/swap.c: flush lru pvecs on compound page
      arrival").  Doing this also can make the unit of vmstat counters more
      unified.  Finally, the unit of the vmstat counters are pages, kB and
      bytes.  The B/KB suffix can tell us that the unit is bytes or kB.  The
      rest which is without suffix are pages.
      
      Link: https://lkml.kernel.org/r/20201228164110.2838-6-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Cc: Rafael. J. Wysocki <rafael@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a1528e21