1. 15 1月, 2022 8 次提交
    • B
      mm/page_alloc.c: do not warn allocation failure on zone DMA if no managed pages · c4dc63f0
      Baoquan He 提交于
      In kdump kernel of x86_64, page allocation failure is observed:
      
       kworker/u2:2: page allocation failure: order:0, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
       CPU: 0 PID: 55 Comm: kworker/u2:2 Not tainted 5.16.0-rc4+ #5
       Hardware name: AMD Dinar/Dinar, BIOS RDN1505B 06/05/2013
       Workqueue: events_unbound async_run_entry_fn
       Call Trace:
        <TASK>
        dump_stack_lvl+0x48/0x5e
        warn_alloc.cold+0x72/0xd6
        __alloc_pages_slowpath.constprop.0+0xc69/0xcd0
        __alloc_pages+0x1df/0x210
        new_slab+0x389/0x4d0
        ___slab_alloc+0x58f/0x770
        __slab_alloc.constprop.0+0x4a/0x80
        kmem_cache_alloc_trace+0x24b/0x2c0
        sr_probe+0x1db/0x620
        ......
        device_add+0x405/0x920
        ......
        __scsi_add_device+0xe5/0x100
        ata_scsi_scan_host+0x97/0x1d0
        async_run_entry_fn+0x30/0x130
        process_one_work+0x1e8/0x3c0
        worker_thread+0x50/0x3b0
        ? rescuer_thread+0x350/0x350
        kthread+0x16b/0x190
        ? set_kthread_struct+0x40/0x40
        ret_from_fork+0x22/0x30
        </TASK>
       Mem-Info:
       ......
      
      The above failure happened when calling kmalloc() to allocate buffer with
      GFP_DMA.  It requests to allocate slab page from DMA zone while no managed
      pages at all in there.
      
       sr_probe()
       --> get_capabilities()
           --> buffer = kmalloc(512, GFP_KERNEL | GFP_DMA);
      
      Because in the current kernel, dma-kmalloc will be created as long as
      CONFIG_ZONE_DMA is enabled.  However, kdump kernel of x86_64 doesn't have
      managed pages on DMA zone since commit 6f599d84 ("x86/kdump: Always
      reserve the low 1M when the crashkernel option is specified").  The
      failure can be always reproduced.
      
      For now, let's mute the warning of allocation failure if requesting pages
      from DMA zone while no managed pages.
      
      [akpm@linux-foundation.org: fix warning]
      
      Link: https://lkml.kernel.org/r/20211223094435.248523-4-bhe@redhat.com
      Fixes: 6f599d84 ("x86/kdump: Always reserve the low 1M when the crashkernel option is specified")
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NJohn Donnelly  <john.p.donnelly@oracle.com>
      Reviewed-by: NHyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4dc63f0
    • B
      mm_zone: add function to check if managed dma zone exists · 62b31070
      Baoquan He 提交于
      Patch series "Handle warning of allocation failure on DMA zone w/o
      managed pages", v4.
      
      **Problem observed:
      On x86_64, when crash is triggered and entering into kdump kernel, page
      allocation failure can always be seen.
      
       ---------------------------------
       DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations
       swapper/0: page allocation failure: order:5, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
       CPU: 0 PID: 1 Comm: swapper/0
       Call Trace:
        dump_stack+0x7f/0xa1
        warn_alloc.cold+0x72/0xd6
        ......
        __alloc_pages+0x24d/0x2c0
        ......
        dma_atomic_pool_init+0xdb/0x176
        do_one_initcall+0x67/0x320
        ? rcu_read_lock_sched_held+0x3f/0x80
        kernel_init_freeable+0x290/0x2dc
        ? rest_init+0x24f/0x24f
        kernel_init+0xa/0x111
        ret_from_fork+0x22/0x30
       Mem-Info:
       ------------------------------------
      
      ***Root cause:
      In the current kernel, it assumes that DMA zone must have managed pages
      and try to request pages if CONFIG_ZONE_DMA is enabled. While this is not
      always true. E.g in kdump kernel of x86_64, only low 1M is presented and
      locked down at very early stage of boot, so that this low 1M won't be
      added into buddy allocator to become managed pages of DMA zone. This
      exception will always cause page allocation failure if page is requested
      from DMA zone.
      
      ***Investigation:
      This failure happens since below commit merged into linus's tree.
        1a6a9044 x86/setup: Remove CONFIG_X86_RESERVE_LOW and reservelow= options
        23721c8e x86/crash: Remove crash_reserve_low_1M()
        f1d4d47c x86/setup: Always reserve the first 1M of RAM
        7c321eb2 x86/kdump: Remove the backup region handling
        6f599d84 x86/kdump: Always reserve the low 1M when the crashkernel option is specified
      
      Before them, on x86_64, the low 640K area will be reused by kdump kernel.
      So in kdump kernel, the content of low 640K area is copied into a backup
      region for dumping before jumping into kdump. Then except of those firmware
      reserved region in [0, 640K], the left area will be added into buddy
      allocator to become available managed pages of DMA zone.
      
      However, after above commits applied, in kdump kernel of x86_64, the low
      1M is reserved by memblock, but not released to buddy allocator. So any
      later page allocation requested from DMA zone will fail.
      
      At the beginning, if crashkernel is reserved, the low 1M need be locked
      down because AMD SME encrypts memory making the old backup region
      mechanims impossible when switching into kdump kernel.
      
      Later, it was also observed that there are BIOSes corrupting memory
      under 1M. To solve this, in commit f1d4d47c, the entire region of
      low 1M is always reserved after the real mode trampoline is allocated.
      
      Besides, recently, Intel engineer mentioned their TDX (Trusted domain
      extensions) which is under development in kernel also needs to lock down
      the low 1M. So we can't simply revert above commits to fix the page allocation
      failure from DMA zone as someone suggested.
      
      ***Solution:
      Currently, only DMA atomic pool and dma-kmalloc will initialize and
      request page allocation with GFP_DMA during bootup.
      
      So only initializ DMA atomic pool when DMA zone has available managed
      pages, otherwise just skip the initialization.
      
      For dma-kmalloc(), for the time being, let's mute the warning of
      allocation failure if requesting pages from DMA zone while no manged
      pages.  Meanwhile, change code to use dma_alloc_xx/dma_map_xx API to
      replace kmalloc(GFP_DMA), or do not use GFP_DMA when calling kmalloc() if
      not necessary.  Christoph is posting patches to fix those under
      drivers/scsi/.  Finally, we can remove the need of dma-kmalloc() as people
      suggested.
      
      This patch (of 3):
      
      In some places of the current kernel, it assumes that dma zone must have
      managed pages if CONFIG_ZONE_DMA is enabled.  While this is not always
      true.  E.g in kdump kernel of x86_64, only low 1M is presented and locked
      down at very early stage of boot, so that there's no managed pages at all
      in DMA zone.  This exception will always cause page allocation failure if
      page is requested from DMA zone.
      
      Here add function has_managed_dma() and the relevant helper functions to
      check if there's DMA zone with managed pages.  It will be used in later
      patches.
      
      Link: https://lkml.kernel.org/r/20211223094435.248523-1-bhe@redhat.com
      Link: https://lkml.kernel.org/r/20211223094435.248523-2-bhe@redhat.com
      Fixes: 6f599d84 ("x86/kdump: Always reserve the low 1M when the crashkernel option is specified")
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NJohn Donnelly  <john.p.donnelly@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62b31070
    • A
      mm/page_alloc.c: modify the comment section for alloc_contig_pages() · eaab8e75
      Anshuman Khandual 提交于
      Clarify that the alloc_contig_pages() allocated range will always be
      aligned to the requested nr_pages.
      
      Link: https://lkml.kernel.org/r/1639545478-12160-1-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eaab8e75
    • X
      mm: page_alloc: fix building error on -Werror=array-compare · ca831f29
      Xiongwei Song 提交于
      Arthur Marsh reported we would hit the error below when building kernel
      with gcc-12:
      
        CC      mm/page_alloc.o
        mm/page_alloc.c: In function `mem_init_print_info':
        mm/page_alloc.c:8173:27: error: comparison between two arrays [-Werror=array-compare]
         8173 |                 if (start <= pos && pos < end && size > adj) \
              |
      
      In C++20, the comparision between arrays should be warned.
      
      Link: https://lkml.kernel.org/r/20211125130928.32465-1-sxwjean@me.comSigned-off-by: NXiongwei Song <sxwjean@gmail.com>
      Reported-by: NArthur Marsh <arthur.marsh@internode.on.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ca831f29
    • P
      mm: page table check · df4e817b
      Pasha Tatashin 提交于
      Check user page table entries at the time they are added and removed.
      
      Allows to synchronously catch memory corruption issues related to double
      mapping.
      
      When a pte for an anonymous page is added into page table, we verify
      that this pte does not already point to a file backed page, and vice
      versa if this is a file backed page that is being added we verify that
      this page does not have an anonymous mapping
      
      We also enforce that read-only sharing for anonymous pages is allowed
      (i.e.  cow after fork).  All other sharing must be for file pages.
      
      Page table check allows to protect and debug cases where "struct page"
      metadata became corrupted for some reason.  For example, when refcnt or
      mapcount become invalid.
      
      Link: https://lkml.kernel.org/r/20211221154650.1047963-4-pasha.tatashin@soleen.comSigned-off-by: NPasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <frederic@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Slaby <jirislaby@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df4e817b
    • J
      mm/memremap: add ZONE_DEVICE support for compound pages · c4386bd8
      Joao Martins 提交于
      Add a new @vmemmap_shift property for struct dev_pagemap which specifies
      that a devmap is composed of a set of compound pages of order
      @vmemmap_shift, instead of base pages.  When a compound page devmap is
      requested, all but the first page are initialised as tail pages instead
      of order-0 pages.
      
      For certain ZONE_DEVICE users like device-dax which have a fixed page
      size, this creates an opportunity to optimize GUP and GUP-fast walkers,
      treating it the same way as THP or hugetlb pages.
      
      Additionally, commit 7118fc29 ("hugetlb: address ref count racing in
      prep_compound_gigantic_page") removed set_page_count() because the
      setting of page ref count to zero was redundant.  devmap pages don't
      come from page allocator though and only head page refcount is used for
      compound pages, hence initialize tail page count to zero.
      
      Link: https://lkml.kernel.org/r/20211202204422.26777-5-joao.m.martins@oracle.comSigned-off-by: NJoao Martins <joao.m.martins@oracle.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4386bd8
    • J
      mm/page_alloc: refactor memmap_init_zone_device() page init · 46487e00
      Joao Martins 提交于
      Move struct page init to an helper function __init_zone_device_page().
      
      This is in preparation for sharing the storage for compound page
      metadata.
      
      Link: https://lkml.kernel.org/r/20211202204422.26777-4-joao.m.martins@oracle.comSigned-off-by: NJoao Martins <joao.m.martins@oracle.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      46487e00
    • J
      mm/page_alloc: split prep_compound_page into head and tail subparts · 5b24eeef
      Joao Martins 提交于
      Patch series "mm, device-dax: Introduce compound pages in devmap", v7.
      
      This series converts device-dax to use compound pages, and moves away
      from the 'struct page per basepage on PMD/PUD' that is done today.
      
      Doing so
       1) unlocks a few noticeable improvements on unpin_user_pages() and
          makes device-dax+altmap case 4x times faster in pinning (numbers
          below and in last patch)
       2) as mentioned in various other threads it's one important step
          towards cleaning up ZONE_DEVICE refcounting.
      
      I've split the compound pages on devmap part from the rest based on
      recent discussions on devmap pending and future work planned[5][6].
      There is consensus that device-dax should be using compound pages to
      represent its PMD/PUDs just like HugeTLB and THP, and that leads to less
      specialization of the dax parts.  I will pursue the rest of the work in
      parallel once this part is merged, particular the GUP-{slow,fast}
      improvements [7] and the tail struct page deduplication memory savings
      part[8].
      
      To summarize what the series does:
      
      Patch 1: Prepare hwpoisoning to work with dax compound pages.
      
      Patches 2-3: Split the current utility function of prep_compound_page()
      into head and tail and use those two helpers where appropriate to take
      advantage of caches being warm after __init_single_page().  This is used
      when initializing zone device when we bring up device-dax namespaces.
      
      Patches 4-10: Add devmap support for compound pages in device-dax.
      memmap_init_zone_device() initialize its metadata as compound pages, and
      it introduces a new devmap property known as vmemmap_shift which
      outlines how the vmemmap is structured (defaults to base pages as done
      today).  The property describe the page order of the metadata
      essentially.  While at it do a few cleanups in device-dax in patches
      5-9.  Finally enable device-dax usage of devmap @vmemmap_shift to a
      value based on its own @align property.  @vmemmap_shift returns 0 by
      default (which is today's case of base pages in devmap, like fsdax or
      the others) and the usage of compound devmap is optional.  Starting with
      device-dax (*not* fsdax) we enable it by default.  There are a few
      pinning improvements particular on the unpinning case and altmap, as
      well as unpin_user_page_range_dirty_lock() being just as effective as
      THP/hugetlb[0] pages.
      
          $ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S -a -n 512 -w
          (pin_user_pages_fast 2M pages) put:~71 ms -> put:~22 ms
          [altmap]
          (pin_user_pages_fast 2M pages) get:~524ms put:~525 ms -> get: ~127ms put:~71ms
      
           $ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S -a -n 512 -w
          (pin_user_pages_fast 2M pages) put:~513 ms -> put:~188 ms
          [altmap with -m 127004]
          (pin_user_pages_fast 2M pages) get:~4.1 secs put:~4.12 secs -> get:~1sec put:~563ms
      
      Tested on x86 with 1Tb+ of pmem (alongside registering it with RDMA with
      and without altmap), alongside gup_test selftests with dynamic dax
      regions and static dax regions.  Coupled with ndctl unit tests for
      dynamic dax devices that exercise all of this.  Note, for dynamic dax
      regions I had to revert commit 8aa83e63 ("x86/setup: Call
      early_reserve_memory() earlier"), it is a known issue that this commit
      broke efi_fake_mem=.
      
      This patch (of 11):
      
      Split the utility function prep_compound_page() into head and tail
      counterparts, and use them accordingly.
      
      This is in preparation for sharing the storage for compound page
      metadata.
      
      Link: https://lkml.kernel.org/r/20211202204422.26777-1-joao.m.martins@oracle.com
      Link: https://lkml.kernel.org/r/20211202204422.26777-3-joao.m.martins@oracle.comSigned-off-by: NJoao Martins <joao.m.martins@oracle.com>
      Acked-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b24eeef
  2. 07 11月, 2021 17 次提交
    • M
      mm/page_alloc: remove the throttling logic from the page allocator · 132b0d21
      Mel Gorman 提交于
      The page allocator stalls based on the number of pages that are waiting
      for writeback to start but this should now be redundant.
      shrink_inactive_list() will wake flusher threads if the LRU tail are
      unqueued dirty pages so the flusher should be active.  If it fails to
      make progress due to pages under writeback not being completed quickly
      then it should stall on VMSCAN_THROTTLE_WRITEBACK.
      
      Link: https://lkml.kernel.org/r/20211022144651.19914-6-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Darrick J . Wong" <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      132b0d21
    • M
      mm/vmscan: throttle reclaim until some writeback completes if congested · 8cd7c588
      Mel Gorman 提交于
      Patch series "Remove dependency on congestion_wait in mm/", v5.
      
      This series that removes all calls to congestion_wait in mm/ and deletes
      wait_iff_congested.  It's not a clever implementation but
      congestion_wait has been broken for a long time [1].
      
      Even if congestion throttling worked, it was never a great idea.  While
      excessive dirty/writeback pages at the tail of the LRU is one
      possibility that reclaim may be slow, there is also the problem of too
      many pages being isolated and reclaim failing for other reasons
      (elevated references, too many pages isolated, excessive LRU contention
      etc).
      
      This series replaces the "congestion" throttling with 3 different types.
      
       - If there are too many dirty/writeback pages, sleep until a timeout or
         enough pages get cleaned
      
       - If too many pages are isolated, sleep until enough isolated pages are
         either reclaimed or put back on the LRU
      
       - If no progress is being made, direct reclaim tasks sleep until
         another task makes progress with acceptable efficiency.
      
      This was initially tested with a mix of workloads that used to trigger
      corner cases that no longer work.  A new test case was created called
      "stutterp" (pagereclaim-stutterp-noreaders in mmtests) using a freshly
      created XFS filesystem.  Note that it may be necessary to increase the
      timeout of ssh if executing remotely as ssh itself can get throttled and
      the connection may timeout.
      
      stutterp varies the number of "worker" processes from 4 up to NR_CPUS*4
      to check the impact as the number of direct reclaimers increase.  It has
      four types of worker.
      
       - One "anon latency" worker creates small mappings with mmap() and
         times how long it takes to fault the mapping reading it 4K at a time
      
       - X file writers which is fio randomly writing X files where the total
         size of the files add up to the allowed dirty_ratio. fio is allowed
         to run for a warmup period to allow some file-backed pages to
         accumulate. The duration of the warmup is based on the best-case
         linear write speed of the storage.
      
       - Y file readers which is fio randomly reading small files
      
       - Z anon memory hogs which continually map (100-dirty_ratio)% of memory
      
       - Total estimated WSS = (100+dirty_ration) percentage of memory
      
      X+Y+Z+1 == NR_WORKERS varying from 4 up to NR_CPUS*4
      
      The intent is to maximise the total WSS with a mix of file and anon
      memory where some anonymous memory must be swapped and there is a high
      likelihood of dirty/writeback pages reaching the end of the LRU.
      
      The test can be configured to have no background readers to stress
      dirty/writeback pages.  The results below are based on having zero
      readers.
      
      The short summary of the results is that the series works and stalls
      until some event occurs but the timeouts may need adjustment.
      
      The test results are not broken down by patch as the series should be
      treated as one block that replaces a broken throttling mechanism with a
      working one.
      
      Finally, three machines were tested but I'm reporting the worst set of
      results.  The other two machines had much better latencies for example.
      
      First the results of the "anon latency" latency
      
        stutterp
                                      5.15.0-rc1             5.15.0-rc1
                                         vanilla mm-reclaimcongest-v5r4
        Amean     mmap-4      31.4003 (   0.00%)   2661.0198 (-8374.52%)
        Amean     mmap-7      38.1641 (   0.00%)    149.2891 (-291.18%)
        Amean     mmap-12     60.0981 (   0.00%)    187.8105 (-212.51%)
        Amean     mmap-21    161.2699 (   0.00%)    213.9107 ( -32.64%)
        Amean     mmap-30    174.5589 (   0.00%)    377.7548 (-116.41%)
        Amean     mmap-48   8106.8160 (   0.00%)   1070.5616 (  86.79%)
        Stddev    mmap-4      41.3455 (   0.00%)  27573.9676 (-66591.66%)
        Stddev    mmap-7      53.5556 (   0.00%)   4608.5860 (-8505.23%)
        Stddev    mmap-12    171.3897 (   0.00%)   5559.4542 (-3143.75%)
        Stddev    mmap-21   1506.6752 (   0.00%)   5746.2507 (-281.39%)
        Stddev    mmap-30    557.5806 (   0.00%)   7678.1624 (-1277.05%)
        Stddev    mmap-48  61681.5718 (   0.00%)  14507.2830 (  76.48%)
        Max-90    mmap-4      31.4243 (   0.00%)     83.1457 (-164.59%)
        Max-90    mmap-7      41.0410 (   0.00%)     41.0720 (  -0.08%)
        Max-90    mmap-12     66.5255 (   0.00%)     53.9073 (  18.97%)
        Max-90    mmap-21    146.7479 (   0.00%)    105.9540 (  27.80%)
        Max-90    mmap-30    193.9513 (   0.00%)     64.3067 (  66.84%)
        Max-90    mmap-48    277.9137 (   0.00%)    591.0594 (-112.68%)
        Max       mmap-4    1913.8009 (   0.00%) 299623.9695 (-15555.96%)
        Max       mmap-7    2423.9665 (   0.00%) 204453.1708 (-8334.65%)
        Max       mmap-12   6845.6573 (   0.00%) 221090.3366 (-3129.64%)
        Max       mmap-21  56278.6508 (   0.00%) 213877.3496 (-280.03%)
        Max       mmap-30  19716.2990 (   0.00%) 216287.6229 (-997.00%)
        Max       mmap-48 477923.9400 (   0.00%) 245414.8238 (  48.65%)
      
      For most thread counts, the time to mmap() is unfortunately increased.
      In earlier versions of the series, this was lower but a large number of
      throttling events were reaching their timeout increasing the amount of
      inefficient scanning of the LRU.  There is no prioritisation of reclaim
      tasks making progress based on each tasks rate of page allocation versus
      progress of reclaim.  The variance is also impacted for high worker
      counts but in all cases, the differences in latency are not
      statistically significant due to very large maximum outliers.  Max-90
      shows that 90% of the stalls are comparable but the Max results show the
      massive outliers which are increased to to stalling.
      
      It is expected that this will be very machine dependant.  Due to the
      test design, reclaim is difficult so allocations stall and there are
      variances depending on whether THPs can be allocated or not.  The amount
      of memory will affect exactly how bad the corner cases are and how often
      they trigger.  The warmup period calculation is not ideal as it's based
      on linear writes where as fio is randomly writing multiple files from
      multiple tasks so the start state of the test is variable.  For example,
      these are the latencies on a single-socket machine that had more memory
      
        Amean     mmap-4      42.2287 (   0.00%)     49.6838 * -17.65%*
        Amean     mmap-7     216.4326 (   0.00%)     47.4451 *  78.08%*
        Amean     mmap-12   2412.0588 (   0.00%)     51.7497 (  97.85%)
        Amean     mmap-21   5546.2548 (   0.00%)     51.8862 (  99.06%)
        Amean     mmap-30   1085.3121 (   0.00%)     72.1004 (  93.36%)
      
      The overall system CPU usage and elapsed time is as follows
      
                          5.15.0-rc3  5.15.0-rc3
                             vanilla mm-reclaimcongest-v5r4
        Duration User        6989.03      983.42
        Duration System      7308.12      799.68
        Duration Elapsed     2277.67     2092.98
      
      The patches reduce system CPU usage by 89% as the vanilla kernel is rarely
      stalling.
      
      The high-level /proc/vmstats show
      
                                             5.15.0-rc1     5.15.0-rc1
                                                vanilla mm-reclaimcongest-v5r2
        Ops Direct pages scanned          1056608451.00   503594991.00
        Ops Kswapd pages scanned           109795048.00   147289810.00
        Ops Kswapd pages reclaimed          63269243.00    31036005.00
        Ops Direct pages reclaimed          10803973.00     6328887.00
        Ops Kswapd efficiency %                   57.62          21.07
        Ops Kswapd velocity                    48204.98       57572.86
        Ops Direct efficiency %                    1.02           1.26
        Ops Direct velocity                   463898.83      196845.97
      
      Kswapd scanned less pages but the detailed pattern is different.  The
      vanilla kernel scans slowly over time where as the patches exhibits
      burst patterns of scan activity.  Direct reclaim scanning is reduced by
      52% due to stalling.
      
      The pattern for stealing pages is also slightly different.  Both kernels
      exhibit spikes but the vanilla kernel when reclaiming shows pages being
      reclaimed over a period of time where as the patches tend to reclaim in
      spikes.  The difference is that vanilla is not throttling and instead
      scanning constantly finding some pages over time where as the patched
      kernel throttles and reclaims in spikes.
      
        Ops Percentage direct scans               90.59          77.37
      
      For direct reclaim, vanilla scanned 90.59% of pages where as with the
      patches, 77.37% were direct reclaim due to throttling
      
        Ops Page writes by reclaim           2613590.00     1687131.00
      
      Page writes from reclaim context are reduced.
      
        Ops Page writes anon                 2932752.00     1917048.00
      
      And there is less swapping.
      
        Ops Page reclaim immediate         996248528.00   107664764.00
      
      The number of pages encountered at the tail of the LRU tagged for
      immediate reclaim but still dirty/writeback is reduced by 89%.
      
        Ops Slabs scanned                     164284.00      153608.00
      
      Slab scan activity is similar.
      
      ftrace was used to gather stall activity
      
        Vanilla
        -------
            1 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=16000
            2 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=12000
            8 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=8000
           29 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=4000
        82394 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=0
      
      The fast majority of wait_iff_congested calls do not stall at all.  What
      is likely happening is that cond_resched() reschedules the task for a
      short period when the BDI is not registering congestion (which it never
      will in this test setup).
      
            1 writeback_congestion_wait: usec_timeout=100000 usec_delayed=120000
            2 writeback_congestion_wait: usec_timeout=100000 usec_delayed=132000
            4 writeback_congestion_wait: usec_timeout=100000 usec_delayed=112000
          380 writeback_congestion_wait: usec_timeout=100000 usec_delayed=108000
          778 writeback_congestion_wait: usec_timeout=100000 usec_delayed=104000
      
      congestion_wait if called always exceeds the timeout as there is no
      trigger to wake it up.
      
      Bottom line: Vanilla will throttle but it's not effective.
      
      Patch series
      ------------
      
      Kswapd throttle activity was always due to scanning pages tagged for
      immediate reclaim at the tail of the LRU
      
            1 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
            4 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
            5 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
            6 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
           11 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK
           11 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
           94 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
          112 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK
      
      The majority of events did not stall or stalled for a short period.
      Roughly 16% of stalls reached the timeout before expiry.  For direct
      reclaim, the number of times stalled for each reason were
      
         6624 reason=VMSCAN_THROTTLE_ISOLATED
        93246 reason=VMSCAN_THROTTLE_NOPROGRESS
        96934 reason=VMSCAN_THROTTLE_WRITEBACK
      
      The most common reason to stall was due to excessive pages tagged for
      immediate reclaim at the tail of the LRU followed by a failure to make
      forward.  A relatively small number were due to too many pages isolated
      from the LRU by parallel threads
      
      For VMSCAN_THROTTLE_ISOLATED, the breakdown of delays was
      
            9 usec_timeout=20000 usect_delayed=4000 reason=VMSCAN_THROTTLE_ISOLATED
           12 usec_timeout=20000 usect_delayed=16000 reason=VMSCAN_THROTTLE_ISOLATED
           83 usec_timeout=20000 usect_delayed=20000 reason=VMSCAN_THROTTLE_ISOLATED
         6520 usec_timeout=20000 usect_delayed=0 reason=VMSCAN_THROTTLE_ISOLATED
      
      Most did not stall at all.  A small number reached the timeout.
      
      For VMSCAN_THROTTLE_NOPROGRESS, the breakdown of stalls were all over
      the map
      
            1 usec_timeout=500000 usect_delayed=324000 reason=VMSCAN_THROTTLE_NOPROGRESS
            1 usec_timeout=500000 usect_delayed=332000 reason=VMSCAN_THROTTLE_NOPROGRESS
            1 usec_timeout=500000 usect_delayed=348000 reason=VMSCAN_THROTTLE_NOPROGRESS
            1 usec_timeout=500000 usect_delayed=360000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=228000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=260000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=340000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=364000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=372000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=428000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=460000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=464000 reason=VMSCAN_THROTTLE_NOPROGRESS
            3 usec_timeout=500000 usect_delayed=244000 reason=VMSCAN_THROTTLE_NOPROGRESS
            3 usec_timeout=500000 usect_delayed=252000 reason=VMSCAN_THROTTLE_NOPROGRESS
            3 usec_timeout=500000 usect_delayed=272000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=188000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=268000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=328000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=380000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=392000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=432000 reason=VMSCAN_THROTTLE_NOPROGRESS
            5 usec_timeout=500000 usect_delayed=204000 reason=VMSCAN_THROTTLE_NOPROGRESS
            5 usec_timeout=500000 usect_delayed=220000 reason=VMSCAN_THROTTLE_NOPROGRESS
            5 usec_timeout=500000 usect_delayed=412000 reason=VMSCAN_THROTTLE_NOPROGRESS
            5 usec_timeout=500000 usect_delayed=436000 reason=VMSCAN_THROTTLE_NOPROGRESS
            6 usec_timeout=500000 usect_delayed=488000 reason=VMSCAN_THROTTLE_NOPROGRESS
            7 usec_timeout=500000 usect_delayed=212000 reason=VMSCAN_THROTTLE_NOPROGRESS
            7 usec_timeout=500000 usect_delayed=300000 reason=VMSCAN_THROTTLE_NOPROGRESS
            7 usec_timeout=500000 usect_delayed=316000 reason=VMSCAN_THROTTLE_NOPROGRESS
            7 usec_timeout=500000 usect_delayed=472000 reason=VMSCAN_THROTTLE_NOPROGRESS
            8 usec_timeout=500000 usect_delayed=248000 reason=VMSCAN_THROTTLE_NOPROGRESS
            8 usec_timeout=500000 usect_delayed=356000 reason=VMSCAN_THROTTLE_NOPROGRESS
            8 usec_timeout=500000 usect_delayed=456000 reason=VMSCAN_THROTTLE_NOPROGRESS
            9 usec_timeout=500000 usect_delayed=124000 reason=VMSCAN_THROTTLE_NOPROGRESS
            9 usec_timeout=500000 usect_delayed=376000 reason=VMSCAN_THROTTLE_NOPROGRESS
            9 usec_timeout=500000 usect_delayed=484000 reason=VMSCAN_THROTTLE_NOPROGRESS
           10 usec_timeout=500000 usect_delayed=172000 reason=VMSCAN_THROTTLE_NOPROGRESS
           10 usec_timeout=500000 usect_delayed=420000 reason=VMSCAN_THROTTLE_NOPROGRESS
           10 usec_timeout=500000 usect_delayed=452000 reason=VMSCAN_THROTTLE_NOPROGRESS
           11 usec_timeout=500000 usect_delayed=256000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=112000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=116000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=144000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=152000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=264000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=384000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=424000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=492000 reason=VMSCAN_THROTTLE_NOPROGRESS
           13 usec_timeout=500000 usect_delayed=184000 reason=VMSCAN_THROTTLE_NOPROGRESS
           13 usec_timeout=500000 usect_delayed=444000 reason=VMSCAN_THROTTLE_NOPROGRESS
           14 usec_timeout=500000 usect_delayed=308000 reason=VMSCAN_THROTTLE_NOPROGRESS
           14 usec_timeout=500000 usect_delayed=440000 reason=VMSCAN_THROTTLE_NOPROGRESS
           14 usec_timeout=500000 usect_delayed=476000 reason=VMSCAN_THROTTLE_NOPROGRESS
           16 usec_timeout=500000 usect_delayed=140000 reason=VMSCAN_THROTTLE_NOPROGRESS
           17 usec_timeout=500000 usect_delayed=232000 reason=VMSCAN_THROTTLE_NOPROGRESS
           17 usec_timeout=500000 usect_delayed=240000 reason=VMSCAN_THROTTLE_NOPROGRESS
           17 usec_timeout=500000 usect_delayed=280000 reason=VMSCAN_THROTTLE_NOPROGRESS
           18 usec_timeout=500000 usect_delayed=404000 reason=VMSCAN_THROTTLE_NOPROGRESS
           20 usec_timeout=500000 usect_delayed=148000 reason=VMSCAN_THROTTLE_NOPROGRESS
           20 usec_timeout=500000 usect_delayed=216000 reason=VMSCAN_THROTTLE_NOPROGRESS
           20 usec_timeout=500000 usect_delayed=468000 reason=VMSCAN_THROTTLE_NOPROGRESS
           21 usec_timeout=500000 usect_delayed=448000 reason=VMSCAN_THROTTLE_NOPROGRESS
           23 usec_timeout=500000 usect_delayed=168000 reason=VMSCAN_THROTTLE_NOPROGRESS
           23 usec_timeout=500000 usect_delayed=296000 reason=VMSCAN_THROTTLE_NOPROGRESS
           25 usec_timeout=500000 usect_delayed=132000 reason=VMSCAN_THROTTLE_NOPROGRESS
           25 usec_timeout=500000 usect_delayed=352000 reason=VMSCAN_THROTTLE_NOPROGRESS
           26 usec_timeout=500000 usect_delayed=180000 reason=VMSCAN_THROTTLE_NOPROGRESS
           27 usec_timeout=500000 usect_delayed=284000 reason=VMSCAN_THROTTLE_NOPROGRESS
           28 usec_timeout=500000 usect_delayed=164000 reason=VMSCAN_THROTTLE_NOPROGRESS
           29 usec_timeout=500000 usect_delayed=136000 reason=VMSCAN_THROTTLE_NOPROGRESS
           30 usec_timeout=500000 usect_delayed=200000 reason=VMSCAN_THROTTLE_NOPROGRESS
           30 usec_timeout=500000 usect_delayed=400000 reason=VMSCAN_THROTTLE_NOPROGRESS
           31 usec_timeout=500000 usect_delayed=196000 reason=VMSCAN_THROTTLE_NOPROGRESS
           32 usec_timeout=500000 usect_delayed=156000 reason=VMSCAN_THROTTLE_NOPROGRESS
           33 usec_timeout=500000 usect_delayed=224000 reason=VMSCAN_THROTTLE_NOPROGRESS
           35 usec_timeout=500000 usect_delayed=128000 reason=VMSCAN_THROTTLE_NOPROGRESS
           35 usec_timeout=500000 usect_delayed=176000 reason=VMSCAN_THROTTLE_NOPROGRESS
           36 usec_timeout=500000 usect_delayed=368000 reason=VMSCAN_THROTTLE_NOPROGRESS
           36 usec_timeout=500000 usect_delayed=496000 reason=VMSCAN_THROTTLE_NOPROGRESS
           37 usec_timeout=500000 usect_delayed=312000 reason=VMSCAN_THROTTLE_NOPROGRESS
           38 usec_timeout=500000 usect_delayed=304000 reason=VMSCAN_THROTTLE_NOPROGRESS
           40 usec_timeout=500000 usect_delayed=288000 reason=VMSCAN_THROTTLE_NOPROGRESS
           43 usec_timeout=500000 usect_delayed=408000 reason=VMSCAN_THROTTLE_NOPROGRESS
           55 usec_timeout=500000 usect_delayed=416000 reason=VMSCAN_THROTTLE_NOPROGRESS
           56 usec_timeout=500000 usect_delayed=76000 reason=VMSCAN_THROTTLE_NOPROGRESS
           58 usec_timeout=500000 usect_delayed=120000 reason=VMSCAN_THROTTLE_NOPROGRESS
           59 usec_timeout=500000 usect_delayed=208000 reason=VMSCAN_THROTTLE_NOPROGRESS
           61 usec_timeout=500000 usect_delayed=68000 reason=VMSCAN_THROTTLE_NOPROGRESS
           71 usec_timeout=500000 usect_delayed=192000 reason=VMSCAN_THROTTLE_NOPROGRESS
           71 usec_timeout=500000 usect_delayed=480000 reason=VMSCAN_THROTTLE_NOPROGRESS
           79 usec_timeout=500000 usect_delayed=60000 reason=VMSCAN_THROTTLE_NOPROGRESS
           82 usec_timeout=500000 usect_delayed=320000 reason=VMSCAN_THROTTLE_NOPROGRESS
           82 usec_timeout=500000 usect_delayed=92000 reason=VMSCAN_THROTTLE_NOPROGRESS
           85 usec_timeout=500000 usect_delayed=64000 reason=VMSCAN_THROTTLE_NOPROGRESS
           85 usec_timeout=500000 usect_delayed=80000 reason=VMSCAN_THROTTLE_NOPROGRESS
           88 usec_timeout=500000 usect_delayed=84000 reason=VMSCAN_THROTTLE_NOPROGRESS
           90 usec_timeout=500000 usect_delayed=160000 reason=VMSCAN_THROTTLE_NOPROGRESS
           90 usec_timeout=500000 usect_delayed=292000 reason=VMSCAN_THROTTLE_NOPROGRESS
           94 usec_timeout=500000 usect_delayed=56000 reason=VMSCAN_THROTTLE_NOPROGRESS
          118 usec_timeout=500000 usect_delayed=88000 reason=VMSCAN_THROTTLE_NOPROGRESS
          119 usec_timeout=500000 usect_delayed=72000 reason=VMSCAN_THROTTLE_NOPROGRESS
          126 usec_timeout=500000 usect_delayed=108000 reason=VMSCAN_THROTTLE_NOPROGRESS
          146 usec_timeout=500000 usect_delayed=52000 reason=VMSCAN_THROTTLE_NOPROGRESS
          148 usec_timeout=500000 usect_delayed=36000 reason=VMSCAN_THROTTLE_NOPROGRESS
          148 usec_timeout=500000 usect_delayed=48000 reason=VMSCAN_THROTTLE_NOPROGRESS
          159 usec_timeout=500000 usect_delayed=28000 reason=VMSCAN_THROTTLE_NOPROGRESS
          178 usec_timeout=500000 usect_delayed=44000 reason=VMSCAN_THROTTLE_NOPROGRESS
          183 usec_timeout=500000 usect_delayed=40000 reason=VMSCAN_THROTTLE_NOPROGRESS
          237 usec_timeout=500000 usect_delayed=100000 reason=VMSCAN_THROTTLE_NOPROGRESS
          266 usec_timeout=500000 usect_delayed=32000 reason=VMSCAN_THROTTLE_NOPROGRESS
          313 usec_timeout=500000 usect_delayed=24000 reason=VMSCAN_THROTTLE_NOPROGRESS
          347 usec_timeout=500000 usect_delayed=96000 reason=VMSCAN_THROTTLE_NOPROGRESS
          470 usec_timeout=500000 usect_delayed=20000 reason=VMSCAN_THROTTLE_NOPROGRESS
          559 usec_timeout=500000 usect_delayed=16000 reason=VMSCAN_THROTTLE_NOPROGRESS
          964 usec_timeout=500000 usect_delayed=12000 reason=VMSCAN_THROTTLE_NOPROGRESS
         2001 usec_timeout=500000 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPROGRESS
         2447 usec_timeout=500000 usect_delayed=8000 reason=VMSCAN_THROTTLE_NOPROGRESS
         7888 usec_timeout=500000 usect_delayed=4000 reason=VMSCAN_THROTTLE_NOPROGRESS
        22727 usec_timeout=500000 usect_delayed=0 reason=VMSCAN_THROTTLE_NOPROGRESS
        51305 usec_timeout=500000 usect_delayed=500000 reason=VMSCAN_THROTTLE_NOPROGRESS
      
      The full timeout is often hit but a large number also do not stall at
      all.  The remainder slept a little allowing other reclaim tasks to make
      progress.
      
      While this timeout could be further increased, it could also negatively
      impact worst-case behaviour when there is no prioritisation of what task
      should make progress.
      
      For VMSCAN_THROTTLE_WRITEBACK, the breakdown was
      
            1 usec_timeout=100000 usect_delayed=44000 reason=VMSCAN_THROTTLE_WRITEBACK
            2 usec_timeout=100000 usect_delayed=76000 reason=VMSCAN_THROTTLE_WRITEBACK
            3 usec_timeout=100000 usect_delayed=80000 reason=VMSCAN_THROTTLE_WRITEBACK
            5 usec_timeout=100000 usect_delayed=48000 reason=VMSCAN_THROTTLE_WRITEBACK
            5 usec_timeout=100000 usect_delayed=84000 reason=VMSCAN_THROTTLE_WRITEBACK
            6 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
            7 usec_timeout=100000 usect_delayed=88000 reason=VMSCAN_THROTTLE_WRITEBACK
           11 usec_timeout=100000 usect_delayed=56000 reason=VMSCAN_THROTTLE_WRITEBACK
           12 usec_timeout=100000 usect_delayed=64000 reason=VMSCAN_THROTTLE_WRITEBACK
           16 usec_timeout=100000 usect_delayed=92000 reason=VMSCAN_THROTTLE_WRITEBACK
           24 usec_timeout=100000 usect_delayed=68000 reason=VMSCAN_THROTTLE_WRITEBACK
           28 usec_timeout=100000 usect_delayed=32000 reason=VMSCAN_THROTTLE_WRITEBACK
           30 usec_timeout=100000 usect_delayed=60000 reason=VMSCAN_THROTTLE_WRITEBACK
           30 usec_timeout=100000 usect_delayed=96000 reason=VMSCAN_THROTTLE_WRITEBACK
           32 usec_timeout=100000 usect_delayed=52000 reason=VMSCAN_THROTTLE_WRITEBACK
           42 usec_timeout=100000 usect_delayed=40000 reason=VMSCAN_THROTTLE_WRITEBACK
           77 usec_timeout=100000 usect_delayed=28000 reason=VMSCAN_THROTTLE_WRITEBACK
           99 usec_timeout=100000 usect_delayed=36000 reason=VMSCAN_THROTTLE_WRITEBACK
          137 usec_timeout=100000 usect_delayed=24000 reason=VMSCAN_THROTTLE_WRITEBACK
          190 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
          339 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
          518 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
          852 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
         3359 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK
         7147 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
        83962 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK
      
      The majority hit the timeout in direct reclaim context although a
      sizable number did not stall at all.  This is very different to kswapd
      where only a tiny percentage of stalls due to writeback reached the
      timeout.
      
      Bottom line, the throttling appears to work and the wakeup events may
      limit worst case stalls.  There might be some grounds for adjusting
      timeouts but it's likely futile as the worst-case scenarios depend on
      the workload, memory size and the speed of the storage.  A better
      approach to improve the series further would be to prioritise tasks
      based on their rate of allocation with the caveat that it may be very
      expensive to track.
      
      This patch (of 5):
      
      Page reclaim throttles on wait_iff_congested under the following
      conditions:
      
       - kswapd is encountering pages under writeback and marked for immediate
         reclaim implying that pages are cycling through the LRU faster than
         pages can be cleaned.
      
       - Direct reclaim will stall if all dirty pages are backed by congested
         inodes.
      
      wait_iff_congested is almost completely broken with few exceptions.
      This patch adds a new node-based workqueue and tracks the number of
      throttled tasks and pages written back since throttling started.  If
      enough pages belonging to the node are written back then the throttled
      tasks will wake early.  If not, the throttled tasks sleeps until the
      timeout expires.
      
      [neilb@suse.de: Uninterruptible sleep and simpler wakeups]
      [hdanton@sina.com: Avoid race when reclaim starts]
      [vbabka@suse.cz: vmstat irq-safe api, clarifications]
      
      Link: https://lore.kernel.org/linux-mm/45d8b7a6-8548-65f5-cccf-9f451d4ae3d4@kernel.dk/ [1]
      Link: https://lkml.kernel.org/r/20211022144651.19914-1-mgorman@techsingularity.net
      Link: https://lkml.kernel.org/r/20211022144651.19914-2-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: NeilBrown <neilb@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Darrick J . Wong" <djwong@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8cd7c588
    • L
      mm: khugepaged: recalculate min_free_kbytes after stopping khugepaged · bd3400ea
      Liangcai Fan 提交于
      When initializing transparent huge pages, min_free_kbytes would be
      calculated according to what khugepaged expected.
      
      So when transparent huge pages get disabled, min_free_kbytes should be
      recalculated instead of the higher value set by khugepaged.
      
      Link: https://lkml.kernel.org/r/1633937809-16558-1-git-send-email-liangcaifan19@gmail.comSigned-off-by: NLiangcai Fan <liangcaifan19@gmail.com>
      Signed-off-by: NChunyan Zhang <zhang.lyra@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd3400ea
    • W
      mm/page_alloc: use clamp() to simplify code · 59d336bd
      Wang ShaoBo 提交于
      This patch uses clamp() to simplify code in init_per_zone_wmark_min().
      
      Link: https://lkml.kernel.org/r/20211021034830.1049150-1-bobo.shaobowang@huawei.comSigned-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Wei Yongjun <weiyongjun1@huawei.com>
      Cc: Li Bin <huawei.libin@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59d336bd
    • S
      mm: page_alloc: use migrate_disable() in drain_local_pages_wq() · 9c25cbfc
      Sebastian Andrzej Siewior 提交于
      drain_local_pages_wq() disables preemption to avoid CPU migration during
      CPU hotplug and can't use cpus_read_lock().
      
      Using migrate_disable() works here, too.  The scheduler won't take the
      CPU offline until the task left the migrate-disable section.  The
      problem with disabled preemption here is that drain_local_pages()
      acquires locks which are turned into sleeping locks on PREEMPT_RT and
      can't be acquired with disabled preemption.
      
      Use migrate_disable() in drain_local_pages_wq().
      
      Link: https://lkml.kernel.org/r/20211015210933.viw6rjvo64qtqxn4@linutronix.deSigned-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c25cbfc
    • L
      mm/page_alloc.c: show watermark_boost of zone in zoneinfo · a6ea8b5b
      Liangcai Fan 提交于
      min/low/high_wmark_pages(z) is defined as
      
        (z->_watermark[WMARK_MIN/LOW/HIGH] + z->watermark_boost)
      
      If kswapd is frequently woken up due to the increase of
      min/low/high_wmark_pages, printing watermark_boost can quickly locate
      whether watermark_boost or _watermark[WMARK_MIN/LOW/HIGH] caused
      min/low/high_wmark_pages to increase.
      
      Link: https://lkml.kernel.org/r/1632472566-12246-1-git-send-email-liangcaifan19@gmail.comSigned-off-by: NLiangcai Fan <liangcaifan19@gmail.com>
      Cc: Chunyan Zhang <zhang.lyra@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6ea8b5b
    • F
      mm/page_alloc: detect allocation forbidden by cpuset and bail out early · 8ca1b5a4
      Feng Tang 提交于
      There was a report that starting an Ubuntu in docker while using cpuset
      to bind it to movable nodes (a node only has movable zone, like a node
      for hotplug or a Persistent Memory node in normal usage) will fail due
      to memory allocation failure, and then OOM is involved and many other
      innocent processes got killed.
      
      It can be reproduced with command:
      
          $ docker run -it --rm --cpuset-mems 4 ubuntu:latest bash -c "grep Mems_allowed /proc/self/status"
      
      (where node 4 is a movable node)
      
        runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0
        CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G        W I E     5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased)
        Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020
        Call Trace:
         dump_stack+0x6b/0x88
         dump_header+0x4a/0x1e2
         oom_kill_process.cold+0xb/0x10
         out_of_memory.part.0+0xaf/0x230
         out_of_memory+0x3d/0x80
         __alloc_pages_slowpath.constprop.0+0x954/0xa20
         __alloc_pages_nodemask+0x2d3/0x300
         pipe_write+0x322/0x590
         new_sync_write+0x196/0x1b0
         vfs_write+0x1c3/0x1f0
         ksys_write+0xa7/0xe0
         do_syscall_64+0x52/0xd0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        Mem-Info:
        active_anon:392832 inactive_anon:182 isolated_anon:0
         active_file:68130 inactive_file:151527 isolated_file:0
         unevictable:2701 dirty:0 writeback:7
         slab_reclaimable:51418 slab_unreclaimable:116300
         mapped:45825 shmem:735 pagetables:2540 bounce:0
         free:159849484 free_pcp:73 free_cma:0
        Node 4 active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB all_unreclaimable? no
        Node 4 Movable free:130021408kB min:9140kB low:139160kB high:269180kB reserved_highatomic:0KB active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:130023424kB managed:130023424kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:292kB local_pcp:84kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0 0
        Node 4 Movable: 1*4kB (M) 0*8kB 0*16kB 1*32kB (M) 0*64kB 0*128kB 1*256kB (M) 1*512kB (M) 1*1024kB (M) 0*2048kB 31743*4096kB (M) = 130021156kB
      
        oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=docker-9976a269caec812c134fa317f27487ee36e1129beba7278a463dd53e5fb9997b.scope,mems_allowed=4,global_oom,task_memcg=/system.slice/containerd.service,task=containerd,pid=4100,uid=0
        Out of memory: Killed process 4100 (containerd) total-vm:4077036kB, anon-rss:51184kB, file-rss:26016kB, shmem-rss:0kB, UID:0 pgtables:676kB oom_score_adj:0
        oom_reaper: reaped process 8248 (docker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
        oom_reaper: reaped process 2054 (node_exporter), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
        oom_reaper: reaped process 1452 (systemd-journal), now anon-rss:0kB, file-rss:8564kB, shmem-rss:4kB
        oom_reaper: reaped process 2146 (munin-node), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
        oom_reaper: reaped process 8291 (runc:[2:INIT]), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      
      The reason is that in this case, the target cpuset nodes only have
      movable zone, while the creation of an OS in docker sometimes needs to
      allocate memory in non-movable zones (dma/dma32/normal) like
      GFP_HIGHUSER, and the cpuset limit forbids the allocation, then
      out-of-memory killing is involved even when normal nodes and movable
      nodes both have many free memory.
      
      The OOM killer cannot help to resolve the situation as there is no
      usable memory for the request in the cpuset scope.  The only reasonable
      measure to take is to fail the allocation right away and have the caller
      to deal with it.
      
      So add a check for cases like this in the slowpath of allocation, and
      bail out early returning NULL for the allocation.
      
      As page allocation is one of the hottest path in kernel, this check will
      hurt all users with sane cpuset configuration, add a static branch check
      and detect the abnormal config in cpuset memory binding setup so that
      the extra check cost in page allocation is not paid by everyone.
      
      [thanks to Micho Hocko and David Rientjes for suggesting not handling
       it inside OOM code, adding cpuset check, refining comments]
      
      Link: https://lkml.kernel.org/r/1632481657-68112-1-git-send-email-feng.tang@intel.comSigned-off-by: NFeng Tang <feng.tang@intel.com>
      Suggested-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ca1b5a4
    • E
      mm/page_alloc.c: do not acquire zone lock in is_free_buddy_page() · 8446b59b
      Eric Dumazet 提交于
      Grabbing zone lock in is_free_buddy_page() gives a wrong sense of
      safety, and has potential performance implications when zone is
      experiencing lock contention.
      
      In any case, if a caller needs a stable result, it should grab zone lock
      before calling this function.
      
      Link: https://lkml.kernel.org/r/20210922152833.4023972-1-eric.dumazet@gmail.comSigned-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8446b59b
    • G
      mm: move node_reclaim_distance to fix NUMA without SMP · 61bb6cd2
      Geert Uytterhoeven 提交于
      Patch series "Fix NUMA without SMP".
      
      SuperH is the only architecture which still supports NUMA without SMP,
      for good reasons (various memories scattered around the address space,
      each with varying latencies).
      
      This series fixes two build errors due to variables and functions used
      by the NUMA code being provided by SMP-only source files or sections.
      
      This patch (of 2):
      
      If CONFIG_NUMA=y, but CONFIG_SMP=n (e.g. sh/migor_defconfig):
      
          sh4-linux-gnu-ld: mm/page_alloc.o: in function `get_page_from_freelist':
          page_alloc.c:(.text+0x2c24): undefined reference to `node_reclaim_distance'
      
      Fix this by moving the declaration of node_reclaim_distance from an
      SMP-only to a generic file.
      
      Link: https://lkml.kernel.org/r/cover.1631781495.git.geert+renesas@glider.be
      Link: https://lkml.kernel.org/r/6432666a648dde85635341e6c918cee97c97d264.1631781495.git.geert+renesas@glider.be
      Fixes: a55c7454 ("sched/topology: Improve load balancing on AMD EPYC systems")
      Signed-off-by: NGeert Uytterhoeven <geert+renesas@glider.be>
      Suggested-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yoshinori Sato <ysato@users.osdn.me>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Gon Solo <gonsolo@gmail.com>
      Cc: Geert Uytterhoeven <geert+renesas@glider.be>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      61bb6cd2
    • K
      mm/page_alloc: use accumulated load when building node fallback list · 54d032ce
      Krupa Ramakrishnan 提交于
      In build_zonelists(), when the fallback list is built for the nodes, the
      node load gets reinitialized during each iteration.  This results in
      nodes with same distances occupying the same slot in different node
      fallback lists rather than appearing in the intended round- robin
      manner.  This results in one node getting picked for allocation more
      compared to other nodes with the same distance.
      
      As an example, consider a 4 node system with the following distance
      matrix.
      
        Node 0  1  2  3
        ----------------
        0    10 12 32 32
        1    12 10 32 32
        2    32 32 10 12
        3    32 32 12 10
      
      For this case, the node fallback list gets built like this:
      
        Node  Fallback list
        ---------------------
        0     0 1 2 3
        1     1 0 3 2
        2     2 3 0 1
        3     3 2 0 1 <-- Unexpected fallback order
      
      In the fallback list for nodes 2 and 3, the nodes 0 and 1 appear in the
      same order which results in more allocations getting satisfied from node
      0 compared to node 1.
      
      The effect of this on remote memory bandwidth as seen by stream
      benchmark is shown below:
      
        Case 1: Bandwidth from cores on nodes 2 & 3 to memory on nodes 0 & 1
      	(numactl -m 0,1 ./stream_lowOverhead ... --cores <from 2, 3>)
        Case 2: Bandwidth from cores on nodes 0 & 1 to memory on nodes 2 & 3
      	(numactl -m 2,3 ./stream_lowOverhead ... --cores <from 0, 1>)
      
        ----------------------------------------
      		BANDWIDTH (MB/s)
            TEST	Case 1		Case 2
        ----------------------------------------
            COPY	57479.6		110791.8
           SCALE	55372.9		105685.9
             ADD	50460.6		96734.2
          TRIADD	50397.6		97119.1
        ----------------------------------------
      
      The bandwidth drop in Case 1 occurs because most of the allocations get
      satisfied by node 0 as it appears first in the fallback order for both
      nodes 2 and 3.
      
      This can be fixed by accumulating the node load in build_zonelists()
      rather than reinitializing it during each iteration.  With this the
      nodes with the same distance rightly get assigned in the round robin
      manner.
      
      In fact this was how it was originally until commit f0c0b2b8
      ("change zonelist order: zonelist order selection logic") dropped the
      load accumulation and resorted to initializing the load during each
      iteration.
      
      While zonelist ordering was removed by commit c9bff3ee ("mm,
      page_alloc: rip out ZONELIST_ORDER_ZONE"), the change to the node load
      accumulation in build_zonelists() remained.  So essentially this patch
      reverts back to the accumulated node load logic.
      
      After this fix, the fallback order gets built like this:
      
        Node Fallback list
        ------------------
        0    0 1 2 3
        1    1 0 3 2
        2    2 3 0 1
        3    3 2 1 0 <-- Note the change here
      
      The bandwidth in Case 1 improves and matches Case 2 as shown below.
      
        ----------------------------------------
      		BANDWIDTH (MB/s)
            TEST	Case 1		Case 2
        ----------------------------------------
            COPY	110438.9	110107.2
           SCALE	105930.5	105817.5
             ADD	97005.1		96159.8
          TRIADD	97441.5		96757.1
        ----------------------------------------
      
      The correctness of the fallback list generation has been verified for
      the above node configuration where the node 3 starts as memory-less node
      and comes up online only during memory hotplug.
      
      [bharata@amd.com: Added changelog, review, test validation]
      
      Link: https://lkml.kernel.org/r/20210830121603.1081-3-bharata@amd.com
      Fixes: f0c0b2b8 ("change zonelist order: zonelist order selection logic")
      Signed-off-by: NKrupa Ramakrishnan <krupa.ramakrishnan@amd.com>
      Co-developed-by: NSadagopan Srinivasan <Sadagopan.Srinivasan@amd.com>
      Signed-off-by: NSadagopan Srinivasan <Sadagopan.Srinivasan@amd.com>
      Signed-off-by: NBharata B Rao <bharata@amd.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54d032ce
    • B
      mm/page_alloc: print node fallback order · 6cf25392
      Bharata B Rao 提交于
      Patch series "Fix NUMA nodes fallback list ordering".
      
      For a NUMA system that has multiple nodes at same distance from other
      nodes, the fallback list generation prefers same node order for them
      instead of round-robin thereby penalizing one node over others.  This
      series fixes it.
      
      More description of the problem and the fix is present in the patch
      description.
      
      This patch (of 2):
      
      Print information message about the allocation fallback order for each
      NUMA node during boot.
      
      No functional changes here.  This makes it easier to illustrate the
      problem in the node fallback list generation, which the next patch
      fixes.
      
      Link: https://lkml.kernel.org/r/20210830121603.1081-1-bharata@amd.com
      Link: https://lkml.kernel.org/r/20210830121603.1081-2-bharata@amd.comSigned-off-by: NBharata B Rao <bharata@amd.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Krupa Ramakrishnan <krupa.ramakrishnan@amd.com>
      Cc: Sadagopan Srinivasan <Sadagopan.Srinivasan@amd.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6cf25392
    • M
      mm/page_alloc.c: avoid allocating highmem pages via alloc_pages_exact[_nid] · ba7f1b9e
      Miaohe Lin 提交于
      Don't use with __GFP_HIGHMEM because page_address() cannot represent
      highmem pages without kmap().  Newly allocated pages would leak as
      page_address() will return NULL for highmem pages here.  But It works
      now because the callers do not specify __GFP_HIGHMEM now.
      
      Link: https://lkml.kernel.org/r/20210902121242.41607-6-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba7f1b9e
    • M
      mm/page_alloc.c: use helper function zone_spans_pfn() · 86fb05b9
      Miaohe Lin 提交于
      Use helper function zone_spans_pfn() to check whether pfn is within a
      zone to simplify the code slightly.
      
      Link: https://lkml.kernel.org/r/20210902121242.41607-5-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86fb05b9
    • M
      mm/page_alloc.c: fix obsolete comment in free_pcppages_bulk() · 7cba630b
      Miaohe Lin 提交于
      The second two paragraphs about "all pages pinned" and pages_scanned is
      obsolete.  And There are PAGE_ALLOC_COSTLY_ORDER + 1 + NR_PCP_THP orders
      in pcp.  So the same order assumption is not held now.
      
      Link: https://lkml.kernel.org/r/20210902121242.41607-4-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cba630b
    • M
      mm/page_alloc.c: simplify the code by using macro K() · ff7ed9e4
      Miaohe Lin 提交于
      Use helper macro K() to convert the pages to the corresponding size.
      Minor readability improvement.
      
      Link: https://lkml.kernel.org/r/20210902121242.41607-3-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff7ed9e4
    • M
      mm/page_alloc.c: remove meaningless VM_BUG_ON() in pindex_to_order() · ea808b4e
      Miaohe Lin 提交于
      Patch series "Cleanups and fixup for page_alloc", v2.
      
      This series contains cleanups to remove meaningless VM_BUG_ON(), use
      helpers to simplify the code and remove obsolete comment.  Also we avoid
      allocating highmem pages via alloc_pages_exact[_nid].  More details can be
      found in the respective changelogs.
      
      This patch (of 5):
      
      It's meaningless to VM_BUG_ON() order != pageblock_order just after
      setting order to pageblock_order.  Remove it.
      
      Link: https://lkml.kernel.org/r/20210902121242.41607-1-linmiaohe@huawei.com
      Link: https://lkml.kernel.org/r/20210902121242.41607-2-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea808b4e
    • E
      mm/large system hash: avoid possible NULL deref in alloc_large_system_hash · 084f7e23
      Eric Dumazet 提交于
      If __vmalloc() returned NULL, is_vm_area_hugepages(NULL) will fault if
      CONFIG_HAVE_ARCH_HUGE_VMALLOC=y
      
      Link: https://lkml.kernel.org/r/20210915212530.2321545-1-eric.dumazet@gmail.com
      Fixes: 121e6f32 ("mm/vmalloc: hugepage vmalloc mappings")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      084f7e23
  3. 29 10月, 2021 2 次提交
  4. 18 10月, 2021 1 次提交
  5. 27 9月, 2021 1 次提交
  6. 09 9月, 2021 3 次提交
    • M
      mm/page_alloc.c: avoid accessing uninitialized pcp page migratetype · 053cfda1
      Miaohe Lin 提交于
      If it's not prepared to free unref page, the pcp page migratetype is
      unset.  Thus we will get rubbish from get_pcppage_migratetype() and
      might list_del(&page->lru) again after it's already deleted from the list
      leading to grumble about data corruption.
      
      Link: https://lkml.kernel.org/r/20210902115447.57050-1-linmiaohe@huawei.com
      Fixes: df1acc85 ("mm/page_alloc: avoid conflating IRQs disabled with zone->lock")
      Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      053cfda1
    • D
      mm: track present early pages per zone · 4b097002
      David Hildenbrand 提交于
      Patch series "mm/memory_hotplug: "auto-movable" online policy and memory groups", v3.
      
      I. Goal
      
      The goal of this series is improving in-kernel auto-online support.  It
      tackles the fundamental problems that:
      
       1) We can create zone imbalances when onlining all memory blindly to
          ZONE_MOVABLE, in the worst case crashing the system. We have to know
          upfront how much memory we are going to hotplug such that we can
          safely enable auto-onlining of all hotplugged memory to ZONE_MOVABLE
          via "online_movable". This is far from practical and only applicable in
          limited setups -- like inside VMs under the RHV/oVirt hypervisor which
          will never hotplug more than 3 times the boot memory (and the
          limitation is only in place due to the Linux limitation).
      
       2) We see more setups that implement dynamic VM resizing, hot(un)plugging
          memory to resize VM memory. In these setups, we might hotplug a lot of
          memory, but it might happen in various small steps in both directions
          (e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...). virtio-mem is the
          primary driver of this upstream right now, performing such dynamic
          resizing NUMA-aware via multiple virtio-mem devices.
      
          Onlining all hotplugged memory to ZONE_NORMAL means we basically have
          no hotunplug guarantees. Onlining all to ZONE_MOVABLE means we can
          easily run into zone imbalances when growing a VM. We want a mixture,
          and we want as much memory as reasonable/configured in ZONE_MOVABLE.
          Details regarding zone imbalances can be found at [1].
      
       3) Memory devices consist of 1..X memory block devices, however, the
          kernel doesn't really track the relationship. Consequently, also user
          space has no idea. We want to make per-device decisions.
      
          As one example, for memory hotunplug it doesn't make sense to use a
          mixture of zones within a single DIMM: we want all MOVABLE if
          possible, otherwise all !MOVABLE, because any !MOVABLE part will easily
          block the whole DIMM from getting hotunplugged.
      
          As another example, virtio-mem operates on individual units that span
          1..X memory blocks. Similar to a DIMM, we want a unit to either be all
          MOVABLE or !MOVABLE. A "unit" can be thought of like a DIMM, however,
          all units of a virtio-mem device logically belong together and are
          managed (added/removed) by a single driver. We want as much memory of
          a virtio-mem device to be MOVABLE as possible.
      
       4) We want memory onlining to be done right from the kernel while adding
          memory, not triggered by user space via udev rules; for example, this
          is reqired for fast memory hotplug for drivers that add individual
          memory blocks, like virito-mem. We want a way to configure a policy in
          the kernel and avoid implementing advanced policies in user space.
      
      The auto-onlining support we have in the kernel is not sufficient.  All we
      have is a) online everything MOVABLE (online_movable) b) online everything
      !MOVABLE (online_kernel) c) keep zones contiguous (online).  This series
      allows configuring c) to mean instead "online movable if possible
      according to the coniguration, driven by a maximum MOVABLE:KERNEL ratio"
      -- a new onlining policy.
      
      II. Approach
      
      This series does 3 things:
      
       1) Introduces the "auto-movable" online policy that initially operates on
          individual memory blocks only. It uses a maximum MOVABLE:KERNEL ratio
          to make a decision whether a memory block will be onlined to
          ZONE_MOVABLE or not. However, in the basic form, hotplugged KERNEL
          memory does not allow for more MOVABLE memory (details in the
          patches). CMA memory is treated like MOVABLE memory.
      
       2) Introduces static (e.g., DIMM) and dynamic (e.g., virtio-mem) memory
          groups and uses group information to make decisions in the
          "auto-movable" online policy across memory blocks of a single memory
          device (modeled as memory group). More details can be found in patch
          #3 or in the DIMM example below.
      
       3) Maximizes ZONE_MOVABLE memory within dynamic memory groups, by
          allowing ZONE_NORMAL memory within a dynamic memory group to allow for
          more ZONE_MOVABLE memory within the same memory group. The target use
          case is dynamic VM resizing using virtio-mem. See the virtio-mem
          example below.
      
      I remember that the basic idea of using a ratio to implement a policy in
      the kernel was once mentioned by Vitaly Kuznetsov, but I might be wrong (I
      lost the pointer to that discussion).
      
      For me, the main use case is using it along with virtio-mem (and DIMMs /
      ppc64 dlpar where necessary) for dynamic resizing of VMs, increasing the
      amount of memory we can hotunplug reliably again if we might eventually
      hotplug a lot of memory to a VM.
      
      III. Target Usage
      
      The target usage will be:
      
       1) Linux boots with "mhp_default_online_type=offline"
      
       2) User space (e.g., systemd unit) configures memory onlining (according
          to a config file and system properties), for example:
          * Setting memory_hotplug.online_policy=auto-movable
          * Setting memory_hotplug.auto_movable_ratio=301
          * Setting memory_hotplug.auto_movable_numa_aware=true
      
       3) User space enabled auto onlining via "echo online >
          /sys/devices/system/memory/auto_online_blocks"
      
       4) User space triggers manual onlining of all already-offline memory
          blocks (go over offline memory blocks and set them to "online")
      
      IV. Example
      
      For DIMMs, hotplugging 4 GiB DIMMs to a 4 GiB VM with a configured ratio of
      301% results in the following layout:
      	Memory block 0-15:    DMA32   (early)
      	Memory block 32-47:   Normal  (early)
      	Memory block 48-79:   Movable (DIMM 0)
      	Memory block 80-111:  Movable (DIMM 1)
      	Memory block 112-143: Movable (DIMM 2)
      	Memory block 144-275: Normal  (DIMM 3)
      	Memory block 176-207: Normal  (DIMM 4)
      	... all Normal
      	(-> hotplugged Normal memory does not allow for more Movable memory)
      
      For virtio-mem, using a simple, single virtio-mem device with a 4 GiB VM
      will result in the following layout:
      	Memory block 0-15:    DMA32   (early)
      	Memory block 32-47:   Normal  (early)
      	Memory block 48-143:  Movable (virtio-mem, first 12 GiB)
      	Memory block 144:     Normal  (virtio-mem, next 128 MiB)
      	Memory block 145-147: Movable (virtio-mem, next 384 MiB)
      	Memory block 148:     Normal  (virtio-mem, next 128 MiB)
      	Memory block 149-151: Movable (virtio-mem, next 384 MiB)
      	... Normal/Movable mixture as above
      	(-> hotplugged Normal memory allows for more Movable memory within
      	    the same device)
      
      Which gives us maximum flexibility when dynamically growing/shrinking a
      VM in smaller steps.
      
      V. Doc Update
      
      I'll update the memory-hotplug.rst documentation, once the overhaul [1] is
      usptream. Until then, details can be found in patch #2.
      
      VI. Future Work
      
       1) Use memory groups for ppc64 dlpar
       2) Being able to specify a portion of (early) kernel memory that will be
          excluded from the ratio. Like "128 MiB globally/per node" are excluded.
      
          This might be helpful when starting VMs with extremely small memory
          footprint (e.g., 128 MiB) and hotplugging memory later -- not wanting
          the first hotplugged units getting onlined to ZONE_MOVABLE. One
          alternative would be a trigger to not consider ZONE_DMA memory
          in the ratio. We'll have to see if this is really rrequired.
       3) Indicate to user space that MOVABLE might be a bad idea -- especially
          relevant when memory ballooning without support for balloon compaction
          is active.
      
      This patch (of 9):
      
      For implementing a new memory onlining policy, which determines when to
      online memory blocks to ZONE_MOVABLE semi-automatically, we need the
      number of present early (boot) pages -- present pages excluding hotplugged
      pages.  Let's track these pages per zone.
      
      Pass a page instead of the zone to adjust_present_page_count(), similar as
      adjust_managed_page_count() and derive the zone from the page.
      
      It's worth noting that a memory block to be offlined/onlined is either
      completely "early" or "not early".  add_memory() and friends can only add
      complete memory blocks and we only online/offline complete (individual)
      memory blocks.
      
      Link: https://lkml.kernel.org/r/20210806124715.17090-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20210806124715.17090-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Marek Kedzierski <mkedzier@redhat.com>
      Cc: Hui Zhu <teawater@gmail.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b097002
    • M
      mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE · 859a85dd
      Mike Rapoport 提交于
      Patch series "mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE".
      
      After recent updates to freeing unused parts of the memory map, no
      architecture can have holes in the memory map within a pageblock.  This
      makes pfn_valid_within() check and CONFIG_HOLES_IN_ZONE configuration
      option redundant.
      
      The first patch removes them both in a mechanical way and the second patch
      simplifies memory_hotplug::test_pages_in_a_zone() that had
      pfn_valid_within() surrounded by more logic than simple if.
      
      This patch (of 2):
      
      After recent changes in freeing of the unused parts of the memory map and
      rework of pfn_valid() in arm and arm64 there are no architectures that can
      have holes in the memory map within a pageblock and so nothing can enable
      CONFIG_HOLES_IN_ZONE which guards non trivial implementation of
      pfn_valid_within().
      
      With that, pfn_valid_within() is always hardwired to 1 and can be
      completely removed.
      
      Remove calls to pfn_valid_within() and CONFIG_HOLES_IN_ZONE.
      
      Link: https://lkml.kernel.org/r/20210713080035.7464-1-rppt@kernel.org
      Link: https://lkml.kernel.org/r/20210713080035.7464-2-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      859a85dd
  7. 04 9月, 2021 8 次提交