1. 01 7月, 2021 1 次提交
    • M
      mm: memory_hotplug: factor out bootmem core functions to bootmem_info.c · 426e5c42
      Muchun Song 提交于
      Patch series "Free some vmemmap pages of HugeTLB page", v23.
      
      This patch series will free some vmemmap pages(struct page structures)
      associated with each HugeTLB page when preallocated to save memory.
      
      In order to reduce the difficulty of the first version of code review.  In
      this version, we disable PMD/huge page mapping of vmemmap if this feature
      was enabled.  This acutely eliminates a bunch of the complex code doing
      page table manipulation.  When this patch series is solid, we cam add the
      code of vmemmap page table manipulation in the future.
      
      The struct page structures (page structs) are used to describe a physical
      page frame.  By default, there is an one-to-one mapping from a page frame
      to it's corresponding page struct.
      
      The HugeTLB pages consist of multiple base page size pages and is
      supported by many architectures.  See hugetlbpage.rst in the Documentation
      directory for more details.  On the x86 architecture, HugeTLB pages of
      size 2MB and 1GB are currently supported.  Since the base page size on x86
      is 4KB, a 2MB HugeTLB page consists of 512 base pages and a 1GB HugeTLB
      page consists of 4096 base pages.  For each base page, there is a
      corresponding page struct.
      
      Within the HugeTLB subsystem, only the first 4 page structs are used to
      contain unique information about a HugeTLB page.  HUGETLB_CGROUP_MIN_ORDER
      provides this upper limit.  The only 'useful' information in the remaining
      page structs is the compound_head field, and this field is the same for
      all tail pages.
      
      By removing redundant page structs for HugeTLB pages, memory can returned
      to the buddy allocator for other uses.
      
      When the system boot up, every 2M HugeTLB has 512 struct page structs which
      size is 8 pages(sizeof(struct page) * 512 / PAGE_SIZE).
      
          HugeTLB                  struct pages(8 pages)         page frame(8 pages)
       +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
       |           |                     |     0     | -------------> |     0     |
       |           |                     +-----------+                +-----------+
       |           |                     |     1     | -------------> |     1     |
       |           |                     +-----------+                +-----------+
       |           |                     |     2     | -------------> |     2     |
       |           |                     +-----------+                +-----------+
       |           |                     |     3     | -------------> |     3     |
       |           |                     +-----------+                +-----------+
       |           |                     |     4     | -------------> |     4     |
       |    2MB    |                     +-----------+                +-----------+
       |           |                     |     5     | -------------> |     5     |
       |           |                     +-----------+                +-----------+
       |           |                     |     6     | -------------> |     6     |
       |           |                     +-----------+                +-----------+
       |           |                     |     7     | -------------> |     7     |
       |           |                     +-----------+                +-----------+
       |           |
       |           |
       |           |
       +-----------+
      
      The value of page->compound_head is the same for all tail pages.  The
      first page of page structs (page 0) associated with the HugeTLB page
      contains the 4 page structs necessary to describe the HugeTLB.  The only
      use of the remaining pages of page structs (page 1 to page 7) is to point
      to page->compound_head.  Therefore, we can remap pages 2 to 7 to page 1.
      Only 2 pages of page structs will be used for each HugeTLB page.  This
      will allow us to free the remaining 6 pages to the buddy allocator.
      
      Here is how things look after remapping.
      
          HugeTLB                  struct pages(8 pages)         page frame(8 pages)
       +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
       |           |                     |     0     | -------------> |     0     |
       |           |                     +-----------+                +-----------+
       |           |                     |     1     | -------------> |     1     |
       |           |                     +-----------+                +-----------+
       |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^
       |           |                     +-----------+                   | | | | |
       |           |                     |     3     | ------------------+ | | | |
       |           |                     +-----------+                     | | | |
       |           |                     |     4     | --------------------+ | | |
       |    2MB    |                     +-----------+                       | | |
       |           |                     |     5     | ----------------------+ | |
       |           |                     +-----------+                         | |
       |           |                     |     6     | ------------------------+ |
       |           |                     +-----------+                           |
       |           |                     |     7     | --------------------------+
       |           |                     +-----------+
       |           |
       |           |
       |           |
       +-----------+
      
      When a HugeTLB is freed to the buddy system, we should allocate 6 pages
      for vmemmap pages and restore the previous mapping relationship.
      
      Apart from 2MB HugeTLB page, we also have 1GB HugeTLB page.  It is similar
      to the 2MB HugeTLB page.  We also can use this approach to free the
      vmemmap pages.
      
      In this case, for the 1GB HugeTLB page, we can save 4094 pages.  This is a
      very substantial gain.  On our server, run some SPDK/QEMU applications
      which will use 1024GB HugeTLB page.  With this feature enabled, we can
      save ~16GB (1G hugepage)/~12GB (2MB hugepage) memory.
      
      Because there are vmemmap page tables reconstruction on the
      freeing/allocating path, it increases some overhead.  Here are some
      overhead analysis.
      
      1) Allocating 10240 2MB HugeTLB pages.
      
         a) With this patch series applied:
         # time echo 10240 > /proc/sys/vm/nr_hugepages
      
         real     0m0.166s
         user     0m0.000s
         sys      0m0.166s
      
         # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; }
           kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs -
           @start[tid]); delete(@start[tid]); }'
         Attaching 2 probes...
      
         @latency:
         [8K, 16K)           5476 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
         [16K, 32K)          4760 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       |
         [32K, 64K)             4 |                                                    |
      
         b) Without this patch series:
         # time echo 10240 > /proc/sys/vm/nr_hugepages
      
         real     0m0.067s
         user     0m0.000s
         sys      0m0.067s
      
         # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; }
           kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs -
           @start[tid]); delete(@start[tid]); }'
         Attaching 2 probes...
      
         @latency:
         [4K, 8K)           10147 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
         [8K, 16K)             93 |                                                    |
      
         Summarize: this feature is about ~2x slower than before.
      
      2) Freeing 10240 2MB HugeTLB pages.
      
         a) With this patch series applied:
         # time echo 0 > /proc/sys/vm/nr_hugepages
      
         real     0m0.213s
         user     0m0.000s
         sys      0m0.213s
      
         # bpftrace -e 'kprobe:free_pool_huge_page { @start[tid] = nsecs; }
           kretprobe:free_pool_huge_page /@start[tid]/ { @latency = hist(nsecs -
           @start[tid]); delete(@start[tid]); }'
         Attaching 2 probes...
      
         @latency:
         [8K, 16K)              6 |                                                    |
         [16K, 32K)         10227 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
         [32K, 64K)             7 |                                                    |
      
         b) Without this patch series:
         # time echo 0 > /proc/sys/vm/nr_hugepages
      
         real     0m0.081s
         user     0m0.000s
         sys      0m0.081s
      
         # bpftrace -e 'kprobe:free_pool_huge_page { @start[tid] = nsecs; }
           kretprobe:free_pool_huge_page /@start[tid]/ { @latency = hist(nsecs -
           @start[tid]); delete(@start[tid]); }'
         Attaching 2 probes...
      
         @latency:
         [4K, 8K)            6805 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
         [8K, 16K)           3427 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
         [16K, 32K)             8 |                                                    |
      
         Summary: The overhead of __free_hugepage is about ~2-3x slower than before.
      
      Although the overhead has increased, the overhead is not significant.
      Like Mike said, "However, remember that the majority of use cases create
      HugeTLB pages at or shortly after boot time and add them to the pool.  So,
      additional overhead is at pool creation time.  There is no change to
      'normal run time' operations of getting a page from or returning a page to
      the pool (think page fault/unmap)".
      
      Despite the overhead and in addition to the memory gains from this series.
      The following data is obtained by Joao Martins.  Very thanks to his
      effort.
      
      There's an additional benefit which is page (un)pinners will see an improvement
      and Joao presumes because there are fewer memmap pages and thus the tail/head
      pages are staying in cache more often.
      
      Out of the box Joao saw (when comparing linux-next against linux-next +
      this series) with gup_test and pinning a 16G HugeTLB file (with 1G pages):
      
      	get_user_pages(): ~32k -> ~9k
      	unpin_user_pages(): ~75k -> ~70k
      
      Usually any tight loop fetching compound_head(), or reading tail pages
      data (e.g.  compound_head) benefit a lot.  There's some unpinning
      inefficiencies Joao was fixing[2], but with that in added it shows even
      more:
      
      	unpin_user_pages(): ~27k -> ~3.8k
      
      [1] https://lore.kernel.org/linux-mm/20210409205254.242291-1-mike.kravetz@oracle.com/
      [2] https://lore.kernel.org/linux-mm/20210204202500.26474-1-joao.m.martins@oracle.com/
      
      This patch (of 9):
      
      Move bootmem info registration common API to individual bootmem_info.c.
      And we will use {get,put}_page_bootmem() to initialize the page for the
      vmemmap pages or free the vmemmap pages to buddy in the later patch.  So
      move them out of CONFIG_MEMORY_HOTPLUG_SPARSE.  This is just code movement
      without any functional change.
      
      Link: https://lkml.kernel.org/r/20210510030027.56044-1-songmuchun@bytedance.com
      Link: https://lkml.kernel.org/r/20210510030027.56044-2-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Acked-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Tested-by: NChen Huang <chenhuang5@huawei.com>
      Tested-by: NBodeddula Balasubramaniam <bodeddub@amazon.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: x86@kernel.org
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Oliver Neukum <oneukum@suse.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      426e5c42
  2. 30 6月, 2021 1 次提交
  3. 06 5月, 2021 6 次提交
  4. 10 3月, 2021 1 次提交
  5. 27 2月, 2021 7 次提交
  6. 25 2月, 2021 2 次提交
  7. 30 12月, 2020 1 次提交
    • B
      mm: memmap defer init doesn't work as expected · dc2da7b4
      Baoquan He 提交于
      VMware observed a performance regression during memmap init on their
      platform, and bisected to commit 73a6e474 ("mm: memmap_init:
      iterate over memblock regions rather that check each PFN") causing it.
      
      Before the commit:
      
        [0.033176] Normal zone: 1445888 pages used for memmap
        [0.033176] Normal zone: 89391104 pages, LIFO batch:63
        [0.035851] ACPI: PM-Timer IO Port: 0x448
      
      With commit
      
        [0.026874] Normal zone: 1445888 pages used for memmap
        [0.026875] Normal zone: 89391104 pages, LIFO batch:63
        [2.028450] ACPI: PM-Timer IO Port: 0x448
      
      The root cause is the current memmap defer init doesn't work as expected.
      
      Before, memmap_init_zone() was used to do memmap init of one whole zone,
      to initialize all low zones of one numa node, but defer memmap init of
      the last zone in that numa node.  However, since commit 73a6e474,
      function memmap_init() is adapted to iterater over memblock regions
      inside one zone, then call memmap_init_zone() to do memmap init for each
      region.
      
      E.g, on VMware's system, the memory layout is as below, there are two
      memory regions in node 2.  The current code will mistakenly initialize the
      whole 1st region [mem 0xab00000000-0xfcffffffff], then do memmap defer to
      iniatialize only one memmory section on the 2nd region [mem
      0x10000000000-0x1033fffffff].  In fact, we only expect to see that there's
      only one memory section's memmap initialized.  That's why more time is
      costed at the time.
      
      [    0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
      [    0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
      [    0.008843] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x55ffffffff]
      [    0.008844] ACPI: SRAT: Node 1 PXM 1 [mem 0x5600000000-0xaaffffffff]
      [    0.008844] ACPI: SRAT: Node 2 PXM 2 [mem 0xab00000000-0xfcffffffff]
      [    0.008845] ACPI: SRAT: Node 2 PXM 2 [mem 0x10000000000-0x1033fffffff]
      
      Now, let's add a parameter 'zone_end_pfn' to memmap_init_zone() to pass
      down the real zone end pfn so that defer_init() can use it to judge
      whether defer need be taken in zone wide.
      
      Link: https://lkml.kernel.org/r/20201223080811.16211-1-bhe@redhat.com
      Link: https://lkml.kernel.org/r/20201223080811.16211-2-bhe@redhat.com
      Fixes: commit 73a6e474 ("mm: memmap_init: iterate over memblock regions rather that check each PFN")
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Reported-by: NRahul Gopakumar <gopakumarr@vmware.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc2da7b4
  8. 19 12月, 2020 1 次提交
    • D
      mm/memory_hotplug: extend offline_and_remove_memory() to handle more than one memory block · 8dc4bb58
      David Hildenbrand 提交于
      virtio-mem soon wants to use offline_and_remove_memory() memory that
      exceeds a single Linux memory block (memory_block_size_bytes()). Let's
      remove that restriction.
      
      Let's remember the old state and try to restore that if anything goes
      wrong. While re-onlining can, in general, fail, it's highly unlikely to
      happen (usually only when a notifier fails to allocate memory, and these
      are rather rare).
      
      This will be used by virtio-mem to offline+remove memory ranges that are
      bigger than a single memory block - for example, with a device block
      size of 1 GiB (e.g., gigantic pages in the hypervisor) and a Linux memory
      block size of 128MB.
      
      While we could compress the state into 2 bit, using 8 bit is much
      easier.
      
      This handling is similar, but different to acpi_scan_try_to_offline():
      
      a) We don't try to offline twice. I am not sure if this CONFIG_MEMCG
      optimization is still relevant - it should only apply to ZONE_NORMAL
      (where we have no guarantees). If relevant, we can always add it.
      
      b) acpi_scan_try_to_offline() simply onlines all memory in case
      something goes wrong. It doesn't restore previous online type. Let's do
      that, so we won't overwrite what e.g., user space configured.
      Reviewed-by: NWei Yang <richard.weiyang@linux.alibaba.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Link: https://lore.kernel.org/r/20201112133815.13332-28-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      8dc4bb58
  9. 16 12月, 2020 5 次提交
    • L
      mm/memory_hotplug: quieting offline operation · 7c33023a
      Laurent Dufour 提交于
      On PowerPC, when dymically removing memory from a system we can see in the
      console a lot of messages like this:
      
      [  186.575389] Offlined Pages 4096
      
      This message is displayed on each LMB (256MB) removed, which means that we
      removing 1TB of memory, this message is displayed 4096 times.
      
      Moving it to DEBUG to not flood the console.
      
      Link: https://lkml.kernel.org/r/20201211150157.91399-1-ldufour@linux.ibm.comSigned-off-by: NLaurent Dufour <ldufour@linux.ibm.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c33023a
    • V
      mm, page_alloc: disable pcplists during memory offline · ec6e8c7e
      Vlastimil Babka 提交于
      Memory offlining relies on page isolation to guarantee a forward progress
      because pages cannot be reused while they are isolated.  But the page
      isolation itself doesn't prevent from races while freed pages are stored
      on pcp lists and thus can be reused.  This can be worked around by
      repeated draining of pcplists, as done by commit 96831826
      ("mm/memory_hotplug: drain per-cpu pages again during memory offline").
      
      David and Michal would prefer that this race was closed in a way that
      callers of page isolation who need stronger guarantees don't need to
      repeatedly drain.  David suggested disabling pcplists usage completely
      during page isolation, instead of repeatedly draining them.
      
      To achieve this without adding special cases in alloc/free fastpath, we
      can use the same approach as boot pagesets - when pcp->high is 0, any
      pcplist addition will be immediately flushed.
      
      The race can thus be closed by setting pcp->high to 0 and draining
      pcplists once, before calling start_isolate_page_range().  The draining
      will serialize after processes that already disabled interrupts and read
      the old value of pcp->high in free_unref_page_commit(), and processes that
      have not yet disabled interrupts, will observe pcp->high == 0 when they
      are rescheduled, and skip pcplists.  This guarantees no stray pages on
      pcplists in zones where isolation happens.
      
      This patch thus adds zone_pcp_disable() and zone_pcp_enable() functions
      that page isolation users can call before start_isolate_page_range() and
      after unisolating (or offlining) the isolated pages.
      
      Also, drain_all_pages() is optimized to only execute on cpus where
      pcplists are not empty.  The check can however race with a free to pcplist
      that has not yet increased the pcp->count from 0 to 1.  Thus make the
      drain optionally skip the racy check and drain on all cpus, and use this
      option in zone_pcp_disable().
      
      As we have to avoid external updates to high and batch while pcplists are
      disabled, we take pcp_batch_high_lock in zone_pcp_disable() and release it
      in zone_pcp_enable().  This also synchronizes multiple users of
      zone_pcp_disable()/enable().
      
      Currently the only user of this functionality is offline_pages().
      
      [vbabka@suse.cz: add comment, per David]
        Link: https://lkml.kernel.org/r/527480ef-ed72-e1c1-52a0-1c5b0113df45@suse.cz
      
      Link: https://lkml.kernel.org/r/20201111092812.11329-8-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Suggested-by: NDavid Hildenbrand <david@redhat.com>
      Suggested-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec6e8c7e
    • V
      mm, page_alloc: move draining pcplists to page isolation users · 7612921f
      Vlastimil Babka 提交于
      Currently, pcplists are drained during set_migratetype_isolate() which
      means once per pageblock processed start_isolate_page_range().  This is
      somewhat wasteful.  Moreover, the callers might need different guarantees,
      and the draining is currently prone to races and does not guarantee that
      no page from isolated pageblock will end up on the pcplist after the
      drain.
      
      Better guarantees are added by later patches and require explicit actions
      by page isolation users that need them.  Thus it makes sense to move the
      current imperfect draining to the callers also as a preparation step.
      
      Link: https://lkml.kernel.org/r/20201111092812.11329-7-vbabka@suse.czSuggested-by: NDavid Hildenbrand <david@redhat.com>
      Suggested-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7612921f
    • M
      mm: introduce debug_pagealloc_{map,unmap}_pages() helpers · 77bc7fd6
      Mike Rapoport 提交于
      Patch series "arch, mm: improve robustness of direct map manipulation", v7.
      
      During recent discussion about KVM protected memory, David raised a
      concern about usage of __kernel_map_pages() outside of DEBUG_PAGEALLOC
      scope [1].
      
      Indeed, for architectures that define CONFIG_ARCH_HAS_SET_DIRECT_MAP it is
      possible that __kernel_map_pages() would fail, but since this function is
      void, the failure will go unnoticed.
      
      Moreover, there's lack of consistency of __kernel_map_pages() semantics
      across architectures as some guard this function with #ifdef
      DEBUG_PAGEALLOC, some refuse to update the direct map if page allocation
      debugging is disabled at run time and some allow modifying the direct map
      regardless of DEBUG_PAGEALLOC settings.
      
      This set straightens this out by restoring dependency of
      __kernel_map_pages() on DEBUG_PAGEALLOC and updating the call sites
      accordingly.
      
      Since currently the only user of __kernel_map_pages() outside
      DEBUG_PAGEALLOC is hibernation, it is updated to make direct map accesses
      there more explicit.
      
      [1] https://lore.kernel.org/lkml/2759b4bf-e1e3-d006-7d86-78a40348269d@redhat.com
      
      This patch (of 4):
      
      When CONFIG_DEBUG_PAGEALLOC is enabled, it unmaps pages from the kernel
      direct mapping after free_pages().  The pages than need to be mapped back
      before they could be used.  Theese mapping operations use
      __kernel_map_pages() guarded with with debug_pagealloc_enabled().
      
      The only place that calls __kernel_map_pages() without checking whether
      DEBUG_PAGEALLOC is enabled is the hibernation code that presumes
      availability of this function when ARCH_HAS_SET_DIRECT_MAP is set.  Still,
      on arm64, __kernel_map_pages() will bail out when DEBUG_PAGEALLOC is not
      enabled but set_direct_map_invalid_noflush() may render some pages not
      present in the direct map and hibernation code won't be able to save such
      pages.
      
      To make page allocation debugging and hibernation interaction more robust,
      the dependency on DEBUG_PAGEALLOC or ARCH_HAS_SET_DIRECT_MAP has to be
      made more explicit.
      
      Start with combining the guard condition and the call to
      __kernel_map_pages() into debug_pagealloc_map_pages() and
      debug_pagealloc_unmap_pages() functions to emphasize that
      __kernel_map_pages() should not be called without DEBUG_PAGEALLOC and use
      these new functions to map/unmap pages when page allocation debugging is
      enabled.
      
      Link: https://lkml.kernel.org/r/20201109192128.960-1-rppt@kernel.org
      Link: https://lkml.kernel.org/r/20201109192128.960-2-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      77bc7fd6
    • S
      mm/rmap: always do TTU_IGNORE_ACCESS · 013339df
      Shakeel Butt 提交于
      Since commit 369ea824 ("mm/rmap: update to new mmu_notifier semantic
      v2"), the code to check the secondary MMU's page table access bit is
      broken for !(TTU_IGNORE_ACCESS) because the page is unmapped from the
      secondary MMU's page table before the check.  More specifically for those
      secondary MMUs which unmap the memory in
      mmu_notifier_invalidate_range_start() like kvm.
      
      However memory reclaim is the only user of !(TTU_IGNORE_ACCESS) or the
      absence of TTU_IGNORE_ACCESS and it explicitly performs the page table
      access check before trying to unmap the page.  So, at worst the reclaim
      will miss accesses in a very short window if we remove page table access
      check in unmapping code.
      
      There is an unintented consequence of !(TTU_IGNORE_ACCESS) for the memcg
      reclaim.  From memcg reclaim the page_referenced() only account the
      accesses from the processes which are in the same memcg of the target page
      but the unmapping code is considering accesses from all the processes, so,
      decreasing the effectiveness of memcg reclaim.
      
      The simplest solution is to always assume TTU_IGNORE_ACCESS in unmapping
      code.
      
      Link: https://lkml.kernel.org/r/20201104231928.1494083-1-shakeelb@google.com
      Fixes: 369ea824 ("mm/rmap: update to new mmu_notifier semantic v2")
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      013339df
  10. 23 11月, 2020 1 次提交
  11. 19 10月, 2020 1 次提交
  12. 17 10月, 2020 13 次提交
    • D
      mm/memory_hotplug: update comment regarding zone shuffling · b86c5fc4
      David Hildenbrand 提交于
      As we no longer shuffle via generic_online_page() and when undoing
      isolation, we can simplify the comment.
      
      We now effectively shuffle only once (properly) when onlining new memory.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NWei Yang <richard.weiyang@linux.alibaba.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Link: https://lkml.kernel.org/r/20201005121534.15649-6-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b86c5fc4
    • L
      mm: don't panic when links can't be created in sysfs · 90c7eaeb
      Laurent Dufour 提交于
      At boot time, or when doing memory hot-add operations, if the links in
      sysfs can't be created, the system is still able to run, so just report
      the error in the kernel log rather than BUG_ON and potentially make system
      unusable because the callpath can be called with locks held.
      
      Since the number of memory blocks managed could be high, the messages are
      rate limited.
      
      As a consequence, link_mem_sections() has no status to report anymore.
      Signed-off-by: NLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: "Rafael J . Wysocki" <rafael@kernel.org>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200915094143.79181-4-ldufour@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90c7eaeb
    • D
      kernel/resource: make iomem_resource implicit in release_mem_region_adjustable() · cb8e3c8b
      David Hildenbrand 提交于
      "mem" in the name already indicates the root, similar to
      release_mem_region() and devm_request_mem_region().  Make it implicit.
      The only single caller always passes iomem_resource, other parents are not
      applicable.
      Suggested-by: NWei Yang <richard.weiyang@linux.alibaba.com>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NWei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Link: https://lkml.kernel.org/r/20200916073041.10355-1-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb8e3c8b
    • D
      mm/memory_hotplug: MEMHP_MERGE_RESOURCE to specify merging of System RAM resources · 9ca6551e
      David Hildenbrand 提交于
      Some add_memory*() users add memory in small, contiguous memory blocks.
      Examples include virtio-mem, hyper-v balloon, and the XEN balloon.
      
      This can quickly result in a lot of memory resources, whereby the actual
      resource boundaries are not of interest (e.g., it might be relevant for
      DIMMs, exposed via /proc/iomem to user space).  We really want to merge
      added resources in this scenario where possible.
      
      Let's provide a flag (MEMHP_MERGE_RESOURCE) to specify that a resource
      either created within add_memory*() or passed via add_memory_resource()
      shall be marked mergeable and merged with applicable siblings.
      
      To implement that, we need a kernel/resource interface to mark selected
      System RAM resources mergeable (IORESOURCE_SYSRAM_MERGEABLE) and trigger
      merging.
      
      Note: We really want to merge after the whole operation succeeded, not
      directly when adding a resource to the resource tree (it would break
      add_memory_resource() and require splitting resources again when the
      operation failed - e.g., due to -ENOMEM).
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Roger Pau Monné <roger.pau@citrix.com>
      Cc: Julien Grall <julien@xen.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Anton Blanchard <anton@ozlabs.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Leonardo Bras <leobras.c@gmail.com>
      Cc: Libor Pechacek <lpechacek@suse.cz>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: "Oliver O'Halloran" <oohall@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Link: https://lkml.kernel.org/r/20200911103459.10306-6-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ca6551e
    • D
      mm/memory_hotplug: prepare passing flags to add_memory() and friends · b6117199
      David Hildenbrand 提交于
      We soon want to pass flags, e.g., to mark added System RAM resources.
      mergeable.  Prepare for that.
      
      This patch is based on a similar patch by Oscar Salvador:
      
      https://lkml.kernel.org/r/20190625075227.15193-3-osalvador@suse.deSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: Juergen Gross <jgross@suse.com> # Xen related part
      Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NWei Liu <wei.liu@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: "Oliver O'Halloran" <oohall@gmail.com>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Libor Pechacek <lpechacek@suse.cz>
      Cc: Anton Blanchard <anton@ozlabs.org>
      Cc: Leonardo Bras <leobras.c@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Julien Grall <julien@xen.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Roger Pau Monné <roger.pau@citrix.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Link: https://lkml.kernel.org/r/20200911103459.10306-5-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6117199
    • D
      kernel/resource: move and rename IORESOURCE_MEM_DRIVER_MANAGED · 7cf603d1
      David Hildenbrand 提交于
      IORESOURCE_MEM_DRIVER_MANAGED currently uses an unused PnP bit, which is
      always set to 0 by hardware.  This is far from beautiful (and confusing),
      and the bit only applies to SYSRAM.  So let's move it out of the
      bus-specific (PnP) defined bits.
      
      We'll add another SYSRAM specific bit soon.  If we ever need more bits for
      other purposes, we can steal some from "desc", or reshuffle/regroup what
      we have.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Anton Blanchard <anton@ozlabs.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Julien Grall <julien@xen.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Leonardo Bras <leobras.c@gmail.com>
      Cc: Libor Pechacek <lpechacek@suse.cz>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: "Oliver O'Halloran" <oohall@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Roger Pau Monné <roger.pau@citrix.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Link: https://lkml.kernel.org/r/20200911103459.10306-3-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cf603d1
    • D
      kernel/resource: make release_mem_region_adjustable() never fail · ec62d04e
      David Hildenbrand 提交于
      Patch series "selective merging of system ram resources", v4.
      
      Some add_memory*() users add memory in small, contiguous memory blocks.
      Examples include virtio-mem, hyper-v balloon, and the XEN balloon.
      
      This can quickly result in a lot of memory resources, whereby the actual
      resource boundaries are not of interest (e.g., it might be relevant for
      DIMMs, exposed via /proc/iomem to user space).  We really want to merge
      added resources in this scenario where possible.
      
      Resources are effectively stored in a list-based tree.  Having a lot of
      resources not only wastes memory, it also makes traversing that tree more
      expensive, and makes /proc/iomem explode in size (e.g., requiring
      kexec-tools to manually merge resources when creating a kdump header.  The
      current kexec-tools resource count limit does not allow for more than
      ~100GB of memory with a memory block size of 128MB on x86-64).
      
      Let's allow to selectively merge system ram resources by specifying a new
      flag for add_memory*().  Patch #5 contains a /proc/iomem example.  Only
      tested with virtio-mem.
      
      This patch (of 8):
      
      Let's make sure splitting a resource on memory hotunplug will never fail.
      This will become more relevant once we merge selected System RAM resources
      - then, we'll trigger that case more often on memory hotunplug.
      
      In general, this function is already unlikely to fail.  When we remove
      memory, we free up quite a lot of metadata (memmap, page tables, memory
      block device, etc.).  The only reason it could really fail would be when
      injecting allocation errors.
      
      All other error cases inside release_mem_region_adjustable() seem to be
      sanity checks if the function would be abused in different context - let's
      add WARN_ON_ONCE() in these cases so we can catch them.
      
      [natechancellor@gmail.com: fix use of ternary condition in release_mem_region_adjustable]
        Link: https://lkml.kernel.org/r/20200922060748.2452056-1-natechancellor@gmail.com
        Link: https://github.com/ClangBuiltLinux/linux/issues/1159Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NNathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Anton Blanchard <anton@ozlabs.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Julien Grall <julien@xen.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Leonardo Bras <leobras.c@gmail.com>
      Cc: Libor Pechacek <lpechacek@suse.cz>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: "Oliver O'Halloran" <oohall@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Roger Pau Monn <roger.pau@citrix.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Link: https://lkml.kernel.org/r/20200911103459.10306-2-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec62d04e
    • D
      mm/memory_hotplug: mark pageblocks MIGRATE_ISOLATE while onlining memory · b30c5927
      David Hildenbrand 提交于
      Currently, it can happen that pages are allocated (and freed) via the
      buddy before we finished basic memory onlining.
      
      For example, pages are exposed to the buddy and can be allocated before we
      actually mark the sections online.  Allocated pages could suddenly fail
      pfn_to_online_page() checks.  We had similar issues with pcp handling,
      when pages are allocated+freed before we reach zone_pcp_update() in
      online_pages() [1].
      
      Instead, mark all pageblocks MIGRATE_ISOLATE, such that allocations are
      impossible.  Once done with the heavy lifting, use
      undo_isolate_page_range() to move the pages to the MIGRATE_MOVABLE
      freelist, marking them ready for allocation.  Similar to offline_pages(),
      we have to manually adjust zone->nr_isolate_pageblock.
      
      [1] https://lkml.kernel.org/r/1597150703-19003-1-git-send-email-charante@codeaurora.orgSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200819175957.28465-11-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b30c5927
    • D
      mm: pass migratetype into memmap_init_zone() and move_pfn_range_to_zone() · d882c006
      David Hildenbrand 提交于
      On the memory onlining path, we want to start with MIGRATE_ISOLATE, to
      un-isolate the pages after memory onlining is complete.  Let's allow
      passing in the migratetype.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Link: https://lkml.kernel.org/r/20200819175957.28465-10-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d882c006
    • D
      mm/memory_hotplug: simplify page onlining · aac65321
      David Hildenbrand 提交于
      We don't allow to offline memory with holes, all boot memory is online,
      and all hotplugged memory cannot have holes.
      
      We can now simplify onlining of pages.  As we only allow to online/offline
      full sections and sections always span full MAX_ORDER_NR_PAGES, we can
      just process MAX_ORDER - 1 pages without further special handling.
      
      The number of onlined pages simply corresponds to the number of pages we
      were requested to online.
      
      While at it, refine the comment regarding the callback not exposing all
      pages to the buddy.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200819175957.28465-8-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aac65321
    • D
      mm/page_isolation: simplify return value of start_isolate_page_range() · 3fa0c7c7
      David Hildenbrand 提交于
      Callers no longer need the number of isolated pageblocks.  Let's simplify.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200819175957.28465-7-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3fa0c7c7
    • D
      mm/memory_hotplug: drop nr_isolate_pageblock in offline_pages() · ea15153c
      David Hildenbrand 提交于
      We make sure that we cannot have any memory holes right at the beginning
      of offline_pages() and we only support to online/offline full sections.
      Both, sections and pageblocks are a power of two in size, and sections
      always span full pageblocks.
      
      We can directly calculate the number of isolated pageblocks from nr_pages.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200819175957.28465-6-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea15153c
    • D
      mm/memory_hotplug: simplify page offlining · 0a1a9a00
      David Hildenbrand 提交于
      We make sure that we cannot have any memory holes right at the beginning
      of offline_pages().  We no longer need walk_system_ram_range() and can
      call test_pages_isolated() and __offline_isolated_pages() directly.
      
      offlined_pages always corresponds to nr_pages, so we can simplify that.
      
      [akpm@linux-foundation.org: patch conflict resolution]
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200819175957.28465-4-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a1a9a00