1. 14 5月, 2022 1 次提交
    • M
      mm: hugetlb_vmemmap: add hugetlb_optimize_vmemmap sysctl · 78f39084
      Muchun Song 提交于
      We must add hugetlb_free_vmemmap=on (or "off") to the boot cmdline and
      reboot the server to enable or disable the feature of optimizing vmemmap
      pages associated with HugeTLB pages.  However, rebooting usually takes a
      long time.  So add a sysctl to enable or disable the feature at runtime
      without rebooting.  Why we need this?  There are 3 use cases.
      
      1) The feature of minimizing overhead of struct page associated with
         each HugeTLB is disabled by default without passing
         "hugetlb_free_vmemmap=on" to the boot cmdline.  When we (ByteDance)
         deliver the servers to the users who want to enable this feature, they
         have to configure the grub (change boot cmdline) and reboot the
         servers, whereas rebooting usually takes a long time (we have thousands
         of servers).  It's a very bad experience for the users.  So we need a
         approach to enable this feature after rebooting.  This is a use case in
         our practical environment.
      
      2) Some use cases are that HugeTLB pages are allocated 'on the fly'
         instead of being pulled from the HugeTLB pool, those workloads would be
         affected with this feature enabled.  Those workloads could be
         identified by the characteristics of they never explicitly allocating
         huge pages with 'nr_hugepages' but only set 'nr_overcommit_hugepages'
         and then let the pages be allocated from the buddy allocator at fault
         time.  We can confirm it is a real use case from the commit
         099730d6.  For those workloads, the page fault time could be ~2x
         slower than before.  We suspect those users want to disable this
         feature if the system has enabled this before and they don't think the
         memory savings benefit is enough to make up for the performance drop.
      
      3) If the workload which wants vmemmap pages to be optimized and the
         workload which wants to set 'nr_overcommit_hugepages' and does not want
         the extera overhead at fault time when the overcommitted pages be
         allocated from the buddy allocator are deployed in the same server. 
         The user could enable this feature and set 'nr_hugepages' and
         'nr_overcommit_hugepages', then disable the feature.  In this case, the
         overcommited HugeTLB pages will not encounter the extra overhead at
         fault time.
      
      Link: https://lkml.kernel.org/r/20220512041142.39501-5-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      78f39084
  2. 29 4月, 2022 1 次提交
    • J
      mm/sparse-vmemmap: add a pgmap argument to section activation · e3246d8f
      Joao Martins 提交于
      Patch series "sparse-vmemmap: memory savings for compound devmaps (device-dax)", v9.
      
      This series minimizes 'struct page' overhead by pursuing a similar
      approach as Muchun Song series "Free some vmemmap pages of hugetlb page"
      (now merged since v5.14), but applied to devmap with @vmemmap_shift
      (device-dax).  
      
      The vmemmap dedpulication original idea (already used in HugeTLB) is to
      reuse/deduplicate tail page vmemmap areas, particular the area which only
      describes tail pages.  So a vmemmap page describes 64 struct pages, and
      the first page for a given ZONE_DEVICE vmemmap would contain the head page
      and 63 tail pages.  The second vmemmap page would contain only tail pages,
      and that's what gets reused across the rest of the subsection/section. 
      The bigger the page size, the bigger the savings (2M hpage -> save 6
      vmemmap pages; 1G hpage -> save 4094 vmemmap pages).  
      
      This is done for PMEM /specifically only/ on device-dax configured
      namespaces, not fsdax.  In other words, a devmap with a @vmemmap_shift.
      
      In terms of savings, per 1Tb of memory, the struct page cost would go down
      with compound devmap:
      
      * with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of
        total memory)
      
      * with 1G pages we lose 40MB instead of 16G (0.0014% instead of 1.5% of
        total memory)
      
      The series is mostly summed up by patch 4, and to summarize what the
      series does:
      
      Patches 1 - 3: Minor cleanups in preparation for patch 4.  Move the very
      nice docs of hugetlb_vmemmap.c into a Documentation/vm/ entry.
      
      Patch 4: Patch 4 is the one that takes care of the struct page savings
      (also referred to here as tail-page/vmemmap deduplication).  Much like
      Muchun series, we reuse the second PTE tail page vmemmap areas across a
      given @vmemmap_shift On important difference though, is that contrary to
      the hugetlbfs series, there's no vmemmap for the area because we are
      late-populating it as opposed to remapping a system-ram range.  IOW no
      freeing of pages of already initialized vmemmap like the case for
      hugetlbfs, which greatly simplifies the logic (besides not being
      arch-specific).  altmap case unchanged and still goes via the
      vmemmap_populate().  Also adjust the newly added docs to the device-dax
      case.
      
      [Note that device-dax is still a little behind HugeTLB in terms of
      savings.  I have an additional simple patch that reuses the head vmemmap
      page too, as a follow-up.  That will double the savings and namespaces
      initialization.]
      
      Patch 5: Initialize fewer struct pages depending on the page size with
      DRAM backed struct pages -- because fewer pages are unique and most tail
      pages (with bigger vmemmap_shift).
      
          NVDIMM namespace bootstrap improves from ~268-358 ms to
          ~80-110/<1ms on 128G NVDIMMs with 2M and 1G respectivally.  And struct
          page needed capacity will be 3.8x / 1071x smaller for 2M and 1G
          respectivelly.  Tested on x86 with 1.5Tb of pmem (including pinning,
          and RDMA registration/deregistration scalability with 2M MRs)
      
      
      This patch (of 5):
      
      In support of using compound pages for devmap mappings, plumb the pgmap
      down to the vmemmap_populate implementation.  Note that while altmap is
      retrievable from pgmap the memory hotplug code passes altmap without
      pgmap[*], so both need to be independently plumbed.
      
      So in addition to @altmap, pass @pgmap to sparse section populate
      functions namely:
      
      	sparse_add_section
      	  section_activate
      	    populate_section_memmap
         	      __populate_section_memmap
      
      Passing @pgmap allows __populate_section_memmap() to both fetch the
      vmemmap_shift in which memmap metadata is created for and also to let
      sparse-vmemmap fetch pgmap ranges to co-relate to a given section and pick
      whether to just reuse tail pages from past onlined sections.
      
      While at it, fix the kdoc for @altmap for sparse_add_section().
      
      [*] https://lore.kernel.org/linux-mm/20210319092635.6214-1-osalvador@suse.de/
      
      Link: https://lkml.kernel.org/r/20220420155310.9712-1-joao.m.martins@oracle.com
      Link: https://lkml.kernel.org/r/20220420155310.9712-2-joao.m.martins@oracle.comSigned-off-by: NJoao Martins <joao.m.martins@oracle.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      e3246d8f
  3. 23 3月, 2022 5 次提交
    • D
      drivers/base/memory: determine and store zone for single-zone memory blocks · 395f6081
      David Hildenbrand 提交于
      test_pages_in_a_zone() is just another nasty PFN walker that can easily
      stumble over ZONE_DEVICE memory ranges falling into the same memory block
      as ordinary system RAM: the memmap of parts of these ranges might possibly
      be uninitialized.  In fact, we observed (on an older kernel) with UBSAN:
      
        UBSAN: Undefined behaviour in ./include/linux/mm.h:1133:50
        index 7 is out of range for type 'zone [5]'
        CPU: 121 PID: 35603 Comm: read_all Kdump: loaded Tainted: [...]
        Hardware name: Dell Inc. PowerEdge R7425/08V001, BIOS 1.12.2 11/15/2019
        Call Trace:
         dump_stack+0x9a/0xf0
         ubsan_epilogue+0x9/0x7a
         __ubsan_handle_out_of_bounds+0x13a/0x181
         test_pages_in_a_zone+0x3c4/0x500
         show_valid_zones+0x1fa/0x380
         dev_attr_show+0x43/0xb0
         sysfs_kf_seq_show+0x1c5/0x440
         seq_read+0x49d/0x1190
         vfs_read+0xff/0x300
         ksys_read+0xb8/0x170
         do_syscall_64+0xa5/0x4b0
         entry_SYSCALL_64_after_hwframe+0x6a/0xdf
        RIP: 0033:0x7f01f4439b52
      
      We seem to stumble over a memmap that contains a garbage zone id.  While
      we could try inserting pfn_to_online_page() calls, it will just make
      memory offlining slower, because we use test_pages_in_a_zone() to make
      sure we're offlining pages that all belong to the same zone.
      
      Let's just get rid of this PFN walker and determine the single zone of a
      memory block -- if any -- for early memory blocks during boot.  For memory
      onlining, we know the single zone already.  Let's avoid any additional
      memmap scanning and just rely on the zone information available during
      boot.
      
      For memory hot(un)plug, we only really care about memory blocks that:
      * span a single zone (and, thereby, a single node)
      * are completely System RAM (IOW, no holes, no ZONE_DEVICE)
      If one of these conditions is not met, we reject memory offlining.
      Hotplugged memory blocks (starting out offline), always meet both
      conditions.
      
      There are three scenarios to handle:
      
      (1) Memory hot(un)plug
      
      A memory block with zone == NULL cannot be offlined, corresponding to
      our previous test_pages_in_a_zone() check.
      
      After successful memory onlining/offlining, we simply set the zone
      accordingly.
      * Memory onlining: set the zone we just used for onlining
      * Memory offlining: set zone = NULL
      
      So a hotplugged memory block starts with zone = NULL. Once memory
      onlining is done, we set the proper zone.
      
      (2) Boot memory with !CONFIG_NUMA
      
      We know that there is just a single pgdat, so we simply scan all zones
      of that pgdat for an intersection with our memory block PFN range when
      adding the memory block. If more than one zone intersects (e.g., DMA and
      DMA32 on x86 for the first memory block) we set zone = NULL and
      consequently mimic what test_pages_in_a_zone() used to do.
      
      (3) Boot memory with CONFIG_NUMA
      
      At the point in time we create the memory block devices during boot, we
      don't know yet which nodes *actually* span a memory block. While we could
      scan all zones of all nodes for intersections, overlapping nodes complicate
      the situation and scanning all nodes is possibly expensive. But that
      problem has already been solved by the code that sets the node of a memory
      block and creates the link in the sysfs --
      do_register_memory_block_under_node().
      
      So, we hook into the code that sets the node id for a memory block. If
      we already have a different node id set for the memory block, we know
      that multiple nodes *actually* have PFNs falling into our memory block:
      we set zone = NULL and consequently mimic what test_pages_in_a_zone() used
      to do. If there is no node id set, we do the same as (2) for the given
      node.
      
      Note that the call order in driver_init() is:
      -> memory_dev_init(): create memory block devices
      -> node_dev_init(): link memory block devices to the node and set the
      		    node id
      
      So in summary, we detect if there is a single zone responsible for this
      memory block and we consequently store the zone in that case in the
      memory block, updating it during memory onlining/offlining.
      
      Link: https://lkml.kernel.org/r/20220210184359.235565-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reported-by: NRafael Parra <rparrazo@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rafael Parra <rparrazo@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      395f6081
    • M
      mm, memory_hotplug: reorganize new pgdat initialization · 70b5b46a
      Michal Hocko 提交于
      When a !node_online node is brought up it needs a hotplug specific
      initialization because the node could be either uninitialized yet or it
      could have been recycled after previous hotremove.  hotadd_init_pgdat is
      responsible for that.
      
      Internal pgdat state is initialized at two places currently
      	- hotadd_init_pgdat
      	- free_area_init_core_hotplug
      
      There is no real clear cut what should go where but this patch's chosen to
      move the whole internal state initialization into
      free_area_init_core_hotplug.  hotadd_init_pgdat is still responsible to
      pull all the parts together - most notably to initialize zonelists because
      those depend on the overall topology.
      
      This patch doesn't introduce any functional change.
      
      Link: https://lkml.kernel.org/r/20220127085305.20890-5-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NRafael Aquini <raquini@redhat.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Alexey Makhalov <amakhalov@vmware.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nico Pache <npache@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70b5b46a
    • M
      mm, memory_hotplug: drop arch_free_nodedata · 390511e1
      Michal Hocko 提交于
      Prior to "mm: handle uninitialized numa nodes gracefully" memory hotplug
      used to allocate pgdat when memory has been added to a node
      (hotadd_init_pgdat) arch_free_nodedata has been only used in the failure
      path because once the pgdat is exported (to be visible by NODA_DATA(nid))
      it cannot really be freed because there is no synchronization available
      for that.
      
      pgdat is allocated for each possible nodes now so the memory hotplug
      doesn't need to do the ever use arch_free_nodedata so drop it.
      
      This patch doesn't introduce any functional change.
      
      Link: https://lkml.kernel.org/r/20220127085305.20890-4-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NRafael Aquini <raquini@redhat.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Alexey Makhalov <amakhalov@vmware.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Nico Pache <npache@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      390511e1
    • M
      mm: handle uninitialized numa nodes gracefully · 09f49dca
      Michal Hocko 提交于
      We have had several reports [1][2][3] that page allocator blows up when an
      allocation from a possible node is requested.  The underlying reason is
      that NODE_DATA for the specific node is not allocated.
      
      NUMA specific initialization is arch specific and it can vary a lot.  E.g.
      x86 tries to initialize all nodes that have some cpu affinity (see
      init_cpu_to_node) but this can be insufficient because the node might be
      cpuless for example.
      
      One way to address this problem would be to check for !node_online nodes
      when trying to get a zonelist and silently fall back to another node.
      That is unfortunately adding a branch into allocator hot path and it
      doesn't handle any other potential NODE_DATA users.
      
      This patch takes a different approach (following a lead of [3]) and it pre
      allocates pgdat for all possible nodes in an arch indipendent code -
      free_area_init.  All uninitialized nodes are treated as memoryless nodes.
      node_state of the node is not changed because that would lead to other
      side effects - e.g.  sysfs representation of such a node and from past
      discussions [4] it is known that some tools might have problems digesting
      that.
      
      Newly allocated pgdat only gets a minimal initialization and the rest of
      the work is expected to be done by the memory hotplug - hotadd_new_pgdat
      (renamed to hotadd_init_pgdat).
      
      generic_alloc_nodedata is changed to use the memblock allocator because
      neither page nor slab allocators are available at the stage when all
      pgdats are allocated.  Hotplug doesn't allocate pgdat anymore so we can
      use the early boot allocator.  The only arch specific implementation is
      ia64 and that is changed to use the early allocator as well.
      
      [1] http://lkml.kernel.org/r/20211101201312.11589-1-amakhalov@vmware.com
      [2] http://lkml.kernel.org/r/20211207224013.880775-1-npache@redhat.com
      [3] http://lkml.kernel.org/r/20190114082416.30939-1-mhocko@kernel.org
      [4] http://lkml.kernel.org/r/20200428093836.27190-1-srikar@linux.vnet.ibm.com
      
      [akpm@linux-foundation.org: replace comment, per Mike]
      
      Link: https://lkml.kernel.org/r/Yfe7RBeLCijnWBON@dhcp22.suse.czReported-by: NAlexey Makhalov <amakhalov@vmware.com>
      Tested-by: NAlexey Makhalov <amakhalov@vmware.com>
      Reported-by: NNico Pache <npache@redhat.com>
      Acked-by: NRafael Aquini <raquini@redhat.com>
      Tested-by: NRafael Aquini <raquini@redhat.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09f49dca
    • M
      mm, memory_hotplug: make arch_alloc_nodedata independent on CONFIG_MEMORY_HOTPLUG · e930d999
      Michal Hocko 提交于
      Patch series "mm, memory_hotplug: handle unitialized numa node gracefully".
      
      The core of the fix is patch 2 which also links existing bug reports.  The
      high level goal is to have all possible numa nodes have their pgdat
      allocated and initialized so
      
      	for_each_possible_node(nid)
      		NODE_DATA(nid)
      
      will never return garbage.  This has proven to be problem in several
      places when an offline numa node is used for an allocation just to realize
      that node_data and therefore allocation fallback zonelists are not
      initialized and such an allocation request blows up.
      
      There were attempts to address that by checking node_online in several
      places including the page allocator.  This patchset approaches the problem
      from a different perspective and instead of special casing, which just
      adds a runtime overhead, it allocates pglist_data for each possible node.
      This can add some memory overhead for platforms with high number of
      possible nodes if they do not contain any memory.  This should be a rather
      rare configuration though.
      
      How to test this? David has provided and excellent howto:
      http://lkml.kernel.org/r/6e5ebc19-890c-b6dd-1924-9f25c441010d@redhat.com
      
      Patches 1 and 3-6 are mostly cleanups.  The patchset has been reviewed by
      Rafael (thanks!) and the core fix tested by Rafael and Alexey (thanks to
      both).  David has tested as per instructions above and hasn't found any
      fallouts in the memory hotplug scenarios.
      
      This patch (of 6):
      
      This is a preparatory patch and it doesn't introduce any functional
      change.  It merely pulls out arch_alloc_nodedata (and co) outside of
      CONFIG_MEMORY_HOTPLUG because the following patch will need to call this
      from the generic MM code.
      
      Link: https://lkml.kernel.org/r/20220127085305.20890-1-mhocko@kernel.org
      Link: https://lkml.kernel.org/r/20220127085305.20890-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NRafael Aquini <raquini@redhat.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NWei Yang <richard.weiyang@gmail.com>
      Cc: Alexey Makhalov <amakhalov@vmware.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Nico Pache <npache@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e930d999
  4. 07 11月, 2021 1 次提交
  5. 09 9月, 2021 7 次提交
    • D
      mm/memory_hotplug: memory group aware "auto-movable" online policy · 445fcf7c
      David Hildenbrand 提交于
      Use memory groups to improve our "auto-movable" onlining policy:
      
      1. For static memory groups (e.g., a DIMM), online a memory block MOVABLE
         only if all other memory blocks in the group are either MOVABLE or could
         be onlined MOVABLE. A DIMM will either be MOVABLE or not, not a mixture.
      
      2. For dynamic memory groups (e.g., a virtio-mem device), online a
         memory block MOVABLE only if all other memory blocks inside the
         current unit are either MOVABLE or could be onlined MOVABLE. For a
         virtio-mem device with a device block size with 512 MiB, all 128 MiB
         memory blocks wihin a 512 MiB unit will either be MOVABLE or not, not
         a mixture.
      
      We have to pass the memory group to zone_for_pfn_range() to take the
      memory group into account.
      
      Note: for now, there seems to be no compelling reason to make this
      behavior configurable.
      
      Link: https://lkml.kernel.org/r/20210806124715.17090-9-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hui Zhu <teawater@gmail.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Marek Kedzierski <mkedzier@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      445fcf7c
    • D
      mm/memory_hotplug: track present pages in memory groups · 836809ec
      David Hildenbrand 提交于
      Let's track all present pages in each memory group.  Especially, track
      memory present in ZONE_MOVABLE and memory present in one of the kernel
      zones (which really only is ZONE_NORMAL right now as memory groups only
      apply to hotplugged memory) separately within a memory group, to prepare
      for making smart auto-online decision for individual memory blocks within
      a memory group based on group statistics.
      
      Link: https://lkml.kernel.org/r/20210806124715.17090-5-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hui Zhu <teawater@gmail.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Marek Kedzierski <mkedzier@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      836809ec
    • D
      drivers/base/memory: introduce "memory groups" to logically group memory blocks · 028fc57a
      David Hildenbrand 提交于
      In our "auto-movable" memory onlining policy, we want to make decisions
      across memory blocks of a single memory device.  Examples of memory
      devices include ACPI memory devices (in the simplest case a single DIMM)
      and virtio-mem.  For now, we don't have a connection between a single
      memory block device and the real memory device.  Each memory device
      consists of 1..X memory block devices.
      
      Let's logically group memory blocks belonging to the same memory device in
      "memory groups".  Memory groups can span multiple physical ranges and a
      memory group itself does not contain any information regarding physical
      ranges, only properties (e.g., "max_pages") necessary for improved memory
      onlining.
      
      Introduce two memory group types:
      
      1) Static memory group: E.g., a single ACPI memory device, consisting
         of 1..X memory resources.  A memory group consists of 1..Y memory
         blocks.  The whole group is added/removed in one go.  If any part
         cannot get offlined, the whole group cannot be removed.
      
      2) Dynamic memory group: E.g., a single virtio-mem device.  Memory is
         dynamically added/removed in a fixed granularity, called a "unit",
         consisting of 1..X memory blocks.  A unit is added/removed in one go.
         If any part of a unit cannot get offlined, the whole unit cannot be
         removed.
      
      In case of 1) we usually want either all memory managed by ZONE_MOVABLE or
      none.  In case of 2) we usually want to have as many units as possible
      managed by ZONE_MOVABLE.  We want a single unit to be of the same type.
      
      For now, memory groups are an internal concept that is not exposed to user
      space; we might want to change that in the future, though.
      
      add_memory() users can specify a mgid instead of a nid when passing the
      MHP_NID_IS_MGID flag.
      
      Link: https://lkml.kernel.org/r/20210806124715.17090-4-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hui Zhu <teawater@gmail.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Marek Kedzierski <mkedzier@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      028fc57a
    • D
      mm: track present early pages per zone · 4b097002
      David Hildenbrand 提交于
      Patch series "mm/memory_hotplug: "auto-movable" online policy and memory groups", v3.
      
      I. Goal
      
      The goal of this series is improving in-kernel auto-online support.  It
      tackles the fundamental problems that:
      
       1) We can create zone imbalances when onlining all memory blindly to
          ZONE_MOVABLE, in the worst case crashing the system. We have to know
          upfront how much memory we are going to hotplug such that we can
          safely enable auto-onlining of all hotplugged memory to ZONE_MOVABLE
          via "online_movable". This is far from practical and only applicable in
          limited setups -- like inside VMs under the RHV/oVirt hypervisor which
          will never hotplug more than 3 times the boot memory (and the
          limitation is only in place due to the Linux limitation).
      
       2) We see more setups that implement dynamic VM resizing, hot(un)plugging
          memory to resize VM memory. In these setups, we might hotplug a lot of
          memory, but it might happen in various small steps in both directions
          (e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...). virtio-mem is the
          primary driver of this upstream right now, performing such dynamic
          resizing NUMA-aware via multiple virtio-mem devices.
      
          Onlining all hotplugged memory to ZONE_NORMAL means we basically have
          no hotunplug guarantees. Onlining all to ZONE_MOVABLE means we can
          easily run into zone imbalances when growing a VM. We want a mixture,
          and we want as much memory as reasonable/configured in ZONE_MOVABLE.
          Details regarding zone imbalances can be found at [1].
      
       3) Memory devices consist of 1..X memory block devices, however, the
          kernel doesn't really track the relationship. Consequently, also user
          space has no idea. We want to make per-device decisions.
      
          As one example, for memory hotunplug it doesn't make sense to use a
          mixture of zones within a single DIMM: we want all MOVABLE if
          possible, otherwise all !MOVABLE, because any !MOVABLE part will easily
          block the whole DIMM from getting hotunplugged.
      
          As another example, virtio-mem operates on individual units that span
          1..X memory blocks. Similar to a DIMM, we want a unit to either be all
          MOVABLE or !MOVABLE. A "unit" can be thought of like a DIMM, however,
          all units of a virtio-mem device logically belong together and are
          managed (added/removed) by a single driver. We want as much memory of
          a virtio-mem device to be MOVABLE as possible.
      
       4) We want memory onlining to be done right from the kernel while adding
          memory, not triggered by user space via udev rules; for example, this
          is reqired for fast memory hotplug for drivers that add individual
          memory blocks, like virito-mem. We want a way to configure a policy in
          the kernel and avoid implementing advanced policies in user space.
      
      The auto-onlining support we have in the kernel is not sufficient.  All we
      have is a) online everything MOVABLE (online_movable) b) online everything
      !MOVABLE (online_kernel) c) keep zones contiguous (online).  This series
      allows configuring c) to mean instead "online movable if possible
      according to the coniguration, driven by a maximum MOVABLE:KERNEL ratio"
      -- a new onlining policy.
      
      II. Approach
      
      This series does 3 things:
      
       1) Introduces the "auto-movable" online policy that initially operates on
          individual memory blocks only. It uses a maximum MOVABLE:KERNEL ratio
          to make a decision whether a memory block will be onlined to
          ZONE_MOVABLE or not. However, in the basic form, hotplugged KERNEL
          memory does not allow for more MOVABLE memory (details in the
          patches). CMA memory is treated like MOVABLE memory.
      
       2) Introduces static (e.g., DIMM) and dynamic (e.g., virtio-mem) memory
          groups and uses group information to make decisions in the
          "auto-movable" online policy across memory blocks of a single memory
          device (modeled as memory group). More details can be found in patch
          #3 or in the DIMM example below.
      
       3) Maximizes ZONE_MOVABLE memory within dynamic memory groups, by
          allowing ZONE_NORMAL memory within a dynamic memory group to allow for
          more ZONE_MOVABLE memory within the same memory group. The target use
          case is dynamic VM resizing using virtio-mem. See the virtio-mem
          example below.
      
      I remember that the basic idea of using a ratio to implement a policy in
      the kernel was once mentioned by Vitaly Kuznetsov, but I might be wrong (I
      lost the pointer to that discussion).
      
      For me, the main use case is using it along with virtio-mem (and DIMMs /
      ppc64 dlpar where necessary) for dynamic resizing of VMs, increasing the
      amount of memory we can hotunplug reliably again if we might eventually
      hotplug a lot of memory to a VM.
      
      III. Target Usage
      
      The target usage will be:
      
       1) Linux boots with "mhp_default_online_type=offline"
      
       2) User space (e.g., systemd unit) configures memory onlining (according
          to a config file and system properties), for example:
          * Setting memory_hotplug.online_policy=auto-movable
          * Setting memory_hotplug.auto_movable_ratio=301
          * Setting memory_hotplug.auto_movable_numa_aware=true
      
       3) User space enabled auto onlining via "echo online >
          /sys/devices/system/memory/auto_online_blocks"
      
       4) User space triggers manual onlining of all already-offline memory
          blocks (go over offline memory blocks and set them to "online")
      
      IV. Example
      
      For DIMMs, hotplugging 4 GiB DIMMs to a 4 GiB VM with a configured ratio of
      301% results in the following layout:
      	Memory block 0-15:    DMA32   (early)
      	Memory block 32-47:   Normal  (early)
      	Memory block 48-79:   Movable (DIMM 0)
      	Memory block 80-111:  Movable (DIMM 1)
      	Memory block 112-143: Movable (DIMM 2)
      	Memory block 144-275: Normal  (DIMM 3)
      	Memory block 176-207: Normal  (DIMM 4)
      	... all Normal
      	(-> hotplugged Normal memory does not allow for more Movable memory)
      
      For virtio-mem, using a simple, single virtio-mem device with a 4 GiB VM
      will result in the following layout:
      	Memory block 0-15:    DMA32   (early)
      	Memory block 32-47:   Normal  (early)
      	Memory block 48-143:  Movable (virtio-mem, first 12 GiB)
      	Memory block 144:     Normal  (virtio-mem, next 128 MiB)
      	Memory block 145-147: Movable (virtio-mem, next 384 MiB)
      	Memory block 148:     Normal  (virtio-mem, next 128 MiB)
      	Memory block 149-151: Movable (virtio-mem, next 384 MiB)
      	... Normal/Movable mixture as above
      	(-> hotplugged Normal memory allows for more Movable memory within
      	    the same device)
      
      Which gives us maximum flexibility when dynamically growing/shrinking a
      VM in smaller steps.
      
      V. Doc Update
      
      I'll update the memory-hotplug.rst documentation, once the overhaul [1] is
      usptream. Until then, details can be found in patch #2.
      
      VI. Future Work
      
       1) Use memory groups for ppc64 dlpar
       2) Being able to specify a portion of (early) kernel memory that will be
          excluded from the ratio. Like "128 MiB globally/per node" are excluded.
      
          This might be helpful when starting VMs with extremely small memory
          footprint (e.g., 128 MiB) and hotplugging memory later -- not wanting
          the first hotplugged units getting onlined to ZONE_MOVABLE. One
          alternative would be a trigger to not consider ZONE_DMA memory
          in the ratio. We'll have to see if this is really rrequired.
       3) Indicate to user space that MOVABLE might be a bad idea -- especially
          relevant when memory ballooning without support for balloon compaction
          is active.
      
      This patch (of 9):
      
      For implementing a new memory onlining policy, which determines when to
      online memory blocks to ZONE_MOVABLE semi-automatically, we need the
      number of present early (boot) pages -- present pages excluding hotplugged
      pages.  Let's track these pages per zone.
      
      Pass a page instead of the zone to adjust_present_page_count(), similar as
      adjust_managed_page_count() and derive the zone from the page.
      
      It's worth noting that a memory block to be offlined/onlined is either
      completely "early" or "not early".  add_memory() and friends can only add
      complete memory blocks and we only online/offline complete (individual)
      memory blocks.
      
      Link: https://lkml.kernel.org/r/20210806124715.17090-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20210806124715.17090-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Marek Kedzierski <mkedzier@redhat.com>
      Cc: Hui Zhu <teawater@gmail.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b097002
    • D
      mm/memory_hotplug: remove nid parameter from remove_memory() and friends · e1c158e4
      David Hildenbrand 提交于
      There is only a single user remaining.  We can simply lookup the nid only
      used for node offlining purposes when walking our memory blocks.  We don't
      expect to remove multi-nid ranges; and if we'd ever do, we most probably
      don't care about removing multi-nid ranges that actually result in empty
      nodes.
      
      If ever required, we can detect the "multi-nid" scenario and simply try
      offlining all online nodes.
      
      Link: https://lkml.kernel.org/r/20210712124052.26491-4-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Laurent Dufour <ldufour@linux.ibm.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Anton Blanchard <anton@ozlabs.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jia He <justin.he@arm.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michel Lespinasse <michel@lespinasse.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta@ionos.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Pierre Morel <pmorel@linux.ibm.com>
      Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Sergei Trofimovich <slyfox@gentoo.org>
      Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1c158e4
    • D
      mm/memory_hotplug: remove nid parameter from arch_remove_memory() · 65a2aa5f
      David Hildenbrand 提交于
      The parameter is unused, let's remove it.
      
      Link: https://lkml.kernel.org/r/20210712124052.26491-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
      Acked-by: Heiko Carstens <hca@linux.ibm.com>	[s390]
      Reviewed-by: NPankaj Gupta <pankaj.gupta@ionos.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Laurent Dufour <ldufour@linux.ibm.com>
      Cc: Sergei Trofimovich <slyfox@gentoo.org>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Michel Lespinasse <michel@lespinasse.org>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Pierre Morel <pmorel@linux.ibm.com>
      Cc: Jia He <justin.he@arm.com>
      Cc: Anton Blanchard <anton@ozlabs.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65a2aa5f
    • D
      mm/memory_hotplug: use "unsigned long" for PFN in zone_for_pfn_range() · 7cf209ba
      David Hildenbrand 提交于
      Patch series "mm/memory_hotplug: preparatory patches for new online policy and memory"
      
      These are all cleanups and one fix previously sent as part of [1]:
      [PATCH v1 00/12] mm/memory_hotplug: "auto-movable" online policy and memory
      groups.
      
      These patches make sense even without the other series, therefore I pulled
      them out to make the other series easier to digest.
      
      [1] https://lkml.kernel.org/r/20210607195430.48228-1-david@redhat.com
      
      This patch (of 4):
      
      Checkpatch complained on a follow-up patch that we are using "unsigned"
      here, which defaults to "unsigned int" and checkpatch is correct.
      
      As we will search for a fitting zone using the wrong pfn, we might end
      up onlining memory to one of the special kernel zones, such as ZONE_DMA,
      which can end badly as the onlined memory does not satisfy properties of
      these zones.
      
      Use "unsigned long" instead, just as we do in other places when handling
      PFNs.  This can bite us once we have physical addresses in the range of
      multiple TB.
      
      Link: https://lkml.kernel.org/r/20210712124052.26491-2-david@redhat.com
      Fixes: e5e68930 ("mm, memory_hotplug: display allowed zones in the preferred ordering")
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPankaj Gupta <pankaj.gupta@ionos.com>
      Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: virtualization@lists.linux-foundation.org
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Anton Blanchard <anton@ozlabs.org>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jia He <justin.he@arm.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Laurent Dufour <ldufour@linux.ibm.com>
      Cc: Michel Lespinasse <michel@lespinasse.org>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Pierre Morel <pmorel@linux.ibm.com>
      Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Sergei Trofimovich <slyfox@gentoo.org>
      Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cf209ba
  6. 01 7月, 2021 1 次提交
    • M
      mm: memory_hotplug: factor out bootmem core functions to bootmem_info.c · 426e5c42
      Muchun Song 提交于
      Patch series "Free some vmemmap pages of HugeTLB page", v23.
      
      This patch series will free some vmemmap pages(struct page structures)
      associated with each HugeTLB page when preallocated to save memory.
      
      In order to reduce the difficulty of the first version of code review.  In
      this version, we disable PMD/huge page mapping of vmemmap if this feature
      was enabled.  This acutely eliminates a bunch of the complex code doing
      page table manipulation.  When this patch series is solid, we cam add the
      code of vmemmap page table manipulation in the future.
      
      The struct page structures (page structs) are used to describe a physical
      page frame.  By default, there is an one-to-one mapping from a page frame
      to it's corresponding page struct.
      
      The HugeTLB pages consist of multiple base page size pages and is
      supported by many architectures.  See hugetlbpage.rst in the Documentation
      directory for more details.  On the x86 architecture, HugeTLB pages of
      size 2MB and 1GB are currently supported.  Since the base page size on x86
      is 4KB, a 2MB HugeTLB page consists of 512 base pages and a 1GB HugeTLB
      page consists of 4096 base pages.  For each base page, there is a
      corresponding page struct.
      
      Within the HugeTLB subsystem, only the first 4 page structs are used to
      contain unique information about a HugeTLB page.  HUGETLB_CGROUP_MIN_ORDER
      provides this upper limit.  The only 'useful' information in the remaining
      page structs is the compound_head field, and this field is the same for
      all tail pages.
      
      By removing redundant page structs for HugeTLB pages, memory can returned
      to the buddy allocator for other uses.
      
      When the system boot up, every 2M HugeTLB has 512 struct page structs which
      size is 8 pages(sizeof(struct page) * 512 / PAGE_SIZE).
      
          HugeTLB                  struct pages(8 pages)         page frame(8 pages)
       +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
       |           |                     |     0     | -------------> |     0     |
       |           |                     +-----------+                +-----------+
       |           |                     |     1     | -------------> |     1     |
       |           |                     +-----------+                +-----------+
       |           |                     |     2     | -------------> |     2     |
       |           |                     +-----------+                +-----------+
       |           |                     |     3     | -------------> |     3     |
       |           |                     +-----------+                +-----------+
       |           |                     |     4     | -------------> |     4     |
       |    2MB    |                     +-----------+                +-----------+
       |           |                     |     5     | -------------> |     5     |
       |           |                     +-----------+                +-----------+
       |           |                     |     6     | -------------> |     6     |
       |           |                     +-----------+                +-----------+
       |           |                     |     7     | -------------> |     7     |
       |           |                     +-----------+                +-----------+
       |           |
       |           |
       |           |
       +-----------+
      
      The value of page->compound_head is the same for all tail pages.  The
      first page of page structs (page 0) associated with the HugeTLB page
      contains the 4 page structs necessary to describe the HugeTLB.  The only
      use of the remaining pages of page structs (page 1 to page 7) is to point
      to page->compound_head.  Therefore, we can remap pages 2 to 7 to page 1.
      Only 2 pages of page structs will be used for each HugeTLB page.  This
      will allow us to free the remaining 6 pages to the buddy allocator.
      
      Here is how things look after remapping.
      
          HugeTLB                  struct pages(8 pages)         page frame(8 pages)
       +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
       |           |                     |     0     | -------------> |     0     |
       |           |                     +-----------+                +-----------+
       |           |                     |     1     | -------------> |     1     |
       |           |                     +-----------+                +-----------+
       |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^
       |           |                     +-----------+                   | | | | |
       |           |                     |     3     | ------------------+ | | | |
       |           |                     +-----------+                     | | | |
       |           |                     |     4     | --------------------+ | | |
       |    2MB    |                     +-----------+                       | | |
       |           |                     |     5     | ----------------------+ | |
       |           |                     +-----------+                         | |
       |           |                     |     6     | ------------------------+ |
       |           |                     +-----------+                           |
       |           |                     |     7     | --------------------------+
       |           |                     +-----------+
       |           |
       |           |
       |           |
       +-----------+
      
      When a HugeTLB is freed to the buddy system, we should allocate 6 pages
      for vmemmap pages and restore the previous mapping relationship.
      
      Apart from 2MB HugeTLB page, we also have 1GB HugeTLB page.  It is similar
      to the 2MB HugeTLB page.  We also can use this approach to free the
      vmemmap pages.
      
      In this case, for the 1GB HugeTLB page, we can save 4094 pages.  This is a
      very substantial gain.  On our server, run some SPDK/QEMU applications
      which will use 1024GB HugeTLB page.  With this feature enabled, we can
      save ~16GB (1G hugepage)/~12GB (2MB hugepage) memory.
      
      Because there are vmemmap page tables reconstruction on the
      freeing/allocating path, it increases some overhead.  Here are some
      overhead analysis.
      
      1) Allocating 10240 2MB HugeTLB pages.
      
         a) With this patch series applied:
         # time echo 10240 > /proc/sys/vm/nr_hugepages
      
         real     0m0.166s
         user     0m0.000s
         sys      0m0.166s
      
         # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; }
           kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs -
           @start[tid]); delete(@start[tid]); }'
         Attaching 2 probes...
      
         @latency:
         [8K, 16K)           5476 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
         [16K, 32K)          4760 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       |
         [32K, 64K)             4 |                                                    |
      
         b) Without this patch series:
         # time echo 10240 > /proc/sys/vm/nr_hugepages
      
         real     0m0.067s
         user     0m0.000s
         sys      0m0.067s
      
         # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; }
           kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs -
           @start[tid]); delete(@start[tid]); }'
         Attaching 2 probes...
      
         @latency:
         [4K, 8K)           10147 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
         [8K, 16K)             93 |                                                    |
      
         Summarize: this feature is about ~2x slower than before.
      
      2) Freeing 10240 2MB HugeTLB pages.
      
         a) With this patch series applied:
         # time echo 0 > /proc/sys/vm/nr_hugepages
      
         real     0m0.213s
         user     0m0.000s
         sys      0m0.213s
      
         # bpftrace -e 'kprobe:free_pool_huge_page { @start[tid] = nsecs; }
           kretprobe:free_pool_huge_page /@start[tid]/ { @latency = hist(nsecs -
           @start[tid]); delete(@start[tid]); }'
         Attaching 2 probes...
      
         @latency:
         [8K, 16K)              6 |                                                    |
         [16K, 32K)         10227 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
         [32K, 64K)             7 |                                                    |
      
         b) Without this patch series:
         # time echo 0 > /proc/sys/vm/nr_hugepages
      
         real     0m0.081s
         user     0m0.000s
         sys      0m0.081s
      
         # bpftrace -e 'kprobe:free_pool_huge_page { @start[tid] = nsecs; }
           kretprobe:free_pool_huge_page /@start[tid]/ { @latency = hist(nsecs -
           @start[tid]); delete(@start[tid]); }'
         Attaching 2 probes...
      
         @latency:
         [4K, 8K)            6805 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
         [8K, 16K)           3427 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
         [16K, 32K)             8 |                                                    |
      
         Summary: The overhead of __free_hugepage is about ~2-3x slower than before.
      
      Although the overhead has increased, the overhead is not significant.
      Like Mike said, "However, remember that the majority of use cases create
      HugeTLB pages at or shortly after boot time and add them to the pool.  So,
      additional overhead is at pool creation time.  There is no change to
      'normal run time' operations of getting a page from or returning a page to
      the pool (think page fault/unmap)".
      
      Despite the overhead and in addition to the memory gains from this series.
      The following data is obtained by Joao Martins.  Very thanks to his
      effort.
      
      There's an additional benefit which is page (un)pinners will see an improvement
      and Joao presumes because there are fewer memmap pages and thus the tail/head
      pages are staying in cache more often.
      
      Out of the box Joao saw (when comparing linux-next against linux-next +
      this series) with gup_test and pinning a 16G HugeTLB file (with 1G pages):
      
      	get_user_pages(): ~32k -> ~9k
      	unpin_user_pages(): ~75k -> ~70k
      
      Usually any tight loop fetching compound_head(), or reading tail pages
      data (e.g.  compound_head) benefit a lot.  There's some unpinning
      inefficiencies Joao was fixing[2], but with that in added it shows even
      more:
      
      	unpin_user_pages(): ~27k -> ~3.8k
      
      [1] https://lore.kernel.org/linux-mm/20210409205254.242291-1-mike.kravetz@oracle.com/
      [2] https://lore.kernel.org/linux-mm/20210204202500.26474-1-joao.m.martins@oracle.com/
      
      This patch (of 9):
      
      Move bootmem info registration common API to individual bootmem_info.c.
      And we will use {get,put}_page_bootmem() to initialize the page for the
      vmemmap pages or free the vmemmap pages to buddy in the later patch.  So
      move them out of CONFIG_MEMORY_HOTPLUG_SPARSE.  This is just code movement
      without any functional change.
      
      Link: https://lkml.kernel.org/r/20210510030027.56044-1-songmuchun@bytedance.com
      Link: https://lkml.kernel.org/r/20210510030027.56044-2-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Acked-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Tested-by: NChen Huang <chenhuang5@huawei.com>
      Tested-by: NBodeddula Balasubramaniam <bodeddub@amazon.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: x86@kernel.org
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Oliver Neukum <oneukum@suse.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      426e5c42
  7. 06 5月, 2021 1 次提交
    • O
      mm,memory_hotplug: allocate memmap from the added memory range · a08a2ae3
      Oscar Salvador 提交于
      Physical memory hotadd has to allocate a memmap (struct page array) for
      the newly added memory section.  Currently, alloc_pages_node() is used
      for those allocations.
      
      This has some disadvantages:
       a) an existing memory is consumed for that purpose
          (eg: ~2MB per 128MB memory section on x86_64)
          This can even lead to extreme cases where system goes OOM because
          the physically hotplugged memory depletes the available memory before
          it is onlined.
       b) if the whole node is movable then we have off-node struct pages
          which has performance drawbacks.
       c) It might be there are no PMD_ALIGNED chunks so memmap array gets
          populated with base pages.
      
      This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.
      
      Vmemap page tables can map arbitrary memory.  That means that we can
      reserve a part of the physically hotadded memory to back vmemmap page
      tables.  This implementation uses the beginning of the hotplugged memory
      for that purpose.
      
      There are some non-obviously things to consider though.
      
      Vmemmap pages are allocated/freed during the memory hotplug events
      (add_memory_resource(), try_remove_memory()) when the memory is
      added/removed.  This means that the reserved physical range is not
      online although it is used.  The most obvious side effect is that
      pfn_to_online_page() returns NULL for those pfns.  The current design
      expects that this should be OK as the hotplugged memory is considered a
      garbage until it is onlined.  For example hibernation wouldn't save the
      content of those vmmemmaps into the image so it wouldn't be restored on
      resume but this should be OK as there no real content to recover anyway
      while metadata is reachable from other data structures (e.g.  vmemmap
      page tables).
      
      The reserved space is therefore (de)initialized during the {on,off}line
      events (mhp_{de}init_memmap_on_memory).  That is done by extracting page
      allocator independent initialization from the regular onlining path.
      The primary reason to handle the reserved space outside of
      {on,off}line_pages is to make each initialization specific to the
      purpose rather than special case them in a single function.
      
      As per above, the functions that are introduced are:
      
       - mhp_init_memmap_on_memory:
         Initializes vmemmap pages by calling move_pfn_range_to_zone(), calls
         kasan_add_zero_shadow(), and onlines as many sections as vmemmap pages
         fully span.
      
       - mhp_deinit_memmap_on_memory:
         Offlines as many sections as vmemmap pages fully span, removes the
         range from zhe zone by remove_pfn_range_from_zone(), and calls
         kasan_remove_zero_shadow() for the range.
      
      The new function memory_block_online() calls mhp_init_memmap_on_memory()
      before doing the actual online_pages().  Should online_pages() fail, we
      clean up by calling mhp_deinit_memmap_on_memory().  Adjusting of
      present_pages is done at the end once we know that online_pages()
      succedeed.
      
      On offline, memory_block_offline() needs to unaccount vmemmap pages from
      present_pages() before calling offline_pages().  This is necessary because
      offline_pages() tears down some structures based on the fact whether the
      node or the zone become empty.  If offline_pages() fails, we account back
      vmemmap pages.  If it succeeds, we call mhp_deinit_memmap_on_memory().
      
      Hot-remove:
      
       We need to be careful when removing memory, as adding and
       removing memory needs to be done with the same granularity.
       To check that this assumption is not violated, we check the
       memory range we want to remove and if a) any memory block has
       vmemmap pages and b) the range spans more than a single memory
       block, we scream out loud and refuse to proceed.
      
       If all is good and the range was using memmap on memory (aka vmemmap pages),
       we construct an altmap structure so free_hugepage_table does the right
       thing and calls vmem_altmap_free instead of free_pagetable.
      
      Link: https://lkml.kernel.org/r/20210421102701.25051-5-osalvador@suse.deSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a08a2ae3
  8. 27 2月, 2021 4 次提交
  9. 23 11月, 2020 1 次提交
  10. 19 11月, 2020 1 次提交
  11. 17 10月, 2020 5 次提交
    • D
      mm/memory_hotplug: MEMHP_MERGE_RESOURCE to specify merging of System RAM resources · 9ca6551e
      David Hildenbrand 提交于
      Some add_memory*() users add memory in small, contiguous memory blocks.
      Examples include virtio-mem, hyper-v balloon, and the XEN balloon.
      
      This can quickly result in a lot of memory resources, whereby the actual
      resource boundaries are not of interest (e.g., it might be relevant for
      DIMMs, exposed via /proc/iomem to user space).  We really want to merge
      added resources in this scenario where possible.
      
      Let's provide a flag (MEMHP_MERGE_RESOURCE) to specify that a resource
      either created within add_memory*() or passed via add_memory_resource()
      shall be marked mergeable and merged with applicable siblings.
      
      To implement that, we need a kernel/resource interface to mark selected
      System RAM resources mergeable (IORESOURCE_SYSRAM_MERGEABLE) and trigger
      merging.
      
      Note: We really want to merge after the whole operation succeeded, not
      directly when adding a resource to the resource tree (it would break
      add_memory_resource() and require splitting resources again when the
      operation failed - e.g., due to -ENOMEM).
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Roger Pau Monné <roger.pau@citrix.com>
      Cc: Julien Grall <julien@xen.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Anton Blanchard <anton@ozlabs.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Leonardo Bras <leobras.c@gmail.com>
      Cc: Libor Pechacek <lpechacek@suse.cz>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: "Oliver O'Halloran" <oohall@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Link: https://lkml.kernel.org/r/20200911103459.10306-6-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ca6551e
    • D
      mm/memory_hotplug: prepare passing flags to add_memory() and friends · b6117199
      David Hildenbrand 提交于
      We soon want to pass flags, e.g., to mark added System RAM resources.
      mergeable.  Prepare for that.
      
      This patch is based on a similar patch by Oscar Salvador:
      
      https://lkml.kernel.org/r/20190625075227.15193-3-osalvador@suse.deSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: Juergen Gross <jgross@suse.com> # Xen related part
      Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NWei Liu <wei.liu@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: "Oliver O'Halloran" <oohall@gmail.com>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Libor Pechacek <lpechacek@suse.cz>
      Cc: Anton Blanchard <anton@ozlabs.org>
      Cc: Leonardo Bras <leobras.c@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Julien Grall <julien@xen.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Roger Pau Monné <roger.pau@citrix.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Link: https://lkml.kernel.org/r/20200911103459.10306-5-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6117199
    • D
      mm/memory_hotplug: guard more declarations by CONFIG_MEMORY_HOTPLUG · 3a0aaefe
      David Hildenbrand 提交于
      We soon want to pass flags via a new type to add_memory() and friends.
      That revealed that we currently don't guard some declarations by
      CONFIG_MEMORY_HOTPLUG.
      
      While some definitions could be moved to different places, let's keep it
      minimal for now and use CONFIG_MEMORY_HOTPLUG for all functions only
      compiled with CONFIG_MEMORY_HOTPLUG.
      
      Wrap sparse_decode_mem_map() into CONFIG_MEMORY_HOTPLUG, it's only called
      from CONFIG_MEMORY_HOTPLUG code.
      
      While at it, remove allow_online_pfn_range(), which is no longer around,
      and mhp_notimplemented(), which is unused.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Anton Blanchard <anton@ozlabs.org>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Julien Grall <julien@xen.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Leonardo Bras <leobras.c@gmail.com>
      Cc: Libor Pechacek <lpechacek@suse.cz>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: "Oliver O'Halloran" <oohall@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Roger Pau Monné <roger.pau@citrix.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Link: https://lkml.kernel.org/r/20200911103459.10306-4-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3a0aaefe
    • D
      mm: pass migratetype into memmap_init_zone() and move_pfn_range_to_zone() · d882c006
      David Hildenbrand 提交于
      On the memory onlining path, we want to start with MIGRATE_ISOLATE, to
      un-isolate the pages after memory onlining is complete.  Let's allow
      passing in the migratetype.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Link: https://lkml.kernel.org/r/20200819175957.28465-10-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d882c006
    • D
      mm/page_alloc: simplify __offline_isolated_pages() · 257bea71
      David Hildenbrand 提交于
      offline_pages() is the only user.  __offline_isolated_pages() never gets
      called with ranges that contain memory holes and we no longer care about
      the return value.  Drop the return value handling and all pfn_valid()
      checks.
      
      Update the documentation.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200819175957.28465-5-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      257bea71
  12. 14 10月, 2020 1 次提交
    • D
      mm/memory_hotplug: introduce default phys_to_target_node() implementation · a035b6bf
      Dan Williams 提交于
      In preparation to set a fallback value for dev_dax->target_node, introduce
      generic fallback helpers for phys_to_target_node()
      
      A generic implementation based on node-data or memblock was proposed, but
      as noted by Mike:
      
          "Here again, I would prefer to add a weak default for
           phys_to_target_node() because the "generic" implementation is not really
           generic.
      
           The fallback to reserved ranges is x86 specfic because on x86 most of
           the reserved areas is not in memblock.memory. AFAIK, no other
           architecture does this."
      
      The info message in the generic memory_add_physaddr_to_nid()
      implementation is fixed up to properly reflect that
      memory_add_physaddr_to_nid() communicates "online" node info and
      phys_to_target_node() indicates "target / to-be-onlined" node info.
      
      [akpm@linux-foundation.org: fix CONFIG_MEMORY_HOTPLUG=n build]
        Link: https://lkml.kernel.org/r/202008252130.7YrHIyMI%25lkp@intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Jia He <justin.he@arm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brice Goglin <Brice.Goglin@inria.fr>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Hulk Robot <hulkci@huawei.com>
      Cc: Jason Yan <yanaijie@huawei.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Link: https://lkml.kernel.org/r/159643097768.4062302.3135192588966888630.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a035b6bf
  13. 05 6月, 2020 3 次提交
    • D
      mm/memory_hotplug: introduce add_memory_driver_managed() · 7b7b2721
      David Hildenbrand 提交于
      Patch series "mm/memory_hotplug: Interface to add driver-managed system
      ram", v4.
      
      kexec (via kexec_load()) can currently not properly handle memory added
      via dax/kmem, and will have similar issues with virtio-mem.  kexec-tools
      will currently add all memory to the fixed-up initial firmware memmap.  In
      case of dax/kmem, this means that - in contrast to a proper reboot - how
      that persistent memory will be used can no longer be configured by the
      kexec'd kernel.  In case of virtio-mem it will be harmful, because that
      memory might contain inaccessible pieces that require coordination with
      hypervisor first.
      
      In both cases, we want to let the driver in the kexec'd kernel handle
      detecting and adding the memory, like during an ordinary reboot.
      Introduce add_memory_driver_managed().  More on the samentics are in patch
      #1.
      
      In the future, we might want to make this behavior configurable for
      dax/kmem- either by configuring it in the kernel (which would then also
      allow to configure kexec_file_load()) or in kexec-tools by also adding
      "System RAM (kmem)" memory from /proc/iomem to the fixed-up initial
      firmware memmap.
      
      More on the motivation can be found in [1] and [2].
      
      [1] https://lkml.kernel.org/r/20200429160803.109056-1-david@redhat.com
      [2] https://lkml.kernel.org/r/20200430102908.10107-1-david@redhat.com
      
      This patch (of 3):
      
      Some device drivers rely on memory they managed to not get added to the
      initial (firmware) memmap as system RAM - so it's not used as initial
      system RAM by the kernel and the driver is under control.  While this is
      the case during cold boot and after a reboot, kexec is not aware of that
      and might add such memory to the initial (firmware) memmap of the kexec
      kernel.  We need ways to teach kernel and userspace that this system ram
      is different.
      
      For example, dax/kmem allows to decide at runtime if persistent memory is
      to be used as system ram.  Another future user is virtio-mem, which has to
      coordinate with its hypervisor to deal with inaccessible parts within
      memory resources.
      
      We want to let users in the kernel (esp. kexec) but also user space
      (esp. kexec-tools) know that this memory has different semantics and
      needs to be handled differently:
      1. Don't create entries in /sys/firmware/memmap/
      2. Name the memory resource "System RAM ($DRIVER)" (exposed via
         /proc/iomem) ($DRIVER might be "kmem", "virtio_mem").
      3. Flag the memory resource IORESOURCE_MEM_DRIVER_MANAGED
      
      /sys/firmware/memmap/ [1] represents the "raw firmware-provided memory
      map" because "on most architectures that firmware-provided memory map is
      modified afterwards by the kernel itself".  The primary user is kexec on
      x86-64.  Since commit d96ae530 ("memory-hotplug: create
      /sys/firmware/memmap entry for new memory"), we add all hotplugged memory
      to that firmware memmap - which makes perfect sense for traditional memory
      hotplug on x86-64, where real HW will also add hotplugged DIMMs to the
      firmware memmap.  We replicate what the "raw firmware-provided memory map"
      looks like after hot(un)plug.
      
      To keep things simple, let the user provide the full resource name instead
      of only the driver name - this way, we don't have to manually
      allocate/craft strings for memory resources.  Also use the resource name
      to make decisions, to avoid passing additional flags.  In case the name
      isn't "System RAM", it's special.
      
      We don't have to worry about firmware_map_remove() on the removal path.
      If there is no entry, it will simply return with -EINVAL.
      
      We'll adapt dax/kmem in a follow-up patch.
      
      [1] https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-firmware-memmapSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Link: http://lkml.kernel.org/r/20200508084217.9160-1-david@redhat.com
      Link: http://lkml.kernel.org/r/20200508084217.9160-3-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b7b2721
    • D
      mm/memory_hotplug: remove is_mem_section_removable() · 04f3465c
      David Hildenbrand 提交于
      Fortunately, all users of is_mem_section_removable() are gone.  Get rid of
      it, including some now unnecessary functions.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Link: http://lkml.kernel.org/r/20200407135416.24093-3-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04f3465c
    • D
      mm/memory_hotplug: Introduce offline_and_remove_memory() · 08b3acd7
      David Hildenbrand 提交于
      virtio-mem wants to offline and remove a memory block once it unplugged
      all subblocks (e.g., using alloc_contig_range()). Let's provide
      an interface to do that from a driver. virtio-mem already supports to
      offline partially unplugged memory blocks. Offlining a fully unplugged
      memory block will not require to migrate any pages. All unplugged
      subblocks are PageOffline() and have a reference count of 0 - so
      offlining code will simply skip them.
      
      All we need is an interface to offline and remove the memory from kernel
      module context, where we don't have access to the memory block devices
      (esp. find_memory_block() and device_offline()) and the device hotplug
      lock.
      
      To keep things simple, allow to only work on a single memory block.
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Link: https://lore.kernel.org/r/20200507140139.17083-9-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
      08b3acd7
  14. 11 4月, 2020 3 次提交
    • L
      mm/memory_hotplug: add pgprot_t to mhp_params · bfeb022f
      Logan Gunthorpe 提交于
      devm_memremap_pages() is currently used by the PCI P2PDMA code to create
      struct page mappings for IO memory.  At present, these mappings are
      created with PAGE_KERNEL which implies setting the PAT bits to be WB.
      However, on x86, an mtrr register will typically override this and force
      the cache type to be UC-.  In the case firmware doesn't set this
      register it is effectively WB and will typically result in a machine
      check exception when it's accessed.
      
      Other arches are not currently likely to function correctly seeing they
      don't have any MTRR registers to fall back on.
      
      To solve this, provide a way to specify the pgprot value explicitly to
      arch_add_memory().
      
      Of the arches that support MEMORY_HOTPLUG: x86_64, and arm64 need a
      simple change to pass the pgprot_t down to their respective functions
      which set up the page tables.  For x86_32, set the page tables
      explicitly using _set_memory_prot() (seeing they are already mapped).
      
      For ia64, s390 and sh, reject anything but PAGE_KERNEL settings -- this
      should be fine, for now, seeing these architectures don't support
      ZONE_DEVICE.
      
      A check in __add_pages() is also added to ensure the pgprot parameter
      was set for all arches.
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Eric Badger <ebadger@gigaio.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200306170846.9333-7-logang@deltatee.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bfeb022f
    • L
      mm/memory_hotplug: rename mhp_restrictions to mhp_params · f5637d3b
      Logan Gunthorpe 提交于
      The mhp_restrictions struct really doesn't specify anything resembling a
      restriction anymore so rename it to be mhp_params as it is a list of
      extended parameters.
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Eric Badger <ebadger@gigaio.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200306170846.9333-3-logang@deltatee.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5637d3b
    • L
      mm/memory_hotplug: drop the flags field from struct mhp_restrictions · 96c6b598
      Logan Gunthorpe 提交于
      Patch series "Allow setting caching mode in arch_add_memory() for
      P2PDMA", v4.
      
      Currently, the page tables created using memremap_pages() are always
      created with the PAGE_KERNEL cacheing mode.  However, the P2PDMA code is
      creating pages for PCI BAR memory which should never be accessed through
      the cache and instead use either WC or UC.  This still works in most
      cases, on x86, because the MTRR registers typically override the caching
      settings in the page tables for all of the IO memory to be UC-.
      However, this tends not to work so well on other arches or some rare x86
      machines that have firmware which does not setup the MTRR registers in
      this way.
      
      Instead of this, this series proposes a change to arch_add_memory() to
      take the pgprot required by the mapping which allows us to explicitly
      set pagetable entries for P2PDMA memory to UC.
      
      This changes is pretty routine for most of the arches: x86_64, arm64 and
      powerpc simply need to thread the pgprot through to where the page
      tables are setup.  x86_32 unfortunately sets up the page tables at boot
      so must use _set_memory_prot() to change their caching mode.  ia64, s390
      and sh don't appear to have an easy way to change the page tables so,
      for now at least, we just return -EINVAL on such mappings and thus they
      will not support P2PDMA memory until the work for this is done.  This
      should be fine as they don't yet support ZONE_DEVICE.
      
      This patch (of 7):
      
      This variable is not used anywhere and should therefore be removed from
      the structure.
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Eric Badger <ebadger@gigaio.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Link: http://lkml.kernel.org/r/20200306170846.9333-2-logang@deltatee.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96c6b598
  15. 08 4月, 2020 4 次提交
    • D
      mm/memory_hotplug: allow to specify a default online_type · 5f47adf7
      David Hildenbrand 提交于
      For now, distributions implement advanced udev rules to essentially
      - Don't online any hotplugged memory (s390x)
      - Online all memory to ZONE_NORMAL (e.g., most virt environments like
        hyperv)
      - Online all memory to ZONE_MOVABLE in case the zone imbalance is taken
        care of (e.g., bare metal, special virt environments)
      
      In summary: All memory is usually onlined the same way, however, the
      kernel always has to ask user space to come up with the same answer.
      E.g., Hyper-V always waits for a memory block to get onlined before
      continuing, otherwise it might end up adding memory faster than
      onlining it, which can result in strange OOM situations.  This waiting
      slows down adding of a bigger amount of memory.
      
      Let's allow to specify a default online_type, not just "online" and
      "offline".  This allows distributions to configure the default online_type
      when booting up and be done with it.
      
      We can now specify "offline", "online", "online_movable" and
      "online_kernel" via
      - "memhp_default_state=" on the kernel cmdline
      - /sys/devices/system/memory/auto_online_blocks
      just like we are able to specify for a single memory block via
      /sys/devices/system/memory/memoryX/state
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Yumei Huang <yuhuang@redhat.com>
      Link: http://lkml.kernel.org/r/20200317104942.11178-9-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f47adf7
    • D
      mm/memory_hotplug: convert memhp_auto_online to store an online_type · 862919e5
      David Hildenbrand 提交于
      ...  and rename it to memhp_default_online_type.  This is a preparation
      for more detailed default online behavior.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Yumei Huang <yuhuang@redhat.com>
      Link: http://lkml.kernel.org/r/20200317104942.11178-8-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      862919e5
    • D
      drivers/base/memory: map MMOP_OFFLINE to 0 · efc978ad
      David Hildenbrand 提交于
      Historically, we used the value -1.  Just treat 0 as the special case now.
      Clarify a comment (which was wrong, when we come via device_online() the
      first time, the online_type would have been 0 / MEM_ONLINE).  The default
      is now always MMOP_OFFLINE.  This removes the last user of the manual
      "-1", which didn't use the enum value.
      
      This is a preparation to use the online_type as an array index.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Yumei Huang <yuhuang@redhat.com>
      Link: http://lkml.kernel.org/r/20200317104942.11178-3-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      efc978ad
    • D
      drivers/base/memory: rename MMOP_ONLINE_KEEP to MMOP_ONLINE · 956f8b44
      David Hildenbrand 提交于
      Patch series "mm/memory_hotplug: allow to specify a default online_type", v3.
      
      Distributions nowadays use udev rules ([1] [2]) to specify if and how to
      online hotplugged memory.  The rules seem to get more complex with many
      special cases.  Due to the various special cases,
      CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE cannot be used.  All memory hotplug
      is handled via udev rules.
      
      Every time we hotplug memory, the udev rule will come to the same
      conclusion.  Especially Hyper-V (but also soon virtio-mem) add a lot of
      memory in separate memory blocks and wait for memory to get onlined by
      user space before continuing to add more memory blocks (to not add memory
      faster than it is getting onlined).  This of course slows down the whole
      memory hotplug process.
      
      To make the job of distributions easier and to avoid udev rules that get
      more and more complicated, let's extend the mechanism provided by
      - /sys/devices/system/memory/auto_online_blocks
      - "memhp_default_state=" on the kernel cmdline
      to be able to specify also "online_movable" as well as "online_kernel"
      
      === Example /usr/libexec/config-memhotplug ===
      
      #!/bin/bash
      
      VIRT=`systemd-detect-virt --vm`
      ARCH=`uname -p`
      
      sense_virtio_mem() {
        if [ -d "/sys/bus/virtio/drivers/virtio_mem/" ]; then
          DEVICES=`find /sys/bus/virtio/drivers/virtio_mem/ -maxdepth 1 -type l | wc -l`
          if [ $DEVICES != "0" ]; then
              return 0
          fi
        fi
        return 1
      }
      
      if [ ! -e "/sys/devices/system/memory/auto_online_blocks" ]; then
        echo "Memory hotplug configuration support missing in the kernel"
        exit 1
      fi
      
      if grep "memhp_default_state=" /proc/cmdline > /dev/null; then
        echo "Memory hotplug configuration overridden in kernel cmdline (memhp_default_state=)"
        exit 1
      fi
      
      if [ $VIRT == "microsoft" ]; then
        echo "Detected Hyper-V on $ARCH"
        # Hyper-V wants all memory in ZONE_NORMAL
        ONLINE_TYPE="online_kernel"
      elif sense_virtio_mem; then
        echo "Detected virtio-mem on $ARCH"
        # virtio-mem wants all memory in ZONE_NORMAL
        ONLINE_TYPE="online_kernel"
      elif [ $ARCH == "s390x" ] || [ $ARCH == "s390" ]; then
        echo "Detected $ARCH"
        # standby memory should not be onlined automatically
        ONLINE_TYPE="offline"
      elif [ $ARCH == "ppc64" ] || [ $ARCH == "ppc64le" ]; then
        echo "Detected" $ARCH
        # PPC64 onlines all hotplugged memory right from the kernel
        ONLINE_TYPE="offline"
      elif [ $VIRT == "none" ]; then
        echo "Detected bare-metal on $ARCH"
        # Bare metal users expect hotplugged memory to be unpluggable. We assume
        # that ZONE imbalances on such enterpise servers cannot happen and is
        # properly documented
        ONLINE_TYPE="online_movable"
      else
        # TODO: Hypervisors that want to unplug DIMMs and can guarantee that ZONE
        # imbalances won't happen
        echo "Detected $VIRT on $ARCH"
        # Usually, ballooning is used in virtual environments, so memory should go to
        # ZONE_NORMAL. However, sometimes "movable_node" is relevant.
        ONLINE_TYPE="online"
      fi
      
      echo "Selected online_type:" $ONLINE_TYPE
      
      # Configure what to do with memory that will be hotplugged in the future
      echo $ONLINE_TYPE 2>/dev/null > /sys/devices/system/memory/auto_online_blocks
      if [ $? != "0" ]; then
        echo "Memory hotplug cannot be configured (e.g., old kernel or missing permissions)"
        # A backup udev rule should handle old kernels if necessary
        exit 1
      fi
      
      # Process all already pluggedd blocks (e.g., DIMMs, but also Hyper-V or virtio-mem)
      if [ $ONLINE_TYPE != "offline" ]; then
        for MEMORY in /sys/devices/system/memory/memory*; do
          STATE=`cat $MEMORY/state`
          if [ $STATE == "offline" ]; then
              echo $ONLINE_TYPE > $MEMORY/state
          fi
        done
      fi
      
      === Example /usr/lib/systemd/system/config-memhotplug.service ===
      
      [Unit]
      Description=Configure memory hotplug behavior
      DefaultDependencies=no
      Conflicts=shutdown.target
      Before=sysinit.target shutdown.target
      After=systemd-modules-load.service
      ConditionPathExists=|/sys/devices/system/memory/auto_online_blocks
      
      [Service]
      ExecStart=/usr/libexec/config-memhotplug
      Type=oneshot
      TimeoutSec=0
      RemainAfterExit=yes
      
      [Install]
      WantedBy=sysinit.target
      
      === Example modification to the 40-redhat.rules [2] ===
      
      : diff --git a/40-redhat.rules b/40-redhat.rules-new
      : index 2c690e5..168fd03 100644
      : --- a/40-redhat.rules
      : +++ b/40-redhat.rules-new
      : @@ -6,6 +6,9 @@ SUBSYSTEM=="cpu", ACTION=="add", TEST=="online", ATTR{online}=="0", ATTR{online}
      :  # Memory hotadd request
      :  SUBSYSTEM!="memory", GOTO="memory_hotplug_end"
      :  ACTION!="add", GOTO="memory_hotplug_end"
      : +# memory hotplug behavior configured
      : +PROGRAM=="grep online /sys/devices/system/memory/auto_online_blocks", GOTO="memory_hotplug_end"
      : +
      :  PROGRAM="/bin/uname -p", RESULT=="s390*", GOTO="memory_hotplug_end"
      :
      :  ENV{.state}="online"
      
      ===
      
      [1] https://github.com/lnykryn/systemd-rhel/pull/281
      [2] https://github.com/lnykryn/systemd-rhel/blob/staging/rules/40-redhat.rules
      
      This patch (of 8):
      
      The name is misleading and it's not really clear what is "kept".  Let's
      just name it like the online_type name we expose to user space ("online").
      
      Add some documentation to the types.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Yumei Huang <yuhuang@redhat.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Link: http://lkml.kernel.org/r/20200319131221.14044-1-david@redhat.com
      Link: http://lkml.kernel.org/r/20200317104942.11178-2-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      956f8b44
  16. 04 2月, 2020 1 次提交