1. 23 3月, 2022 3 次提交
    • D
      drivers/base/memory: determine and store zone for single-zone memory blocks · 395f6081
      David Hildenbrand 提交于
      test_pages_in_a_zone() is just another nasty PFN walker that can easily
      stumble over ZONE_DEVICE memory ranges falling into the same memory block
      as ordinary system RAM: the memmap of parts of these ranges might possibly
      be uninitialized.  In fact, we observed (on an older kernel) with UBSAN:
      
        UBSAN: Undefined behaviour in ./include/linux/mm.h:1133:50
        index 7 is out of range for type 'zone [5]'
        CPU: 121 PID: 35603 Comm: read_all Kdump: loaded Tainted: [...]
        Hardware name: Dell Inc. PowerEdge R7425/08V001, BIOS 1.12.2 11/15/2019
        Call Trace:
         dump_stack+0x9a/0xf0
         ubsan_epilogue+0x9/0x7a
         __ubsan_handle_out_of_bounds+0x13a/0x181
         test_pages_in_a_zone+0x3c4/0x500
         show_valid_zones+0x1fa/0x380
         dev_attr_show+0x43/0xb0
         sysfs_kf_seq_show+0x1c5/0x440
         seq_read+0x49d/0x1190
         vfs_read+0xff/0x300
         ksys_read+0xb8/0x170
         do_syscall_64+0xa5/0x4b0
         entry_SYSCALL_64_after_hwframe+0x6a/0xdf
        RIP: 0033:0x7f01f4439b52
      
      We seem to stumble over a memmap that contains a garbage zone id.  While
      we could try inserting pfn_to_online_page() calls, it will just make
      memory offlining slower, because we use test_pages_in_a_zone() to make
      sure we're offlining pages that all belong to the same zone.
      
      Let's just get rid of this PFN walker and determine the single zone of a
      memory block -- if any -- for early memory blocks during boot.  For memory
      onlining, we know the single zone already.  Let's avoid any additional
      memmap scanning and just rely on the zone information available during
      boot.
      
      For memory hot(un)plug, we only really care about memory blocks that:
      * span a single zone (and, thereby, a single node)
      * are completely System RAM (IOW, no holes, no ZONE_DEVICE)
      If one of these conditions is not met, we reject memory offlining.
      Hotplugged memory blocks (starting out offline), always meet both
      conditions.
      
      There are three scenarios to handle:
      
      (1) Memory hot(un)plug
      
      A memory block with zone == NULL cannot be offlined, corresponding to
      our previous test_pages_in_a_zone() check.
      
      After successful memory onlining/offlining, we simply set the zone
      accordingly.
      * Memory onlining: set the zone we just used for onlining
      * Memory offlining: set zone = NULL
      
      So a hotplugged memory block starts with zone = NULL. Once memory
      onlining is done, we set the proper zone.
      
      (2) Boot memory with !CONFIG_NUMA
      
      We know that there is just a single pgdat, so we simply scan all zones
      of that pgdat for an intersection with our memory block PFN range when
      adding the memory block. If more than one zone intersects (e.g., DMA and
      DMA32 on x86 for the first memory block) we set zone = NULL and
      consequently mimic what test_pages_in_a_zone() used to do.
      
      (3) Boot memory with CONFIG_NUMA
      
      At the point in time we create the memory block devices during boot, we
      don't know yet which nodes *actually* span a memory block. While we could
      scan all zones of all nodes for intersections, overlapping nodes complicate
      the situation and scanning all nodes is possibly expensive. But that
      problem has already been solved by the code that sets the node of a memory
      block and creates the link in the sysfs --
      do_register_memory_block_under_node().
      
      So, we hook into the code that sets the node id for a memory block. If
      we already have a different node id set for the memory block, we know
      that multiple nodes *actually* have PFNs falling into our memory block:
      we set zone = NULL and consequently mimic what test_pages_in_a_zone() used
      to do. If there is no node id set, we do the same as (2) for the given
      node.
      
      Note that the call order in driver_init() is:
      -> memory_dev_init(): create memory block devices
      -> node_dev_init(): link memory block devices to the node and set the
      		    node id
      
      So in summary, we detect if there is a single zone responsible for this
      memory block and we consequently store the zone in that case in the
      memory block, updating it during memory onlining/offlining.
      
      Link: https://lkml.kernel.org/r/20220210184359.235565-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reported-by: NRafael Parra <rparrazo@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rafael Parra <rparrazo@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      395f6081
    • D
      drivers/base/memory: add memory block to memory group after registration succeeded · 7ea0d2d7
      David Hildenbrand 提交于
      If register_memory() fails, we freed the memory block but already added
      the memory block to the group list, not good.  Let's defer adding the
      block to the memory group to after registering the memory block device.
      
      We do handle it properly during unregister_memory(), but that's not
      called when the registration fails.
      
      Link: https://lkml.kernel.org/r/20220128144540.153902-1-david@redhat.com
      Fixes: 028fc57a ("drivers/base/memory: introduce "memory groups" to logically group memory blocks")
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ea0d2d7
    • L
      mm/hwpoison: avoid the impact of hwpoison_filter() return value on mce handler · d1fe111f
      luofei 提交于
      When the hwpoison page meets the filter conditions, it should not be
      regarded as successful memory_failure() processing for mce handler, but
      should return a distinct value, otherwise mce handler regards the error
      page has been identified and isolated, which may lead to calling
      set_mce_nospec() to change page attribute, etc.
      
      Here memory_failure() return -EOPNOTSUPP to indicate that the error
      event is filtered, mce handler should not take any action for this
      situation and hwpoison injector should treat as correct.
      
      Link: https://lkml.kernel.org/r/20220223082135.2769649-1-luofei@unicloud.comSigned-off-by: Nluofei <luofei@unicloud.com>
      Acked-by: NBorislav Petkov <bp@suse.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d1fe111f
  2. 09 9月, 2021 5 次提交
    • D
      mm/memory_hotplug: improved dynamic memory group aware "auto-movable" online policy · 3fcebf90
      David Hildenbrand 提交于
      Currently, the "auto-movable" online policy does not allow for hotplugged
      KERNEL (ZONE_NORMAL) memory to increase the amount of MOVABLE memory we
      can have, primarily, because there is no coordiantion across memory
      devices and we don't want to create zone-imbalances accidentially when
      unplugging memory.
      
      However, within a single memory device it's different.  Let's allow for
      KERNEL memory within a dynamic memory group to allow for more MOVABLE
      within the same memory group.  The only thing we have to take care of is
      that the managing driver avoids zone imbalances by unplugging MOVABLE
      memory first, otherwise there can be corner cases where unplug of memory
      could result in (accidential) zone imbalances.
      
      virtio-mem is the only user of dynamic memory groups and recently added
      support for prioritizing unplug of ZONE_MOVABLE over ZONE_NORMAL, so we
      don't need a new toggle to enable it for dynamic memory groups.
      
      We limit this handling to dynamic memory groups, because:
      
      * We want to keep the runtime overhead for collecting stats when
        onlining a single memory block small.  We tend to have only a handful of
        dynamic memory groups, but we can have quite some static memory groups
        (e.g., 256 DIMMs).
      
      * It doesn't make too much sense for static memory groups, as we try
        onlining all applicable memory blocks either completely to ZONE_MOVABLE
        or not.  In ordinary operation, we won't have a mixture of zones within
        a static memory group.
      
      When adding memory to a dynamic memory group, we'll first online memory to
      ZONE_MOVABLE as long as early KERNEL memory allows for it.  Then, we'll
      online the next unit(s) to ZONE_NORMAL, until we can online the next
      unit(s) to ZONE_MOVABLE.
      
      For a simple virtio-mem device with a MOVABLE:KERNEL ratio of 3:1, it will
      result in a layout like:
      
        [M][M][M][M][M][M][M][M][N][M][M][M][N][M][M][M]...
        ^ movable memory due to early kernel memory
      			   ^ allows for more movable memory ...
      			      ^-----^ ... here
      				       ^ allows for more movable memory ...
      				          ^-----^ ... here
      
      While the created layout is sub-optimal when it comes to contiguous zones,
      it gives us the maximum flexibility when dynamically growing/shrinking a
      device; we can grow small VMs really big in small steps, and still shrink
      reliably to e.g., 1/4 of the maximum VM size in this example, removing
      full memory blocks along with meta data more reliably.
      
      Mark dynamic memory groups in the xarray such that we can efficiently
      iterate over them when collecting stats.  In usual setups, we have one
      virtio-mem device per NUMA node, and usually only a small number of NUMA
      nodes.
      
      Note: for now, there seems to be no compelling reason to make this
      behavior configurable.
      
      Link: https://lkml.kernel.org/r/20210806124715.17090-10-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hui Zhu <teawater@gmail.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Marek Kedzierski <mkedzier@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3fcebf90
    • D
      mm/memory_hotplug: memory group aware "auto-movable" online policy · 445fcf7c
      David Hildenbrand 提交于
      Use memory groups to improve our "auto-movable" onlining policy:
      
      1. For static memory groups (e.g., a DIMM), online a memory block MOVABLE
         only if all other memory blocks in the group are either MOVABLE or could
         be onlined MOVABLE. A DIMM will either be MOVABLE or not, not a mixture.
      
      2. For dynamic memory groups (e.g., a virtio-mem device), online a
         memory block MOVABLE only if all other memory blocks inside the
         current unit are either MOVABLE or could be onlined MOVABLE. For a
         virtio-mem device with a device block size with 512 MiB, all 128 MiB
         memory blocks wihin a 512 MiB unit will either be MOVABLE or not, not
         a mixture.
      
      We have to pass the memory group to zone_for_pfn_range() to take the
      memory group into account.
      
      Note: for now, there seems to be no compelling reason to make this
      behavior configurable.
      
      Link: https://lkml.kernel.org/r/20210806124715.17090-9-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hui Zhu <teawater@gmail.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Marek Kedzierski <mkedzier@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      445fcf7c
    • D
      mm/memory_hotplug: track present pages in memory groups · 836809ec
      David Hildenbrand 提交于
      Let's track all present pages in each memory group.  Especially, track
      memory present in ZONE_MOVABLE and memory present in one of the kernel
      zones (which really only is ZONE_NORMAL right now as memory groups only
      apply to hotplugged memory) separately within a memory group, to prepare
      for making smart auto-online decision for individual memory blocks within
      a memory group based on group statistics.
      
      Link: https://lkml.kernel.org/r/20210806124715.17090-5-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hui Zhu <teawater@gmail.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Marek Kedzierski <mkedzier@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      836809ec
    • D
      drivers/base/memory: introduce "memory groups" to logically group memory blocks · 028fc57a
      David Hildenbrand 提交于
      In our "auto-movable" memory onlining policy, we want to make decisions
      across memory blocks of a single memory device.  Examples of memory
      devices include ACPI memory devices (in the simplest case a single DIMM)
      and virtio-mem.  For now, we don't have a connection between a single
      memory block device and the real memory device.  Each memory device
      consists of 1..X memory block devices.
      
      Let's logically group memory blocks belonging to the same memory device in
      "memory groups".  Memory groups can span multiple physical ranges and a
      memory group itself does not contain any information regarding physical
      ranges, only properties (e.g., "max_pages") necessary for improved memory
      onlining.
      
      Introduce two memory group types:
      
      1) Static memory group: E.g., a single ACPI memory device, consisting
         of 1..X memory resources.  A memory group consists of 1..Y memory
         blocks.  The whole group is added/removed in one go.  If any part
         cannot get offlined, the whole group cannot be removed.
      
      2) Dynamic memory group: E.g., a single virtio-mem device.  Memory is
         dynamically added/removed in a fixed granularity, called a "unit",
         consisting of 1..X memory blocks.  A unit is added/removed in one go.
         If any part of a unit cannot get offlined, the whole unit cannot be
         removed.
      
      In case of 1) we usually want either all memory managed by ZONE_MOVABLE or
      none.  In case of 2) we usually want to have as many units as possible
      managed by ZONE_MOVABLE.  We want a single unit to be of the same type.
      
      For now, memory groups are an internal concept that is not exposed to user
      space; we might want to change that in the future, though.
      
      add_memory() users can specify a mgid instead of a nid when passing the
      MHP_NID_IS_MGID flag.
      
      Link: https://lkml.kernel.org/r/20210806124715.17090-4-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hui Zhu <teawater@gmail.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Marek Kedzierski <mkedzier@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      028fc57a
    • D
      mm: track present early pages per zone · 4b097002
      David Hildenbrand 提交于
      Patch series "mm/memory_hotplug: "auto-movable" online policy and memory groups", v3.
      
      I. Goal
      
      The goal of this series is improving in-kernel auto-online support.  It
      tackles the fundamental problems that:
      
       1) We can create zone imbalances when onlining all memory blindly to
          ZONE_MOVABLE, in the worst case crashing the system. We have to know
          upfront how much memory we are going to hotplug such that we can
          safely enable auto-onlining of all hotplugged memory to ZONE_MOVABLE
          via "online_movable". This is far from practical and only applicable in
          limited setups -- like inside VMs under the RHV/oVirt hypervisor which
          will never hotplug more than 3 times the boot memory (and the
          limitation is only in place due to the Linux limitation).
      
       2) We see more setups that implement dynamic VM resizing, hot(un)plugging
          memory to resize VM memory. In these setups, we might hotplug a lot of
          memory, but it might happen in various small steps in both directions
          (e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...). virtio-mem is the
          primary driver of this upstream right now, performing such dynamic
          resizing NUMA-aware via multiple virtio-mem devices.
      
          Onlining all hotplugged memory to ZONE_NORMAL means we basically have
          no hotunplug guarantees. Onlining all to ZONE_MOVABLE means we can
          easily run into zone imbalances when growing a VM. We want a mixture,
          and we want as much memory as reasonable/configured in ZONE_MOVABLE.
          Details regarding zone imbalances can be found at [1].
      
       3) Memory devices consist of 1..X memory block devices, however, the
          kernel doesn't really track the relationship. Consequently, also user
          space has no idea. We want to make per-device decisions.
      
          As one example, for memory hotunplug it doesn't make sense to use a
          mixture of zones within a single DIMM: we want all MOVABLE if
          possible, otherwise all !MOVABLE, because any !MOVABLE part will easily
          block the whole DIMM from getting hotunplugged.
      
          As another example, virtio-mem operates on individual units that span
          1..X memory blocks. Similar to a DIMM, we want a unit to either be all
          MOVABLE or !MOVABLE. A "unit" can be thought of like a DIMM, however,
          all units of a virtio-mem device logically belong together and are
          managed (added/removed) by a single driver. We want as much memory of
          a virtio-mem device to be MOVABLE as possible.
      
       4) We want memory onlining to be done right from the kernel while adding
          memory, not triggered by user space via udev rules; for example, this
          is reqired for fast memory hotplug for drivers that add individual
          memory blocks, like virito-mem. We want a way to configure a policy in
          the kernel and avoid implementing advanced policies in user space.
      
      The auto-onlining support we have in the kernel is not sufficient.  All we
      have is a) online everything MOVABLE (online_movable) b) online everything
      !MOVABLE (online_kernel) c) keep zones contiguous (online).  This series
      allows configuring c) to mean instead "online movable if possible
      according to the coniguration, driven by a maximum MOVABLE:KERNEL ratio"
      -- a new onlining policy.
      
      II. Approach
      
      This series does 3 things:
      
       1) Introduces the "auto-movable" online policy that initially operates on
          individual memory blocks only. It uses a maximum MOVABLE:KERNEL ratio
          to make a decision whether a memory block will be onlined to
          ZONE_MOVABLE or not. However, in the basic form, hotplugged KERNEL
          memory does not allow for more MOVABLE memory (details in the
          patches). CMA memory is treated like MOVABLE memory.
      
       2) Introduces static (e.g., DIMM) and dynamic (e.g., virtio-mem) memory
          groups and uses group information to make decisions in the
          "auto-movable" online policy across memory blocks of a single memory
          device (modeled as memory group). More details can be found in patch
          #3 or in the DIMM example below.
      
       3) Maximizes ZONE_MOVABLE memory within dynamic memory groups, by
          allowing ZONE_NORMAL memory within a dynamic memory group to allow for
          more ZONE_MOVABLE memory within the same memory group. The target use
          case is dynamic VM resizing using virtio-mem. See the virtio-mem
          example below.
      
      I remember that the basic idea of using a ratio to implement a policy in
      the kernel was once mentioned by Vitaly Kuznetsov, but I might be wrong (I
      lost the pointer to that discussion).
      
      For me, the main use case is using it along with virtio-mem (and DIMMs /
      ppc64 dlpar where necessary) for dynamic resizing of VMs, increasing the
      amount of memory we can hotunplug reliably again if we might eventually
      hotplug a lot of memory to a VM.
      
      III. Target Usage
      
      The target usage will be:
      
       1) Linux boots with "mhp_default_online_type=offline"
      
       2) User space (e.g., systemd unit) configures memory onlining (according
          to a config file and system properties), for example:
          * Setting memory_hotplug.online_policy=auto-movable
          * Setting memory_hotplug.auto_movable_ratio=301
          * Setting memory_hotplug.auto_movable_numa_aware=true
      
       3) User space enabled auto onlining via "echo online >
          /sys/devices/system/memory/auto_online_blocks"
      
       4) User space triggers manual onlining of all already-offline memory
          blocks (go over offline memory blocks and set them to "online")
      
      IV. Example
      
      For DIMMs, hotplugging 4 GiB DIMMs to a 4 GiB VM with a configured ratio of
      301% results in the following layout:
      	Memory block 0-15:    DMA32   (early)
      	Memory block 32-47:   Normal  (early)
      	Memory block 48-79:   Movable (DIMM 0)
      	Memory block 80-111:  Movable (DIMM 1)
      	Memory block 112-143: Movable (DIMM 2)
      	Memory block 144-275: Normal  (DIMM 3)
      	Memory block 176-207: Normal  (DIMM 4)
      	... all Normal
      	(-> hotplugged Normal memory does not allow for more Movable memory)
      
      For virtio-mem, using a simple, single virtio-mem device with a 4 GiB VM
      will result in the following layout:
      	Memory block 0-15:    DMA32   (early)
      	Memory block 32-47:   Normal  (early)
      	Memory block 48-143:  Movable (virtio-mem, first 12 GiB)
      	Memory block 144:     Normal  (virtio-mem, next 128 MiB)
      	Memory block 145-147: Movable (virtio-mem, next 384 MiB)
      	Memory block 148:     Normal  (virtio-mem, next 128 MiB)
      	Memory block 149-151: Movable (virtio-mem, next 384 MiB)
      	... Normal/Movable mixture as above
      	(-> hotplugged Normal memory allows for more Movable memory within
      	    the same device)
      
      Which gives us maximum flexibility when dynamically growing/shrinking a
      VM in smaller steps.
      
      V. Doc Update
      
      I'll update the memory-hotplug.rst documentation, once the overhaul [1] is
      usptream. Until then, details can be found in patch #2.
      
      VI. Future Work
      
       1) Use memory groups for ppc64 dlpar
       2) Being able to specify a portion of (early) kernel memory that will be
          excluded from the ratio. Like "128 MiB globally/per node" are excluded.
      
          This might be helpful when starting VMs with extremely small memory
          footprint (e.g., 128 MiB) and hotplugging memory later -- not wanting
          the first hotplugged units getting onlined to ZONE_MOVABLE. One
          alternative would be a trigger to not consider ZONE_DMA memory
          in the ratio. We'll have to see if this is really rrequired.
       3) Indicate to user space that MOVABLE might be a bad idea -- especially
          relevant when memory ballooning without support for balloon compaction
          is active.
      
      This patch (of 9):
      
      For implementing a new memory onlining policy, which determines when to
      online memory blocks to ZONE_MOVABLE semi-automatically, we need the
      number of present early (boot) pages -- present pages excluding hotplugged
      pages.  Let's track these pages per zone.
      
      Pass a page instead of the zone to adjust_present_page_count(), similar as
      adjust_managed_page_count() and derive the zone from the page.
      
      It's worth noting that a memory block to be offlined/onlined is either
      completely "early" or "not early".  add_memory() and friends can only add
      complete memory blocks and we only online/offline complete (individual)
      memory blocks.
      
      Link: https://lkml.kernel.org/r/20210806124715.17090-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20210806124715.17090-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Marek Kedzierski <mkedzier@redhat.com>
      Cc: Hui Zhu <teawater@gmail.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b097002
  3. 04 9月, 2021 1 次提交
  4. 05 6月, 2021 1 次提交
    • D
      drivers/base/memory: fix trying offlining memory blocks with memory holes on aarch64 · 92813053
      David Hildenbrand 提交于
      offline_pages() properly checks for memory holes and bails out.
      However, we do a page_zone(pfn_to_page(start_pfn)) before calling
      offline_pages() when offlining a memory block.
      
      We should not unconditionally call page_zone(pfn_to_page(start_pfn)) on
      aarch64 in offlining code, otherwise we can trigger a BUG when hitting a
      memory hole:
      
         kernel BUG at include/linux/mm.h:1383!
         Internal error: Oops - BUG: 0 [#1] SMP
         Modules linked in: loop processor efivarfs ip_tables x_tables ext4 mbcache jbd2 dm_mod igb nvme i2c_algo_bit mlx5_core i2c_core nvme_core firmware_class
         CPU: 13 PID: 1694 Comm: ranbug Not tainted 5.12.0-next-20210524+ #4
         Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020
         pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
         pc : memory_subsys_offline+0x1f8/0x250
         lr : memory_subsys_offline+0x1f8/0x250
         Call trace:
           memory_subsys_offline+0x1f8/0x250
           device_offline+0x154/0x1d8
           online_store+0xa4/0x118
           dev_attr_store+0x44/0x78
           sysfs_kf_write+0xe8/0x138
           kernfs_fop_write_iter+0x26c/0x3d0
           new_sync_write+0x2bc/0x4f8
           vfs_write+0x718/0xc88
           ksys_write+0xf8/0x1e0
           __arm64_sys_write+0x74/0xa8
           invoke_syscall.constprop.0+0x78/0x1e8
           do_el0_svc+0xe4/0x298
           el0_svc+0x20/0x30
           el0_sync_handler+0xb0/0xb8
           el0_sync+0x178/0x180
         Kernel panic - not syncing: Oops - BUG: Fatal exception
         SMP: stopping secondary CPUs
         Kernel Offset: disabled
         CPU features: 0x00000251,20000846
         Memory Limit: none
      
      If nr_vmemmap_pages is set, we know that we are dealing with hotplugged
      memory that doesn't have any holes.  So call
      page_zone(pfn_to_page(start_pfn)) only when really necessary -- when
      nr_vmemmap_pages is set and we actually adjust the present pages.
      
      Link: https://lkml.kernel.org/r/20210526075226.5572-1-david@redhat.com
      Fixes: a08a2ae3 ("mm,memory_hotplug: allocate memmap from the added memory range")
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Reported-by: NQian Cai (QUIC) <quic_qiancai@quicinc.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      92813053
  5. 04 6月, 2021 1 次提交
  6. 06 5月, 2021 2 次提交
    • O
      mm,memory_hotplug: allocate memmap from the added memory range · a08a2ae3
      Oscar Salvador 提交于
      Physical memory hotadd has to allocate a memmap (struct page array) for
      the newly added memory section.  Currently, alloc_pages_node() is used
      for those allocations.
      
      This has some disadvantages:
       a) an existing memory is consumed for that purpose
          (eg: ~2MB per 128MB memory section on x86_64)
          This can even lead to extreme cases where system goes OOM because
          the physically hotplugged memory depletes the available memory before
          it is onlined.
       b) if the whole node is movable then we have off-node struct pages
          which has performance drawbacks.
       c) It might be there are no PMD_ALIGNED chunks so memmap array gets
          populated with base pages.
      
      This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.
      
      Vmemap page tables can map arbitrary memory.  That means that we can
      reserve a part of the physically hotadded memory to back vmemmap page
      tables.  This implementation uses the beginning of the hotplugged memory
      for that purpose.
      
      There are some non-obviously things to consider though.
      
      Vmemmap pages are allocated/freed during the memory hotplug events
      (add_memory_resource(), try_remove_memory()) when the memory is
      added/removed.  This means that the reserved physical range is not
      online although it is used.  The most obvious side effect is that
      pfn_to_online_page() returns NULL for those pfns.  The current design
      expects that this should be OK as the hotplugged memory is considered a
      garbage until it is onlined.  For example hibernation wouldn't save the
      content of those vmmemmaps into the image so it wouldn't be restored on
      resume but this should be OK as there no real content to recover anyway
      while metadata is reachable from other data structures (e.g.  vmemmap
      page tables).
      
      The reserved space is therefore (de)initialized during the {on,off}line
      events (mhp_{de}init_memmap_on_memory).  That is done by extracting page
      allocator independent initialization from the regular onlining path.
      The primary reason to handle the reserved space outside of
      {on,off}line_pages is to make each initialization specific to the
      purpose rather than special case them in a single function.
      
      As per above, the functions that are introduced are:
      
       - mhp_init_memmap_on_memory:
         Initializes vmemmap pages by calling move_pfn_range_to_zone(), calls
         kasan_add_zero_shadow(), and onlines as many sections as vmemmap pages
         fully span.
      
       - mhp_deinit_memmap_on_memory:
         Offlines as many sections as vmemmap pages fully span, removes the
         range from zhe zone by remove_pfn_range_from_zone(), and calls
         kasan_remove_zero_shadow() for the range.
      
      The new function memory_block_online() calls mhp_init_memmap_on_memory()
      before doing the actual online_pages().  Should online_pages() fail, we
      clean up by calling mhp_deinit_memmap_on_memory().  Adjusting of
      present_pages is done at the end once we know that online_pages()
      succedeed.
      
      On offline, memory_block_offline() needs to unaccount vmemmap pages from
      present_pages() before calling offline_pages().  This is necessary because
      offline_pages() tears down some structures based on the fact whether the
      node or the zone become empty.  If offline_pages() fails, we account back
      vmemmap pages.  If it succeeds, we call mhp_deinit_memmap_on_memory().
      
      Hot-remove:
      
       We need to be careful when removing memory, as adding and
       removing memory needs to be done with the same granularity.
       To check that this assumption is not violated, we check the
       memory range we want to remove and if a) any memory block has
       vmemmap pages and b) the range spans more than a single memory
       block, we scream out loud and refuse to proceed.
      
       If all is good and the range was using memmap on memory (aka vmemmap pages),
       we construct an altmap structure so free_hugepage_table does the right
       thing and calls vmem_altmap_free instead of free_pagetable.
      
      Link: https://lkml.kernel.org/r/20210421102701.25051-5-osalvador@suse.deSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a08a2ae3
    • O
      drivers/base/memory: introduce memory_block_{online,offline} · 8736cc2d
      Oscar Salvador 提交于
      Patch series "Allocate memmap from hotadded memory (per device)", v10.
      
      The primary goal of this patchset is to reduce memory overhead of the
      hot-added memory (at least for SPARSEMEM_VMEMMAP memory model).  The
      current way we use to populate memmap (struct page array) has two main
      drawbacks:
      
      a) it consumes an additional memory until the hotadded memory itself is
         onlined and
      
      b) memmap might end up on a different numa node which is especially
         true for movable_node configuration.
      
      c) due to fragmentation we might end up populating memmap with base
         pages
      
      One way to mitigate all these issues is to simply allocate memmap array
      (which is the largest memory footprint of the physical memory hotplug)
      from the hot-added memory itself.  SPARSEMEM_VMEMMAP memory model allows
      us to map any pfn range so the memory doesn't need to be online to be
      usable for the array.  See patch 4 for more details.  This feature is
      only usable when CONFIG_SPARSEMEM_VMEMMAP is set.
      
      [Overall design]:
      
      Implementation wise we reuse vmem_altmap infrastructure to override the
      default allocator used by vmemap_populate.  memory_block structure gains a
      new field called nr_vmemmap_pages, which accounts for the number of
      vmemmap pages used by that memory_block.  E.g: On x86_64, that is 512
      vmemmap pages on small memory bloks and 4096 on large memory blocks (1GB)
      
      We also introduce new two functions: memory_block_{online,offline}.  These
      functions take care of initializing/unitializing vmemmap pages prior to
      calling {online,offline}_pages, so the latter functions can remain totally
      untouched.
      
      More details can be found in the respective changelogs.
      
      This patch (of 8):
      
      This is a preparatory patch that introduces two new functions:
      memory_block_online() and memory_block_offline().
      
      For now, these functions will only call online_pages() and offline_pages()
      respectively, but they will be later in charge of preparing the vmemmap
      pages, carrying out the initialization and proper accounting of such
      pages.
      
      Since memory_block struct contains all the information, pass this struct
      down the chain till the end functions.
      
      Link: https://lkml.kernel.org/r/20210421102701.25051-1-osalvador@suse.de
      Link: https://lkml.kernel.org/r/20210421102701.25051-2-osalvador@suse.deSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8736cc2d
  7. 27 2月, 2021 2 次提交
  8. 17 10月, 2020 1 次提交
    • D
      mm/memory_hotplug: prepare passing flags to add_memory() and friends · b6117199
      David Hildenbrand 提交于
      We soon want to pass flags, e.g., to mark added System RAM resources.
      mergeable.  Prepare for that.
      
      This patch is based on a similar patch by Oscar Salvador:
      
      https://lkml.kernel.org/r/20190625075227.15193-3-osalvador@suse.deSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: Juergen Gross <jgross@suse.com> # Xen related part
      Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NWei Liu <wei.liu@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: "Oliver O'Halloran" <oohall@gmail.com>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Libor Pechacek <lpechacek@suse.cz>
      Cc: Anton Blanchard <anton@ozlabs.org>
      Cc: Leonardo Bras <leobras.c@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Julien Grall <julien@xen.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Roger Pau Monné <roger.pau@citrix.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Link: https://lkml.kernel.org/r/20200911103459.10306-5-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6117199
  9. 02 10月, 2020 3 次提交
    • J
      drivers core: Miscellaneous changes for sysfs_emit · 948b3edb
      Joe Perches 提交于
      Change additional instances that could use sysfs_emit and sysfs_emit_at
      that the coccinelle script could not convert.
      
      o macros creating show functions with ## concatenation
      o unbound sprintf uses with buf+len for start of output to sysfs_emit_at
      o returns with ?: tests and sprintf to sysfs_emit
      o sysfs output with struct class * not struct device * arguments
      
      Miscellanea:
      
      o remove unnecessary initializations around these changes
      o consistently use int len for return length of show functions
      o use octal permissions and not S_<FOO>
      o rename a few show function names so DEVICE_ATTR_<FOO> can be used
      o use DEVICE_ATTR_ADMIN_RO where appropriate
      o consistently use const char *output for strings
      o checkpatch/style neatening
      Signed-off-by: NJoe Perches <joe@perches.com>
      Link: https://lore.kernel.org/r/8bc24444fe2049a9b2de6127389b57edfdfe324d.1600285923.git.joe@perches.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      948b3edb
    • J
      drivers core: Remove strcat uses around sysfs_emit and neaten · 973c3911
      Joe Perches 提交于
      strcat is no longer necessary for sysfs_emit and sysfs_emit_at uses.
      
      Convert the strcat uses to sysfs_emit calls and neaten other block
      uses of direct returns to use an intermediate const char *.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Link: https://lore.kernel.org/r/5d606519698ce4c8f1203a2b35797d8254c6050a.1600285923.git.joe@perches.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      973c3911
    • J
      drivers core: Use sysfs_emit and sysfs_emit_at for show(device *...) functions · aa838896
      Joe Perches 提交于
      Convert the various sprintf fmaily calls in sysfs device show functions
      to sysfs_emit and sysfs_emit_at for PAGE_SIZE buffer safety.
      
      Done with:
      
      $ spatch -sp-file sysfs_emit_dev.cocci --in-place --max-width=80 .
      
      And cocci script:
      
      $ cat sysfs_emit_dev.cocci
      @@
      identifier d_show;
      identifier dev, attr, buf;
      @@
      
      ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf)
      {
      	<...
      	return
      -	sprintf(buf,
      +	sysfs_emit(buf,
      	...);
      	...>
      }
      
      @@
      identifier d_show;
      identifier dev, attr, buf;
      @@
      
      ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf)
      {
      	<...
      	return
      -	snprintf(buf, PAGE_SIZE,
      +	sysfs_emit(buf,
      	...);
      	...>
      }
      
      @@
      identifier d_show;
      identifier dev, attr, buf;
      @@
      
      ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf)
      {
      	<...
      	return
      -	scnprintf(buf, PAGE_SIZE,
      +	sysfs_emit(buf,
      	...);
      	...>
      }
      
      @@
      identifier d_show;
      identifier dev, attr, buf;
      expression chr;
      @@
      
      ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf)
      {
      	<...
      	return
      -	strcpy(buf, chr);
      +	sysfs_emit(buf, chr);
      	...>
      }
      
      @@
      identifier d_show;
      identifier dev, attr, buf;
      identifier len;
      @@
      
      ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf)
      {
      	<...
      	len =
      -	sprintf(buf,
      +	sysfs_emit(buf,
      	...);
      	...>
      	return len;
      }
      
      @@
      identifier d_show;
      identifier dev, attr, buf;
      identifier len;
      @@
      
      ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf)
      {
      	<...
      	len =
      -	snprintf(buf, PAGE_SIZE,
      +	sysfs_emit(buf,
      	...);
      	...>
      	return len;
      }
      
      @@
      identifier d_show;
      identifier dev, attr, buf;
      identifier len;
      @@
      
      ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf)
      {
      	<...
      	len =
      -	scnprintf(buf, PAGE_SIZE,
      +	sysfs_emit(buf,
      	...);
      	...>
      	return len;
      }
      
      @@
      identifier d_show;
      identifier dev, attr, buf;
      identifier len;
      @@
      
      ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf)
      {
      	<...
      -	len += scnprintf(buf + len, PAGE_SIZE - len,
      +	len += sysfs_emit_at(buf, len,
      	...);
      	...>
      	return len;
      }
      
      @@
      identifier d_show;
      identifier dev, attr, buf;
      expression chr;
      @@
      
      ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf)
      {
      	...
      -	strcpy(buf, chr);
      -	return strlen(buf);
      +	return sysfs_emit(buf, chr);
      }
      Signed-off-by: NJoe Perches <joe@perches.com>
      Link: https://lore.kernel.org/r/3d033c33056d88bbe34d4ddb62afd05ee166ab9a.1600285923.git.joe@perches.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      aa838896
  10. 10 7月, 2020 2 次提交
  11. 04 6月, 2020 1 次提交
    • S
      drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup · 4fb6eabf
      Scott Cheloha 提交于
      Searching for a particular memory block by id is an O(n) operation because
      each memory block's underlying device is kept in an unsorted linked list
      on the subsystem bus.
      
      We can cut the lookup cost to O(log n) if we cache each memory block
      in an xarray.  This time complexity improvement is significant on
      systems with many memory blocks.  For example:
      
      1. A 128GB POWER9 VM with 256MB memblocks has 512 blocks.  With this
         change  memory_dev_init() completes ~12ms faster and walk_memory_blocks()
         completes ~12ms faster.
      
      Before:
      [    0.005042] memory_dev_init: adding memory blocks
      [    0.021591] memory_dev_init: added memory blocks
      [    0.022699] walk_memory_blocks: walking memory blocks
      [    0.038730] walk_memory_blocks: walked memory blocks 0-511
      
      After:
      [    0.005057] memory_dev_init: adding memory blocks
      [    0.009415] memory_dev_init: added memory blocks
      [    0.010519] walk_memory_blocks: walking memory blocks
      [    0.014135] walk_memory_blocks: walked memory blocks 0-511
      
      2. A 256GB POWER9 LPAR with 256MB memblocks has 1024 blocks.  With
         this change memory_dev_init() completes ~88ms faster and
         walk_memory_blocks() completes ~87ms faster.
      
      Before:
      [    0.252246] memory_dev_init: adding memory blocks
      [    0.395469] memory_dev_init: added memory blocks
      [    0.409413] walk_memory_blocks: walking memory blocks
      [    0.433028] walk_memory_blocks: walked memory blocks 0-511
      [    0.433094] walk_memory_blocks: walking memory blocks
      [    0.500244] walk_memory_blocks: walked memory blocks 131072-131583
      
      After:
      [    0.245063] memory_dev_init: adding memory blocks
      [    0.299539] memory_dev_init: added memory blocks
      [    0.313609] walk_memory_blocks: walking memory blocks
      [    0.315287] walk_memory_blocks: walked memory blocks 0-511
      [    0.315349] walk_memory_blocks: walking memory blocks
      [    0.316988] walk_memory_blocks: walked memory blocks 131072-131583
      
      3. A 32TB POWER9 LPAR with 256MB memblocks has 131072 blocks.  With
         this change we complete memory_dev_init() ~37 minutes faster and
         walk_memory_blocks() at least ~30 minutes faster.  The exact timing
         for walk_memory_blocks() is  missing, though I observed that the
         soft lockups in walk_memory_blocks() disappeared with the change,
         suggesting that lower bound.
      
      Before:
      [   13.703907] memory_dev_init: adding blocks
      [ 2287.406099] memory_dev_init: added all blocks
      [ 2347.494986] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
      [ 2527.625378] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
      [ 2707.761977] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
      [ 2887.899975] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
      [ 3068.028318] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
      [ 3248.158764] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
      [ 3428.287296] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
      [ 3608.425357] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
      [ 3788.554572] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
      [ 3968.695071] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
      [ 4148.823970] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
      
      After:
      [   13.696898] memory_dev_init: adding blocks
      [   15.660035] memory_dev_init: added all blocks
      (the walk_memory_blocks traces disappear)
      
      There should be no significant negative impact for machines with few
      memory blocks.  A sparse xarray has a small footprint and an O(log n)
      lookup is negligibly slower than an O(n) lookup for only the smallest
      number of memory blocks.
      
      1. A 16GB x86 machine with 128MB memblocks has 132 blocks.  With this
         change memory_dev_init() completes ~300us faster and walk_memory_blocks()
         completes no faster or slower.  The improvement is pretty close to noise.
      
      Before:
      [    0.224752] memory_dev_init: adding memory blocks
      [    0.227116] memory_dev_init: added memory blocks
      [    0.227183] walk_memory_blocks: walking memory blocks
      [    0.227183] walk_memory_blocks: walked memory blocks 0-131
      
      After:
      [    0.224911] memory_dev_init: adding memory blocks
      [    0.226935] memory_dev_init: added memory blocks
      [    0.227089] walk_memory_blocks: walking memory blocks
      [    0.227089] walk_memory_blocks: walked memory blocks 0-131
      
      [david@redhat.com: document the locking]
        Link: http://lkml.kernel.org/r/bc21eec6-7251-4c91-2f57-9a0671f8d414@redhat.comSigned-off-by: NScott Cheloha <cheloha@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NNathan Lynch <nathanl@linux.ibm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Rafael J. Wysocki <rafael@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Rick Lindsley <ricklind@linux.vnet.ibm.com>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Link: http://lkml.kernel.org/r/20200121231028.13699-1-cheloha@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4fb6eabf
  12. 08 4月, 2020 7 次提交
    • D
      mm/memory_hotplug: allow to specify a default online_type · 5f47adf7
      David Hildenbrand 提交于
      For now, distributions implement advanced udev rules to essentially
      - Don't online any hotplugged memory (s390x)
      - Online all memory to ZONE_NORMAL (e.g., most virt environments like
        hyperv)
      - Online all memory to ZONE_MOVABLE in case the zone imbalance is taken
        care of (e.g., bare metal, special virt environments)
      
      In summary: All memory is usually onlined the same way, however, the
      kernel always has to ask user space to come up with the same answer.
      E.g., Hyper-V always waits for a memory block to get onlined before
      continuing, otherwise it might end up adding memory faster than
      onlining it, which can result in strange OOM situations.  This waiting
      slows down adding of a bigger amount of memory.
      
      Let's allow to specify a default online_type, not just "online" and
      "offline".  This allows distributions to configure the default online_type
      when booting up and be done with it.
      
      We can now specify "offline", "online", "online_movable" and
      "online_kernel" via
      - "memhp_default_state=" on the kernel cmdline
      - /sys/devices/system/memory/auto_online_blocks
      just like we are able to specify for a single memory block via
      /sys/devices/system/memory/memoryX/state
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Yumei Huang <yuhuang@redhat.com>
      Link: http://lkml.kernel.org/r/20200317104942.11178-9-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f47adf7
    • D
      mm/memory_hotplug: convert memhp_auto_online to store an online_type · 862919e5
      David Hildenbrand 提交于
      ...  and rename it to memhp_default_online_type.  This is a preparation
      for more detailed default online behavior.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Yumei Huang <yuhuang@redhat.com>
      Link: http://lkml.kernel.org/r/20200317104942.11178-8-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      862919e5
    • D
      drivers/base/memory: store mapping between MMOP_* and string in an array · 4dc8207b
      David Hildenbrand 提交于
      Let's use a simple array which we can reuse soon.  While at it, move the
      string->mmop conversion out of the device hotplug lock.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Yumei Huang <yuhuang@redhat.com>
      Link: http://lkml.kernel.org/r/20200317104942.11178-4-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4dc8207b
    • D
      drivers/base/memory: map MMOP_OFFLINE to 0 · efc978ad
      David Hildenbrand 提交于
      Historically, we used the value -1.  Just treat 0 as the special case now.
      Clarify a comment (which was wrong, when we come via device_online() the
      first time, the online_type would have been 0 / MEM_ONLINE).  The default
      is now always MMOP_OFFLINE.  This removes the last user of the manual
      "-1", which didn't use the enum value.
      
      This is a preparation to use the online_type as an array index.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Yumei Huang <yuhuang@redhat.com>
      Link: http://lkml.kernel.org/r/20200317104942.11178-3-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      efc978ad
    • D
      drivers/base/memory: rename MMOP_ONLINE_KEEP to MMOP_ONLINE · 956f8b44
      David Hildenbrand 提交于
      Patch series "mm/memory_hotplug: allow to specify a default online_type", v3.
      
      Distributions nowadays use udev rules ([1] [2]) to specify if and how to
      online hotplugged memory.  The rules seem to get more complex with many
      special cases.  Due to the various special cases,
      CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE cannot be used.  All memory hotplug
      is handled via udev rules.
      
      Every time we hotplug memory, the udev rule will come to the same
      conclusion.  Especially Hyper-V (but also soon virtio-mem) add a lot of
      memory in separate memory blocks and wait for memory to get onlined by
      user space before continuing to add more memory blocks (to not add memory
      faster than it is getting onlined).  This of course slows down the whole
      memory hotplug process.
      
      To make the job of distributions easier and to avoid udev rules that get
      more and more complicated, let's extend the mechanism provided by
      - /sys/devices/system/memory/auto_online_blocks
      - "memhp_default_state=" on the kernel cmdline
      to be able to specify also "online_movable" as well as "online_kernel"
      
      === Example /usr/libexec/config-memhotplug ===
      
      #!/bin/bash
      
      VIRT=`systemd-detect-virt --vm`
      ARCH=`uname -p`
      
      sense_virtio_mem() {
        if [ -d "/sys/bus/virtio/drivers/virtio_mem/" ]; then
          DEVICES=`find /sys/bus/virtio/drivers/virtio_mem/ -maxdepth 1 -type l | wc -l`
          if [ $DEVICES != "0" ]; then
              return 0
          fi
        fi
        return 1
      }
      
      if [ ! -e "/sys/devices/system/memory/auto_online_blocks" ]; then
        echo "Memory hotplug configuration support missing in the kernel"
        exit 1
      fi
      
      if grep "memhp_default_state=" /proc/cmdline > /dev/null; then
        echo "Memory hotplug configuration overridden in kernel cmdline (memhp_default_state=)"
        exit 1
      fi
      
      if [ $VIRT == "microsoft" ]; then
        echo "Detected Hyper-V on $ARCH"
        # Hyper-V wants all memory in ZONE_NORMAL
        ONLINE_TYPE="online_kernel"
      elif sense_virtio_mem; then
        echo "Detected virtio-mem on $ARCH"
        # virtio-mem wants all memory in ZONE_NORMAL
        ONLINE_TYPE="online_kernel"
      elif [ $ARCH == "s390x" ] || [ $ARCH == "s390" ]; then
        echo "Detected $ARCH"
        # standby memory should not be onlined automatically
        ONLINE_TYPE="offline"
      elif [ $ARCH == "ppc64" ] || [ $ARCH == "ppc64le" ]; then
        echo "Detected" $ARCH
        # PPC64 onlines all hotplugged memory right from the kernel
        ONLINE_TYPE="offline"
      elif [ $VIRT == "none" ]; then
        echo "Detected bare-metal on $ARCH"
        # Bare metal users expect hotplugged memory to be unpluggable. We assume
        # that ZONE imbalances on such enterpise servers cannot happen and is
        # properly documented
        ONLINE_TYPE="online_movable"
      else
        # TODO: Hypervisors that want to unplug DIMMs and can guarantee that ZONE
        # imbalances won't happen
        echo "Detected $VIRT on $ARCH"
        # Usually, ballooning is used in virtual environments, so memory should go to
        # ZONE_NORMAL. However, sometimes "movable_node" is relevant.
        ONLINE_TYPE="online"
      fi
      
      echo "Selected online_type:" $ONLINE_TYPE
      
      # Configure what to do with memory that will be hotplugged in the future
      echo $ONLINE_TYPE 2>/dev/null > /sys/devices/system/memory/auto_online_blocks
      if [ $? != "0" ]; then
        echo "Memory hotplug cannot be configured (e.g., old kernel or missing permissions)"
        # A backup udev rule should handle old kernels if necessary
        exit 1
      fi
      
      # Process all already pluggedd blocks (e.g., DIMMs, but also Hyper-V or virtio-mem)
      if [ $ONLINE_TYPE != "offline" ]; then
        for MEMORY in /sys/devices/system/memory/memory*; do
          STATE=`cat $MEMORY/state`
          if [ $STATE == "offline" ]; then
              echo $ONLINE_TYPE > $MEMORY/state
          fi
        done
      fi
      
      === Example /usr/lib/systemd/system/config-memhotplug.service ===
      
      [Unit]
      Description=Configure memory hotplug behavior
      DefaultDependencies=no
      Conflicts=shutdown.target
      Before=sysinit.target shutdown.target
      After=systemd-modules-load.service
      ConditionPathExists=|/sys/devices/system/memory/auto_online_blocks
      
      [Service]
      ExecStart=/usr/libexec/config-memhotplug
      Type=oneshot
      TimeoutSec=0
      RemainAfterExit=yes
      
      [Install]
      WantedBy=sysinit.target
      
      === Example modification to the 40-redhat.rules [2] ===
      
      : diff --git a/40-redhat.rules b/40-redhat.rules-new
      : index 2c690e5..168fd03 100644
      : --- a/40-redhat.rules
      : +++ b/40-redhat.rules-new
      : @@ -6,6 +6,9 @@ SUBSYSTEM=="cpu", ACTION=="add", TEST=="online", ATTR{online}=="0", ATTR{online}
      :  # Memory hotadd request
      :  SUBSYSTEM!="memory", GOTO="memory_hotplug_end"
      :  ACTION!="add", GOTO="memory_hotplug_end"
      : +# memory hotplug behavior configured
      : +PROGRAM=="grep online /sys/devices/system/memory/auto_online_blocks", GOTO="memory_hotplug_end"
      : +
      :  PROGRAM="/bin/uname -p", RESULT=="s390*", GOTO="memory_hotplug_end"
      :
      :  ENV{.state}="online"
      
      ===
      
      [1] https://github.com/lnykryn/systemd-rhel/pull/281
      [2] https://github.com/lnykryn/systemd-rhel/blob/staging/rules/40-redhat.rules
      
      This patch (of 8):
      
      The name is misleading and it's not really clear what is "kept".  Let's
      just name it like the online_type name we expose to user space ("online").
      
      Add some documentation to the types.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Yumei Huang <yuhuang@redhat.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Link: http://lkml.kernel.org/r/20200319131221.14044-1-david@redhat.com
      Link: http://lkml.kernel.org/r/20200317104942.11178-2-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      956f8b44
    • D
      drivers/base/memory.c: drop pages_correctly_probed() · fada9ae3
      David Hildenbrand 提交于
      pages_correctly_probed() is a leftover from ancient times.  It dates back
      to commit 3947be19 ("[PATCH] memory hotplug: sysfs and add/remove
      functions"), where Pg_reserved checks were added as a sfety net:
      
      	/*
      	 * The probe routines leave the pages reserved, just
      	 * as the bootmem code does.  Make sure they're still
      	 * that way.
      	 */
      
      The checks were refactored quite a bit over the years, especially in
      commit b77eab70 ("mm/memory_hotplug: optimize probe routine"), where
      checks for present, valid, and online sections were added.
      
      Hotplugged memory is added via add_memory(), which will create the full
      memmap for the hotplugged memory, and mark all sections valid and present.
      
      Only full memory blocks are onlined/offlined, so we also cannot have an
      inconsistency in that regard (especially, memory blocks with some sections
      being online and some being offline).
      
      1. Boot memory always starts online.  Since commit c5e79ef5
         ("mm/memory_hotplug.c: don't allow to online/offline memory blocks with
         holes") we disallow to offline any memory with holes.  Therefore, we
         never online memory with holes.  Present and validity checks are
         superfluous.
      
      2. Only complete memory blocks are onlined/offlined (and especially,
         the state - online or offline - is stored for whole memory blocks).
         Besides the core, only arch/powerpc/platforms/powernv/memtrace.c
         manually calls offline_pages() and fiddels with memory block states.
         But it also only offlines complete memory blocks.
      
      3. To make any of these conditions trigger, something would have to be
         terribly messed up in the core.  (e.g., online/offline only some
         sections of a memory block).
      
      4. Memory unplug properly makes sure that all sysfs attributes were
         removed (and therefore, that all threads left the sysfs handlers).  We
         don't have to worry about zombie devices at this point.
      
      5. The valid_section_nr(section_nr) check is actually dead code, as it
         would never have been reached due to the WARN_ON_ONCE(!pfn_valid(pfn)).
      
      No wonder we haven't seen any of these errors in a long time (or even
         ever, according to my search).  Let's just get rid of them.  Now, all
         checks that could hinder onlining and offlining are completely
         contained in online_pages()/offline_pages().
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Link: http://lkml.kernel.org/r/20200127110424.5757-3-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fada9ae3
    • D
      drivers/base/memory.c: drop section_count · 68c3a6ac
      David Hildenbrand 提交于
      Patch series "mm: drop superfluous section checks when onlining/offlining".
      
      Let's drop some superfluous section checks on the onlining/offlining path.
      
      This patch (of 3):
      
      Since commit c5e79ef5 ("mm/memory_hotplug.c: don't allow to
      online/offline memory blocks with holes") we have a generic check in
      offline_pages() that disallows offlining memory blocks with holes.
      
      Memory blocks with missing sections are just another variant of these type
      of blocks.  We can stop checking (and especially storing) present
      sections.  A proper error message is now printed why offlining failed.
      
      section_count was initially introduced in commit 07681215 ("Driver
      core: Add section count to memory_block struct") in order to detect when
      it is okay to remove a memory block.  It was used in commit 26bbe7ef
      ("drivers/base/memory.c: prohibit offlining of memory blocks with missing
      sections") to disallow offlining memory blocks with missing sections.  As
      we refactored creation/removal of memory devices and have a proper check
      for holes in place, we can drop the section_count.
      
      This also removes a leftover comment regarding the mem_sysfs_mutex, which
      was removed in commit 848e19ad ("drivers/base/memory.c: drop the
      mem_sysfs_mutex").
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Link: http://lkml.kernel.org/r/20200127110424.5757-2-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68c3a6ac
  13. 30 3月, 2020 1 次提交
    • D
      drivers/base/memory.c: indicate all memory blocks as removable · 53cdc1cb
      David Hildenbrand 提交于
      We see multiple issues with the implementation/interface to compute
      whether a memory block can be offlined (exposed via
      /sys/devices/system/memory/memoryX/removable) and would like to simplify
      it (remove the implementation).
      
      1. It runs basically lockless. While this might be good for performance,
         we see possible races with memory offlining that will require at
         least some sort of locking to fix.
      
      2. Nowadays, more false positives are possible. No arch-specific checks
         are performed that validate if memory offlining will not be denied
         right away (and such check will require locking). For example, arm64
         won't allow to offline any memory block that was added during boot -
         which will imply a very high error rate. Other archs have other
         constraints.
      
      3. The interface is inherently racy. E.g., if a memory block is detected
         to be removable (and was not a false positive at that time), there is
         still no guarantee that offlining will actually succeed. So any
         caller already has to deal with false positives.
      
      4. It is unclear which performance benefit this interface actually
         provides. The introducing commit 5c755e9f ("memory-hotplug: add
         sysfs removable attribute for hotplug memory remove") mentioned
      
      	"A user-level agent must be able to identify which sections
      	 of memory are likely to be removable before attempting the
      	 potentially expensive operation."
      
         However, no actual performance comparison was included.
      
      Known users:
      
       - lsmem: Will group memory blocks based on the "removable" property. [1]
      
       - chmem: Indirect user. It has a RANGE mode where one can specify
                removable ranges identified via lsmem to be offlined. However,
                it also has a "SIZE" mode, which allows a sysadmin to skip the
                manual "identify removable blocks" step. [2]
      
       - powerpc-utils: Uses the "removable" attribute to skip some memory
                blocks right away when trying to find some to offline+remove.
                However, with ballooning enabled, it already skips this
                information completely (because it once resulted in many false
                negatives). Therefore, the implementation can deal with false
                positives properly already. [3]
      
      According to Nathan Fontenot, DLPAR on powerpc is nowadays no longer
      driven from userspace via the drmgr command (powerpc-utils).  Nowadays
      it's managed in the kernel - including onlining/offlining of memory
      blocks - triggered by drmgr writing to /sys/kernel/dlpar.  So the
      affected legacy userspace handling is only active on old kernels.  Only
      very old versions of drmgr on a new kernel (unlikely) might execute
      slower - totally acceptable.
      
      With CONFIG_MEMORY_HOTREMOVE, always indicating "removable" should not
      break any user space tool.  We implement a very bad heuristic now.
      Without CONFIG_MEMORY_HOTREMOVE we cannot offline anything, so report
      "not removable" as before.
      
      Original discussion can be found in [4] ("[PATCH RFC v1] mm:
      is_mem_section_removable() overhaul").
      
      Other users of is_mem_section_removable() will be removed next, so that
      we can remove is_mem_section_removable() completely.
      
      [1] http://man7.org/linux/man-pages/man1/lsmem.1.html
      [2] http://man7.org/linux/man-pages/man8/chmem.8.html
      [3] https://github.com/ibm-power-utilities/powerpc-utils
      [4] https://lkml.kernel.org/r/20200117105759.27905-1-david@redhat.com
      
      Also, this patch probably fixes a crash reported by Steve.
      http://lkml.kernel.org/r/CAPcyv4jpdaNvJ67SkjyUJLBnBnXXQv686BiVW042g03FUmWLXw@mail.gmail.comReported-by: N"Scargall, Steve" <steve.scargall@intel.com>
      Suggested-by: NMichal Hocko <mhocko@kernel.org>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NNathan Fontenot <ndfont@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Karel Zak <kzak@redhat.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200128093542.6908-1-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53cdc1cb
  14. 04 2月, 2020 1 次提交
  15. 01 2月, 2020 2 次提交
  16. 02 12月, 2019 2 次提交
    • D
      drivers/base/memory.c: drop the mem_sysfs_mutex · 848e19ad
      David Hildenbrand 提交于
      The mem_sysfs_mutex isn't really helpful.  Also, it's not really clear
      what the mutex protects at all.
      
      The device lists of the memory subsystem are protected separately.  We
      don't need that mutex when looking up.  creating, or removing
      independent devices.  find_memory_block_by_id() will perform locking on
      its own and grab a reference of the returned device.
      
      At the time memory_dev_init() is called, we cannot have concurrent
      hot(un)plug operations yet - we're still fairly early during boot.  We
      don't need any locking.
      
      The creation/removal of memory block devices should be protected on a
      higher level - especially using the device hotplug lock to avoid
      documented issues (see Documentation/core-api/memory-hotplug.rst) - or
      if that is reworked, using similar locking.
      
      Protecting in the context of these functions only doesn't really make
      sense.  Especially, if we would have a situation where the same memory
      blocks are created/deleted at the same time, there is something horribly
      going wrong (imagining adding/removing a DIMM at the same time from two
      call paths) - after the functions succeeded something else in the
      callers would blow up (e.g., create_memory_block_devices() succeeded but
      there are no memory block devices anymore).
      
      All relevant call paths (except when adding memory early during boot via
      ACPI, which is now documented) hold the device hotplug lock when adding
      memory, and when removing memory.  Let's document that instead.
      
      Add a simple safety net to create_memory_block_devices() in case we
      would actually remove memory blocks while adding them, so we'll never
      dereference a NULL pointer.  Simplify memory_dev_init() now that the
      lock is gone.
      
      Link: http://lkml.kernel.org/r/20190925082621.4927-1-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      848e19ad
    • N
      mm, soft-offline: convert parameter to pfn · feec24a6
      Naoya Horiguchi 提交于
      Currently soft_offline_page() receives struct page, and its sibling
      memory_failure() receives pfn.  This discrepancy looks weird and makes
      precheck on pfn validity tricky.  So let's align them.
      
      Link: http://lkml.kernel.org/r/20191016234706.GA5493@www9186uo.sakura.ne.jpSigned-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      feec24a6
  17. 16 11月, 2019 1 次提交
    • D
      mm/memory_hotplug: fix try_offline_node() · 2c91f8fc
      David Hildenbrand 提交于
      try_offline_node() is pretty much broken right now:
      
       - The node span is updated when onlining memory, not when adding it. We
         ignore memory that was mever onlined. Bad.
      
       - We touch possible garbage memmaps. The pfn_to_nid(pfn) can easily
         trigger a kernel panic. Bad for memory that is offline but also bad
         for subsection hotadd with ZONE_DEVICE, whereby the memmap of the
         first PFN of a section might contain garbage.
      
       - Sections belonging to mixed nodes are not properly considered.
      
      As memory blocks might belong to multiple nodes, we would have to walk
      all pageblocks (or at least subsections) within present sections.
      However, we don't have a way to identify whether a memmap that is not
      online was initialized (relevant for ZONE_DEVICE).  This makes things
      more complicated.
      
      Luckily, we can piggy pack on the node span and the nid stored in memory
      blocks.  Currently, the node span is grown when calling
      move_pfn_range_to_zone() - e.g., when onlining memory, and shrunk when
      removing memory, before calling try_offline_node().  Sysfs links are
      created via link_mem_sections(), e.g., during boot or when adding
      memory.
      
      If the node still spans memory or if any memory block belongs to the
      nid, we don't set the node offline.  As memory blocks that span multiple
      nodes cannot get offlined, the nid stored in memory blocks is reliable
      enough (for such online memory blocks, the node still spans the memory).
      
      Introduce for_each_memory_block() to efficiently walk all memory blocks.
      
      Note: We will soon stop shrinking the ZONE_DEVICE zone and the node span
      when removing ZONE_DEVICE memory to fix similar issues (access of
      garbage memmaps) - until we have a reliable way to identify whether
      these memmaps were properly initialized.  This implies later, that once
      a node had ZONE_DEVICE memory, we won't be able to set a node offline -
      which should be acceptable.
      
      Since commit f1dd2cd1 ("mm, memory_hotplug: do not associate
      hotadded memory to zones until online") memory that is added is not
      assoziated with a zone/node (memmap not initialized).  The introducing
      commit 60a5a19e ("memory-hotplug: remove sysfs file of node")
      already missed that we could have multiple nodes for a section and that
      the zone/node span is updated when onlining pages, not when adding them.
      
      I tested this by hotplugging two DIMMs to a memory-less and cpu-less
      NUMA node.  The node is properly onlined when adding the DIMMs.  When
      removing the DIMMs, the node is properly offlined.
      
      Masayoshi Mizuma reported:
      
      : Without this patch, memory hotplug fails as panic:
      :
      :  BUG: kernel NULL pointer dereference, address: 0000000000000000
      :  ...
      :  Call Trace:
      :   remove_memory_block_devices+0x81/0xc0
      :   try_remove_memory+0xb4/0x130
      :   __remove_memory+0xa/0x20
      :   acpi_memory_device_remove+0x84/0x100
      :   acpi_bus_trim+0x57/0x90
      :   acpi_bus_trim+0x2e/0x90
      :   acpi_device_hotplug+0x2b2/0x4d0
      :   acpi_hotplug_work_fn+0x1a/0x30
      :   process_one_work+0x171/0x380
      :   worker_thread+0x49/0x3f0
      :   kthread+0xf8/0x130
      :   ret_from_fork+0x35/0x40
      
      [david@redhat.com: v3]
        Link: http://lkml.kernel.org/r/20191102120221.7553-1-david@redhat.com
      Link: http://lkml.kernel.org/r/20191028105458.28320-1-david@redhat.com
      Fixes: 60a5a19e ("memory-hotplug: remove sysfs file of node")
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online") # visiable after d0dc12e8Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Tested-by: NMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
      Cc: Jani Nikula <jani.nikula@intel.com>
      Cc: Nayna Jain <nayna@linux.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c91f8fc
  18. 19 10月, 2019 1 次提交
  19. 25 9月, 2019 3 次提交