1. 07 11月, 2021 2 次提交
  2. 29 10月, 2021 2 次提交
  3. 09 9月, 2021 3 次提交
    • M
      mm/page_alloc.c: avoid accessing uninitialized pcp page migratetype · 053cfda1
      Miaohe Lin 提交于
      If it's not prepared to free unref page, the pcp page migratetype is
      unset.  Thus we will get rubbish from get_pcppage_migratetype() and
      might list_del(&page->lru) again after it's already deleted from the list
      leading to grumble about data corruption.
      
      Link: https://lkml.kernel.org/r/20210902115447.57050-1-linmiaohe@huawei.com
      Fixes: df1acc85 ("mm/page_alloc: avoid conflating IRQs disabled with zone->lock")
      Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      053cfda1
    • D
      mm: track present early pages per zone · 4b097002
      David Hildenbrand 提交于
      Patch series "mm/memory_hotplug: "auto-movable" online policy and memory groups", v3.
      
      I. Goal
      
      The goal of this series is improving in-kernel auto-online support.  It
      tackles the fundamental problems that:
      
       1) We can create zone imbalances when onlining all memory blindly to
          ZONE_MOVABLE, in the worst case crashing the system. We have to know
          upfront how much memory we are going to hotplug such that we can
          safely enable auto-onlining of all hotplugged memory to ZONE_MOVABLE
          via "online_movable". This is far from practical and only applicable in
          limited setups -- like inside VMs under the RHV/oVirt hypervisor which
          will never hotplug more than 3 times the boot memory (and the
          limitation is only in place due to the Linux limitation).
      
       2) We see more setups that implement dynamic VM resizing, hot(un)plugging
          memory to resize VM memory. In these setups, we might hotplug a lot of
          memory, but it might happen in various small steps in both directions
          (e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...). virtio-mem is the
          primary driver of this upstream right now, performing such dynamic
          resizing NUMA-aware via multiple virtio-mem devices.
      
          Onlining all hotplugged memory to ZONE_NORMAL means we basically have
          no hotunplug guarantees. Onlining all to ZONE_MOVABLE means we can
          easily run into zone imbalances when growing a VM. We want a mixture,
          and we want as much memory as reasonable/configured in ZONE_MOVABLE.
          Details regarding zone imbalances can be found at [1].
      
       3) Memory devices consist of 1..X memory block devices, however, the
          kernel doesn't really track the relationship. Consequently, also user
          space has no idea. We want to make per-device decisions.
      
          As one example, for memory hotunplug it doesn't make sense to use a
          mixture of zones within a single DIMM: we want all MOVABLE if
          possible, otherwise all !MOVABLE, because any !MOVABLE part will easily
          block the whole DIMM from getting hotunplugged.
      
          As another example, virtio-mem operates on individual units that span
          1..X memory blocks. Similar to a DIMM, we want a unit to either be all
          MOVABLE or !MOVABLE. A "unit" can be thought of like a DIMM, however,
          all units of a virtio-mem device logically belong together and are
          managed (added/removed) by a single driver. We want as much memory of
          a virtio-mem device to be MOVABLE as possible.
      
       4) We want memory onlining to be done right from the kernel while adding
          memory, not triggered by user space via udev rules; for example, this
          is reqired for fast memory hotplug for drivers that add individual
          memory blocks, like virito-mem. We want a way to configure a policy in
          the kernel and avoid implementing advanced policies in user space.
      
      The auto-onlining support we have in the kernel is not sufficient.  All we
      have is a) online everything MOVABLE (online_movable) b) online everything
      !MOVABLE (online_kernel) c) keep zones contiguous (online).  This series
      allows configuring c) to mean instead "online movable if possible
      according to the coniguration, driven by a maximum MOVABLE:KERNEL ratio"
      -- a new onlining policy.
      
      II. Approach
      
      This series does 3 things:
      
       1) Introduces the "auto-movable" online policy that initially operates on
          individual memory blocks only. It uses a maximum MOVABLE:KERNEL ratio
          to make a decision whether a memory block will be onlined to
          ZONE_MOVABLE or not. However, in the basic form, hotplugged KERNEL
          memory does not allow for more MOVABLE memory (details in the
          patches). CMA memory is treated like MOVABLE memory.
      
       2) Introduces static (e.g., DIMM) and dynamic (e.g., virtio-mem) memory
          groups and uses group information to make decisions in the
          "auto-movable" online policy across memory blocks of a single memory
          device (modeled as memory group). More details can be found in patch
          #3 or in the DIMM example below.
      
       3) Maximizes ZONE_MOVABLE memory within dynamic memory groups, by
          allowing ZONE_NORMAL memory within a dynamic memory group to allow for
          more ZONE_MOVABLE memory within the same memory group. The target use
          case is dynamic VM resizing using virtio-mem. See the virtio-mem
          example below.
      
      I remember that the basic idea of using a ratio to implement a policy in
      the kernel was once mentioned by Vitaly Kuznetsov, but I might be wrong (I
      lost the pointer to that discussion).
      
      For me, the main use case is using it along with virtio-mem (and DIMMs /
      ppc64 dlpar where necessary) for dynamic resizing of VMs, increasing the
      amount of memory we can hotunplug reliably again if we might eventually
      hotplug a lot of memory to a VM.
      
      III. Target Usage
      
      The target usage will be:
      
       1) Linux boots with "mhp_default_online_type=offline"
      
       2) User space (e.g., systemd unit) configures memory onlining (according
          to a config file and system properties), for example:
          * Setting memory_hotplug.online_policy=auto-movable
          * Setting memory_hotplug.auto_movable_ratio=301
          * Setting memory_hotplug.auto_movable_numa_aware=true
      
       3) User space enabled auto onlining via "echo online >
          /sys/devices/system/memory/auto_online_blocks"
      
       4) User space triggers manual onlining of all already-offline memory
          blocks (go over offline memory blocks and set them to "online")
      
      IV. Example
      
      For DIMMs, hotplugging 4 GiB DIMMs to a 4 GiB VM with a configured ratio of
      301% results in the following layout:
      	Memory block 0-15:    DMA32   (early)
      	Memory block 32-47:   Normal  (early)
      	Memory block 48-79:   Movable (DIMM 0)
      	Memory block 80-111:  Movable (DIMM 1)
      	Memory block 112-143: Movable (DIMM 2)
      	Memory block 144-275: Normal  (DIMM 3)
      	Memory block 176-207: Normal  (DIMM 4)
      	... all Normal
      	(-> hotplugged Normal memory does not allow for more Movable memory)
      
      For virtio-mem, using a simple, single virtio-mem device with a 4 GiB VM
      will result in the following layout:
      	Memory block 0-15:    DMA32   (early)
      	Memory block 32-47:   Normal  (early)
      	Memory block 48-143:  Movable (virtio-mem, first 12 GiB)
      	Memory block 144:     Normal  (virtio-mem, next 128 MiB)
      	Memory block 145-147: Movable (virtio-mem, next 384 MiB)
      	Memory block 148:     Normal  (virtio-mem, next 128 MiB)
      	Memory block 149-151: Movable (virtio-mem, next 384 MiB)
      	... Normal/Movable mixture as above
      	(-> hotplugged Normal memory allows for more Movable memory within
      	    the same device)
      
      Which gives us maximum flexibility when dynamically growing/shrinking a
      VM in smaller steps.
      
      V. Doc Update
      
      I'll update the memory-hotplug.rst documentation, once the overhaul [1] is
      usptream. Until then, details can be found in patch #2.
      
      VI. Future Work
      
       1) Use memory groups for ppc64 dlpar
       2) Being able to specify a portion of (early) kernel memory that will be
          excluded from the ratio. Like "128 MiB globally/per node" are excluded.
      
          This might be helpful when starting VMs with extremely small memory
          footprint (e.g., 128 MiB) and hotplugging memory later -- not wanting
          the first hotplugged units getting onlined to ZONE_MOVABLE. One
          alternative would be a trigger to not consider ZONE_DMA memory
          in the ratio. We'll have to see if this is really rrequired.
       3) Indicate to user space that MOVABLE might be a bad idea -- especially
          relevant when memory ballooning without support for balloon compaction
          is active.
      
      This patch (of 9):
      
      For implementing a new memory onlining policy, which determines when to
      online memory blocks to ZONE_MOVABLE semi-automatically, we need the
      number of present early (boot) pages -- present pages excluding hotplugged
      pages.  Let's track these pages per zone.
      
      Pass a page instead of the zone to adjust_present_page_count(), similar as
      adjust_managed_page_count() and derive the zone from the page.
      
      It's worth noting that a memory block to be offlined/onlined is either
      completely "early" or "not early".  add_memory() and friends can only add
      complete memory blocks and we only online/offline complete (individual)
      memory blocks.
      
      Link: https://lkml.kernel.org/r/20210806124715.17090-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20210806124715.17090-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Marek Kedzierski <mkedzier@redhat.com>
      Cc: Hui Zhu <teawater@gmail.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b097002
    • M
      mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE · 859a85dd
      Mike Rapoport 提交于
      Patch series "mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE".
      
      After recent updates to freeing unused parts of the memory map, no
      architecture can have holes in the memory map within a pageblock.  This
      makes pfn_valid_within() check and CONFIG_HOLES_IN_ZONE configuration
      option redundant.
      
      The first patch removes them both in a mechanical way and the second patch
      simplifies memory_hotplug::test_pages_in_a_zone() that had
      pfn_valid_within() surrounded by more logic than simple if.
      
      This patch (of 2):
      
      After recent changes in freeing of the unused parts of the memory map and
      rework of pfn_valid() in arm and arm64 there are no architectures that can
      have holes in the memory map within a pageblock and so nothing can enable
      CONFIG_HOLES_IN_ZONE which guards non trivial implementation of
      pfn_valid_within().
      
      With that, pfn_valid_within() is always hardwired to 1 and can be
      completely removed.
      
      Remove calls to pfn_valid_within() and CONFIG_HOLES_IN_ZONE.
      
      Link: https://lkml.kernel.org/r/20210713080035.7464-1-rppt@kernel.org
      Link: https://lkml.kernel.org/r/20210713080035.7464-2-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      859a85dd
  4. 04 9月, 2021 9 次提交
  5. 21 8月, 2021 1 次提交
  6. 24 7月, 2021 1 次提交
  7. 16 7月, 2021 4 次提交
  8. 11 7月, 2021 1 次提交
  9. 02 7月, 2021 2 次提交
  10. 01 7月, 2021 1 次提交
    • M
      hugetlb: address ref count racing in prep_compound_gigantic_page · 7118fc29
      Mike Kravetz 提交于
      In [1], Jann Horn points out a possible race between
      prep_compound_gigantic_page and __page_cache_add_speculative.  The root
      cause of the possible race is prep_compound_gigantic_page uncondittionally
      setting the ref count of pages to zero.  It does this because
      prep_compound_gigantic_page is handed a 'group' of pages from an allocator
      and needs to convert that group of pages to a compound page.  The ref
      count of each page in this 'group' is one as set by the allocator.
      However, the ref count of compound page tail pages must be zero.
      
      The potential race comes about when ref counted pages are returned from
      the allocator.  When this happens, other mm code could also take a
      reference on the page.  __page_cache_add_speculative is one such example.
      Therefore, prep_compound_gigantic_page can not just set the ref count of
      pages to zero as it does today.  Doing so would lose the reference taken
      by any other code.  This would lead to BUGs in code checking ref counts
      and could possibly even lead to memory corruption.
      
      There are two possible ways to address this issue.
      
      1) Make all allocators of gigantic groups of pages be able to return a
         properly constructed compound page.
      
      2) Make prep_compound_gigantic_page be more careful when constructing a
         compound page.
      
      This patch takes approach 2.
      
      In prep_compound_gigantic_page, use cmpxchg to only set ref count to zero
      if it is one.  If the cmpxchg fails, call synchronize_rcu() in the hope
      that the extra ref count will be driopped during a rcu grace period.  This
      is not a performance critical code path and the wait should be
      accceptable.  If the ref count is still inflated after the grace period,
      then undo any modifications made and return an error.
      
      Currently prep_compound_gigantic_page is type void and does not return
      errors.  Modify the two callers to check for and handle error returns.  On
      error, the caller must free the 'group' of pages as they can not be used
      to form a gigantic page.  After freeing pages, the runtime caller
      (alloc_fresh_huge_page) will retry the allocation once.  Boot time
      allocations can not be retried.
      
      The routine prep_compound_page also unconditionally sets the ref count of
      compound page tail pages to zero.  However, in this case the buddy
      allocator is constructing a compound page from freshly allocated pages.
      The ref count on those freshly allocated pages is already zero, so the
      set_page_count(p, 0) is unnecessary and could lead to confusion.  Just
      remove it.
      
      [1] https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@mail.gmail.com/
      
      Link: https://lkml.kernel.org/r/20210622021423.154662-3-mike.kravetz@oracle.com
      Fixes: 58a84aa9 ("thp: set compound tail page _count to zero")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: NJann Horn <jannh@google.com>
      Cc: Youquan Song <youquan.song@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7118fc29
  11. 30 6月, 2021 14 次提交