1. 09 1月, 2023 1 次提交
    • A
      mm: Always release pages to the buddy allocator in memblock_free_late(). · 115d9d77
      Aaron Thompson 提交于
      If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, memblock_free_pages()
      only releases pages to the buddy allocator if they are not in the
      deferred range. This is correct for free pages (as defined by
      for_each_free_mem_pfn_range_in_zone()) because free pages in the
      deferred range will be initialized and released as part of the deferred
      init process. memblock_free_pages() is called by memblock_free_late(),
      which is used to free reserved ranges after memblock_free_all() has
      run. All pages in reserved ranges have been initialized at that point,
      and accordingly, those pages are not touched by the deferred init
      process. This means that currently, if the pages that
      memblock_free_late() intends to release are in the deferred range, they
      will never be released to the buddy allocator. They will forever be
      reserved.
      
      In addition, memblock_free_pages() calls kmsan_memblock_free_pages(),
      which is also correct for free pages but is not correct for reserved
      pages. KMSAN metadata for reserved pages is initialized by
      kmsan_init_shadow(), which runs shortly before memblock_free_all().
      
      For both of these reasons, memblock_free_pages() should only be called
      for free pages, and memblock_free_late() should call __free_pages_core()
      directly instead.
      
      One case where this issue can occur in the wild is EFI boot on
      x86_64. The x86 EFI code reserves all EFI boot services memory ranges
      via memblock_reserve() and frees them later via memblock_free_late()
      (efi_reserve_boot_services() and efi_free_boot_services(),
      respectively). If any of those ranges happens to fall within the
      deferred init range, the pages will not be released and that memory will
      be unavailable.
      
      For example, on an Amazon EC2 t3.micro VM (1 GB) booting via EFI:
      
      v6.2-rc2:
        # grep -E 'Node|spanned|present|managed' /proc/zoneinfo
        Node 0, zone      DMA
                spanned  4095
                present  3999
                managed  3840
        Node 0, zone    DMA32
                spanned  246652
                present  245868
                managed  178867
      
      v6.2-rc2 + patch:
        # grep -E 'Node|spanned|present|managed' /proc/zoneinfo
        Node 0, zone      DMA
                spanned  4095
                present  3999
                managed  3840
        Node 0, zone    DMA32
                spanned  246652
                present  245868
                managed  222816   # +43,949 pages
      
      Fixes: 3a80a7fa ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
      Signed-off-by: NAaron Thompson <dev@aaront.org>
      Link: https://lore.kernel.org/r/01010185892de53e-e379acfb-7044-4b24-b30a-e2657c1ba989-000000@us-west-2.amazonses.comSigned-off-by: NMike Rapoport (IBM) <rppt@kernel.org>
      115d9d77
  2. 04 1月, 2023 1 次提交
  3. 04 10月, 2022 2 次提交
  4. 30 7月, 2022 1 次提交
    • Z
      memblock,arm64: expand the static memblock memory table · 450d0e74
      Zhou Guanghui 提交于
      In a system(Huawei Ascend ARM64 SoC) using HBM, a multi-bit ECC error
      occurs, and the BIOS will mark the corresponding area (for example, 2 MB)
      as unusable.  When the system restarts next time, these areas are not
      reported or reported as EFI_UNUSABLE_MEMORY.  Both cases lead to an
      increase in the number of memblocks, whereas EFI_UNUSABLE_MEMORY leads to
      a larger number of memblocks.
      
      For example, if the EFI_UNUSABLE_MEMORY type is reported:
      ...
      memory[0x92]    [0x0000200834a00000-0x0000200835bfffff], 0x0000000001200000 bytes on node 7 flags: 0x0
      memory[0x93]    [0x0000200835c00000-0x0000200835dfffff], 0x0000000000200000 bytes on node 7 flags: 0x4
      memory[0x94]    [0x0000200835e00000-0x00002008367fffff], 0x0000000000a00000 bytes on node 7 flags: 0x0
      memory[0x95]    [0x0000200836800000-0x00002008369fffff], 0x0000000000200000 bytes on node 7 flags: 0x4
      memory[0x96]    [0x0000200836a00000-0x0000200837bfffff], 0x0000000001200000 bytes on node 7 flags: 0x0
      memory[0x97]    [0x0000200837c00000-0x0000200837dfffff], 0x0000000000200000 bytes on node 7 flags: 0x4
      memory[0x98]    [0x0000200837e00000-0x000020087fffffff], 0x0000000048200000 bytes on node 7 flags: 0x0
      memory[0x99]    [0x0000200880000000-0x0000200bcfffffff], 0x0000000350000000 bytes on node 6 flags: 0x0
      memory[0x9a]    [0x0000200bd0000000-0x0000200bd01fffff], 0x0000000000200000 bytes on node 6 flags: 0x4
      memory[0x9b]    [0x0000200bd0200000-0x0000200bd07fffff], 0x0000000000600000 bytes on node 6 flags: 0x0
      memory[0x9c]    [0x0000200bd0800000-0x0000200bd09fffff], 0x0000000000200000 bytes on node 6 flags: 0x4
      memory[0x9d]    [0x0000200bd0a00000-0x0000200fcfffffff], 0x00000003ff600000 bytes on node 6 flags: 0x0
      memory[0x9e]    [0x0000200fd0000000-0x0000200fd01fffff], 0x0000000000200000 bytes on node 6 flags: 0x4
      memory[0x9f]    [0x0000200fd0200000-0x0000200fffffffff], 0x000000002fe00000 bytes on node 6 flags: 0x0
      ...
      
      The EFI memory map is parsed to construct the memblock arrays before the
      memblock arrays can be resized.  As the result, memory regions beyond
      INIT_MEMBLOCK_REGIONS are lost.
      
      Add a new macro INIT_MEMBLOCK_MEMORY_REGIONS to replace
      INIT_MEMBLOCK_REGTIONS to define the size of the static memblock.memory
      array.
      
      Allow overriding memblock.memory array size with architecture defined
      INIT_MEMBLOCK_MEMORY_REGIONS and make arm64 to set
      INIT_MEMBLOCK_MEMORY_REGIONS to 1024 when CONFIG_EFI is enabled.
      
      Link: https://lkml.kernel.org/r/20220615102742.96450-1-zhouguanghui1@huawei.comSigned-off-by: NZhou Guanghui <zhouguanghui1@huawei.com>
      Acked-by: NMike Rapoport <rppt@linux.ibm.com>
      Tested-by: NDarren Hart <darren@os.amperecomputing.com>
      Acked-by: Will Deacon <will@kernel.org>		[arm64]
      Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Xu Qiang <xuqiang36@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      450d0e74
  5. 30 6月, 2022 1 次提交
    • J
      memblock: avoid some repeat when add new range · 28e1a8f4
      Jinyu Tang 提交于
      The worst case is that the new memory range overlaps all existing
      regions, which requires type->cnt + 1 empty struct memblock_region slots in
      the type->regions array.
      So if type->cnt + 1 + type->cnt is less than type->max, we can insert
      regions directly rather than calculate the needed amount before the
      insertion.
      And becase of merge operation in the end of function, tpye->cnt will
      increase slowly for many cases.
      
      This change allows to avoid unnecessary repeat of memblock ranges traversal
      for many cases when adding new memory range.
      Signed-off-by: NJinyu Tang <tjytimi@163.com>
      [rppt: massaged comment and changelog text]
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      28e1a8f4
  6. 17 6月, 2022 1 次提交
  7. 15 6月, 2022 2 次提交
  8. 21 2月, 2022 1 次提交
  9. 20 2月, 2022 1 次提交
  10. 08 11月, 2021 1 次提交
    • Q
      arm64: Track no early_pgtable_alloc() for kmemleak · c6975d7c
      Qian Cai 提交于
      After switched page size from 64KB to 4KB on several arm64 servers here,
      kmemleak starts to run out of early memory pool due to a huge number of
      those early_pgtable_alloc() calls:
      
        kmemleak_alloc_phys()
        memblock_alloc_range_nid()
        memblock_phys_alloc_range()
        early_pgtable_alloc()
        init_pmd()
        alloc_init_pud()
        __create_pgd_mapping()
        __map_memblock()
        paging_init()
        setup_arch()
        start_kernel()
      
      Increased the default value of DEBUG_KMEMLEAK_MEM_POOL_SIZE by 4 times
      won't be enough for a server with 200GB+ memory. There isn't much
      interesting to check memory leaks for those early page tables and those
      early memory mappings should not reference to other memory. Hence, no
      kmemleak false positives, and we can safely skip tracking those early
      allocations from kmemleak like we did in the commit fed84c78
      ("mm/memblock.c: skip kmemleak for kasan_init()") without needing to
      introduce complications to automatically scale the value depends on the
      runtime memory size etc. After the patch, the default value of
      DEBUG_KMEMLEAK_MEM_POOL_SIZE becomes sufficient again.
      Signed-off-by: NQian Cai <quic_qiancai@quicinc.com>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Link: https://lore.kernel.org/r/20211105150509.7826-1-quic_qiancai@quicinc.comSigned-off-by: NWill Deacon <will@kernel.org>
      c6975d7c
  11. 07 11月, 2021 5 次提交
    • D
      memblock: add MEMBLOCK_DRIVER_MANAGED to mimic IORESOURCE_SYSRAM_DRIVER_MANAGED · f7892d8e
      David Hildenbrand 提交于
      Let's add a flag that corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED,
      indicating that we're dealing with a memory region that is never
      indicated in the firmware-provided memory map, but always detected and
      added by a driver.
      
      Similar to MEMBLOCK_HOTPLUG, most infrastructure has to treat such
      memory regions like ordinary MEMBLOCK_NONE memory regions -- for
      example, when selecting memory regions to add to the vmcore for dumping
      in the crashkernel via for_each_mem_range().
      
      However, especially kexec_file is not supposed to select such memblocks
      via for_each_free_mem_range() / for_each_free_mem_range_reverse() to
      place kexec images, similar to how we handle
      IORESOURCE_SYSRAM_DRIVER_MANAGED without CONFIG_ARCH_KEEP_MEMBLOCK.
      
      We'll make sure that memory hotplug code sets the flag where applicable
      (IORESOURCE_SYSRAM_DRIVER_MANAGED) next.  This prepares architectures
      that need CONFIG_ARCH_KEEP_MEMBLOCK, such as arm64, for virtio-mem
      support.
      
      Note that kexec *must not* indicate this memory to the second kernel and
      *must not* place kexec-images on this memory.  Let's add a comment to
      kexec_walk_memblock(), documenting how we handle MEMBLOCK_DRIVER_MANAGED
      now just like using IORESOURCE_SYSRAM_DRIVER_MANAGED in
      locate_mem_hole_callback() for kexec_walk_resources().
      
      Also note that MEMBLOCK_HOTPLUG cannot be reused due to different
      semantics:
      	MEMBLOCK_HOTPLUG: memory is indicated as "System RAM" in the
      	firmware-provided memory map and added to the system early during
      	boot; kexec *has to* indicate this memory to the second kernel and
      	can place kexec-images on this memory. After memory hotunplug,
      	kexec has to be re-armed. We mostly ignore this flag when
      	"movable_node" is not set on the kernel command line, because
      	then we're told to not care about hotunpluggability of such
      	memory regions.
      
      	MEMBLOCK_DRIVER_MANAGED: memory is not indicated as "System RAM" in
      	the firmware-provided memory map; this memory is always detected
      	and added to the system by a driver; memory might not actually be
      	physically hotunpluggable. kexec *must not* indicate this memory to
      	the second kernel and *must not* place kexec-images on this memory.
      
      Link: https://lkml.kernel.org/r/20211004093605.5830-5-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huacai Chen <chenhuacai@kernel.org>
      Cc: Jianyong Wu <Jianyong.Wu@arm.com>
      Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Shahab Vahedi <shahab@synopsys.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7892d8e
    • D
      memblock: allow to specify flags with memblock_add_node() · 952eea9b
      David Hildenbrand 提交于
      We want to specify flags when hotplugging memory.  Let's prepare to pass
      flags to memblock_add_node() by adjusting all existing users.
      
      Note that when hotplugging memory the system is already up and running
      and we might have concurrent memblock users: for example, while we're
      hotplugging memory, kexec_file code might search for suitable memory
      regions to place kexec images.  It's important to add the memory
      directly to memblock via a single call with the right flags, instead of
      adding the memory first and apply flags later: otherwise, concurrent
      memblock users might temporarily stumble over memblocks with wrong
      flags, which will be important in a follow-up patch that introduces a
      new flag to properly handle add_memory_driver_managed().
      
      Link: https://lkml.kernel.org/r/20211004093605.5830-4-david@redhat.comAcked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: NHeiko Carstens <hca@linux.ibm.com>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: Shahab Vahedi <shahab@synopsys.com>	[arch/arc]
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Huacai Chen <chenhuacai@kernel.org>
      Cc: Jianyong Wu <Jianyong.Wu@arm.com>
      Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      952eea9b
    • M
      memblock: use memblock_free for freeing virtual pointers · 4421cca0
      Mike Rapoport 提交于
      Rename memblock_free_ptr() to memblock_free() and use memblock_free()
      when freeing a virtual pointer so that memblock_free() will be a
      counterpart of memblock_alloc()
      
      The callers are updated with the below semantic patch and manual
      addition of (void *) casting to pointers that are represented by
      unsigned long variables.
      
          @@
          identifier vaddr;
          expression size;
          @@
          (
          - memblock_phys_free(__pa(vaddr), size);
          + memblock_free(vaddr, size);
          |
          - memblock_free_ptr(vaddr, size);
          + memblock_free(vaddr, size);
          )
      
      [sfr@canb.auug.org.au: fixup]
        Link: https://lkml.kernel.org/r/20211018192940.3d1d532f@canb.auug.org.au
      
      Link: https://lkml.kernel.org/r/20210930185031.18648-7-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4421cca0
    • M
      memblock: rename memblock_free to memblock_phys_free · 3ecc6834
      Mike Rapoport 提交于
      Since memblock_free() operates on a physical range, make its name
      reflect it and rename it to memblock_phys_free(), so it will be a
      logical counterpart to memblock_phys_alloc().
      
      The callers are updated with the below semantic patch:
      
          @@
          expression addr;
          expression size;
          @@
          - memblock_free(addr, size);
          + memblock_phys_free(addr, size);
      
      Link: https://lkml.kernel.org/r/20210930185031.18648-6-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ecc6834
    • M
      memblock: stop aliasing __memblock_free_late with memblock_free_late · 621d9739
      Mike Rapoport 提交于
      memblock_free_late() is a NOP wrapper for __memblock_free_late(), there
      is no point to keep this indirection.
      
      Drop the wrapper and rename __memblock_free_late() to
      memblock_free_late().
      
      Link: https://lkml.kernel.org/r/20210930185031.18648-5-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      621d9739
  12. 22 10月, 2021 2 次提交
    • M
      memblock: exclude MEMBLOCK_NOMAP regions from kmemleak · 658aafc8
      Mike Rapoport 提交于
      Vladimir Zapolskiy reports:
      
      Commit a7259df7 ("memblock: make memblock_find_in_range method
      private") invokes a kernel panic while running kmemleak on OF platforms
      with nomaped regions:
      
        Unable to handle kernel paging request at virtual address fff000021e00000
        [...]
          scan_block+0x64/0x170
          scan_gray_list+0xe8/0x17c
          kmemleak_scan+0x270/0x514
          kmemleak_write+0x34c/0x4ac
      
      The memory allocated from memblock is registered with kmemleak, but if
      it is marked MEMBLOCK_NOMAP it won't have linear map entries so an
      attempt to scan such areas will fault.
      
      Ideally, memblock_mark_nomap() would inform kmemleak to ignore
      MEMBLOCK_NOMAP memory, but it can be called before kmemleak interfaces
      operating on physical addresses can use __va() conversion.
      
      Make sure that functions that mark allocated memory as MEMBLOCK_NOMAP
      take care of informing kmemleak to ignore such memory.
      
      Link: https://lore.kernel.org/all/8ade5174-b143-d621-8c8e-dc6a1898c6fb@linaro.org
      Link: https://lore.kernel.org/all/c30ff0a2-d196-c50d-22f0-bd50696b1205@quicinc.com
      Fixes: a7259df7 ("memblock: make memblock_find_in_range method private")
      Reported-by: NVladimir Zapolskiy <vladimir.zapolskiy@linaro.org>
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Tested-by: NVladimir Zapolskiy <vladimir.zapolskiy@linaro.org>
      Tested-by: NQian Cai <quic_qiancai@quicinc.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      658aafc8
    • M
      Revert "memblock: exclude NOMAP regions from kmemleak" · 6c9a5455
      Mike Rapoport 提交于
      Commit 6e44bd6d ("memblock: exclude NOMAP regions from kmemleak")
      breaks boot on EFI systems with kmemleak and VM_DEBUG enabled:
      
        efi: Processing EFI memory map:
        efi:   0x000090000000-0x000091ffffff [Conventional|   |  |  |  |  |  |  |  |  |   |WB|WT|WC|UC]
        efi:   0x000092000000-0x0000928fffff [Runtime Data|RUN|  |  |  |  |  |  |  |  |   |WB|WT|WC|UC]
        ------------[ cut here ]------------
        kernel BUG at mm/kmemleak.c:1140!
        Internal error: Oops - BUG: 0 [#1] SMP
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper Not tainted 5.15.0-rc6-next-20211019+ #104
        pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
        pc : kmemleak_free_part_phys+0x64/0x8c
        lr : kmemleak_free_part_phys+0x38/0x8c
        sp : ffff800011eafbc0
        x29: ffff800011eafbc0 x28: 1fffff7fffb41c0d x27: fffffbfffda0e068
        x26: 0000000092000000 x25: 1ffff000023d5f94 x24: ffff800011ed84d0
        x23: ffff800011ed84c0 x22: ffff800011ed83d8 x21: 0000000000900000
        x20: ffff800011782000 x19: 0000000092000000 x18: ffff800011ee0730
        x17: 0000000000000000 x16: 0000000000000000 x15: 1ffff0000233252c
        x14: ffff800019a905a0 x13: 0000000000000001 x12: ffff7000023d5ed7
        x11: 1ffff000023d5ed6 x10: ffff7000023d5ed6 x9 : dfff800000000000
        x8 : ffff800011eaf6b7 x7 : 0000000000000001 x6 : ffff800011eaf6b0
        x5 : 00008ffffdc2a12a x4 : ffff7000023d5ed7 x3 : 1ffff000023dbf99
        x2 : 1ffff000022f0463 x1 : 0000000000000000 x0 : ffffffffffffffff
        Call trace:
         kmemleak_free_part_phys+0x64/0x8c
         memblock_mark_nomap+0x5c/0x78
         reserve_regions+0x294/0x33c
         efi_init+0x2d0/0x490
         setup_arch+0x80/0x138
         start_kernel+0xa0/0x3ec
         __primary_switched+0xc0/0xc8
        Code: 34000041 97d526e7 f9418e80 36000040 (d4210000)
        random: get_random_bytes called from print_oops_end_marker+0x34/0x80 with crng_init=0
        ---[ end trace 0000000000000000 ]---
      
      The crash happens because kmemleak_free_part_phys() tries to use __va()
      before memstart_addr is initialized and this triggers a VM_BUG_ON() in
      arch/arm64/include/asm/memory.h:
      
      Revert 6e44bd6d ("memblock: exclude NOMAP regions from kmemleak"),
      the issue it is fixing will be fixed differently.
      Reported-by: NQian Cai <quic_qiancai@quicinc.com>
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c9a5455
  13. 19 10月, 2021 1 次提交
  14. 13 10月, 2021 1 次提交
  15. 15 9月, 2021 1 次提交
    • L
      memblock: introduce saner 'memblock_free_ptr()' interface · 77e02cf5
      Linus Torvalds 提交于
      The boot-time allocation interface for memblock is a mess, with
      'memblock_alloc()' returning a virtual pointer, but then you are
      supposed to free it with 'memblock_free()' that takes a _physical_
      address.
      
      Not only is that all kinds of strange and illogical, but it actually
      causes bugs, when people then use it like a normal allocation function,
      and it fails spectacularly on a NULL pointer:
      
         https://lore.kernel.org/all/20210912140820.GD25450@xsang-OptiPlex-9020/
      
      or just random memory corruption if the debug checks don't catch it:
      
         https://lore.kernel.org/all/61ab2d0c-3313-aaab-514c-e15b7aa054a0@suse.cz/
      
      I really don't want to apply patches that treat the symptoms, when the
      fundamental cause is this horribly confusing interface.
      
      I started out looking at just automating a sane replacement sequence,
      but because of this mix or virtual and physical addresses, and because
      people have used the "__pa()" macro that can take either a regular
      kernel pointer, or just the raw "unsigned long" address, it's all quite
      messy.
      
      So this just introduces a new saner interface for freeing a virtual
      address that was allocated using 'memblock_alloc()', and that was kept
      as a regular kernel pointer.  And then it converts a couple of users
      that are obvious and easy to test, including the 'xbc_nodes' case in
      lib/bootconfig.c that caused problems.
      Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Fixes: 40caa127 ("init: bootconfig: Remove all bootconfig data when the init memory is removed")
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      77e02cf5
  16. 04 9月, 2021 2 次提交
  17. 11 8月, 2021 2 次提交
  18. 24 7月, 2021 1 次提交
    • M
      memblock: make for_each_mem_range() traverse MEMBLOCK_HOTPLUG regions · 79e482e9
      Mike Rapoport 提交于
      Commit b10d6bca ("arch, drivers: replace for_each_membock() with
      for_each_mem_range()") didn't take into account that when there is
      movable_node parameter in the kernel command line, for_each_mem_range()
      would skip ranges marked with MEMBLOCK_HOTPLUG.
      
      The page table setup code in POWER uses for_each_mem_range() to create
      the linear mapping of the physical memory and since the regions marked
      as MEMORY_HOTPLUG are skipped, they never make it to the linear map.
      
      A later access to the memory in those ranges will fail:
      
        BUG: Unable to handle kernel data access on write at 0xc000000400000000
        Faulting instruction address: 0xc00000000008a3c0
        Oops: Kernel access of bad area, sig: 11 [#1]
        LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in:
        CPU: 0 PID: 53 Comm: kworker/u2:0 Not tainted 5.13.0 #7
        NIP:  c00000000008a3c0 LR: c0000000003c1ed8 CTR: 0000000000000040
        REGS: c000000008a57770 TRAP: 0300   Not tainted  (5.13.0)
        MSR:  8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 84222202  XER: 20040000
        CFAR: c0000000003c1ed4 DAR: c000000400000000 DSISR: 42000000 IRQMASK: 0
        GPR00: c0000000003c1ed8 c000000008a57a10 c0000000019da700 c000000400000000
        GPR04: 0000000000000280 0000000000000180 0000000000000400 0000000000000200
        GPR08: 0000000000000100 0000000000000080 0000000000000040 0000000000000300
        GPR12: 0000000000000380 c000000001bc0000 c0000000001660c8 c000000006337e00
        GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
        GPR20: 0000000040000000 0000000020000000 c000000001a81990 c000000008c30000
        GPR24: c000000008c20000 c000000001a81998 000fffffffff0000 c000000001a819a0
        GPR28: c000000001a81908 c00c000001000000 c000000008c40000 c000000008a64680
        NIP clear_user_page+0x50/0x80
        LR __handle_mm_fault+0xc88/0x1910
        Call Trace:
          __handle_mm_fault+0xc44/0x1910 (unreliable)
          handle_mm_fault+0x130/0x2a0
          __get_user_pages+0x248/0x610
          __get_user_pages_remote+0x12c/0x3e0
          get_arg_page+0x54/0xf0
          copy_string_kernel+0x11c/0x210
          kernel_execve+0x16c/0x220
          call_usermodehelper_exec_async+0x1b0/0x2f0
          ret_from_kernel_thread+0x5c/0x70
        Instruction dump:
        79280fa4 79271764 79261f24 794ae8e2 7ca94214 7d683a14 7c893a14 7d893050
        7d4903a6 60000000 60000000 60000000 <7c001fec> 7c091fec 7c081fec 7c051fec
        ---[ end trace 490b8c67e6075e09 ]---
      
      Making for_each_mem_range() include MEMBLOCK_HOTPLUG regions in the
      traversal fixes this issue.
      
      Link: https://bugzilla.redhat.com/show_bug.cgi?id=1976100
      Link: https://lkml.kernel.org/r/20210712071132.20902-1-rppt@kernel.org
      Fixes: b10d6bca ("arch, drivers: replace for_each_membock() with for_each_mem_range()")
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Tested-by: NGreg Kurz <groug@kaod.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: <stable@vger.kernel.org>	[5.10+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79e482e9
  19. 01 7月, 2021 1 次提交
  20. 30 6月, 2021 4 次提交
  21. 06 2月, 2021 1 次提交
    • R
      memblock: do not start bottom-up allocations with kernel_end · 2dcb3964
      Roman Gushchin 提交于
      With kaslr the kernel image is placed at a random place, so starting the
      bottom-up allocation with the kernel_end can result in an allocation
      failure and a warning like this one:
      
        hugetlb_cma: reserve 2048 MiB, up to 2048 MiB per node
        ------------[ cut here ]------------
        memblock: bottom-up allocation failed, memory hotremove may be affected
        WARNING: CPU: 0 PID: 0 at mm/memblock.c:332 memblock_find_in_range_node+0x178/0x25a
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper Not tainted 5.10.0+ #1169
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014
        RIP: 0010:memblock_find_in_range_node+0x178/0x25a
        Code: e9 6d ff ff ff 48 85 c0 0f 85 da 00 00 00 80 3d 9b 35 df 00 00 75 15 48 c7 c7 c0 75 59 88 c6 05 8b 35 df 00 01 e8 25 8a fa ff <0f> 0b 48 c7 44 24 20 ff ff ff ff 44 89 e6 44 89 ea 48 c7 c1 70 5c
        RSP: 0000:ffffffff88803d18 EFLAGS: 00010086 ORIG_RAX: 0000000000000000
        RAX: 0000000000000000 RBX: 0000000240000000 RCX: 00000000ffffdfff
        RDX: 00000000ffffdfff RSI: 00000000ffffffea RDI: 0000000000000046
        RBP: 0000000100000000 R08: ffffffff88922788 R09: 0000000000009ffb
        R10: 00000000ffffe000 R11: 3fffffffffffffff R12: 0000000000000000
        R13: 0000000000000000 R14: 0000000080000000 R15: 00000001fb42c000
        FS:  0000000000000000(0000) GS:ffffffff88f71000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: ffffa080fb401000 CR3: 00000001fa80a000 CR4: 00000000000406b0
        Call Trace:
          memblock_alloc_range_nid+0x8d/0x11e
          cma_declare_contiguous_nid+0x2c4/0x38c
          hugetlb_cma_reserve+0xdc/0x128
          flush_tlb_one_kernel+0xc/0x20
          native_set_fixmap+0x82/0xd0
          flat_get_apic_id+0x5/0x10
          register_lapic_address+0x8e/0x97
          setup_arch+0x8a5/0xc3f
          start_kernel+0x66/0x547
          load_ucode_bsp+0x4c/0xcd
          secondary_startup_64_no_verify+0xb0/0xbb
        random: get_random_bytes called from __warn+0xab/0x110 with crng_init=0
        ---[ end trace f151227d0b39be70 ]---
      
      At the same time, the kernel image is protected with memblock_reserve(),
      so we can just start searching at PAGE_SIZE.  In this case the bottom-up
      allocation has the same chances to success as a top-down allocation, so
      there is no reason to fallback in the case of a failure.  All together it
      simplifies the logic.
      
      Link: https://lkml.kernel.org/r/20201217201214.3414100-2-guro@fb.com
      Fixes: 8fabc623 ("powerpc: Ensure that swiotlb buffer is allocated from low memory")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Wonhyuk Yang <vvghjk1234@gmail.com>
      Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2dcb3964
  22. 21 1月, 2021 1 次提交
  23. 14 1月, 2021 1 次提交
  24. 16 12月, 2020 2 次提交
  25. 16 11月, 2020 1 次提交
    • F
      mm: memblock: add more debug logs · b5cf2d6c
      Faiyaz Mohammed 提交于
      It is useful to know the exact caller of memblock_phys_alloc_range() to
      track early memory reservations during development.
      
      Currently, when memblock debugging is enabled, the allocations done with
      memblock_phys_alloc_range() are only reported at memblock_reserve():
      
      [    0.000000] memblock_reserve: [0x000000023fc6b000-0x000000023fc6bfff] memblock_alloc_range_nid+0xc0/0x188
      
      Add memblock_dbg() to memblock_phys_alloc_range() to get details about
      its usage.
      
      For example:
      
      [    0.000000] memblock_phys_alloc_range: 4096 bytes align=0x1000 from=0x0000000000000000 max_addr=0x0000000000000000 early_pgtable_alloc+0x24/0x178
      [    0.000000] memblock_reserve: [0x000000023fc6b000-0x000000023fc6bfff] memblock_alloc_range_nid+0xc0/0x188
      Signed-off-by: NFaiyaz Mohammed <faiyazm@codeaurora.org>
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      b5cf2d6c
  26. 15 10月, 2020 2 次提交