1. 08 4月, 2020 8 次提交
  2. 06 3月, 2020 1 次提交
    • V
      mm, hotplug: fix page online with DEBUG_PAGEALLOC compiled but not enabled · c87cbc1f
      Vlastimil Babka 提交于
      Commit cd02cf1a ("mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC")
      fixed memory hotplug with debug_pagealloc enabled, where onlining a page
      goes through page freeing, which removes the direct mapping.  Some arches
      don't like when the page is not mapped in the first place, so
      generic_online_page() maps it first.  This is somewhat wasteful, but
      better than special casing page freeing fast paths.
      
      The commit however missed that DEBUG_PAGEALLOC configured doesn't mean
      it's actually enabled.  One has to test debug_pagealloc_enabled() since
      031bc574 ("mm/debug-pagealloc: make debug-pagealloc boottime
      configurable"), or alternatively debug_pagealloc_enabled_static() since
      8e57f8ac ("mm, debug_pagealloc: don't rely on static keys too early"),
      but this is not done.
      
      As a result, a s390 kernel with DEBUG_PAGEALLOC configured but not enabled
      will crash:
      
      Unable to handle kernel pointer dereference in virtual kernel address space
      Failing address: 0000000000000000 TEID: 0000000000000483
      Fault in home space mode while using kernel ASCE.
      AS:0000001ece13400b R2:000003fff7fd000b R3:000003fff7fcc007 S:000003fff7fd7000 P:000000000000013d
      Oops: 0004 ilc:2 [#1] SMP
      CPU: 1 PID: 26015 Comm: chmem Kdump: loaded Tainted: GX 5.3.18-5-default #1 SLE15-SP2 (unreleased)
      Krnl PSW : 0704e00180000000 0000001ecd281b9e (__kernel_map_pages+0x166/0x188)
      R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
      Krnl GPRS: 0000000000000000 0000000000000800 0000400b00000000 0000000000000100
      0000000000000001 0000000000000000 0000000000000002 0000000000000100
      0000001ece139230 0000001ecdd98d40 0000400b00000100 0000000000000000
      000003ffa17e4000 001fffe0114f7d08 0000001ecd4d93ea 001fffe0114f7b20
      Krnl Code: 0000001ecd281b8e: ec17ffff00d8 ahik %r1,%r7,-1
      0000001ecd281b94: ec111dbc0355 risbg %r1,%r1,29,188,3
      >0000001ecd281b9e: 94fb5006 ni 6(%r5),251
      0000001ecd281ba2: 41505008 la %r5,8(%r5)
      0000001ecd281ba6: ec51fffc6064 cgrj %r5,%r1,6,1ecd281b9e
      0000001ecd281bac: 1a07 ar %r0,%r7
      0000001ecd281bae: ec03ff584076 crj %r0,%r3,4,1ecd281a5e
      Call Trace:
      [<0000001ecd281b9e>] __kernel_map_pages+0x166/0x188
      [<0000001ecd4d9516>] online_pages_range+0xf6/0x128
      [<0000001ecd2a8186>] walk_system_ram_range+0x7e/0xd8
      [<0000001ecda28aae>] online_pages+0x2fe/0x3f0
      [<0000001ecd7d02a6>] memory_subsys_online+0x8e/0xc0
      [<0000001ecd7add42>] device_online+0x5a/0xc8
      [<0000001ecd7d0430>] state_store+0x88/0x118
      [<0000001ecd5b9f62>] kernfs_fop_write+0xc2/0x200
      [<0000001ecd5064b6>] vfs_write+0x176/0x1e0
      [<0000001ecd50676a>] ksys_write+0xa2/0x100
      [<0000001ecda315d4>] system_call+0xd8/0x2c8
      
      Fix this by checking debug_pagealloc_enabled_static() before calling
      kernel_map_pages(). Backports for kernel before 5.5 should use
      debug_pagealloc_enabled() instead. Also add comments.
      
      Fixes: cd02cf1a ("mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC")
      Reported-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: <stable@vger.kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Qian Cai <cai@lca.pw>
      Link: http://lkml.kernel.org/r/20200224094651.18257-1-vbabka@suse.czSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c87cbc1f
  3. 04 2月, 2020 6 次提交
  4. 01 2月, 2020 3 次提交
    • D
      mm/memory_hotplug: pass in nid to online_pages() · bd5c2344
      David Hildenbrand 提交于
      Patch series "mm/memory_hotplug: pass in nid to online_pages()".
      
      Simplify onlining code and get rid of find_memory_block().  Pass in the
      nid from the memory block we are trying to online directly, instead of
      manually looking it up.
      
      This patch (of 2):
      
      No need to lookup the memory block, we can directly pass in the nid.
      
      Link: http://lkml.kernel.org/r/20200113113354.6341-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd5c2344
    • D
      mm: remove "count" parameter from has_unmovable_pages() · fe4c86c9
      David Hildenbrand 提交于
      Now that the memory isolate notifier is gone, the parameter is always 0.
      Drop it and cleanup has_unmovable_pages().
      
      Link: http://lkml.kernel.org/r/20191114131911.11783-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Arun KS <arunks@codeaurora.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fe4c86c9
    • D
      mm/memory_hotplug: fix remove_memory() lockdep splat · f1037ec0
      Dan Williams 提交于
      The daxctl unit test for the dax_kmem driver currently triggers the
      (false positive) lockdep splat below.  It results from the fact that
      remove_memory_block_devices() is invoked under the mem_hotplug_lock()
      causing lockdep entanglements with cpu_hotplug_lock() and sysfs (kernfs
      active state tracking).  It is a false positive because the sysfs
      attribute path triggering the memory remove is not the same attribute
      path associated with memory-block device.
      
      sysfs_break_active_protection() is not applicable since there is no real
      deadlock conflict, instead move memory-block device removal outside the
      lock.  The mem_hotplug_lock() is not needed to synchronize the
      memory-block device removal vs the page online state, that is already
      handled by lock_device_hotplug().  Specifically, lock_device_hotplug()
      is sufficient to allow try_remove_memory() to check the offline state of
      the memblocks and be assured that any in progress online attempts are
      flushed / blocked by kernfs_drain() / attribute removal.
      
      The add_memory() path safely creates memblock devices under the
      mem_hotplug_lock().  There is no kernfs active state synchronization in
      the memblock device_register() path, so nothing to fix there.
      
      This change is only possible thanks to the recent change that refactored
      memory block device removal out of arch_remove_memory() (commit
      4c4b7f9b "mm/memory_hotplug: remove memory block devices before
      arch_remove_memory()"), and David's due diligence tracking down the
      guarantees afforded by kernfs_drain().  Not flagged for -stable since
      this only impacts ongoing development and lockdep validation, not a
      runtime issue.
      
          ======================================================
          WARNING: possible circular locking dependency detected
          5.5.0-rc3+ #230 Tainted: G           OE
          ------------------------------------------------------
          lt-daxctl/6459 is trying to acquire lock:
          ffff99c7f0003510 (kn->count#241){++++}, at: kernfs_remove_by_name_ns+0x41/0x80
      
          but task is already holding lock:
          ffffffffa76a5450 (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x20/0xe0
      
          which lock already depends on the new lock.
      
          the existing dependency chain (in reverse order) is:
      
          -> #2 (mem_hotplug_lock.rw_sem){++++}:
                 __lock_acquire+0x39c/0x790
                 lock_acquire+0xa2/0x1b0
                 get_online_mems+0x3e/0xb0
                 kmem_cache_create_usercopy+0x2e/0x260
                 kmem_cache_create+0x12/0x20
                 ptlock_cache_init+0x20/0x28
                 start_kernel+0x243/0x547
                 secondary_startup_64+0xb6/0xc0
      
          -> #1 (cpu_hotplug_lock.rw_sem){++++}:
                 __lock_acquire+0x39c/0x790
                 lock_acquire+0xa2/0x1b0
                 cpus_read_lock+0x3e/0xb0
                 online_pages+0x37/0x300
                 memory_subsys_online+0x17d/0x1c0
                 device_online+0x60/0x80
                 state_store+0x65/0xd0
                 kernfs_fop_write+0xcf/0x1c0
                 vfs_write+0xdb/0x1d0
                 ksys_write+0x65/0xe0
                 do_syscall_64+0x5c/0xa0
                 entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
          -> #0 (kn->count#241){++++}:
                 check_prev_add+0x98/0xa40
                 validate_chain+0x576/0x860
                 __lock_acquire+0x39c/0x790
                 lock_acquire+0xa2/0x1b0
                 __kernfs_remove+0x25f/0x2e0
                 kernfs_remove_by_name_ns+0x41/0x80
                 remove_files.isra.0+0x30/0x70
                 sysfs_remove_group+0x3d/0x80
                 sysfs_remove_groups+0x29/0x40
                 device_remove_attrs+0x39/0x70
                 device_del+0x16a/0x3f0
                 device_unregister+0x16/0x60
                 remove_memory_block_devices+0x82/0xb0
                 try_remove_memory+0xb5/0x130
                 remove_memory+0x26/0x40
                 dev_dax_kmem_remove+0x44/0x6a [kmem]
                 device_release_driver_internal+0xe4/0x1c0
                 unbind_store+0xef/0x120
                 kernfs_fop_write+0xcf/0x1c0
                 vfs_write+0xdb/0x1d0
                 ksys_write+0x65/0xe0
                 do_syscall_64+0x5c/0xa0
                 entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
          other info that might help us debug this:
      
          Chain exists of:
            kn->count#241 --> cpu_hotplug_lock.rw_sem --> mem_hotplug_lock.rw_sem
      
           Possible unsafe locking scenario:
      
                 CPU0                    CPU1
                 ----                    ----
            lock(mem_hotplug_lock.rw_sem);
                                         lock(cpu_hotplug_lock.rw_sem);
                                         lock(mem_hotplug_lock.rw_sem);
            lock(kn->count#241);
      
           *** DEADLOCK ***
      
      No fixes tag as this has been a long standing issue that predated the
      addition of kernfs lockdep annotations.
      
      Link: http://lkml.kernel.org/r/157991441887.2763922.4770790047389427325.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1037ec0
  5. 05 1月, 2020 1 次提交
    • D
      mm/memory_hotplug: shrink zones when offlining memory · feee6b29
      David Hildenbrand 提交于
      We currently try to shrink a single zone when removing memory.  We use
      the zone of the first page of the memory we are removing.  If that
      memmap was never initialized (e.g., memory was never onlined), we will
      read garbage and can trigger kernel BUGs (due to a stale pointer):
      
          BUG: unable to handle page fault for address: 000000000000353d
          #PF: supervisor write access in kernel mode
          #PF: error_code(0x0002) - not-present page
          PGD 0 P4D 0
          Oops: 0002 [#1] SMP PTI
          CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 5.3.0-rc5-next-20190820+ #317
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
          Workqueue: kacpi_hotplug acpi_hotplug_work_fn
          RIP: 0010:clear_zone_contiguous+0x5/0x10
          Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840
          RSP: 0018:ffffad2400043c98 EFLAGS: 00010246
          RAX: 0000000000000000 RBX: 0000000200000000 RCX: 0000000000000000
          RDX: 0000000000200000 RSI: 0000000000140000 RDI: 0000000000002f40
          RBP: 0000000140000000 R08: 0000000000000000 R09: 0000000000000001
          R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
          R13: 0000000000140000 R14: 0000000000002f40 R15: ffff9e3e7aff3680
          FS:  0000000000000000(0000) GS:ffff9e3e7bb00000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 000000000000353d CR3: 0000000058610000 CR4: 00000000000006e0
          DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
          Call Trace:
           __remove_pages+0x4b/0x640
           arch_remove_memory+0x63/0x8d
           try_remove_memory+0xdb/0x130
           __remove_memory+0xa/0x11
           acpi_memory_device_remove+0x70/0x100
           acpi_bus_trim+0x55/0x90
           acpi_device_hotplug+0x227/0x3a0
           acpi_hotplug_work_fn+0x1a/0x30
           process_one_work+0x221/0x550
           worker_thread+0x50/0x3b0
           kthread+0x105/0x140
           ret_from_fork+0x3a/0x50
          Modules linked in:
          CR2: 000000000000353d
      
      Instead, shrink the zones when offlining memory or when onlining failed.
      Introduce and use remove_pfn_range_from_zone(() for that.  We now
      properly shrink the zones, even if we have DIMMs whereby
      
       - Some memory blocks fall into no zone (never onlined)
      
       - Some memory blocks fall into multiple zones (offlined+re-onlined)
      
       - Multiple memory blocks that fall into different zones
      
      Drop the zone parameter (with a potential dubious value) from
      __remove_pages() and __remove_section().
      
      Link: http://lkml.kernel.org/r/20191006085646.5768-6-david@redhat.com
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[visible after d0dc12e8]
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: <stable@vger.kernel.org>	[5.0+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      feee6b29
  6. 02 12月, 2019 8 次提交
  7. 23 11月, 2019 1 次提交
    • D
      mm/memory_hotplug: don't access uninitialized memmaps in shrink_zone_span() · 7ce700bf
      David Hildenbrand 提交于
      Let's limit shrinking to !ZONE_DEVICE so we can fix the current code.
      We should never try to touch the memmap of offline sections where we
      could have uninitialized memmaps and could trigger BUGs when calling
      page_to_nid() on poisoned pages.
      
      There is no reliable way to distinguish an uninitialized memmap from an
      initialized memmap that belongs to ZONE_DEVICE, as we don't have
      anything like SECTION_IS_ONLINE we can use similar to
      pfn_to_online_section() for !ZONE_DEVICE memory.
      
      E.g., set_zone_contiguous() similarly relies on pfn_to_online_section()
      and will therefore never set a ZONE_DEVICE zone consecutive.  Stopping
      to shrink the ZONE_DEVICE therefore results in no observable changes,
      besides /proc/zoneinfo indicating different boundaries - something we
      can totally live with.
      
      Before commit d0dc12e8 ("mm/memory_hotplug: optimize memory
      hotplug"), the memmap was initialized with 0 and the node with the right
      value.  So the zone might be wrong but not garbage.  After that commit,
      both the zone and the node will be garbage when touching uninitialized
      memmaps.
      
      Toshiki reported a BUG (race between delayed initialization of
      ZONE_DEVICE memmaps without holding the memory hotplug lock and
      concurrent zone shrinking).
      
        https://lkml.org/lkml/2019/11/14/1040
      
      "Iteration of create and destroy namespace causes the panic as below:
      
            kernel BUG at mm/page_alloc.c:535!
            CPU: 7 PID: 2766 Comm: ndctl Not tainted 5.4.0-rc4 #6
            Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
            RIP: 0010:set_pfnblock_flags_mask+0x95/0xf0
            Call Trace:
             memmap_init_zone_device+0x165/0x17c
             memremap_pages+0x4c1/0x540
             devm_memremap_pages+0x1d/0x60
             pmem_attach_disk+0x16b/0x600 [nd_pmem]
             nvdimm_bus_probe+0x69/0x1c0
             really_probe+0x1c2/0x3e0
             driver_probe_device+0xb4/0x100
             device_driver_attach+0x4f/0x60
             bind_store+0xc9/0x110
             kernfs_fop_write+0x116/0x190
             vfs_write+0xa5/0x1a0
             ksys_write+0x59/0xd0
             do_syscall_64+0x5b/0x180
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        While creating a namespace and initializing memmap, if you destroy the
        namespace and shrink the zone, it will initialize the memmap outside
        the zone and trigger VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page),
        pfn), page) in set_pfnblock_flags_mask()."
      
      This BUG is also mitigated by this commit, where we for now stop to
      shrink the ZONE_DEVICE zone until we can do it in a safe and clean way.
      
      Link: http://lkml.kernel.org/r/20191006085646.5768-5-david@redhat.com
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[visible after d0dc12e8]
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Reported-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reported-by: NToshiki Fukasawa <t-fukasawa@vx.jp.nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Damian Tometzki <damian.tometzki@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Halil Pasic <pasic@linux.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jun Yao <yaojun8558363@gmail.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: <stable@vger.kernel.org>	[4.13+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ce700bf
  8. 16 11月, 2019 1 次提交
    • D
      mm/memory_hotplug: fix try_offline_node() · 2c91f8fc
      David Hildenbrand 提交于
      try_offline_node() is pretty much broken right now:
      
       - The node span is updated when onlining memory, not when adding it. We
         ignore memory that was mever onlined. Bad.
      
       - We touch possible garbage memmaps. The pfn_to_nid(pfn) can easily
         trigger a kernel panic. Bad for memory that is offline but also bad
         for subsection hotadd with ZONE_DEVICE, whereby the memmap of the
         first PFN of a section might contain garbage.
      
       - Sections belonging to mixed nodes are not properly considered.
      
      As memory blocks might belong to multiple nodes, we would have to walk
      all pageblocks (or at least subsections) within present sections.
      However, we don't have a way to identify whether a memmap that is not
      online was initialized (relevant for ZONE_DEVICE).  This makes things
      more complicated.
      
      Luckily, we can piggy pack on the node span and the nid stored in memory
      blocks.  Currently, the node span is grown when calling
      move_pfn_range_to_zone() - e.g., when onlining memory, and shrunk when
      removing memory, before calling try_offline_node().  Sysfs links are
      created via link_mem_sections(), e.g., during boot or when adding
      memory.
      
      If the node still spans memory or if any memory block belongs to the
      nid, we don't set the node offline.  As memory blocks that span multiple
      nodes cannot get offlined, the nid stored in memory blocks is reliable
      enough (for such online memory blocks, the node still spans the memory).
      
      Introduce for_each_memory_block() to efficiently walk all memory blocks.
      
      Note: We will soon stop shrinking the ZONE_DEVICE zone and the node span
      when removing ZONE_DEVICE memory to fix similar issues (access of
      garbage memmaps) - until we have a reliable way to identify whether
      these memmaps were properly initialized.  This implies later, that once
      a node had ZONE_DEVICE memory, we won't be able to set a node offline -
      which should be acceptable.
      
      Since commit f1dd2cd1 ("mm, memory_hotplug: do not associate
      hotadded memory to zones until online") memory that is added is not
      assoziated with a zone/node (memmap not initialized).  The introducing
      commit 60a5a19e ("memory-hotplug: remove sysfs file of node")
      already missed that we could have multiple nodes for a section and that
      the zone/node span is updated when onlining pages, not when adding them.
      
      I tested this by hotplugging two DIMMs to a memory-less and cpu-less
      NUMA node.  The node is properly onlined when adding the DIMMs.  When
      removing the DIMMs, the node is properly offlined.
      
      Masayoshi Mizuma reported:
      
      : Without this patch, memory hotplug fails as panic:
      :
      :  BUG: kernel NULL pointer dereference, address: 0000000000000000
      :  ...
      :  Call Trace:
      :   remove_memory_block_devices+0x81/0xc0
      :   try_remove_memory+0xb4/0x130
      :   __remove_memory+0xa/0x20
      :   acpi_memory_device_remove+0x84/0x100
      :   acpi_bus_trim+0x57/0x90
      :   acpi_bus_trim+0x2e/0x90
      :   acpi_device_hotplug+0x2b2/0x4d0
      :   acpi_hotplug_work_fn+0x1a/0x30
      :   process_one_work+0x171/0x380
      :   worker_thread+0x49/0x3f0
      :   kthread+0xf8/0x130
      :   ret_from_fork+0x35/0x40
      
      [david@redhat.com: v3]
        Link: http://lkml.kernel.org/r/20191102120221.7553-1-david@redhat.com
      Link: http://lkml.kernel.org/r/20191028105458.28320-1-david@redhat.com
      Fixes: 60a5a19e ("memory-hotplug: remove sysfs file of node")
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online") # visiable after d0dc12e8Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Tested-by: NMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
      Cc: Jani Nikula <jani.nikula@intel.com>
      Cc: Nayna Jain <nayna@linux.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c91f8fc
  9. 07 11月, 2019 1 次提交
    • D
      mm/memory_hotplug: fix updating the node span · 656d5711
      David Hildenbrand 提交于
      We recently started updating the node span based on the zone span to
      avoid touching uninitialized memmaps.
      
      Currently, we will always detect the node span to start at 0, meaning a
      node can easily span too many pages.  pgdat_is_empty() will still work
      correctly if all zones span no pages.  We should skip over all zones
      without spanned pages and properly handle the first detected zone that
      spans pages.
      
      Unfortunately, in contrast to the zone span (/proc/zoneinfo), the node
      span cannot easily be inspected and tested.  The node span gives no real
      guarantees when an architecture supports memory hotplug, meaning it can
      easily contain holes or span pages of different nodes.
      
      The node span is not really used after init on architectures that
      support memory hotplug.
      
      E.g., we use it in mm/memory_hotplug.c:try_offline_node() and in
      mm/kmemleak.c:kmemleak_scan().  These users seem to be fine.
      
      Link: http://lkml.kernel.org/r/20191027222714.5313-1-david@redhat.com
      Fixes: 00d6c019 ("mm/memory_hotplug: don't access uninitialized memmaps in shrink_pgdat_span()")
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      656d5711
  10. 19 10月, 2019 1 次提交
    • D
      mm/memory_hotplug: don't access uninitialized memmaps in shrink_pgdat_span() · 00d6c019
      David Hildenbrand 提交于
      We might use the nid of memmaps that were never initialized.  For
      example, if the memmap was poisoned, we will crash the kernel in
      pfn_to_nid() right now.  Let's use the calculated boundaries of the
      separate zones instead.  This now also avoids having to iterate over a
      whole bunch of subsections again, after shrinking one zone.
      
      Before commit d0dc12e8 ("mm/memory_hotplug: optimize memory
      hotplug"), the memmap was initialized to 0 and the node was set to the
      right value.  After that commit, the node might be garbage.
      
      We'll have to fix shrink_zone_span() next.
      
      Link: http://lkml.kernel.org/r/20191006085646.5768-4-david@redhat.com
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[d0dc12e8]
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Reported-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Damian Tometzki <damian.tometzki@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Halil Pasic <pasic@linux.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jun Yao <yaojun8558363@gmail.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: <stable@vger.kernel.org>	[4.13+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      00d6c019
  11. 25 9月, 2019 9 次提交