1. 13 8月, 2020 4 次提交
    • J
      mm/migrate: introduce a standard migration target allocation function · 19fc7bed
      Joonsoo Kim 提交于
      There are some similar functions for migration target allocation.  Since
      there is no fundamental difference, it's better to keep just one rather
      than keeping all variants.  This patch implements base migration target
      allocation function.  In the following patches, variants will be converted
      to use this function.
      
      Changes should be mechanical, but, unfortunately, there are some
      differences.  First, some callers' nodemask is assgined to NULL since NULL
      nodemask will be considered as all available nodes, that is,
      &node_states[N_MEMORY].  Second, for hugetlb page allocation, gfp_mask is
      redefined as regular hugetlb allocation gfp_mask plus __GFP_THISNODE if
      user provided gfp_mask has it.  This is because future caller of this
      function requires to set this node constaint.  Lastly, if provided nodeid
      is NUMA_NO_NODE, nodeid is set up to the node where migration source
      lives.  It helps to remove simple wrappers for setting up the nodeid.
      
      Note that PageHighmem() call in previous function is changed to open-code
      "is_highmem_idx()" since it provides more readability.
      
      [akpm@linux-foundation.org: tweak patch title, per Vlastimil]
      [akpm@linux-foundation.org: fix typo in comment]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Roman Gushchin <guro@fb.com>
      Link: http://lkml.kernel.org/r/1594622517-20681-6-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19fc7bed
    • C
      mm, memory_hotplug: update pcp lists everytime onlining a memory block · de1193f0
      Charan Teja Reddy 提交于
      When onlining a first memory block in a zone, pcp lists are not updated
      thus pcp struct will have the default setting of ->high = 0,->batch = 1.
      
      This means till the second memory block in a zone(if it have) is onlined
      the pcp lists of this zone will not contain any pages because pcp's
      ->count is always greater than ->high thus free_pcppages_bulk() is called
      to free batch size(=1) pages every time system wants to add a page to the
      pcp list through free_unref_page().
      
      To put this in a word, system is not using benefits offered by the pcp
      lists when there is a single onlineable memory block in a zone.  Correct
      this by always updating the pcp lists when memory block is onlined.
      
      Fixes: 1f522509 ("mem-hotplug: avoid multiple zones sharing same boot strapping boot_pageset")
      Signed-off-by: NCharan Teja Reddy <charante@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Link: http://lkml.kernel.org/r/1596372896-15336-1-git-send-email-charante@codeaurora.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de1193f0
    • J
      mm/memory_hotplug: fix unpaired mem_hotplug_begin/done · b4223a51
      Jia He 提交于
      When check_memblock_offlined_cb() returns failed rc(e.g. the memblock is
      online at that time), mem_hotplug_begin/done is unpaired in such case.
      
      Therefore a warning:
       Call Trace:
        percpu_up_write+0x33/0x40
        try_remove_memory+0x66/0x120
        ? _cond_resched+0x19/0x30
        remove_memory+0x2b/0x40
        dev_dax_kmem_remove+0x36/0x72 [kmem]
        device_release_driver_internal+0xf0/0x1c0
        device_release_driver+0x12/0x20
        bus_remove_device+0xe1/0x150
        device_del+0x17b/0x3e0
        unregister_dev_dax+0x29/0x60
        devm_action_release+0x15/0x20
        release_nodes+0x19a/0x1e0
        devres_release_all+0x3f/0x50
        device_release_driver_internal+0x100/0x1c0
        driver_detach+0x4c/0x8f
        bus_remove_driver+0x5c/0xd0
        driver_unregister+0x31/0x50
        dax_pmem_exit+0x10/0xfe0 [dax_pmem]
      
      Fixes: f1037ec0 ("mm/memory_hotplug: fix remove_memory() lockdep splat")
      Signed-off-by: NJia He <justin.he@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>	[5.6+]
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chuhong Yuan <hslester96@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
      Cc: Kaly Xin <Kaly.Xin@arm.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200710031619.18762-3-justin.he@arm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4223a51
    • J
      mm/memory_hotplug: introduce default dummy memory_add_physaddr_to_nid() · d622ecec
      Jia He 提交于
      This is to introduce a general dummy helper.  memory_add_physaddr_to_nid()
      is a fallback option to get the nid in case NUMA_NO_NID is detected.
      
      After this patch, arm64/sh/s390 can simply use the general dummy version.
      PowerPC/x86/ia64 will still use their specific version.
      
      This is the preparation to set a fallback value for dev_dax->target_node.
      Signed-off-by: NJia He <justin.he@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Chuhong Yuan <hslester96@gmail.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
      Cc: Kaly Xin <Kaly.Xin@arm.com>
      Link: http://lkml.kernel.org/r/20200710031619.18762-2-justin.he@arm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d622ecec
  2. 08 8月, 2020 2 次提交
  3. 26 6月, 2020 1 次提交
    • B
      mm/memory_hotplug.c: fix false softlockup during pfn range removal · b7e3debd
      Ben Widawsky 提交于
      When working with very large nodes, poisoning the struct pages (for which
      there will be very many) can take a very long time.  If the system is
      using voluntary preemptions, the software watchdog will not be able to
      detect forward progress.  This patch addresses this issue by offering to
      give up time like __remove_pages() does.  This behavior was introduced in
      v5.6 with: commit d33695b1 ("mm/memory_hotplug: poison memmap in
      remove_pfn_range_from_zone()")
      
      Alternately, init_page_poison could do this cond_resched(), but it seems
      to me that the caller of init_page_poison() is what actually knows whether
      or not it should relax its own priority.
      
      Based on Dan's notes, I think this is perfectly safe: commit f931ab47
      ("mm: fix devm_memremap_pages crash, use mem_hotplug_{begin, done}")
      
      Aside from fixing the lockup, it is also a friendlier thing to do on lower
      core systems that might wipe out large chunks of hotplug memory (probably
      not a very common case).
      
      Fixes this kind of splat:
      
        watchdog: BUG: soft lockup - CPU#46 stuck for 22s! [daxctl:9922]
        irq event stamp: 138450
        hardirqs last  enabled at (138449): [<ffffffffa1001f26>] trace_hardirqs_on_thunk+0x1a/0x1c
        hardirqs last disabled at (138450): [<ffffffffa1001f42>] trace_hardirqs_off_thunk+0x1a/0x1c
        softirqs last  enabled at (138448): [<ffffffffa1e00347>] __do_softirq+0x347/0x456
        softirqs last disabled at (138443): [<ffffffffa10c416d>] irq_exit+0x7d/0xb0
        CPU: 46 PID: 9922 Comm: daxctl Not tainted 5.7.0-BEN-14238-g373c6049b336 #30
        Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYXCRB1.86B.0578.D07.1902280810 02/28/2019
        RIP: 0010:memset_erms+0x9/0x10
        Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 <f3> aa 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01
        Call Trace:
         remove_pfn_range_from_zone+0x3a/0x380
         memunmap_pages+0x17f/0x280
         release_nodes+0x22a/0x260
         __device_release_driver+0x172/0x220
         device_driver_detach+0x3e/0xa0
         unbind_store+0x113/0x130
         kernfs_fop_write+0xdc/0x1c0
         vfs_write+0xde/0x1d0
         ksys_write+0x58/0xd0
         do_syscall_64+0x5a/0x120
         entry_SYSCALL_64_after_hwframe+0x49/0xb3
        Built 2 zonelists, mobility grouping on.  Total pages: 49050381
        Policy zone: Normal
        Built 3 zonelists, mobility grouping on.  Total pages: 49312525
        Policy zone: Normal
      
      David said: "It really only is an issue for devmem.  Ordinary
      hotplugged system memory is not affected (onlined/offlined in memory
      block granularity)."
      
      Link: http://lkml.kernel.org/r/20200619231213.1160351-1-ben.widawsky@intel.com
      Fixes: commit d33695b1 ("mm/memory_hotplug: poison memmap in remove_pfn_range_from_zone()")
      Signed-off-by: NBen Widawsky <ben.widawsky@intel.com>
      Reported-by: N"Scargall, Steve" <steve.scargall@intel.com>
      Reported-by: NBen Widawsky <ben.widawsky@intel.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b7e3debd
  4. 05 6月, 2020 8 次提交
    • E
    • D
      mm/memory_hotplug: introduce add_memory_driver_managed() · 7b7b2721
      David Hildenbrand 提交于
      Patch series "mm/memory_hotplug: Interface to add driver-managed system
      ram", v4.
      
      kexec (via kexec_load()) can currently not properly handle memory added
      via dax/kmem, and will have similar issues with virtio-mem.  kexec-tools
      will currently add all memory to the fixed-up initial firmware memmap.  In
      case of dax/kmem, this means that - in contrast to a proper reboot - how
      that persistent memory will be used can no longer be configured by the
      kexec'd kernel.  In case of virtio-mem it will be harmful, because that
      memory might contain inaccessible pieces that require coordination with
      hypervisor first.
      
      In both cases, we want to let the driver in the kexec'd kernel handle
      detecting and adding the memory, like during an ordinary reboot.
      Introduce add_memory_driver_managed().  More on the samentics are in patch
      #1.
      
      In the future, we might want to make this behavior configurable for
      dax/kmem- either by configuring it in the kernel (which would then also
      allow to configure kexec_file_load()) or in kexec-tools by also adding
      "System RAM (kmem)" memory from /proc/iomem to the fixed-up initial
      firmware memmap.
      
      More on the motivation can be found in [1] and [2].
      
      [1] https://lkml.kernel.org/r/20200429160803.109056-1-david@redhat.com
      [2] https://lkml.kernel.org/r/20200430102908.10107-1-david@redhat.com
      
      This patch (of 3):
      
      Some device drivers rely on memory they managed to not get added to the
      initial (firmware) memmap as system RAM - so it's not used as initial
      system RAM by the kernel and the driver is under control.  While this is
      the case during cold boot and after a reboot, kexec is not aware of that
      and might add such memory to the initial (firmware) memmap of the kexec
      kernel.  We need ways to teach kernel and userspace that this system ram
      is different.
      
      For example, dax/kmem allows to decide at runtime if persistent memory is
      to be used as system ram.  Another future user is virtio-mem, which has to
      coordinate with its hypervisor to deal with inaccessible parts within
      memory resources.
      
      We want to let users in the kernel (esp. kexec) but also user space
      (esp. kexec-tools) know that this memory has different semantics and
      needs to be handled differently:
      1. Don't create entries in /sys/firmware/memmap/
      2. Name the memory resource "System RAM ($DRIVER)" (exposed via
         /proc/iomem) ($DRIVER might be "kmem", "virtio_mem").
      3. Flag the memory resource IORESOURCE_MEM_DRIVER_MANAGED
      
      /sys/firmware/memmap/ [1] represents the "raw firmware-provided memory
      map" because "on most architectures that firmware-provided memory map is
      modified afterwards by the kernel itself".  The primary user is kexec on
      x86-64.  Since commit d96ae530 ("memory-hotplug: create
      /sys/firmware/memmap entry for new memory"), we add all hotplugged memory
      to that firmware memmap - which makes perfect sense for traditional memory
      hotplug on x86-64, where real HW will also add hotplugged DIMMs to the
      firmware memmap.  We replicate what the "raw firmware-provided memory map"
      looks like after hot(un)plug.
      
      To keep things simple, let the user provide the full resource name instead
      of only the driver name - this way, we don't have to manually
      allocate/craft strings for memory resources.  Also use the resource name
      to make decisions, to avoid passing additional flags.  In case the name
      isn't "System RAM", it's special.
      
      We don't have to worry about firmware_map_remove() on the removal path.
      If there is no entry, it will simply return with -EINVAL.
      
      We'll adapt dax/kmem in a follow-up patch.
      
      [1] https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-firmware-memmapSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Link: http://lkml.kernel.org/r/20200508084217.9160-1-david@redhat.com
      Link: http://lkml.kernel.org/r/20200508084217.9160-3-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b7b2721
    • D
      mm/memory_hotplug: handle memblocks only with CONFIG_ARCH_KEEP_MEMBLOCK · 52219aea
      David Hildenbrand 提交于
      The comment in add_memory_resource() is stale: hotadd_new_pgdat() will no
      longer call get_pfn_range_for_nid(), as a hotadded pgdat will simply span
      no pages at all, until memory is moved to the zone/node via
      move_pfn_range_to_zone() - e.g., when onlining memory blocks.
      
      The only archs that care about memblocks for hotplugged memory (either for
      iterating over all system RAM or testing for memory validity) are arm64,
      s390x, and powerpc - due to CONFIG_ARCH_KEEP_MEMBLOCK.  Without
      CONFIG_ARCH_KEEP_MEMBLOCK, we can simply stop messing with memblocks.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Link: http://lkml.kernel.org/r/20200422155353.25381-3-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52219aea
    • D
      mm/memory_hotplug: set node_start_pfn of hotadded pgdat to 0 · c68ab18c
      David Hildenbrand 提交于
      Patch series "mm/memory_hotplug: handle memblocks only with
      CONFIG_ARCH_KEEP_MEMBLOCK", v1.
      
      A hotadded node/pgdat will span no pages at all, until memory is moved to
      the zone/node via move_pfn_range_to_zone() -> resize_pgdat_range - e.g.,
      when onlining memory blocks.  We don't have to initialize the
      node_start_pfn to the memory we are adding.
      
      This patch (of 2):
      
      Especially, there is an inconsistency:
       - Hotplugging memory to a memory-less node with cpus: node_start_pf ==  0
       - Offlining and removing last memory from a node: node_start_pfn == 0
       - Hotplugging memory to a memory-less node without cpus: node_start_pfn != 0
      
      As soon as memory is onlined, node_start_pfn is overwritten with the
      actual start.  E.g., when adding two DIMMs but only onlining one of both,
      only that DIMM (with online memory blocks) is spanned by the node.
      
      Currently, the validity of node_start_pfn really is linked to
      node_spanned_pages != 0.  With node_spanned_pages == 0 (e.g., before
      onlining memory), it has no meaning.
      
      So let's stop setting node_start_pfn, just to be overwritten via
      move_pfn_range_to_zone().  This avoids confusion when looking at the code,
      wondering which magic will be performed with the node_start_pfn in this
      function, when hotadding a pgdat.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/20200422155353.25381-1-david@redhat.com
      Link: http://lkml.kernel.org/r/20200422155353.25381-2-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c68ab18c
    • D
      mm/memory_hotplug: remove is_mem_section_removable() · 04f3465c
      David Hildenbrand 提交于
      Fortunately, all users of is_mem_section_removable() are gone.  Get rid of
      it, including some now unnecessary functions.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Link: http://lkml.kernel.org/r/20200407135416.24093-3-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04f3465c
    • V
      mm/memory_hotplug: refrain from adding memory into an impossible node · fa6d9ec7
      Vishal Verma 提交于
      A misbehaving qemu created a situation where the ACPI SRAT table
      advertised one fewer proximity domains than intended.  The NFIT table did
      describe all the expected proximity domains.  This caused the device dax
      driver to assign an impossible target_node to the device, and when
      hotplugged as system memory, this would fail with the following signature:
      
         BUG: kernel NULL pointer dereference, address: 0000000000000088
         #PF: supervisor read access in kernel mode
         #PF: error_code(0x0000) - not-present page
         PGD 80000001767d4067 P4D 80000001767d4067 PUD 10e0c4067 PMD 0
         Oops: 0000 [#1] SMP PTI
         CPU: 4 PID: 22737 Comm: kswapd3 Tainted: G           O      5.6.0-rc5 #9
         Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
         RIP: 0010:prepare_kswapd_sleep+0x7c/0xc0
         Code: 89 df e8 87 fd ff ff 89 c2 31 c0 84 d2 74 e6 0f 1f 44 00 00 48 8b 05 fb af 7a 01 48 63 93 88 1d 01 00 48 8b 84 d0 20 0f 00 00 <48> 3b 98 88 00 00 00 75 28 f0 80 a0 80 00 00 00 fe f0 80 a3 38 20
         RSP: 0018:ffffc900017a3e78 EFLAGS: 00010202
         RAX: 0000000000000000 RBX: ffff8881209e0000 RCX: 0000000000000000
         RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffff8881209e0e80
         RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000008000
         R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000003
         R13: 0000000000000003 R14: 0000000000000000 R15: ffffc900017a3ec8
         FS:  0000000000000000(0000) GS:ffff888318c00000(0000) knlGS:0000000000000000
         CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         CR2: 0000000000000088 CR3: 0000000120b50002 CR4: 00000000001606e0
         Call Trace:
          kswapd+0x103/0x520
          kthread+0x120/0x140
          ret_from_fork+0x3a/0x50
      
      Add a check in the add_memory path to fail if the node to which we are
      adding memory is in the node_possible_map
      Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Link: http://lkml.kernel.org/r/20200416225438.15208-1-vishal.l.verma@intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa6d9ec7
    • D
      mm/memory_hotplug: Introduce offline_and_remove_memory() · 08b3acd7
      David Hildenbrand 提交于
      virtio-mem wants to offline and remove a memory block once it unplugged
      all subblocks (e.g., using alloc_contig_range()). Let's provide
      an interface to do that from a driver. virtio-mem already supports to
      offline partially unplugged memory blocks. Offlining a fully unplugged
      memory block will not require to migrate any pages. All unplugged
      subblocks are PageOffline() and have a reference count of 0 - so
      offlining code will simply skip them.
      
      All we need is an interface to offline and remove the memory from kernel
      module context, where we don't have access to the memory block devices
      (esp. find_memory_block() and device_offline()) and the device hotplug
      lock.
      
      To keep things simple, allow to only work on a single memory block.
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Link: https://lore.kernel.org/r/20200507140139.17083-9-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
      08b3acd7
    • D
      mm: Allow to offline unmovable PageOffline() pages via MEM_GOING_OFFLINE · aa218795
      David Hildenbrand 提交于
      virtio-mem wants to allow to offline memory blocks of which some parts
      were unplugged (allocated via alloc_contig_range()), especially, to later
      offline and remove completely unplugged memory blocks. The important part
      is that PageOffline() has to remain set until the section is offline, so
      these pages will never get accessed (e.g., when dumping). The pages should
      not be handed back to the buddy (which would require clearing PageOffline()
      and result in issues if offlining fails and the pages are suddenly in the
      buddy).
      
      Let's allow to do that by allowing to isolate any PageOffline() page
      when offlining. This way, we can reach the memory hotplug notifier
      MEM_GOING_OFFLINE, where the driver can signal that he is fine with
      offlining this page by dropping its reference count. PageOffline() pages
      with a reference count of 0 can then be skipped when offlining the
      pages (like if they were free, however they are not in the buddy).
      
      Anybody who uses PageOffline() pages and does not agree to offline them
      (e.g., Hyper-V balloon, XEN balloon, VMWare balloon for 2MB pages) will not
      decrement the reference count and make offlining fail when trying to
      migrate such an unmovable page. So there should be no observable change.
      Same applies to balloon compaction users (movable PageOffline() pages), the
      pages will simply be migrated.
      
      Note 1: If offlining fails, a driver has to increment the reference
      	count again in MEM_CANCEL_OFFLINE.
      
      Note 2: A driver that makes use of this has to be aware that re-onlining
      	the memory block has to be handled by hooking into onlining code
      	(online_page_callback_t), resetting the page PageOffline() and
      	not giving them to the buddy.
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Link: https://lore.kernel.org/r/20200507140139.17083-7-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
      aa218795
  5. 04 6月, 2020 2 次提交
    • J
      mm/page_alloc: integrate classzone_idx and high_zoneidx · 97a225e6
      Joonsoo Kim 提交于
      classzone_idx is just different name for high_zoneidx now.  So, integrate
      them and add some comment to struct alloc_context in order to reduce
      future confusion about the meaning of this variable.
      
      The accessor, ac_classzone_idx() is also removed since it isn't needed
      after integration.
      
      In addition to integration, this patch also renames high_zoneidx to
      highest_zoneidx since it represents more precise meaning.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ye Xiaolong <xiaolong.ye@intel.com>
      Link: http://lkml.kernel.org/r/1587095923-7515-3-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97a225e6
    • M
      mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP option · 3f08a302
      Mike Rapoport 提交于
      CONFIG_HAVE_MEMBLOCK_NODE_MAP is used to differentiate initialization of
      nodes and zones structures between the systems that have region to node
      mapping in memblock and those that don't.
      
      Currently all the NUMA architectures enable this option and for the
      non-NUMA systems we can presume that all the memory belongs to node 0 and
      therefore the compile time configuration option is not required.
      
      The remaining few architectures that use DISCONTIGMEM without NUMA are
      easily updated to use memblock_add_node() instead of memblock_add() and
      thus have proper correspondence of memblock regions to NUMA nodes.
      
      Still, free_area_init_node() must have a backward compatible version
      because its semantics with and without CONFIG_HAVE_MEMBLOCK_NODE_MAP is
      different.  Once all the architectures will use the new semantics, the
      entire compatibility layer can be dropped.
      
      To avoid addition of extra run time memory to store node id for
      architectures that keep memblock but have only a single node, the node id
      field of the memblock_region is guarded by CONFIG_NEED_MULTIPLE_NODES and
      the corresponding accessors presume that in those cases it is always 0.
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200412194859.12663-4-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f08a302
  6. 11 4月, 2020 2 次提交
    • L
      mm/memory_hotplug: add pgprot_t to mhp_params · bfeb022f
      Logan Gunthorpe 提交于
      devm_memremap_pages() is currently used by the PCI P2PDMA code to create
      struct page mappings for IO memory.  At present, these mappings are
      created with PAGE_KERNEL which implies setting the PAT bits to be WB.
      However, on x86, an mtrr register will typically override this and force
      the cache type to be UC-.  In the case firmware doesn't set this
      register it is effectively WB and will typically result in a machine
      check exception when it's accessed.
      
      Other arches are not currently likely to function correctly seeing they
      don't have any MTRR registers to fall back on.
      
      To solve this, provide a way to specify the pgprot value explicitly to
      arch_add_memory().
      
      Of the arches that support MEMORY_HOTPLUG: x86_64, and arm64 need a
      simple change to pass the pgprot_t down to their respective functions
      which set up the page tables.  For x86_32, set the page tables
      explicitly using _set_memory_prot() (seeing they are already mapped).
      
      For ia64, s390 and sh, reject anything but PAGE_KERNEL settings -- this
      should be fine, for now, seeing these architectures don't support
      ZONE_DEVICE.
      
      A check in __add_pages() is also added to ensure the pgprot parameter
      was set for all arches.
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Eric Badger <ebadger@gigaio.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200306170846.9333-7-logang@deltatee.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bfeb022f
    • L
      mm/memory_hotplug: rename mhp_restrictions to mhp_params · f5637d3b
      Logan Gunthorpe 提交于
      The mhp_restrictions struct really doesn't specify anything resembling a
      restriction anymore so rename it to be mhp_params as it is a list of
      extended parameters.
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Eric Badger <ebadger@gigaio.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200306170846.9333-3-logang@deltatee.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5637d3b
  7. 08 4月, 2020 8 次提交
  8. 06 3月, 2020 1 次提交
    • V
      mm, hotplug: fix page online with DEBUG_PAGEALLOC compiled but not enabled · c87cbc1f
      Vlastimil Babka 提交于
      Commit cd02cf1a ("mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC")
      fixed memory hotplug with debug_pagealloc enabled, where onlining a page
      goes through page freeing, which removes the direct mapping.  Some arches
      don't like when the page is not mapped in the first place, so
      generic_online_page() maps it first.  This is somewhat wasteful, but
      better than special casing page freeing fast paths.
      
      The commit however missed that DEBUG_PAGEALLOC configured doesn't mean
      it's actually enabled.  One has to test debug_pagealloc_enabled() since
      031bc574 ("mm/debug-pagealloc: make debug-pagealloc boottime
      configurable"), or alternatively debug_pagealloc_enabled_static() since
      8e57f8ac ("mm, debug_pagealloc: don't rely on static keys too early"),
      but this is not done.
      
      As a result, a s390 kernel with DEBUG_PAGEALLOC configured but not enabled
      will crash:
      
      Unable to handle kernel pointer dereference in virtual kernel address space
      Failing address: 0000000000000000 TEID: 0000000000000483
      Fault in home space mode while using kernel ASCE.
      AS:0000001ece13400b R2:000003fff7fd000b R3:000003fff7fcc007 S:000003fff7fd7000 P:000000000000013d
      Oops: 0004 ilc:2 [#1] SMP
      CPU: 1 PID: 26015 Comm: chmem Kdump: loaded Tainted: GX 5.3.18-5-default #1 SLE15-SP2 (unreleased)
      Krnl PSW : 0704e00180000000 0000001ecd281b9e (__kernel_map_pages+0x166/0x188)
      R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
      Krnl GPRS: 0000000000000000 0000000000000800 0000400b00000000 0000000000000100
      0000000000000001 0000000000000000 0000000000000002 0000000000000100
      0000001ece139230 0000001ecdd98d40 0000400b00000100 0000000000000000
      000003ffa17e4000 001fffe0114f7d08 0000001ecd4d93ea 001fffe0114f7b20
      Krnl Code: 0000001ecd281b8e: ec17ffff00d8 ahik %r1,%r7,-1
      0000001ecd281b94: ec111dbc0355 risbg %r1,%r1,29,188,3
      >0000001ecd281b9e: 94fb5006 ni 6(%r5),251
      0000001ecd281ba2: 41505008 la %r5,8(%r5)
      0000001ecd281ba6: ec51fffc6064 cgrj %r5,%r1,6,1ecd281b9e
      0000001ecd281bac: 1a07 ar %r0,%r7
      0000001ecd281bae: ec03ff584076 crj %r0,%r3,4,1ecd281a5e
      Call Trace:
      [<0000001ecd281b9e>] __kernel_map_pages+0x166/0x188
      [<0000001ecd4d9516>] online_pages_range+0xf6/0x128
      [<0000001ecd2a8186>] walk_system_ram_range+0x7e/0xd8
      [<0000001ecda28aae>] online_pages+0x2fe/0x3f0
      [<0000001ecd7d02a6>] memory_subsys_online+0x8e/0xc0
      [<0000001ecd7add42>] device_online+0x5a/0xc8
      [<0000001ecd7d0430>] state_store+0x88/0x118
      [<0000001ecd5b9f62>] kernfs_fop_write+0xc2/0x200
      [<0000001ecd5064b6>] vfs_write+0x176/0x1e0
      [<0000001ecd50676a>] ksys_write+0xa2/0x100
      [<0000001ecda315d4>] system_call+0xd8/0x2c8
      
      Fix this by checking debug_pagealloc_enabled_static() before calling
      kernel_map_pages(). Backports for kernel before 5.5 should use
      debug_pagealloc_enabled() instead. Also add comments.
      
      Fixes: cd02cf1a ("mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC")
      Reported-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: <stable@vger.kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Qian Cai <cai@lca.pw>
      Link: http://lkml.kernel.org/r/20200224094651.18257-1-vbabka@suse.czSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c87cbc1f
  9. 04 2月, 2020 6 次提交
  10. 01 2月, 2020 3 次提交
    • D
      mm/memory_hotplug: pass in nid to online_pages() · bd5c2344
      David Hildenbrand 提交于
      Patch series "mm/memory_hotplug: pass in nid to online_pages()".
      
      Simplify onlining code and get rid of find_memory_block().  Pass in the
      nid from the memory block we are trying to online directly, instead of
      manually looking it up.
      
      This patch (of 2):
      
      No need to lookup the memory block, we can directly pass in the nid.
      
      Link: http://lkml.kernel.org/r/20200113113354.6341-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd5c2344
    • D
      mm: remove "count" parameter from has_unmovable_pages() · fe4c86c9
      David Hildenbrand 提交于
      Now that the memory isolate notifier is gone, the parameter is always 0.
      Drop it and cleanup has_unmovable_pages().
      
      Link: http://lkml.kernel.org/r/20191114131911.11783-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Arun KS <arunks@codeaurora.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fe4c86c9
    • D
      mm/memory_hotplug: fix remove_memory() lockdep splat · f1037ec0
      Dan Williams 提交于
      The daxctl unit test for the dax_kmem driver currently triggers the
      (false positive) lockdep splat below.  It results from the fact that
      remove_memory_block_devices() is invoked under the mem_hotplug_lock()
      causing lockdep entanglements with cpu_hotplug_lock() and sysfs (kernfs
      active state tracking).  It is a false positive because the sysfs
      attribute path triggering the memory remove is not the same attribute
      path associated with memory-block device.
      
      sysfs_break_active_protection() is not applicable since there is no real
      deadlock conflict, instead move memory-block device removal outside the
      lock.  The mem_hotplug_lock() is not needed to synchronize the
      memory-block device removal vs the page online state, that is already
      handled by lock_device_hotplug().  Specifically, lock_device_hotplug()
      is sufficient to allow try_remove_memory() to check the offline state of
      the memblocks and be assured that any in progress online attempts are
      flushed / blocked by kernfs_drain() / attribute removal.
      
      The add_memory() path safely creates memblock devices under the
      mem_hotplug_lock().  There is no kernfs active state synchronization in
      the memblock device_register() path, so nothing to fix there.
      
      This change is only possible thanks to the recent change that refactored
      memory block device removal out of arch_remove_memory() (commit
      4c4b7f9b "mm/memory_hotplug: remove memory block devices before
      arch_remove_memory()"), and David's due diligence tracking down the
      guarantees afforded by kernfs_drain().  Not flagged for -stable since
      this only impacts ongoing development and lockdep validation, not a
      runtime issue.
      
          ======================================================
          WARNING: possible circular locking dependency detected
          5.5.0-rc3+ #230 Tainted: G           OE
          ------------------------------------------------------
          lt-daxctl/6459 is trying to acquire lock:
          ffff99c7f0003510 (kn->count#241){++++}, at: kernfs_remove_by_name_ns+0x41/0x80
      
          but task is already holding lock:
          ffffffffa76a5450 (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x20/0xe0
      
          which lock already depends on the new lock.
      
          the existing dependency chain (in reverse order) is:
      
          -> #2 (mem_hotplug_lock.rw_sem){++++}:
                 __lock_acquire+0x39c/0x790
                 lock_acquire+0xa2/0x1b0
                 get_online_mems+0x3e/0xb0
                 kmem_cache_create_usercopy+0x2e/0x260
                 kmem_cache_create+0x12/0x20
                 ptlock_cache_init+0x20/0x28
                 start_kernel+0x243/0x547
                 secondary_startup_64+0xb6/0xc0
      
          -> #1 (cpu_hotplug_lock.rw_sem){++++}:
                 __lock_acquire+0x39c/0x790
                 lock_acquire+0xa2/0x1b0
                 cpus_read_lock+0x3e/0xb0
                 online_pages+0x37/0x300
                 memory_subsys_online+0x17d/0x1c0
                 device_online+0x60/0x80
                 state_store+0x65/0xd0
                 kernfs_fop_write+0xcf/0x1c0
                 vfs_write+0xdb/0x1d0
                 ksys_write+0x65/0xe0
                 do_syscall_64+0x5c/0xa0
                 entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
          -> #0 (kn->count#241){++++}:
                 check_prev_add+0x98/0xa40
                 validate_chain+0x576/0x860
                 __lock_acquire+0x39c/0x790
                 lock_acquire+0xa2/0x1b0
                 __kernfs_remove+0x25f/0x2e0
                 kernfs_remove_by_name_ns+0x41/0x80
                 remove_files.isra.0+0x30/0x70
                 sysfs_remove_group+0x3d/0x80
                 sysfs_remove_groups+0x29/0x40
                 device_remove_attrs+0x39/0x70
                 device_del+0x16a/0x3f0
                 device_unregister+0x16/0x60
                 remove_memory_block_devices+0x82/0xb0
                 try_remove_memory+0xb5/0x130
                 remove_memory+0x26/0x40
                 dev_dax_kmem_remove+0x44/0x6a [kmem]
                 device_release_driver_internal+0xe4/0x1c0
                 unbind_store+0xef/0x120
                 kernfs_fop_write+0xcf/0x1c0
                 vfs_write+0xdb/0x1d0
                 ksys_write+0x65/0xe0
                 do_syscall_64+0x5c/0xa0
                 entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
          other info that might help us debug this:
      
          Chain exists of:
            kn->count#241 --> cpu_hotplug_lock.rw_sem --> mem_hotplug_lock.rw_sem
      
           Possible unsafe locking scenario:
      
                 CPU0                    CPU1
                 ----                    ----
            lock(mem_hotplug_lock.rw_sem);
                                         lock(cpu_hotplug_lock.rw_sem);
                                         lock(mem_hotplug_lock.rw_sem);
            lock(kn->count#241);
      
           *** DEADLOCK ***
      
      No fixes tag as this has been a long standing issue that predated the
      addition of kernfs lockdep annotations.
      
      Link: http://lkml.kernel.org/r/157991441887.2763922.4770790047389427325.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1037ec0
  11. 05 1月, 2020 1 次提交
    • D
      mm/memory_hotplug: shrink zones when offlining memory · feee6b29
      David Hildenbrand 提交于
      We currently try to shrink a single zone when removing memory.  We use
      the zone of the first page of the memory we are removing.  If that
      memmap was never initialized (e.g., memory was never onlined), we will
      read garbage and can trigger kernel BUGs (due to a stale pointer):
      
          BUG: unable to handle page fault for address: 000000000000353d
          #PF: supervisor write access in kernel mode
          #PF: error_code(0x0002) - not-present page
          PGD 0 P4D 0
          Oops: 0002 [#1] SMP PTI
          CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 5.3.0-rc5-next-20190820+ #317
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
          Workqueue: kacpi_hotplug acpi_hotplug_work_fn
          RIP: 0010:clear_zone_contiguous+0x5/0x10
          Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840
          RSP: 0018:ffffad2400043c98 EFLAGS: 00010246
          RAX: 0000000000000000 RBX: 0000000200000000 RCX: 0000000000000000
          RDX: 0000000000200000 RSI: 0000000000140000 RDI: 0000000000002f40
          RBP: 0000000140000000 R08: 0000000000000000 R09: 0000000000000001
          R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
          R13: 0000000000140000 R14: 0000000000002f40 R15: ffff9e3e7aff3680
          FS:  0000000000000000(0000) GS:ffff9e3e7bb00000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 000000000000353d CR3: 0000000058610000 CR4: 00000000000006e0
          DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
          Call Trace:
           __remove_pages+0x4b/0x640
           arch_remove_memory+0x63/0x8d
           try_remove_memory+0xdb/0x130
           __remove_memory+0xa/0x11
           acpi_memory_device_remove+0x70/0x100
           acpi_bus_trim+0x55/0x90
           acpi_device_hotplug+0x227/0x3a0
           acpi_hotplug_work_fn+0x1a/0x30
           process_one_work+0x221/0x550
           worker_thread+0x50/0x3b0
           kthread+0x105/0x140
           ret_from_fork+0x3a/0x50
          Modules linked in:
          CR2: 000000000000353d
      
      Instead, shrink the zones when offlining memory or when onlining failed.
      Introduce and use remove_pfn_range_from_zone(() for that.  We now
      properly shrink the zones, even if we have DIMMs whereby
      
       - Some memory blocks fall into no zone (never onlined)
      
       - Some memory blocks fall into multiple zones (offlined+re-onlined)
      
       - Multiple memory blocks that fall into different zones
      
      Drop the zone parameter (with a potential dubious value) from
      __remove_pages() and __remove_section().
      
      Link: http://lkml.kernel.org/r/20191006085646.5768-6-david@redhat.com
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[visible after d0dc12e8]
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: <stable@vger.kernel.org>	[5.0+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      feee6b29
  12. 02 12月, 2019 2 次提交