1. 29 12月, 2018 6 次提交
    • D
      mm/memory_hotplug: drop "online" parameter from add_memory_resource() · f29d8e9c
      David Hildenbrand 提交于
      Userspace should always be in charge of how to online memory and if memory
      should be onlined automatically in the kernel.  Let's drop the parameter
      to overwrite this - XEN passes memhp_auto_online, just like add_memory(),
      so we can directly use that instead internally.
      
      Link: http://lkml.kernel.org/r/20181123123740.27652-1-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NJuergen Gross <jgross@suse.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Arun KS <arunks@codeaurora.org>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f29d8e9c
    • M
      mm, memory_hotplug: do not clear numa_node association after hot_remove · 46a3679b
      Michal Hocko 提交于
      Per-cpu numa_node provides a default node for each possible cpu.  The
      association gets initialized during the boot when the architecture
      specific code explores cpu->NUMA affinity.  When the whole NUMA node is
      removed though we are clearing this association
      
      try_offline_node
        check_and_unmap_cpu_on_node
          unmap_cpu_on_node
            numa_clear_node
              numa_set_node(cpu, NUMA_NO_NODE)
      
      This means that whoever calls cpu_to_node for a cpu associated with such a
      node will get NUMA_NO_NODE.  This is problematic for two reasons.  First
      it is fragile because __alloc_pages_node would simply blow up on an
      out-of-bound access.  We have encountered this when loading kvm module
      
        BUG: unable to handle kernel paging request at 00000000000021c0
        IP: __alloc_pages_nodemask+0x93/0xb70
        PGD 800000ffe853e067 PUD 7336bbc067 PMD 0
        Oops: 0000 [#1] SMP
        [...]
        CPU: 88 PID: 1223749 Comm: modprobe Tainted: G        W          4.4.156-94.64-default #1
        RIP: __alloc_pages_nodemask+0x93/0xb70
        RSP: 0018:ffff887354493b40  EFLAGS: 00010202
        RAX: 00000000000021c0 RBX: 0000000000000000 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: 0000000000000002 RDI: 00000000014000c0
        RBP: 00000000014000c0 R08: ffffffffffffffff R09: 0000000000000000
        R10: ffff88fffc89e790 R11: 0000000000014000 R12: 0000000000000101
        R13: ffffffffa0772cd4 R14: ffffffffa0769ac0 R15: 0000000000000000
        FS:  00007fdf2f2f1700(0000) GS:ffff88fffc880000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000000021c0 CR3: 00000077205ee000 CR4: 0000000000360670
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
          alloc_vmcs_cpu+0x3d/0x90 [kvm_intel]
          hardware_setup+0x781/0x849 [kvm_intel]
          kvm_arch_hardware_setup+0x28/0x190 [kvm]
          kvm_init+0x7c/0x2d0 [kvm]
          vmx_init+0x1e/0x32c [kvm_intel]
          do_one_initcall+0xca/0x1f0
          do_init_module+0x5a/0x1d7
          load_module+0x1393/0x1c90
          SYSC_finit_module+0x70/0xa0
          entry_SYSCALL_64_fastpath+0x1e/0xb7
        DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xb7
      
      on an older kernel but the code is basically the same in the current Linus
      tree as well.  alloc_vmcs_cpu could use alloc_pages_nodemask which would
      recognize NUMA_NO_NODE and use alloc_pages_node which would translate it
      to numa_mem_id but that is wrong as well because it would use a cpu
      affinity of the local CPU which might be quite far from the original node.
      It is also reasonable to expect that cpu_to_node will provide a sane
      value and there might be many more callers like that.
      
      The second problem is that __register_one_node relies on cpu_to_node to
      properly associate cpus back to the node when it is onlined.  We do not
      want to lose that link as there is no arch independent way to get it from
      the early boot time AFAICS.
      
      Drop the whole check_and_unmap_cpu_on_node machinery and keep the
      association to fix both issues.  The NODE_DATA(nid) is not deallocated so
      it will stay in place and if anybody wants to allocate from that node then
      a fallback node will be used.
      
      Thanks to Vlastimil Babka for his live system debugging skills that helped
      debugging the issue.
      
      Link: http://lkml.kernel.org/r/20181108100413.966-1-mhocko@kernel.org
      Fixes: e13fe869 ("cpu-hotplug,memory-hotplug: clear cpu_to_node() when offlining the node")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Debugged-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: NMiroslav Benes <mbenes@suse.cz>
      Acked-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      46a3679b
    • M
      mm: only report isolation failures when offlining memory · d381c547
      Michal Hocko 提交于
      Heiko has complained that his log is swamped by warnings from
      has_unmovable_pages
      
      [   20.536664] page dumped because: has_unmovable_pages
      [   20.536792] page:000003d081ff4080 count:1 mapcount:0 mapping:000000008ff88600 index:0x0 compound_mapcount: 0
      [   20.536794] flags: 0x3fffe0000010200(slab|head)
      [   20.536795] raw: 03fffe0000010200 0000000000000100 0000000000000200 000000008ff88600
      [   20.536796] raw: 0000000000000000 0020004100000000 ffffffff00000001 0000000000000000
      [   20.536797] page dumped because: has_unmovable_pages
      [   20.536814] page:000003d0823b0000 count:1 mapcount:0 mapping:0000000000000000 index:0x0
      [   20.536815] flags: 0x7fffe0000000000()
      [   20.536817] raw: 07fffe0000000000 0000000000000100 0000000000000200 0000000000000000
      [   20.536818] raw: 0000000000000000 0000000000000000 ffffffff00000001 0000000000000000
      
      which are not triggered by the memory hotplug but rather CMA allocator.
      The original idea behind dumping the page state for all call paths was
      that these messages will be helpful debugging failures.  From the above it
      seems that this is not the case for the CMA path because we are lacking
      much more context.  E.g the second reported page might be a CMA allocated
      page.  It is still interesting to see a slab page in the CMA area but it
      is hard to tell whether this is bug from the above output alone.
      
      Address this issue by dumping the page state only on request.  Both
      start_isolate_page_range and has_unmovable_pages already have an argument
      to ignore hwpoison pages so make this argument more generic and turn it
      into flags and allow callers to combine non-default modes into a mask.
      While we are at it, has_unmovable_pages call from
      is_pageblock_removable_nolock (sysfs removable file) is questionable to
      report the failure so drop it from there as well.
      
      Link: http://lkml.kernel.org/r/20181218092802.31429-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d381c547
    • M
      mm, memory_hotplug: be more verbose for memory offline failures · 2932c8b0
      Michal Hocko 提交于
      There is only very limited information printed when the memory offlining
      fails:
      
      [ 1984.506184] rac1 kernel: memory offlining [mem 0x82600000000-0x8267fffffff] failed due to signal backoff
      
      This tells us that the failure is triggered by the userspace intervention
      but it doesn't tell us much more about the underlying reason.  It might be
      that the page migration failes repeatedly and the userspace timeout
      expires and send a signal or it might be some of the earlier steps
      (isolation, memory notifier) takes too long.
      
      If the migration failes then it would be really helpful to see which page
      that and its state.  The same applies to the isolation phase.  If we fail
      to isolate a page from the allocator then knowing the state of the page
      would be helpful as well.
      
      Dump the page state that fails to get isolated or migrated.  This will
      tell us more about the failure and what to focus on during debugging.
      
      [akpm@linux-foundation.org: add missing printk arg]
      [mhocko@suse.com: tweak dump_page() `reason' text]
        Link: http://lkml.kernel.org/r/20181116083020.20260-6-mhocko@kernel.org
      Link: http://lkml.kernel.org/r/20181107101830.17405-6-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Oscar Salvador <OSalvador@suse.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2932c8b0
    • M
      mm, memory_hotplug: print reason for the offlining failure · 79605093
      Michal Hocko 提交于
      The memory offlining failure reporting is inconsistent and insufficient.
      Some error paths simply do not report the failure to the log at all.  When
      we do report there are no details about the reason of the failure and
      there are several of them which makes memory offlining failures hard to
      debug.
      
      Make sure that the
      	memory offlining [mem %#010llx-%#010llx] failed
      message is printed for all failures and also provide a short textual
      reason for the failure e.g.
      
      [ 1984.506184] rac1 kernel: memory offlining [mem 0x82600000000-0x8267fffffff] failed due to signal backoff
      
      this tells us that the offlining has failed because of a signal pending
      aka user intervention.
      
      [akpm@linux-foundation.org: tweak messages a bit]
      Link: http://lkml.kernel.org/r/20181107101830.17405-5-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Oscar Salvador <OSalvador@suse.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79605093
    • M
      mm, memory_hotplug: drop pointless block alignment checks from __offline_pages · 6cc2baf6
      Michal Hocko 提交于
      This function is never called from a context which would provide
      misaligned pfn range so drop the pointless check.
      
      Link: http://lkml.kernel.org/r/20181107101830.17405-4-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Oscar Salvador <OSalvador@suse.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6cc2baf6
  2. 04 11月, 2018 1 次提交
    • M
      memory_hotplug: cond_resched in __remove_pages · dd33ad7b
      Michal Hocko 提交于
      We have received a bug report that unbinding a large pmem (>1TB) can
      result in a soft lockup:
      
        NMI watchdog: BUG: soft lockup - CPU#9 stuck for 23s! [ndctl:4365]
        [...]
        Supported: Yes
        CPU: 9 PID: 4365 Comm: ndctl Not tainted 4.12.14-94.40-default #1 SLE12-SP4
        Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS SE5C620.86B.01.00.0833.051120182255 05/11/2018
        task: ffff9cce7d4410c0 task.stack: ffffbe9eb1bc4000
        RIP: 0010:__put_page+0x62/0x80
        Call Trace:
         devm_memremap_pages_release+0x152/0x260
         release_nodes+0x18d/0x1d0
         device_release_driver_internal+0x160/0x210
         unbind_store+0xb3/0xe0
         kernfs_fop_write+0x102/0x180
         __vfs_write+0x26/0x150
         vfs_write+0xad/0x1a0
         SyS_write+0x42/0x90
         do_syscall_64+0x74/0x150
         entry_SYSCALL_64_after_hwframe+0x3d/0xa2
        RIP: 0033:0x7fd13166b3d0
      
      It has been reported on an older (4.12) kernel but the current upstream
      code doesn't cond_resched in the hot remove code at all and the given
      range to remove might be really large.  Fix the issue by calling
      cond_resched once per memory section.
      
      Link: http://lkml.kernel.org/r/20181031125840.23982-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Cc: Dan Williams <dan.j.williams@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd33ad7b
  3. 31 10月, 2018 4 次提交
    • D
      mm/memory_hotplug: fix online/offline_pages called w.o. mem_hotplug_lock · 381eab4a
      David Hildenbrand 提交于
      There seem to be some problems as result of 30467e0b ("mm, hotplug:
      fix concurrent memory hot-add deadlock"), which tried to fix a possible
      lock inversion reported and discussed in [1] due to the two locks
      	a) device_lock()
      	b) mem_hotplug_lock
      
      While add_memory() first takes b), followed by a) during
      bus_probe_device(), onlining of memory from user space first took a),
      followed by b), exposing a possible deadlock.
      
      In [1], and it was decided to not make use of device_hotplug_lock, but
      rather to enforce a locking order.
      
      The problems I spotted related to this:
      
      1. Memory block device attributes: While .state first calls
         mem_hotplug_begin() and the calls device_online() - which takes
         device_lock() - .online does no longer call mem_hotplug_begin(), so
         effectively calls online_pages() without mem_hotplug_lock.
      
      2. device_online() should be called under device_hotplug_lock, however
         onlining memory during add_memory() does not take care of that.
      
      In addition, I think there is also something wrong about the locking in
      
      3. arch/powerpc/platforms/powernv/memtrace.c calls offline_pages()
         without locks. This was introduced after 30467e0b. And skimming over
         the code, I assume it could need some more care in regards to locking
         (e.g. device_online() called without device_hotplug_lock. This will
         be addressed in the following patches.
      
      Now that we hold the device_hotplug_lock when
      - adding memory (e.g. via add_memory()/add_memory_resource())
      - removing memory (e.g. via remove_memory())
      - device_online()/device_offline()
      
      We can move mem_hotplug_lock usage back into
      online_pages()/offline_pages().
      
      Why is mem_hotplug_lock still needed? Essentially to make
      get_online_mems()/put_online_mems() be very fast (relying on
      device_hotplug_lock would be very slow), and to serialize against
      addition of memory that does not create memory block devices (hmm).
      
      [1] http://driverdev.linuxdriverproject.org/pipermail/ driverdev-devel/
          2015-February/065324.html
      
      This patch is partly based on a patch by Vitaly Kuznetsov.
      
      Link: http://lkml.kernel.org/r/20180925091457.28651-4-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Reviewed-by: NRashmica Gupta <rashmica.g@gmail.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Rashmica Gupta <rashmica.g@gmail.com>
      Cc: Michael Neuling <mikey@neuling.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Kate Stewart <kstewart@linuxfoundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: John Allen <jallen@linux.vnet.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      381eab4a
    • D
      mm/memory_hotplug: make add_memory() take the device_hotplug_lock · 8df1d0e4
      David Hildenbrand 提交于
      add_memory() currently does not take the device_hotplug_lock, however
      is aleady called under the lock from
      	arch/powerpc/platforms/pseries/hotplug-memory.c
      	drivers/acpi/acpi_memhotplug.c
      to synchronize against CPU hot-remove and similar.
      
      In general, we should hold the device_hotplug_lock when adding memory to
      synchronize against online/offline request (e.g.  from user space) - which
      already resulted in lock inversions due to device_lock() and
      mem_hotplug_lock - see 30467e0b ("mm, hotplug: fix concurrent memory
      hot-add deadlock").  add_memory()/add_memory_resource() will create memory
      block devices, so this really feels like the right thing to do.
      
      Holding the device_hotplug_lock makes sure that a memory block device
      can really only be accessed (e.g. via .online/.state) from user space,
      once the memory has been fully added to the system.
      
      The lock is not held yet in
      	drivers/xen/balloon.c
      	arch/powerpc/platforms/powernv/memtrace.c
      	drivers/s390/char/sclp_cmd.c
      	drivers/hv/hv_balloon.c
      So, let's either use the locked variants or take the lock.
      
      Don't export add_memory_resource(), as it once was exported to be used by
      XEN, which is never built as a module.  If somebody requires it, we also
      have to export a locked variant (as device_hotplug_lock is never
      exported).
      
      Link: http://lkml.kernel.org/r/20180925091457.28651-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Reviewed-by: NRashmica Gupta <rashmica.g@gmail.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      Cc: John Allen <jallen@linux.vnet.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kate Stewart <kstewart@linuxfoundation.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michael Neuling <mikey@neuling.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8df1d0e4
    • D
      mm/memory_hotplug: make remove_memory() take the device_hotplug_lock · d15e5926
      David Hildenbrand 提交于
      Patch series "mm: online/offline_pages called w.o. mem_hotplug_lock", v3.
      
      Reading through the code and studying how mem_hotplug_lock is to be used,
      I noticed that there are two places where we can end up calling
      device_online()/device_offline() - online_pages()/offline_pages() without
      the mem_hotplug_lock.  And there are other places where we call
      device_online()/device_offline() without the device_hotplug_lock.
      
      While e.g.
      	echo "online" > /sys/devices/system/memory/memory9/state
      is fine, e.g.
      	echo 1 > /sys/devices/system/memory/memory9/online
      Will not take the mem_hotplug_lock. However the device_lock() and
      device_hotplug_lock.
      
      E.g.  via memory_probe_store(), we can end up calling
      add_memory()->online_pages() without the device_hotplug_lock.  So we can
      have concurrent callers in online_pages().  We e.g.  touch in
      online_pages() basically unprotected zone->present_pages then.
      
      Looks like there is a longer history to that (see Patch #2 for details),
      and fixing it to work the way it was intended is not really possible.  We
      would e.g.  have to take the mem_hotplug_lock in device/base/core.c, which
      sounds wrong.
      
      Summary: We had a lock inversion on mem_hotplug_lock and device_lock().
      More details can be found in patch 3 and patch 6.
      
      I propose the general rules (documentation added in patch 6):
      
      1. add_memory/add_memory_resource() must only be called with
         device_hotplug_lock.
      2. remove_memory() must only be called with device_hotplug_lock. This is
         already documented and holds for all callers.
      3. device_online()/device_offline() must only be called with
         device_hotplug_lock. This is already documented and true for now in core
         code. Other callers (related to memory hotplug) have to be fixed up.
      4. mem_hotplug_lock is taken inside of add_memory/remove_memory/
         online_pages/offline_pages.
      
      To me, this looks way cleaner than what we have right now (and easier to
      verify).  And looking at the documentation of remove_memory, using
      lock_device_hotplug also for add_memory() feels natural.
      
      This patch (of 6):
      
      remove_memory() is exported right now but requires the
      device_hotplug_lock, which is not exported.  So let's provide a variant
      that takes the lock and only export that one.
      
      The lock is already held in
      	arch/powerpc/platforms/pseries/hotplug-memory.c
      	drivers/acpi/acpi_memhotplug.c
      	arch/powerpc/platforms/powernv/memtrace.c
      
      Apart from that, there are not other users in the tree.
      
      Link: http://lkml.kernel.org/r/20180925091457.28651-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Reviewed-by: NRashmica Gupta <rashmica.g@gmail.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Rashmica Gupta <rashmica.g@gmail.com>
      Cc: Michael Neuling <mikey@neuling.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      Cc: John Allen <jallen@linux.vnet.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kate Stewart <kstewart@linuxfoundation.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d15e5926
    • M
      mm: remove include/linux/bootmem.h · 57c8a661
      Mike Rapoport 提交于
      Move remaining definitions and declarations from include/linux/bootmem.h
      into include/linux/memblock.h and remove the redundant header.
      
      The includes were replaced with the semantic patch below and then
      semi-automated removal of duplicated '#include <linux/memblock.h>
      
      @@
      @@
      - #include <linux/bootmem.h>
      + #include <linux/memblock.h>
      
      [sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h]
        Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au
      [sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h]
        Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au
      [sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal]
        Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au
      Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Palmer Dabbelt <palmer@sifive.com>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Serge Semin <fancer.lancer@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57c8a661
  4. 27 10月, 2018 4 次提交
  5. 05 9月, 2018 1 次提交
  6. 23 8月, 2018 1 次提交
    • O
      mm/page_alloc: Introduce free_area_init_core_hotplug · 03e85f9d
      Oscar Salvador 提交于
      Currently, whenever a new node is created/re-used from the memhotplug
      path, we call free_area_init_node()->free_area_init_core().  But there is
      some code that we do not really need to run when we are coming from such
      path.
      
      free_area_init_core() performs the following actions:
      
      1) Initializes pgdat internals, such as spinlock, waitqueues and more.
      2) Account # nr_all_pages and # nr_kernel_pages. These values are used later on
         when creating hash tables.
      3) Account number of managed_pages per zone, substracting dma_reserved and
         memmap pages.
      4) Initializes some fields of the zone structure data
      5) Calls init_currently_empty_zone to initialize all the freelists
      6) Calls memmap_init to initialize all pages belonging to certain zone
      
      When called from memhotplug path, free_area_init_core() only performs
      actions #1 and #4.
      
      Action #2 is pointless as the zones do not have any pages since either the
      node was freed, or we are re-using it, eitherway all zones belonging to
      this node should have 0 pages.  For the same reason, action #3 results
      always in manages_pages being 0.
      
      Action #5 and #6 are performed later on when onlining the pages:
       online_pages()->move_pfn_range_to_zone()->init_currently_empty_zone()
       online_pages()->move_pfn_range_to_zone()->memmap_init_zone()
      
      This patch does two things:
      
      First, moves the node/zone initializtion to their own function, so it
      allows us to create a small version of free_area_init_core, where we only
      perform:
      
      1) Initialization of pgdat internals, such as spinlock, waitqueues and more
      4) Initialization of some fields of the zone structure data
      
      These two functions are: pgdat_init_internals() and zone_init_internals().
      
      The second thing this patch does, is to introduce
      free_area_init_core_hotplug(), the memhotplug version of
      free_area_init_core():
      
      Currently, we call free_area_init_node() from the memhotplug path.  In
      there, we set some pgdat's fields, and call calculate_node_totalpages().
      calculate_node_totalpages() calculates the # of pages the node has.
      
      Since the node is either new, or we are re-using it, the zones belonging
      to this node should not have any pages, so there is no point to calculate
      this now.
      
      Actually, we re-set these values to 0 later on with the calls to:
      
      reset_node_managed_pages()
      reset_node_present_pages()
      
      The # of pages per node and the # of pages per zone will be calculated when
      onlining the pages:
      
      online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_zone_range()
      online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_pgdat_range()
      
      Also, since free_area_init_core/free_area_init_node will now only get called during early init, let us replace
      __paginginit with __init, so their code gets freed up.
      
      [osalvador@techadventures.net: fix section usage]
        Link: http://lkml.kernel.org/r/20180731101752.GA473@techadventures.net
      [osalvador@suse.de: v6]
        Link: http://lkml.kernel.org/r/20180801122348.21588-6-osalvador@techadventures.net
      Link: http://lkml.kernel.org/r/20180730101757.28058-5-osalvador@techadventures.netSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      03e85f9d
  7. 18 8月, 2018 3 次提交
  8. 08 6月, 2018 1 次提交
  9. 26 5月, 2018 1 次提交
    • J
      mm/memory_hotplug: fix leftover use of struct page during hotplug · a2155861
      Jonathan Cameron 提交于
      The case of a new numa node got missed in avoiding using the node info
      from page_struct during hotplug.  In this path we have a call to
      register_mem_sect_under_node (which allows us to specify it is hotplug
      so don't change the node), via link_mem_sections which unfortunately
      does not.
      
      Fix is to pass check_nid through link_mem_sections as well and disable
      it in the new numa node path.
      
      Note the bug only 'sometimes' manifests depending on what happens to be
      in the struct page structures - there are lots of them and it only needs
      to match one of them.
      
      The result of the bug is that (with a new memory only node) we never
      successfully call register_mem_sect_under_node so don't get the memory
      associated with the node in sysfs and meminfo for the node doesn't
      report it.
      
      It came up whilst testing some arm64 hotplug patches, but appears to be
      universal.  Whilst I'm triggering it by removing then reinserting memory
      to a node with no other elements (thus making the node disappear then
      appear again), it appears it would happen on hotplugging memory where
      there was none before and it doesn't seem to be related the arm64
      patches.
      
      These patches call __add_pages (where most of the issue was fixed by
      Pavel's patch).  If there is a node at the time of the __add_pages call
      then all is well as it calls register_mem_sect_under_node from there
      with check_nid set to false.  Without a node that function returns
      having not done the sysfs related stuff as there is no node to use.
      This is expected but it is the resulting path that fails...
      
      Exact path to the problem is as follows:
      
       mm/memory_hotplug.c: add_memory_resource()
      
         The node is not online so we enter the 'if (new_node)' twice, on the
         second such block there is a call to link_mem_sections which calls
         into
      
        drivers/node.c: link_mem_sections() which calls
      
        drivers/node.c: register_mem_sect_under_node() which calls
           get_nid_for_pfn and keeps trying until the output of that matches
           the expected node (passed all the way down from
           add_memory_resource)
      
      It is effectively the same fix as the one referred to in the fixes tag
      just in the code path for a new node where the comments point out we
      have to rerun the link creation because it will have failed in
      register_new_memory (as there was no node at the time).  (actually that
      comment is wrong now as we don't have register_new_memory any more it
      got renamed to hotplug_memory_register in Pavel's patch).
      
      Link: http://lkml.kernel.org/r/20180504085311.1240-1-Jonathan.Cameron@huawei.com
      Fixes: fc44f7f9 ("mm/memory_hotplug: don't read nid from struct page during hotplug")
      Signed-off-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2155861
  10. 12 4月, 2018 2 次提交
    • M
      mm: unclutter THP migration · 94723aaf
      Michal Hocko 提交于
      THP migration is hacked into the generic migration with rather
      surprising semantic.  The migration allocation callback is supposed to
      check whether the THP can be migrated at once and if that is not the
      case then it allocates a simple page to migrate.  unmap_and_move then
      fixes that up by spliting the THP into small pages while moving the head
      page to the newly allocated order-0 page.  Remaning pages are moved to
      the LRU list by split_huge_page.  The same happens if the THP allocation
      fails.  This is really ugly and error prone [1].
      
      I also believe that split_huge_page to the LRU lists is inherently wrong
      because all tail pages are not migrated.  Some callers will just work
      around that by retrying (e.g.  memory hotplug).  There are other pfn
      walkers which are simply broken though.  e.g. madvise_inject_error will
      migrate head and then advances next pfn by the huge page size.
      do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
      will simply split the THP before migration if the THP migration is not
      supported then falls back to single page migration but it doesn't handle
      tail pages if the THP migration path is not able to allocate a fresh THP
      so we end up with ENOMEM and fail the whole migration which is a
      questionable behavior.  Page compaction doesn't try to migrate large
      pages so it should be immune.
      
      This patch tries to unclutter the situation by moving the special THP
      handling up to the migrate_pages layer where it actually belongs.  We
      simply split the THP page into the existing list if unmap_and_move fails
      with ENOMEM and retry.  So we will _always_ migrate all THP subpages and
      specific migrate_pages users do not have to deal with this case in a
      special way.
      
      [1] http://lkml.kernel.org/r/20171121021855.50525-1-zi.yan@sent.com
      
      Link: http://lkml.kernel.org/r/20180103082555.14592-4-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NZi Yan <zi.yan@cs.rutgers.edu>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      94723aaf
    • M
      mm, migrate: remove reason argument from new_page_t · 666feb21
      Michal Hocko 提交于
      No allocation callback is using this argument anymore.  new_page_node
      used to use this parameter to convey node_id resp.  migration error up
      to move_pages code (do_move_page_to_node_array).  The error status never
      made it into the final status field and we have a better way to
      communicate node id to the status field now.  All other allocation
      callbacks simply ignored the argument so we can drop it finally.
      
      [mhocko@suse.com: fix migration callback]
        Link: http://lkml.kernel.org/r/20180105085259.GH2801@dhcp22.suse.cz
      [akpm@linux-foundation.org: fix alloc_misplaced_dst_page()]
      [mhocko@kernel.org: fix build]
        Link: http://lkml.kernel.org/r/20180103091134.GB11319@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20180103082555.14592-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NZi Yan <zi.yan@cs.rutgers.edu>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      666feb21
  11. 06 4月, 2018 4 次提交
    • M
    • P
      mm/memory_hotplug: optimize memory hotplug · d0dc12e8
      Pavel Tatashin 提交于
      During memory hotplugging we traverse struct pages three times:
      
      1. memset(0) in sparse_add_one_section()
      2. loop in __add_section() to set do: set_page_node(page, nid); and
         SetPageReserved(page);
      3. loop in memmap_init_zone() to call __init_single_pfn()
      
      This patch removes the first two loops, and leaves only loop 3.  All
      struct pages are initialized in one place, the same as it is done during
      boot.
      
      The benefits:
      
       - We improve memory hotplug performance because we are not evicting the
         cache several times and also reduce loop branching overhead.
      
       - Remove condition from hotpath in __init_single_pfn(), that was added
         in order to fix the problem that was reported by Bharata in the above
         email thread, thus also improve performance during normal boot.
      
       - Make memory hotplug more similar to the boot memory initialization
         path because we zero and initialize struct pages only in one
         function.
      
       - Simplifies memory hotplug struct page initialization code, and thus
         enables future improvements, such as multi-threading the
         initialization of struct pages in order to improve hotplug
         performance even further on larger machines.
      
      [pasha.tatashin@oracle.com: v5]
        Link: http://lkml.kernel.org/r/20180228030308.1116-7-pasha.tatashin@oracle.com
      Link: http://lkml.kernel.org/r/20180215165920.8570-7-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0dc12e8
    • P
      mm/memory_hotplug: don't read nid from struct page during hotplug · fc44f7f9
      Pavel Tatashin 提交于
      During memory hotplugging the probe routine will leave struct pages
      uninitialized, the same as it is currently done during boot.  Therefore,
      we do not want to access the inside of struct pages before
      __init_single_page() is called during onlining.
      
      Because during hotplug we know that pages in one memory block belong to
      the same numa node, we can skip the checking.  We should keep checking
      for the boot case.
      
      [pasha.tatashin@oracle.com: s/register_new_memory()/hotplug_memory_register()]
        Link: http://lkml.kernel.org/r/20180228030308.1116-6-pasha.tatashin@oracle.com
      Link: http://lkml.kernel.org/r/20180215165920.8570-6-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc44f7f9
    • P
      mm/memory_hotplug: enforce block size aligned range check · ba325585
      Pavel Tatashin 提交于
      Patch series "optimize memory hotplug", v3.
      
      This patchset:
      
       - Improves hotplug performance by eliminating a number of struct page
         traverses during memory hotplug.
      
       - Fixes some issues with hotplugging, where boundaries were not
         properly checked. And on x86 block size was not properly aligned with
         end of memory
      
       - Also, potentially improves boot performance by eliminating condition
         from __init_single_page().
      
       - Adds robustness by verifying that that struct pages are correctly
         poisoned when flags are accessed.
      
      The following experiments were performed on Xeon(R) CPU E7-8895 v3 @
      2.60GHz with 1T RAM:
      
      booting in qemu with 960G of memory, time to initialize struct pages:
      
      no-kvm:
      	TRY1		TRY2
      BEFORE:	39.433668	39.39705
      AFTER:	36.903781	36.989329
      
      with-kvm:
      BEFORE:	10.977447	11.103164
      AFTER:	10.929072	10.751885
      
      Hotplug 896G memory:
      no-kvm:
      	TRY1		TRY2
      BEFORE: 848.740000	846.910000
      AFTER:  783.070000	786.560000
      
      with-kvm:
      	TRY1		TRY2
      BEFORE: 34.410000	33.57
      AFTER:	29.810000	29.580000
      
      This patch (of 6):
      
      Start qemu with the following arguments:
      
        -m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G
      
      Which: boots machine with 64G, and adds a device mem1 with 2G which can
      be hotplugged later.
      
      Also make sure that config has the following turned on:
        CONFIG_MEMORY_HOTPLUG
        CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
        CONFIG_ACPI_HOTPLUG_MEMORY
      
      Using the qemu monitor hotplug the memory (make sure config has (qemu)
      device_add pc-dimm,id=dimm1,memdev=mem1
      
      The operation will fail with the following trace:
      
          WARNING: CPU: 0 PID: 91 at drivers/base/memory.c:205
          pages_correctly_reserved+0xe6/0x110
          Modules linked in:
          CPU: 0 PID: 91 Comm: systemd-udevd Not tainted 4.16.0-rc1_pt_master #29
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
          BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
          RIP: 0010:pages_correctly_reserved+0xe6/0x110
          Call Trace:
           memory_subsys_online+0x44/0xa0
           device_online+0x51/0x80
           store_mem_state+0x5e/0xe0
           kernfs_fop_write+0xfa/0x170
           __vfs_write+0x2e/0x150
           vfs_write+0xa8/0x1a0
           SyS_write+0x4d/0xb0
           do_syscall_64+0x5d/0x110
           entry_SYSCALL_64_after_hwframe+0x21/0x86
          ---[ end trace 6203bc4f1a5d30e8 ]---
      
      The problem is detected in: drivers/base/memory.c
      
         static bool pages_correctly_reserved(unsigned long start_pfn)
         205                 if (WARN_ON_ONCE(!pfn_valid(pfn)))
      
      This function loops through every section in the newly added memory
      block and verifies that the first pfn is valid, meaning section exists,
      has mapping (struct page array), and is online.
      
      The block size on x86 is usually 128M, but when machine is booted with
      more than 64G of memory, the block size is changed to 2G: $ cat
      /sys/devices/system/memory/block_size_bytes 80000000
      
      or
      
         $ dmesg | grep "block size"
         [    0.086469] x86/mm: Memory block size: 2048MB
      
      During memory hotplug, and hotremove we verify that the range is section
      size aligned, but we actually must verify that it is block size aligned,
      because that is the proper unit for hotplug operations.  See:
      Documentation/memory-hotplug.txt
      
      So, when the start_pfn of newly added memory is not block size aligned,
      we can get a memory block that has only part of it with properly
      populated sections.
      
      In our case the start_pfn starts from the last_pfn (end of physical
      memory).
      
         $ dmesg | grep last_pfn
         [    0.000000] e820: last_pfn = 0x1040000 max_arch_pfn = 0x400000000
      
      0x1040000 == 65G, and so is not 2G aligned!
      
      The fix is to enforce that memory that is hotplugged and hotremoved is
      block size aligned.
      
      With this fix, running the above sequence yield to the following result:
      
         (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
         Block size [0x80000000] unaligned hotplug range: start 0x1040000000,
         							size 0x80000000
         acpi PNP0C80:00: add_memory failed
         acpi PNP0C80:00: acpi_memory_enable_device() error
         acpi PNP0C80:00: Enumeration failure
      
      Link: http://lkml.kernel.org/r/20180213193159.14606-2-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba325585
  12. 01 2月, 2018 3 次提交
  13. 09 1月, 2018 6 次提交
  14. 16 11月, 2017 3 次提交
    • F
    • M
      mm, memory_hotplug: remove timeout from __offline_memory · ecde0f3e
      Michal Hocko 提交于
      We have a hardcoded 120s timeout after which the memory offline fails
      basically since the hot remove has been introduced.  This is essentially
      a policy implemented in the kernel.  Moreover there is no way to adjust
      the timeout and so we are sometimes facing memory offline failures if
      the system is under a heavy memory pressure or very intensive CPU
      workload on large machines.
      
      It is not very clear what purpose the timeout actually serves.  The
      offline operation is interruptible by a signal so if userspace wants
      some timeout based termination this can be done trivially by sending a
      signal.
      
      If there is a strong usecase to do this from the kernel then we should
      do it properly and have a it tunable from the userspace with the timeout
      disabled by default along with the explanation who uses it and for what
      purporse.
      
      Link: http://lkml.kernel.org/r/20170918070834.13083-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ecde0f3e
    • M
      mm, memory_hotplug: do not fail offlining too early · 72b39cfc
      Michal Hocko 提交于
      Patch series "mm, memory_hotplug: redefine memory offline retry logic", v2.
      
      While testing memory hotplug on a large 4TB machine we have noticed that
      memory offlining is just too eager to fail.  The primary reason is that
      the retry logic is just too easy to give up.  We have 4 ways out of the
      offline
      
      	- we have a permanent failure (isolation or memory notifiers fail,
      	  or hugetlb pages cannot be dropped)
      	- userspace sends a signal
      	- a hardcoded 120s timeout expires
      	- page migration fails 5 times
      
      This is way too convoluted and it doesn't scale very well.  We have seen
      both temporary migration failures as well as 120s being triggered.
      After removing those restrictions we were able to pass stress testing
      during memory hot remove without any other negative side effects
      observed.  Therefore I suggest dropping both hard coded policies.  I
      couldn't have found any specific reason for them in the changelog.  I
      neither didn't get any response [1] from Kamezawa.  If we need some
      upper bound - e.g.  timeout based - then we should have a proper and
      user defined policy for that.  In any case there should be a clear use
      case when introducing it.
      
      This patch (of 2):
      
      Memory offlining can fail too eagerly under heavy memory pressure.
      
        page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
        flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
        page dumped because: isolation failed
        page->mem_cgroup:ffff8801cd662000
        memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed
      
      Isolation has failed here because the page is not on LRU.  Most probably
      because it was on the pcp LRU cache or it has been removed from the LRU
      already but it hasn't been freed yet.  In both cases the page doesn't
      look non-migrable so retrying more makes sense.
      
      __offline_pages seems rather cluttered when it comes to the retry logic.
      We have 5 retries at maximum and a timeout.  We could argue whether the
      timeout makes sense but failing just because of a race when somebody
      isoltes a page from LRU or puts it on a pcp LRU lists is just wrong.  It
      only takes it to race with a process which unmaps some pages and remove
      them from the LRU list and we can fail the whole offline because of
      something that is a temporary condition and actually not harmful for the
      offline.
      
      Please note that unmovable pages should be already excluded during
      start_isolate_page_range.  We could argue that has_unmovable_pages is
      racy and MIGRATE_MOVABLE check doesn't provide any hard guarantee either
      but kernel zones (aka < ZONE_MOVABLE) will very likely detect unmovable
      pages in most cases and movable zone shouldn't contain unmovable pages
      at all.  Some of those pages might be pinned but not for ever because
      that would be a bug on its own.  In any case the context is still
      interruptible and so the userspace can easily bail out when the
      operation takes too long.  This is certainly better behavior than a
      hardcoded retry loop which is racy.
      
      Fix this by removing the max retry count and only rely on the timeout
      resp. interruption by a signal from the userspace.  Also retry rather
      than fail when check_pages_isolated sees some !free pages because those
      could be a result of the race as well.
      
      Link: http://lkml.kernel.org/r/20170918070834.13083-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72b39cfc