1. 06 3月, 2019 1 次提交
  2. 22 2月, 2019 1 次提交
    • M
      mm, memory_hotplug: fix off-by-one in is_pageblock_removable · 891cb2a7
      Michal Hocko 提交于
      Rong Chen has reported the following boot crash:
      
          PGD 0 P4D 0
          Oops: 0000 [#1] PREEMPT SMP PTI
          CPU: 1 PID: 239 Comm: udevd Not tainted 5.0.0-rc4-00149-gefad4e47 #1
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
          RIP: 0010:page_mapping+0x12/0x80
          Code: 5d c3 48 89 df e8 0e ad 02 00 85 c0 75 da 89 e8 5b 5d c3 0f 1f 44 00 00 53 48 89 fb 48 8b 43 08 48 8d 50 ff a8 01 48 0f 45 da <48> 8b 53 08 48 8d 42 ff 83 e2 01 48 0f 44 c3 48 83 38 ff 74 2f 48
          RSP: 0018:ffff88801fa87cd8 EFLAGS: 00010202
          RAX: ffffffffffffffff RBX: fffffffffffffffe RCX: 000000000000000a
          RDX: fffffffffffffffe RSI: ffffffff820b9a20 RDI: ffff88801e5c0000
          RBP: 6db6db6db6db6db7 R08: ffff88801e8bb000 R09: 0000000001b64d13
          R10: ffff88801fa87cf8 R11: 0000000000000001 R12: ffff88801e640000
          R13: ffffffff820b9a20 R14: ffff88801f145258 R15: 0000000000000001
          FS:  00007fb2079817c0(0000) GS:ffff88801dd00000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 0000000000000006 CR3: 000000001fa82000 CR4: 00000000000006a0
          Call Trace:
           __dump_page+0x14/0x2c0
           is_mem_section_removable+0x24c/0x2c0
           removable_show+0x87/0xa0
           dev_attr_show+0x25/0x60
           sysfs_kf_seq_show+0xba/0x110
           seq_read+0x196/0x3f0
           __vfs_read+0x34/0x180
           vfs_read+0xa0/0x150
           ksys_read+0x44/0xb0
           do_syscall_64+0x5e/0x4a0
           entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      and bisected it down to commit efad4e47 ("mm, memory_hotplug:
      is_mem_section_removable do not pass the end of a zone").
      
      The reason for the crash is that the mapping is garbage for poisoned
      (uninitialized) page.  This shouldn't happen as all pages in the zone's
      boundary should be initialized.
      
      Later debugging revealed that the actual problem is an off-by-one when
      evaluating the end_page.  'start_pfn + nr_pages' resp 'zone_end_pfn'
      refers to a pfn after the range and as such it might belong to a
      differen memory section.
      
      This along with CONFIG_SPARSEMEM then makes the loop condition
      completely bogus because a pointer arithmetic doesn't work for pages
      from two different sections in that memory model.
      
      Fix the issue by reworking is_pageblock_removable to be pfn based and
      only use struct page where necessary.  This makes the code slightly
      easier to follow and we will remove the problematic pointer arithmetic
      completely.
      
      Link: http://lkml.kernel.org/r/20190218181544.14616-1-mhocko@kernel.org
      Fixes: efad4e47 ("mm, memory_hotplug: is_mem_section_removable do not pass the end of a zone")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: <rong.a.chen@intel.com>
      Tested-by: <rong.a.chen@intel.com>
      Acked-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      891cb2a7
  3. 02 2月, 2019 5 次提交
  4. 29 12月, 2018 12 次提交
    • M
      mm, memory_hotplug: deobfuscate migration part of offlining · bb8965bd
      Michal Hocko 提交于
      Memory migration might fail during offlining and we keep retrying in that
      case.  This is currently obfuscated by goto retry loop.  The code is hard
      to follow and as a result it is even suboptimal becase each retry round
      scans the full range from start_pfn even though we have successfully
      scanned/migrated [start_pfn, pfn] range already.  This is all only because
      check_pages_isolated failure has to rescan the full range again.
      
      De-obfuscate the migration retry loop by promoting it to a real for loop.
      In fact remove the goto altogether by making it a proper double loop
      (yeah, gotos are nasty in this specific case).  In the end we will get a
      slightly more optimal code which is better readable.
      
      [akpm@linux-foundation.org: reflow comments to 80 cols]
      Link: http://lkml.kernel.org/r/20181211142741.2607-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb8965bd
    • M
      mm, memory_hotplug: try to migrate full pfn range · a85009c3
      Michal Hocko 提交于
      Patch series "few memory offlining enhancements".
      
      I have been chasing memory offlining not making progress recently.  On the
      way I have noticed few weird decisions in the code.  The migration itself
      is restricted without a reasonable justification and the retry loop around
      the migration is quite messy.  This is addressed by patch 1 and patch 2.
      
      Patch 3 is targeting on the faultaround code which has been a hot
      candidate for the initial issue reported upstream [2] and that I am
      debugging internally.  It turned out to be not the main contributor in the
      end but I believe we should address it regardless.  See the patch
      description for more details.
      
      [1] http://lkml.kernel.org/r/20181120134323.13007-1-mhocko@kernel.org
      [2] http://lkml.kernel.org/r/20181114070909.GB2653@MiWiFi-R3L-srv
      
      This patch (of 3):
      
      do_migrate_range has been limiting the number of pages to migrate to 256
      for some reason which is not documented.  Even if the limit made some
      sense back then when it was introduced it doesn't really serve a good
      purpose these days.  If the range contains huge pages then we break out of
      the loop too early and go through LRU and pcp caches draining and
      scan_movable_pages is quite suboptimal.
      
      The only reason to limit the number of pages I can think of is to reduce
      the potential time to react on the fatal signal.  But even then the number
      of pages is a questionable metric because even a single page migration
      might block in a non-killable state (e.g.  __unmap_and_move).
      
      Remove the limit and offline the full requested range (this is one
      memblock worth of pages with the current code).  Should we ever get a
      report that offlining takes too long to react on fatal signal then we
      should rather fix the core migration to use killable waits and bailout
      on a signal.
      
      Link: http://lkml.kernel.org/r/20181211142741.2607-1-mhocko@kernel.org
      Link: http://lkml.kernel.org/r/20181211142741.2607-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a85009c3
    • M
      hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined · b15c8726
      Michal Hocko 提交于
      We have received a bug report that an injected MCE about faulty memory
      prevents memory offline to succeed on 4.4 base kernel.  The underlying
      reason was that the HWPoison page has an elevated reference count and the
      migration keeps failing.  There are two problems with that.  First of all
      it is dubious to migrate the poisoned page because we know that accessing
      that memory is possible to fail.  Secondly it doesn't make any sense to
      migrate a potentially broken content and preserve the memory corruption
      over to a new location.
      
      Oscar has found out that 4.4 and the current upstream kernels behave
      slightly differently with his simply testcase
      
      ===
      
      int main(void)
      {
              int ret;
              int i;
              int fd;
              char *array = malloc(4096);
              char *array_locked = malloc(4096);
      
              fd = open("/tmp/data", O_RDONLY);
              read(fd, array, 4095);
      
              for (i = 0; i < 4096; i++)
                      array_locked[i] = 'd';
      
              ret = mlock((void *)PAGE_ALIGN((unsigned long)array_locked), sizeof(array_locked));
              if (ret)
                      perror("mlock");
      
              sleep (20);
      
              ret = madvise((void *)PAGE_ALIGN((unsigned long)array_locked), 4096, MADV_HWPOISON);
              if (ret)
                      perror("madvise");
      
              for (i = 0; i < 4096; i++)
                      array_locked[i] = 'd';
      
              return 0;
      }
      ===
      
      + offline this memory.
      
      In 4.4 kernels he saw the hwpoisoned page to be returned back to the LRU
      list
      kernel:  [<ffffffff81019ac9>] dump_trace+0x59/0x340
      kernel:  [<ffffffff81019e9a>] show_stack_log_lvl+0xea/0x170
      kernel:  [<ffffffff8101ac71>] show_stack+0x21/0x40
      kernel:  [<ffffffff8132bb90>] dump_stack+0x5c/0x7c
      kernel:  [<ffffffff810815a1>] warn_slowpath_common+0x81/0xb0
      kernel:  [<ffffffff811a275c>] __pagevec_lru_add_fn+0x14c/0x160
      kernel:  [<ffffffff811a2eed>] pagevec_lru_move_fn+0xad/0x100
      kernel:  [<ffffffff811a334c>] __lru_cache_add+0x6c/0xb0
      kernel:  [<ffffffff81195236>] add_to_page_cache_lru+0x46/0x70
      kernel:  [<ffffffffa02b4373>] extent_readpages+0xc3/0x1a0 [btrfs]
      kernel:  [<ffffffff811a16d7>] __do_page_cache_readahead+0x177/0x200
      kernel:  [<ffffffff811a18c8>] ondemand_readahead+0x168/0x2a0
      kernel:  [<ffffffff8119673f>] generic_file_read_iter+0x41f/0x660
      kernel:  [<ffffffff8120e50d>] __vfs_read+0xcd/0x140
      kernel:  [<ffffffff8120e9ea>] vfs_read+0x7a/0x120
      kernel:  [<ffffffff8121404b>] kernel_read+0x3b/0x50
      kernel:  [<ffffffff81215c80>] do_execveat_common.isra.29+0x490/0x6f0
      kernel:  [<ffffffff81215f08>] do_execve+0x28/0x30
      kernel:  [<ffffffff81095ddb>] call_usermodehelper_exec_async+0xfb/0x130
      kernel:  [<ffffffff8161c045>] ret_from_fork+0x55/0x80
      
      And that latter confuses the hotremove path because an LRU page is
      attempted to be migrated and that fails due to an elevated reference
      count.  It is quite possible that the reuse of the HWPoisoned page is some
      kind of fixed race condition but I am not really sure about that.
      
      With the upstream kernel the failure is slightly different.  The page
      doesn't seem to have LRU bit set but isolate_movable_page simply fails and
      do_migrate_range simply puts all the isolated pages back to LRU and
      therefore no progress is made and scan_movable_pages finds same set of
      pages over and over again.
      
      Fix both cases by explicitly checking HWPoisoned pages before we even try
      to get reference on the page, try to unmap it if it is still mapped.  As
      explained by Naoya:
      
      : Hwpoison code never unmapped those for no big reason because
      : Ksm pages never dominate memory, so we simply didn't have strong
      : motivation to save the pages.
      
      Also put WARN_ON(PageLRU) in case there is a race and we can hit LRU
      HWPoison pages which shouldn't happen but I couldn't convince myself about
      that.  Naoya has noted the following:
      
      : Theoretically no such gurantee, because try_to_unmap() doesn't have a
      : guarantee of success and then memory_failure() returns immediately
      : when hwpoison_user_mappings fails.
      : Or the following code (comes after hwpoison_user_mappings block) also impli=
      : es
      : that the target page can still have PageLRU flag.
      :
      :         /*
      :          * Torn down by someone else?
      :          */
      :         if (PageLRU(p) && !PageSwapCache(p) && p->mapping =3D=3D NULL) {
      :                 action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED);
      :                 res =3D -EBUSY;
      :                 goto out;
      :         }
      :
      : So I think it's OK to keep "if (WARN_ON(PageLRU(page)))" block in
      : current version of your patch.
      
      Link: http://lkml.kernel.org/r/20181206120135.14079-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.com>
      Debugged-by: NOscar Salvador <osalvador@suse.com>
      Tested-by: NOscar Salvador <osalvador@suse.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b15c8726
    • W
      mm, hotplug: move init_currently_empty_zone() under zone_span_lock protection · fa004ab7
      Wei Yang 提交于
      During online_pages phase, pgdat->nr_zones will be updated in case this
      zone is empty.
      
      Currently the online_pages phase is protected by the global locks
      (device_device_hotplug_lock and mem_hotplug_lock), which ensures there is
      no contention during the update of nr_zones.
      
      These global locks introduces scalability issues (especially the second
      one), which slow down code relying on get_online_mems().  This is also a
      preparation for not having to rely on get_online_mems() but instead some
      more fine grained locks.
      
      The patch moves init_currently_empty_zone under both zone_span_writelock
      and pgdat_resize_lock because both the pgdat state is changed (nr_zones)
      and the zone's start_pfn.  Also this patch changes the documentation of
      node_size_lock to include the protection of nr_zones.
      
      Link: http://lkml.kernel.org/r/20181203205016.14123-1-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa004ab7
    • W
      mm, sparse: pass nid instead of pgdat to sparse_add_one_section() · 4e0d2e7e
      Wei Yang 提交于
      Since the information needed in sparse_add_one_section() is node id to
      allocate proper memory, it is not necessary to pass its pgdat.
      
      This patch changes the prototype of sparse_add_one_section() to pass node
      id directly.  This is intended to reduce misleading that
      sparse_add_one_section() would touch pgdat.
      
      Link: http://lkml.kernel.org/r/20181204085657.20472-2-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e0d2e7e
    • O
      mm, memory_hotplug: add nid parameter to arch_remove_memory · 2c2a5af6
      Oscar Salvador 提交于
      Patch series "Do not touch pages in hot-remove path", v2.
      
      This patchset aims for two things:
      
       1) A better definition about offline and hot-remove stage
       2) Solving bugs where we can access non-initialized pages
          during hot-remove operations [2] [3].
      
      This is achieved by moving all page/zone handling to the offline
      stage, so we do not need to access pages when hot-removing memory.
      
      [1] https://patchwork.kernel.org/cover/10691415/
      [2] https://patchwork.kernel.org/patch/10547445/
      [3] https://www.spinics.net/lists/linux-mm/msg161316.html
      
      This patch (of 5):
      
      This is a preparation for the following-up patches.  The idea of passing
      the nid is that it will allow us to get rid of the zone parameter
      afterwards.
      
      Link: http://lkml.kernel.org/r/20181127162005.15833-2-osalvador@suse.deSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c2a5af6
    • D
      mm/memory_hotplug: drop "online" parameter from add_memory_resource() · f29d8e9c
      David Hildenbrand 提交于
      Userspace should always be in charge of how to online memory and if memory
      should be onlined automatically in the kernel.  Let's drop the parameter
      to overwrite this - XEN passes memhp_auto_online, just like add_memory(),
      so we can directly use that instead internally.
      
      Link: http://lkml.kernel.org/r/20181123123740.27652-1-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NJuergen Gross <jgross@suse.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Arun KS <arunks@codeaurora.org>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f29d8e9c
    • M
      mm, memory_hotplug: do not clear numa_node association after hot_remove · 46a3679b
      Michal Hocko 提交于
      Per-cpu numa_node provides a default node for each possible cpu.  The
      association gets initialized during the boot when the architecture
      specific code explores cpu->NUMA affinity.  When the whole NUMA node is
      removed though we are clearing this association
      
      try_offline_node
        check_and_unmap_cpu_on_node
          unmap_cpu_on_node
            numa_clear_node
              numa_set_node(cpu, NUMA_NO_NODE)
      
      This means that whoever calls cpu_to_node for a cpu associated with such a
      node will get NUMA_NO_NODE.  This is problematic for two reasons.  First
      it is fragile because __alloc_pages_node would simply blow up on an
      out-of-bound access.  We have encountered this when loading kvm module
      
        BUG: unable to handle kernel paging request at 00000000000021c0
        IP: __alloc_pages_nodemask+0x93/0xb70
        PGD 800000ffe853e067 PUD 7336bbc067 PMD 0
        Oops: 0000 [#1] SMP
        [...]
        CPU: 88 PID: 1223749 Comm: modprobe Tainted: G        W          4.4.156-94.64-default #1
        RIP: __alloc_pages_nodemask+0x93/0xb70
        RSP: 0018:ffff887354493b40  EFLAGS: 00010202
        RAX: 00000000000021c0 RBX: 0000000000000000 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: 0000000000000002 RDI: 00000000014000c0
        RBP: 00000000014000c0 R08: ffffffffffffffff R09: 0000000000000000
        R10: ffff88fffc89e790 R11: 0000000000014000 R12: 0000000000000101
        R13: ffffffffa0772cd4 R14: ffffffffa0769ac0 R15: 0000000000000000
        FS:  00007fdf2f2f1700(0000) GS:ffff88fffc880000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000000021c0 CR3: 00000077205ee000 CR4: 0000000000360670
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
          alloc_vmcs_cpu+0x3d/0x90 [kvm_intel]
          hardware_setup+0x781/0x849 [kvm_intel]
          kvm_arch_hardware_setup+0x28/0x190 [kvm]
          kvm_init+0x7c/0x2d0 [kvm]
          vmx_init+0x1e/0x32c [kvm_intel]
          do_one_initcall+0xca/0x1f0
          do_init_module+0x5a/0x1d7
          load_module+0x1393/0x1c90
          SYSC_finit_module+0x70/0xa0
          entry_SYSCALL_64_fastpath+0x1e/0xb7
        DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xb7
      
      on an older kernel but the code is basically the same in the current Linus
      tree as well.  alloc_vmcs_cpu could use alloc_pages_nodemask which would
      recognize NUMA_NO_NODE and use alloc_pages_node which would translate it
      to numa_mem_id but that is wrong as well because it would use a cpu
      affinity of the local CPU which might be quite far from the original node.
      It is also reasonable to expect that cpu_to_node will provide a sane
      value and there might be many more callers like that.
      
      The second problem is that __register_one_node relies on cpu_to_node to
      properly associate cpus back to the node when it is onlined.  We do not
      want to lose that link as there is no arch independent way to get it from
      the early boot time AFAICS.
      
      Drop the whole check_and_unmap_cpu_on_node machinery and keep the
      association to fix both issues.  The NODE_DATA(nid) is not deallocated so
      it will stay in place and if anybody wants to allocate from that node then
      a fallback node will be used.
      
      Thanks to Vlastimil Babka for his live system debugging skills that helped
      debugging the issue.
      
      Link: http://lkml.kernel.org/r/20181108100413.966-1-mhocko@kernel.org
      Fixes: e13fe869 ("cpu-hotplug,memory-hotplug: clear cpu_to_node() when offlining the node")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Debugged-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: NMiroslav Benes <mbenes@suse.cz>
      Acked-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      46a3679b
    • M
      mm: only report isolation failures when offlining memory · d381c547
      Michal Hocko 提交于
      Heiko has complained that his log is swamped by warnings from
      has_unmovable_pages
      
      [   20.536664] page dumped because: has_unmovable_pages
      [   20.536792] page:000003d081ff4080 count:1 mapcount:0 mapping:000000008ff88600 index:0x0 compound_mapcount: 0
      [   20.536794] flags: 0x3fffe0000010200(slab|head)
      [   20.536795] raw: 03fffe0000010200 0000000000000100 0000000000000200 000000008ff88600
      [   20.536796] raw: 0000000000000000 0020004100000000 ffffffff00000001 0000000000000000
      [   20.536797] page dumped because: has_unmovable_pages
      [   20.536814] page:000003d0823b0000 count:1 mapcount:0 mapping:0000000000000000 index:0x0
      [   20.536815] flags: 0x7fffe0000000000()
      [   20.536817] raw: 07fffe0000000000 0000000000000100 0000000000000200 0000000000000000
      [   20.536818] raw: 0000000000000000 0000000000000000 ffffffff00000001 0000000000000000
      
      which are not triggered by the memory hotplug but rather CMA allocator.
      The original idea behind dumping the page state for all call paths was
      that these messages will be helpful debugging failures.  From the above it
      seems that this is not the case for the CMA path because we are lacking
      much more context.  E.g the second reported page might be a CMA allocated
      page.  It is still interesting to see a slab page in the CMA area but it
      is hard to tell whether this is bug from the above output alone.
      
      Address this issue by dumping the page state only on request.  Both
      start_isolate_page_range and has_unmovable_pages already have an argument
      to ignore hwpoison pages so make this argument more generic and turn it
      into flags and allow callers to combine non-default modes into a mask.
      While we are at it, has_unmovable_pages call from
      is_pageblock_removable_nolock (sysfs removable file) is questionable to
      report the failure so drop it from there as well.
      
      Link: http://lkml.kernel.org/r/20181218092802.31429-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d381c547
    • M
      mm, memory_hotplug: be more verbose for memory offline failures · 2932c8b0
      Michal Hocko 提交于
      There is only very limited information printed when the memory offlining
      fails:
      
      [ 1984.506184] rac1 kernel: memory offlining [mem 0x82600000000-0x8267fffffff] failed due to signal backoff
      
      This tells us that the failure is triggered by the userspace intervention
      but it doesn't tell us much more about the underlying reason.  It might be
      that the page migration failes repeatedly and the userspace timeout
      expires and send a signal or it might be some of the earlier steps
      (isolation, memory notifier) takes too long.
      
      If the migration failes then it would be really helpful to see which page
      that and its state.  The same applies to the isolation phase.  If we fail
      to isolate a page from the allocator then knowing the state of the page
      would be helpful as well.
      
      Dump the page state that fails to get isolated or migrated.  This will
      tell us more about the failure and what to focus on during debugging.
      
      [akpm@linux-foundation.org: add missing printk arg]
      [mhocko@suse.com: tweak dump_page() `reason' text]
        Link: http://lkml.kernel.org/r/20181116083020.20260-6-mhocko@kernel.org
      Link: http://lkml.kernel.org/r/20181107101830.17405-6-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Oscar Salvador <OSalvador@suse.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2932c8b0
    • M
      mm, memory_hotplug: print reason for the offlining failure · 79605093
      Michal Hocko 提交于
      The memory offlining failure reporting is inconsistent and insufficient.
      Some error paths simply do not report the failure to the log at all.  When
      we do report there are no details about the reason of the failure and
      there are several of them which makes memory offlining failures hard to
      debug.
      
      Make sure that the
      	memory offlining [mem %#010llx-%#010llx] failed
      message is printed for all failures and also provide a short textual
      reason for the failure e.g.
      
      [ 1984.506184] rac1 kernel: memory offlining [mem 0x82600000000-0x8267fffffff] failed due to signal backoff
      
      this tells us that the offlining has failed because of a signal pending
      aka user intervention.
      
      [akpm@linux-foundation.org: tweak messages a bit]
      Link: http://lkml.kernel.org/r/20181107101830.17405-5-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Oscar Salvador <OSalvador@suse.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79605093
    • M
      mm, memory_hotplug: drop pointless block alignment checks from __offline_pages · 6cc2baf6
      Michal Hocko 提交于
      This function is never called from a context which would provide
      misaligned pfn range so drop the pointless check.
      
      Link: http://lkml.kernel.org/r/20181107101830.17405-4-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Oscar Salvador <OSalvador@suse.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6cc2baf6
  5. 04 11月, 2018 1 次提交
    • M
      memory_hotplug: cond_resched in __remove_pages · dd33ad7b
      Michal Hocko 提交于
      We have received a bug report that unbinding a large pmem (>1TB) can
      result in a soft lockup:
      
        NMI watchdog: BUG: soft lockup - CPU#9 stuck for 23s! [ndctl:4365]
        [...]
        Supported: Yes
        CPU: 9 PID: 4365 Comm: ndctl Not tainted 4.12.14-94.40-default #1 SLE12-SP4
        Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS SE5C620.86B.01.00.0833.051120182255 05/11/2018
        task: ffff9cce7d4410c0 task.stack: ffffbe9eb1bc4000
        RIP: 0010:__put_page+0x62/0x80
        Call Trace:
         devm_memremap_pages_release+0x152/0x260
         release_nodes+0x18d/0x1d0
         device_release_driver_internal+0x160/0x210
         unbind_store+0xb3/0xe0
         kernfs_fop_write+0x102/0x180
         __vfs_write+0x26/0x150
         vfs_write+0xad/0x1a0
         SyS_write+0x42/0x90
         do_syscall_64+0x74/0x150
         entry_SYSCALL_64_after_hwframe+0x3d/0xa2
        RIP: 0033:0x7fd13166b3d0
      
      It has been reported on an older (4.12) kernel but the current upstream
      code doesn't cond_resched in the hot remove code at all and the given
      range to remove might be really large.  Fix the issue by calling
      cond_resched once per memory section.
      
      Link: http://lkml.kernel.org/r/20181031125840.23982-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Cc: Dan Williams <dan.j.williams@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd33ad7b
  6. 31 10月, 2018 4 次提交
    • D
      mm/memory_hotplug: fix online/offline_pages called w.o. mem_hotplug_lock · 381eab4a
      David Hildenbrand 提交于
      There seem to be some problems as result of 30467e0b ("mm, hotplug:
      fix concurrent memory hot-add deadlock"), which tried to fix a possible
      lock inversion reported and discussed in [1] due to the two locks
      	a) device_lock()
      	b) mem_hotplug_lock
      
      While add_memory() first takes b), followed by a) during
      bus_probe_device(), onlining of memory from user space first took a),
      followed by b), exposing a possible deadlock.
      
      In [1], and it was decided to not make use of device_hotplug_lock, but
      rather to enforce a locking order.
      
      The problems I spotted related to this:
      
      1. Memory block device attributes: While .state first calls
         mem_hotplug_begin() and the calls device_online() - which takes
         device_lock() - .online does no longer call mem_hotplug_begin(), so
         effectively calls online_pages() without mem_hotplug_lock.
      
      2. device_online() should be called under device_hotplug_lock, however
         onlining memory during add_memory() does not take care of that.
      
      In addition, I think there is also something wrong about the locking in
      
      3. arch/powerpc/platforms/powernv/memtrace.c calls offline_pages()
         without locks. This was introduced after 30467e0b. And skimming over
         the code, I assume it could need some more care in regards to locking
         (e.g. device_online() called without device_hotplug_lock. This will
         be addressed in the following patches.
      
      Now that we hold the device_hotplug_lock when
      - adding memory (e.g. via add_memory()/add_memory_resource())
      - removing memory (e.g. via remove_memory())
      - device_online()/device_offline()
      
      We can move mem_hotplug_lock usage back into
      online_pages()/offline_pages().
      
      Why is mem_hotplug_lock still needed? Essentially to make
      get_online_mems()/put_online_mems() be very fast (relying on
      device_hotplug_lock would be very slow), and to serialize against
      addition of memory that does not create memory block devices (hmm).
      
      [1] http://driverdev.linuxdriverproject.org/pipermail/ driverdev-devel/
          2015-February/065324.html
      
      This patch is partly based on a patch by Vitaly Kuznetsov.
      
      Link: http://lkml.kernel.org/r/20180925091457.28651-4-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Reviewed-by: NRashmica Gupta <rashmica.g@gmail.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Rashmica Gupta <rashmica.g@gmail.com>
      Cc: Michael Neuling <mikey@neuling.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Kate Stewart <kstewart@linuxfoundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: John Allen <jallen@linux.vnet.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      381eab4a
    • D
      mm/memory_hotplug: make add_memory() take the device_hotplug_lock · 8df1d0e4
      David Hildenbrand 提交于
      add_memory() currently does not take the device_hotplug_lock, however
      is aleady called under the lock from
      	arch/powerpc/platforms/pseries/hotplug-memory.c
      	drivers/acpi/acpi_memhotplug.c
      to synchronize against CPU hot-remove and similar.
      
      In general, we should hold the device_hotplug_lock when adding memory to
      synchronize against online/offline request (e.g.  from user space) - which
      already resulted in lock inversions due to device_lock() and
      mem_hotplug_lock - see 30467e0b ("mm, hotplug: fix concurrent memory
      hot-add deadlock").  add_memory()/add_memory_resource() will create memory
      block devices, so this really feels like the right thing to do.
      
      Holding the device_hotplug_lock makes sure that a memory block device
      can really only be accessed (e.g. via .online/.state) from user space,
      once the memory has been fully added to the system.
      
      The lock is not held yet in
      	drivers/xen/balloon.c
      	arch/powerpc/platforms/powernv/memtrace.c
      	drivers/s390/char/sclp_cmd.c
      	drivers/hv/hv_balloon.c
      So, let's either use the locked variants or take the lock.
      
      Don't export add_memory_resource(), as it once was exported to be used by
      XEN, which is never built as a module.  If somebody requires it, we also
      have to export a locked variant (as device_hotplug_lock is never
      exported).
      
      Link: http://lkml.kernel.org/r/20180925091457.28651-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Reviewed-by: NRashmica Gupta <rashmica.g@gmail.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      Cc: John Allen <jallen@linux.vnet.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kate Stewart <kstewart@linuxfoundation.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michael Neuling <mikey@neuling.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8df1d0e4
    • D
      mm/memory_hotplug: make remove_memory() take the device_hotplug_lock · d15e5926
      David Hildenbrand 提交于
      Patch series "mm: online/offline_pages called w.o. mem_hotplug_lock", v3.
      
      Reading through the code and studying how mem_hotplug_lock is to be used,
      I noticed that there are two places where we can end up calling
      device_online()/device_offline() - online_pages()/offline_pages() without
      the mem_hotplug_lock.  And there are other places where we call
      device_online()/device_offline() without the device_hotplug_lock.
      
      While e.g.
      	echo "online" > /sys/devices/system/memory/memory9/state
      is fine, e.g.
      	echo 1 > /sys/devices/system/memory/memory9/online
      Will not take the mem_hotplug_lock. However the device_lock() and
      device_hotplug_lock.
      
      E.g.  via memory_probe_store(), we can end up calling
      add_memory()->online_pages() without the device_hotplug_lock.  So we can
      have concurrent callers in online_pages().  We e.g.  touch in
      online_pages() basically unprotected zone->present_pages then.
      
      Looks like there is a longer history to that (see Patch #2 for details),
      and fixing it to work the way it was intended is not really possible.  We
      would e.g.  have to take the mem_hotplug_lock in device/base/core.c, which
      sounds wrong.
      
      Summary: We had a lock inversion on mem_hotplug_lock and device_lock().
      More details can be found in patch 3 and patch 6.
      
      I propose the general rules (documentation added in patch 6):
      
      1. add_memory/add_memory_resource() must only be called with
         device_hotplug_lock.
      2. remove_memory() must only be called with device_hotplug_lock. This is
         already documented and holds for all callers.
      3. device_online()/device_offline() must only be called with
         device_hotplug_lock. This is already documented and true for now in core
         code. Other callers (related to memory hotplug) have to be fixed up.
      4. mem_hotplug_lock is taken inside of add_memory/remove_memory/
         online_pages/offline_pages.
      
      To me, this looks way cleaner than what we have right now (and easier to
      verify).  And looking at the documentation of remove_memory, using
      lock_device_hotplug also for add_memory() feels natural.
      
      This patch (of 6):
      
      remove_memory() is exported right now but requires the
      device_hotplug_lock, which is not exported.  So let's provide a variant
      that takes the lock and only export that one.
      
      The lock is already held in
      	arch/powerpc/platforms/pseries/hotplug-memory.c
      	drivers/acpi/acpi_memhotplug.c
      	arch/powerpc/platforms/powernv/memtrace.c
      
      Apart from that, there are not other users in the tree.
      
      Link: http://lkml.kernel.org/r/20180925091457.28651-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Reviewed-by: NRashmica Gupta <rashmica.g@gmail.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Rashmica Gupta <rashmica.g@gmail.com>
      Cc: Michael Neuling <mikey@neuling.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      Cc: John Allen <jallen@linux.vnet.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kate Stewart <kstewart@linuxfoundation.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d15e5926
    • M
      mm: remove include/linux/bootmem.h · 57c8a661
      Mike Rapoport 提交于
      Move remaining definitions and declarations from include/linux/bootmem.h
      into include/linux/memblock.h and remove the redundant header.
      
      The includes were replaced with the semantic patch below and then
      semi-automated removal of duplicated '#include <linux/memblock.h>
      
      @@
      @@
      - #include <linux/bootmem.h>
      + #include <linux/memblock.h>
      
      [sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h]
        Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au
      [sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h]
        Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au
      [sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal]
        Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au
      Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Palmer Dabbelt <palmer@sifive.com>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Serge Semin <fancer.lancer@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57c8a661
  7. 27 10月, 2018 4 次提交
  8. 05 9月, 2018 1 次提交
  9. 23 8月, 2018 1 次提交
    • O
      mm/page_alloc: Introduce free_area_init_core_hotplug · 03e85f9d
      Oscar Salvador 提交于
      Currently, whenever a new node is created/re-used from the memhotplug
      path, we call free_area_init_node()->free_area_init_core().  But there is
      some code that we do not really need to run when we are coming from such
      path.
      
      free_area_init_core() performs the following actions:
      
      1) Initializes pgdat internals, such as spinlock, waitqueues and more.
      2) Account # nr_all_pages and # nr_kernel_pages. These values are used later on
         when creating hash tables.
      3) Account number of managed_pages per zone, substracting dma_reserved and
         memmap pages.
      4) Initializes some fields of the zone structure data
      5) Calls init_currently_empty_zone to initialize all the freelists
      6) Calls memmap_init to initialize all pages belonging to certain zone
      
      When called from memhotplug path, free_area_init_core() only performs
      actions #1 and #4.
      
      Action #2 is pointless as the zones do not have any pages since either the
      node was freed, or we are re-using it, eitherway all zones belonging to
      this node should have 0 pages.  For the same reason, action #3 results
      always in manages_pages being 0.
      
      Action #5 and #6 are performed later on when onlining the pages:
       online_pages()->move_pfn_range_to_zone()->init_currently_empty_zone()
       online_pages()->move_pfn_range_to_zone()->memmap_init_zone()
      
      This patch does two things:
      
      First, moves the node/zone initializtion to their own function, so it
      allows us to create a small version of free_area_init_core, where we only
      perform:
      
      1) Initialization of pgdat internals, such as spinlock, waitqueues and more
      4) Initialization of some fields of the zone structure data
      
      These two functions are: pgdat_init_internals() and zone_init_internals().
      
      The second thing this patch does, is to introduce
      free_area_init_core_hotplug(), the memhotplug version of
      free_area_init_core():
      
      Currently, we call free_area_init_node() from the memhotplug path.  In
      there, we set some pgdat's fields, and call calculate_node_totalpages().
      calculate_node_totalpages() calculates the # of pages the node has.
      
      Since the node is either new, or we are re-using it, the zones belonging
      to this node should not have any pages, so there is no point to calculate
      this now.
      
      Actually, we re-set these values to 0 later on with the calls to:
      
      reset_node_managed_pages()
      reset_node_present_pages()
      
      The # of pages per node and the # of pages per zone will be calculated when
      onlining the pages:
      
      online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_zone_range()
      online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_pgdat_range()
      
      Also, since free_area_init_core/free_area_init_node will now only get called during early init, let us replace
      __paginginit with __init, so their code gets freed up.
      
      [osalvador@techadventures.net: fix section usage]
        Link: http://lkml.kernel.org/r/20180731101752.GA473@techadventures.net
      [osalvador@suse.de: v6]
        Link: http://lkml.kernel.org/r/20180801122348.21588-6-osalvador@techadventures.net
      Link: http://lkml.kernel.org/r/20180730101757.28058-5-osalvador@techadventures.netSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      03e85f9d
  10. 18 8月, 2018 3 次提交
  11. 08 6月, 2018 1 次提交
  12. 26 5月, 2018 1 次提交
    • J
      mm/memory_hotplug: fix leftover use of struct page during hotplug · a2155861
      Jonathan Cameron 提交于
      The case of a new numa node got missed in avoiding using the node info
      from page_struct during hotplug.  In this path we have a call to
      register_mem_sect_under_node (which allows us to specify it is hotplug
      so don't change the node), via link_mem_sections which unfortunately
      does not.
      
      Fix is to pass check_nid through link_mem_sections as well and disable
      it in the new numa node path.
      
      Note the bug only 'sometimes' manifests depending on what happens to be
      in the struct page structures - there are lots of them and it only needs
      to match one of them.
      
      The result of the bug is that (with a new memory only node) we never
      successfully call register_mem_sect_under_node so don't get the memory
      associated with the node in sysfs and meminfo for the node doesn't
      report it.
      
      It came up whilst testing some arm64 hotplug patches, but appears to be
      universal.  Whilst I'm triggering it by removing then reinserting memory
      to a node with no other elements (thus making the node disappear then
      appear again), it appears it would happen on hotplugging memory where
      there was none before and it doesn't seem to be related the arm64
      patches.
      
      These patches call __add_pages (where most of the issue was fixed by
      Pavel's patch).  If there is a node at the time of the __add_pages call
      then all is well as it calls register_mem_sect_under_node from there
      with check_nid set to false.  Without a node that function returns
      having not done the sysfs related stuff as there is no node to use.
      This is expected but it is the resulting path that fails...
      
      Exact path to the problem is as follows:
      
       mm/memory_hotplug.c: add_memory_resource()
      
         The node is not online so we enter the 'if (new_node)' twice, on the
         second such block there is a call to link_mem_sections which calls
         into
      
        drivers/node.c: link_mem_sections() which calls
      
        drivers/node.c: register_mem_sect_under_node() which calls
           get_nid_for_pfn and keeps trying until the output of that matches
           the expected node (passed all the way down from
           add_memory_resource)
      
      It is effectively the same fix as the one referred to in the fixes tag
      just in the code path for a new node where the comments point out we
      have to rerun the link creation because it will have failed in
      register_new_memory (as there was no node at the time).  (actually that
      comment is wrong now as we don't have register_new_memory any more it
      got renamed to hotplug_memory_register in Pavel's patch).
      
      Link: http://lkml.kernel.org/r/20180504085311.1240-1-Jonathan.Cameron@huawei.com
      Fixes: fc44f7f9 ("mm/memory_hotplug: don't read nid from struct page during hotplug")
      Signed-off-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2155861
  13. 12 4月, 2018 2 次提交
    • M
      mm: unclutter THP migration · 94723aaf
      Michal Hocko 提交于
      THP migration is hacked into the generic migration with rather
      surprising semantic.  The migration allocation callback is supposed to
      check whether the THP can be migrated at once and if that is not the
      case then it allocates a simple page to migrate.  unmap_and_move then
      fixes that up by spliting the THP into small pages while moving the head
      page to the newly allocated order-0 page.  Remaning pages are moved to
      the LRU list by split_huge_page.  The same happens if the THP allocation
      fails.  This is really ugly and error prone [1].
      
      I also believe that split_huge_page to the LRU lists is inherently wrong
      because all tail pages are not migrated.  Some callers will just work
      around that by retrying (e.g.  memory hotplug).  There are other pfn
      walkers which are simply broken though.  e.g. madvise_inject_error will
      migrate head and then advances next pfn by the huge page size.
      do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
      will simply split the THP before migration if the THP migration is not
      supported then falls back to single page migration but it doesn't handle
      tail pages if the THP migration path is not able to allocate a fresh THP
      so we end up with ENOMEM and fail the whole migration which is a
      questionable behavior.  Page compaction doesn't try to migrate large
      pages so it should be immune.
      
      This patch tries to unclutter the situation by moving the special THP
      handling up to the migrate_pages layer where it actually belongs.  We
      simply split the THP page into the existing list if unmap_and_move fails
      with ENOMEM and retry.  So we will _always_ migrate all THP subpages and
      specific migrate_pages users do not have to deal with this case in a
      special way.
      
      [1] http://lkml.kernel.org/r/20171121021855.50525-1-zi.yan@sent.com
      
      Link: http://lkml.kernel.org/r/20180103082555.14592-4-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NZi Yan <zi.yan@cs.rutgers.edu>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      94723aaf
    • M
      mm, migrate: remove reason argument from new_page_t · 666feb21
      Michal Hocko 提交于
      No allocation callback is using this argument anymore.  new_page_node
      used to use this parameter to convey node_id resp.  migration error up
      to move_pages code (do_move_page_to_node_array).  The error status never
      made it into the final status field and we have a better way to
      communicate node id to the status field now.  All other allocation
      callbacks simply ignored the argument so we can drop it finally.
      
      [mhocko@suse.com: fix migration callback]
        Link: http://lkml.kernel.org/r/20180105085259.GH2801@dhcp22.suse.cz
      [akpm@linux-foundation.org: fix alloc_misplaced_dst_page()]
      [mhocko@kernel.org: fix build]
        Link: http://lkml.kernel.org/r/20180103091134.GB11319@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20180103082555.14592-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NZi Yan <zi.yan@cs.rutgers.edu>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      666feb21
  14. 06 4月, 2018 3 次提交