1. 06 3月, 2019 5 次提交
    • Q
      mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC · cd02cf1a
      Qian Cai 提交于
      When onlining a memory block with DEBUG_PAGEALLOC, it unmaps the pages
      in the block from kernel, However, it does not map those pages while
      offlining at the beginning.  As the result, it triggers a panic below
      while onlining on ppc64le as it checks if the pages are mapped before
      unmapping.  However, the imbalance exists for all arches where
      double-unmappings could happen.  Therefore, let kernel map those pages
      in generic_online_page() before they have being freed into the page
      allocator for the first time where it will set the page count to one.
      
      On the other hand, it works fine during the boot, because at least for
      IBM POWER8, it does,
      
      early_setup
        early_init_mmu
          harsh__early_init_mmu
            htab_initialize [1]
              htab_bolt_mapping [2]
      
      where it effectively map all memblock regions just like
      kernel_map_linear_page(), so later mem_init() -> memblock_free_all()
      will unmap them just fine without any imbalance.  On other arches
      without this imbalance checking, it still unmap them once at the most.
      
      [1]
      for_each_memblock(memory, reg) {
              base = (unsigned long)__va(reg->base);
              size = reg->size;
      
              DBG("creating mapping for region: %lx..%lx (prot: %lx)\n",
                      base, size, prot);
      
              BUG_ON(htab_bolt_mapping(base, base + size, __pa(base),
                      prot, mmu_linear_psize, mmu_kernel_ssize));
              }
      
      [2] linear_map_hash_slots[paddr >> PAGE_SHIFT] = ret | 0x80;
          kernel BUG at arch/powerpc/mm/hash_utils_64.c:1815!
          Oops: Exception in kernel mode, sig: 5 [#1]
          LE SMP NR_CPUS=256 DEBUG_PAGEALLOC NUMA pSeries
          CPU: 2 PID: 4298 Comm: bash Not tainted 5.0.0-rc7+ #15
          NIP:  c000000000062670 LR: c00000000006265c CTR: 0000000000000000
          REGS: c0000005bf8a75b0 TRAP: 0700   Not tainted  (5.0.0-rc7+)
          MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28422842
          XER: 00000000
          CFAR: c000000000804f44 IRQMASK: 1
          NIP [c000000000062670] __kernel_map_pages+0x2e0/0x4f0
          LR [c00000000006265c] __kernel_map_pages+0x2cc/0x4f0
          Call Trace:
             __kernel_map_pages+0x2cc/0x4f0
             free_unref_page_prepare+0x2f0/0x4d0
             free_unref_page+0x44/0x90
             __online_page_free+0x84/0x110
             online_pages_range+0xc0/0x150
             walk_system_ram_range+0xc8/0x120
             online_pages+0x280/0x5a0
             memory_subsys_online+0x1b4/0x270
             device_online+0xc0/0xf0
             state_store+0xc0/0x180
             dev_attr_store+0x3c/0x60
             sysfs_kf_write+0x70/0xb0
             kernfs_fop_write+0x10c/0x250
             __vfs_write+0x48/0x240
             vfs_write+0xd8/0x210
             ksys_write+0x70/0x120
             system_call+0x5c/0x70
      
      Link: http://lkml.kernel.org/r/20190301220814.97339-1-cai@lca.pwSigned-off-by: NQian Cai <cai@lca.pw>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd02cf1a
    • O
      mm,memory_hotplug: explicitly pass the head to isolate_huge_page · daf3538a
      Oscar Salvador 提交于
      isolate_huge_page() expects we pass the head of hugetlb page to it:
      
        bool isolate_huge_page(...)
        {
      	...
      	VM_BUG_ON_PAGE(!PageHead(page), page);
      	...
        }
      
      While I really cannot think of any situation where we end up with a
      non-head page between hands in do_migrate_range(), let us make sure the
      code is as sane as possible by explicitly passing the Head.  Since we
      already got the pointer, it does not take us extra effort.
      
      Link: http://lkml.kernel.org/r/20190208090604.975-1-osalvador@suse.deSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      daf3538a
    • W
      mm: remove extra drain pages on pcp list · c52e7593
      Wei Yang 提交于
      In the current implementation, there are two places to isolate a range
      of page: __offline_pages() and alloc_contig_range().  During this
      procedure, it will drain pages on pcp list.
      
      Below is a brief call flow:
      
        __offline_pages()/alloc_contig_range()
            start_isolate_page_range()
                set_migratetype_isolate()
                    drain_all_pages()
            drain_all_pages()                 <--- A
      
      This snippet shows the current logic is isolate and drain pcp list for
      each pageblock and drain pcp list again for the whole range.
      
      start_isolate_page_range is responsible for isolating the given pfn
      range.  One part of that job is to make sure that also pages that are on
      the allocator pcp lists are properly isolated.  Otherwise they could be
      reused and the range wouldn't be completely isolated until the memory is
      freed back.  While there is no strict guarantee here because pages might
      get allocated at any time before drain_all_pages is called there doesn't
      seem to be any strong demand for such a guarantee.
      
      In any case, draining is already done at the isolation level and there
      is no need to do it again later by start_isolate_page_range callers
      (memory hotplug and CMA allocator currently).  Therefore remove
      pointless draining in existing callers to make the code more clear and
      functionally correct.
      
      [mhocko@suse.com: provide a clearer changelog for the last two paragraphs]
      Link: http://lkml.kernel.org/r/20190105233141.2329-1-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c52e7593
    • A
      mm: replace all open encodings for NUMA_NO_NODE · 98fa15f3
      Anshuman Khandual 提交于
      Patch series "Replace all open encodings for NUMA_NO_NODE", v3.
      
      All these places for replacement were found by running the following
      grep patterns on the entire kernel code.  Please let me know if this
      might have missed some instances.  This might also have replaced some
      false positives.  I will appreciate suggestions, inputs and review.
      
      1. git grep "nid == -1"
      2. git grep "node == -1"
      3. git grep "nid = -1"
      4. git grep "node = -1"
      
      This patch (of 2):
      
      At present there are multiple places where invalid node number is
      encoded as -1.  Even though implicitly understood it is always better to
      have macros in there.  Replace these open encodings for an invalid node
      number with the global macro NUMA_NO_NODE.  This helps remove NUMA
      related assumptions like 'invalid node' from various places redirecting
      them to a common definition.
      
      Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	[ixgbe]
      Acked-by: Jens Axboe <axboe@kernel.dk>			[mtip32xx]
      Acked-by: Vinod Koul <vkoul@kernel.org>			[dmaengine.c]
      Acked-by: Michael Ellerman <mpe@ellerman.id.au>		[powerpc]
      Acked-by: Doug Ledford <dledford@redhat.com>		[drivers/infiniband]
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Hans Verkuil <hverkuil@xs4all.nl>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      98fa15f3
    • A
      mm/page_alloc.c: memory hotplug: free pages as higher order · a9cd410a
      Arun KS 提交于
      When freeing pages are done with higher order, time spent on coalescing
      pages by buddy allocator can be reduced.  With section size of 256MB,
      hot add latency of a single section shows improvement from 50-60 ms to
      less than 1 ms, hence improving the hot add latency by 60 times.  Modify
      external providers of online callback to align with the change.
      
      [arunks@codeaurora.org: v11]
        Link: http://lkml.kernel.org/r/1547792588-18032-1-git-send-email-arunks@codeaurora.org
      [akpm@linux-foundation.org: remove unused local, per Arun]
      [akpm@linux-foundation.org: avoid return of void-returning __free_pages_core(), per Oscar]
      [akpm@linux-foundation.org: fix it for mm-convert-totalram_pages-and-totalhigh_pages-variables-to-atomic.patch]
      [arunks@codeaurora.org: v8]
        Link: http://lkml.kernel.org/r/1547032395-24582-1-git-send-email-arunks@codeaurora.org
      [arunks@codeaurora.org: v9]
        Link: http://lkml.kernel.org/r/1547098543-26452-1-git-send-email-arunks@codeaurora.org
      Link: http://lkml.kernel.org/r/1538727006-5727-1-git-send-email-arunks@codeaurora.orgSigned-off-by: NArun KS <arunks@codeaurora.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Srivatsa Vaddagiri <vatsa@codeaurora.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9cd410a
  2. 01 3月, 2019 2 次提交
    • D
      mm/memory-hotplug: Allow memory resources to be children · 2794129e
      Dave Hansen 提交于
      The mm/resource.c code is used to manage the physical address
      space.  The current resource configuration can be viewed in
      /proc/iomem.  An example of this is at the bottom of this
      description.
      
      The nvdimm subsystem "owns" the physical address resources which
      map to persistent memory and has resources inserted for them as
      "Persistent Memory".  The best way to repurpose this for volatile
      use is to leave the existing resource in place, but add a "System
      RAM" resource underneath it. This clearly communicates the
      ownership relationship of this memory.
      
      The request_resource_conflict() API only deals with the
      top-level resources.  Replace it with __request_region() which
      will search for !IORESOURCE_BUSY areas lower in the resource
      tree than the top level.
      
      We *could* also simply truncate the existing top-level
      "Persistent Memory" resource and take over the released address
      space.  But, this means that if we ever decide to hot-unplug the
      "RAM" and give it back, we need to recreate the original setup,
      which may mean going back to the BIOS tables.
      
      This should have no real effect on the existing collision
      detection because the areas that truly conflict should be marked
      IORESOURCE_BUSY.
      
      00000000-00000fff : Reserved
      00001000-0009fbff : System RAM
      0009fc00-0009ffff : Reserved
      000a0000-000bffff : PCI Bus 0000:00
      000c0000-000c97ff : Video ROM
      000c9800-000ca5ff : Adapter ROM
      000f0000-000fffff : Reserved
        000f0000-000fffff : System ROM
      00100000-9fffffff : System RAM
        01000000-01e071d0 : Kernel code
        01e071d1-027dfdff : Kernel data
        02dc6000-0305dfff : Kernel bss
      a0000000-afffffff : Persistent Memory (legacy)
        a0000000-a7ffffff : System RAM
      b0000000-bffdffff : System RAM
      bffe0000-bfffffff : Reserved
      c0000000-febfffff : PCI Bus 0000:00
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NVishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: linux-nvdimm@lists.01.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      2794129e
    • D
      mm/resource: Move HMM pr_debug() deeper into resource code · b926b7f3
      Dave Hansen 提交于
      HMM consumes physical address space for its own use, even
      though nothing is mapped or accessible there.  It uses a
      special resource description (IORES_DESC_DEVICE_PRIVATE_MEMORY)
      to uniquely identify these areas.
      
      When HMM consumes address space, it makes a best guess about
      what to consume.  However, it is possible that a future memory
      or device hotplug can collide with the reserved area.  In the
      case of these conflicts, there is an error message in
      register_memory_resource().
      
      Later patches in this series move register_memory_resource()
      from using request_resource_conflict() to __request_region().
      Unfortunately, __request_region() does not return the conflict
      like the previous function did, which makes it impossible to
      check for IORES_DESC_DEVICE_PRIVATE_MEMORY in a conflicting
      resource.
      
      Instead of warning in register_memory_resource(), move the
      check into the core resource code itself (__request_region())
      where the conflicting resource _is_ available.  This has the
      added bonus of producing a warning in case of HMM conflicts
      with devices *or* RAM address space, as opposed to the RAM-
      only warnings that were there previously.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: linux-nvdimm@lists.01.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      b926b7f3
  3. 22 2月, 2019 1 次提交
    • M
      mm, memory_hotplug: fix off-by-one in is_pageblock_removable · 891cb2a7
      Michal Hocko 提交于
      Rong Chen has reported the following boot crash:
      
          PGD 0 P4D 0
          Oops: 0000 [#1] PREEMPT SMP PTI
          CPU: 1 PID: 239 Comm: udevd Not tainted 5.0.0-rc4-00149-gefad4e47 #1
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
          RIP: 0010:page_mapping+0x12/0x80
          Code: 5d c3 48 89 df e8 0e ad 02 00 85 c0 75 da 89 e8 5b 5d c3 0f 1f 44 00 00 53 48 89 fb 48 8b 43 08 48 8d 50 ff a8 01 48 0f 45 da <48> 8b 53 08 48 8d 42 ff 83 e2 01 48 0f 44 c3 48 83 38 ff 74 2f 48
          RSP: 0018:ffff88801fa87cd8 EFLAGS: 00010202
          RAX: ffffffffffffffff RBX: fffffffffffffffe RCX: 000000000000000a
          RDX: fffffffffffffffe RSI: ffffffff820b9a20 RDI: ffff88801e5c0000
          RBP: 6db6db6db6db6db7 R08: ffff88801e8bb000 R09: 0000000001b64d13
          R10: ffff88801fa87cf8 R11: 0000000000000001 R12: ffff88801e640000
          R13: ffffffff820b9a20 R14: ffff88801f145258 R15: 0000000000000001
          FS:  00007fb2079817c0(0000) GS:ffff88801dd00000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 0000000000000006 CR3: 000000001fa82000 CR4: 00000000000006a0
          Call Trace:
           __dump_page+0x14/0x2c0
           is_mem_section_removable+0x24c/0x2c0
           removable_show+0x87/0xa0
           dev_attr_show+0x25/0x60
           sysfs_kf_seq_show+0xba/0x110
           seq_read+0x196/0x3f0
           __vfs_read+0x34/0x180
           vfs_read+0xa0/0x150
           ksys_read+0x44/0xb0
           do_syscall_64+0x5e/0x4a0
           entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      and bisected it down to commit efad4e47 ("mm, memory_hotplug:
      is_mem_section_removable do not pass the end of a zone").
      
      The reason for the crash is that the mapping is garbage for poisoned
      (uninitialized) page.  This shouldn't happen as all pages in the zone's
      boundary should be initialized.
      
      Later debugging revealed that the actual problem is an off-by-one when
      evaluating the end_page.  'start_pfn + nr_pages' resp 'zone_end_pfn'
      refers to a pfn after the range and as such it might belong to a
      differen memory section.
      
      This along with CONFIG_SPARSEMEM then makes the loop condition
      completely bogus because a pointer arithmetic doesn't work for pages
      from two different sections in that memory model.
      
      Fix the issue by reworking is_pageblock_removable to be pfn based and
      only use struct page where necessary.  This makes the code slightly
      easier to follow and we will remove the problematic pointer arithmetic
      completely.
      
      Link: http://lkml.kernel.org/r/20190218181544.14616-1-mhocko@kernel.org
      Fixes: efad4e47 ("mm, memory_hotplug: is_mem_section_removable do not pass the end of a zone")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: <rong.a.chen@intel.com>
      Tested-by: <rong.a.chen@intel.com>
      Acked-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      891cb2a7
  4. 18 2月, 2019 1 次提交
  5. 02 2月, 2019 5 次提交
  6. 29 12月, 2018 12 次提交
    • M
      mm, memory_hotplug: deobfuscate migration part of offlining · bb8965bd
      Michal Hocko 提交于
      Memory migration might fail during offlining and we keep retrying in that
      case.  This is currently obfuscated by goto retry loop.  The code is hard
      to follow and as a result it is even suboptimal becase each retry round
      scans the full range from start_pfn even though we have successfully
      scanned/migrated [start_pfn, pfn] range already.  This is all only because
      check_pages_isolated failure has to rescan the full range again.
      
      De-obfuscate the migration retry loop by promoting it to a real for loop.
      In fact remove the goto altogether by making it a proper double loop
      (yeah, gotos are nasty in this specific case).  In the end we will get a
      slightly more optimal code which is better readable.
      
      [akpm@linux-foundation.org: reflow comments to 80 cols]
      Link: http://lkml.kernel.org/r/20181211142741.2607-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb8965bd
    • M
      mm, memory_hotplug: try to migrate full pfn range · a85009c3
      Michal Hocko 提交于
      Patch series "few memory offlining enhancements".
      
      I have been chasing memory offlining not making progress recently.  On the
      way I have noticed few weird decisions in the code.  The migration itself
      is restricted without a reasonable justification and the retry loop around
      the migration is quite messy.  This is addressed by patch 1 and patch 2.
      
      Patch 3 is targeting on the faultaround code which has been a hot
      candidate for the initial issue reported upstream [2] and that I am
      debugging internally.  It turned out to be not the main contributor in the
      end but I believe we should address it regardless.  See the patch
      description for more details.
      
      [1] http://lkml.kernel.org/r/20181120134323.13007-1-mhocko@kernel.org
      [2] http://lkml.kernel.org/r/20181114070909.GB2653@MiWiFi-R3L-srv
      
      This patch (of 3):
      
      do_migrate_range has been limiting the number of pages to migrate to 256
      for some reason which is not documented.  Even if the limit made some
      sense back then when it was introduced it doesn't really serve a good
      purpose these days.  If the range contains huge pages then we break out of
      the loop too early and go through LRU and pcp caches draining and
      scan_movable_pages is quite suboptimal.
      
      The only reason to limit the number of pages I can think of is to reduce
      the potential time to react on the fatal signal.  But even then the number
      of pages is a questionable metric because even a single page migration
      might block in a non-killable state (e.g.  __unmap_and_move).
      
      Remove the limit and offline the full requested range (this is one
      memblock worth of pages with the current code).  Should we ever get a
      report that offlining takes too long to react on fatal signal then we
      should rather fix the core migration to use killable waits and bailout
      on a signal.
      
      Link: http://lkml.kernel.org/r/20181211142741.2607-1-mhocko@kernel.org
      Link: http://lkml.kernel.org/r/20181211142741.2607-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a85009c3
    • M
      hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined · b15c8726
      Michal Hocko 提交于
      We have received a bug report that an injected MCE about faulty memory
      prevents memory offline to succeed on 4.4 base kernel.  The underlying
      reason was that the HWPoison page has an elevated reference count and the
      migration keeps failing.  There are two problems with that.  First of all
      it is dubious to migrate the poisoned page because we know that accessing
      that memory is possible to fail.  Secondly it doesn't make any sense to
      migrate a potentially broken content and preserve the memory corruption
      over to a new location.
      
      Oscar has found out that 4.4 and the current upstream kernels behave
      slightly differently with his simply testcase
      
      ===
      
      int main(void)
      {
              int ret;
              int i;
              int fd;
              char *array = malloc(4096);
              char *array_locked = malloc(4096);
      
              fd = open("/tmp/data", O_RDONLY);
              read(fd, array, 4095);
      
              for (i = 0; i < 4096; i++)
                      array_locked[i] = 'd';
      
              ret = mlock((void *)PAGE_ALIGN((unsigned long)array_locked), sizeof(array_locked));
              if (ret)
                      perror("mlock");
      
              sleep (20);
      
              ret = madvise((void *)PAGE_ALIGN((unsigned long)array_locked), 4096, MADV_HWPOISON);
              if (ret)
                      perror("madvise");
      
              for (i = 0; i < 4096; i++)
                      array_locked[i] = 'd';
      
              return 0;
      }
      ===
      
      + offline this memory.
      
      In 4.4 kernels he saw the hwpoisoned page to be returned back to the LRU
      list
      kernel:  [<ffffffff81019ac9>] dump_trace+0x59/0x340
      kernel:  [<ffffffff81019e9a>] show_stack_log_lvl+0xea/0x170
      kernel:  [<ffffffff8101ac71>] show_stack+0x21/0x40
      kernel:  [<ffffffff8132bb90>] dump_stack+0x5c/0x7c
      kernel:  [<ffffffff810815a1>] warn_slowpath_common+0x81/0xb0
      kernel:  [<ffffffff811a275c>] __pagevec_lru_add_fn+0x14c/0x160
      kernel:  [<ffffffff811a2eed>] pagevec_lru_move_fn+0xad/0x100
      kernel:  [<ffffffff811a334c>] __lru_cache_add+0x6c/0xb0
      kernel:  [<ffffffff81195236>] add_to_page_cache_lru+0x46/0x70
      kernel:  [<ffffffffa02b4373>] extent_readpages+0xc3/0x1a0 [btrfs]
      kernel:  [<ffffffff811a16d7>] __do_page_cache_readahead+0x177/0x200
      kernel:  [<ffffffff811a18c8>] ondemand_readahead+0x168/0x2a0
      kernel:  [<ffffffff8119673f>] generic_file_read_iter+0x41f/0x660
      kernel:  [<ffffffff8120e50d>] __vfs_read+0xcd/0x140
      kernel:  [<ffffffff8120e9ea>] vfs_read+0x7a/0x120
      kernel:  [<ffffffff8121404b>] kernel_read+0x3b/0x50
      kernel:  [<ffffffff81215c80>] do_execveat_common.isra.29+0x490/0x6f0
      kernel:  [<ffffffff81215f08>] do_execve+0x28/0x30
      kernel:  [<ffffffff81095ddb>] call_usermodehelper_exec_async+0xfb/0x130
      kernel:  [<ffffffff8161c045>] ret_from_fork+0x55/0x80
      
      And that latter confuses the hotremove path because an LRU page is
      attempted to be migrated and that fails due to an elevated reference
      count.  It is quite possible that the reuse of the HWPoisoned page is some
      kind of fixed race condition but I am not really sure about that.
      
      With the upstream kernel the failure is slightly different.  The page
      doesn't seem to have LRU bit set but isolate_movable_page simply fails and
      do_migrate_range simply puts all the isolated pages back to LRU and
      therefore no progress is made and scan_movable_pages finds same set of
      pages over and over again.
      
      Fix both cases by explicitly checking HWPoisoned pages before we even try
      to get reference on the page, try to unmap it if it is still mapped.  As
      explained by Naoya:
      
      : Hwpoison code never unmapped those for no big reason because
      : Ksm pages never dominate memory, so we simply didn't have strong
      : motivation to save the pages.
      
      Also put WARN_ON(PageLRU) in case there is a race and we can hit LRU
      HWPoison pages which shouldn't happen but I couldn't convince myself about
      that.  Naoya has noted the following:
      
      : Theoretically no such gurantee, because try_to_unmap() doesn't have a
      : guarantee of success and then memory_failure() returns immediately
      : when hwpoison_user_mappings fails.
      : Or the following code (comes after hwpoison_user_mappings block) also impli=
      : es
      : that the target page can still have PageLRU flag.
      :
      :         /*
      :          * Torn down by someone else?
      :          */
      :         if (PageLRU(p) && !PageSwapCache(p) && p->mapping =3D=3D NULL) {
      :                 action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED);
      :                 res =3D -EBUSY;
      :                 goto out;
      :         }
      :
      : So I think it's OK to keep "if (WARN_ON(PageLRU(page)))" block in
      : current version of your patch.
      
      Link: http://lkml.kernel.org/r/20181206120135.14079-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.com>
      Debugged-by: NOscar Salvador <osalvador@suse.com>
      Tested-by: NOscar Salvador <osalvador@suse.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b15c8726
    • W
      mm, hotplug: move init_currently_empty_zone() under zone_span_lock protection · fa004ab7
      Wei Yang 提交于
      During online_pages phase, pgdat->nr_zones will be updated in case this
      zone is empty.
      
      Currently the online_pages phase is protected by the global locks
      (device_device_hotplug_lock and mem_hotplug_lock), which ensures there is
      no contention during the update of nr_zones.
      
      These global locks introduces scalability issues (especially the second
      one), which slow down code relying on get_online_mems().  This is also a
      preparation for not having to rely on get_online_mems() but instead some
      more fine grained locks.
      
      The patch moves init_currently_empty_zone under both zone_span_writelock
      and pgdat_resize_lock because both the pgdat state is changed (nr_zones)
      and the zone's start_pfn.  Also this patch changes the documentation of
      node_size_lock to include the protection of nr_zones.
      
      Link: http://lkml.kernel.org/r/20181203205016.14123-1-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa004ab7
    • W
      mm, sparse: pass nid instead of pgdat to sparse_add_one_section() · 4e0d2e7e
      Wei Yang 提交于
      Since the information needed in sparse_add_one_section() is node id to
      allocate proper memory, it is not necessary to pass its pgdat.
      
      This patch changes the prototype of sparse_add_one_section() to pass node
      id directly.  This is intended to reduce misleading that
      sparse_add_one_section() would touch pgdat.
      
      Link: http://lkml.kernel.org/r/20181204085657.20472-2-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e0d2e7e
    • O
      mm, memory_hotplug: add nid parameter to arch_remove_memory · 2c2a5af6
      Oscar Salvador 提交于
      Patch series "Do not touch pages in hot-remove path", v2.
      
      This patchset aims for two things:
      
       1) A better definition about offline and hot-remove stage
       2) Solving bugs where we can access non-initialized pages
          during hot-remove operations [2] [3].
      
      This is achieved by moving all page/zone handling to the offline
      stage, so we do not need to access pages when hot-removing memory.
      
      [1] https://patchwork.kernel.org/cover/10691415/
      [2] https://patchwork.kernel.org/patch/10547445/
      [3] https://www.spinics.net/lists/linux-mm/msg161316.html
      
      This patch (of 5):
      
      This is a preparation for the following-up patches.  The idea of passing
      the nid is that it will allow us to get rid of the zone parameter
      afterwards.
      
      Link: http://lkml.kernel.org/r/20181127162005.15833-2-osalvador@suse.deSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c2a5af6
    • D
      mm/memory_hotplug: drop "online" parameter from add_memory_resource() · f29d8e9c
      David Hildenbrand 提交于
      Userspace should always be in charge of how to online memory and if memory
      should be onlined automatically in the kernel.  Let's drop the parameter
      to overwrite this - XEN passes memhp_auto_online, just like add_memory(),
      so we can directly use that instead internally.
      
      Link: http://lkml.kernel.org/r/20181123123740.27652-1-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NJuergen Gross <jgross@suse.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Arun KS <arunks@codeaurora.org>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f29d8e9c
    • M
      mm, memory_hotplug: do not clear numa_node association after hot_remove · 46a3679b
      Michal Hocko 提交于
      Per-cpu numa_node provides a default node for each possible cpu.  The
      association gets initialized during the boot when the architecture
      specific code explores cpu->NUMA affinity.  When the whole NUMA node is
      removed though we are clearing this association
      
      try_offline_node
        check_and_unmap_cpu_on_node
          unmap_cpu_on_node
            numa_clear_node
              numa_set_node(cpu, NUMA_NO_NODE)
      
      This means that whoever calls cpu_to_node for a cpu associated with such a
      node will get NUMA_NO_NODE.  This is problematic for two reasons.  First
      it is fragile because __alloc_pages_node would simply blow up on an
      out-of-bound access.  We have encountered this when loading kvm module
      
        BUG: unable to handle kernel paging request at 00000000000021c0
        IP: __alloc_pages_nodemask+0x93/0xb70
        PGD 800000ffe853e067 PUD 7336bbc067 PMD 0
        Oops: 0000 [#1] SMP
        [...]
        CPU: 88 PID: 1223749 Comm: modprobe Tainted: G        W          4.4.156-94.64-default #1
        RIP: __alloc_pages_nodemask+0x93/0xb70
        RSP: 0018:ffff887354493b40  EFLAGS: 00010202
        RAX: 00000000000021c0 RBX: 0000000000000000 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: 0000000000000002 RDI: 00000000014000c0
        RBP: 00000000014000c0 R08: ffffffffffffffff R09: 0000000000000000
        R10: ffff88fffc89e790 R11: 0000000000014000 R12: 0000000000000101
        R13: ffffffffa0772cd4 R14: ffffffffa0769ac0 R15: 0000000000000000
        FS:  00007fdf2f2f1700(0000) GS:ffff88fffc880000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000000021c0 CR3: 00000077205ee000 CR4: 0000000000360670
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
          alloc_vmcs_cpu+0x3d/0x90 [kvm_intel]
          hardware_setup+0x781/0x849 [kvm_intel]
          kvm_arch_hardware_setup+0x28/0x190 [kvm]
          kvm_init+0x7c/0x2d0 [kvm]
          vmx_init+0x1e/0x32c [kvm_intel]
          do_one_initcall+0xca/0x1f0
          do_init_module+0x5a/0x1d7
          load_module+0x1393/0x1c90
          SYSC_finit_module+0x70/0xa0
          entry_SYSCALL_64_fastpath+0x1e/0xb7
        DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xb7
      
      on an older kernel but the code is basically the same in the current Linus
      tree as well.  alloc_vmcs_cpu could use alloc_pages_nodemask which would
      recognize NUMA_NO_NODE and use alloc_pages_node which would translate it
      to numa_mem_id but that is wrong as well because it would use a cpu
      affinity of the local CPU which might be quite far from the original node.
      It is also reasonable to expect that cpu_to_node will provide a sane
      value and there might be many more callers like that.
      
      The second problem is that __register_one_node relies on cpu_to_node to
      properly associate cpus back to the node when it is onlined.  We do not
      want to lose that link as there is no arch independent way to get it from
      the early boot time AFAICS.
      
      Drop the whole check_and_unmap_cpu_on_node machinery and keep the
      association to fix both issues.  The NODE_DATA(nid) is not deallocated so
      it will stay in place and if anybody wants to allocate from that node then
      a fallback node will be used.
      
      Thanks to Vlastimil Babka for his live system debugging skills that helped
      debugging the issue.
      
      Link: http://lkml.kernel.org/r/20181108100413.966-1-mhocko@kernel.org
      Fixes: e13fe869 ("cpu-hotplug,memory-hotplug: clear cpu_to_node() when offlining the node")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Debugged-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: NMiroslav Benes <mbenes@suse.cz>
      Acked-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      46a3679b
    • M
      mm: only report isolation failures when offlining memory · d381c547
      Michal Hocko 提交于
      Heiko has complained that his log is swamped by warnings from
      has_unmovable_pages
      
      [   20.536664] page dumped because: has_unmovable_pages
      [   20.536792] page:000003d081ff4080 count:1 mapcount:0 mapping:000000008ff88600 index:0x0 compound_mapcount: 0
      [   20.536794] flags: 0x3fffe0000010200(slab|head)
      [   20.536795] raw: 03fffe0000010200 0000000000000100 0000000000000200 000000008ff88600
      [   20.536796] raw: 0000000000000000 0020004100000000 ffffffff00000001 0000000000000000
      [   20.536797] page dumped because: has_unmovable_pages
      [   20.536814] page:000003d0823b0000 count:1 mapcount:0 mapping:0000000000000000 index:0x0
      [   20.536815] flags: 0x7fffe0000000000()
      [   20.536817] raw: 07fffe0000000000 0000000000000100 0000000000000200 0000000000000000
      [   20.536818] raw: 0000000000000000 0000000000000000 ffffffff00000001 0000000000000000
      
      which are not triggered by the memory hotplug but rather CMA allocator.
      The original idea behind dumping the page state for all call paths was
      that these messages will be helpful debugging failures.  From the above it
      seems that this is not the case for the CMA path because we are lacking
      much more context.  E.g the second reported page might be a CMA allocated
      page.  It is still interesting to see a slab page in the CMA area but it
      is hard to tell whether this is bug from the above output alone.
      
      Address this issue by dumping the page state only on request.  Both
      start_isolate_page_range and has_unmovable_pages already have an argument
      to ignore hwpoison pages so make this argument more generic and turn it
      into flags and allow callers to combine non-default modes into a mask.
      While we are at it, has_unmovable_pages call from
      is_pageblock_removable_nolock (sysfs removable file) is questionable to
      report the failure so drop it from there as well.
      
      Link: http://lkml.kernel.org/r/20181218092802.31429-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d381c547
    • M
      mm, memory_hotplug: be more verbose for memory offline failures · 2932c8b0
      Michal Hocko 提交于
      There is only very limited information printed when the memory offlining
      fails:
      
      [ 1984.506184] rac1 kernel: memory offlining [mem 0x82600000000-0x8267fffffff] failed due to signal backoff
      
      This tells us that the failure is triggered by the userspace intervention
      but it doesn't tell us much more about the underlying reason.  It might be
      that the page migration failes repeatedly and the userspace timeout
      expires and send a signal or it might be some of the earlier steps
      (isolation, memory notifier) takes too long.
      
      If the migration failes then it would be really helpful to see which page
      that and its state.  The same applies to the isolation phase.  If we fail
      to isolate a page from the allocator then knowing the state of the page
      would be helpful as well.
      
      Dump the page state that fails to get isolated or migrated.  This will
      tell us more about the failure and what to focus on during debugging.
      
      [akpm@linux-foundation.org: add missing printk arg]
      [mhocko@suse.com: tweak dump_page() `reason' text]
        Link: http://lkml.kernel.org/r/20181116083020.20260-6-mhocko@kernel.org
      Link: http://lkml.kernel.org/r/20181107101830.17405-6-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Oscar Salvador <OSalvador@suse.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2932c8b0
    • M
      mm, memory_hotplug: print reason for the offlining failure · 79605093
      Michal Hocko 提交于
      The memory offlining failure reporting is inconsistent and insufficient.
      Some error paths simply do not report the failure to the log at all.  When
      we do report there are no details about the reason of the failure and
      there are several of them which makes memory offlining failures hard to
      debug.
      
      Make sure that the
      	memory offlining [mem %#010llx-%#010llx] failed
      message is printed for all failures and also provide a short textual
      reason for the failure e.g.
      
      [ 1984.506184] rac1 kernel: memory offlining [mem 0x82600000000-0x8267fffffff] failed due to signal backoff
      
      this tells us that the offlining has failed because of a signal pending
      aka user intervention.
      
      [akpm@linux-foundation.org: tweak messages a bit]
      Link: http://lkml.kernel.org/r/20181107101830.17405-5-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Oscar Salvador <OSalvador@suse.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79605093
    • M
      mm, memory_hotplug: drop pointless block alignment checks from __offline_pages · 6cc2baf6
      Michal Hocko 提交于
      This function is never called from a context which would provide
      misaligned pfn range so drop the pointless check.
      
      Link: http://lkml.kernel.org/r/20181107101830.17405-4-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Oscar Salvador <OSalvador@suse.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6cc2baf6
  7. 04 11月, 2018 1 次提交
    • M
      memory_hotplug: cond_resched in __remove_pages · dd33ad7b
      Michal Hocko 提交于
      We have received a bug report that unbinding a large pmem (>1TB) can
      result in a soft lockup:
      
        NMI watchdog: BUG: soft lockup - CPU#9 stuck for 23s! [ndctl:4365]
        [...]
        Supported: Yes
        CPU: 9 PID: 4365 Comm: ndctl Not tainted 4.12.14-94.40-default #1 SLE12-SP4
        Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS SE5C620.86B.01.00.0833.051120182255 05/11/2018
        task: ffff9cce7d4410c0 task.stack: ffffbe9eb1bc4000
        RIP: 0010:__put_page+0x62/0x80
        Call Trace:
         devm_memremap_pages_release+0x152/0x260
         release_nodes+0x18d/0x1d0
         device_release_driver_internal+0x160/0x210
         unbind_store+0xb3/0xe0
         kernfs_fop_write+0x102/0x180
         __vfs_write+0x26/0x150
         vfs_write+0xad/0x1a0
         SyS_write+0x42/0x90
         do_syscall_64+0x74/0x150
         entry_SYSCALL_64_after_hwframe+0x3d/0xa2
        RIP: 0033:0x7fd13166b3d0
      
      It has been reported on an older (4.12) kernel but the current upstream
      code doesn't cond_resched in the hot remove code at all and the given
      range to remove might be really large.  Fix the issue by calling
      cond_resched once per memory section.
      
      Link: http://lkml.kernel.org/r/20181031125840.23982-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Cc: Dan Williams <dan.j.williams@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd33ad7b
  8. 31 10月, 2018 4 次提交
    • D
      mm/memory_hotplug: fix online/offline_pages called w.o. mem_hotplug_lock · 381eab4a
      David Hildenbrand 提交于
      There seem to be some problems as result of 30467e0b ("mm, hotplug:
      fix concurrent memory hot-add deadlock"), which tried to fix a possible
      lock inversion reported and discussed in [1] due to the two locks
      	a) device_lock()
      	b) mem_hotplug_lock
      
      While add_memory() first takes b), followed by a) during
      bus_probe_device(), onlining of memory from user space first took a),
      followed by b), exposing a possible deadlock.
      
      In [1], and it was decided to not make use of device_hotplug_lock, but
      rather to enforce a locking order.
      
      The problems I spotted related to this:
      
      1. Memory block device attributes: While .state first calls
         mem_hotplug_begin() and the calls device_online() - which takes
         device_lock() - .online does no longer call mem_hotplug_begin(), so
         effectively calls online_pages() without mem_hotplug_lock.
      
      2. device_online() should be called under device_hotplug_lock, however
         onlining memory during add_memory() does not take care of that.
      
      In addition, I think there is also something wrong about the locking in
      
      3. arch/powerpc/platforms/powernv/memtrace.c calls offline_pages()
         without locks. This was introduced after 30467e0b. And skimming over
         the code, I assume it could need some more care in regards to locking
         (e.g. device_online() called without device_hotplug_lock. This will
         be addressed in the following patches.
      
      Now that we hold the device_hotplug_lock when
      - adding memory (e.g. via add_memory()/add_memory_resource())
      - removing memory (e.g. via remove_memory())
      - device_online()/device_offline()
      
      We can move mem_hotplug_lock usage back into
      online_pages()/offline_pages().
      
      Why is mem_hotplug_lock still needed? Essentially to make
      get_online_mems()/put_online_mems() be very fast (relying on
      device_hotplug_lock would be very slow), and to serialize against
      addition of memory that does not create memory block devices (hmm).
      
      [1] http://driverdev.linuxdriverproject.org/pipermail/ driverdev-devel/
          2015-February/065324.html
      
      This patch is partly based on a patch by Vitaly Kuznetsov.
      
      Link: http://lkml.kernel.org/r/20180925091457.28651-4-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Reviewed-by: NRashmica Gupta <rashmica.g@gmail.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Rashmica Gupta <rashmica.g@gmail.com>
      Cc: Michael Neuling <mikey@neuling.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Kate Stewart <kstewart@linuxfoundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: John Allen <jallen@linux.vnet.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      381eab4a
    • D
      mm/memory_hotplug: make add_memory() take the device_hotplug_lock · 8df1d0e4
      David Hildenbrand 提交于
      add_memory() currently does not take the device_hotplug_lock, however
      is aleady called under the lock from
      	arch/powerpc/platforms/pseries/hotplug-memory.c
      	drivers/acpi/acpi_memhotplug.c
      to synchronize against CPU hot-remove and similar.
      
      In general, we should hold the device_hotplug_lock when adding memory to
      synchronize against online/offline request (e.g.  from user space) - which
      already resulted in lock inversions due to device_lock() and
      mem_hotplug_lock - see 30467e0b ("mm, hotplug: fix concurrent memory
      hot-add deadlock").  add_memory()/add_memory_resource() will create memory
      block devices, so this really feels like the right thing to do.
      
      Holding the device_hotplug_lock makes sure that a memory block device
      can really only be accessed (e.g. via .online/.state) from user space,
      once the memory has been fully added to the system.
      
      The lock is not held yet in
      	drivers/xen/balloon.c
      	arch/powerpc/platforms/powernv/memtrace.c
      	drivers/s390/char/sclp_cmd.c
      	drivers/hv/hv_balloon.c
      So, let's either use the locked variants or take the lock.
      
      Don't export add_memory_resource(), as it once was exported to be used by
      XEN, which is never built as a module.  If somebody requires it, we also
      have to export a locked variant (as device_hotplug_lock is never
      exported).
      
      Link: http://lkml.kernel.org/r/20180925091457.28651-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Reviewed-by: NRashmica Gupta <rashmica.g@gmail.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      Cc: John Allen <jallen@linux.vnet.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kate Stewart <kstewart@linuxfoundation.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michael Neuling <mikey@neuling.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8df1d0e4
    • D
      mm/memory_hotplug: make remove_memory() take the device_hotplug_lock · d15e5926
      David Hildenbrand 提交于
      Patch series "mm: online/offline_pages called w.o. mem_hotplug_lock", v3.
      
      Reading through the code and studying how mem_hotplug_lock is to be used,
      I noticed that there are two places where we can end up calling
      device_online()/device_offline() - online_pages()/offline_pages() without
      the mem_hotplug_lock.  And there are other places where we call
      device_online()/device_offline() without the device_hotplug_lock.
      
      While e.g.
      	echo "online" > /sys/devices/system/memory/memory9/state
      is fine, e.g.
      	echo 1 > /sys/devices/system/memory/memory9/online
      Will not take the mem_hotplug_lock. However the device_lock() and
      device_hotplug_lock.
      
      E.g.  via memory_probe_store(), we can end up calling
      add_memory()->online_pages() without the device_hotplug_lock.  So we can
      have concurrent callers in online_pages().  We e.g.  touch in
      online_pages() basically unprotected zone->present_pages then.
      
      Looks like there is a longer history to that (see Patch #2 for details),
      and fixing it to work the way it was intended is not really possible.  We
      would e.g.  have to take the mem_hotplug_lock in device/base/core.c, which
      sounds wrong.
      
      Summary: We had a lock inversion on mem_hotplug_lock and device_lock().
      More details can be found in patch 3 and patch 6.
      
      I propose the general rules (documentation added in patch 6):
      
      1. add_memory/add_memory_resource() must only be called with
         device_hotplug_lock.
      2. remove_memory() must only be called with device_hotplug_lock. This is
         already documented and holds for all callers.
      3. device_online()/device_offline() must only be called with
         device_hotplug_lock. This is already documented and true for now in core
         code. Other callers (related to memory hotplug) have to be fixed up.
      4. mem_hotplug_lock is taken inside of add_memory/remove_memory/
         online_pages/offline_pages.
      
      To me, this looks way cleaner than what we have right now (and easier to
      verify).  And looking at the documentation of remove_memory, using
      lock_device_hotplug also for add_memory() feels natural.
      
      This patch (of 6):
      
      remove_memory() is exported right now but requires the
      device_hotplug_lock, which is not exported.  So let's provide a variant
      that takes the lock and only export that one.
      
      The lock is already held in
      	arch/powerpc/platforms/pseries/hotplug-memory.c
      	drivers/acpi/acpi_memhotplug.c
      	arch/powerpc/platforms/powernv/memtrace.c
      
      Apart from that, there are not other users in the tree.
      
      Link: http://lkml.kernel.org/r/20180925091457.28651-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Reviewed-by: NRashmica Gupta <rashmica.g@gmail.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Rashmica Gupta <rashmica.g@gmail.com>
      Cc: Michael Neuling <mikey@neuling.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      Cc: John Allen <jallen@linux.vnet.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kate Stewart <kstewart@linuxfoundation.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d15e5926
    • M
      mm: remove include/linux/bootmem.h · 57c8a661
      Mike Rapoport 提交于
      Move remaining definitions and declarations from include/linux/bootmem.h
      into include/linux/memblock.h and remove the redundant header.
      
      The includes were replaced with the semantic patch below and then
      semi-automated removal of duplicated '#include <linux/memblock.h>
      
      @@
      @@
      - #include <linux/bootmem.h>
      + #include <linux/memblock.h>
      
      [sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h]
        Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au
      [sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h]
        Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au
      [sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal]
        Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au
      Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Palmer Dabbelt <palmer@sifive.com>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Serge Semin <fancer.lancer@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57c8a661
  9. 27 10月, 2018 4 次提交
  10. 05 9月, 2018 1 次提交
  11. 23 8月, 2018 1 次提交
    • O
      mm/page_alloc: Introduce free_area_init_core_hotplug · 03e85f9d
      Oscar Salvador 提交于
      Currently, whenever a new node is created/re-used from the memhotplug
      path, we call free_area_init_node()->free_area_init_core().  But there is
      some code that we do not really need to run when we are coming from such
      path.
      
      free_area_init_core() performs the following actions:
      
      1) Initializes pgdat internals, such as spinlock, waitqueues and more.
      2) Account # nr_all_pages and # nr_kernel_pages. These values are used later on
         when creating hash tables.
      3) Account number of managed_pages per zone, substracting dma_reserved and
         memmap pages.
      4) Initializes some fields of the zone structure data
      5) Calls init_currently_empty_zone to initialize all the freelists
      6) Calls memmap_init to initialize all pages belonging to certain zone
      
      When called from memhotplug path, free_area_init_core() only performs
      actions #1 and #4.
      
      Action #2 is pointless as the zones do not have any pages since either the
      node was freed, or we are re-using it, eitherway all zones belonging to
      this node should have 0 pages.  For the same reason, action #3 results
      always in manages_pages being 0.
      
      Action #5 and #6 are performed later on when onlining the pages:
       online_pages()->move_pfn_range_to_zone()->init_currently_empty_zone()
       online_pages()->move_pfn_range_to_zone()->memmap_init_zone()
      
      This patch does two things:
      
      First, moves the node/zone initializtion to their own function, so it
      allows us to create a small version of free_area_init_core, where we only
      perform:
      
      1) Initialization of pgdat internals, such as spinlock, waitqueues and more
      4) Initialization of some fields of the zone structure data
      
      These two functions are: pgdat_init_internals() and zone_init_internals().
      
      The second thing this patch does, is to introduce
      free_area_init_core_hotplug(), the memhotplug version of
      free_area_init_core():
      
      Currently, we call free_area_init_node() from the memhotplug path.  In
      there, we set some pgdat's fields, and call calculate_node_totalpages().
      calculate_node_totalpages() calculates the # of pages the node has.
      
      Since the node is either new, or we are re-using it, the zones belonging
      to this node should not have any pages, so there is no point to calculate
      this now.
      
      Actually, we re-set these values to 0 later on with the calls to:
      
      reset_node_managed_pages()
      reset_node_present_pages()
      
      The # of pages per node and the # of pages per zone will be calculated when
      onlining the pages:
      
      online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_zone_range()
      online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_pgdat_range()
      
      Also, since free_area_init_core/free_area_init_node will now only get called during early init, let us replace
      __paginginit with __init, so their code gets freed up.
      
      [osalvador@techadventures.net: fix section usage]
        Link: http://lkml.kernel.org/r/20180731101752.GA473@techadventures.net
      [osalvador@suse.de: v6]
        Link: http://lkml.kernel.org/r/20180801122348.21588-6-osalvador@techadventures.net
      Link: http://lkml.kernel.org/r/20180730101757.28058-5-osalvador@techadventures.netSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      03e85f9d
  12. 18 8月, 2018 3 次提交