1. 20 9月, 2016 1 次提交
  2. 12 8月, 2016 1 次提交
    • R
      mm/memory_hotplug.c: initialize per_cpu_nodestats for hotadded pgdats · 5830169f
      Reza Arbab 提交于
      The following oops occurs after a pgdat is hotadded:
      
        Unable to handle kernel paging request for data at address 0x00c30001
        Faulting instruction address: 0xc00000000022f8f4
        Oops: Kernel access of bad area, sig: 11 [#1]
        SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter nls_utf8 isofs sg virtio_balloon uio_pdrv_genirq uio ip_tables xfs libcrc32c sr_mod cdrom sd_mod virtio_net ibmvscsi scsi_transport_srp virtio_pci virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod
        CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W 4.8.0-rc1-device #110
        task: c000000000ef3080 task.stack: c000000000f6c000
        NIP: c00000000022f8f4 LR: c00000000022f948 CTR: 0000000000000000
        REGS: c000000000f6fa50 TRAP: 0300   Tainted: G        W (4.8.0-rc1-device)
        MSR: 800000010280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]>  CR: 84002028  XER: 20000000
        CFAR: d000000001d2013c DAR: 0000000000c30001 DSISR: 40000000 SOFTE: 0
        NIP refresh_cpu_vm_stats+0x1a4/0x2f0
        LR refresh_cpu_vm_stats+0x1f8/0x2f0
        Call Trace:
          refresh_cpu_vm_stats+0x1f8/0x2f0 (unreliable)
      
      Add per_cpu_nodestats initialization to the hotplug codepath.
      
      Link: http://lkml.kernel.org/r/1470931473-7090-1-git-send-email-arbab@linux.vnet.ibm.comSigned-off-by: NReza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5830169f
  3. 29 7月, 2016 3 次提交
  4. 27 7月, 2016 2 次提交
  5. 28 5月, 2016 2 次提交
  6. 20 5月, 2016 3 次提交
  7. 18 3月, 2016 5 次提交
    • J
      mm: coalesce split strings · 756a025f
      Joe Perches 提交于
      Kernel style prefers a single string over split strings when the string is
      'user-visible'.
      
      Miscellanea:
      
       - Add a missing newline
       - Realign arguments
      Signed-off-by: NJoe Perches <joe@perches.com>
      Acked-by: Tejun Heo <tj@kernel.org>	[percpu]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      756a025f
    • C
      mm, memory hotplug: print debug message in the proper way for online_pages · e33e33b4
      Chen Yucong 提交于
      online_pages() simply returns an error value if
      memory_notify(MEM_GOING_ONLINE, &arg) return a value that is not what we
      want for successfully onlining target pages.  This patch arms to print
      more failure information like offline_pages() in online_pages.
      
      This patch also converts printk(KERN_<LEVEL>) to pr_<level>(), and moves
      __offline_pages() to not print failure information with KERN_INFO
      according to David Rientjes's suggestion[1].
      
      [1] https://lkml.org/lkml/2016/2/24/1094Signed-off-by: NChen Yucong <slaoub@gmail.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e33e33b4
    • J
      mm: introduce page reference manipulation functions · fe896d18
      Joonsoo Kim 提交于
      The success of CMA allocation largely depends on the success of
      migration and key factor of it is page reference count.  Until now, page
      reference is manipulated by direct calling atomic functions so we cannot
      follow up who and where manipulate it.  Then, it is hard to find actual
      reason of CMA allocation failure.  CMA allocation should be guaranteed
      to succeed so finding offending place is really important.
      
      In this patch, call sites where page reference is manipulated are
      converted to introduced wrapper function.  This is preparation step to
      add tracepoint to each page reference manipulation function.  With this
      facility, we can easily find reason of CMA allocation failure.  There is
      no functional change in this patch.
      
      In addition, this patch also converts reference read sites.  It will
      help a second step that renames page._count to something else and
      prevents later attempt to direct access to it (Suggested by Andrew).
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fe896d18
    • V
      mm, memory hotplug: small cleanup in online_pages() · e888ca35
      Vlastimil Babka 提交于
      We can reuse the nid we've determined instead of repeated pfn_to_nid()
      usages.  Also zone_to_nid() should be a bit cheaper in general than
      pfn_to_nid().
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e888ca35
    • V
      mm, compaction: introduce kcompactd · 698b1b30
      Vlastimil Babka 提交于
      Memory compaction can be currently performed in several contexts:
      
       - kswapd balancing a zone after a high-order allocation failure
       - direct compaction to satisfy a high-order allocation, including THP
         page fault attemps
       - khugepaged trying to collapse a hugepage
       - manually from /proc
      
      The purpose of compaction is two-fold.  The obvious purpose is to
      satisfy a (pending or future) high-order allocation, and is easy to
      evaluate.  The other purpose is to keep overal memory fragmentation low
      and help the anti-fragmentation mechanism.  The success wrt the latter
      purpose is more
      
      The current situation wrt the purposes has a few drawbacks:
      
       - compaction is invoked only when a high-order page or hugepage is not
         available (or manually).  This might be too late for the purposes of
         keeping memory fragmentation low.
       - direct compaction increases latency of allocations.  Again, it would
         be better if compaction was performed asynchronously to keep
         fragmentation low, before the allocation itself comes.
       - (a special case of the previous) the cost of compaction during THP
         page faults can easily offset the benefits of THP.
       - kswapd compaction appears to be complex, fragile and not working in
         some scenarios.  It could also end up compacting for a high-order
         allocation request when it should be reclaiming memory for a later
         order-0 request.
      
      To improve the situation, we should be able to benefit from an
      equivalent of kswapd, but for compaction - i.e. a background thread
      which responds to fragmentation and the need for high-order allocations
      (including hugepages) somewhat proactively.
      
      One possibility is to extend the responsibilities of kswapd, which could
      however complicate its design too much.  It should be better to let
      kswapd handle reclaim, as order-0 allocations are often more critical
      than high-order ones.
      
      Another possibility is to extend khugepaged, but this kthread is a
      single instance and tied to THP configs.
      
      This patch goes with the option of a new set of per-node kthreads called
      kcompactd, and lays the foundations, without introducing any new
      tunables.  The lifecycle mimics kswapd kthreads, including the memory
      hotplug hooks.
      
      For compaction, kcompactd uses the standard compaction_suitable() and
      ompact_finished() criteria and the deferred compaction functionality.
      Unlike direct compaction, it uses only sync compaction, as there's no
      allocation latency to minimize.
      
      This patch doesn't yet add a call to wakeup_kcompactd.  The kswapd
      compact/reclaim loop for high-order pages will be replaced by waking up
      kcompactd in the next patch with the description of what's wrong with
      the old approach.
      
      Waking up of the kcompactd threads is also tied to kswapd activity and
      follows these rules:
       - we don't want to affect any fastpaths, so wake up kcompactd only from
         the slowpath, as it's done for kswapd
       - if kswapd is doing reclaim, it's more important than compaction, so
         don't invoke kcompactd until kswapd goes to sleep
       - the target order used for kswapd is passed to kcompactd
      
      Future possible future uses for kcompactd include the ability to wake up
      kcompactd on demand in special situations, such as when hugepages are
      not available (currently not done due to __GFP_NO_KSWAPD) or when a
      fragmentation event (i.e.  __rmqueue_fallback()) occurs.  It's also
      possible to perform periodic compaction with kcompactd.
      
      [arnd@arndb.de: fix build errors with kcompactd]
      [paul.gortmaker@windriver.com: don't use modular references for non modular code]
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      698b1b30
  8. 16 3月, 2016 2 次提交
    • J
      mm/compaction: speed up pageblock_pfn_to_page() when zone is contiguous · 7cf91a98
      Joonsoo Kim 提交于
      There is a performance drop report due to hugepage allocation and in
      there half of cpu time are spent on pageblock_pfn_to_page() in
      compaction [1].
      
      In that workload, compaction is triggered to make hugepage but most of
      pageblocks are un-available for compaction due to pageblock type and
      skip bit so compaction usually fails.  Most costly operations in this
      case is to find valid pageblock while scanning whole zone range.  To
      check if pageblock is valid to compact, valid pfn within pageblock is
      required and we can obtain it by calling pageblock_pfn_to_page().  This
      function checks whether pageblock is in a single zone and return valid
      pfn if possible.  Problem is that we need to check it every time before
      scanning pageblock even if we re-visit it and this turns out to be very
      expensive in this workload.
      
      Although we have no way to skip this pageblock check in the system where
      hole exists at arbitrary position, we can use cached value for zone
      continuity and just do pfn_to_page() in the system where hole doesn't
      exist.  This optimization considerably speeds up in above workload.
      
      Before vs After
        Max: 1096 MB/s vs 1325 MB/s
        Min: 635 MB/s 1015 MB/s
        Avg: 899 MB/s 1194 MB/s
      
      Avg is improved by roughly 30% [2].
      
      [1]: http://www.spinics.net/lists/linux-mm/msg97378.html
      [2]: https://lkml.org/lkml/2015/12/9/23
      
      [akpm@linux-foundation.org: don't forget to restore zone->contiguous on error path, per Vlastimil]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reported-by: NAaron Lu <aaron.lu@intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NAaron Lu <aaron.lu@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cf91a98
    • V
      memory-hotplug: add automatic onlining policy for the newly added memory · 31bc3858
      Vitaly Kuznetsov 提交于
      Currently, all newly added memory blocks remain in 'offline' state
      unless someone onlines them, some linux distributions carry special udev
      rules like:
      
        SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline", ATTR{state}="online"
      
      to make this happen automatically.  This is not a great solution for
      virtual machines where memory hotplug is being used to address high
      memory pressure situations as such onlining is slow and a userspace
      process doing this (udev) has a chance of being killed by the OOM killer
      as it will probably require to allocate some memory.
      
      Introduce default policy for the newly added memory blocks in
      /sys/devices/system/memory/auto_online_blocks file with two possible
      values: "offline" which preserves the current behavior and "online"
      which causes all newly added memory blocks to go online as soon as
      they're added.  The default is "offline".
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: NDaniel Kiper <daniel.kiper@oracle.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Kay Sievers <kay@vrfy.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31bc3858
  9. 30 1月, 2016 1 次提交
    • T
      xen, mm: Set IORESOURCE_SYSTEM_RAM to System RAM · 782b8664
      Toshi Kani 提交于
      Set IORESOURCE_SYSTEM_RAM in struct resource.flags of "System
      RAM" entries.
      Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Acked-by: David Vrabel <david.vrabel@citrix.com> # xen
      Cc: Andrew Banman <abanman@sgi.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: linux-arch@vger.kernel.org
      Cc: linux-mm <linux-mm@kvack.org>
      Cc: xen-devel@lists.xenproject.org
      Link: http://lkml.kernel.org/r/1453841853-11383-9-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      782b8664
  10. 16 1月, 2016 1 次提交
  11. 15 1月, 2016 1 次提交
  12. 30 12月, 2015 1 次提交
    • A
      mm/memory_hotplug.c: check for missing sections in test_pages_in_a_zone() · 5f0f2887
      Andrew Banman 提交于
      test_pages_in_a_zone() does not account for the possibility of missing
      sections in the given pfn range.  pfn_valid_within always returns 1 when
      CONFIG_HOLES_IN_ZONE is not set, allowing invalid pfns from missing
      sections to pass the test, leading to a kernel oops.
      
      Wrap an additional pfn loop with PAGES_PER_SECTION granularity to check
      for missing sections before proceeding into the zone-check code.
      
      This also prevents a crash from offlining memory devices with missing
      sections.  Despite this, it may be a good idea to keep the related patch
      '[PATCH 3/3] drivers: memory: prohibit offlining of memory blocks with
      missing sections' because missing sections in a memory block may lead to
      other problems not covered by the scope of this fix.
      Signed-off-by: NAndrew Banman <abanman@sgi.com>
      Acked-by: NAlex Thorlton <athorlton@sgi.com>
      Cc: Russ Anderson <rja@sgi.com>
      Cc: Alex Thorlton <athorlton@sgi.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Greg KH <greg@kroah.com>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f0f2887
  13. 06 11月, 2015 1 次提交
  14. 23 10月, 2015 1 次提交
  15. 05 9月, 2015 1 次提交
    • T
      memory-hotplug: add hot-added memory ranges to memblock before allocate node_data for a node. · 7f36e3e5
      Tang Chen 提交于
      Commit f9126ab9 ("memory-hotplug: fix wrong edge when hot add a new
      node") hot-added memory range to memblock, after creating pgdat for new
      node.
      
      But there is a problem:
      
        add_memory()
        |--> hotadd_new_pgdat()
             |--> free_area_init_node()
                  |--> get_pfn_range_for_nid()
                       |--> find start_pfn and end_pfn in memblock
        |--> ......
        |--> memblock_add_node(start, size, nid)    --------    Here, just too late.
      
      get_pfn_range_for_nid() will find that start_pfn and end_pfn are both 0.
      As a result, when adding memory, dmesg will give the following wrong
      message.
      
        Initmem setup node 5 [mem 0x0000000000000000-0xffffffffffffffff]
        On node 5 totalpages: 0
        Built 5 zonelists in Node order, mobility grouping on.  Total pages: 32588823
        Policy zone: Normal
        init_memory_mapping: [mem 0x60000000000-0x607ffffffff]
      
      The solution is simple, just add the memory range to memblock a little
      earlier, before hotadd_new_pgdat().
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>	[4.2.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f36e3e5
  16. 28 8月, 2015 1 次提交
    • D
      mm: ZONE_DEVICE for "device memory" · 033fbae9
      Dan Williams 提交于
      While pmem is usable as a block device or via DAX mappings to userspace
      there are several usage scenarios that can not target pmem due to its
      lack of struct page coverage. In preparation for "hot plugging" pmem
      into the vmemmap add ZONE_DEVICE as a new zone to tag these pages
      separately from the ones that are subject to standard page allocations.
      Importantly "device memory" can be removed at will by userspace
      unbinding the driver of the device.
      
      Having a separate zone prevents allocation and otherwise marks these
      pages that are distinct from typical uniform memory.  Device memory has
      different lifetime and performance characteristics than RAM.  However,
      since we have run out of ZONES_SHIFT bits this functionality currently
      depends on sacrificing ZONE_DMA.
      
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Jerome Glisse <j.glisse@gmail.com>
      [hch: various simplifications in the arch interface]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      033fbae9
  17. 15 8月, 2015 1 次提交
  18. 07 8月, 2015 1 次提交
  19. 25 6月, 2015 1 次提交
    • Z
      mm/memory hotplug: print the last vmemmap region at the end of hot add memory · c435a390
      Zhu Guihua 提交于
      When hot add two nodes continuously, we found the vmemmap region info is
      a bit messed.  The last region of node 2 is printed when node 3 hot
      added, like the following:
      
        Initmem setup node 2 [mem 0x0000000000000000-0xffffffffffffffff]
         On node 2 totalpages: 0
         Built 2 zonelists in Node order, mobility grouping on.  Total pages: 16090539
         Policy zone: Normal
         init_memory_mapping: [mem 0x40000000000-0x407ffffffff]
          [mem 0x40000000000-0x407ffffffff] page 1G
          [ffffea1000000000-ffffea10001fffff] PMD -> [ffff8a077d800000-ffff8a077d9fffff] on node 2
          [ffffea1000200000-ffffea10003fffff] PMD -> [ffff8a077de00000-ffff8a077dffffff] on node 2
        ...
          [ffffea101f600000-ffffea101f9fffff] PMD -> [ffff8a074ac00000-ffff8a074affffff] on node 2
          [ffffea101fa00000-ffffea101fdfffff] PMD -> [ffff8a074a800000-ffff8a074abfffff] on node 2
        Initmem setup node 3 [mem 0x0000000000000000-0xffffffffffffffff]
         On node 3 totalpages: 0
         Built 3 zonelists in Node order, mobility grouping on.  Total pages: 16090539
         Policy zone: Normal
         init_memory_mapping: [mem 0x60000000000-0x607ffffffff]
          [mem 0x60000000000-0x607ffffffff] page 1G
          [ffffea101fe00000-ffffea101fffffff] PMD -> [ffff8a074a400000-ffff8a074a5fffff] on node 2 <=== node 2 ???
          [ffffea1800000000-ffffea18001fffff] PMD -> [ffff8a074a600000-ffff8a074a7fffff] on node 3
          [ffffea1800200000-ffffea18005fffff] PMD -> [ffff8a074a000000-ffff8a074a3fffff] on node 3
          [ffffea1800600000-ffffea18009fffff] PMD -> [ffff8a0749c00000-ffff8a0749ffffff] on node 3
        ...
      
      The cause is the last region was missed at the and of hot add memory,
      and p_start, p_end, node_start were not reset, so when hot add memory to
      a new node, it will consider they are not contiguous blocks and print
      the previous one.  So we print the last vmemmap region at the end of hot
      add memory to avoid the confusion.
      Signed-off-by: NZhu Guihua <zhugh.fnst@cn.fujitsu.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c435a390
  20. 11 6月, 2015 1 次提交
    • G
      mm/memory_hotplug.c: set zone->wait_table to null after freeing it · 85bd8399
      Gu Zheng 提交于
      Izumi found the following oops when hot re-adding a node:
      
          BUG: unable to handle kernel paging request at ffffc90008963690
          IP: __wake_up_bit+0x20/0x70
          Oops: 0000 [#1] SMP
          CPU: 68 PID: 1237 Comm: rs:main Q:Reg Not tainted 4.1.0-rc5 #80
          Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 1.87 04/28/2015
          task: ffff880838df8000 ti: ffff880017b94000 task.ti: ffff880017b94000
          RIP: 0010:[<ffffffff810dff80>]  [<ffffffff810dff80>] __wake_up_bit+0x20/0x70
          RSP: 0018:ffff880017b97be8  EFLAGS: 00010246
          RAX: ffffc90008963690 RBX: 00000000003c0000 RCX: 000000000000a4c9
          RDX: 0000000000000000 RSI: ffffea101bffd500 RDI: ffffc90008963648
          RBP: ffff880017b97c08 R08: 0000000002000020 R09: 0000000000000000
          R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a0797c73800
          R13: ffffea101bffd500 R14: 0000000000000001 R15: 00000000003c0000
          FS:  00007fcc7ffff700(0000) GS:ffff880874800000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: ffffc90008963690 CR3: 0000000836761000 CR4: 00000000001407e0
          Call Trace:
            unlock_page+0x6d/0x70
            generic_write_end+0x53/0xb0
            xfs_vm_write_end+0x29/0x80 [xfs]
            generic_perform_write+0x10a/0x1e0
            xfs_file_buffered_aio_write+0x14d/0x3e0 [xfs]
            xfs_file_write_iter+0x79/0x120 [xfs]
            __vfs_write+0xd4/0x110
            vfs_write+0xac/0x1c0
            SyS_write+0x58/0xd0
            system_call_fastpath+0x12/0x76
          Code: 5d c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 48 83 ec 20 65 48 8b 04 25 28 00 00 00 48 89 45 f8 31 c0 48 8d 47 48 <48> 39 47 48 48 c7 45 e8 00 00 00 00 48 c7 45 f0 00 00 00 00 48
          RIP  [<ffffffff810dff80>] __wake_up_bit+0x20/0x70
           RSP <ffff880017b97be8>
          CR2: ffffc90008963690
      
      Reproduce method (re-add a node)::
        Hot-add nodeA --> remove nodeA --> hot-add nodeA (panic)
      
      This seems an use-after-free problem, and the root cause is
      zone->wait_table was not set to *NULL* after free it in
      try_offline_node.
      
      When hot re-add a node, we will reuse the pgdat of it, so does the zone
      struct, and when add pages to the target zone, it will init the zone
      first (including the wait_table) if the zone is not initialized.  The
      judgement of zone initialized is based on zone->wait_table:
      
      	static inline bool zone_is_initialized(struct zone *zone)
      	{
      		return !!zone->wait_table;
      	}
      
      so if we do not set the zone->wait_table to *NULL* after free it, the
      memory hotplug routine will skip the init of new zone when hot re-add
      the node, and the wait_table still points to the freed memory, then we
      will access the invalid address when trying to wake up the waiting
      people after the i/o operation with the page is done, such as mentioned
      above.
      Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
      Reported-by: NTaku Izumi <izumi.taku@jp.fujitsu.com>
      Reviewed by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      85bd8399
  21. 16 4月, 2015 1 次提交
  22. 15 4月, 2015 2 次提交
    • D
      mm, hotplug: fix concurrent memory hot-add deadlock · 30467e0b
      David Rientjes 提交于
      There's a deadlock when concurrently hot-adding memory through the probe
      interface and switching a memory block from offline to online.
      
      When hot-adding memory via the probe interface, add_memory() first takes
      mem_hotplug_begin() and then device_lock() is later taken when registering
      the newly initialized memory block.  This creates a lock dependency of (1)
      mem_hotplug.lock (2) dev->mutex.
      
      When switching a memory block from offline to online, dev->mutex is first
      grabbed in device_online() when the write(2) transitions an existing
      memory block from offline to online, and then online_pages() will take
      mem_hotplug_begin().
      
      This creates a lock inversion between mem_hotplug.lock and dev->mutex.
      Vitaly reports that this deadlock can happen when kworker handling a probe
      event races with systemd-udevd switching a memory block's state.
      
      This patch requires the state transition to take mem_hotplug_begin()
      before dev->mutex.  Hot-adding memory via the probe interface creates a
      memory block while holding mem_hotplug_begin(), there is no way to take
      dev->mutex first in this case.
      
      online_pages() and offline_pages() are only called when transitioning
      memory block state.  We now require that mem_hotplug_begin() is taken
      before calling them -- this requires exporting the mem_hotplug_begin() and
      mem_hotplug_done() to generic code.  In all hot-add and hot-remove cases,
      mem_hotplug_begin() is done prior to device_online().  This is all that is
      needed to avoid the deadlock.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Tested-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zhang Zhen <zhenzhang.zhang@huawei.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Wang Nan <wangnan0@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      30467e0b
    • S
      memory hotplug: use macro to switch between section and pfn · 19c07d5e
      Sheng Yong 提交于
      Use macro section_nr_to_pfn() to switch between section and pfn, instead
      of open-coding it.  No semantic changes.
      Signed-off-by: NSheng Yong <shengyong1@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19c07d5e
  23. 26 3月, 2015 1 次提交
    • G
      mm/memory hotplug: postpone the reset of obsolete pgdat · b0dc3a34
      Gu Zheng 提交于
      Qiu Xishi reported the following BUG when testing hot-add/hot-remove node under
      stress condition:
      
        BUG: unable to handle kernel paging request at 0000000000025f60
        IP: next_online_pgdat+0x1/0x50
        PGD 0
        Oops: 0000 [#1] SMP
        ACPI: Device does not support D3cold
        Modules linked in: fuse nls_iso8859_1 nls_cp437 vfat fat loop dm_mod coretemp mperf crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 pcspkr microcode igb dca i2c_algo_bit ipv6 megaraid_sas iTCO_wdt i2c_i801 i2c_core iTCO_vendor_support tg3 sg hwmon ptp lpc_ich pps_core mfd_core acpi_pad rtc_cmos button ext3 jbd mbcache sd_mod crc_t10dif scsi_dh_alua scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh ahci libahci libata scsi_mod [last unloaded: rasf]
        CPU: 23 PID: 238 Comm: kworker/23:1 Tainted: G           O 3.10.15-5885-euler0302 #1
        Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei N1, BIOS V100R001 03/02/2015
        Workqueue: events vmstat_update
        task: ffffa800d32c0000 ti: ffffa800d32ae000 task.ti: ffffa800d32ae000
        RIP: 0010: next_online_pgdat+0x1/0x50
        RSP: 0018:ffffa800d32afce8  EFLAGS: 00010286
        RAX: 0000000000001440 RBX: ffffffff81da53b8 RCX: 0000000000000082
        RDX: 0000000000000000 RSI: 0000000000000082 RDI: 0000000000000000
        RBP: ffffa800d32afd28 R08: ffffffff81c93bfc R09: ffffffff81cbdc96
        R10: 00000000000040ec R11: 00000000000000a0 R12: ffffa800fffb3440
        R13: ffffa800d32afd38 R14: 0000000000000017 R15: ffffa800e6616800
        FS:  0000000000000000(0000) GS:ffffa800e6600000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000025f60 CR3: 0000000001a0b000 CR4: 00000000001407e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
          refresh_cpu_vm_stats+0xd0/0x140
          vmstat_update+0x11/0x50
          process_one_work+0x194/0x3d0
          worker_thread+0x12b/0x410
          kthread+0xc6/0xd0
          ret_from_fork+0x7c/0xb0
      
      The cause is the "memset(pgdat, 0, sizeof(*pgdat))" at the end of
      try_offline_node, which will reset all the content of pgdat to 0, as the
      pgdat is accessed lock-free, so that the users still using the pgdat
      will panic, such as the vmstat_update routine.
      
      process A:				offline node XX:
      
      vmstat_updat()
         refresh_cpu_vm_stats()
           for_each_populated_zone()
             find online node XX
           cond_resched()
      					offline cpu and memory, then try_offline_node()
      					node_set_offline(nid), and memset(pgdat, 0, sizeof(*pgdat))
             zone = next_zone(zone)
               pg_data_t *pgdat = zone->zone_pgdat;  // here pgdat is NULL now
                 next_online_pgdat(pgdat)
                   next_online_node(pgdat->node_id);  // NULL pointer access
      
      So the solution here is postponing the reset of obsolete pgdat from
      try_offline_node() to hotadd_new_pgdat(), and just resetting
      pgdat->nr_zones and pgdat->classzone_idx to be 0 rather than the memset
      0 to avoid breaking pointer information in pgdat.
      Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
      Reported-by: NXishi Qiu <qiuxishi@huawei.com>
      Suggested-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Xie XiuQi <xiexiuqi@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b0dc3a34
  24. 11 12月, 2014 2 次提交
    • V
      mm, memory_hotplug/failure: drain single zone pcplists · c0554329
      Vlastimil Babka 提交于
      Memory hotplug and failure mechanisms have several places where pcplists
      are drained so that pages are returned to the buddy allocator and can be
      e.g. prepared for offlining.  This is always done in the context of a
      single zone, we can reduce the pcplists drain to the single zone, which
      is now possible.
      
      The change should make memory offlining due to hotremove or failure
      faster and not disturbing unrelated pcplists anymore.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0554329
    • V
      mm: introduce single zone pcplists drain · 93481ff0
      Vlastimil Babka 提交于
      The functions for draining per-cpu pages back to buddy allocators
      currently always operate on all zones.  There are however several cases
      where the drain is only needed in the context of a single zone, and
      spilling other pcplists is a waste of time both due to the extra
      spilling and later refilling.
      
      This patch introduces new zone pointer parameter to drain_all_pages()
      and changes the dummy parameter of drain_local_pages() to be also a zone
      pointer.  When NULL is passed, the functions operate on all zones as
      usual.  Passing a specific zone pointer reduces the work to the single
      zone.
      
      All callers are updated to pass the NULL pointer in this patch.
      Conversion to single zone (where appropriate) is done in further
      patches.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93481ff0
  25. 14 11月, 2014 2 次提交
  26. 30 10月, 2014 1 次提交
    • Y
      memory-hotplug: clear pgdat which is allocated by bootmem in try_offline_node() · 35dca71c
      Yasuaki Ishimatsu 提交于
      When hot adding the same memory after hot removal, the following
      messages are shown:
      
        WARNING: CPU: 20 PID: 6 at mm/page_alloc.c:4968 free_area_init_node+0x3fe/0x426()
        ...
        Call Trace:
          dump_stack+0x46/0x58
          warn_slowpath_common+0x81/0xa0
          warn_slowpath_null+0x1a/0x20
          free_area_init_node+0x3fe/0x426
          hotadd_new_pgdat+0x90/0x110
          add_memory+0xd4/0x200
          acpi_memory_device_add+0x1aa/0x289
          acpi_bus_attach+0xfd/0x204
          acpi_bus_attach+0x178/0x204
          acpi_bus_scan+0x6a/0x90
          acpi_device_hotplug+0xe8/0x418
          acpi_hotplug_work_fn+0x1f/0x2b
          process_one_work+0x14e/0x3f0
          worker_thread+0x11b/0x510
          kthread+0xe1/0x100
          ret_from_fork+0x7c/0xb0
      
      The detaled explanation is as follows:
      
      When hot removing memory, pgdat is set to 0 in try_offline_node().  But
      if the pgdat is allocated by bootmem allocator, the clearing step is
      skipped.
      
      And when hot adding the same memory, the uninitialized pgdat is reused.
      But free_area_init_node() checks wether pgdat is set to zero.  As a
      result, free_area_init_node() hits WARN_ON().
      
      This patch clears pgdat which is allocated by bootmem allocator in
      try_offline_node().
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Zhen <zhenzhang.zhang@huawei.com>
      Cc: Wang Nan <wangnan0@huawei.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: NToshi Kani <toshi.kani@hp.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      35dca71c