1. 03 6月, 2017 1 次提交
    • M
      mm: consider memblock reservations for deferred memory initialization sizing · 864b9a39
      Michal Hocko 提交于
      We have seen an early OOM killer invocation on ppc64 systems with
      crashkernel=4096M:
      
      	kthreadd invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=7, order=0, oom_score_adj=0
      	kthreadd cpuset=/ mems_allowed=7
      	CPU: 0 PID: 2 Comm: kthreadd Not tainted 4.4.68-1.gd7fe927-default #1
      	Call Trace:
      	  dump_stack+0xb0/0xf0 (unreliable)
      	  dump_header+0xb0/0x258
      	  out_of_memory+0x5f0/0x640
      	  __alloc_pages_nodemask+0xa8c/0xc80
      	  kmem_getpages+0x84/0x1a0
      	  fallback_alloc+0x2a4/0x320
      	  kmem_cache_alloc_node+0xc0/0x2e0
      	  copy_process.isra.25+0x260/0x1b30
      	  _do_fork+0x94/0x470
      	  kernel_thread+0x48/0x60
      	  kthreadd+0x264/0x330
      	  ret_from_kernel_thread+0x5c/0xa4
      
      	Mem-Info:
      	active_anon:0 inactive_anon:0 isolated_anon:0
      	 active_file:0 inactive_file:0 isolated_file:0
      	 unevictable:0 dirty:0 writeback:0 unstable:0
      	 slab_reclaimable:5 slab_unreclaimable:73
      	 mapped:0 shmem:0 pagetables:0 bounce:0
      	 free:0 free_pcp:0 free_cma:0
      	Node 7 DMA free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:52428800kB managed:110016kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:320kB slab_unreclaimable:4672kB kernel_stack:1152kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
      	lowmem_reserve[]: 0 0 0 0
      	Node 7 DMA: 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 0kB
      	0 total pagecache pages
      	0 pages in swap cache
      	Swap cache stats: add 0, delete 0, find 0/0
      	Free swap  = 0kB
      	Total swap = 0kB
      	819200 pages RAM
      	0 pages HighMem/MovableOnly
      	817481 pages reserved
      	0 pages cma reserved
      	0 pages hwpoisoned
      
      the reason is that the managed memory is too low (only 110MB) while the
      rest of the the 50GB is still waiting for the deferred intialization to
      be done.  update_defer_init estimates the initial memoty to initialize
      to 2GB at least but it doesn't consider any memory allocated in that
      range.  In this particular case we've had
      
      	Reserving 4096MB of memory at 128MB for crashkernel (System RAM: 51200MB)
      
      so the low 2GB is mostly depleted.
      
      Fix this by considering memblock allocations in the initial static
      initialization estimation.  Move the max_initialise to
      reset_deferred_meminit and implement a simple memblock_reserved_memory
      helper which iterates all reserved blocks and sums the size of all that
      start below the given address.  The cumulative size is than added on top
      of the initial estimation.  This is still not ideal because
      reset_deferred_meminit doesn't consider holes and so reservation might
      be above the initial estimation whihch we ignore but let's make the
      logic simpler until we really need to handle more complicated cases.
      
      Fixes: 3a80a7fa ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
      Link: http://lkml.kernel.org/r/20170531104010.GI27783@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Tested-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>	[4.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      864b9a39
  2. 06 4月, 2017 2 次提交
  3. 10 3月, 2017 1 次提交
  4. 25 2月, 2017 3 次提交
  5. 23 2月, 2017 4 次提交
  6. 12 10月, 2016 1 次提交
  7. 08 10月, 2016 1 次提交
  8. 05 8月, 2016 2 次提交
  9. 29 7月, 2016 3 次提交
  10. 27 7月, 2016 1 次提交
  11. 21 5月, 2016 2 次提交
  12. 18 3月, 2016 1 次提交
  13. 16 3月, 2016 1 次提交
  14. 06 2月, 2016 1 次提交
  15. 15 1月, 2016 3 次提交
  16. 10 12月, 2015 1 次提交
  17. 06 11月, 2015 1 次提交
  18. 09 9月, 2015 6 次提交
    • A
    • A
      mm/memblock.c: fiy typos in comments · c1153931
      Alexander Kuleshov 提交于
      s/succees/success/
      Signed-off-by: NAlexander Kuleshov <kuleshovmail@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c1153931
    • A
      mm/memblock.c: rename local variable of memblock_type to 'type' · 567d117b
      Alexander Kuleshov 提交于
      Since commit e3239ff9 ("memblock: Rename memblock_region to
      memblock_type and memblock_property to memblock_region"), all local
      variables of the membock_type type were renamed to 'type'.  This commit
      renames all remaining local variables with the memblock_type type to the
      same view.
      Signed-off-by: NAlexander Kuleshov <kuleshovmail@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      567d117b
    • T
      mem-hotplug: handle node hole when initializing numa_meminfo. · 95cf82ec
      Tang Chen 提交于
      When parsing SRAT, all memory ranges are added into numa_meminfo.  In
      numa_init(), before entering numa_cleanup_meminfo(), all possible memory
      ranges are in numa_meminfo.  And numa_cleanup_meminfo() removes all
      ranges over max_pfn or empty.
      
      But, this only works if the nodes are continuous.  Let's have a look at
      the following example:
      
      We have an SRAT like this:
      SRAT: Node 0 PXM 0 [mem 0x00000000-0x5fffffff]
      SRAT: Node 0 PXM 0 [mem 0x100000000-0x1ffffffffff]
      SRAT: Node 1 PXM 1 [mem 0x20000000000-0x3ffffffffff]
      SRAT: Node 4 PXM 2 [mem 0x40000000000-0x5ffffffffff] hotplug
      SRAT: Node 5 PXM 3 [mem 0x60000000000-0x7ffffffffff] hotplug
      SRAT: Node 2 PXM 4 [mem 0x80000000000-0x9ffffffffff] hotplug
      SRAT: Node 3 PXM 5 [mem 0xa0000000000-0xbffffffffff] hotplug
      SRAT: Node 6 PXM 6 [mem 0xc0000000000-0xdffffffffff] hotplug
      SRAT: Node 7 PXM 7 [mem 0xe0000000000-0xfffffffffff] hotplug
      
      On boot, only node 0,1,2,3 exist.
      
      And the numa_meminfo will look like this:
      numa_meminfo.nr_blks = 9
      1. on node 0: [0, 60000000]
      2. on node 0: [100000000, 20000000000]
      3. on node 1: [20000000000, 40000000000]
      4. on node 4: [40000000000, 60000000000]
      5. on node 5: [60000000000, 80000000000]
      6. on node 2: [80000000000, a0000000000]
      7. on node 3: [a0000000000, a0800000000]
      8. on node 6: [c0000000000, a0800000000]
      9. on node 7: [e0000000000, a0800000000]
      
      And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because the
      end address is over max_pfn, which is a0800000000.  But 4 and 5 are not
      removed because their end addresses are less then max_pfn.  But in fact,
      node 4 and 5 don't exist.
      
      In a word, numa_cleanup_meminfo() is not able to handle holes between nodes.
      
      Since memory ranges in node 4 and 5 are in numa_meminfo, in
      numa_register_memblks(), node 4 and 5 will be mistakenly set to online.
      
      If you run lscpu, it will show:
      NUMA node0 CPU(s):     0-14,128-142
      NUMA node1 CPU(s):     15-29,143-157
      NUMA node2 CPU(s):
      NUMA node3 CPU(s):
      NUMA node4 CPU(s):     62-76,190-204
      NUMA node5 CPU(s):     78-92,206-220
      
      In this patch, we use memblock_overlaps_region() to check if ranges in
      numa_meminfo overlap with ranges in memory_block.  Since memory_block
      contains all available memory at boot time, if they overlap, it means the
      ranges exist.  If not, then remove them from numa_meminfo.
      
      After this patch, lscpu will show:
      NUMA node0 CPU(s):     0-14,128-142
      NUMA node1 CPU(s):     15-29,143-157
      NUMA node4 CPU(s):     62-76,190-204
      NUMA node5 CPU(s):     78-92,206-220
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Vladimir Murzin <vladimir.murzin@arm.com>
      Cc: Fabian Frederick <fabf@skynet.be>
      Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      95cf82ec
    • T
      mm/memblock.c: make memblock_overlaps_region() return bool. · c5c5c9d1
      Tang Chen 提交于
      memblock_overlaps_region() checks if the given memblock region
      intersects a region in memblock.  If so, it returns the index of the
      intersected region.
      
      But its only caller is memblock_is_region_reserved(), and it returns 0
      if false, non-zero if true.
      
      Both of these should return bool.
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Vladimir Murzin <vladimir.murzin@arm.com>
      Cc: Fabian Frederick <fabf@skynet.be>
      Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5c5c9d1
    • W
      mm/memblock.c: WARN_ON when flags differs from overlap region · 4fcab5f4
      Wei Yang 提交于
      Each memblock_region has flags to indicates the type of this range. For
      the overlap case, memblock_add_range() inserts the lower part and leave the
      upper part as indicated in the overlapped region.
      
      If the flags of the new range differs from the overlapped region, the
      information recorded is not correct.
      
      This patch adds a WARN_ON when the flags of the new range differs from the
      overlapped region.
      Signed-off-by: NWei Yang <weiyang@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4fcab5f4
  19. 05 9月, 2015 1 次提交
  20. 01 7月, 2015 2 次提交
    • M
      mm: page_alloc: pass PFN to __free_pages_bootmem · d70ddd7a
      Mel Gorman 提交于
      __free_pages_bootmem prepares a page for release to the buddy allocator
      and assumes that the struct page is initialised.  Parallel initialisation
      of struct pages defers initialisation and __free_pages_bootmem can be
      called for struct pages that cannot yet map struct page to PFN.  This
      patch passes PFN to __free_pages_bootmem with no other functional change.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Tested-by: NNate Zimmer <nzimmer@sgi.com>
      Tested-by: NWaiman Long <waiman.long@hp.com>
      Tested-by: NDaniel J Blueman <daniel@numascale.com>
      Acked-by: NPekka Enberg <penberg@kernel.org>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Nate Zimmer <nzimmer@sgi.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Waiman Long <waiman.long@hp.com>
      Cc: Scott Norton <scott.norton@hp.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d70ddd7a
    • R
      memblock: introduce a for_each_reserved_mem_region iterator · 8e7a7f86
      Robin Holt 提交于
      Struct page initialisation had been identified as one of the reasons why
      large machines take a long time to boot. Patches were posted a long time ago
      to defer initialisation until they were first used.  This was rejected on
      the grounds it should not be necessary to hurt the fast paths. This series
      reuses much of the work from that time but defers the initialisation of
      memory to kswapd so that one thread per node initialises memory local to
      that node.
      
      After applying the series and setting the appropriate Kconfig variable I
      see this in the boot log on a 64G machine
      
      [    7.383764] kswapd 0 initialised deferred memory in 188ms
      [    7.404253] kswapd 1 initialised deferred memory in 208ms
      [    7.411044] kswapd 3 initialised deferred memory in 216ms
      [    7.411551] kswapd 2 initialised deferred memory in 216ms
      
      On a 1TB machine, I see
      
      [    8.406511] kswapd 3 initialised deferred memory in 1116ms
      [    8.428518] kswapd 1 initialised deferred memory in 1140ms
      [    8.435977] kswapd 0 initialised deferred memory in 1148ms
      [    8.437416] kswapd 2 initialised deferred memory in 1148ms
      
      Once booted the machine appears to work as normal. Boot times were measured
      from the time shutdown was called until ssh was available again.  In the
      64G case, the boot time savings are negligible. On the 1TB machine, the
      savings were 16 seconds.
      
      Nate Zimmer said:
      
      : On an older 8 TB box with lots and lots of cpus the boot time, as
      : measure from grub to login prompt, the boot time improved from 1484
      : seconds to exactly 1000 seconds.
      
      Waiman Long said:
      
      : I ran a bootup timing test on a 12-TB 16-socket IvyBridge-EX system.  From
      : grub menu to ssh login, the bootup time was 453s before the patch and 265s
      : after the patch - a saving of 188s (42%).
      
      Daniel Blueman said:
      
      : On a 7TB, 1728-core NumaConnect system with 108 NUMA nodes, we're seeing
      : stock 4.0 boot in 7136s.  This drops to 2159s, or a 70% reduction with
      : this patchset.  Non-temporal PMD init (https://lkml.org/lkml/2015/4/23/350)
      : drops this to 1045s.
      
      This patch (of 13):
      
      As part of initializing struct page's in 2MiB chunks, we noticed that at
      the end of free_all_bootmem(), there was nothing which had forced the
      reserved/allocated 4KiB pages to be initialized.
      
      This helper function will be used for that expansion.
      Signed-off-by: NRobin Holt <holt@sgi.com>
      Signed-off-by: NNate Zimmer <nzimmer@sgi.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Tested-by: NNate Zimmer <nzimmer@sgi.com>
      Tested-by: NWaiman Long <waiman.long@hp.com>
      Tested-by: NDaniel J Blueman <daniel@numascale.com>
      Acked-by: NPekka Enberg <penberg@kernel.org>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Waiman Long <waiman.long@hp.com>
      Cc: Scott Norton <scott.norton@hp.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e7a7f86
  21. 25 6月, 2015 2 次提交
    • T
      mm/memblock: allocate boot time data structures from mirrored memory · a3f5bafc
      Tony Luck 提交于
      Try to allocate all boot time kernel data structures from mirrored
      memory.
      
      If we run out of mirrored memory print warnings, but fall back to using
      non-mirrored memory to make sure that we still boot.
      
      By number of bytes, most of what we allocate at boot time is the page
      structures.  64 bytes per 4K page on x86_64 ...  or about 1.5% of total
      system memory.  For workloads where the bulk of memory is allocated to
      applications this may represent a useful improvement to system
      availability since 1.5% of total memory might be a third of the memory
      allocated to the kernel.
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Xiexiuqi <xiexiuqi@huawei.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a3f5bafc
    • T
      mm/memblock: add extra "flags" to memblock to allow selection of memory based on attribute · fc6daaf9
      Tony Luck 提交于
      Some high end Intel Xeon systems report uncorrectable memory errors as a
      recoverable machine check.  Linux has included code for some time to
      process these and just signal the affected processes (or even recover
      completely if the error was in a read only page that can be replaced by
      reading from disk).
      
      But we have no recovery path for errors encountered during kernel code
      execution.  Except for some very specific cases were are unlikely to ever
      be able to recover.
      
      Enter memory mirroring. Actually 3rd generation of memory mirroing.
      
      Gen1: All memory is mirrored
      	Pro: No s/w enabling - h/w just gets good data from other side of the
      	     mirror
      	Con: Halves effective memory capacity available to OS/applications
      
      Gen2: Partial memory mirror - just mirror memory begind some memory controllers
      	Pro: Keep more of the capacity
      	Con: Nightmare to enable. Have to choose between allocating from
      	     mirrored memory for safety vs. NUMA local memory for performance
      
      Gen3: Address range partial memory mirror - some mirror on each memory
            controller
      	Pro: Can tune the amount of mirror and keep NUMA performance
      	Con: I have to write memory management code to implement
      
      The current plan is just to use mirrored memory for kernel allocations.
      This has been broken into two phases:
      
      1) This patch series - find the mirrored memory, use it for boot time
         allocations
      
      2) Wade into mm/page_alloc.c and define a ZONE_MIRROR to pick up the
         unused mirrored memory from mm/memblock.c and only give it out to
         select kernel allocations (this is still being scoped because
         page_alloc.c is scary).
      
      This patch (of 3):
      
      Add extra "flags" to memblock to allow selection of memory based on
      attribute.  No functional changes
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Xiexiuqi <xiexiuqi@huawei.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc6daaf9