1. 09 9月, 2015 4 次提交
    • A
      mm/memblock.c: rename local variable of memblock_type to 'type' · 567d117b
      Alexander Kuleshov 提交于
      Since commit e3239ff9 ("memblock: Rename memblock_region to
      memblock_type and memblock_property to memblock_region"), all local
      variables of the membock_type type were renamed to 'type'.  This commit
      renames all remaining local variables with the memblock_type type to the
      same view.
      Signed-off-by: NAlexander Kuleshov <kuleshovmail@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      567d117b
    • T
      mem-hotplug: handle node hole when initializing numa_meminfo. · 95cf82ec
      Tang Chen 提交于
      When parsing SRAT, all memory ranges are added into numa_meminfo.  In
      numa_init(), before entering numa_cleanup_meminfo(), all possible memory
      ranges are in numa_meminfo.  And numa_cleanup_meminfo() removes all
      ranges over max_pfn or empty.
      
      But, this only works if the nodes are continuous.  Let's have a look at
      the following example:
      
      We have an SRAT like this:
      SRAT: Node 0 PXM 0 [mem 0x00000000-0x5fffffff]
      SRAT: Node 0 PXM 0 [mem 0x100000000-0x1ffffffffff]
      SRAT: Node 1 PXM 1 [mem 0x20000000000-0x3ffffffffff]
      SRAT: Node 4 PXM 2 [mem 0x40000000000-0x5ffffffffff] hotplug
      SRAT: Node 5 PXM 3 [mem 0x60000000000-0x7ffffffffff] hotplug
      SRAT: Node 2 PXM 4 [mem 0x80000000000-0x9ffffffffff] hotplug
      SRAT: Node 3 PXM 5 [mem 0xa0000000000-0xbffffffffff] hotplug
      SRAT: Node 6 PXM 6 [mem 0xc0000000000-0xdffffffffff] hotplug
      SRAT: Node 7 PXM 7 [mem 0xe0000000000-0xfffffffffff] hotplug
      
      On boot, only node 0,1,2,3 exist.
      
      And the numa_meminfo will look like this:
      numa_meminfo.nr_blks = 9
      1. on node 0: [0, 60000000]
      2. on node 0: [100000000, 20000000000]
      3. on node 1: [20000000000, 40000000000]
      4. on node 4: [40000000000, 60000000000]
      5. on node 5: [60000000000, 80000000000]
      6. on node 2: [80000000000, a0000000000]
      7. on node 3: [a0000000000, a0800000000]
      8. on node 6: [c0000000000, a0800000000]
      9. on node 7: [e0000000000, a0800000000]
      
      And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because the
      end address is over max_pfn, which is a0800000000.  But 4 and 5 are not
      removed because their end addresses are less then max_pfn.  But in fact,
      node 4 and 5 don't exist.
      
      In a word, numa_cleanup_meminfo() is not able to handle holes between nodes.
      
      Since memory ranges in node 4 and 5 are in numa_meminfo, in
      numa_register_memblks(), node 4 and 5 will be mistakenly set to online.
      
      If you run lscpu, it will show:
      NUMA node0 CPU(s):     0-14,128-142
      NUMA node1 CPU(s):     15-29,143-157
      NUMA node2 CPU(s):
      NUMA node3 CPU(s):
      NUMA node4 CPU(s):     62-76,190-204
      NUMA node5 CPU(s):     78-92,206-220
      
      In this patch, we use memblock_overlaps_region() to check if ranges in
      numa_meminfo overlap with ranges in memory_block.  Since memory_block
      contains all available memory at boot time, if they overlap, it means the
      ranges exist.  If not, then remove them from numa_meminfo.
      
      After this patch, lscpu will show:
      NUMA node0 CPU(s):     0-14,128-142
      NUMA node1 CPU(s):     15-29,143-157
      NUMA node4 CPU(s):     62-76,190-204
      NUMA node5 CPU(s):     78-92,206-220
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Vladimir Murzin <vladimir.murzin@arm.com>
      Cc: Fabian Frederick <fabf@skynet.be>
      Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      95cf82ec
    • T
      mm/memblock.c: make memblock_overlaps_region() return bool. · c5c5c9d1
      Tang Chen 提交于
      memblock_overlaps_region() checks if the given memblock region
      intersects a region in memblock.  If so, it returns the index of the
      intersected region.
      
      But its only caller is memblock_is_region_reserved(), and it returns 0
      if false, non-zero if true.
      
      Both of these should return bool.
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Vladimir Murzin <vladimir.murzin@arm.com>
      Cc: Fabian Frederick <fabf@skynet.be>
      Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5c5c9d1
    • W
      mm/memblock.c: WARN_ON when flags differs from overlap region · 4fcab5f4
      Wei Yang 提交于
      Each memblock_region has flags to indicates the type of this range. For
      the overlap case, memblock_add_range() inserts the lower part and leave the
      upper part as indicated in the overlapped region.
      
      If the flags of the new range differs from the overlapped region, the
      information recorded is not correct.
      
      This patch adds a WARN_ON when the flags of the new range differs from the
      overlapped region.
      Signed-off-by: NWei Yang <weiyang@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4fcab5f4
  2. 05 9月, 2015 1 次提交
  3. 01 7月, 2015 2 次提交
    • M
      mm: page_alloc: pass PFN to __free_pages_bootmem · d70ddd7a
      Mel Gorman 提交于
      __free_pages_bootmem prepares a page for release to the buddy allocator
      and assumes that the struct page is initialised.  Parallel initialisation
      of struct pages defers initialisation and __free_pages_bootmem can be
      called for struct pages that cannot yet map struct page to PFN.  This
      patch passes PFN to __free_pages_bootmem with no other functional change.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Tested-by: NNate Zimmer <nzimmer@sgi.com>
      Tested-by: NWaiman Long <waiman.long@hp.com>
      Tested-by: NDaniel J Blueman <daniel@numascale.com>
      Acked-by: NPekka Enberg <penberg@kernel.org>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Nate Zimmer <nzimmer@sgi.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Waiman Long <waiman.long@hp.com>
      Cc: Scott Norton <scott.norton@hp.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d70ddd7a
    • R
      memblock: introduce a for_each_reserved_mem_region iterator · 8e7a7f86
      Robin Holt 提交于
      Struct page initialisation had been identified as one of the reasons why
      large machines take a long time to boot. Patches were posted a long time ago
      to defer initialisation until they were first used.  This was rejected on
      the grounds it should not be necessary to hurt the fast paths. This series
      reuses much of the work from that time but defers the initialisation of
      memory to kswapd so that one thread per node initialises memory local to
      that node.
      
      After applying the series and setting the appropriate Kconfig variable I
      see this in the boot log on a 64G machine
      
      [    7.383764] kswapd 0 initialised deferred memory in 188ms
      [    7.404253] kswapd 1 initialised deferred memory in 208ms
      [    7.411044] kswapd 3 initialised deferred memory in 216ms
      [    7.411551] kswapd 2 initialised deferred memory in 216ms
      
      On a 1TB machine, I see
      
      [    8.406511] kswapd 3 initialised deferred memory in 1116ms
      [    8.428518] kswapd 1 initialised deferred memory in 1140ms
      [    8.435977] kswapd 0 initialised deferred memory in 1148ms
      [    8.437416] kswapd 2 initialised deferred memory in 1148ms
      
      Once booted the machine appears to work as normal. Boot times were measured
      from the time shutdown was called until ssh was available again.  In the
      64G case, the boot time savings are negligible. On the 1TB machine, the
      savings were 16 seconds.
      
      Nate Zimmer said:
      
      : On an older 8 TB box with lots and lots of cpus the boot time, as
      : measure from grub to login prompt, the boot time improved from 1484
      : seconds to exactly 1000 seconds.
      
      Waiman Long said:
      
      : I ran a bootup timing test on a 12-TB 16-socket IvyBridge-EX system.  From
      : grub menu to ssh login, the bootup time was 453s before the patch and 265s
      : after the patch - a saving of 188s (42%).
      
      Daniel Blueman said:
      
      : On a 7TB, 1728-core NumaConnect system with 108 NUMA nodes, we're seeing
      : stock 4.0 boot in 7136s.  This drops to 2159s, or a 70% reduction with
      : this patchset.  Non-temporal PMD init (https://lkml.org/lkml/2015/4/23/350)
      : drops this to 1045s.
      
      This patch (of 13):
      
      As part of initializing struct page's in 2MiB chunks, we noticed that at
      the end of free_all_bootmem(), there was nothing which had forced the
      reserved/allocated 4KiB pages to be initialized.
      
      This helper function will be used for that expansion.
      Signed-off-by: NRobin Holt <holt@sgi.com>
      Signed-off-by: NNate Zimmer <nzimmer@sgi.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Tested-by: NNate Zimmer <nzimmer@sgi.com>
      Tested-by: NWaiman Long <waiman.long@hp.com>
      Tested-by: NDaniel J Blueman <daniel@numascale.com>
      Acked-by: NPekka Enberg <penberg@kernel.org>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Waiman Long <waiman.long@hp.com>
      Cc: Scott Norton <scott.norton@hp.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e7a7f86
  4. 25 6月, 2015 2 次提交
    • T
      mm/memblock: allocate boot time data structures from mirrored memory · a3f5bafc
      Tony Luck 提交于
      Try to allocate all boot time kernel data structures from mirrored
      memory.
      
      If we run out of mirrored memory print warnings, but fall back to using
      non-mirrored memory to make sure that we still boot.
      
      By number of bytes, most of what we allocate at boot time is the page
      structures.  64 bytes per 4K page on x86_64 ...  or about 1.5% of total
      system memory.  For workloads where the bulk of memory is allocated to
      applications this may represent a useful improvement to system
      availability since 1.5% of total memory might be a third of the memory
      allocated to the kernel.
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Xiexiuqi <xiexiuqi@huawei.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a3f5bafc
    • T
      mm/memblock: add extra "flags" to memblock to allow selection of memory based on attribute · fc6daaf9
      Tony Luck 提交于
      Some high end Intel Xeon systems report uncorrectable memory errors as a
      recoverable machine check.  Linux has included code for some time to
      process these and just signal the affected processes (or even recover
      completely if the error was in a read only page that can be replaced by
      reading from disk).
      
      But we have no recovery path for errors encountered during kernel code
      execution.  Except for some very specific cases were are unlikely to ever
      be able to recover.
      
      Enter memory mirroring. Actually 3rd generation of memory mirroing.
      
      Gen1: All memory is mirrored
      	Pro: No s/w enabling - h/w just gets good data from other side of the
      	     mirror
      	Con: Halves effective memory capacity available to OS/applications
      
      Gen2: Partial memory mirror - just mirror memory begind some memory controllers
      	Pro: Keep more of the capacity
      	Con: Nightmare to enable. Have to choose between allocating from
      	     mirrored memory for safety vs. NUMA local memory for performance
      
      Gen3: Address range partial memory mirror - some mirror on each memory
            controller
      	Pro: Can tune the amount of mirror and keep NUMA performance
      	Con: I have to write memory management code to implement
      
      The current plan is just to use mirrored memory for kernel allocations.
      This has been broken into two phases:
      
      1) This patch series - find the mirrored memory, use it for boot time
         allocations
      
      2) Wade into mm/page_alloc.c and define a ZONE_MIRROR to pick up the
         unused mirrored memory from mm/memblock.c and only give it out to
         select kernel allocations (this is still being scoped because
         page_alloc.c is scary).
      
      This patch (of 3):
      
      Add extra "flags" to memblock to allow selection of memory based on
      attribute.  No functional changes
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Xiexiuqi <xiexiuqi@huawei.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc6daaf9
  5. 16 4月, 2015 1 次提交
  6. 15 4月, 2015 1 次提交
  7. 14 12月, 2014 1 次提交
    • T
      mm/memblock.c: refactor functions to set/clear MEMBLOCK_HOTPLUG · 4308ce17
      Tony Luck 提交于
      There is a lot of duplication in the rubric around actually setting or
      clearing a mem region flag.  Create a new helper function to do this and
      reduce each of memblock_mark_hotplug() and memblock_clear_hotplug() to a
      single line.
      
      This will be useful if someone were to add a new mem region flag - which
      I hope to be doing some day soon. But it looks like a plausible cleanup
      even without that - so I'd like to get it out of the way now.
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Cc: Santosh Shilimkar <santosh.shilimkar@ti.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Grygorii Strashko <grygorii.strashko@ti.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Philipp Hachtmann <phacht@linux.vnet.ibm.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Emil Medve <Emilian.Medve@freescale.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4308ce17
  8. 11 9月, 2014 1 次提交
    • X
      mem-hotplug: let memblock skip the hotpluggable memory regions in __next_mem_range() · 0a313a99
      Xishi Qiu 提交于
      Let memblock skip the hotpluggable memory regions in __next_mem_range(),
      it is used to to prevent memblock from allocating hotpluggable memory
      for the kernel at early time. The code is the same as __next_mem_range_rev().
      
      Clear hotpluggable flag before releasing free pages to the buddy
      allocator.  If we don't clear hotpluggable flag in
      free_low_memory_core_early(), the memory which marked hotpluggable flag
      will not free to buddy allocator.  Because __next_mem_range() will skip
      them.
      
      free_low_memory_core_early
      	for_each_free_mem_range
      		for_each_mem_range
      			__next_mem_range
      
      [akpm@linux-foundation.org: fix warning]
      Signed-off-by: NXishi Qiu <qiuxishi@huawei.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a313a99
  9. 30 8月, 2014 1 次提交
  10. 07 6月, 2014 1 次提交
  11. 05 6月, 2014 2 次提交
  12. 20 5月, 2014 2 次提交
    • P
      mm/memblock: add physical memory list · 70210ed9
      Philipp Hachtmann 提交于
      Add the physmem list to the memblock structure. This list only exists
      if HAVE_MEMBLOCK_PHYS_MAP is selected and contains the unmodified
      list of physically available memory. It differs from the memblock
      memory list as it always contains all memory ranges even if the
      memory has been restricted, e.g. by use of the mem= kernel parameter.
      Signed-off-by: NPhilipp Hachtmann <phacht@linux.vnet.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      70210ed9
    • P
      mm/memblock: Do some refactoring, enhance API · f1af9d3a
      Philipp Hachtmann 提交于
      Refactor the memblock code and extend the memblock API to make it
      more flexible. With the extended API it is simple to define and
      work with additional memory lists.
      
      The static functions memblock_add_region and __memblock_remove are
      renamed to memblock_add_range and meblock_remove_range and added to
      the memblock API.
      
      The __next_free_mem_range and __next_free_mem_range_rev functions
      are replaced with calls to the more generic list walkers
      __next_mem_range and __next_mem_range_rev.
      
      To walk an arbitrary memory list two new macros for_each_mem_range
      and for_each_mem_range_rev are added. These new macros are used
      to define for_each_free_mem_range and for_each_free_mem_range_reverse.
      Signed-off-by: NPhilipp Hachtmann <phacht@linux.vnet.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      f1af9d3a
  13. 08 4月, 2014 2 次提交
  14. 12 3月, 2014 1 次提交
  15. 30 1月, 2014 1 次提交
  16. 28 1月, 2014 1 次提交
  17. 24 1月, 2014 2 次提交
  18. 22 1月, 2014 11 次提交
    • G
      mm/memblock: use WARN_ONCE when MAX_NUMNODES passed as input parameter · 560dca27
      Grygorii Strashko 提交于
      Check nid parameter and produce warning if it has deprecated
      MAX_NUMNODES value.  Also re-assign NUMA_NO_NODE value to the nid
      parameter in this case.
      
      These will help to identify the wrong API usage (the caller) and make
      code simpler.
      Signed-off-by: NGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@ti.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      560dca27
    • S
      mm/memblock: add memblock memory allocation apis · 26f09e9b
      Santosh Shilimkar 提交于
      Introduce memblock memory allocation APIs which allow to support PAE or
      LPAE extension on 32 bits archs where the physical memory start address
      can be beyond 4GB.  In such cases, existing bootmem APIs which operate
      on 32 bit addresses won't work and needs memblock layer which operates
      on 64 bit addresses.
      
      So we add equivalent APIs so that we can replace usage of bootmem with
      memblock interfaces.  Architectures already converted to NO_BOOTMEM use
      these new memblock interfaces.  The architectures which are still not
      converted to NO_BOOTMEM continue to function as is because we still
      maintain the fal lback option of bootmem back-end supporting these new
      interfaces.  So no functional change as such.
      
      In long run, once all the architectures moves to NO_BOOTMEM, we can get
      rid of bootmem layer completely.  This is one step to remove the core
      code dependency with bootmem and also gives path for architectures to
      move away from bootmem.
      
      The proposed interface will became active if both CONFIG_HAVE_MEMBLOCK
      and CONFIG_NO_BOOTMEM are specified by arch.  In case
      !CONFIG_NO_BOOTMEM, the memblock() wrappers will fallback to the
      existing bootmem apis so that arch's not converted to NO_BOOTMEM
      continue to work as is.
      
      The meaning of MEMBLOCK_ALLOC_ACCESSIBLE and MEMBLOCK_ALLOC_ANYWHERE
      is kept same.
      
      [akpm@linux-foundation.org: s/depricated/deprecated/]
      Signed-off-by: NGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@ti.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Paul Walmsley <paul@pwsan.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Tony Lindgren <tony@atomide.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26f09e9b
    • G
      mm/memblock: switch to use NUMA_NO_NODE instead of MAX_NUMNODES · b1154233
      Grygorii Strashko 提交于
      It's recommended to use NUMA_NO_NODE everywhere to select "process any
      node" behavior or to indicate that "no node id specified".
      
      Hence, update __next_free_mem_range*() API's to accept both NUMA_NO_NODE
      and MAX_NUMNODES, but emit warning once on MAX_NUMNODES, and correct
      corresponding API's documentation to describe new behavior.  Also,
      update other memblock/nobootmem APIs where MAX_NUMNODES is used
      dirrectly.
      
      The change was suggested by Tejun Heo.
      Signed-off-by: NGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@ti.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Paul Walmsley <paul@pwsan.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Tony Lindgren <tony@atomide.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b1154233
    • G
      mm/memblock: reorder parameters of memblock_find_in_range_node · 87029ee9
      Grygorii Strashko 提交于
      Reorder parameters of memblock_find_in_range_node to be consistent with
      other memblock APIs.
      
      The change was suggested by Tejun Heo <tj@kernel.org>.
      Signed-off-by: NGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@ti.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Paul Walmsley <paul@pwsan.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Tony Lindgren <tony@atomide.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      87029ee9
    • G
      mm/memblock: drop WARN and use SMP_CACHE_BYTES as a default alignment · 79f40fab
      Grygorii Strashko 提交于
      Don't produce warning and interpret 0 as "default align" equal to
      SMP_CACHE_BYTES in case if caller of memblock_alloc_base_nid() doesn't
      specify alignment for the block (align == 0).
      
      This is done in preparation of introducing common memblock alloc interface
      to make code behavior consistent.  More details are in below thread :
      
      	https://lkml.org/lkml/2013/10/13/117.
      Signed-off-by: NGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@ti.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Paul Walmsley <paul@pwsan.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Tony Lindgren <tony@atomide.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79f40fab
    • G
      mm/memblock: debug: don't free reserved array if !ARCH_DISCARD_MEMBLOCK · fd615c4e
      Grygorii Strashko 提交于
      Now the Nobootmem allocator will always try to free memory allocated for
      reserved memory regions (free_low_memory_core_early()) without taking
      into to account current memblock debugging configuration
      (CONFIG_ARCH_DISCARD_MEMBLOCK and CONFIG_DEBUG_FS state).
      
      As result if:
      
       - CONFIG_DEBUG_FS defined
       - CONFIG_ARCH_DISCARD_MEMBLOCK not defined;
       - reserved memory regions array have been resized during boot
      
      then:
      
       - memory allocated for reserved memory regions array will be freed to
         buddy allocator;
       - debug_fs entry "sys/kernel/debug/memblock/reserved" will show garbage
         instead of state of memory reservations.  like:
         0: 0x98393bc0..0x9a393bbf
         1: 0xff120000..0xff11ffff
         2: 0x00000000..0xffffffff
      
      Hence, do not free memory allocated for reserved memory regions if
      defined(CONFIG_DEBUG_FS) && !defined(CONFIG_ARCH_DISCARD_MEMBLOCK).
      Signed-off-by: NGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@ti.com>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Paul Walmsley <paul@pwsan.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Tony Lindgren <tony@atomide.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fd615c4e
    • T
      memblock, mem_hotplug: make memblock skip hotpluggable regions if needed · 55ac590c
      Tang Chen 提交于
      Linux kernel cannot migrate pages used by the kernel.  As a result,
      hotpluggable memory used by the kernel won't be able to be hot-removed.
      To solve this problem, the basic idea is to prevent memblock from
      allocating hotpluggable memory for the kernel at early time, and arrange
      all hotpluggable memory in ACPI SRAT(System Resource Affinity Table) as
      ZONE_MOVABLE when initializing zones.
      
      In the previous patches, we have marked hotpluggable memory regions with
      MEMBLOCK_HOTPLUG flag in memblock.memory.
      
      In this patch, we make memblock skip these hotpluggable memory regions
      in the default top-down allocation function if movable_node boot option
      is specified.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
      Cc: Chen Tang <imtangchen@gmail.com>
      Cc: Gong Chen <gong.chen@linux.intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Liu Jiang <jiang.liu@huawei.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55ac590c
    • T
      memblock: make memblock_set_node() support different memblock_type · e7e8de59
      Tang Chen 提交于
      [sfr@canb.auug.org.au: fix powerpc build]
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
      Cc: Chen Tang <imtangchen@gmail.com>
      Cc: Gong Chen <gong.chen@linux.intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Liu Jiang <jiang.liu@huawei.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7e8de59
    • T
      memblock, mem_hotplug: introduce MEMBLOCK_HOTPLUG flag to mark hotpluggable regions · 66b16edf
      Tang Chen 提交于
      In find_hotpluggable_memory, once we find out a memory region which is
      hotpluggable, we want to mark them in memblock.memory.  So that we could
      control memblock allocator not to allocte hotpluggable memory for the
      kernel later.
      
      To achieve this goal, we introduce MEMBLOCK_HOTPLUG flag to indicate the
      hotpluggable memory regions in memblock and a function
      memblock_mark_hotplug() to mark hotpluggable memory if we find one.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
      Cc: Chen Tang <imtangchen@gmail.com>
      Cc: Gong Chen <gong.chen@linux.intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Liu Jiang <jiang.liu@huawei.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      66b16edf
    • T
      memblock, numa: introduce flags field into memblock · 66a20757
      Tang Chen 提交于
      There is no flag in memblock to describe what type the memory is.
      Sometimes, we may use memblock to reserve some memory for special usage.
      And we want to know what kind of memory it is.  So we need a way to
      
      In hotplug environment, we want to reserve hotpluggable memory so the
      kernel won't be able to use it.  And when the system is up, we have to
      free these hotpluggable memory to buddy.  So we need to mark these
      memory first.
      
      In order to do so, we need to mark out these special memory in memblock.
      In this patch, we introduce a new "flags" member into memblock_region:
      
         struct memblock_region {
                 phys_addr_t base;
                 phys_addr_t size;
                 unsigned long flags;		/* This is new. */
         #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
                 int nid;
         #endif
         };
      
      This patch does the following things:
      1) Add "flags" member to memblock_region.
      2) Modify the following APIs' prototype:
      	memblock_add_region()
      	memblock_insert_region()
      3) Add memblock_reserve_region() to support reserve memory with flags, and keep
         memblock_reserve()'s prototype unmodified.
      4) Modify other APIs to support flags, but keep their prototype unmodified.
      
      The idea is from Wen Congyang <wency@cn.fujitsu.com> and Liu Jiang <jiang.liu@huawei.com>.
      Suggested-by: NWen Congyang <wency@cn.fujitsu.com>
      Suggested-by: NLiu Jiang <jiang.liu@huawei.com>
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
      Cc: Chen Tang <imtangchen@gmail.com>
      Cc: Gong Chen <gong.chen@linux.intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      66a20757
    • G
      mm/memblock: debug: correct displaying of upper memory boundary · 931d13f5
      Grygorii Strashko 提交于
      Current memblock APIs don't work on 32 PAE or LPAE extension arches
      where the physical memory start address beyond 4GB.  The problem was
      discussed here [3] where Tejun, Yinghai(thanks) proposed a way forward
      with memblock interfaces.  Based on the proposal, this series adds
      necessary memblock interfaces and convert the core kernel code to use
      them.  Architectures already converted to NO_BOOTMEM use these new
      interfaces and other which still uses bootmem, these new interfaces just
      fallback to exiting bootmem APIs.
      
      So no functional change in behavior.  In long run, once all the
      architectures moves to NO_BOOTMEM, we can get rid of bootmem layer
      completely.  This is one step to remove the core code dependency with
      bootmem and also gives path for architectures to move away from bootmem.
      
      Testing is done on ARM architecture with 32 bit ARM LAPE machines with
      normal as well sparse(faked) memory model.
      
      This patch (of 23):
      
      When debugging is enabled (cmdline has "memblock=debug") the memblock
      will display upper memory boundary per each allocated/freed memory range
      wrongly.  For example:
      
       memblock_reserve: [0x0000009e7e8000-0x0000009e7ed000] _memblock_early_alloc_try_nid_nopanic+0xfc/0x12c
      
      The 0x0000009e7ed000 is displayed instead of 0x0000009e7ecfff
      
      Hence, correct this by changing formula used to calculate upper memory
      boundary to (u64)base + size - 1 instead of (u64)base + size everywhere
      in the debug messages.
      Signed-off-by: NGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@ti.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Paul Walmsley <paul@pwsan.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Tony Lindgren <tony@atomide.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      931d13f5
  19. 13 11月, 2013 2 次提交
    • T
      mm/memblock.c: introduce bottom-up allocation mode · 79442ed1
      Tang Chen 提交于
      The Linux kernel cannot migrate pages used by the kernel.  As a result,
      kernel pages cannot be hot-removed.  So we cannot allocate hotpluggable
      memory for the kernel.
      
      ACPI SRAT (System Resource Affinity Table) contains the memory hotplug
      info.  But before SRAT is parsed, memblock has already started to allocate
      memory for the kernel.  So we need to prevent memblock from doing this.
      
      In a memory hotplug system, any numa node the kernel resides in should be
      unhotpluggable.  And for a modern server, each node could have at least
      16GB memory.  So memory around the kernel image is highly likely
      unhotpluggable.
      
      So the basic idea is: Allocate memory from the end of the kernel image and
      to the higher memory.  Since memory allocation before SRAT is parsed won't
      be too much, it could highly likely be in the same node with kernel image.
      
      The current memblock can only allocate memory top-down.  So this patch
      introduces a new bottom-up allocation mode to allocate memory bottom-up.
      And later when we use this allocation direction to allocate memory, we
      will limit the start address above the kernel.
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: NToshi Kani <toshi.kani@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79442ed1
    • T
      mm/memblock.c: factor out of top-down allocation · 1402899e
      Tang Chen 提交于
      [Problem]
      
      The current Linux cannot migrate pages used by the kernel because of the
      kernel direct mapping.  In Linux kernel space, va = pa + PAGE_OFFSET.
      When the pa is changed, we cannot simply update the pagetable and keep the
      va unmodified.  So the kernel pages are not migratable.
      
      There are also some other issues will cause the kernel pages not
      migratable.  For example, the physical address may be cached somewhere and
      will be used.  It is not to update all the caches.
      
      When doing memory hotplug in Linux, we first migrate all the pages in one
      memory device somewhere else, and then remove the device.  But if pages
      are used by the kernel, they are not migratable.  As a result, memory used
      by the kernel cannot be hot-removed.
      
      Modifying the kernel direct mapping mechanism is too difficult to do.  And
      it may cause the kernel performance down and unstable.  So we use the
      following way to do memory hotplug.
      
      [What we are doing]
      
      In Linux, memory in one numa node is divided into several zones.  One of
      the zones is ZONE_MOVABLE, which the kernel won't use.
      
      In order to implement memory hotplug in Linux, we are going to arrange all
      hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these
      memory.  To do this, we need ACPI's help.
      
      In ACPI, SRAT(System Resource Affinity Table) contains NUMA info.  The
      memory affinities in SRAT record every memory range in the system, and
      also, flags specifying if the memory range is hotpluggable.  (Please refer
      to ACPI spec 5.0 5.2.16)
      
      With the help of SRAT, we have to do the following two things to achieve our
      goal:
      
      1. When doing memory hot-add, allow the users arranging hotpluggable as
         ZONE_MOVABLE.
         (This has been done by the MOVABLE_NODE functionality in Linux.)
      
      2. when the system is booting, prevent bootmem allocator from allocating
         hotpluggable memory for the kernel before the memory initialization
         finishes.
      
      The problem 2 is the key problem we are going to solve. But before solving it,
      we need some preparation. Please see below.
      
      [Preparation]
      
      Bootloader has to load the kernel image into memory.  And this memory must
      be unhotpluggable.  We cannot prevent this anyway.  So in a memory hotplug
      system, we can assume any node the kernel resides in is not hotpluggable.
      
      Before SRAT is parsed, we don't know which memory ranges are hotpluggable.
       But memblock has already started to work.  In the current kernel,
      memblock allocates the following memory before SRAT is parsed:
      
      setup_arch()
       |->memblock_x86_fill()            /* memblock is ready */
       |......
       |->early_reserve_e820_mpc_new()   /* allocate memory under 1MB */
       |->reserve_real_mode()            /* allocate memory under 1MB */
       |->init_mem_mapping()             /* allocate page tables, about 2MB to map 1GB memory */
       |->dma_contiguous_reserve()       /* specified by user, should be low */
       |->setup_log_buf()                /* specified by user, several mega bytes */
       |->relocate_initrd()              /* could be large, but will be freed after boot, should reorder */
       |->acpi_initrd_override()         /* several mega bytes */
       |->reserve_crashkernel()          /* could be large, should reorder */
       |......
       |->initmem_init()                 /* Parse SRAT */
      
      According to Tejun's advice, before SRAT is parsed, we should try our best
      to allocate memory near the kernel image.  Since the whole node the kernel
      resides in won't be hotpluggable, and for a modern server, a node may have
      at least 16GB memory, allocating several mega bytes memory around the
      kernel image won't cross to hotpluggable memory.
      
      [About this patchset]
      
      So this patchset is the preparation for the problem 2 that we want to
      solve.  It does the following:
      
      1. Make memblock be able to allocate memory bottom up.
         1) Keep all the memblock APIs' prototype unmodified.
         2) When the direction is bottom up, keep the start address greater than the
            end of kernel image.
      
      2. Improve init_mem_mapping() to support allocate page tables in
         bottom up direction.
      
      3. Introduce "movable_node" boot option to enable and disable this
         functionality.
      
      This patch (of 6):
      
      Create a new function __memblock_find_range_top_down to factor out of
      top-down allocation from memblock_find_in_range_node.  This is a
      preparation because we will introduce a new bottom-up allocation mode in
      the following patch.
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NToshi Kani <toshi.kani@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1402899e
  20. 12 9月, 2013 1 次提交
    • Y
      memblock, numa: binary search node id · e76b63f8
      Yinghai Lu 提交于
      Current early_pfn_to_nid() on arch that support memblock go over
      memblock.memory one by one, so will take too many try near the end.
      
      We can use existing memblock_search to find the node id for given pfn,
      that could save some time on bigger system that have many entries
      memblock.memory array.
      
      Here are the timing differences for several machines.  In each case with
      the patch less time was spent in __early_pfn_to_nid().
      
                              3.11-rc5        with patch      difference (%)
                              --------        ----------      --------------
      UV1: 256 nodes  9TB:     411.66          402.47         -9.19 (2.23%)
      UV2: 255 nodes 16TB:    1141.02         1138.12         -2.90 (0.25%)
      UV2:  64 nodes  2TB:     128.15          126.53         -1.62 (1.26%)
      UV2:  32 nodes  2TB:     121.87          121.07         -0.80 (0.66%)
                              Time in seconds.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Acked-by: NRuss Anderson <rja@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e76b63f8