1. 17 2月, 2011 3 次提交
    • T
      x86-64, NUMA: Unify the rest of memblk registration · fd0435d8
      Tejun Heo 提交于
      Move the remaining memblk registration logic from acpi_scan_nodes() to
      numa_register_memblks() and initmem_init().
      
      This applies nodes_cover_memory() sanity check, memory node sorting
      and node_online() checking, which were only applied to acpi, to all
      init methods.
      
      As all memblk registration is moved to common code, active range
      clearing is moved to initmem_init() too and removed from bad_srat().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      fd0435d8
    • T
      x86-64, NUMA: Unify use of memblk in all init methods · 43a662f0
      Tejun Heo 提交于
      Make both amd and dummy use numa_add_memblk() to describe the detected
      memory blocks.  This allows initmem_init() to call
      numa_register_memblk() regardless of init method in use.  Drop custom
      memory registration codes from amd and dummy.
      
      After this change, memblk merge/cleanup in numa_register_memblks() is
      applied to all init methods.
      
      As this makes compute_hash_shift() and numa_register_memblks() used
      only inside numa_64.c, make them static.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      43a662f0
    • T
      x86-64, NUMA: Factor out memblk handling into numa_{add|register}_memblk() · ef396ec9
      Tejun Heo 提交于
      Factor out memblk handling from srat_64.c into two functions in
      numa_64.c.  This patch doesn't introduce any behavior change.  The
      next patch will make all init methods use these functions.
      
      - v2: Fixed build failure on 32bit due to misplaced NR_NODE_MEMBLKS.
            Reported by Ingo.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      ef396ec9
  2. 16 2月, 2011 8 次提交
    • T
      x86-64, NUMA: Kill {acpi|amd}_get_nodes() · 19095548
      Tejun Heo 提交于
      With common numa_nodes[], common code in numa_64.c can access it
      directly.  Copy directly and kill {acpi|amd}_get_nodes().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      19095548
    • T
      x86-64, NUMA: Use common numa_nodes[] · 206e4208
      Tejun Heo 提交于
      ACPI and amd are using separate nodes[] array.  Add numa_nodes[] and
      use them in all NUMA init methods.  cutoff_node() cleanup is moved
      from srat_64.c to numa_64.c and applied in initmem_init() regardless
      of init methods.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      206e4208
    • T
      x86-64, NUMA: Use common {cpu|mem}_nodes_parsed · ec8cf29b
      Tejun Heo 提交于
      ACPI and amd are using separate nodes_parsed masks.  Add
      {cpu|mem}_nodes_parsed and use them in all NUMA init methods.
      Initialization of the masks and building node_possible_map are now
      handled commonly by initmem_init().
      
      dummy_numa_init() is updated to set node 0 on both masks.  While at
      it, move the info messages from scan to init.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      ec8cf29b
    • T
      x86-64, NUMA: Restructure initmem_init() · ffe77a46
      Tejun Heo 提交于
      Reorganize initmem_init() such that,
      
      * Different NUMA init methods are iterated in a consistent way.
      
      * Each iteration re-initializes all the parameters and different
        method can be tried after a failure.
      
      * Dummy init is handled the same as other methods.
      
      Apart from how retry after failure, this patch doesn't change the
      behavior.  The call sequences are kept equivalent across the
      conversion.
      
      After the change, bad_srat() doesn't need to clear apic to node
      mapping or worry about numa_off.  Simplified accordingly.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      ffe77a46
    • T
      x86, NUMA: Move *_numa_init() invocations into initmem_init() · d8fc3afc
      Tejun Heo 提交于
      There's no reason for these to live in setup_arch().  Move them inside
      initmem_init().
      
      - v2: x86-32 initmem_init() weren't updated breaking 32bit builds.
        Fixed.  Found by Ankita.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ankita Garg <ankita@in.ibm.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      d8fc3afc
    • T
      x86-64, NUMA: Unify {acpi|amd}_{numa_init|scan_nodes}() arguments and return values · 940fed2e
      Tejun Heo 提交于
      The functions used during NUMA initialization - *_numa_init() and
      *_scan_nodes() - have different arguments and return values.  Unify
      them such that they all take no argument and return 0 on success and
      -errno on failure.  This is in preparation for further NUMA init
      cleanups.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      940fed2e
    • T
      x86, NUMA: Drop @start/last_pfn from initmem_init() · 86ef4dbf
      Tejun Heo 提交于
      initmem_init() extensively accesses and modifies global data
      structures and the parameters aren't even followed depending on which
      path is being used.  Drop @start/last_pfn and let it deal with
      @max_pfn directly.  This is in preparation for further NUMA init
      cleanups.
      
      - v2: x86-32 initmem_init() weren't updated breaking 32bit builds.
        Fixed.  Found by Yinghai.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      86ef4dbf
    • T
      x86-64, NUMA: Make dummy node initialization path similar to non-dummy ones · 7d36b7bc
      Tejun Heo 提交于
      Dummy node initialization in initmem_init() didn't initialize apicid
      to node mapping and set cpu to node mapping directly by caling
      numa_set_node(), which is different from non-dummy init paths.
      
      Update it such that they behave similarly.  Initialize apicid to node
      mapping and call numa_init_array().  The actual cpu to node mapping is
      handled by init_cpu_to_node() later.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NYinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      7d36b7bc
  3. 14 2月, 2011 1 次提交
    • D
      x86, numa: Add error handling for bad cpu-to-node mappings · 14392fd3
      David Rientjes 提交于
      CONFIG_DEBUG_PER_CPU_MAPS may return NUMA_NO_NODE when an
      early_cpu_to_node() mapping hasn't been initialized.  In such a
      case, it emits a warning and continues without an issue but
      callers may try to use the return value to index into an array.
      
      We can catch those errors and fail silently since a warning has
      already been emitted.  No current user of numa_add_cpu()
      requires this error checking to avoid a crash, but it's better
      to be proactive in case a future user happens to have a bug and
      a user tries to diagnose it with CONFIG_DEBUG_PER_CPU_MAPS.
      Reported-by: NJesper Juhl <jj@chaosbits.net>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      LKML-Reference: <alpine.DEB.2.00.1102071407250.7812@chino.kir.corp.google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      14392fd3
  4. 28 1月, 2011 4 次提交
    • T
      x86: Unify NUMA initialization between 32 and 64bit · 8db78cc4
      Tejun Heo 提交于
      Now that everything else is unified, NUMA initialization can be
      unified too.
      
      * numa_init_array() and init_cpu_to_node() are moved from
        numa_64 to numa.
      
      * numa_32::initmem_init() is updated to call numa_init_array()
        and setup_arch() to call init_cpu_to_node() on 32bit too.
      
      * x86_cpu_to_node_map is now initialized to NUMA_NO_NODE on
        32bit too. This is safe now as numa_init_array() will initialize
        it early during boot.
      
      This makes NUMA mapping fully initialized before
      setup_per_cpu_areas() on 32bit too and thus makes the first
      percpu chunk which contains all the static variables and some of
      dynamic area allocated with NUMA affinity correctly considered.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: yinghai@kernel.org
      Cc: brgerst@gmail.com
      Cc: gorcunov@gmail.com
      Cc: shaohui.zheng@intel.com
      Cc: rientjes@google.com
      LKML-Reference: <1295789862-25482-17-git-send-email-tj@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
      Reviewed-by: NPekka Enberg <penberg@kernel.org>
      8db78cc4
    • T
      x86: Unify node_to_cpumask_map handling between 32 and 64bit · de2d9445
      Tejun Heo 提交于
      x86_32 has been managing node_to_cpumask_map explicitly from
      map_cpu_to_node() and friends in a rather ugly way.  With
      previous changes, it's now possible to share the code with
      64bit.
      
      * When CONFIG_NUMA_EMU is disabled, numa_add/remove_cpu() are
        implemented in numa.c and shared by 32 and 64bit.  CONFIG_NUMA_EMU
        versions still live in numa_64.c.
      
        NUMA_EMU's dependency on 64bit is planned to be removed and the
        above should go away together.
      
      * identify_cpu() now calls numa_add_cpu() for 32bit too.  This
        makes the explicit mask management from map_cpu_to_node() unnecessary.
      
      * The whole x86_32 specific map_cpu_to_node() chunk is no longer
        necessary.  Dropped.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NPekka Enberg <penberg@kernel.org>
      Cc: eric.dumazet@gmail.com
      Cc: yinghai@kernel.org
      Cc: brgerst@gmail.com
      Cc: gorcunov@gmail.com
      Cc: shaohui.zheng@intel.com
      Cc: rientjes@google.com
      LKML-Reference: <1295789862-25482-16-git-send-email-tj@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      de2d9445
    • T
      x86: Unify CPU -> NUMA node mapping between 32 and 64bit · 645a7919
      Tejun Heo 提交于
      Unlike 64bit, 32bit has been using its own cpu_to_node_map[] for
      CPU -> NUMA node mapping.  Replace it with early_percpu variable
      x86_cpu_to_node_map and share the mapping code with 64bit.
      
      * USE_PERCPU_NUMA_NODE_ID is now enabled for 32bit too.
      
      * x86_cpu_to_node_map and numa_set/clear_node() are moved from
        numa_64 to numa.  For now, on 32bit, x86_cpu_to_node_map is initialized
        with 0 instead of NUMA_NO_NODE.  This is to avoid introducing unexpected
        behavior change and will be updated once init path is unified.
      
      * srat_detect_node() is now enabled for x86_32 too.  It calls
        numa_set_node() and initializes the mapping making explicit
        cpu_to_node_map[] updates from map/unmap_cpu_to_node() unnecessary.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: eric.dumazet@gmail.com
      Cc: yinghai@kernel.org
      Cc: brgerst@gmail.com
      Cc: gorcunov@gmail.com
      Cc: penberg@kernel.org
      Cc: shaohui.zheng@intel.com
      Cc: rientjes@google.com
      LKML-Reference: <1295789862-25482-15-git-send-email-tj@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Cc: David Rientjes <rientjes@google.com>
      645a7919
    • T
      x86: Unify cpu/apicid <-> NUMA node mapping between 32 and 64bit · bbc9e2f4
      Tejun Heo 提交于
      The mapping between cpu/apicid and node is done via
      apicid_to_node[] on 64bit and apicid_2_node[] +
      apic->x86_32_numa_cpu_node() on 32bit. This difference makes it
      difficult to further unify 32 and 64bit NUMA handling.
      
      This patch unifies it by replacing both apicid_to_node[] and
      apicid_2_node[] with __apicid_to_node[] array, which is accessed
      by two accessors - set_apicid_to_node() and numa_cpu_node().  On
      64bit, numa_cpu_node() always consults __apicid_to_node[]
      directly while 32bit goes through apic->numa_cpu_node() method
      to allow apic implementations to override it.
      
      srat_detect_node() for amd cpus contains workaround for broken
      NUMA configuration which assumes relationship between APIC ID,
      HT node ID and NUMA topology.  Leave it to access
      __apicid_to_node[] directly as mapping through CPU might result
      in undesirable behavior change.  The comment is reformatted and
      updated to note the ugliness.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NPekka Enberg <penberg@kernel.org>
      Cc: eric.dumazet@gmail.com
      Cc: yinghai@kernel.org
      Cc: brgerst@gmail.com
      Cc: gorcunov@gmail.com
      Cc: shaohui.zheng@intel.com
      Cc: rientjes@google.com
      LKML-Reference: <1295789862-25482-14-git-send-email-tj@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Cc: David Rientjes <rientjes@google.com>
      bbc9e2f4
  5. 19 1月, 2011 1 次提交
  6. 07 1月, 2011 1 次提交
    • D
      x86, numa: Fix CONFIG_DEBUG_PER_CPU_MAPS without NUMA emulation · d906f0eb
      David Rientjes 提交于
      "x86, numa: Fake node-to-cpumask for NUMA emulation" broke the
      build when CONFIG_DEBUG_PER_CPU_MAPS is set and CONFIG_NUMA_EMU
      is not.  This is because it is possible to map a cpu to multiple
      nodes when NUMA emulation is used; the patch required a physical
      node address table to find those nodes that was only available
      when CONFIG_NUMA_EMU was enabled.
      
      This extracts the common debug functionality to its own function
      for CONFIG_DEBUG_PER_CPU_MAPS and uses it regardless of whether
      CONFIG_NUMA_EMU is set or not.
      
      NUMA emulation will now iterate over the set of possible nodes
      for each cpu and call the new debug function whereas only the
      cpu's node will be used without NUMA emulation enabled.
      Reported-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <alpine.DEB.2.00.1012301053590.12995@chino.kir.corp.google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d906f0eb
  7. 30 12月, 2010 2 次提交
    • Y
      x86-64, numa: Put pgtable to local node memory · 1411e0ec
      Yinghai Lu 提交于
      Introduce init_memory_mapping_high(), and use it with 64bit.
      
      It will go with every memory segment above 4g to create page table to the
      memory range itself.
      
      before this patch all page tables was on one node.
      
      with this patch, one RED-PEN is killed
      
      debug out for 8 sockets system after patch
      [    0.000000] initial memory mapped : 0 - 20000000
      [    0.000000] init_memory_mapping: [0x00000000000000-0x0000007f74ffff]
      [    0.000000]  0000000000 - 007f600000 page 2M
      [    0.000000]  007f600000 - 007f750000 page 4k
      [    0.000000] kernel direct mapping tables up to 7f750000 @ [0x7f74c000-0x7f74ffff]
      [    0.000000] RAMDISK: 7bc84000 - 7f745000
      ....
      [    0.000000] Adding active range (0, 0x10, 0x95) 0 entries of 3200 used
      [    0.000000] Adding active range (0, 0x100, 0x7f750) 1 entries of 3200 used
      [    0.000000] Adding active range (0, 0x100000, 0x1080000) 2 entries of 3200 used
      [    0.000000] Adding active range (1, 0x1080000, 0x2080000) 3 entries of 3200 used
      [    0.000000] Adding active range (2, 0x2080000, 0x3080000) 4 entries of 3200 used
      [    0.000000] Adding active range (3, 0x3080000, 0x4080000) 5 entries of 3200 used
      [    0.000000] Adding active range (4, 0x4080000, 0x5080000) 6 entries of 3200 used
      [    0.000000] Adding active range (5, 0x5080000, 0x6080000) 7 entries of 3200 used
      [    0.000000] Adding active range (6, 0x6080000, 0x7080000) 8 entries of 3200 used
      [    0.000000] Adding active range (7, 0x7080000, 0x8080000) 9 entries of 3200 used
      [    0.000000] init_memory_mapping: [0x00000100000000-0x0000107fffffff]
      [    0.000000]  0100000000 - 1080000000 page 2M
      [    0.000000] kernel direct mapping tables up to 1080000000 @ [0x107ffbd000-0x107fffffff]
      [    0.000000]     memblock_x86_reserve_range: [0x107ffc2000-0x107fffffff]          PGTABLE
      [    0.000000] init_memory_mapping: [0x00001080000000-0x0000207fffffff]
      [    0.000000]  1080000000 - 2080000000 page 2M
      [    0.000000] kernel direct mapping tables up to 2080000000 @ [0x207ff7d000-0x207fffffff]
      [    0.000000]     memblock_x86_reserve_range: [0x207ffc0000-0x207fffffff]          PGTABLE
      [    0.000000] init_memory_mapping: [0x00002080000000-0x0000307fffffff]
      [    0.000000]  2080000000 - 3080000000 page 2M
      [    0.000000] kernel direct mapping tables up to 3080000000 @ [0x307ff3d000-0x307fffffff]
      [    0.000000]     memblock_x86_reserve_range: [0x307ffc0000-0x307fffffff]          PGTABLE
      [    0.000000] init_memory_mapping: [0x00003080000000-0x0000407fffffff]
      [    0.000000]  3080000000 - 4080000000 page 2M
      [    0.000000] kernel direct mapping tables up to 4080000000 @ [0x407fefd000-0x407fffffff]
      [    0.000000]     memblock_x86_reserve_range: [0x407ffc0000-0x407fffffff]          PGTABLE
      [    0.000000] init_memory_mapping: [0x00004080000000-0x0000507fffffff]
      [    0.000000]  4080000000 - 5080000000 page 2M
      [    0.000000] kernel direct mapping tables up to 5080000000 @ [0x507febd000-0x507fffffff]
      [    0.000000]     memblock_x86_reserve_range: [0x507ffc0000-0x507fffffff]          PGTABLE
      [    0.000000] init_memory_mapping: [0x00005080000000-0x0000607fffffff]
      [    0.000000]  5080000000 - 6080000000 page 2M
      [    0.000000] kernel direct mapping tables up to 6080000000 @ [0x607fe7d000-0x607fffffff]
      [    0.000000]     memblock_x86_reserve_range: [0x607ffc0000-0x607fffffff]          PGTABLE
      [    0.000000] init_memory_mapping: [0x00006080000000-0x0000707fffffff]
      [    0.000000]  6080000000 - 7080000000 page 2M
      [    0.000000] kernel direct mapping tables up to 7080000000 @ [0x707fe3d000-0x707fffffff]
      [    0.000000]     memblock_x86_reserve_range: [0x707ffc0000-0x707fffffff]          PGTABLE
      [    0.000000] init_memory_mapping: [0x00007080000000-0x0000807fffffff]
      [    0.000000]  7080000000 - 8080000000 page 2M
      [    0.000000] kernel direct mapping tables up to 8080000000 @ [0x807fdfc000-0x807fffffff]
      [    0.000000]     memblock_x86_reserve_range: [0x807ffbf000-0x807fffffff]          PGTABLE
      [    0.000000] Initmem setup node 0 [0000000000000000-000000107fffffff]
      [    0.000000]   NODE_DATA [0x0000107ffbd000-0x0000107ffc1fff]
      [    0.000000] Initmem setup node 1 [0000001080000000-000000207fffffff]
      [    0.000000]   NODE_DATA [0x0000207ffbb000-0x0000207ffbffff]
      [    0.000000] Initmem setup node 2 [0000002080000000-000000307fffffff]
      [    0.000000]   NODE_DATA [0x0000307ffbb000-0x0000307ffbffff]
      [    0.000000] Initmem setup node 3 [0000003080000000-000000407fffffff]
      [    0.000000]   NODE_DATA [0x0000407ffbb000-0x0000407ffbffff]
      [    0.000000] Initmem setup node 4 [0000004080000000-000000507fffffff]
      [    0.000000]   NODE_DATA [0x0000507ffbb000-0x0000507ffbffff]
      [    0.000000] Initmem setup node 5 [0000005080000000-000000607fffffff]
      [    0.000000]   NODE_DATA [0x0000607ffbb000-0x0000607ffbffff]
      [    0.000000] Initmem setup node 6 [0000006080000000-000000707fffffff]
      [    0.000000]   NODE_DATA [0x0000707ffbb000-0x0000707ffbffff]
      [    0.000000] Initmem setup node 7 [0000007080000000-000000807fffffff]
      [    0.000000]   NODE_DATA [0x0000807ffba000-0x0000807ffbefff]
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <4D1933D1.9020609@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      1411e0ec
    • Y
      x86-64, numa: Allocate memnodemap under max_pfn_mapped · dbef7b56
      Yinghai Lu 提交于
      We need to access it right way, so make sure that it is mapped already.
      
      Prepare to put page table on local node, and nodemap is used before that.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <4D1933C8.7060105@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      dbef7b56
  8. 24 12月, 2010 3 次提交
    • D
      x86, numa: Fix cpu to node mapping for sparse node ids · a387e95a
      David Rientjes 提交于
      NUMA boot code assumes that physical node ids start at 0, but the DIMMs
      that the apic id represents may not be reachable.  If this is the case,
      node 0 is never online and cpus never end up getting appropriately
      assigned to a node.  This causes the cpumask of all online nodes to be
      empty and machines crash with kernel code assuming online nodes have
      valid cpus.
      
      The fix is to appropriately map all the address ranges for physical nodes
      and ensure the cpu to node mapping function checks all possible nodes (up
      to MAX_NUMNODES) instead of simply checking nodes 0-N, where N is the
      number of physical nodes, for valid address ranges.
      
      This requires no longer "compressing" the address ranges of nodes in the
      physical node map from 0-N, but rather leave indices in physnodes[] to
      represent the actual node id of the physical node.  Accordingly, the
      topology exported by both amd_get_nodes() and acpi_get_nodes() no longer
      must return the number of nodes to iterate through; all such iterations
      will now be to MAX_NUMNODES.
      
      This change also passes the end address of system RAM (which may be
      different from normal operation if mem= is specified on the command line)
      before the physnodes[] array is populated.  ACPI parsed nodes are
      truncated to fit within the address range that respect the mem=
      boundaries and even some physical nodes may become unreachable in such
      cases.
      
      When NUMA emulation does succeed, any apicid to node mapping that exists
      for unreachable nodes are given default values so that proximity domains
      can still be assigned.  This is important for node_distance() to
      function as desired.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      LKML-Reference: <alpine.DEB.2.00.1012221702090.3701@chino.kir.corp.google.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      a387e95a
    • D
      x86, numa: Fake node-to-cpumask for NUMA emulation · c1c3443c
      David Rientjes 提交于
      It's necessary to fake the node-to-cpumask mapping so that an emulated
      node ID returns a cpumask that includes all cpus that have affinity to
      the memory it represents.
      
      This is a little intrusive because it requires knowledge of the physical
      topology of the system.  setup_physnodes() gives us that information, but
      since NUMA emulation ends up altering the physnodes array, it's necessary
      to reset it before cpus are brought online.
      
      Accordingly, the physnodes array is moved out of init.data and into
      cpuinit.data since it will be needed on cpuup callbacks.
      
      This works regardless of whether numa=fake is used on the command line,
      or the setup of the fake node succeeds or fails.  The physnodes array
      always contains the physical topology of the machine if CONFIG_NUMA_EMU
      is enabled and can be used to setup the correct node-to-cpumask mappings
      in all cases since setup_physnodes() is called whenever the array needs
      to be repopulated with the correct data.
      
      To fake the actual mappings, numa_add_cpu() and numa_remove_cpu() are
      rewritten for CONFIG_NUMA_EMU so that we first find the physical node to
      which each cpu has local affinity, then iterate through all online nodes
      to find the emulated nodes that have local affinity to that physical
      node, and then finally map the cpu to each of those emulated nodes.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      LKML-Reference: <alpine.DEB.2.00.1012221701520.3701@chino.kir.corp.google.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      c1c3443c
    • D
      x86, numa: Fake apicid and pxm mappings for NUMA emulation · f51bf307
      David Rientjes 提交于
      This patch adds the equivalent of acpi_fake_nodes() for AMD Northbridge
      platforms.  The goal is to fake the apicid-to-node mappings for NUMA
      emulation so the physical topology of the machine is correctly maintained
      within the kernel.
      
      This change also fakes proximity domains for both ACPI and k8 code so the
      physical distance between emulated nodes is maintained via
      node_distance().  This exports the correct distances via
      /sys/devices/system/node/.../distance based on the underlying topology.
      
      A new helper function, fake_physnodes(), is introduced to correctly
      invoke the correct NUMA code to fake these two mappings based on the
      system type.  If there is no underlying NUMA configuration, all cpus are
      mapped to node 0 for local distance.
      
      Since acpi_fake_nodes() is no longer called with CONFIG_ACPI_NUMA, it's
      prototype can be removed from the header file for such a configuration.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      LKML-Reference: <alpine.DEB.2.00.1012221701360.3701@chino.kir.corp.google.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      f51bf307
  9. 18 11月, 2010 1 次提交
  10. 29 10月, 2010 1 次提交
  11. 21 9月, 2010 1 次提交
  12. 28 8月, 2010 3 次提交
    • Y
      x86: Remove old bootmem code · 774ea0bc
      Yinghai Lu 提交于
      Requested by Ingo, Thomas and HPA.
      
      The old bootmem code is no longer necessary, and the transition is
      complete.  Remove it.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      774ea0bc
    • Y
      x86, memblock: Replace e820_/_early string with memblock_ · a9ce6bc1
      Yinghai Lu 提交于
      1.include linux/memblock.h directly. so later could reduce e820.h reference.
      2 this patch is done by sed scripts mainly
      
      -v2: use MEMBLOCK_ERROR instead of -1ULL or -1UL
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      a9ce6bc1
    • Y
      x86: Use memblock to replace early_res · 72d7c3b3
      Yinghai Lu 提交于
      1. replace find_e820_area with memblock_find_in_range
      2. replace reserve_early with memblock_x86_reserve_range
      3. replace free_early with memblock_x86_free_range.
      4. NO_BOOTMEM will switch to use memblock too.
      5. use _e820, _early wrap in the patch, in following patch, will
         replace them all
      6. because memblock_x86_free_range support partial free, we can remove some special care
      7. Need to make sure that memblock_find_in_range() is called after memblock_x86_fill()
         so adjust some calling later in setup.c::setup_arch()
         -- corruption_check and mptable_update
      
      -v2: Move reserve_brk() early
          Before fill_memblock_area, to avoid overlap between brk and memblock_find_in_range()
          that could happen We have more then 128 RAM entry in E820 tables, and
          memblock_x86_fill() could use memblock_find_in_range() to find a new place for
          memblock.memory.region array.
          and We don't need to use extend_brk() after fill_memblock_area()
          So move reserve_brk() early before fill_memblock_area().
      -v3: Move find_smp_config early
          To make sure memblock_find_in_range not find wrong place, if BIOS doesn't put mptable
          in right place.
      -v4: Treat RESERVED_KERN as RAM in memblock.memory. and they are already in
          memblock.reserved already..
          use __NOT_KEEP_MEMBLOCK to make sure memblock related code could be freed later.
      -v5: Generic version __memblock_find_in_range() is going from high to low, and for 32bit
          active_region for 32bit does include high pages
          need to replace the limit with memblock.default_alloc_limit, aka get_max_mapped()
      -v6: Use current_limit instead
      -v7: check with MEMBLOCK_ERROR instead of -1ULL or -1L
      -v8: Set memblock_can_resize early to handle EFI with more RAM entries
      -v9: update after kmemleak changes in mainline
      Suggested-by: NDavid S. Miller <davem@davemloft.net>
      Suggested-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      72d7c3b3
  13. 28 5月, 2010 1 次提交
  14. 16 2月, 2010 3 次提交
    • D
      x86, numa: Remove configurable node size support for numa emulation · ca2107c9
      David Rientjes 提交于
      Now that numa=fake=<size>[MG] is implemented, it is possible to remove
      configurable node size support.  The command-line parsing was already
      broken (numa=fake=*128, for example, would not work) and since fake nodes
      are now interleaved over physical nodes, this support is no longer
      required.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      LKML-Reference: <alpine.DEB.2.00.1002151343080.26927@chino.kir.corp.google.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      ca2107c9
    • D
      x86, numa: Add fixed node size option for numa emulation · 8df5bb34
      David Rientjes 提交于
      numa=fake=N specifies the number of fake nodes, N, to partition the
      system into and then allocates them by interleaving over physical nodes.
      This requires knowledge of the system capacity when attempting to
      allocate nodes of a certain size: either very large nodes to benchmark
      scalability of code that operates on individual nodes, or very small
      nodes to find bugs in the VM.
      
      This patch introduces numa=fake=<size>[MG] so it is possible to specify
      the size of each node to allocate.  When used, nodes of the size
      specified will be allocated and interleaved over the set of physical
      nodes.
      
      FAKE_NODE_MIN_SIZE was also moved to the more-appropriate
      include/asm/numa_64.h.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      LKML-Reference: <alpine.DEB.2.00.1002151342510.26927@chino.kir.corp.google.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      8df5bb34
    • D
      x86, numa: Fix numa emulation calculation of big nodes · 68fd111e
      David Rientjes 提交于
      numa=fake=N uses split_nodes_interleave() to partition the system into N
      fake nodes.  Each node size must have be a multiple of
      FAKE_NODE_MIN_SIZE, otherwise it is possible to get strange alignments.
      Because of this, the remaining memory from each node when rounded to
      FAKE_NODE_MIN_SIZE is consolidated into a number of "big nodes" that are
      bigger than the rest.
      
      The calculation of the number of big nodes is incorrect since it is using
      a logical AND operator when it should be multiplying the rounded-off
      portion of each node with N.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      LKML-Reference: <alpine.DEB.2.00.1002151342230.26927@chino.kir.corp.google.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      68fd111e
  15. 13 2月, 2010 1 次提交
  16. 11 2月, 2010 2 次提交
    • Y
      x86: Make early_node_mem get mem > 4 GB if possible · cef625ee
      Yinghai Lu 提交于
      So we could put pgdata for the node high, and later sparse
      vmmap will get the section nr that need.
      
      With this patch will make <4 GB ram not use a sparse vmmap.
      
      before this patch, will get, before swiotlb try get bootmem
      [    0.000000] nid=1 start=0 end=2080000 aligned=1
      [    0.000000]   free [10 - 96]
      [    0.000000]   free [b12 - 1000]
      [    0.000000]   free [359f - 38a3]
      [    0.000000]   free [38b5 - 3a00]
      [    0.000000]   free [41e01 - 42000]
      [    0.000000]   free [73dde - 73e00]
      [    0.000000]   free [73fdd - 74000]
      [    0.000000]   free [741dd - 74200]
      [    0.000000]   free [743dd - 74400]
      [    0.000000]   free [745dd - 74600]
      [    0.000000]   free [747dd - 74800]
      [    0.000000]   free [749dd - 74a00]
      [    0.000000]   free [74bdd - 74c00]
      [    0.000000]   free [74ddd - 74e00]
      [    0.000000]   free [74fdd - 75000]
      [    0.000000]   free [751dd - 75200]
      [    0.000000]   free [753dd - 75400]
      [    0.000000]   free [755dd - 75600]
      [    0.000000]   free [757dd - 75800]
      [    0.000000]   free [759dd - 75a00]
      [    0.000000]   free [75bdd - 7bf5f]
      [    0.000000]   free [7f730 - 7f750]
      [    0.000000]   free [100000 - 2080000]
      [    0.000000]   total free 1f87170
      [   93.301474] Placing 64MB software IO TLB between ffff880075bdd000 - ffff880079bdd000
      [   93.311814] software IO TLB at phys 0x75bdd000 - 0x79bdd000
      
      with this patch will get: before swiotlb try get bootmem
      [    0.000000] nid=1 start=0 end=2080000 aligned=1
      [    0.000000]   free [a - 96]
      [    0.000000]   free [702 - 1000]
      [    0.000000]   free [359f - 3600]
      [    0.000000]   free [37de - 3800]
      [    0.000000]   free [39dd - 3a00]
      [    0.000000]   free [3bdd - 3c00]
      [    0.000000]   free [3ddd - 3e00]
      [    0.000000]   free [3fdd - 4000]
      [    0.000000]   free [41dd - 4200]
      [    0.000000]   free [43dd - 4400]
      [    0.000000]   free [45dd - 4600]
      [    0.000000]   free [47dd - 4800]
      [    0.000000]   free [49dd - 4a00]
      [    0.000000]   free [4bdd - 4c00]
      [    0.000000]   free [4ddd - 4e00]
      [    0.000000]   free [4fdd - 5000]
      [    0.000000]   free [51dd - 5200]
      [    0.000000]   free [53dd - 5400]
      [    0.000000]   free [55dd - 7bf5f]
      [    0.000000]   free [7f730 - 7f750]
      [    0.000000]   free [100428 - 100600]
      [    0.000000]   free [13ea01 - 13ec00]
      [    0.000000]   free [170800 - 2080000]
      [    0.000000]   total free 1f87170
      
      [   92.689485] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
      [   92.699799] Placing 64MB software IO TLB between ffff8800055dd000 - ffff8800095dd000
      [   92.710916] software IO TLB at phys 0x55dd000 - 0x95dd000
      
      so will get enough space below 4G, aka pfn 0x100000
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <1265793639-15071-15-git-send-email-yinghai@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      cef625ee
    • Y
      x86: Call early_res_to_bootmem one time · 1842f90c
      Yinghai Lu 提交于
      Simplify setup_node_mem: don't use bootmem from other node, instead
      just find_e820_area in early_node_mem.
      
      This keeps the boundary between early_res and boot mem more clear, and
      lets us only call early_res_to_bootmem() one time instead of for all
      nodes.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <1265793639-15071-12-git-send-email-yinghai@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      1842f90c
  17. 23 11月, 2009 2 次提交
    • Y
      x86, numa: Use near(er) online node instead of roundrobin for NUMA · d9c2d5ac
      Yinghai Lu 提交于
      CPU to node mapping is set via the following sequence:
      
       1. numa_init_array(): Set up roundrobin from cpu to online node
      
       2. init_cpu_to_node(): Set that according to apicid_to_node[]
      			according to srat only handle the node that
      			is online, and leave other cpu on node
      			without ram (aka not online) to still
      			roundrobin.
      
      3. later call srat_detect_node for Intel/AMD, will use first_online
         node or nearby node.
      
      Problem is that setup_per_cpu_areas() is not called between 2 and 3,
      the per_cpu for cpu on node with ram is on different node, and could
      put that on node with two hops away.
      
      So try to optimize this and add find_near_online_node() and call
      init_cpu_to_node().
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      LKML-Reference: <4B07A739.3030104@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d9c2d5ac
    • Y
      x86, numa, bootmem: Only free bootmem on NUMA failure path · 021428ad
      Yinghai Lu 提交于
      In the NUMA bootmem setup failure path we freed nodedata_phys
      incorrectly.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      LKML-Reference: <4B07A739.3030104@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      021428ad
  18. 13 10月, 2009 2 次提交
    • D
      x86: Interleave emulated nodes over physical nodes · adc19389
      David Rientjes 提交于
      Add interleaved NUMA emulation support
      
      This patch interleaves emulated nodes over the system's physical
      nodes. This is required for interleave optimizations since
      mempolicies, for example, operate by iterating over a nodemask and
      act without knowledge of node distances.  It can also be used for
      testing memory latencies and NUMA bugs in the kernel.
      
      There're a couple of ways to do this:
      
       - divide the number of emulated nodes by the number of physical
         nodes and allocate the result on each physical node, or
      
       - allocate each successive emulated node on a different physical
         node until all memory is exhausted.
      
      The disadvantage of the first option is, depending on the asymmetry
      in node capacities of each physical node, emulated nodes may
      substantially differ in size on a particular physical node compared
      to another.
      
      The disadvantage of the second option is, also depending on the
      asymmetry in node capacities of each physical node, there may be
      more emulated nodes allocated on a single physical node as another.
      
      This patch implements the second option; we sacrifice the
      possibility that we may have slightly more emulated nodes on a
      particular physical node compared to another in lieu of node size
      asymmetry.
      
       [ Note that "node capacity" of a physical node is not only a
         function of its addressable range, but also is affected by
         subtracting out the amount of reserved memory over that range.
         NUMA emulation only deals with available, non-reserved memory
         quantities. ]
      
      We ensure there is at least a minimal amount of available memory
      allocated to each node.  We also make sure that at least this
      amount of available memory is available in ZONE_DMA32 for any node
      that includes both ZONE_DMA32 and ZONE_NORMAL.
      
      This patch also cleans the emulation code up by no longer passing
      the statically allocated struct bootnode array among the various
      functions. This init.data array is not allocated on the stack since
      it may be very large and thus it may be accessed at file scope.
      
      The WARN_ON() for nodes_cover_memory() when faking proximity
      domains is removed since it relies on successive nodes always
      having greater start addresses than previous nodes; with
      interleaving this is no longer always true.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Ankita Garg <ankita@in.ibm.com>
      Cc: Len Brown <len.brown@intel.com>
      LKML-Reference: <alpine.DEB.1.00.0909251519150.14754@chino.kir.corp.google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      adc19389
    • D
      x86: Export srat physical topology · 8716273c
      David Rientjes 提交于
      This is the counterpart to "x86: export k8 physical topology" for
      SRAT. It is not as invasive because the acpi code already seperates
      node setup into detection and registration steps, with the
      exception of registering e820 active regions in
      acpi_numa_memory_affinity_init().  This is now moved to
      acpi_scan_nodes() if NUMA emulation is disabled or deferred.
      
      acpi_numa_init() now returns a value which specifies whether an
      underlying SRAT was located.  If so, that topology can be used by
      the emulation code to interleave emulated nodes over physical nodes
      or to register the nodes for ACPI.
      
      acpi_get_nodes() may now be used to export the srat physical
      topology of the machine for NUMA emulation.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Ankita Garg <ankita@in.ibm.com>
      Cc: Len Brown <len.brown@intel.com>
      LKML-Reference: <alpine.DEB.1.00.0909251518580.14754@chino.kir.corp.google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8716273c