1. 13 11月, 2013 1 次提交
    • T
      mem-hotplug: introduce movable_node boot option · c5320926
      Tang Chen 提交于
      The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
      As we mentioned before, if hotpluggable memory is used by the kernel, it
      cannot be hot-removed.  So memory hotplug users may want to set all
      hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.
      
      Memory hotplug users may also set a node as movable node, which has
      ZONE_MOVABLE only, so that the whole node can be hot-removed.
      
      But the kernel cannot use memory in ZONE_MOVABLE.  By doing this, the
      kernel cannot use memory in movable nodes.  This will cause NUMA
      performance down.  And other users may be unhappy.
      
      So we need a way to allow users to enable and disable this functionality.
      In this patch, we introduce movable_node boot option to allow users to
      choose to not to consume hotpluggable memory at early boot time and later
      we can set it as ZONE_MOVABLE.
      
      To achieve this, the movable_node boot option will control the memblock
      allocation direction.  That said, after memblock is ready, before SRAT is
      parsed, we should allocate memory near the kernel image as we explained in
      the previous patches.  So if movable_node boot option is set, the kernel
      does the following:
      
      1. After memblock is ready, make memblock allocate memory bottom up.
      2. After SRAT is parsed, make memblock behave as default, allocate memory
         top down.
      
      Users can specify "movable_node" in kernel commandline to enable this
      functionality.  For those who don't use memory hotplug or who don't want
      to lose their NUMA performance, just don't specify anything.  The kernel
      will work as before.
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Suggested-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NToshi Kani <toshi.kani@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5320926
  2. 15 7月, 2013 1 次提交
    • P
      x86: delete __cpuinit usage from all x86 files · 148f9bb8
      Paul Gortmaker 提交于
      The __cpuinit type of throwaway sections might have made sense
      some time ago when RAM was more constrained, but now the savings
      do not offset the cost and complications.  For example, the fix in
      commit 5e427ec2 ("x86: Fix bit corruption at CPU resume time")
      is a good example of the nasty type of bugs that can be created
      with improper use of the various __init prefixes.
      
      After a discussion on LKML[1] it was decided that cpuinit should go
      the way of devinit and be phased out.  Once all the users are gone,
      we can then finally remove the macros themselves from linux/init.h.
      
      Note that some harmless section mismatch warnings may result, since
      notify_cpu_starting() and cpu_up() are arch independent (kernel/cpu.c)
      are flagged as __cpuinit  -- so if we remove the __cpuinit from
      arch specific callers, we will also get section mismatch warnings.
      As an intermediate step, we intend to turn the linux/init.h cpuinit
      content into no-ops as early as possible, since that will get rid
      of these warnings.  In any case, they are temporary and harmless.
      
      This removes all the arch/x86 uses of the __cpuinit macros from
      all C files.  x86 only had the one __CPUINIT used in assembly files,
      and it wasn't paired off with a .previous or a __FINIT, so we can
      delete it directly w/o any corresponding additional change there.
      
      [1] https://lkml.org/lkml/2013/5/20/589
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NH. Peter Anvin <hpa@linux.intel.com>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      148f9bb8
  3. 30 4月, 2013 1 次提交
  4. 03 3月, 2013 1 次提交
    • Y
      x86, ACPI, mm: Revert movablemem_map support · 20e6926d
      Yinghai Lu 提交于
      Tim found:
      
        WARNING: at arch/x86/kernel/smpboot.c:324 topology_sane.isra.2+0x6f/0x80()
        Hardware name: S2600CP
        sched: CPU #1's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
        smpboot: Booting Node   1, Processors  #1
        Modules linked in:
        Pid: 0, comm: swapper/1 Not tainted 3.9.0-0-generic #1
        Call Trace:
          set_cpu_sibling_map+0x279/0x449
          start_secondary+0x11d/0x1e5
      
      Don Morris reproduced on a HP z620 workstation, and bisected it to
      commit e8d19552 ("acpi, memory-hotplug: parse SRAT before memblock
      is ready")
      
      It turns out movable_map has some problems, and it breaks several things
      
      1. numa_init is called several times, NOT just for srat. so those
      	nodes_clear(numa_nodes_parsed)
      	memset(&numa_meminfo, 0, sizeof(numa_meminfo))
         can not be just removed.  Need to consider sequence is: numaq, srat, amd, dummy.
         and make fall back path working.
      
      2. simply split acpi_numa_init to early_parse_srat.
         a. that early_parse_srat is NOT called for ia64, so you break ia64.
         b.  for (i = 0; i < MAX_LOCAL_APIC; i++)
      	     set_apicid_to_node(i, NUMA_NO_NODE)
           still left in numa_init. So it will just clear result from early_parse_srat.
           it should be moved before that....
         c.  it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved
             early before override from INITRD is settled.
      
      3. that patch TITLE is total misleading, there is NO x86 in the title,
         but it changes critical x86 code. It caused x86 guys did not
         pay attention to find the problem early. Those patches really should
         be routed via tip/x86/mm.
      
      4. after that commit, following range can not use movable ram:
        a. real_mode code.... well..funny, legacy Node0 [0,1M) could be hot-removed?
        b. initrd... it will be freed after booting, so it could be on movable...
        c. crashkernel for kdump...: looks like we can not put kdump kernel above 4G
      	anymore.
        d. init_mem_mapping: can not put page table high anymore.
        e. initmem_init: vmemmap can not be high local node anymore. That is
           not good.
      
      If node is hotplugable, the mem related range like page table and
      vmemmap could be on the that node without problem and should be on that
      node.
      
      We have workaround patch that could fix some problems, but some can not
      be fixed.
      
      So just remove that offending commit and related ones including:
      
       f7210e6c ("mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to
          protect movablecore_map in memblock_overlaps_region().")
      
       01a178a9 ("acpi, memory-hotplug: support getting hotplug info from
          SRAT")
      
       27168d38 ("acpi, memory-hotplug: extend movablemem_map ranges to
          the end of node")
      
       e8d19552 ("acpi, memory-hotplug: parse SRAT before memblock is
          ready")
      
       fb06bc8e ("page_alloc: bootmem limit with movablecore_map")
      
       42f47e27 ("page_alloc: make movablemem_map have higher priority")
      
       6981ec31 ("page_alloc: introduce zone_movable_limit[] to keep
          movable limit for nodes")
      
       34b71f1e ("page_alloc: add movable_memmap kernel parameter")
      
       4d59a751 ("x86: get pg_data_t's memory from other node")
      
      Later we should have patches that will make sure kernel put page table
      and vmemmap on local node ram instead of push them down to node0.  Also
      need to find way to put other kernel used ram to local node ram.
      Reported-by: NTim Gardner <tim.gardner@canonical.com>
      Reported-by: NDon Morris <don.morris@hp.com>
      Bisected-by: NDon Morris <don.morris@hp.com>
      Tested-by: NDon Morris <don.morris@hp.com>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      20e6926d
  5. 24 2月, 2013 5 次提交
    • W
      x86/mm/numa: Don't check if node is NUMA_NO_NODE · 942670d0
      Wen Congyang 提交于
      If we aren't debugging per_cpu maps, the cpu's node is stored in
      per_cpu variable numa_node.  If `node' is NUMA_NO_NODE, it means
      the caller wants to clear the cpu's node.  So we should also
      call set_cpu_numa_node() in this case.
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      942670d0
    • T
      acpi, memory-hotplug: parse SRAT before memblock is ready · e8d19552
      Tang Chen 提交于
      On linux, the pages used by kernel could not be migrated.  As a result,
      if a memory range is used by kernel, it cannot be hot-removed.  So if we
      want to hot-remove memory, we should prevent kernel from using it.
      
      The way now used to prevent this is specify a memory range by
      movablemem_map boot option and set it as ZONE_MOVABLE.
      
      But when the system is booting, memblock will allocate memory, and
      reserve the memory for kernel.  And before we parse SRAT, and know the
      node memory ranges, memblock is working.  And it may allocate memory in
      ranges to be set as ZONE_MOVABLE.  This memory can be used by kernel,
      and never be freed.
      
      So, let's parse SRAT before memblock is called first.  And it is early
      enough.
      
      The first call of memblock_find_in_range_node() is in:
      
        setup_arch()
          |-->setup_real_mode()
      
      so, this patch add a function early_parse_srat() to parse SRAT, and call
      it before setup_real_mode() is called.
      
      NOTE:
      
      1) early_parse_srat() is called before numa_init(), and has initialized
         numa_meminfo.  So DO NOT clear numa_nodes_parsed in numa_init() and DO
         NOT zero numa_meminfo in numa_init(), otherwise we will lose memory
         numa info.
      
      2) I don't know why using count of memory affinities parsed from SRAT
         as a return value in original acpi_numa_init().  So I add a static
         variable srat_mem_cnt to remember this count and use it as the return
         value of the new acpi_numa_init()
      
      [mhocko@suse.cz: parse SRAT before memblock is ready fix]
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: NWen Congyang <wency@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: "Brown, Len" <len.brown@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e8d19552
    • Y
      x86: get pg_data_t's memory from other node · 4d59a751
      Yasuaki Ishimatsu 提交于
      During the implementation of SRAT support, we met a problem.  In
      setup_arch(), we have the following call series:
      
       1) memblock is ready;
       2) some functions use memblock to allocate memory;
       3) parse ACPI tables, such as SRAT.
      
      Before 3), we don't know which memory is hotpluggable, and as a result,
      we cannot prevent memblock from allocating hotpluggable memory.  So, in
      2), there could be some hotpluggable memory allocated by memblock.
      
      Now, we are trying to parse SRAT earlier, before memblock is ready.  But
      I think we need more investigation on this topic.  So in this v5, I
      dropped all the SRAT support, and v5 is just the same as v3, and it is
      based on 3.8-rc3.
      
      As we planned, we will support getting info from SRAT without users'
      participation at last.  And we will post another patch-set to do so.
      
      And also, I think for now, we can add this boot option as the first step
      of supporting movable node.  Since Linux cannot migrate the direct
      mapped pages, the only way for now is to limit the whole node containing
      only movable memory.
      
      Using SRAT is one way.  But even if we can use SRAT, users still need an
      interface to enable/disable this functionality if they don't want to
      loose their NUMA performance.  So I think, a user interface is always
      needed.
      
      For now, users can disable this functionality by not specifying the boot
      option.  Later, we will post SRAT support, and add another option value
      "movablecore_map=acpi" to using SRAT.
      
      This patch:
      
      If system can create movable node which all memory of the node is
      allocated as ZONE_MOVABLE, setup_node_data() cannot allocate memory for
      the node's pg_data_t.  So, use memblock_alloc_try_nid() instead of
      memblock_alloc_nid() to retry when the first allocation fails.
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d59a751
    • W
      cpu-hotplug,memory-hotplug: clear cpu_to_node() when offlining the node · e13fe869
      Wen Congyang 提交于
      When the node is offlined, there is no memory/cpu on the node.  If a
      sleep task runs on a cpu of this node, it will be migrated to the cpu on
      the other node.  So we can clear cpu-to-node mapping.
      
      [akpm@linux-foundation.org: numa_clear_node() and numa_set_node() can no longer be __cpuinit]
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e13fe869
    • W
      cpu_hotplug: clear apicid to node when the cpu is hotremoved · c4c60524
      Wen Congyang 提交于
      When a cpu is hotpluged, we call acpi_map_cpu2node() in
      _acpi_map_lsapic() to store the cpu's node and apicid's node.  But we
      don't clear the cpu's node in acpi_unmap_lsapic() when this cpu is
      hotremoved.  If the node is also hotremoved, we will get the following
      messages:
      
        kernel BUG at include/linux/gfp.h:329!
        invalid opcode: 0000 [#1] SMP
        Modules linked in: ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat xt_CHECKSUM iptable_mangle bridge stp llc sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel microcode pcspkr i2c_i801 i2c_core lpc_ich mfd_core ioatdma e1000e i7core_edac edac_core sg acpi_memhotplug igb dca sd_mod crc_t10dif megaraid_sas mptsas mptscsih mptbase scsi_transport_sas scsi_mod
        Pid: 3126, comm: init Not tainted 3.6.0-rc3-tangchen-hostbridge+ #13 FUJITSU-SV PRIMEQUEST 1800E/SB
        RIP: 0010:[<ffffffff811bc3fd>]  [<ffffffff811bc3fd>] allocate_slab+0x28d/0x300
        RSP: 0018:ffff88078a049cf8  EFLAGS: 00010246
        RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000246
        RBP: ffff88078a049d38 R08: 00000000000040d0 R09: 0000000000000001
        R10: 0000000000000000 R11: 0000000000000b5f R12: 00000000000052d0
        R13: ffff8807c1417300 R14: 0000000000030038 R15: 0000000000000003
        FS:  00007fa9b1b44700(0000) GS:ffff8807c3800000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 00007fa9b09acca0 CR3: 000000078b855000 CR4: 00000000000007e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Process init (pid: 3126, threadinfo ffff88078a048000, task ffff8807bb6f2650)
        Call Trace:
          new_slab+0x30/0x1b0
          __slab_alloc+0x358/0x4c0
          kmem_cache_alloc_node_trace+0xb4/0x1e0
          alloc_fair_sched_group+0xd0/0x1b0
          sched_create_group+0x3e/0x110
          sched_autogroup_create_attach+0x4d/0x180
          sys_setsid+0xd4/0xf0
          system_call_fastpath+0x16/0x1b
        Code: 89 c4 e9 73 fe ff ff 31 c0 89 de 48 c7 c7 45 de 9e 81 44 89 45 c8 e8 22 05 4b 00 85 db 44 8b 45 c8 0f 89 4f ff ff ff 0f 0b eb fe <0f> 0b 90 eb fd 0f 0b eb fe 89 de 48 c7 c7 45 de 9e 81 31 c0 44
        RIP  [<ffffffff811bc3fd>] allocate_slab+0x28d/0x300
         RSP <ffff88078a049cf8>
        ---[ end trace adf84c90f3fea3e5 ]---
      
      The reason is that the cpu's node is not NUMA_NO_NODE, we will call
      alloc_pages_exact_node() to alloc memory on the node, but the node is
      offlined.
      
      If the node is onlined, we still need cpu's node.  For example: a task
      on the cpu is sleeped when the cpu is hotremoved.  We will choose
      another cpu to run this task when it is waked up.  If we know the cpu's
      node, we will choose the cpu on the same node first.  So we should clear
      cpu-to-node mapping when the node is offlined.
      
      This patch only clears apicid-to-node mapping when the cpu is
      hotremoved.
      
      [akpm@linux-foundation.org: fix section error]
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4c60524
  6. 01 2月, 2013 2 次提交
    • H
      x86-32, mm: Remove reference to alloc_remap() · 07f4207a
      H. Peter Anvin 提交于
      We have removed the remap allocator for x86-32, and x86-64 never had
      it (and doesn't need it).  Remove residual reference to it.
      Reported-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/CAE9FiQVn6_QZi3fNQ-JHYiR-7jeDJ5hT0SyT_%2BzVvfOj=PzF3w@mail.gmail.com
      07f4207a
    • D
      x86-32, mm: Rip out x86_32 NUMA remapping code · f03574f2
      Dave Hansen 提交于
      This code was an optimization for 32-bit NUMA systems.
      
      It has probably been the cause of a number of subtle bugs over
      the years, although the conditions to excite them would have
      been hard to trigger.  Essentially, we remap part of the kernel
      linear mapping area, and then sometimes part of that area gets
      freed back in to the bootmem allocator.  If those pages get
      used by kernel data structures (say mem_map[] or a dentry),
      there's no big deal.  But, if anyone ever tried to use the
      linear mapping for these pages _and_ cared about their physical
      address, bad things happen.
      
      For instance, say you passed __GFP_ZERO to the page allocator
      and then happened to get handed one of these pages, it zero the
      remapped page, but it would make a pte to the _old_ page.
      There are probably a hundred other ways that it could screw
      with things.
      
      We don't need to hang on to performance optimizations for
      these old boxes any more.  All my 32-bit NUMA systems are long
      dead and buried, and I probably had access to more than most
      people.
      
      This code is causing real things to break today:
      
      	https://lkml.org/lkml/2013/1/9/376
      
      I looked in to actually fixing this, but it requires surgery
      to way too much brittle code, as well as stuff like
      per_cpu_ptr_to_phys().
      
      [ hpa: Cc: this for -stable, since it is a memory corruption issue.
        However, an alternative is to simply mark NUMA as depends BROKEN
        rather than EXPERIMENTAL in the X86_32 subclause... ]
      
      Link: http://lkml.kernel.org/r/20130131005616.1C79F411@kernel.stglabs.ibm.comSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      f03574f2
  7. 31 1月, 2013 1 次提交
  8. 26 1月, 2013 1 次提交
    • D
      x86, mm: Make DEBUG_VIRTUAL work earlier in boot · a25b9316
      Dave Hansen 提交于
      The KVM code has some repeated bugs in it around use of __pa() on
      per-cpu data.  Those data are not in an area on which using
      __pa() is valid.  However, they are also called early enough in
      boot that __vmalloc_start_set is not set, and thus the
      CONFIG_DEBUG_VIRTUAL debugging does not catch them.
      
      This adds a check to also verify __pa() calls against max_low_pfn,
      which we can use earler in boot than is_vmalloc_addr().  However,
      if we are super-early in boot, max_low_pfn=0 and this will trip
      on every call, so also make sure that max_low_pfn is set before
      we try to use it.
      
      With this patch applied, CONFIG_DEBUG_VIRTUAL will actually
      catch the bug I was chasing (and fix later in this series).
      
      I'd love to find a generic way so that any __pa() call on percpu
      areas could do a BUG_ON(), but there don't appear to be any nice
      and easy ways to check if an address is a percpu one.  Anybody
      have ideas on a way to do this?
      Signed-off-by: NDave Hansen <dave@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/20130122212430.F46F8159@kernel.stglabs.ibm.comSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      a25b9316
  9. 30 5月, 2012 1 次提交
    • B
      x86: print physical addresses consistently with other parts of kernel · 365811d6
      Bjorn Helgaas 提交于
      Print physical address info in a style consistent with the %pR style used
      elsewhere in the kernel.  For example:
      
          -found SMP MP-table at [ffff8800000fce90] fce90
          +found SMP MP-table at [mem 0x000fce90-0x000fce9f] mapped at [ffff8800000fce90]
          -initial memory mapped : 0 - 20000000
          +initial memory mapped: [mem 0x00000000-0x1fffffff]
          -Base memory trampoline at [ffff88000009c000] 9c000 size 8192
          +Base memory trampoline [mem 0x0009c000-0x0009dfff] mapped at [ffff88000009c000]
          -SRAT: Node 0 PXM 0 0-80000000
          +SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      365811d6
  10. 13 1月, 2012 1 次提交
  11. 09 12月, 2011 1 次提交
  12. 15 7月, 2011 4 次提交
  13. 14 7月, 2011 1 次提交
  14. 13 7月, 2011 1 次提交
    • T
      x86, numa: Implement pfn -> nid mapping granularity check · 1e01979c
      Tejun Heo 提交于
      SPARSEMEM w/o VMEMMAP and DISCONTIGMEM, both used only on 32bit, use
      sections array to map pfn to nid which is limited in granularity.  If
      NUMA nodes are laid out such that the mapping cannot be accurate, boot
      will fail triggering BUG_ON() in mminit_verify_page_links().
      
      On 32bit, it's 512MiB w/ PAE and SPARSEMEM.  This seems to have been
      granular enough until commit 2706a0bf (x86, NUMA: Enable
      CONFIG_AMD_NUMA on 32bit too).  Apparently, there is a machine which
      aligns NUMA nodes to 128MiB and has only AMD NUMA but not SRAT.  This
      led to the following BUG_ON().
      
       On node 0 totalpages: 2096615
         DMA zone: 32 pages used for memmap
         DMA zone: 0 pages reserved
         DMA zone: 3927 pages, LIFO batch:0
         Normal zone: 1740 pages used for memmap
         Normal zone: 220978 pages, LIFO batch:31
         HighMem zone: 16405 pages used for memmap
         HighMem zone: 1853533 pages, LIFO batch:31
       BUG: Int 6: CR2   (null)
            EDI   (null)  ESI 00000002  EBP 00000002  ESP c1543ecc
            EBX f2400000  EDX 00000006  ECX   (null)  EAX 00000001
            err   (null)  EIP c16209aa   CS 00000060  flg 00010002
       Stack: f2400000 00220000 f7200800 c1620613 00220000 01000000 04400000 00238000
                (null) f7200000 00000002 f7200b58 f7200800 c1620929 000375fe   (null)
              f7200b80 c16395f0 00200a02 f7200a80   (null) 000375fe 00000002   (null)
       Pid: 0, comm: swapper Not tainted 2.6.39-rc5-00181-g2706a0bf #17
       Call Trace:
        [<c136b1e5>] ? early_fault+0x2e/0x2e
        [<c16209aa>] ? mminit_verify_page_links+0x12/0x42
        [<c1620613>] ? memmap_init_zone+0xaf/0x10c
        [<c1620929>] ? free_area_init_node+0x2b9/0x2e3
        [<c1607e99>] ? free_area_init_nodes+0x3f2/0x451
        [<c1601d80>] ? paging_init+0x112/0x118
        [<c15f578d>] ? setup_arch+0x791/0x82f
        [<c15f43d9>] ? start_kernel+0x6a/0x257
      
      This patch implements node_map_pfn_alignment() which determines
      maximum internode alignment and update numa_register_memblks() to
      reject NUMA configuration if alignment exceeds the pfn -> nid mapping
      granularity of the memory model as determined by PAGES_PER_SECTION.
      
      This makes the problematic machine boot w/ flatmem by rejecting the
      NUMA config and provides protection against crazy NUMA configurations.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20110712074534.GB2872@htj.dyndns.org
      LKML-Reference: <20110628174613.GP478@escobedo.osrc.amd.com>
      Reported-and-Tested-by: NHans Rosenfeld <hans.rosenfeld@amd.com>
      Cc: Conny Seidel <conny.seidel@amd.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      1e01979c
  15. 02 5月, 2011 10 次提交
    • Y
      x86, NUMA: Trim numa meminfo with max_pfn in a separate loop · e5a10c1b
      Yinghai Lu 提交于
      During testing 32bit numa unifying code from tj, found one system with
      more than 64g fails to use numa.  It turns out we do not trim numa
      meminfo correctly against max_pfn in case start address of a node is
      higher than 64GiB.  Bug fix made it to tip tree.
      
      This patch moves the checking and trimming to a separate loop.  So we
      don't need to compare low/high in following merge loops.  It makes the
      code more readable.
      
      Also it makes the node merge printouts less strange.  On a 512GiB numa
      system with 32bit,
      
      before:
      > NUMA: Node 0 [0,a0000) + [100000,80000000) -> [0,80000000)
      > NUMA: Node 0 [0,80000000) + [100000000,1080000000) -> [0,1000000000)
      
      after:
      > NUMA: Node 0 [0,a0000) + [100000,80000000) -> [0,80000000)
      > NUMA: Node 0 [0,80000000) + [100000000,1000000000) -> [0,1000000000)
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      [Updated patch description and comment slightly.]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      e5a10c1b
    • Y
      x86, NUMA: Rename setup_node_bootmem() to setup_node_data() · a56bca80
      Yinghai Lu 提交于
      After using memblock to replace bootmem, that function only sets up
      node_data now.
      
      Change the name to reflect what it actually does.
      
      tj: Minor adjustment to the patch description.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      a56bca80
    • T
      x86, NUMA: Make numa_init_array() static · 752d4f37
      Tejun Heo 提交于
      numa_init_array() no longer has users outside of numa.c.  Make it
      static.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      752d4f37
    • T
      x86, NUMA: Make 32bit use common NUMA init path · bd6709a9
      Tejun Heo 提交于
      With both _numa_init() methods converted and the rest of init code
      adjusted, numa_32.c now can switch from the 32bit only init code to
      the common one in numa.c.
      
      * Shim get_memcfg_*()'s are dropped and initmem_init() calls
        x86_numa_init(), which is updated to handle NUMAQ.
      
      * All boilerplate operations including node range limiting, pgdat
        alloc/init are handled by numa_init().  32bit only implementation is
        removed.
      
      * 32bit numa_add_memblk(), numa_set_distance() and
        memory_add_physaddr_to_nid() removed and common versions in
        numa_32.c enabled for 32bit.
      
      This change causes the following behavior changes.
      
      * NODE_DATA()->node_start_pfn/node_spanned_pages properly initialized
        for 32bit too.
      
      * Much more sanity checks and configuration cleanups.
      
      * Proper handling of node distances.
      
      * The same NUMA init messages as 64bit.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      bd6709a9
    • T
      x86, NUMA: Initialize and use remap allocator from setup_node_bootmem() · 7888e96b
      Tejun Heo 提交于
      setup_node_bootmem() is taken from 64bit and doesn't use remap
      allocator.  It's about to be shared with 32bit so add support for it.
      If NODE_DATA is remapped, it's noted in the debug message and node
      locality check is skipped as the __pa() of the remapped address
      doesn't reflect the actual physical address.
      
      On 64bit, remap allocator becomes noop and doesn't affect the
      behavior.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      7888e96b
    • T
      x86, NUMA: Remove long 64bit assumption from numa.c · 38f3e1ca
      Tejun Heo 提交于
      Code moved from numa_64.c has assumption that long is 64bit in several
      places.  This patch removes the assumption by using {s|u}64_t
      explicity, using PFN_PHYS() for page number -> addr conversions and
      adjusting printf formats.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      38f3e1ca
    • T
      x86, NUMA: Enable build of generic NUMA init code on 32bit · 744baba0
      Tejun Heo 提交于
      Generic NUMA init code was moved to numa.c from numa_64.c but is still
      guaraded by CONFIG_X86_64.  This patch removes the compile guard and
      enables compiling on 32bit.
      
      * numa_add_memblk() and numa_set_distance() clash with the shim
        implementation in numa_32.c and are left out.
      
      * memory_add_physaddr_to_nid() clashes with 32bit implementation and
        is left out.
      
      * MAX_DMA_PFN definition in dma.h moved out of !CONFIG_X86_32.
      
      * node_data definition in numa_32.c removed in favor of the one in
        numa.c.
      
      There are places where ulong is assumed to be 64bit.  The next patch
      will fix them up.  Note that although the code is compiled it isn't
      used yet and this patch doesn't cause any functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      744baba0
    • T
      x86, NUMA: Move NUMA init logic from numa_64.c to numa.c · a4106eae
      Tejun Heo 提交于
      Move the generic 64bit NUMA init machinery from numa_64.c to numa.c.
      
      * node_data[], numa_mem_info and numa_distance
      * numa_add_memblk[_to](), numa_remove_memblk[_from]()
      * numa_set_distance() and friends
      * numa_init() and all the numa_meminfo handling helpers called from it
      * dummy_numa_init()
      * memory_add_physaddr_to_nid()
      
      A new function x86_numa_init() is added and the content of
      numa_64.c::initmem_init() is moved into it.  initmem_init() now simply
      calls x86_numa_init().
      
      Constants and numa_off declaration are moved from numa_{32|64}.h to
      numa.h.
      
      This is code reorganization and doesn't involve any functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      a4106eae
    • T
      x86, NUMA: Move numa_nodes_parsed to numa.[hc] · e6df595b
      Tejun Heo 提交于
      Move numa_nodes_parsed from numa_64.[hc] to numa.[hc] to prepare for
      NUMA init path unification.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      e6df595b
    • T
      x86, NUMA: Unify 32/64bit numa_cpu_node() implementation · 6bd26273
      Tejun Heo 提交于
      Currently, the only meaningful user of apic->x86_32_numa_cpu_node() is
      NUMAQ which returns valid mapping only after CPU is initialized during
      SMP bringup; thus, the previous patch to set apicid -> node in
      setup_local_APIC() makes __apicid_to_node[] always contain the correct
      mapping whether custom apic->x86_32_numa_cpu_node() is used or not.
      
      So, there is no reason to keep separate 32bit implementation.  We can
      always consult __apicid_to_node[].  Move 64bit implementation from
      numa_64.c to numa.c and remove 32bit implementation from numa_32.c.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      6bd26273
  16. 21 4月, 2011 1 次提交
  17. 14 2月, 2011 1 次提交
    • D
      x86, numa: Add error handling for bad cpu-to-node mappings · 14392fd3
      David Rientjes 提交于
      CONFIG_DEBUG_PER_CPU_MAPS may return NUMA_NO_NODE when an
      early_cpu_to_node() mapping hasn't been initialized.  In such a
      case, it emits a warning and continues without an issue but
      callers may try to use the return value to index into an array.
      
      We can catch those errors and fail silently since a warning has
      already been emitted.  No current user of numa_add_cpu()
      requires this error checking to avoid a crash, but it's better
      to be proactive in case a future user happens to have a bug and
      a user tries to diagnose it with CONFIG_DEBUG_PER_CPU_MAPS.
      Reported-by: NJesper Juhl <jj@chaosbits.net>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      LKML-Reference: <alpine.DEB.2.00.1102071407250.7812@chino.kir.corp.google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      14392fd3
  18. 28 1月, 2011 4 次提交
    • T
      x86: Unify NUMA initialization between 32 and 64bit · 8db78cc4
      Tejun Heo 提交于
      Now that everything else is unified, NUMA initialization can be
      unified too.
      
      * numa_init_array() and init_cpu_to_node() are moved from
        numa_64 to numa.
      
      * numa_32::initmem_init() is updated to call numa_init_array()
        and setup_arch() to call init_cpu_to_node() on 32bit too.
      
      * x86_cpu_to_node_map is now initialized to NUMA_NO_NODE on
        32bit too. This is safe now as numa_init_array() will initialize
        it early during boot.
      
      This makes NUMA mapping fully initialized before
      setup_per_cpu_areas() on 32bit too and thus makes the first
      percpu chunk which contains all the static variables and some of
      dynamic area allocated with NUMA affinity correctly considered.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: yinghai@kernel.org
      Cc: brgerst@gmail.com
      Cc: gorcunov@gmail.com
      Cc: shaohui.zheng@intel.com
      Cc: rientjes@google.com
      LKML-Reference: <1295789862-25482-17-git-send-email-tj@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
      Reviewed-by: NPekka Enberg <penberg@kernel.org>
      8db78cc4
    • T
      x86: Unify node_to_cpumask_map handling between 32 and 64bit · de2d9445
      Tejun Heo 提交于
      x86_32 has been managing node_to_cpumask_map explicitly from
      map_cpu_to_node() and friends in a rather ugly way.  With
      previous changes, it's now possible to share the code with
      64bit.
      
      * When CONFIG_NUMA_EMU is disabled, numa_add/remove_cpu() are
        implemented in numa.c and shared by 32 and 64bit.  CONFIG_NUMA_EMU
        versions still live in numa_64.c.
      
        NUMA_EMU's dependency on 64bit is planned to be removed and the
        above should go away together.
      
      * identify_cpu() now calls numa_add_cpu() for 32bit too.  This
        makes the explicit mask management from map_cpu_to_node() unnecessary.
      
      * The whole x86_32 specific map_cpu_to_node() chunk is no longer
        necessary.  Dropped.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NPekka Enberg <penberg@kernel.org>
      Cc: eric.dumazet@gmail.com
      Cc: yinghai@kernel.org
      Cc: brgerst@gmail.com
      Cc: gorcunov@gmail.com
      Cc: shaohui.zheng@intel.com
      Cc: rientjes@google.com
      LKML-Reference: <1295789862-25482-16-git-send-email-tj@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      de2d9445
    • T
      x86: Unify CPU -> NUMA node mapping between 32 and 64bit · 645a7919
      Tejun Heo 提交于
      Unlike 64bit, 32bit has been using its own cpu_to_node_map[] for
      CPU -> NUMA node mapping.  Replace it with early_percpu variable
      x86_cpu_to_node_map and share the mapping code with 64bit.
      
      * USE_PERCPU_NUMA_NODE_ID is now enabled for 32bit too.
      
      * x86_cpu_to_node_map and numa_set/clear_node() are moved from
        numa_64 to numa.  For now, on 32bit, x86_cpu_to_node_map is initialized
        with 0 instead of NUMA_NO_NODE.  This is to avoid introducing unexpected
        behavior change and will be updated once init path is unified.
      
      * srat_detect_node() is now enabled for x86_32 too.  It calls
        numa_set_node() and initializes the mapping making explicit
        cpu_to_node_map[] updates from map/unmap_cpu_to_node() unnecessary.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: eric.dumazet@gmail.com
      Cc: yinghai@kernel.org
      Cc: brgerst@gmail.com
      Cc: gorcunov@gmail.com
      Cc: penberg@kernel.org
      Cc: shaohui.zheng@intel.com
      Cc: rientjes@google.com
      LKML-Reference: <1295789862-25482-15-git-send-email-tj@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Cc: David Rientjes <rientjes@google.com>
      645a7919
    • T
      x86: Unify cpu/apicid <-> NUMA node mapping between 32 and 64bit · bbc9e2f4
      Tejun Heo 提交于
      The mapping between cpu/apicid and node is done via
      apicid_to_node[] on 64bit and apicid_2_node[] +
      apic->x86_32_numa_cpu_node() on 32bit. This difference makes it
      difficult to further unify 32 and 64bit NUMA handling.
      
      This patch unifies it by replacing both apicid_to_node[] and
      apicid_2_node[] with __apicid_to_node[] array, which is accessed
      by two accessors - set_apicid_to_node() and numa_cpu_node().  On
      64bit, numa_cpu_node() always consults __apicid_to_node[]
      directly while 32bit goes through apic->numa_cpu_node() method
      to allow apic implementations to override it.
      
      srat_detect_node() for amd cpus contains workaround for broken
      NUMA configuration which assumes relationship between APIC ID,
      HT node ID and NUMA topology.  Leave it to access
      __apicid_to_node[] directly as mapping through CPU might result
      in undesirable behavior change.  The comment is reformatted and
      updated to note the ugliness.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NPekka Enberg <penberg@kernel.org>
      Cc: eric.dumazet@gmail.com
      Cc: yinghai@kernel.org
      Cc: brgerst@gmail.com
      Cc: gorcunov@gmail.com
      Cc: shaohui.zheng@intel.com
      Cc: rientjes@google.com
      LKML-Reference: <1295789862-25482-14-git-send-email-tj@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Cc: David Rientjes <rientjes@google.com>
      bbc9e2f4
  19. 19 1月, 2011 1 次提交
  20. 31 5月, 2010 1 次提交