1. 30 1月, 2008 5 次提交
    • C
      x86: 64-bit, make sparsemem vmemmap the only memory model · b263295d
      Christoph Lameter 提交于
      Use sparsemem as the only memory model for UP, SMP and NUMA.  Measurements
      indicate that DISCONTIGMEM has a higher overhead than sparsemem.  And
      FLATMEMs benefits are minimal.  So I think its best to simply standardize
      on sparsemem.
      
      Results of page allocator tests (test can be had via git from slab git
      tree branch tests)
      
      Measurements in cycle counts. 1000 allocations were performed and then the
      average cycle count was calculated.
      
      Order	FlatMem	Discontig	SparseMem
      0	  639	  665		  641
      1	  567	  647		  593
      2	  679	  774		  692
      3	  763	  967		  781
      4	  961	 1501		  962
      5	 1356	 2344		 1392
      6	 2224	 3982		 2336
      7	 4869	 7225		 5074
      8	12500	14048		12732
      9	27926	28223		28165
      10	58578	58714		58682
      
      (Note that FlatMem is an SMP config and the rest NUMA configurations)
      
      Memory use:
      
      SMP Sparsemem
      -------------
      
      Kernel size:
      
         text    data     bss     dec     hex filename
      3849268  397739 1264856 5511863  541ab7 vmlinux
      
                   total       used       free     shared    buffers     cached
      Mem:       8242252      41164    8201088          0        352      11512
      -/+ buffers/cache:      29300    8212952
      Swap:      9775512          0    9775512
      
      SMP Flatmem
      -----------
      
      Kernel size:
      
         text    data     bss     dec     hex filename
      3844612  397739 1264536 5506887  540747 vmlinux
      
      So 4.5k growth in text size vs. FLATMEM.
      
                   total       used       free     shared    buffers     cached
      Mem:       8244052      40544    8203508          0        352      11484
      -/+ buffers/cache:      28708    8215344
      
      2k growth in overall memory use after boot.
      
      NUMA discontig:
      
         text    data     bss     dec     hex filename
      3888124  470659 1276504 5635287  55fcd7 vmlinux
      
                   total       used       free     shared    buffers     cached
      Mem:       8256256      56908    8199348          0        352      11496
      -/+ buffers/cache:      45060    8211196
      Swap:      9775512          0    9775512
      
      NUMA sparse:
      
         text    data     bss     dec     hex filename
      3896428  470659 1276824 5643911  561e87 vmlinux
      
      8k text growth. Given that we fully inline virt_to_page and friends now
      that is rather good.
      
                   total       used       free     shared    buffers     cached
      Mem:       8264720      57240    8207480          0        352      11516
      -/+ buffers/cache:      45372    8219348
      Swap:      9775512          0    9775512
      
      The total available memory is increased by 8k.
      
      This patch makes sparsemem the default and removes discontig and
      flatmem support from x86.
      
      [ akpm@linux-foundation.org: allnoconfig build fix ]
      Acked-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      b263295d
    • T
      x86: fixup numa 64 namespace · 7462894a
      Thomas Gleixner 提交于
      Using a variable name, which is the same as a macro name is not
      really smart. Change the variable names and fixup all users.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      7462894a
    • T
      x86: cleanup numa_64.c · e3cfe529
      Thomas Gleixner 提交于
      Clean it up before applying more patches.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e3cfe529
    • T
      x86: nuke a ton of unused exports · 3abf024d
      Thomas Gleixner 提交于
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3abf024d
    • T
      x86: move k8 related declarations · c9ff0342
      Thomas Gleixner 提交于
      Move k8 related declarations to k8.h and fix numa_64.c
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c9ff0342
  2. 20 10月, 2007 1 次提交
    • M
      x86: convert cpu_to_apicid to be a per cpu variable · 71fff5e6
      Mike Travis 提交于
      This patch converts the x86_cpu_to_apicid array to be a per cpu
      variable. This saves sizeof(apicid) * NR unused cpus.  Access is mostly
      from startup and CPU HOTPLUG functions.
      
      MP_processor_info() is one of the functions that require access to the
      x86_cpu_to_apicid array before the per_cpu data area is setup.  For this
      case, a pointer to the __initdata array is initialized in setup_arch()
      and removed in smp_prepare_cpus() after the per_cpu data area is
      initialized.
      
      A second change is included to change the initial array value of ARCH
      i386 from 0xff to BAD_APICID to be consistent with ARCH x86_64.
      Signed-off-by: NMike Travis <travis@sgi.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      71fff5e6
  3. 18 10月, 2007 2 次提交
    • M
      x86: fix cpu_to_node references · 98c9e27a
      Mike Travis 提交于
      In x86_64 and i386 architectures most arrays that are sized using
      NR_CPUS lay in local memory on node 0.  Not only will most (99%?) of the
      systems not use all the slots in these arrays, particularly when NR_CPUS
      is increased to accommodate future very high cpu count systems, but a
      number of cache lines are passed unnecessarily on the system bus when
      these arrays are referenced by cpus on other nodes.
      
      Typically, the values in these arrays are referenced by the cpu
      accessing it's own values, though when passing IPI interrupts, the cpu
      does access the data relevant to the targeted cpu/node.  Of course, if
      the referencing cpu is not on node 0, then the reference will still
      require cross node exchanges of cache lines.  A common use of this is
      for an interrupt service routine to pass the interrupt to other cpus
      local to that node.
      
      Ideally, all the elements in these arrays should be moved to the per_cpu
      data area.  In some cases (such as x86_cpu_to_apicid) the array is
      referenced before the per_cpu data areas are setup.  In this case, a
      static array is declared in the __initdata area and initialized by the
      booting cpu (BSP).  The values are then moved to the per_cpu area after
      it is initialized and the original static array is freed with the rest
      of the __initdata.
      
      This patch:
      
      Fix four instances where cpu_to_node is referenced by array instead of
      via the cpu_to_node macro.  This is preparation to moving it to the
      per_cpu data area.
      Signed-off-by: NMike Travis <travis@sgi.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      98c9e27a
    • Y
      x86: 0 -> NULL, for arch/x86_64 · 83e83d54
      Yoann Padioleau 提交于
      When comparing a pointer, it's clearer to compare it to NULL than to 0.
      
      [ tglx: arch/x86 adaptation ]
      Signed-off-by: NYoann Padioleau <padator@wanadoo.fr>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: ak@suse.de
      Cc: discuss@x86-64.org
      Cc: akpm@linux-foundation.org
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      83e83d54
  4. 11 10月, 2007 2 次提交
  5. 22 7月, 2007 3 次提交
  6. 03 5月, 2007 4 次提交
    • S
      [PATCH] x86-64: set node_possible_map at runtime - try 2 · e3f1caee
      Suresh Siddha 提交于
      Set the node_possible_map at runtime on x86_64.  On a non NUMA system,
      num_possible_nodes() will now say '1'.
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Eric Dumazet <dada1@cosmosbay.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      e3f1caee
    • D
      [PATCH] x86-64: fixed size remaining fake nodes · 382591d5
      David Rientjes 提交于
      Extends the numa=fake x86_64 command-line option to split the remaining system
      memory into nodes of fixed size.  Any leftover memory is allocated to a final
      node unless the command-line ends with a comma.
      
      For example:
        numa=fake=2*512,*128	gives two 512M nodes and the remaining system
      			memory is split into nodes of 128M each.
      
      This is beneficial for systems where the exact size of RAM is unknown or not
      necessarily relevant, but the size of the remaining nodes to be allocated is
      known based on their capacity for resource management.
      
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      382591d5
    • D
      [PATCH] x86-64: split remaining fake nodes equally · 14694d73
      David Rientjes 提交于
      Extends the numa=fake x86_64 command-line option to split the remaining
      system memory into equal-sized nodes.
      
      For example:
      numa=fake=2*512,4*	gives two 512M nodes and the remaining system
      			memory is split into four approximately equal
      			chunks.
      
      This is beneficial for systems where the exact size of RAM is unknown or not
      necessarily relevant, but the granularity with which nodes shall be allocated
      is known.
      
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      14694d73
    • D
      [PATCH] x86-64: configurable fake numa node sizes · 8b8ca80e
      David Rientjes 提交于
      Extends the numa=fake x86_64 command-line option to allow for configurable
      node sizes.  These nodes can be used in conjunction with cpusets for coarse
      memory resource management.
      
      The old command-line option is still supported:
        numa=fake=32	gives 32 fake NUMA nodes, ignoring the NUMA setup of the
      		actual machine.
      
      But now you may configure your system for the node sizes of your choice:
        numa=fake=2*512,1024,2*256
      		gives two 512M nodes, one 1024M node, two 256M nodes, and
      		the rest of system memory to a sixth node.
      
      The existing hash function is maintained to support the various node sizes
      that are possible with this implementation.
      
      Each node of the same size receives roughly the same amount of available
      pages, regardless of any reserved memory with its address range.  The total
      available pages on the system is calculated and divided by the number of equal
      nodes to allocate.  These nodes are then dynamically allocated and their
      borders extended until such time as their number of available pages reaches
      the required size.
      
      Configurable node sizes are recommended when used in conjunction with cpusets
      for memory control because it eliminates the overhead associated with scanning
      the zonelists of many smaller full nodes on page_alloc().
      
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      8b8ca80e
  7. 13 2月, 2007 4 次提交
  8. 12 10月, 2006 1 次提交
    • M
      [PATCH] mm: use symbolic names instead of indices for zone initialisation · 6391af17
      Mel Gorman 提交于
      Arch-independent zone-sizing is using indices instead of symbolic names to
      offset within an array related to zones (max_zone_pfns).  The unintended
      impact is that ZONE_DMA and ZONE_NORMAL is initialised on powerpc instead
      of ZONE_DMA and ZONE_HIGHMEM when CONFIG_HIGHMEM is set.  As a result, the
      the machine fails to boot but will boot with CONFIG_HIGHMEM turned off.
      
      The following patch properly initialises the max_zone_pfns[] array and uses
      symbolic names instead of indices in each architecture using
      arch-independent zone-sizing.  Two users have successfully booted their
      powerpcs with it (one an ibook G4).  It has also been boot tested on x86,
      x86_64, ppc64 and ia64.  Please merge for 2.6.19-rc2.
      
      Credit to Benjamin Herrenschmidt for identifying the bug and rolling the
      first fix.  Additional credit to Johannes Berg and Andreas Schwab for
      reporting the problem and testing on powerpc.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      6391af17
  9. 27 9月, 2006 1 次提交
  10. 26 9月, 2006 2 次提交
  11. 23 4月, 2006 1 次提交
  12. 10 4月, 2006 2 次提交
    • A
      [PATCH] x86_64: Handle empty PXMs that only contain hotplug memory · a8062231
      Andi Kleen 提交于
      The node setup code would try to allocate the node metadata in the node
      itself, but that fails if there is no memory in there.
      
      This can happen with memory hotplug when the hotplug area defines an so
      far empty node.
      
      Now use bootmem to try to allocate the mem_map in other nodes.
      
      And if it fails don't panic, but just ignore the node.
      
      To make this work I added a new __alloc_bootmem_nopanic function that
      does what its name implies.
      
      TBD should try to use nearby nodes here.  Currently we just use any.
      It's hard to do it better because bootmem doesn't have proper fallback
      lists yet.
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      a8062231
    • A
      [PATCH] x86_64: Reserve SRAT hotadd memory on x86-64 · 68a3a7fe
      Andi Kleen 提交于
      From: Keith Mannthey, Andi Kleen
      
      Implement memory hotadd without sparsemem. The memory in the SRAT
      hotadd area is just preserved instead and can be activated later.
      
      There are a few restrictions:
      - Only one continuous hotadd area allowed per node
      
      The main problem is dealing with the many buggy SRAT tables
      that are out there. The strategy here is to reject anything
      suspicious.
      
      Originally from Keith Mannthey, with several hacks and changes by AK
      and also contributions from Andrew Morton
      
      [ TBD: Problems pointed out by KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>:
      
       1) Goto's rebuild_zonelist patch will not work if CONFIG_MEMORY_HOTPLUG=n.
      
          Rebuilding zonelist is necessary when the system has just memory <
          4G at boot, and hot add memory > 4G.  because x86_64 has DMA32,
          ZONE_NORAML is not included into zonelist at boot time if system
          doesn't have memory >4G at boot.
      
          [AK: should just force the higher zones at boot time when SRAT tells us]
      
       2) zone and node's spanned_pages and present_pages are not incremented.
          They should be.
      
          For example, our server (ia64/Fujitsu PrimeQuest) can equip memory
          from 4G to 1T(maybe 2T in future), and SRAT will *always* say we have
          possible 1T +memory.  (Microsoft requires "write all possible memory
          in SRAT") When we reserve memmap for possible 1T memory, Linux will
          not work well in +minimum 4G configuraion ;)
      
          [AK: needs limiting to 5-10% of max memory]
       ]
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      68a3a7fe
  13. 28 3月, 2006 1 次提交
  14. 26 3月, 2006 3 次提交
  15. 16 2月, 2006 1 次提交
  16. 12 1月, 2006 5 次提交
  17. 13 12月, 2005 1 次提交
  18. 15 11月, 2005 1 次提交