1. 13 2月, 2009 1 次提交
    • D
      powerpc/mm: Fix numa reserve bootmem page selection · 06eccea6
      Dave Hansen 提交于
      Fix the powerpc NUMA reserve bootmem page selection logic.
      
      commit 8f64e1f2 (powerpc: Reserve
      in bootmem lmb reserved regions that cross NUMA nodes) changed
      the logic for how the powerpc LMB reserved regions were converted
      to bootmen reserved regions.  As the folowing discussion reports,
      the new logic was not correct.
      
      mark_reserved_regions_for_nid() goes through each LMB on the
      system that specifies a reserved area.  It searches for
      active regions that intersect with that LMB and are on the
      specified node.  It attempts to bootmem-reserve only the area
      where the active region and the reserved LMB intersect.  We
      can not reserve things on other nodes as they may not have
      bootmem structures allocated, yet.
      
      We base the size of the bootmem reservation on two possible
      things.  Normally, we just make the reservation start and
      stop exactly at the start and end of the LMB.
      
      However, the LMB reservations are not aware of NUMA nodes and
      on occasion a single LMB may cross into several adjacent
      active regions.  Those may even be on different NUMA nodes
      and will require separate calls to the bootmem reserve
      functions.  So, the bootmem reservation must be trimmed to
      fit inside the current active region.
      
      That's all fine and dandy, but we trim the reservation
      in a page-aligned fashion.  That's bad because we start the
      reservation at a non-page-aligned address: physbase.
      
      The reservation may only span 2 bytes, but that those bytes
      may span two pfns and cause a reserve_size of 2*PAGE_SIZE.
      
      Take the case where you reserve 0x2 bytes at 0x0fff and
      where the active region ends at 0x1000.  You'll jump into
      that if() statment, but node_ar.end_pfn=0x1 and
      start_pfn=0x0.  You'll end up with a reserve_size=0x1000,
      and then call
      
        reserve_bootmem_node(node, physbase=0xfff, size=0x1000);
      
      0x1000 may not be on the same node as 0xfff.  Oops.
      
      In almost all the vm code, end_<anything> is not inclusive.
      If you have an end_pfn of 0x1234, page 0x1234 is not
      included in the range.  Using PFN_UP instead of the
      (>> >> PAGE_SHIFT) will make this consistent with the other VM
      code.
      
      We also need to do math for the reserved size with physbase
      instead of start_pfn.  node_ar.end_pfn << PAGE_SHIFT is
      *precisely* the end of the node.  However,
      (start_pfn << PAGE_SHIFT) is *NOT* precisely the beginning
      of the reserved area.  That is, of course, physbase.
      If we don't use physbase here, the reserve_size can be
      made too large.
      
      From: Dave Hansen <dave@linux.vnet.ibm.com>
      Tested-by: Geoff Levand <geoffrey.levand@am.sony.com>  Tested on PS3.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      06eccea6
  2. 08 1月, 2009 4 次提交
  3. 16 12月, 2008 1 次提交
    • D
      powerpc: Fix bootmem reservation on uninitialized node · a4c74ddd
      Dave Hansen 提交于
      careful_allocation() was calling into the bootmem allocator for
      nodes which had not been fully initialized and caused a previous
      bug:  http://patchwork.ozlabs.org/patch/10528/  So, I merged a
      few broken out loops in do_init_bootmem() to fix it.  That changed
      the code ordering.
      
      I think this bug is triggered by having reserved areas for a node
      which are spanned by another node's contents.  In the
      mark_reserved_regions_for_nid() code, we attempt to reserve the
      area for a node before we have allocated the NODE_DATA() for that
      nid.  We do this since I reordered that loop.  I suck.
      
      This is causing crashes at bootup on some systems, as reported
      by Jon Tollefson.
      
      This may only present on some systems that have 16GB pages
      reserved.  But, it can probably happen on any system that is
      trying to reserve large swaths of memory that happen to span other
      nodes' contents.
      
      This commit ensures that we do not touch bootmem for any node which
      has not been initialized, and also removes a compile warning about
      an unused variable.
      Signed-off-by: NDave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      a4c74ddd
  4. 01 12月, 2008 1 次提交
    • D
      powerpc: Fix boot freeze on machine with empty memory node · 4a618669
      Dave Hansen 提交于
      I got a bug report about a distro kernel not booting on a particular
      machine.  It would freeze during boot:
      
      > ...
      > Could not find start_pfn for node 1
      > [boot]0015 Setup Done
      > Built 2 zonelists in Node order, mobility grouping on.  Total pages: 123783
      > Policy zone: DMA
      > Kernel command line:
      > [boot]0020 XICS Init
      > [boot]0021 XICS Done
      > PID hash table entries: 4096 (order: 12, 32768 bytes)
      > clocksource: timebase mult[7d0000] shift[22] registered
      > Console: colour dummy device 80x25
      > console handover: boot [udbg0] -> real [hvc0]
      > Dentry cache hash table entries: 1048576 (order: 7, 8388608 bytes)
      > Inode-cache hash table entries: 524288 (order: 6, 4194304 bytes)
      > freeing bootmem node 0
      
      I've reproduced this on 2.6.27.7.  It is caused by commit
      8f64e1f2 ("powerpc: Reserve in bootmem
      lmb reserved regions that cross NUMA nodes").
      
      The problem is that Jon took a loop which was (in pseudocode):
      
      	for_each_node(nid)
      		NODE_DATA(nid) = careful_alloc(nid);
      		setup_bootmem(nid);
      		reserve_node_bootmem(nid);
      
      and broke it up into:
      
      	for_each_node(nid)
      		NODE_DATA(nid) = careful_alloc(nid);
      		setup_bootmem(nid);
      	for_each_node(nid)
      		reserve_node_bootmem(nid);
      
      The issue comes in when the 'careful_alloc()' is called on a node with
      no memory.  It falls back to using bootmem from a previously-initialized
      node.  But, bootmem has not yet been reserved when Jon's patch is
      applied.  It gives back bogus memory (0xc000000000000000) and pukes
      later in boot.
      
      The following patch collapses the loop back together.  It also breaks
      the mark_reserved_regions_for_nid() code out into a function and adds
      some comments.  I think a huge part of introducing this bug is because
      for loop was too long and hard to read.
      
      The actual bug fix here is the:
      
      +		if (end_pfn <= node->node_start_pfn ||
      +		    start_pfn >= node_end_pfn)
      +			continue;
      Signed-off-by: NDave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      4a618669
  5. 21 10月, 2008 2 次提交
  6. 10 10月, 2008 1 次提交
    • J
      powerpc: Reserve in bootmem lmb reserved regions that cross NUMA nodes · 8f64e1f2
      Jon Tollefson 提交于
      If there are multiple reserved memory blocks via lmb_reserve() that are
      contiguous addresses and on different NUMA nodes we are losing track of which
      address ranges to reserve in bootmem on which node.  I discovered this
      when I recently got to try 16GB huge pages on a system with more then 2 nodes.
      
      When scanning the device tree in early boot we call lmb_reserve() with
      the addresses of the 16G pages that we find so that the memory doesn't
      get used for something else.  For example the addresses for the pages
      could be 4000000000, 4400000000, 4800000000, 4C00000000, etc - 8 pages,
      one on each of eight nodes.  In the lmb after all the pages have been
      reserved it will look something like the following:
      
      lmb_dump_all:
          memory.cnt            = 0x2
          memory.size           = 0x3e80000000
          memory.region[0x0].base       = 0x0
                            .size     = 0x1e80000000
          memory.region[0x1].base       = 0x4000000000
                            .size     = 0x2000000000
          reserved.cnt          = 0x5
          reserved.size         = 0x3e80000000
          reserved.region[0x0].base       = 0x0
                            .size     = 0x7b5000
          reserved.region[0x1].base       = 0x2a00000
                            .size     = 0x78c000
          reserved.region[0x2].base       = 0x328c000
                            .size     = 0x43000
          reserved.region[0x3].base       = 0xf4e8000
                            .size     = 0xb18000
          reserved.region[0x4].base       = 0x4000000000
                            .size     = 0x2000000000
      
      The reserved.region[0x4] contains the 16G pages.  In
      arch/powerpc/mm/num.c: do_init_bootmem() we loop through each of the
      node numbers looking for the reserved regions that belong to the
      particular node.  It is not able to identify region 0x4 as being a part
      of each of the 8 nodes.  It is assuming that a reserved region is only
      on a single node.
      
      This patch takes out the reserved region loop from inside
      the loop that goes over each node.  It looks up the active region containing
      the start of the reserved region.  If it extends past that active region then
      it adjusts the size and gets the next active region containing it.
      Signed-off-by: NJon Tollefson <kniht@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      8f64e1f2
  7. 16 9月, 2008 1 次提交
    • C
      powerpc: Add support for dynamic reconfiguration memory in kexec/kdump kernels · cf00085d
      Chandru 提交于
      Kdump kernel needs to use only those memory regions that it is allowed
      to use (crashkernel, rtas, tce, etc.).  Each of these regions have
      their own sizes and are currently added under 'linux,usable-memory'
      property under each memory@xxx node of the device tree.
      
      The ibm,dynamic-memory property of ibm,dynamic-reconfiguration-memory
      node (on POWER6) now stores in it the representation for most of the
      logical memory blocks with the size of each memory block being a
      constant (lmb_size).  If one or more or part of the above mentioned
      regions lie under one of the lmb from ibm,dynamic-memory property,
      there is a need to identify those regions within the given lmb.
      
      This makes the kernel recognize a new 'linux,drconf-usable-memory'
      property added by kexec-tools.  Each entry in this property is of the
      form of a count followed by that many (base, size) pairs for the above
      mentioned regions.  The number of cells in the count value is given by
      the #size-cells property of the root node.
      Signed-off-by: NChandru Siddalingappa <chandru@in.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      cf00085d
  8. 25 7月, 2008 1 次提交
  9. 03 7月, 2008 2 次提交
  10. 24 4月, 2008 1 次提交
  11. 14 2月, 2008 1 次提交
  12. 08 2月, 2008 1 次提交
    • B
      Introduce flags for reserve_bootmem() · 72a7fe39
      Bernhard Walle 提交于
      This patchset adds a flags variable to reserve_bootmem() and uses the
      BOOTMEM_EXCLUSIVE flag in crashkernel reservation code to detect collisions
      between crashkernel area and already used memory.
      
      This patch:
      
      Change the reserve_bootmem() function to accept a new flag BOOTMEM_EXCLUSIVE.
      If that flag is set, the function returns with -EBUSY if the memory already
      has been reserved in the past.  This is to avoid conflicts.
      
      Because that code runs before SMP initialisation, there's no race condition
      inside reserve_bootmem_core().
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix powerpc build]
      Signed-off-by: NBernhard Walle <bwalle@suse.de>
      Cc: <linux-arch@vger.kernel.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72a7fe39
  13. 07 2月, 2008 1 次提交
    • B
      [POWERPC] Fake NUMA emulation for PowerPC · 1daa6d08
      Balbir Singh 提交于
      Here's a dumb simple implementation of fake NUMA nodes for PowerPC.
      Fake NUMA nodes can be specified using the following command line
      option
      
      numa=fake=<node range>
      
      node range is of the format <range1>,<range2>,...<rangeN>
      
      Each of the rangeX parameters is passed using memparse().  I find the
      patch useful for fake NUMA emulation on my simple PowerPC machine.
      I've tested it on a numa box with the following arguments
      
      numa=fake=512M
      numa=fake=512M,768M
      numa=fake=256M,512M mem=512M
      numa=fake=1G mem=768M
      numa=fake=
      without any numa= argument
      
      The other side-effect introduced by this patch is that; in the case
      where we don't have NUMA information, we now set a node online after
      adding each LMB.  This node could very well be node 0, but in the case
      that we enable fake NUMA nodes, when we cross node boundaries, we need
      to set the new node online.
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      1daa6d08
  14. 26 1月, 2008 1 次提交
    • P
      Revert "[POWERPC] Fake NUMA emulation for PowerPC" · 55852bed
      Paul Mackerras 提交于
      This reverts commit 5c3f5892,
      basically because it changes behaviour even when no fake NUMA
      information is specified on the kernel command line.
      
      Firstly, it changes the nid, thus destroying the real NUMA
      information.  Secondly, it also changes behaviour in that if a node
      ends up with no memory in it because of the memory limit, we used to
      set it online and now we don't.
      
      Also, in the non-NUMA case with no fake NUMA information, we do
      node_set_online once for each LMB now, whereas previously we only did
      it once.  I don't know if that is actually a problem, but it does seem
      unnecessary.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      55852bed
  15. 20 12月, 2007 1 次提交
    • B
      [POWERPC] Fake NUMA emulation for PowerPC · 5c3f5892
      Balbir Singh 提交于
      Here's a dumb simple implementation of fake NUMA nodes for PowerPC.
      Fake NUMA nodes can be specified using the following command line option
      
      numa=fake=<node range>
      
      node range is of the format <range1>,<range2>,...<rangeN>
      
      Each of the rangeX parameters is passed using memparse().  I find this
      useful for fake NUMA emulation on my simple PowerPC machine.  I've
      tested it on a non-numa box with the following arguments:
      
      numa=fake=1G
      numa=fake=1G,2G
      name=fake=1G,512M,2G
      numa=fake=1500M,2800M mem=3500M
      numa=fake=1G mem=512M
      numa=fake=1G mem=1G
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NOlof Johansson <olof@lixom.net>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      5c3f5892
  16. 03 8月, 2007 1 次提交
  17. 10 5月, 2007 1 次提交
    • R
      Add suspend-related notifications for CPU hotplug · 8bb78442
      Rafael J. Wysocki 提交于
      Since nonboot CPUs are now disabled after tasks and devices have been
      frozen and the CPU hotplug infrastructure is used for this purpose, we need
      special CPU hotplug notifications that will help the CPU-hotplug-aware
      subsystems distinguish normal CPU hotplug events from CPU hotplug events
      related to a system-wide suspend or resume operation in progress.  This
      patch introduces such notifications and causes them to be used during
      suspend and resume transitions.  It also changes all of the
      CPU-hotplug-aware subsystems to take these notifications into consideration
      (for now they are handled in the same way as the corresponding "normal"
      ones).
      
      [oleg@tv-sign.ru: cleanups]
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Cc: Gautham R Shenoy <ego@in.ibm.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8bb78442
  18. 13 4月, 2007 3 次提交
  19. 18 2月, 2007 1 次提交
  20. 11 12月, 2006 1 次提交
    • P
      [POWERPC] Support ibm,dynamic-reconfiguration-memory nodes · 0204568a
      Paul Mackerras 提交于
      For PAPR partitions with large amounts of memory, the firmware has an
      alternative, more compact representation for the information about the
      memory in the partition and its NUMA associativity information.  This
      adds the code to the kernel to parse this alternative representation.
      
      The other part of this patch is telling the firmware that we can
      handle the alternative representation.  There is however a subtlety
      here, because the firmware will invoke a reboot if the memory
      representation we request is different from the representation that
      firmware is currently using.  This is because firmware can't change
      the representation on the fly.  Further, some firmware versions used
      on POWER5+ machines have a bug where this reboot leaves the machine
      with an altered value of load-base, which will prevent any kernel
      booting until it is reset to the normal value (0x4000).  Because of
      this bug, we do NOT set fake_elf.rpanote.new_mem_def = 1, and thus we
      do not request the new representation on POWER5+ and earlier machines.
      We do request the new representation on POWER6, which uses the
      ibm,client-architecture-support call.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      0204568a
  21. 12 10月, 2006 1 次提交
    • M
      [PATCH] mm: use symbolic names instead of indices for zone initialisation · 6391af17
      Mel Gorman 提交于
      Arch-independent zone-sizing is using indices instead of symbolic names to
      offset within an array related to zones (max_zone_pfns).  The unintended
      impact is that ZONE_DMA and ZONE_NORMAL is initialised on powerpc instead
      of ZONE_DMA and ZONE_HIGHMEM when CONFIG_HIGHMEM is set.  As a result, the
      the machine fails to boot but will boot with CONFIG_HIGHMEM turned off.
      
      The following patch properly initialises the max_zone_pfns[] array and uses
      symbolic names instead of indices in each architecture using
      arch-independent zone-sizing.  Two users have successfully booted their
      powerpcs with it (one an ibook G4).  It has also been boot tested on x86,
      x86_64, ppc64 and ia64.  Please merge for 2.6.19-rc2.
      
      Credit to Benjamin Herrenschmidt for identifying the bug and rolling the
      first fix.  Additional credit to Johannes Berg and Andreas Schwab for
      reporting the problem and testing on powerpc.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      6391af17
  22. 27 9月, 2006 1 次提交
  23. 31 7月, 2006 1 次提交
  24. 28 6月, 2006 1 次提交
  25. 02 5月, 2006 1 次提交
  26. 22 4月, 2006 1 次提交
  27. 27 3月, 2006 1 次提交
  28. 22 3月, 2006 6 次提交