1. 23 2月, 2009 6 次提交
  2. 13 2月, 2009 3 次提交
    • D
      powerpc/mm: Fix numa reserve bootmem page selection · 06eccea6
      Dave Hansen 提交于
      Fix the powerpc NUMA reserve bootmem page selection logic.
      
      commit 8f64e1f2 (powerpc: Reserve
      in bootmem lmb reserved regions that cross NUMA nodes) changed
      the logic for how the powerpc LMB reserved regions were converted
      to bootmen reserved regions.  As the folowing discussion reports,
      the new logic was not correct.
      
      mark_reserved_regions_for_nid() goes through each LMB on the
      system that specifies a reserved area.  It searches for
      active regions that intersect with that LMB and are on the
      specified node.  It attempts to bootmem-reserve only the area
      where the active region and the reserved LMB intersect.  We
      can not reserve things on other nodes as they may not have
      bootmem structures allocated, yet.
      
      We base the size of the bootmem reservation on two possible
      things.  Normally, we just make the reservation start and
      stop exactly at the start and end of the LMB.
      
      However, the LMB reservations are not aware of NUMA nodes and
      on occasion a single LMB may cross into several adjacent
      active regions.  Those may even be on different NUMA nodes
      and will require separate calls to the bootmem reserve
      functions.  So, the bootmem reservation must be trimmed to
      fit inside the current active region.
      
      That's all fine and dandy, but we trim the reservation
      in a page-aligned fashion.  That's bad because we start the
      reservation at a non-page-aligned address: physbase.
      
      The reservation may only span 2 bytes, but that those bytes
      may span two pfns and cause a reserve_size of 2*PAGE_SIZE.
      
      Take the case where you reserve 0x2 bytes at 0x0fff and
      where the active region ends at 0x1000.  You'll jump into
      that if() statment, but node_ar.end_pfn=0x1 and
      start_pfn=0x0.  You'll end up with a reserve_size=0x1000,
      and then call
      
        reserve_bootmem_node(node, physbase=0xfff, size=0x1000);
      
      0x1000 may not be on the same node as 0xfff.  Oops.
      
      In almost all the vm code, end_<anything> is not inclusive.
      If you have an end_pfn of 0x1234, page 0x1234 is not
      included in the range.  Using PFN_UP instead of the
      (>> >> PAGE_SHIFT) will make this consistent with the other VM
      code.
      
      We also need to do math for the reserved size with physbase
      instead of start_pfn.  node_ar.end_pfn << PAGE_SHIFT is
      *precisely* the end of the node.  However,
      (start_pfn << PAGE_SHIFT) is *NOT* precisely the beginning
      of the reserved area.  That is, of course, physbase.
      If we don't use physbase here, the reserve_size can be
      made too large.
      
      From: Dave Hansen <dave@linux.vnet.ibm.com>
      Tested-by: Geoff Levand <geoffrey.levand@am.sony.com>  Tested on PS3.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      06eccea6
    • K
      powerpc/fsl-booke: Fix compile warning · 96a8bac5
      Kumar Gala 提交于
      arch/powerpc/mm/fsl_booke_mmu.c: In function 'adjust_total_lowmem':
      arch/powerpc/mm/fsl_booke_mmu.c:221: warning: format '%ld' expects type 'long int', but argument 3 has type 'phys_addr_t'
      Signed-off-by: NKumar Gala <galak@kernel.crashing.org>
      96a8bac5
    • K
      powerpc/fsl-booke: Add new ISA 2.06 page sizes and MAS defines · d66c82ea
      Kumar Gala 提交于
      The Power ISA 2.06 added power of two page sizes to the embedded MMU
      architecture.  Its done it such a way to be code compatiable with the
      existing HW.  Made the minor code changes to support both power of two
      and power of four page sizes.  Also added some new MAS bits and macros
      that are defined as part of the 2.06 ISA.  Renamed some things to use
      the 'Book-3e' concept to convey the new MMU that is based on the
      Freescale Book-E MMU programming model.
      
      Note, its still invalid to try and use a page size that isn't supported
      by cpu.
      Signed-off-by: NKumar Gala <galak@kernel.crashing.org>
      d66c82ea
  3. 11 2月, 2009 5 次提交
    • K
      powerpc/mm: Fix _PAGE_COHERENT support on classic ppc32 HW · f99fb8a2
      Kumar Gala 提交于
      The following commit:
      
      commit 64b3d0e8
      Author: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Date:   Thu Dec 18 19:13:51 2008 +0000
      
          powerpc/mm: Rework usage of _PAGE_COHERENT/NO_CACHE/GUARDED
      
      broke setting of the _PAGE_COHERENT bit in the PPC HW PTE.  Since we now
      actually set _PAGE_COHERENT in the Linux PTE we shouldn't be clearing it
      out before we propogate it to the PPC HW PTE.
      Reported-by: NMartyn Welch <martyn.welch@gefanuc.com>
      Signed-off-by: NKumar Gala <galak@kernel.crashing.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      f99fb8a2
    • B
      powerpc/mm: Rework I$/D$ coherency (v3) · 8d30c14c
      Benjamin Herrenschmidt 提交于
      This patch reworks the way we do I and D cache coherency on PowerPC.
      
      The "old" way was split in 3 different parts depending on the processor type:
      
         - Hash with per-page exec support (64-bit and >= POWER4 only) does it
      at hashing time, by preventing exec on unclean pages and cleaning pages
      on exec faults.
      
         - Everything without per-page exec support (32-bit hash, 8xx, and
      64-bit < POWER4) does it for all page going to user space in update_mmu_cache().
      
         - Embedded with per-page exec support does it from do_page_fault() on
      exec faults, in a way similar to what the hash code does.
      
      That leads to confusion, and bugs. For example, the method using update_mmu_cache()
      is racy on SMP where another processor can see the new PTE and hash it in before
      we have cleaned the cache, and then blow trying to execute. This is hard to hit but
      I think it has bitten us in the past.
      
      Also, it's inefficient for embedded where we always end up having to do at least
      one more page fault.
      
      This reworks the whole thing by moving the cache sync into two main call sites,
      though we keep different behaviours depending on the HW capability. The call
      sites are set_pte_at() which is now made out of line, and ptep_set_access_flags()
      which joins the former in pgtable.c
      
      The base idea for Embedded with per-page exec support, is that we now do the
      flush at set_pte_at() time when coming from an exec fault, which allows us
      to avoid the double fault problem completely (we can even improve the situation
      more by implementing TLB preload in update_mmu_cache() but that's for later).
      
      If for some reason we didn't do it there and we try to execute, we'll hit
      the page fault, which will do a minor fault, which will hit ptep_set_access_flags()
      to do things like update _PAGE_ACCESSED or _PAGE_DIRTY if needed, we just make
      this guys also perform the I/D cache sync for exec faults now. This second path
      is the catch all for things that weren't cleaned at set_pte_at() time.
      
      For cpus without per-pag exec support, we always do the sync at set_pte_at(),
      thus guaranteeing that when the PTE is visible to other processors, the cache
      is clean.
      
      For the 64-bit hash with per-page exec support case, we keep the old mechanism
      for now. I'll look into changing it later, once I've reworked a bit how we
      use _PAGE_EXEC.
      
      This is also a first step for adding _PAGE_EXEC support for embedded platforms
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      8d30c14c
    • A
      powerpc/mm: Move 64-bit unmapped_area to top of address space · 91b0f5ec
      Anton Blanchard 提交于
      We currently place mmaps just below the stack on 32bit, but leave them
      in the middle of the address space on 64bit:
      
      00100000-00120000 r-xp 00100000 00:00 0                    [vdso]
      10000000-10010000 r-xp 00000000 08:06 179534               /tmp/sleep
      10010000-10020000 rw-p 00000000 08:06 179534               /tmp/sleep
      10020000-10130000 rw-p 10020000 00:00 0                    [heap]
      40000000000-40000030000 r-xp 00000000 08:06 440743         /lib64/ld-2.9.so
      40000030000-40000040000 rw-p 00020000 08:06 440743         /lib64/ld-2.9.so
      40000050000-400001f0000 r-xp 00000000 08:06 440671         /lib64/libc-2.9.so
      400001f0000-40000200000 r--p 00190000 08:06 440671         /lib64/libc-2.9.so
      40000200000-40000220000 rw-p 001a0000 08:06 440671         /lib64/libc-2.9.so
      40000220000-40008230000 rw-p 40000220000 00:00 0
      fffffbc0000-fffffd10000 rw-p fffffeb0000 00:00 0           [stack]
      
      Right now it isn't an issue, but at some stage we will run into mmap or
      hugetlb allocation issues. Using the same layout as 32bit gives us a
      some breathing room. This matches what x86-64 is doing too.
      
      00100000-00103000 r-xp 00100000 00:00 0                    [vdso]
      10000000-10001000 r-xp 00000000 08:06 554894               /tmp/test
      10010000-10011000 r--p 00000000 08:06 554894               /tmp/test
      10011000-10012000 rw-p 00001000 08:06 554894               /tmp/test
      10012000-10113000 rw-p 10012000 00:00 0                    [heap]
      fffefdf7000-ffff7df8000 rw-p fffefdf7000 00:00 0
      ffff7df8000-ffff7f97000 r-xp 00000000 08:06 130591         /lib64/libc-2.9.so
      ffff7f97000-ffff7fa6000 ---p 0019f000 08:06 130591         /lib64/libc-2.9.so
      ffff7fa6000-ffff7faa000 r--p 0019e000 08:06 130591         /lib64/libc-2.9.so
      ffff7faa000-ffff7fc0000 rw-p 001a2000 08:06 130591         /lib64/libc-2.9.so
      ffff7fc0000-ffff7fc4000 rw-p ffff7fc0000 00:00 0
      ffff7fc4000-ffff7fec000 r-xp 00000000 08:06 130663         /lib64/ld-2.9.so
      ffff7fee000-ffff7ff0000 rw-p ffff7fee000 00:00 0
      ffff7ffa000-ffff7ffb000 rw-p ffff7ffa000 00:00 0
      ffff7ffb000-ffff7ffc000 r--p 00027000 08:06 130663         /lib64/ld-2.9.so
      ffff7ffc000-ffff7fff000 rw-p 00028000 08:06 130663         /lib64/ld-2.9.so
      ffff7fff000-ffff8000000 rw-p ffff7fff000 00:00 0
      fffffc59000-fffffc6e000 rw-p ffffffeb000 00:00 0           [stack]
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      91b0f5ec
    • M
      powerpc/numa: Remove redundant find_cpu_node() · 8b16cd23
      Milton Miller 提交于
      Use of_get_cpu_node, which is a superset of numa.c's find_cpu_node in
      a less restrictive section (text vs cpuinit).
      Signed-off-by: NMilton Miller <miltonm@bga.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      8b16cd23
    • M
      powerpc/numa: Avoid possible reference beyond prop. length in find_min_common_depth() · 20fcefe5
      Milton Miller 提交于
      find_min_common_depth() was checking the property length incorrectly.
      The value is in bytes not cells, and it is using the second entry.
      Signed-off-By: NMilton Miller <miltonm@bga.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      20fcefe5
  4. 10 2月, 2009 1 次提交
  5. 29 1月, 2009 3 次提交
    • T
      powerpc/fsl-booke: Make CAM entries used for lowmem configurable · 96051465
      Trent Piepho 提交于
      On booke processors, the code that maps low memory only uses up to three
      CAM entries, even though there are sixteen and nothing else uses them.
      
      Make this number configurable in the advanced options menu along with max
      low memory size.  If one wants 1 GB of lowmem, then it's typically
      necessary to have four CAM entries.
      Signed-off-by: NTrent Piepho <tpiepho@freescale.com>
      Signed-off-by: NKumar Gala <galak@kernel.crashing.org>
      96051465
    • T
      powerpc/fsl-booke: Allow larger CAM sizes than 256 MB · c8f3570b
      Trent Piepho 提交于
      The code that maps kernel low memory would only use page sizes up to 256
      MB.  On E500v2 pages up to 4 GB are supported.
      
      However, a page must be aligned to a multiple of the page's size.  I.e.
      256 MB pages must aligned to a 256 MB boundary.  This was enforced by a
      requirement that the physical and virtual addresses of the start of lowmem
      be aligned to 256 MB.  Clearly requiring 1GB or 4GB alignment to allow
      pages of that size isn't acceptable.
      
      To solve this, I simply have adjust_total_lowmem() take alignment into
      account when it decides what size pages to use.  Give it PAGE_OFFSET =
      0x7000_0000, PHYSICAL_START = 0x3000_0000, and 2GB of RAM, and it will map
      pages like this:
      PA 0x3000_0000 VA 0x7000_0000 Size 256 MB
      PA 0x4000_0000 VA 0x8000_0000 Size 1 GB
      PA 0x8000_0000 VA 0xC000_0000 Size 256 MB
      PA 0x9000_0000 VA 0xD000_0000 Size 256 MB
      PA 0xA000_0000 VA 0xE000_0000 Size 256 MB
      
      Because the lowmem mapping code now takes alignment into account,
      PHYSICAL_ALIGN can be lowered from 256 MB to 64 MB.  Even lower might be
      possible.  The lowmem code will work down to 4 kB but it's possible some of
      the boot code will fail before then.  Poor alignment will force small pages
      to be used, which combined with the limited number of TLB1 pages available,
      will result in very little memory getting mapped.  So alignments less than
      64 MB probably aren't very useful anyway.
      Signed-off-by: NTrent Piepho <tpiepho@freescale.com>
      Signed-off-by: NKumar Gala <galak@kernel.crashing.org>
      c8f3570b
    • T
      powerpc/fsl-booke: Remove code duplication in lowmem mapping · f88747e7
      Trent Piepho 提交于
      The code to map lowmem uses three CAM aka TLB[1] entries to cover it.  The
      size of each is stored in three globals named __cam0, __cam1, and __cam2.
      All the code that uses them is duplicated three times for each of the three
      variables.
      
      We have these things called arrays and loops....
      
      Once converted to use an array, it will be easier to make the number of
      CAMs configurable.
      Signed-off-by: NTrent Piepho <tpiepho@freescale.com>
      Signed-off-by: NKumar Gala <galak@kernel.crashing.org>
      f88747e7
  6. 28 1月, 2009 1 次提交
  7. 16 1月, 2009 1 次提交
  8. 13 1月, 2009 1 次提交
  9. 08 1月, 2009 9 次提交
  10. 07 1月, 2009 2 次提交
    • G
      mm: show node to memory section relationship with symlinks in sysfs · c04fc586
      Gary Hade 提交于
      Show node to memory section relationship with symlinks in sysfs
      
      Add /sys/devices/system/node/nodeX/memoryY symlinks for all
      the memory sections located on nodeX.  For example:
      /sys/devices/system/node/node1/memory135 -> ../../memory/memory135
      indicates that memory section 135 resides on node1.
      
      Also revises documentation to cover this change as well as updating
      Documentation/ABI/testing/sysfs-devices-memory to include descriptions
      of memory hotremove files 'phys_device', 'phys_index', and 'state'
      that were previously not described there.
      
      In addition to it always being a good policy to provide users with
      the maximum possible amount of physical location information for
      resources that can be hot-added and/or hot-removed, the following
      are some (but likely not all) of the user benefits provided by
      this change.
      Immediate:
        - Provides information needed to determine the specific node
          on which a defective DIMM is located.  This will reduce system
          downtime when the node or defective DIMM is swapped out.
        - Prevents unintended onlining of a memory section that was
          previously offlined due to a defective DIMM.  This could happen
          during node hot-add when the user or node hot-add assist script
          onlines _all_ offlined sections due to user or script inability
          to identify the specific memory sections located on the hot-added
          node.  The consequences of reintroducing the defective memory
          could be ugly.
        - Provides information needed to vary the amount and distribution
          of memory on specific nodes for testing or debugging purposes.
      Future:
        - Will provide information needed to identify the memory
          sections that need to be offlined prior to physical removal
          of a specific node.
      
      Symlink creation during boot was tested on 2-node x86_64, 2-node
      ppc64, and 2-node ia64 systems.  Symlink creation during physical
      memory hot-add tested on a 2-node x86_64 system.
      Signed-off-by: NGary Hade <garyhade@us.ibm.com>
      Signed-off-by: NBadari Pulavarty <pbadari@us.ibm.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c04fc586
    • M
      mm: report the MMU pagesize in /proc/pid/smaps · 3340289d
      Mel Gorman 提交于
      The KernelPageSize entry in /proc/pid/smaps is the pagesize used by the
      kernel to back a VMA.  This matches the size used by the MMU in the
      majority of cases.  However, one counter-example occurs on PPC64 kernels
      whereby a kernel using 64K as a base pagesize may still use 4K pages for
      the MMU on older processor.  To distinguish, this patch reports
      MMUPageSize as the pagesize used by the MMU in /proc/pid/smaps.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3340289d
  11. 29 12月, 2008 1 次提交
  12. 23 12月, 2008 2 次提交
  13. 21 12月, 2008 5 次提交