1. 11 2月, 2009 4 次提交
    • B
      powerpc/mm: Rework I$/D$ coherency (v3) · 8d30c14c
      Benjamin Herrenschmidt 提交于
      This patch reworks the way we do I and D cache coherency on PowerPC.
      
      The "old" way was split in 3 different parts depending on the processor type:
      
         - Hash with per-page exec support (64-bit and >= POWER4 only) does it
      at hashing time, by preventing exec on unclean pages and cleaning pages
      on exec faults.
      
         - Everything without per-page exec support (32-bit hash, 8xx, and
      64-bit < POWER4) does it for all page going to user space in update_mmu_cache().
      
         - Embedded with per-page exec support does it from do_page_fault() on
      exec faults, in a way similar to what the hash code does.
      
      That leads to confusion, and bugs. For example, the method using update_mmu_cache()
      is racy on SMP where another processor can see the new PTE and hash it in before
      we have cleaned the cache, and then blow trying to execute. This is hard to hit but
      I think it has bitten us in the past.
      
      Also, it's inefficient for embedded where we always end up having to do at least
      one more page fault.
      
      This reworks the whole thing by moving the cache sync into two main call sites,
      though we keep different behaviours depending on the HW capability. The call
      sites are set_pte_at() which is now made out of line, and ptep_set_access_flags()
      which joins the former in pgtable.c
      
      The base idea for Embedded with per-page exec support, is that we now do the
      flush at set_pte_at() time when coming from an exec fault, which allows us
      to avoid the double fault problem completely (we can even improve the situation
      more by implementing TLB preload in update_mmu_cache() but that's for later).
      
      If for some reason we didn't do it there and we try to execute, we'll hit
      the page fault, which will do a minor fault, which will hit ptep_set_access_flags()
      to do things like update _PAGE_ACCESSED or _PAGE_DIRTY if needed, we just make
      this guys also perform the I/D cache sync for exec faults now. This second path
      is the catch all for things that weren't cleaned at set_pte_at() time.
      
      For cpus without per-pag exec support, we always do the sync at set_pte_at(),
      thus guaranteeing that when the PTE is visible to other processors, the cache
      is clean.
      
      For the 64-bit hash with per-page exec support case, we keep the old mechanism
      for now. I'll look into changing it later, once I've reworked a bit how we
      use _PAGE_EXEC.
      
      This is also a first step for adding _PAGE_EXEC support for embedded platforms
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      8d30c14c
    • A
      powerpc/mm: Move 64-bit unmapped_area to top of address space · 91b0f5ec
      Anton Blanchard 提交于
      We currently place mmaps just below the stack on 32bit, but leave them
      in the middle of the address space on 64bit:
      
      00100000-00120000 r-xp 00100000 00:00 0                    [vdso]
      10000000-10010000 r-xp 00000000 08:06 179534               /tmp/sleep
      10010000-10020000 rw-p 00000000 08:06 179534               /tmp/sleep
      10020000-10130000 rw-p 10020000 00:00 0                    [heap]
      40000000000-40000030000 r-xp 00000000 08:06 440743         /lib64/ld-2.9.so
      40000030000-40000040000 rw-p 00020000 08:06 440743         /lib64/ld-2.9.so
      40000050000-400001f0000 r-xp 00000000 08:06 440671         /lib64/libc-2.9.so
      400001f0000-40000200000 r--p 00190000 08:06 440671         /lib64/libc-2.9.so
      40000200000-40000220000 rw-p 001a0000 08:06 440671         /lib64/libc-2.9.so
      40000220000-40008230000 rw-p 40000220000 00:00 0
      fffffbc0000-fffffd10000 rw-p fffffeb0000 00:00 0           [stack]
      
      Right now it isn't an issue, but at some stage we will run into mmap or
      hugetlb allocation issues. Using the same layout as 32bit gives us a
      some breathing room. This matches what x86-64 is doing too.
      
      00100000-00103000 r-xp 00100000 00:00 0                    [vdso]
      10000000-10001000 r-xp 00000000 08:06 554894               /tmp/test
      10010000-10011000 r--p 00000000 08:06 554894               /tmp/test
      10011000-10012000 rw-p 00001000 08:06 554894               /tmp/test
      10012000-10113000 rw-p 10012000 00:00 0                    [heap]
      fffefdf7000-ffff7df8000 rw-p fffefdf7000 00:00 0
      ffff7df8000-ffff7f97000 r-xp 00000000 08:06 130591         /lib64/libc-2.9.so
      ffff7f97000-ffff7fa6000 ---p 0019f000 08:06 130591         /lib64/libc-2.9.so
      ffff7fa6000-ffff7faa000 r--p 0019e000 08:06 130591         /lib64/libc-2.9.so
      ffff7faa000-ffff7fc0000 rw-p 001a2000 08:06 130591         /lib64/libc-2.9.so
      ffff7fc0000-ffff7fc4000 rw-p ffff7fc0000 00:00 0
      ffff7fc4000-ffff7fec000 r-xp 00000000 08:06 130663         /lib64/ld-2.9.so
      ffff7fee000-ffff7ff0000 rw-p ffff7fee000 00:00 0
      ffff7ffa000-ffff7ffb000 rw-p ffff7ffa000 00:00 0
      ffff7ffb000-ffff7ffc000 r--p 00027000 08:06 130663         /lib64/ld-2.9.so
      ffff7ffc000-ffff7fff000 rw-p 00028000 08:06 130663         /lib64/ld-2.9.so
      ffff7fff000-ffff8000000 rw-p ffff7fff000 00:00 0
      fffffc59000-fffffc6e000 rw-p ffffffeb000 00:00 0           [stack]
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      91b0f5ec
    • M
      powerpc/numa: Remove redundant find_cpu_node() · 8b16cd23
      Milton Miller 提交于
      Use of_get_cpu_node, which is a superset of numa.c's find_cpu_node in
      a less restrictive section (text vs cpuinit).
      Signed-off-by: NMilton Miller <miltonm@bga.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      8b16cd23
    • M
      powerpc/numa: Avoid possible reference beyond prop. length in find_min_common_depth() · 20fcefe5
      Milton Miller 提交于
      find_min_common_depth() was checking the property length incorrectly.
      The value is in bytes not cells, and it is using the second entry.
      Signed-off-By: NMilton Miller <miltonm@bga.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      20fcefe5
  2. 10 2月, 2009 1 次提交
  3. 29 1月, 2009 3 次提交
    • T
      powerpc/fsl-booke: Make CAM entries used for lowmem configurable · 96051465
      Trent Piepho 提交于
      On booke processors, the code that maps low memory only uses up to three
      CAM entries, even though there are sixteen and nothing else uses them.
      
      Make this number configurable in the advanced options menu along with max
      low memory size.  If one wants 1 GB of lowmem, then it's typically
      necessary to have four CAM entries.
      Signed-off-by: NTrent Piepho <tpiepho@freescale.com>
      Signed-off-by: NKumar Gala <galak@kernel.crashing.org>
      96051465
    • T
      powerpc/fsl-booke: Allow larger CAM sizes than 256 MB · c8f3570b
      Trent Piepho 提交于
      The code that maps kernel low memory would only use page sizes up to 256
      MB.  On E500v2 pages up to 4 GB are supported.
      
      However, a page must be aligned to a multiple of the page's size.  I.e.
      256 MB pages must aligned to a 256 MB boundary.  This was enforced by a
      requirement that the physical and virtual addresses of the start of lowmem
      be aligned to 256 MB.  Clearly requiring 1GB or 4GB alignment to allow
      pages of that size isn't acceptable.
      
      To solve this, I simply have adjust_total_lowmem() take alignment into
      account when it decides what size pages to use.  Give it PAGE_OFFSET =
      0x7000_0000, PHYSICAL_START = 0x3000_0000, and 2GB of RAM, and it will map
      pages like this:
      PA 0x3000_0000 VA 0x7000_0000 Size 256 MB
      PA 0x4000_0000 VA 0x8000_0000 Size 1 GB
      PA 0x8000_0000 VA 0xC000_0000 Size 256 MB
      PA 0x9000_0000 VA 0xD000_0000 Size 256 MB
      PA 0xA000_0000 VA 0xE000_0000 Size 256 MB
      
      Because the lowmem mapping code now takes alignment into account,
      PHYSICAL_ALIGN can be lowered from 256 MB to 64 MB.  Even lower might be
      possible.  The lowmem code will work down to 4 kB but it's possible some of
      the boot code will fail before then.  Poor alignment will force small pages
      to be used, which combined with the limited number of TLB1 pages available,
      will result in very little memory getting mapped.  So alignments less than
      64 MB probably aren't very useful anyway.
      Signed-off-by: NTrent Piepho <tpiepho@freescale.com>
      Signed-off-by: NKumar Gala <galak@kernel.crashing.org>
      c8f3570b
    • T
      powerpc/fsl-booke: Remove code duplication in lowmem mapping · f88747e7
      Trent Piepho 提交于
      The code to map lowmem uses three CAM aka TLB[1] entries to cover it.  The
      size of each is stored in three globals named __cam0, __cam1, and __cam2.
      All the code that uses them is duplicated three times for each of the three
      variables.
      
      We have these things called arrays and loops....
      
      Once converted to use an array, it will be easier to make the number of
      CAMs configurable.
      Signed-off-by: NTrent Piepho <tpiepho@freescale.com>
      Signed-off-by: NKumar Gala <galak@kernel.crashing.org>
      f88747e7
  4. 28 1月, 2009 1 次提交
  5. 16 1月, 2009 1 次提交
  6. 13 1月, 2009 1 次提交
  7. 08 1月, 2009 9 次提交
  8. 07 1月, 2009 2 次提交
    • G
      mm: show node to memory section relationship with symlinks in sysfs · c04fc586
      Gary Hade 提交于
      Show node to memory section relationship with symlinks in sysfs
      
      Add /sys/devices/system/node/nodeX/memoryY symlinks for all
      the memory sections located on nodeX.  For example:
      /sys/devices/system/node/node1/memory135 -> ../../memory/memory135
      indicates that memory section 135 resides on node1.
      
      Also revises documentation to cover this change as well as updating
      Documentation/ABI/testing/sysfs-devices-memory to include descriptions
      of memory hotremove files 'phys_device', 'phys_index', and 'state'
      that were previously not described there.
      
      In addition to it always being a good policy to provide users with
      the maximum possible amount of physical location information for
      resources that can be hot-added and/or hot-removed, the following
      are some (but likely not all) of the user benefits provided by
      this change.
      Immediate:
        - Provides information needed to determine the specific node
          on which a defective DIMM is located.  This will reduce system
          downtime when the node or defective DIMM is swapped out.
        - Prevents unintended onlining of a memory section that was
          previously offlined due to a defective DIMM.  This could happen
          during node hot-add when the user or node hot-add assist script
          onlines _all_ offlined sections due to user or script inability
          to identify the specific memory sections located on the hot-added
          node.  The consequences of reintroducing the defective memory
          could be ugly.
        - Provides information needed to vary the amount and distribution
          of memory on specific nodes for testing or debugging purposes.
      Future:
        - Will provide information needed to identify the memory
          sections that need to be offlined prior to physical removal
          of a specific node.
      
      Symlink creation during boot was tested on 2-node x86_64, 2-node
      ppc64, and 2-node ia64 systems.  Symlink creation during physical
      memory hot-add tested on a 2-node x86_64 system.
      Signed-off-by: NGary Hade <garyhade@us.ibm.com>
      Signed-off-by: NBadari Pulavarty <pbadari@us.ibm.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c04fc586
    • M
      mm: report the MMU pagesize in /proc/pid/smaps · 3340289d
      Mel Gorman 提交于
      The KernelPageSize entry in /proc/pid/smaps is the pagesize used by the
      kernel to back a VMA.  This matches the size used by the MMU in the
      majority of cases.  However, one counter-example occurs on PPC64 kernels
      whereby a kernel using 64K as a base pagesize may still use 4K pages for
      the MMU on older processor.  To distinguish, this patch reports
      MMUPageSize as the pagesize used by the MMU in /proc/pid/smaps.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3340289d
  9. 29 12月, 2008 1 次提交
  10. 23 12月, 2008 2 次提交
  11. 21 12月, 2008 9 次提交
  12. 16 12月, 2008 4 次提交
    • B
      powerpc/mm: Remove flush_HPTE() · f63837f0
      Benjamin Herrenschmidt 提交于
      The function flush_HPTE() is used in only one place, the implementation
      of DEBUG_PAGEALLOC on ppc32.
      
      It's actually a dup of flush_tlb_page() though it's -slightly- more
      efficient on hash based processors.  We remove it and replace it by
      a direct call to the hash flush code on those processors and to
      flush_tlb_page() for everybody else.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      f63837f0
    • B
      powerpc/mm: Rename tlb_32.c and tlb_64.c to tlb_hash32.c and tlb_hash64.c · e41e811a
      Benjamin Herrenschmidt 提交于
      This renames the files to clarify the fact that they are used by
      the hash based family of CPUs (the 603 being an exception in that
      family but is still handled by that code).
      
      This paves the way for the new tlb_nohash.c coming via a subsequent
      commit.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Acked-by: NKumar Gala <galak@kernel.crashing.org>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      e41e811a
    • D
      powerpc: Fix bootmem reservation on uninitialized node · a4c74ddd
      Dave Hansen 提交于
      careful_allocation() was calling into the bootmem allocator for
      nodes which had not been fully initialized and caused a previous
      bug:  http://patchwork.ozlabs.org/patch/10528/  So, I merged a
      few broken out loops in do_init_bootmem() to fix it.  That changed
      the code ordering.
      
      I think this bug is triggered by having reserved areas for a node
      which are spanned by another node's contents.  In the
      mark_reserved_regions_for_nid() code, we attempt to reserve the
      area for a node before we have allocated the NODE_DATA() for that
      nid.  We do this since I reordered that loop.  I suck.
      
      This is causing crashes at bootup on some systems, as reported
      by Jon Tollefson.
      
      This may only present on some systems that have 16GB pages
      reserved.  But, it can probably happen on any system that is
      trying to reserve large swaths of memory that happen to span other
      nodes' contents.
      
      This commit ensures that we do not touch bootmem for any node which
      has not been initialized, and also removes a compile warning about
      an unused variable.
      Signed-off-by: NDave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      a4c74ddd
    • B
      powerpc: Check for valid hugepage size in hugetlb_get_unmapped_area · 48f797de
      Brian King 提交于
      It looks like most of the hugetlb code is doing the correct thing if
      hugepages are not supported, but the mmap code is not.  If we get into
      the mmap code when hugepages are not supported, such as in an LPAR
      which is running Active Memory Sharing, we can oops the kernel.  This
      fixes the oops being seen in this path.
      
      oops: Kernel access of bad area, sig: 11 [#1]
      SMP NR_CPUS=1024 NUMA pSeries
      Modules linked in: nfs(N) lockd(N) nfs_acl(N) sunrpc(N) ipv6(N) fuse(N) loop(N)
      dm_mod(N) sg(N) ibmveth(N) sd_mod(N) crc_t10dif(N) ibmvscsic(N)
      scsi_transport_srp(N) scsi_tgt(N) scsi_mod(N)
      Supported: No
      NIP: c000000000038d60 LR: c00000000003945c CTR: c0000000000393f0
      REGS: c000000077e7b830 TRAP: 0300   Tainted: G
      (2.6.27.5-bz50170-2-ppc64)
      MSR: 8000000000009032 <EE,ME,IR,DR>  CR: 44000448  XER: 20000001
      DAR: c000002000af90a8, DSISR: 0000000040000000
      TASK = c00000007c1b8600[4019] 'hugemmap01' THREAD: c000000077e78000 CPU: 6
      GPR00: 0000001fffffffe0 c000000077e7bab0 c0000000009a4e78 0000000000000000
      GPR04: 0000000000010000 0000000000000001 00000000ffffffff 0000000000000001
      GPR08: 0000000000000000 c000000000af90c8 0000000000000001 0000000000000000
      GPR12: 000000000000003f c000000000a73880 0000000000000000 0000000000000000
      GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000010000
      GPR20: 0000000000000000 0000000000000003 0000000000010000 0000000000000001
      GPR24: 0000000000000003 0000000000000000 0000000000000001 ffffffffffffffb5
      GPR28: c000000077ca2e80 0000000000000000 c00000000092af78 0000000000010000
      NIP [c000000000038d60] .slice_get_unmapped_area+0x6c/0x4e0
      LR [c00000000003945c] .hugetlb_get_unmapped_area+0x6c/0x80
      Call Trace:
      [c000000077e7bbc0] [c00000000003945c] .hugetlb_get_unmapped_area+0x6c/0x80
      [c000000077e7bc30] [c000000000107e30] .get_unmapped_area+0x64/0xd8
      [c000000077e7bcb0] [c00000000010b140] .do_mmap_pgoff+0x140/0x420
      [c000000077e7bd80] [c00000000000bf5c] .sys_mmap+0xc4/0x140
      [c000000077e7be30] [c0000000000086b4] syscall_exit+0x0/0x40
      Instruction dump:
      fac1ffb0 fae1ffb8 fb01ffc0 fb21ffc8 fb41ffd0 fb61ffd8 fb81ffe0 fbc1fff0
      fbe1fff8 f821fef1 f8c10158 f8e10160 <7d49002e> f9010168 e92d01b0 eb4902b0
      Signed-off-by: NBrian King <brking@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      48f797de
  13. 03 12月, 2008 2 次提交