1. 01 4月, 2008 3 次提交
  2. 24 3月, 2008 1 次提交
    • P
      [POWERPC] Don't use 64k pages for ioremap on pSeries · cfe666b1
      Paul Mackerras 提交于
      On pSeries, the hypervisor doesn't let us map in the eHEA ethernet
      adapter using 64k pages, and thus the ehea driver will fail if 64k
      pages are configured.  This works around the problem by always
      using 4k pages for ioremap on pSeries (but not on other platforms).
      A better fix would be to check whether the partition could ever
      have an eHEA adapter, and only force 4k pages if it could, but this
      will do for 2.6.25.
      
      This is based on an earlier patch by Tony Breeds.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      cfe666b1
  3. 20 3月, 2008 1 次提交
  4. 13 3月, 2008 1 次提交
    • M
      [POWERPC] Fix large hash table allocation on Cell blades · 31bf1119
      Michael Ellerman 提交于
      My recent hack to allocate the hash table under 1GB on cell was poorly
      tested, *cough*. It turns out on blades with large amounts of memory we
      fail to allocate the hash table at all. This is because RTAS has been
      instantiated just below 768MB, and 0-x MB are used by the kernel,
      leaving no areas that are both large enough and also naturally-aligned.
      
      For the cell IOMMU hack the page tables must be under 2GB, so use that
      as the limit instead. This has been tested on real hardware and boots
      happily.
      Signed-off-by: NMichael Ellerman <michael@ellerman.id.au>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      31bf1119
  5. 26 2月, 2008 1 次提交
  6. 14 2月, 2008 1 次提交
  7. 09 2月, 2008 1 次提交
    • M
      CONFIG_HIGHPTE vs. sub-page page tables. · 2f569afd
      Martin Schwidefsky 提交于
      Background: I've implemented 1K/2K page tables for s390.  These sub-page
      page tables are required to properly support the s390 virtualization
      instruction with KVM.  The SIE instruction requires that the page tables
      have 256 page table entries (pte) followed by 256 page status table entries
      (pgste).  The pgstes are only required if the process is using the SIE
      instruction.  The pgstes are updated by the hardware and by the hypervisor
      for a number of reasons, one of them is dirty and reference bit tracking.
      To avoid wasting memory the standard pte table allocation should return
      1K/2K (31/64 bit) and 2K/4K if the process is using SIE.
      
      Problem: Page size on s390 is 4K, page table size is 1K or 2K.  That means
      the s390 version for pte_alloc_one cannot return a pointer to a struct
      page.  Trouble is that with the CONFIG_HIGHPTE feature on x86 pte_alloc_one
      cannot return a pointer to a pte either, since that would require more than
      32 bit for the return value of pte_alloc_one (and the pte * would not be
      accessible since its not kmapped).
      
      Solution: The only solution I found to this dilemma is a new typedef: a
      pgtable_t.  For s390 pgtable_t will be a (pte *) - to be introduced with a
      later patch.  For everybody else it will be a (struct page *).  The
      additional problem with the initialization of the ptl lock and the
      NR_PAGETABLE accounting is solved with a constructor pgtable_page_ctor and
      a destructor pgtable_page_dtor.  The page table allocation and free
      functions need to call these two whenever a page table page is allocated or
      freed.  pmd_populate will get a pgtable_t instead of a struct page pointer.
       To get the pgtable_t back from a pmd entry that has been installed with
      pmd_populate a new function pmd_pgtable is added.  It replaces the pmd_page
      call in free_pte_range and apply_to_pte_range.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f569afd
  8. 08 2月, 2008 3 次提交
  9. 07 2月, 2008 1 次提交
    • B
      [POWERPC] Fake NUMA emulation for PowerPC · 1daa6d08
      Balbir Singh 提交于
      Here's a dumb simple implementation of fake NUMA nodes for PowerPC.
      Fake NUMA nodes can be specified using the following command line
      option
      
      numa=fake=<node range>
      
      node range is of the format <range1>,<range2>,...<rangeN>
      
      Each of the rangeX parameters is passed using memparse().  I find the
      patch useful for fake NUMA emulation on my simple PowerPC machine.
      I've tested it on a numa box with the following arguments
      
      numa=fake=512M
      numa=fake=512M,768M
      numa=fake=256M,512M mem=512M
      numa=fake=1G mem=768M
      numa=fake=
      without any numa= argument
      
      The other side-effect introduced by this patch is that; in the case
      where we don't have NUMA information, we now set a node online after
      adding each LMB.  This node could very well be node 0, but in the case
      that we enable fake NUMA nodes, when we cross node boundaries, we need
      to set the new node online.
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      1daa6d08
  10. 06 2月, 2008 2 次提交
  11. 31 1月, 2008 1 次提交
  12. 26 1月, 2008 1 次提交
    • P
      Revert "[POWERPC] Fake NUMA emulation for PowerPC" · 55852bed
      Paul Mackerras 提交于
      This reverts commit 5c3f5892,
      basically because it changes behaviour even when no fake NUMA
      information is specified on the kernel command line.
      
      Firstly, it changes the nid, thus destroying the real NUMA
      information.  Secondly, it also changes behaviour in that if a node
      ends up with no memory in it because of the memory limit, we used to
      set it online and now we don't.
      
      Also, in the non-NUMA case with no fake NUMA information, we do
      node_set_online once for each LMB now, whereas previously we only did
      it once.  I don't know if that is actually a problem, but it does seem
      unnecessary.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      55852bed
  13. 25 1月, 2008 1 次提交
  14. 24 1月, 2008 3 次提交
    • D
      [POWERPC] 85xx: Respect KERNELBASE, PAGE_OFFSET, and PHYSICAL_START on e500 · e8b63761
      Dale Farnsworth 提交于
      The e500 MMU init code previously assumed KERNELBASE always equaled
      PAGE_OFFSET and PHYSICAL_START was 0.  This is useful for kdump
      support as well as asymetric multicore.
      
      For the initial kdump support the secondary kernel will run at 32M
      but need access to all of memory so we bump the initial TLB up to
      64M.  This also matches with the forth coming ePAPR spec.
      Signed-off-by: NDale Farnsworth <dale@farnsworth.org>
      Signed-off-by: NKumar Gala <galak@kernel.crashing.org>
      e8b63761
    • K
      [POWERPC] Fix handling of memreserve if the range lands in highmem · f98eeb4e
      Kumar Gala 提交于
      There were several issues if a memreserve range existed and happened
      to be in highmem:
      
      * The bootmem allocator is only aware of lowmem so calling
        reserve_bootmem with a highmem address would cause a BUG_ON
      * All highmem pages were provided to the buddy allocator
      
      Added a lmb_is_reserved() api that we now use to determine if a highem
      page should continue to be PageReserved or provided to the buddy
      allocator.
      
      Also, we incorrectly reported the amount of pages reserved since all
      highmem pages are initally marked reserved and we clear the
      PageReserved flag as we "free" up the highmem pages.
      Signed-off-by: NKumar Gala <galak@kernel.crashing.org>
      f98eeb4e
    • P
      [POWERPC] Provide a way to protect 4k subpages when using 64k pages · fa28237c
      Paul Mackerras 提交于
      Using 64k pages on 64-bit PowerPC systems makes life difficult for
      emulators that are trying to emulate an ISA, such as x86, which use a
      smaller page size, since the emulator can no longer use the MMU and
      the normal system calls for controlling page protections.  Of course,
      the emulator can emulate the MMU by checking and possibly remapping
      the address for each memory access in software, but that is pretty
      slow.
      
      This provides a facility for such programs to control the access
      permissions on individual 4k sub-pages of 64k pages.  The idea is
      that the emulator supplies an array of protection masks to apply to a
      specified range of virtual addresses.  These masks are applied at the
      level where hardware PTEs are inserted into the hardware page table
      based on the Linux PTEs, so the Linux PTEs are not affected.  Note
      that this new mechanism does not allow any access that would otherwise
      be prohibited; it can only prohibit accesses that would otherwise be
      allowed.  This new facility is only available on 64-bit PowerPC and
      only when the kernel is configured for 64k pages.
      
      The masks are supplied using a new subpage_prot system call, which
      takes a starting virtual address and length, and a pointer to an array
      of protection masks in memory.  The array has a 32-bit word per 64k
      page to be protected; each 32-bit word consists of 16 2-bit fields,
      for which 0 allows any access (that is otherwise allowed), 1 prevents
      write accesses, and 2 or 3 prevent any access.
      
      Implicit in this is that the regions of the address space that are
      protected are switched to use 4k hardware pages rather than 64k
      hardware pages (on machines with hardware 64k page support).  In fact
      the whole process is switched to use 4k hardware pages when the
      subpage_prot system call is used, but this could be improved in future
      to switch only the affected segments.
      
      The subpage protection bits are stored in a 3 level tree akin to the
      page table tree.  The top level of this tree is stored in a structure
      that is appended to the top level of the page table tree, i.e., the
      pgd array.  Since it will often only be 32-bit addresses (below 4GB)
      that are protected, the pointers to the first four bottom level pages
      are also stored in this structure (each bottom level page contains the
      protection bits for 1GB of address space), so the protection bits for
      addresses below 4GB can be accessed with one fewer loads than those
      for higher addresses.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      fa28237c
  15. 17 1月, 2008 1 次提交
  16. 15 1月, 2008 1 次提交
    • P
      [POWERPC] Fix boot failure on POWER6 · dfbe0d3b
      Paul Mackerras 提交于
      Commit 473980a9 added a call to clear
      the SLB shadow buffer before registering it.  Unfortunately this means
      that we clear out the entries that slb_initialize has previously set in
      there.  On POWER6, the hypervisor uses the SLB shadow buffer when doing
      partition switches, and that means that after the next partition switch,
      each non-boot CPU has no SLB entries to map the kernel text and data,
      which causes it to crash.
      
      This fixes it by reverting most of 473980a9 and instead clearing the
      3rd entry explicitly in slb_initialize.  This fixes the problem that
      473980a9 was trying to solve, but without breaking POWER6.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      dfbe0d3b
  17. 11 1月, 2008 1 次提交
  18. 20 12月, 2007 1 次提交
    • B
      [POWERPC] Fake NUMA emulation for PowerPC · 5c3f5892
      Balbir Singh 提交于
      Here's a dumb simple implementation of fake NUMA nodes for PowerPC.
      Fake NUMA nodes can be specified using the following command line option
      
      numa=fake=<node range>
      
      node range is of the format <range1>,<range2>,...<rangeN>
      
      Each of the rangeX parameters is passed using memparse().  I find this
      useful for fake NUMA emulation on my simple PowerPC machine.  I've
      tested it on a non-numa box with the following arguments:
      
      numa=fake=1G
      numa=fake=1G,2G
      name=fake=1G,512M,2G
      numa=fake=1500M,2800M mem=3500M
      numa=fake=1G mem=512M
      numa=fake=1G mem=1G
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NOlof Johansson <olof@lixom.net>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      5c3f5892
  19. 11 12月, 2007 1 次提交
  20. 03 12月, 2007 1 次提交
  21. 20 11月, 2007 2 次提交
  22. 13 11月, 2007 2 次提交
    • S
      [POWERPC] Silence an annoying boot message · 6548d83a
      Stephen Rothwell 提交于
      vmemmap_populate will printk (with KERN_WARNING) for a lot of pages
      if CONFIG_SPARSEMEM_VMEMMAP is enabled (at least it does on iSeries).
      Use pr_debug for it instead.
      
      Replace the only other use of DBG in this file with pr_debug as well.
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: NOlof Johansson <olof@lixom.net>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      6548d83a
    • O
      [POWERPC] Fix CONFIG_SMP=n build error on ppc64 · 9bafbb0c
      Olof Johansson 提交于
      The patch "KVM: fix !SMP build error" change the way smp_call_function()
      actually uses the passed in function names on non-SMP builds.  So
      previously it was never caught that the function passed in was never
      actually defined.
      
      This causes a build error on ppc64_defconfig + CONFIG_SMP=n:
      
      arch/powerpc/mm/tlb_64.c: In function 'pgtable_free_now':
      arch/powerpc/mm/tlb_64.c:71: error: 'pte_free_smp_sync' undeclared (first use in this function)
      arch/powerpc/mm/tlb_64.c:71: error: (Each undeclared identifier is reported only once
      arch/powerpc/mm/tlb_64.c:71: error: for each function it appears in.)
      
      So we need to define it even if CONFIG_SMP is off. Either that or ifdef
      out the smp_call_function() call, but that's ugly.
      Signed-off-by: NOlof Johansson <olof@lixom.net>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      9bafbb0c
  23. 08 11月, 2007 2 次提交
  24. 01 11月, 2007 3 次提交
    • G
      [POWERPC] ppc405 Fix arithmatic rollover bug when memory size under 16M · bd942ba3
      Grant Likely 提交于
      mmu_mapin_ram() loops over total_lowmem to setup page tables.  However, if
      total_lowmem is less that 16M, the subtraction rolls over and results in
      a number just under 4G (because total_lowmem is an unsigned value).
      
      This patch rejigs the loop from countup to countdown to eliminate the
      bug.
      
      Special thanks to Magnus Hjorth who wrote the original patch to fix this
      bug.  This patch improves on his by making the loop code simpler (which
      also eliminates the possibility of another rollover at the high end)
      and also applies the change to arch/powerpc.
      Signed-off-by: NGrant Likely <grant.likely@secretlab.ca>
      Signed-off-by: NJosh Boyer <jwboyer@linux.vnet.ibm.com>
      bd942ba3
    • B
      [POWERPC] 4xx: Deal with 44x virtually tagged icache · b98ac05d
      Benjamin Herrenschmidt 提交于
      The 44x family has an interesting "feature" which is a virtually
      tagged instruction cache (yuck !). So far, we haven't dealt with
      it properly, which means we've been mostly lucky or people didn't
      report the problems, unless people have been running custom patches
      in their distro...
      
      This is an attempt at fixing it properly. I chose to do it by
      setting a global flag whenever we change a PTE that was previously
      marked executable, and flush the entire instruction cache upon
      return to user space when that happens.
      
      This is a bit heavy handed, but it's hard to do more fine grained
      flushes as the icbi instruction, on those processor, for some very
      strange reasons (since the cache is virtually mapped) still requires
      a valid TLB entry for reading in the target address space, which
      isn't something I want to deal with.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NJosh Boyer <jwboyer@linux.vnet.ibm.com>
      b98ac05d
    • B
      [POWERPC] 4xx: Fix 4xx flush_tlb_page() · e701d269
      Benjamin Herrenschmidt 提交于
      On 4xx CPUs, the current implementation of flush_tlb_page() uses
      a low level _tlbie() assembly function that only works for the
      current PID. Thus, invalidations caused by, for example, a COW
      fault triggered by get_user_pages() from a different context will
      not work properly, causing among other things, gdb breakpoints
      to fail.
      
      This patch adds a "pid" argument to _tlbie() on 4xx processors,
      and uses it to flush entries in the right context. FSL BookE
      also gets the argument but it seems they don't need it (their
      tlbivax form ignores the PID when invalidating according to the
      document I have).
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Acked-by: NKumar Gala <galak@kernel.crashing.org>
      Signed-off-by: NJosh Boyer <jwboyer@linux.vnet.ibm.com>
      e701d269
  25. 29 10月, 2007 1 次提交
    • B
      [POWERPC] powerpc: Fix demotion of segments to 4K pages · f6ab0b92
      Benjamin Herrenschmidt 提交于
      When demoting a process to use 4K HW pages (instead of 64K), which
      happens under various circumstances such as doing cache inhibited
      mappings on machines that do not support 64K CI pages, the assembly
      hash code calls back into the C function flush_hash_page().  This
      function prototype was recently changed to accomodate for 1T segments
      but the assembly call site was not updated, causing applications that
      do demotion to hang.  In addition, when updating the per-CPU PACA for
      the new sizes, we didn't properly update the slice "map", thus causing
      the SLB miss code to re-insert segments for the wrong size.
      
      This fixes both and adds a warning comment next to the C
      implementation to try to avoid problems next time someone changes it.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      f6ab0b92
  26. 20 10月, 2007 1 次提交
    • S
      pid namespaces: define is_global_init() and is_container_init() · b460cbc5
      Serge E. Hallyn 提交于
      is_init() is an ambiguous name for the pid==1 check.  Split it into
      is_global_init() and is_container_init().
      
      A cgroup init has it's tsk->pid == 1.
      
      A global init also has it's tsk->pid == 1 and it's active pid namespace
      is the init_pid_ns.  But rather than check the active pid namespace,
      compare the task structure with 'init_pid_ns.child_reaper', which is
      initialized during boot to the /sbin/init process and never changes.
      
      Changelog:
      
      	2.6.22-rc4-mm2-pidns1:
      	- Use 'init_pid_ns.child_reaper' to determine if a given task is the
      	  global init (/sbin/init) process. This would improve performance
      	  and remove dependence on the task_pid().
      
      	2.6.21-mm2-pidns2:
      
      	- [Sukadev Bhattiprolu] Changed is_container_init() calls in {powerpc,
      	  ppc,avr32}/traps.c for the _exception() call to is_global_init().
      	  This way, we kill only the cgroup if the cgroup's init has a
      	  bug rather than force a kernel panic.
      
      [akpm@linux-foundation.org: fix comment]
      [sukadev@us.ibm.com: Use is_global_init() in arch/m32r/mm/fault.c]
      [bunk@stusta.de: kernel/pid.c: remove unused exports]
      [sukadev@us.ibm.com: Fix capability.c to work with threaded init]
      Signed-off-by: NSerge E. Hallyn <serue@us.ibm.com>
      Signed-off-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
      Acked-by: NPavel Emelianov <xemul@openvz.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Herbert Poetzel <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b460cbc5
  27. 17 10月, 2007 2 次提交