1. 30 10月, 2009 4 次提交
    • D
      powerpc/mm: Cleanup initialization of hugepages on powerpc · d1837cba
      David Gibson 提交于
      This patch simplifies the logic used to initialize hugepages on
      powerpc.  The somewhat oddly named set_huge_psize() is renamed to
      add_huge_page_size() and now does all necessary verification of
      whether it's given a valid hugepage sizes (instead of just some) and
      instantiates the generic hstate structure (but no more).
      
      hugetlbpage_init() now steps through the available pagesizes, checks
      if they're valid for hugepages by calling add_huge_page_size() and
      initializes the kmem_caches for the hugepage pagetables.  This means
      we can now eliminate the mmu_huge_psizes array, since we no longer
      need to pass the sizing information for the pagetable caches from
      set_huge_psize() into hugetlbpage_init()
      
      Determination of the default huge page size is also moved from the
      hash code into the general hugepage code.
      Signed-off-by: NDavid Gibson <dwg@au1.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      d1837cba
    • D
      powerpc/mm: Allow more flexible layouts for hugepage pagetables · a4fe3ce7
      David Gibson 提交于
      Currently each available hugepage size uses a slightly different
      pagetable layout: that is, the bottem level table of pointers to
      hugepages is a different size, and may branch off from the normal page
      tables at a different level.  Every hugepage aware path that needs to
      walk the pagetables must therefore look up the hugepage size from the
      slice info first, and work out the correct way to walk the pagetables
      accordingly.  Future hardware is likely to add more possible hugepage
      sizes, more layout options and more mess.
      
      This patch, therefore reworks the handling of hugepage pagetables to
      reduce this complexity.  In the new scheme, instead of having to
      consult the slice mask, pagetable walking code can check a flag in the
      PGD/PUD/PMD entries to see where to branch off to hugepage pagetables,
      and the entry also contains the information (eseentially hugepage
      shift) necessary to then interpret that table without recourse to the
      slice mask.  This scheme can be extended neatly to handle multiple
      levels of self-describing "special" hugepage pagetables, although for
      now we assume only one level exists.
      
      This approach means that only the pagetable allocation path needs to
      know how the pagetables should be set out.  All other (hugepage)
      pagetable walking paths can just interpret the structure as they go.
      
      There already was a flag bit in PGD/PUD/PMD entries for hugepage
      directory pointers, but it was only used for debug.  We alter that
      flag bit to instead be a 0 in the MSB to indicate a hugepage pagetable
      pointer (normally it would be 1 since the pointer lies in the linear
      mapping).  This means that asm pagetable walking can test for (and
      punt on) hugepage pointers with the same test that checks for
      unpopulated page directory entries (beq becomes bge), since hugepage
      pointers will always be positive, and normal pointers always negative.
      
      While we're at it, we get rid of the confusing (and grep defeating)
      #defining of hugepte_shift to be the same thing as mmu_huge_psizes.
      Signed-off-by: NDavid Gibson <dwg@au1.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      a4fe3ce7
    • D
      powerpc/mm: Cleanup management of kmem_caches for pagetables · a0668cdc
      David Gibson 提交于
      Currently we have a fair bit of rather fiddly code to manage the
      various kmem_caches used to store page tables of various levels.  We
      generally have two caches holding some combination of PGD, PUD and PMD
      tables, plus several more for the special hugepage pagetables.
      
      This patch cleans this all up by taking a different approach.  Rather
      than the caches being designated as for PUDs or for hugeptes for 16M
      pages, the caches are simply allocated to be a specific size.  Thus
      sharing of caches between different types/levels of pagetables happens
      naturally.  The pagetable size, where needed, is passed around encoded
      in the same way as {PGD,PUD,PMD}_INDEX_SIZE; that is n where the
      pagetable contains 2^n pointers.
      Signed-off-by: NDavid Gibson <dwg@au1.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      a0668cdc
    • D
      powerpc/mm: Make hpte_need_flush() correctly mask for multiple page sizes · f71dc176
      David Gibson 提交于
      Currently, hpte_need_flush() only correctly flushes the given address
      for normal pages.  Callers for hugepages are required to mask the
      address themselves.
      
      But hpte_need_flush() already looks up the page sizes for its own
      reasons, so this is a rather silly imposition on the callers.  This
      patch alters it to mask based on the pagesize it has looked up itself,
      and removes the awkward masking code in the hugepage caller.
      Signed-off-by: NDavid Gibson <dwg@au1.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      f71dc176
  2. 14 10月, 2009 1 次提交
    • B
      powerpc/mm: Fix hang accessing top of vmalloc space · 8d8997f3
      Benjamin Herrenschmidt 提交于
      On pSeries, we always force the IO space to be mapped using 4K
      pages even with a 64K base page size to cope with some limitations
      in the HV interface to some devices.
      
      However, the SLB miss handler code to discriminate between vmalloc
      and ioremap space uses a CPU feature section such that the code
      is nop'ed out when the processor support large pages non-cachable
      mappings.
      
      Thus, we end up always using the ioremap page size for vmalloc
      segments on such processors, causing a discrepency between the
      segment and the hash table, and thus a hang continously hashing
      the page.
      
      It works for the first segment of the vmalloc space since that
      segment is "bolted" in by C code correctly, and thankfully we
      almost never use the vmalloc space beyond the first segment,
      but the new percpu code made the bug happen.
      
      This fixes it by removing the feature section from the assembly,
      we now always do the comparison between vmalloc and ioremap.
      
      Signed-off-by; Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      8d8997f3
  3. 24 9月, 2009 2 次提交
  4. 23 9月, 2009 4 次提交
    • K
      kcore: use registerd physmem information · 3089aa1b
      KAMEZAWA Hiroyuki 提交于
      For /proc/kcore, each arch registers its memory range by kclist_add().
      In usual,
      
      	- range of physical memory
      	- range of vmalloc area
      	- text, etc...
      
      are registered but "range of physical memory" has some troubles.  It
      doesn't updated at memory hotplug and it tend to include unnecessary
      memory holes.  Now, /proc/iomem (kernel/resource.c) includes required
      physical memory range information and it's properly updated at memory
      hotplug.  Then, it's good to avoid using its own code(duplicating
      information) and to rebuild kclist for physical memory based on
      /proc/iomem.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NJiri Slaby <jirislaby@gmail.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: WANG Cong <xiyou.wangcong@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3089aa1b
    • K
      walk system ram range · 908eedc6
      KAMEZAWA Hiroyuki 提交于
      Originally, walk_memory_resource() was introduced to traverse all memory
      of "System RAM" for detecting memory hotplug/unplug range.  For doing so,
      flags of IORESOUCE_MEM|IORESOURCE_BUSY was used and this was enough for
      memory hotplug.
      
      But for using other purpose, /proc/kcore, this may includes some firmware
      area marked as IORESOURCE_BUSY | IORESOUCE_MEM.  This patch makes the
      check strict to find out busy "System RAM".
      
      Note: PPC64 keeps their own walk_memory_resouce(), which walk through
      ppc64's lmb informaton.  Because old kclist_add() is called per lmb, this
      patch makes no difference in behavior, finally.
      
      And this patch removes CONFIG_MEMORY_HOTPLUG check from this function.
      Because pfn_valid() just show "there is memmap or not* and cannot be used
      for "there is physical memory or not", this function is useful in generic
      to scan physical memory range.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: WANG Cong <xiyou.wangcong@gmail.com>
      Cc: Américo Wang <xiyou.wangcong@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Roland Dreier <rolandd@cisco.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      908eedc6
    • K
      kcore: register vmalloc area in generic way · a0614da8
      KAMEZAWA Hiroyuki 提交于
      For /proc/kcore, vmalloc areas are registered per arch.  But, all of them
      registers same range of [VMALLOC_START...VMALLOC_END) This patch unifies
      them.  By this.  archs which have no kclist_add() hooks can see vmalloc
      area correctly.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: WANG Cong <xiyou.wangcong@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0614da8
    • K
      kcore: add kclist types · c30bb2a2
      KAMEZAWA Hiroyuki 提交于
      Presently, kclist_add() only eats start address and size as its arguments.
      Considering to make kclist dynamically reconfigulable, it's necessary to
      know which kclists are for System RAM and which are not.
      
      This patch add kclist types as
        KCORE_RAM
        KCORE_VMALLOC
        KCORE_TEXT
        KCORE_OTHER
      
      This "type" is used in a patch following this for detecting KCORE_RAM.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: WANG Cong <xiyou.wangcong@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c30bb2a2
  5. 22 9月, 2009 1 次提交
  6. 21 9月, 2009 1 次提交
    • I
      perf: Do the big rename: Performance Counters -> Performance Events · cdd6c482
      Ingo Molnar 提交于
      Bye-bye Performance Counters, welcome Performance Events!
      
      In the past few months the perfcounters subsystem has grown out its
      initial role of counting hardware events, and has become (and is
      becoming) a much broader generic event enumeration, reporting, logging,
      monitoring, analysis facility.
      
      Naming its core object 'perf_counter' and naming the subsystem
      'perfcounters' has become more and more of a misnomer. With pending
      code like hw-breakpoints support the 'counter' name is less and
      less appropriate.
      
      All in one, we've decided to rename the subsystem to 'performance
      events' and to propagate this rename through all fields, variables
      and API names. (in an ABI compatible fashion)
      
      The word 'event' is also a bit shorter than 'counter' - which makes
      it slightly more convenient to write/handle as well.
      
      Thanks goes to Stephane Eranian who first observed this misnomer and
      suggested a rename.
      
      User-space tooling and ABI compatibility is not affected - this patch
      should be function-invariant. (Also, defconfigs were not touched to
      keep the size down.)
      
      This patch has been generated via the following script:
      
        FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
      
        sed -i \
          -e 's/PERF_EVENT_/PERF_RECORD_/g' \
          -e 's/PERF_COUNTER/PERF_EVENT/g' \
          -e 's/perf_counter/perf_event/g' \
          -e 's/nb_counters/nb_events/g' \
          -e 's/swcounter/swevent/g' \
          -e 's/tpcounter_event/tp_event/g' \
          $FILES
      
        for N in $(find . -name perf_counter.[ch]); do
          M=$(echo $N | sed 's/perf_counter/perf_event/g')
          mv $N $M
        done
      
        FILES=$(find . -name perf_event.*)
      
        sed -i \
          -e 's/COUNTER_MASK/REG_MASK/g' \
          -e 's/COUNTER/EVENT/g' \
          -e 's/\<event\>/event_id/g' \
          -e 's/counter/event/g' \
          -e 's/Counter/Event/g' \
          $FILES
      
      ... to keep it as correct as possible. This script can also be
      used by anyone who has pending perfcounters patches - it converts
      a Linux kernel tree over to the new naming. We tried to time this
      change to the point in time where the amount of pending patches
      is the smallest: the end of the merge window.
      
      Namespace clashes were fixed up in a preparatory patch - and some
      stylistic fallout will be fixed up in a subsequent patch.
      
      ( NOTE: 'counters' are still the proper terminology when we deal
        with hardware registers - and these sed scripts are a bit
        over-eager in renaming them. I've undone some of that, but
        in case there's something left where 'counter' would be
        better than 'event' we can undo that on an individual basis
        instead of touching an otherwise nicely automated patch. )
      Suggested-by: NStephane Eranian <eranian@google.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Reviewed-by: NArjan van de Ven <arjan@linux.intel.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: <linux-arch@vger.kernel.org>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cdd6c482
  7. 02 9月, 2009 1 次提交
    • B
      powerpc/pseries: Fix to handle slb resize across migration · 46db2f86
      Brian King 提交于
      The SLB can change sizes across a live migration, which was not
      being handled, resulting in possible machine crashes during
      migration if migrating to a machine which has a smaller max SLB
      size than the source machine. Fix this by first reducing the
      SLB size to the minimum possible value, which is 32, prior to
      migration. Then during the device tree update which occurs after
      migration, we make the call to ensure the SLB gets updated. Also
      add the slb_size to the lparcfg output so that the migration
      tools can check to make sure the kernel has this capability
      before allowing migration in scenarios where the SLB size will change.
      
      BenH: Fixed #include <asm/mmu-hash64.h> -> <asm/mmu.h> to avoid
            breaking ppc32 build
      Signed-off-by: NBrian King <brking@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      46db2f86
  8. 28 8月, 2009 1 次提交
  9. 27 8月, 2009 1 次提交
    • B
      powerpc/mm: Cleanup handling of execute permission · ea3cc330
      Benjamin Herrenschmidt 提交于
      This is an attempt at cleaning up a bit the way we handle execute
      permission on powerpc. _PAGE_HWEXEC is gone, _PAGE_EXEC is now only
      defined by CPUs that can do something with it, and the myriad of
      #ifdef's in the I$/D$ coherency code is reduced to 2 cases that
      hopefully should cover everything.
      
      The logic on BookE is a little bit different than what it was though
      not by much. Since now, _PAGE_EXEC will be set by the generic code
      for executable pages, we need to filter out if they are unclean and
      recover it. However, I don't expect the code to be more bloated than
      it already was in that area due to that change.
      
      I could boast that this brings proper enforcing of per-page execute
      permissions to all BookE and 40x but in fact, we've had that now for
      some time as a side effect of my previous rework in that area (and
      I didn't even know it :-) We would only enable execute permission if
      the page was cache clean and we would only cache clean it if we took
      and exec fault. Since we now enforce that the later only work if
      VM_EXEC is part of the VMA flags, we de-fact already enforce per-page
      execute permissions... Unless I missed something
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      ea3cc330
  10. 25 8月, 2009 1 次提交
  11. 20 8月, 2009 15 次提交
  12. 18 8月, 2009 1 次提交
    • P
      powerpc: Allow perf_counters to access user memory at interrupt time · 9c1e1052
      Paul Mackerras 提交于
      This provides a mechanism to allow the perf_counters code to access
      user memory in a PMU interrupt routine.  Such an access can cause
      various kinds of interrupt: SLB miss, MMU hash table miss, segment
      table miss, or TLB miss, depending on the processor.  This commit
      only deals with 64-bit classic/server processors, which use an MMU
      hash table.  32-bit processors are already able to access user memory
      at interrupt time.  Since we don't soft-disable on 32-bit, we avoid
      the possibility of reentering hash_page or the TLB miss handlers,
      since they run with interrupts disabled.
      
      On 64-bit processors, an SLB miss interrupt on a user address will
      update the slb_cache and slb_cache_ptr fields in the paca.  This is
      OK except in the case where a PMU interrupt occurs in switch_slb,
      which also accesses those fields.  To prevent this, we hard-disable
      interrupts in switch_slb.  Interrupts are already soft-disabled at
      this point, and will get hard-enabled when they get soft-enabled
      later.
      
      This also reworks slb_flush_and_rebolt: to avoid hard-disabling twice,
      and to make sure that it clears the slb_cache_ptr when called from
      other callers than switch_slb, the existing routine is renamed to
      __slb_flush_and_rebolt, which is called by switch_slb and the new
      version of slb_flush_and_rebolt.
      
      Similarly, switch_stab (used on POWER3 and RS64 processors) gets a
      hard_irq_disable() to protect the per-cpu variables used there and
      in ste_allocate.
      
      If a MMU hashtable miss interrupt occurs, normally we would call
      hash_page to look up the Linux PTE for the address and create a HPTE.
      However, hash_page is fairly complex and takes some locks, so to
      avoid the possibility of deadlock, we check the preemption count
      to see if we are in a (pseudo-)NMI handler, and if so, we don't call
      hash_page but instead treat it like a bad access that will get
      reported up through the exception table mechanism.  An interrupt
      whose handler runs even though the interrupt occurred when
      soft-disabled (such as the PMU interrupt) is considered a pseudo-NMI
      handler, which should use nmi_enter()/nmi_exit() rather than
      irq_enter()/irq_exit().
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      9c1e1052
  13. 30 7月, 2009 1 次提交
    • K
      powerpc/mm: Fix SMP issue with MMU context handling code · 5156ddce
      Kumar Gala 提交于
      In switch_mmu_context() if we call steal_context_smp() to get a context
      to use we shouldn't fall through and than call steal_context_up().  Doing
      so can be problematic in that the 'mm' that steal_context_up() ends up
      using will not get marked dirty in the stale_map[] for other CPUs that
      might have used that mm.  Thus we could end up with stale TLB entries in
      the other CPUs that can cause all kinda of havoc.
      Signed-off-by: NKumar Gala <galak@kernel.crashing.org>
      5156ddce
  14. 28 7月, 2009 1 次提交
    • B
      mm: Pass virtual address to [__]p{te,ud,md}_free_tlb() · 9e1b32ca
      Benjamin Herrenschmidt 提交于
      mm: Pass virtual address to [__]p{te,ud,md}_free_tlb()
      
      Upcoming paches to support the new 64-bit "BookE" powerpc architecture
      will need to have the virtual address corresponding to PTE page when
      freeing it, due to the way the HW table walker works.
      
      Basically, the TLB can be loaded with "large" pages that cover the whole
      virtual space (well, sort-of, half of it actually) represented by a PTE
      page, and which contain an "indirect" bit indicating that this TLB entry
      RPN points to an array of PTEs from which the TLB can then create direct
      entries. Thus, in order to invalidate those when PTE pages are deleted,
      we need the virtual address to pass to tlbilx or tlbivax instructions.
      
      The old trick of sticking it somewhere in the PTE page struct page sucks
      too much, the address is almost readily available in all call sites and
      almost everybody implemets these as macros, so we may as well add the
      argument everywhere. I added it to the pmd and pud variants for consistency.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Acked-by: David Howells <dhowells@redhat.com> [MN10300 & FRV]
      Acked-by: NNick Piggin <npiggin@suse.de>
      Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> [s390]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9e1b32ca
  15. 08 7月, 2009 5 次提交