1. 05 11月, 2009 10 次提交
  2. 30 10月, 2009 10 次提交
    • M
      powerpc: Enable sparse irq_descs on powerpc · cd015707
      Michael Ellerman 提交于
      Defining CONFIG_SPARSE_IRQ enables generic code that gets rid of the
      static irq_desc array, and replaces it with an array of pointers to
      irq_descs.
      
      It also allows node local allocation of irq_descs, however we
      currently don't have the information available to do that, so we just
      allocate them on all on node 0.
      Signed-off-by: NMichael Ellerman <michael@ellerman.id.au>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      cd015707
    • T
      powerpc/nvram_64: Remove unused code · ae7dd020
      Thomas Gleixner 提交于
      nvram_find_partition() has no user. The call site was removed in the
      arch/powerpc move, but the function stayed. Remove it.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: linuxppc-dev@ozlabs.org
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      ae7dd020
    • M
      powerpc: Fix potential compile error irqs_disabled_flags · 0682d6c1
      Michael Neuling 提交于
      irqs_disabled_flags is #defined in linux/irqflags.h when
      CONFIG_TRACE_IRQFLAGS_SUPPORT is enabled.  64 and 32 bit always have
      CONFIG_TRACE_IRQFLAGS_SUPPORT enabled so just remove
      irqs_disabled_flags.
      
      This fixes the case when someone needs to include both linux/irqflags.h
      and asm/hw_irq.h.
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      0682d6c1
    • D
      powerpc/mm: Bring hugepage PTE accessor functions back into sync with normal accessors · 0895ecda
      David Gibson 提交于
      The hugepage arch code provides a number of hook functions/macros
      which mirror the functionality of various normal page pte access
      functions.  Various changes in the normal page accessors (in
      particular BenH's recent changes to the handling of lazy icache
      flushing and PAGE_EXEC) have caused the hugepage versions to get out
      of sync with the originals.  In some cases, this is a bug, at least on
      some MMU types.
      
      One of the reasons that some hooks were not identical to the normal
      page versions, is that the fact we're dealing with a hugepage needed
      to be passed down do use the correct dcache-icache flush function.
      This patch makes the main flush_dcache_icache_page() function hugepage
      aware (by checking for the PageCompound flag).  That in turn means we
      can make set_huge_pte_at() just a call to set_pte_at() bringing it
      back into sync.  As a bonus, this lets us remove the
      hash_huge_page_do_lazy_icache() function, replacing it with a call to
      the hash_page_do_lazy_icache() function it was based on.
      
      Some other hugepage pte access hooks - huge_ptep_get_and_clear() and
      huge_ptep_clear_flush() - are not so easily unified, but this patch at
      least brings them back into sync with the current versions of the
      corresponding normal page functions.
      Signed-off-by: NDavid Gibson <dwg@au1.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      0895ecda
    • D
      powerpc/mm: Split hash MMU specific hugepage code into a new file · 883a3e52
      David Gibson 提交于
      This patch separates the parts of hugetlbpage.c which are inherently
      specific to the hash MMU into a new hugelbpage-hash64.c file.
      Signed-off-by: NDavid Gibson <dwg@au1.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      883a3e52
    • D
      powerpc/mm: Cleanup initialization of hugepages on powerpc · d1837cba
      David Gibson 提交于
      This patch simplifies the logic used to initialize hugepages on
      powerpc.  The somewhat oddly named set_huge_psize() is renamed to
      add_huge_page_size() and now does all necessary verification of
      whether it's given a valid hugepage sizes (instead of just some) and
      instantiates the generic hstate structure (but no more).
      
      hugetlbpage_init() now steps through the available pagesizes, checks
      if they're valid for hugepages by calling add_huge_page_size() and
      initializes the kmem_caches for the hugepage pagetables.  This means
      we can now eliminate the mmu_huge_psizes array, since we no longer
      need to pass the sizing information for the pagetable caches from
      set_huge_psize() into hugetlbpage_init()
      
      Determination of the default huge page size is also moved from the
      hash code into the general hugepage code.
      Signed-off-by: NDavid Gibson <dwg@au1.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      d1837cba
    • D
      powerpc/mm: Allow more flexible layouts for hugepage pagetables · a4fe3ce7
      David Gibson 提交于
      Currently each available hugepage size uses a slightly different
      pagetable layout: that is, the bottem level table of pointers to
      hugepages is a different size, and may branch off from the normal page
      tables at a different level.  Every hugepage aware path that needs to
      walk the pagetables must therefore look up the hugepage size from the
      slice info first, and work out the correct way to walk the pagetables
      accordingly.  Future hardware is likely to add more possible hugepage
      sizes, more layout options and more mess.
      
      This patch, therefore reworks the handling of hugepage pagetables to
      reduce this complexity.  In the new scheme, instead of having to
      consult the slice mask, pagetable walking code can check a flag in the
      PGD/PUD/PMD entries to see where to branch off to hugepage pagetables,
      and the entry also contains the information (eseentially hugepage
      shift) necessary to then interpret that table without recourse to the
      slice mask.  This scheme can be extended neatly to handle multiple
      levels of self-describing "special" hugepage pagetables, although for
      now we assume only one level exists.
      
      This approach means that only the pagetable allocation path needs to
      know how the pagetables should be set out.  All other (hugepage)
      pagetable walking paths can just interpret the structure as they go.
      
      There already was a flag bit in PGD/PUD/PMD entries for hugepage
      directory pointers, but it was only used for debug.  We alter that
      flag bit to instead be a 0 in the MSB to indicate a hugepage pagetable
      pointer (normally it would be 1 since the pointer lies in the linear
      mapping).  This means that asm pagetable walking can test for (and
      punt on) hugepage pointers with the same test that checks for
      unpopulated page directory entries (beq becomes bge), since hugepage
      pointers will always be positive, and normal pointers always negative.
      
      While we're at it, we get rid of the confusing (and grep defeating)
      #defining of hugepte_shift to be the same thing as mmu_huge_psizes.
      Signed-off-by: NDavid Gibson <dwg@au1.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      a4fe3ce7
    • D
      powerpc/mm: Cleanup management of kmem_caches for pagetables · a0668cdc
      David Gibson 提交于
      Currently we have a fair bit of rather fiddly code to manage the
      various kmem_caches used to store page tables of various levels.  We
      generally have two caches holding some combination of PGD, PUD and PMD
      tables, plus several more for the special hugepage pagetables.
      
      This patch cleans this all up by taking a different approach.  Rather
      than the caches being designated as for PUDs or for hugeptes for 16M
      pages, the caches are simply allocated to be a specific size.  Thus
      sharing of caches between different types/levels of pagetables happens
      naturally.  The pagetable size, where needed, is passed around encoded
      in the same way as {PGD,PUD,PMD}_INDEX_SIZE; that is n where the
      pagetable contains 2^n pointers.
      Signed-off-by: NDavid Gibson <dwg@au1.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      a0668cdc
    • M
      powerpc: Remove get_irq_desc() · 6cff46f4
      Michael Ellerman 提交于
      get_irq_desc() is a powerpc-specific version of irq_to_desc(). That
      is reason enough to remove it, but it also doesn't know about sparse
      irq_desc support which irq_to_desc() does (when we enable it).
      Signed-off-by: NMichael Ellerman <michael@ellerman.id.au>
      Acked-by: NGrant Likely <grant.likely@secretlab.ca>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      6cff46f4
    • M
      powerpc: Make NR_IRQS a CONFIG option · 551b81f2
      Michael Ellerman 提交于
      The irq_desc array consumes quite a lot of space, and for systems
      that don't need or can't have 512 irqs it's just wasted space.
      
      The first 16 are reserved for ISA, so the minimum of 32 is really
      16 - and no one has asked for more than 512 so leave that as the
      maximum.
      Signed-off-by: NMichael Ellerman <michael@ellerman.id.au>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      551b81f2
  3. 14 10月, 2009 1 次提交
    • A
      powerpc: Fix hypervisor TLB batching · b6dcde5c
      Anton Blanchard 提交于
      Profiling of a page fault scalability microbenchmark shows flush_hash_range
      is not calling the batch hpte invalidate hcall (H_BULK_REMOVE).
      
      It turns out we have a duplicate firmware feature for hcall-bulk and the
      current setup code stops after finding the first match. This meant we never
      batch and always do individual invalidates.
      
      The patch below removes the duplicate and shifts FW_FEATURE_CMO to close
      the gap. With the patch applied the single threaded page fault rate improves
      from 217169 to 238755 per second on a POWER5 test box, a 10% improvement.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      b6dcde5c
  4. 24 9月, 2009 9 次提交
  5. 22 9月, 2009 2 次提交
    • A
      mm: add MAP_HUGETLB for mmaping pseudo-anonymous huge page regions · 90f72aa5
      Arnd Bergmann 提交于
      Add a flag for mmap that will be used to request a huge page region that
      will look like anonymous memory to user space.  This is accomplished by
      using a file on the internal vfsmount.  MAP_HUGETLB is a modifier of
      MAP_ANONYMOUS and so must be specified with it.  The region will behave
      the same as a MAP_ANONYMOUS region using small pages.
      
      The patch also adds the MAP_STACK flag, which was previously defined only
      on some architectures but not on others.  Since MAP_STACK is meant to be a
      hint only, architectures can define it without assigning a specific
      meaning to it.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Eric B Munson <ebmunson@us.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90f72aa5
    • P
      perf_event, powerpc: Fix compilation after big perf_counter rename · a8f90e90
      Paul Mackerras 提交于
      This fixes two places in the powerpc perf_event (perf_counter) code
      where 'list_entry' needs to be changed to 'group_entry', but were
      missed in commit 65abc865 ("perf_counter: Rename list_entry ->
      group_entry, counter_list -> group_list").
      
      This also changes 'event' back to 'counter' in a couple of
      contexts:
      
      * Field and function names that deal with the limited-function
        counters: it's really the hardware counters whose function is
        limited, not the events that they count.  Hence:
      
        MAX_LIMITED_HWEVENTS -> MAX_LIMITED_HWCOUNTERS
        limited_event -> limited_counter
        freeze/thaw_limited_events -> freeze/thaw_limited_counters
      
      * The machine-specific PMU description struct (struct power_pmu): this
        renames 'n_event' back to 'n_counter' since it really describes how
        many hardware counters the machine has.  (Renaming this back avoids
        a compile error in each of the machine-specific PMU back-ends where
        they initialize their power_pmu struct.)
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Cc: linuxppc-dev@ozlabs.org
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <19128.4280.813369.589704@cargo.ozlabs.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a8f90e90
  6. 21 9月, 2009 2 次提交
    • I
      perf: Tidy up after the big rename · 57c0c15b
      Ingo Molnar 提交于
       - provide compatibility Kconfig entry for existing PERF_COUNTERS .config's
      
       - provide courtesy copy of old perf_counter.h, for user-space projects
      
       - small indentation fixups
      
       - fix up MAINTAINERS
      
       - fix small x86 printout fallout
      
       - fix up small PowerPC comment fallout (use 'counter' as in register)
      Reviewed-by: NArjan van de Ven <arjan@linux.intel.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      57c0c15b
    • I
      perf: Do the big rename: Performance Counters -> Performance Events · cdd6c482
      Ingo Molnar 提交于
      Bye-bye Performance Counters, welcome Performance Events!
      
      In the past few months the perfcounters subsystem has grown out its
      initial role of counting hardware events, and has become (and is
      becoming) a much broader generic event enumeration, reporting, logging,
      monitoring, analysis facility.
      
      Naming its core object 'perf_counter' and naming the subsystem
      'perfcounters' has become more and more of a misnomer. With pending
      code like hw-breakpoints support the 'counter' name is less and
      less appropriate.
      
      All in one, we've decided to rename the subsystem to 'performance
      events' and to propagate this rename through all fields, variables
      and API names. (in an ABI compatible fashion)
      
      The word 'event' is also a bit shorter than 'counter' - which makes
      it slightly more convenient to write/handle as well.
      
      Thanks goes to Stephane Eranian who first observed this misnomer and
      suggested a rename.
      
      User-space tooling and ABI compatibility is not affected - this patch
      should be function-invariant. (Also, defconfigs were not touched to
      keep the size down.)
      
      This patch has been generated via the following script:
      
        FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
      
        sed -i \
          -e 's/PERF_EVENT_/PERF_RECORD_/g' \
          -e 's/PERF_COUNTER/PERF_EVENT/g' \
          -e 's/perf_counter/perf_event/g' \
          -e 's/nb_counters/nb_events/g' \
          -e 's/swcounter/swevent/g' \
          -e 's/tpcounter_event/tp_event/g' \
          $FILES
      
        for N in $(find . -name perf_counter.[ch]); do
          M=$(echo $N | sed 's/perf_counter/perf_event/g')
          mv $N $M
        done
      
        FILES=$(find . -name perf_event.*)
      
        sed -i \
          -e 's/COUNTER_MASK/REG_MASK/g' \
          -e 's/COUNTER/EVENT/g' \
          -e 's/\<event\>/event_id/g' \
          -e 's/counter/event/g' \
          -e 's/Counter/Event/g' \
          $FILES
      
      ... to keep it as correct as possible. This script can also be
      used by anyone who has pending perfcounters patches - it converts
      a Linux kernel tree over to the new naming. We tried to time this
      change to the point in time where the amount of pending patches
      is the smallest: the end of the merge window.
      
      Namespace clashes were fixed up in a preparatory patch - and some
      stylistic fallout will be fixed up in a subsequent patch.
      
      ( NOTE: 'counters' are still the proper terminology when we deal
        with hardware registers - and these sed scripts are a bit
        over-eager in renaming them. I've undone some of that, but
        in case there's something left where 'counter' would be
        better than 'event' we can undo that on an individual basis
        instead of touching an otherwise nicely automated patch. )
      Suggested-by: NStephane Eranian <eranian@google.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Reviewed-by: NArjan van de Ven <arjan@linux.intel.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: <linux-arch@vger.kernel.org>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cdd6c482
  7. 16 9月, 2009 1 次提交
    • P
      sched: Disable wakeup balancing · 182a85f8
      Peter Zijlstra 提交于
      Sysbench thinks SD_BALANCE_WAKE is too agressive and kbuild doesn't
      really mind too much, SD_BALANCE_NEWIDLE picks up most of the
      slack.
      
      On a dual socket, quad core, dual thread nehalem system:
      
      sysbench (--num_threads=16):
      
       SD_BALANCE_WAKE-: 13982 tx/s
       SD_BALANCE_WAKE+: 15688 tx/s
      
      kbuild (-j16):
      
       SD_BALANCE_WAKE-: 47.648295846  seconds time elapsed   ( +-   0.312% )
       SD_BALANCE_WAKE+: 47.608607360  seconds time elapsed   ( +-   0.026% )
      
      (same within noise)
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      182a85f8
  8. 15 9月, 2009 3 次提交
    • M
      sched: Improve latencies and throughput · 0ec9fab3
      Mike Galbraith 提交于
      Make the idle balancer more agressive, to improve a
      x264 encoding workload provided by Jason Garrett-Glaser:
      
       NEXT_BUDDY NO_LB_BIAS
       encoded 600 frames, 252.82 fps, 22096.60 kb/s
       encoded 600 frames, 250.69 fps, 22096.60 kb/s
       encoded 600 frames, 245.76 fps, 22096.60 kb/s
      
       NO_NEXT_BUDDY LB_BIAS
       encoded 600 frames, 344.44 fps, 22096.60 kb/s
       encoded 600 frames, 346.66 fps, 22096.60 kb/s
       encoded 600 frames, 352.59 fps, 22096.60 kb/s
      
       NO_NEXT_BUDDY NO_LB_BIAS
       encoded 600 frames, 425.75 fps, 22096.60 kb/s
       encoded 600 frames, 425.45 fps, 22096.60 kb/s
       encoded 600 frames, 422.49 fps, 22096.60 kb/s
      
      Peter pointed out that this is better done via newidle_idx,
      not via LB_BIAS, newidle balancing should look for where
      there is load _now_, not where there was load 2 ticks ago.
      
      Worst-case latencies are improved as well as no buddies
      means less vruntime spread. (as per prior lkml discussions)
      
      This change improves kbuild-peak parallelism as well.
      Reported-by: NJason Garrett-Glaser <darkshikari@gmail.com>
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1253011667.9128.16.camel@marge.simson.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0ec9fab3
    • P
      sched: Tweak wake_idx · 78e7ed53
      Peter Zijlstra 提交于
      When merging select_task_rq_fair() and sched_balance_self() we lost
      the use of wake_idx, restore that and set them to 0 to make wake
      balancing more aggressive.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      78e7ed53
    • P
      sched: Merge select_task_rq_fair() and sched_balance_self() · c88d5910
      Peter Zijlstra 提交于
      The problem with wake_idle() is that is doesn't respect things like
      cpu_power, which means it doesn't deal well with SMT nor the recent
      RT interaction.
      
      To cure this, it needs to do what sched_balance_self() does, which
      leads to the possibility of merging select_task_rq_fair() and
      sched_balance_self().
      
      Modify sched_balance_self() to:
      
        - update_shares() when walking up the domain tree,
          (it only called it for the top domain, but it should
           have done this anyway), which allows us to remove
          this ugly bit from try_to_wake_up().
      
        - do wake_affine() on the smallest domain that contains
          both this (the waking) and the prev (the wakee) cpu for
          WAKE invocations.
      
      Then use the top-down balance steps it had to replace wake_idle().
      
      This leads to the dissapearance of SD_WAKE_BALANCE and
      SD_WAKE_IDLE_FAR, with SD_WAKE_IDLE replaced with SD_BALANCE_WAKE.
      
      SD_WAKE_AFFINE needs SD_BALANCE_WAKE to be effective.
      
      Touch all topology bits to replace the old with new SD flags --
      platforms might need re-tuning, enabling SD_BALANCE_WAKE
      conditionally on a NUMA distance seems like a good additional
      feature, magny-core and small nehalem systems would want this
      enabled, systems with slow interconnects would not.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c88d5910
  9. 11 9月, 2009 2 次提交
    • M
      powerpc/nvram: Enable use Generic NVRAM driver for different size chips · d331d830
      Martyn Welch 提交于
      Remove the reliance on a staticly defined NVRAM size, allowing
      platforms to support NVRAMs with sizes differing from the standard.
      
      A fall back value is provided for platforms not supporting this extension.
      Signed-off-by: NMartyn Welch <martyn.welch@gefanuc.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      d331d830
    • P
      powerpc: Fix bug where perf_counters breaks oprofile · a6dbf93a
      Paul Mackerras 提交于
      Currently there is a bug where if you use oprofile on a pSeries
      machine, then use perf_counters, then use oprofile again, oprofile
      will not work correctly; it will lose the PMU configuration the next
      time the hypervisor does a partition context switch, and thereafter
      won't count anything.
      
      Maynard Johnson identified the sequence causing the problem:
      - oprofile setup calls ppc_enable_pmcs(), which calls
        pseries_lpar_enable_pmcs, which tells the hypervisor that we want
        to use the PMU, and sets the "PMU in use" flag in the lppaca.
        This flag tells the hypervisor whether it needs to save and restore
        the PMU config.
      - The perf_counter code sets and clears the "PMU in use" flag directly
        as it context-switches the PMU between tasks, and leaves it clear
        when it finishes.
      - oprofile setup, called for a new oprofile run, calls ppc_enable_pmcs,
        which does nothing because it has already been called.  In particular
        it doesn't set the "PMU in use" flag.
      
      This fixes the problem by arranging for ppc_enable_pmcs to always set
      the "PMU in use" flag.  It makes the perf_counter code call
      ppc_enable_pmcs also rather than calling the lower-level function
      directly, and removes the setting of the "PMU in use" flag from
      pseries_lpar_enable_pmcs, since that is now done in its caller.
      
      This also removes the declaration of pasemi_enable_pmcs because it
      isn't defined anywhere.
      Reported-by: NMaynard Johnson <mpjohn@us.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Cc: <stable@kernel.org)
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      a6dbf93a