1. 08 12月, 2011 1 次提交
    • P
      powerpc: Provide a way for KVM to indicate that NV GPR values are lost · 2fde6d20
      Paul Mackerras 提交于
      This fixes a problem where a CPU thread coming out of nap mode can
      think it has valid values in the nonvolatile GPRs (r14 - r31) as saved
      away in power7_idle, but in fact the values have been trashed because
      the thread was used for KVM in the mean time.  The result is that the
      thread crashes because code that called power7_idle (e.g.,
      pnv_smp_cpu_kill_self()) goes to use values in registers that have
      been trashed.
      
      The bit field in SRR1 that tells whether state was lost only reflects
      the most recent nap, which may not have been the nap instruction in
      power7_idle.  So we need an extra PACA field to indicate that state
      has been lost even if SRR1 indicates that the most recent nap didn't
      lose state.  We clear this field when saving the state in power7_idle,
      we set it to a non-zero value when we use the thread for KVM, and we
      test it in power7_wakeup_noloss.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      2fde6d20
  2. 20 9月, 2011 1 次提交
  3. 12 7月, 2011 2 次提交
    • P
      KVM: PPC: Add support for Book3S processors in hypervisor mode · de56a948
      Paul Mackerras 提交于
      This adds support for KVM running on 64-bit Book 3S processors,
      specifically POWER7, in hypervisor mode.  Using hypervisor mode means
      that the guest can use the processor's supervisor mode.  That means
      that the guest can execute privileged instructions and access privileged
      registers itself without trapping to the host.  This gives excellent
      performance, but does mean that KVM cannot emulate a processor
      architecture other than the one that the hardware implements.
      
      This code assumes that the guest is running paravirtualized using the
      PAPR (Power Architecture Platform Requirements) interface, which is the
      interface that IBM's PowerVM hypervisor uses.  That means that existing
      Linux distributions that run on IBM pSeries machines will also run
      under KVM without modification.  In order to communicate the PAPR
      hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
      to include/linux/kvm.h.
      
      Currently the choice between book3s_hv support and book3s_pr support
      (i.e. the existing code, which runs the guest in user mode) has to be
      made at kernel configuration time, so a given kernel binary can only
      do one or the other.
      
      This new book3s_hv code doesn't support MMIO emulation at present.
      Since we are running paravirtualized guests, this isn't a serious
      restriction.
      
      With the guest running in supervisor mode, most exceptions go straight
      to the guest.  We will never get data or instruction storage or segment
      interrupts, alignment interrupts, decrementer interrupts, program
      interrupts, single-step interrupts, etc., coming to the hypervisor from
      the guest.  Therefore this introduces a new KVMTEST_NONHV macro for the
      exception entry path so that we don't have to do the KVM test on entry
      to those exception handlers.
      
      We do however get hypervisor decrementer, hypervisor data storage,
      hypervisor instruction storage, and hypervisor emulation assist
      interrupts, so we have to handle those.
      
      In hypervisor mode, real-mode accesses can access all of RAM, not just
      a limited amount.  Therefore we put all the guest state in the vcpu.arch
      and use the shadow_vcpu in the PACA only for temporary scratch space.
      We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
      anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
      We don't have a shared page with the guest, but we still need a
      kvm_vcpu_arch_shared struct to store the values of various registers,
      so we include one in the vcpu_arch struct.
      
      The POWER7 processor has a restriction that all threads in a core have
      to be in the same partition.  MMU-on kernel code counts as a partition
      (partition 0), so we have to do a partition switch on every entry to and
      exit from the guest.  At present we require the host and guest to run
      in single-thread mode because of this hardware restriction.
      
      This code allocates a hashed page table for the guest and initializes
      it with HPTEs for the guest's Virtual Real Memory Area (VRMA).  We
      require that the guest memory is allocated using 16MB huge pages, in
      order to simplify the low-level memory management.  This also means that
      we can get away without tracking paging activity in the host for now,
      since huge pages can't be paged or swapped.
      
      This also adds a few new exports needed by the book3s_hv code.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      de56a948
    • P
      KVM: PPC: Split host-state fields out of kvmppc_book3s_shadow_vcpu · 3c42bf8a
      Paul Mackerras 提交于
      There are several fields in struct kvmppc_book3s_shadow_vcpu that
      temporarily store bits of host state while a guest is running,
      rather than anything relating to the particular guest or vcpu.
      This splits them out into a new kvmppc_host_state structure and
      modifies the definitions in asm-offsets.c to suit.
      
      On 32-bit, we have a kvmppc_host_state structure inside the
      kvmppc_book3s_shadow_vcpu since the assembly code needs to be able
      to get to them both with one pointer.  On 64-bit they are separate
      fields in the PACA.  This means that on 64-bit we don't need to
      copy the kvmppc_host_state in and out on vcpu load/unload, and
      in future will mean that the book3s_hv code doesn't need a
      shadow_vcpu struct in the PACA at all.  That does mean that we
      have to be careful not to rely on any values persisting in the
      hstate field of the paca across any point where we could block
      or get preempted.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      3c42bf8a
  4. 29 6月, 2011 1 次提交
    • S
      powerpc/book3e-64: use a separate TLB handler when linear map is bolted · f67f4ef5
      Scott Wood 提交于
      On MMUs such as FSL where we can guarantee the entire linear mapping is
      bolted, we don't need to worry about linear TLB misses.  If on top of
      that we do a full table walk, we get rid of all recursive TLB faults, and
      can dispense with some state saving.  This gains a few percent on
      TLB-miss-heavy workloads, and around 50% on a benchmark that had a high
      rate of virtual page table faults under the normal handler.
      
      While touching the EX_TLB layout, remove EX_TLB_MMUCR0, EX_TLB_SRR0, and
      EX_TLB_SRR1 as they're not used.
      
      [BenH: Fixed build with 64K pages (wsp config)]
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      f67f4ef5
  5. 04 5月, 2011 1 次提交
    • P
      powerpc: Save Come-From Address Register (CFAR) in exception frame · 48404f2e
      Paul Mackerras 提交于
      Recent 64-bit server processors (POWER6 and POWER7) have a "Come-From
      Address Register" (CFAR), that records the address of the most recent
      branch or rfid (return from interrupt) instruction for debugging purposes.
      
      This saves the value of the CFAR in the exception entry code and stores
      it in the exception frame.  We also make xmon print the CFAR value in
      its register dump code.
      
      Rather than extend the pt_regs struct at this time, we steal the orig_gpr3
      field, which is only used for system calls, and use it for the CFAR value
      for all exceptions/interrupts other than system calls.  This means we
      don't save the CFAR on system calls, which is not a great problem since
      system calls tend not to happen unexpectedly, and also avoids adding the
      overhead of reading the CFAR to the system call entry path.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      48404f2e
  6. 27 4月, 2011 1 次提交
  7. 20 4月, 2011 1 次提交
  8. 19 10月, 2010 1 次提交
    • P
      irq_work: Add generic hardirq context callbacks · e360adbe
      Peter Zijlstra 提交于
      Provide a mechanism that allows running code in IRQ context. It is
      most useful for NMI code that needs to interact with the rest of the
      system -- like wakeup a task to drain buffers.
      
      Perf currently has such a mechanism, so extract that and provide it as
      a generic feature, independent of perf so that others may also
      benefit.
      
      The IRQ context callback is generated through self-IPIs where
      possible, or on architectures like powerpc the decrementer (the
      built-in timer facility) is set to generate an interrupt immediately.
      
      Architectures that don't have anything like this get to do with a
      callback from the timer tick. These architectures can call
      irq_work_run() at the tail of any IRQ handlers that might enqueue such
      work (like the perf IRQ handler) to avoid undue latencies in
      processing the work.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NKyle McMartin <kyle@mcmartin.ca>
      Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      [ various fixes ]
      Signed-off-by: NHuang Ying <ying.huang@intel.com>
      LKML-Reference: <1287036094.7768.291.camel@yhuang-dev>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e360adbe
  9. 02 9月, 2010 1 次提交
    • P
      powerpc: Account time using timebase rather than PURR · cf9efce0
      Paul Mackerras 提交于
      Currently, when CONFIG_VIRT_CPU_ACCOUNTING is enabled, we use the
      PURR register for measuring the user and system time used by
      processes, as well as other related times such as hardirq and
      softirq times.  This turns out to be quite confusing for users
      because it means that a program will often be measured as taking
      less time when run on a multi-threaded processor (SMT2 or SMT4 mode)
      than it does when run on a single-threaded processor (ST mode), even
      though the program takes longer to finish.  The discrepancy is
      accounted for as stolen time, which is also confusing, particularly
      when there are no other partitions running.
      
      This changes the accounting to use the timebase instead, meaning that
      the reported user and system times are the actual number of real-time
      seconds that the program was executing on the processor thread,
      regardless of which SMT mode the processor is in.  Thus a program will
      generally show greater user and system times when run on a
      multi-threaded processor than on a single-threaded processor.
      
      On pSeries systems on POWER5 or later processors, we measure the
      stolen time (time when this partition wasn't running) using the
      hypervisor dispatch trace log.  We check for new entries in the
      log on every entry from user mode and on every transition from
      kernel process context to soft or hard IRQ context (i.e. when
      account_system_vtime() gets called).  So that we can correctly
      distinguish time stolen from user time and time stolen from system
      time, without having to check the log on every exit to user mode,
      we store separate timestamps for exit to user mode and entry from
      user mode.
      
      On systems that have a SPURR (POWER6 and POWER7), we read the SPURR
      in account_system_vtime() (as before), and then apportion the SPURR
      ticks since the last time we read it between scaled user time and
      scaled system time according to the relative proportions of user
      time and system time over the same interval.  This avoids having to
      read the SPURR on every kernel entry and exit.  On systems that have
      PURR but not SPURR (i.e., POWER5), we do the same using the PURR
      rather than the SPURR.
      
      This disables the DTL user interface in /sys/debug/kernel/powerpc/dtl
      for now since it conflicts with the use of the dispatch trace log
      by the time accounting code.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      cf9efce0
  10. 31 7月, 2010 1 次提交
  11. 21 5月, 2010 1 次提交
    • M
      powerpc/kexec: Fix race in kexec shutdown · 1fc711f7
      Michael Neuling 提交于
      In kexec_prepare_cpus, the primary CPU IPIs the secondary CPUs to
      kexec_smp_down().  kexec_smp_down() calls kexec_smp_wait() which sets
      the hw_cpu_id() to -1.  The primary does this while leaving IRQs on
      which means the primary can take a timer interrupt which can lead to
      the IPIing one of the secondary CPUs (say, for a scheduler re-balance)
      but since the secondary CPU now has a hw_cpu_id = -1, we IPI CPU
      -1... Kaboom!
      
      We are hitting this case regularly on POWER7 machines.
      
      There is also a second race, where the primary will tear down the MMU
      mappings before knowing the secondaries have entered real mode.
      
      Also, the secondaries are clearing out any pending IPIs before
      guaranteeing that no more will be received.
      
      This changes kexec_prepare_cpus() so that we turn off IRQs in the
      primary CPU much earlier.  It adds a paca flag to say that the
      secondaries have entered the kexec_smp_down() IPI and turned off IRQs,
      rather than overloading hw_cpu_id with -1.  This new paca flag is
      again used to in indicate when the secondaries has entered real mode.
      
      It also ensures that all CPUs have their IRQs off before we clear out
      any pending IPI requests (in kexec_cpu_down()) to ensure there are no
      trailing IPIs left unacknowledged.
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      1fc711f7
  12. 17 5月, 2010 3 次提交
  13. 09 3月, 2010 1 次提交
    • M
      powerpc: Dynamically allocate pacas · 1426d5a3
      Michael Ellerman 提交于
      On 64-bit kernels we currently have a 512 byte struct paca_struct for
      each cpu (usually just called "the paca"). Currently they are statically
      allocated, which means a kernel built for a large number of cpus will
      waste a lot of space if it's booted on a machine with few cpus.
      
      We can avoid that by only allocating the number of pacas we need at
      boot. However this is complicated by the fact that we need to access
      the paca before we know how many cpus there are in the system.
      
      The solution is to dynamically allocate enough space for NR_CPUS pacas,
      but then later in boot when we know how many cpus we have, we free any
      unused pacas.
      Signed-off-by: NMichael Ellerman <michael@ellerman.id.au>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      1426d5a3
  14. 01 3月, 2010 1 次提交
    • A
      KVM: PPC: Use PACA backed shadow vcpu · 7e57cba0
      Alexander Graf 提交于
      We're being horribly racy right now. All the entry and exit code hijacks
      random fields from the PACA that could easily be used by different code in
      case we get interrupted, for example by a #MC or even page fault.
      
      After discussing this with Ben, we figured it's best to reserve some more
      space in the PACA and just shove off some vcpu state to there.
      
      That way we can drastically improve the readability of the code, make it
      less racy and less complex.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      7e57cba0
  15. 05 11月, 2009 1 次提交
  16. 21 9月, 2009 2 次提交
    • I
      perf: Tidy up after the big rename · 57c0c15b
      Ingo Molnar 提交于
       - provide compatibility Kconfig entry for existing PERF_COUNTERS .config's
      
       - provide courtesy copy of old perf_counter.h, for user-space projects
      
       - small indentation fixups
      
       - fix up MAINTAINERS
      
       - fix small x86 printout fallout
      
       - fix up small PowerPC comment fallout (use 'counter' as in register)
      Reviewed-by: NArjan van de Ven <arjan@linux.intel.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      57c0c15b
    • I
      perf: Do the big rename: Performance Counters -> Performance Events · cdd6c482
      Ingo Molnar 提交于
      Bye-bye Performance Counters, welcome Performance Events!
      
      In the past few months the perfcounters subsystem has grown out its
      initial role of counting hardware events, and has become (and is
      becoming) a much broader generic event enumeration, reporting, logging,
      monitoring, analysis facility.
      
      Naming its core object 'perf_counter' and naming the subsystem
      'perfcounters' has become more and more of a misnomer. With pending
      code like hw-breakpoints support the 'counter' name is less and
      less appropriate.
      
      All in one, we've decided to rename the subsystem to 'performance
      events' and to propagate this rename through all fields, variables
      and API names. (in an ABI compatible fashion)
      
      The word 'event' is also a bit shorter than 'counter' - which makes
      it slightly more convenient to write/handle as well.
      
      Thanks goes to Stephane Eranian who first observed this misnomer and
      suggested a rename.
      
      User-space tooling and ABI compatibility is not affected - this patch
      should be function-invariant. (Also, defconfigs were not touched to
      keep the size down.)
      
      This patch has been generated via the following script:
      
        FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
      
        sed -i \
          -e 's/PERF_EVENT_/PERF_RECORD_/g' \
          -e 's/PERF_COUNTER/PERF_EVENT/g' \
          -e 's/perf_counter/perf_event/g' \
          -e 's/nb_counters/nb_events/g' \
          -e 's/swcounter/swevent/g' \
          -e 's/tpcounter_event/tp_event/g' \
          $FILES
      
        for N in $(find . -name perf_counter.[ch]); do
          M=$(echo $N | sed 's/perf_counter/perf_event/g')
          mv $N $M
        done
      
        FILES=$(find . -name perf_event.*)
      
        sed -i \
          -e 's/COUNTER_MASK/REG_MASK/g' \
          -e 's/COUNTER/EVENT/g' \
          -e 's/\<event\>/event_id/g' \
          -e 's/counter/event/g' \
          -e 's/Counter/Event/g' \
          $FILES
      
      ... to keep it as correct as possible. This script can also be
      used by anyone who has pending perfcounters patches - it converts
      a Linux kernel tree over to the new naming. We tried to time this
      change to the point in time where the amount of pending patches
      is the smallest: the end of the merge window.
      
      Namespace clashes were fixed up in a preparatory patch - and some
      stylistic fallout will be fixed up in a subsequent patch.
      
      ( NOTE: 'counters' are still the proper terminology when we deal
        with hardware registers - and these sed scripts are a bit
        over-eager in renaming them. I've undone some of that, but
        in case there's something left where 'counter' would be
        better than 'event' we can undo that on an individual basis
        instead of touching an otherwise nicely automated patch. )
      Suggested-by: NStephane Eranian <eranian@google.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Reviewed-by: NArjan van de Ven <arjan@linux.intel.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: <linux-arch@vger.kernel.org>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cdd6c482
  17. 20 8月, 2009 1 次提交
  18. 09 6月, 2009 1 次提交
  19. 09 1月, 2009 1 次提交
    • P
      powerpc: Provide a way to defer perf counter work until interrupts are enabled · 93a6d3ce
      Paul Mackerras 提交于
      Because 64-bit powerpc uses lazy (soft) interrupt disabling, it is
      possible for a performance monitor exception to come in when the
      kernel thinks interrupts are disabled (i.e. when they are
      soft-disabled but hard-enabled).  In such a situation the performance
      monitor exception handler might have some processing to do (such as
      process wakeups) which can't be done in what is effectively an NMI
      handler.
      
      This provides a way to defer that work until interrupts get enabled,
      either in raw_local_irq_restore() or by returning from an interrupt
      handler to code that had interrupts enabled.  We have a per-processor
      flag that indicates that there is work pending to do when interrupts
      subsequently get re-enabled.  This flag is checked in the interrupt
      return path and in raw_local_irq_restore(), and if it is set,
      perf_counter_do_pending() is called to do the pending work.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      93a6d3ce
  20. 16 9月, 2008 1 次提交
    • P
      powerpc: Make it possible to move the interrupt handlers away from the kernel · 1f6a93e4
      Paul Mackerras 提交于
      This changes the way that the exception prologs transfer control to
      the handlers in 64-bit kernels with the aim of making it possible to
      have the prologs separate from the main body of the kernel.  Now,
      instead of computing the address of the handler by taking the top
      32 bits of the paca address (to get the 0xc0000000........ part) and
      ORing in something in the bottom 16 bits, we get the base address of
      the kernel by doing a load from the paca and add an offset.
      
      This also replaces an mfmsr and an ori to compute the MSR value for
      the handler with a load from the paca.  That makes it unnecessary to
      have a separate version of EXCEPTION_PROLOG_PSERIES that forces 64-bit
      mode.
      
      We can no longer use a direct branches in the exception prolog code,
      which means that the SLB miss handlers can't branch directly to
      .slb_miss_realmode any more.  Instead we have to compute the address
      and do an indirect branch.  This is conditional on CONFIG_RELOCATABLE;
      for non-relocatable kernels we use a direct branch as before.  (A later
      change will allow CONFIG_RELOCATABLE to be set on 64-bit powerpc.)
      
      Since the secondary CPUs on pSeries start execution in the first 0x100
      bytes of real memory and then have to get to wherever the kernel is,
      we can't use a direct branch to get there.  Instead this changes
      __secondary_hold_spinloop from a flag to a function pointer.  When it
      is set to a non-NULL value, the secondary CPUs jump to the function
      pointed to by that value.
      
      Finally this eliminates one code difference between 32-bit and 64-bit
      by making __secondary_hold be the text address of the secondary CPU
      spinloop rather than a function descriptor for it.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      1f6a93e4
  21. 04 8月, 2008 1 次提交
  22. 24 4月, 2008 1 次提交
  23. 15 4月, 2008 1 次提交
  24. 07 2月, 2008 1 次提交
    • M
      taskstats scaled time cleanup · 06b8e878
      Michael Neuling 提交于
      This moves the ability to scale cputime into generic code.  This allows us
      to fix the issue in kernel/timer.c (noticed by Balbir) where we could only
      add an unscaled value to the scaled utime/stime.
      
      This adds a cputime_to_scaled function.  As before, the POWERPC version
      does the scaling based on the last SPURR/PURR ratio calculated.  The
      generic and s390 (only other arch to implement asm/cputime.h) versions are
      both NOPs.
      
      Also moves the SPURR and PURR snapshots closer.
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Cc: Jay Lan <jlan@engr.sgi.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      06b8e878
  25. 20 10月, 2007 1 次提交
  26. 19 10月, 2007 1 次提交
    • M
      powerpc: add scaled time accounting · 4603ac18
      Michael Neuling 提交于
      This adds POWERPC specific hooks for scaled time accounting.
      
      POWER6 includes a SPURR register.  The SPURR is based off the PURR register
      but is scaled based on CPU frequency and issue rates.  This gives a more
      accurate account of the instructions used per task.  The PURR and timebase
      will be constant relative to the wall clock, irrespective of the CPU
      frequency.
      
      This implementation reads the SPURR register in account_system_vtime which
      is only call called on context witch and hard and soft irq entry and exit.
      The percentage of user and system time is then estimated using the ratio of
      these accounted by the PURR.  If the SPURR is not present, the PURR read.
      
      An earlier implementation of this patch read the SPURR whenever the PURR
      was read, which included the system call entry and exit path.
      Unfortunately this showed a performance regression on lmbench runs, so was
      re-implemented.
      
      I've included the lmbench results here when run bare metal on POWER6.  1st
      column is the unpatch results.  2nd column is the results using the below
      patch and the 3rd is the % diff of these results from the base.  4th and
      5th columns are the results and % differnce from the base using the older
      patch (SPURR read in syscall entry/exit path).
      
                                    Base        Scaled-Acct     SPURR-in-syscall
                                   Result      Result  % diff    Result % diff
      Simple syscall:              0.3086      0.3086  0.0000    0.3452 11.8600
      Simple read:                 0.4591      0.4671  1.7425    0.5044 9.86713
      Simple write:                0.4364      0.4366  0.0458    0.4731 8.40971
      Simple stat:                 2.0055      2.0295  1.1967    2.0669 3.06158
      Simple fstat:                0.5962      0.5876  -1.442    0.6368 6.80979
      Simple open/close:           3.1283      3.1009  -0.875    3.2088 2.57328
      Select on 10 fd's:           0.8554      0.8457  -1.133    0.8667 1.32101
      Select on 100 fd's:          3.5292      3.6329  2.9383    3.6664 3.88756
      Select on 250 fd's:          7.9097      8.1881  3.5197    8.2242 3.97613
      Select on 500 fd's:          15.2659     15.836  3.7357    15.873 3.97814
      Select on 10 tcp fd's:       0.9576      0.9416  -1.670    0.9752 1.83792
      Select on 100 tcp fd's:      7.248       7.2254  -0.311    7.2685 0.28283
      Select on 250 tcp fd's:      17.7742     17.707  -0.375    17.749 -0.1406
      Select on 500 tcp fd's:      35.4258     35.25   -0.496    35.286 -0.3929
      Signal handler installation: 0.6131      0.6075  -0.913    0.647  5.52927
      Signal handler overhead:     2.0919      2.1078  0.7600    2.1831 4.35967
      Protection fault:            0.7345      0.7478  1.8107    0.8031 9.33968
      Pipe latency:                33.006      16.398  -50.31    33.475 1.42368
      AF_UNIX sock stream latency: 14.5093     30.910  113.03    30.715 111.692
      Process fork+exit:           219.8       222.8   1.3648    229.37 4.35623
      Process fork+execve:         876.14      873.28  -0.32     868.66 -0.8533
      Process fork+/bin/sh -c:     2830        2876.5  1.6431    2958   4.52296
      File /var/tmp/XXX write bw:  1193497     1195536 0.1708    118657 -0.5799
      Pagefaults on /var/tmp/XXX:  3.1272      3.2117  2.7020    3.2521 3.99398
      
      Also, kernel compile times show no difference with this patch applied.
      
      [pbadari@us.ibm.com: Avoid unnecessary PURR reading]
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Jay Lan <jlan@engr.sgi.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NBadari Pulavarty <pbadari@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4603ac18
  27. 03 10月, 2007 1 次提交
    • H
      [POWERPC] ppc64: support CONFIG_DEBUG_PREEMPT · 048c8bc9
      Hugh Dickins 提交于
      Add CONFIG_DEBUG_PREEMPT support to ppc64: it was useful for testing
      get_paca() preemption.  Cheat a little, just use debug_smp_processor_id()
      in the debug version of get_paca(): it contains all the right checks and
      reporting, though get_paca() doesn't really use smp_processor_id().
      
      Use local_paca for what might have been called __raw_get_paca().
      Silence harmless warnings from io.h and lparcfg.c with local_paca -
      it is okay for iseries_lparcfg_data to be referencing shared_proc
      with preemption enabled: all cpus should show the same value for
      shared_proc.
      
      Why do other architectures need TRACE_IRQFLAGS_SUPPORT for DEBUG_PREEMPT?
      I don't know, ppc64 appears to get along fine without it.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      048c8bc9
  28. 09 5月, 2007 1 次提交
    • B
      [POWERPC] Introduce address space "slices" · d0f13e3c
      Benjamin Herrenschmidt 提交于
      The basic issue is to be able to do what hugetlbfs does but with
      different page sizes for some other special filesystems; more
      specifically, my need is:
      
       - Huge pages
      
       - SPE local store mappings using 64K pages on a 4K base page size
      kernel on Cell
      
       - Some special 4K segments in 64K-page kernels for mapping a dodgy
      type of powerpc-specific infiniband hardware that requires 4K MMU
      mappings for various reasons I won't explain here.
      
      The main issues are:
      
       - To maintain/keep track of the page size per "segment" (as we can
      only have one page size per segment on powerpc, which are 256MB
      divisions of the address space).
      
       - To make sure special mappings stay within their allotted
      "segments" (including MAP_FIXED crap)
      
       - To make sure everybody else doesn't mmap/brk/grow_stack into a
      "segment" that is used for a special mapping
      
      Some of the necessary mechanisms to handle that were present in the
      hugetlbfs code, but mostly in ways not suitable for anything else.
      
      The patch relies on some changes to the generic get_unmapped_area()
      that just got merged.  It still hijacks hugetlb callbacks here or
      there as the generic code hasn't been entirely cleaned up yet but
      that shouldn't be a problem.
      
      So what is a slice ?  Well, I re-used the mechanism used formerly by our
      hugetlbfs implementation which divides the address space in
      "meta-segments" which I called "slices".  The division is done using
      256MB slices below 4G, and 1T slices above.  Thus the address space is
      divided currently into 16 "low" slices and 16 "high" slices.  (Special
      case: high slice 0 is the area between 4G and 1T).
      
      Doing so simplifies significantly the tracking of segments and avoids
      having to keep track of all the 256MB segments in the address space.
      
      While I used the "concepts" of hugetlbfs, I mostly re-implemented
      everything in a more generic way and "ported" hugetlbfs to it.
      
      Slices can have an associated page size, which is encoded in the mmu
      context and used by the SLB miss handler to set the segment sizes.  The
      hash code currently doesn't care, it has a specific check for hugepages,
      though I might add a mechanism to provide per-slice hash mapping
      functions in the future.
      
      The slice code provide a pair of "generic" get_unmapped_area() (bottomup
      and topdown) functions that should work with any slice size.  There is
      some trickiness here so I would appreciate people to have a look at the
      implementation of these and let me know if I got something wrong.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      d0f13e3c
  29. 24 4月, 2007 1 次提交
  30. 21 3月, 2007 1 次提交
  31. 16 10月, 2006 1 次提交
    • P
      [POWERPC] Lazy interrupt disabling for 64-bit machines · d04c56f7
      Paul Mackerras 提交于
      This implements a lazy strategy for disabling interrupts.  This means
      that local_irq_disable() et al. just clear the 'interrupts are
      enabled' flag in the paca.  If an interrupt comes along, the interrupt
      entry code notices that interrupts are supposed to be disabled, and
      clears the EE bit in SRR1, clears the 'interrupts are hard-enabled'
      flag in the paca, and returns.  This means that interrupts only
      actually get disabled in the processor when an interrupt comes along.
      
      When interrupts are enabled by local_irq_enable() et al., the code
      sets the interrupts-enabled flag in the paca, and then checks whether
      interrupts got hard-disabled.  If so, it also sets the EE bit in the
      MSR to hard-enable the interrupts.
      
      This has the potential to improve performance, and also makes it
      easier to make a kernel that can boot on iSeries and on other 64-bit
      machines, since this lazy-disable strategy is very similar to the
      soft-disable strategy that iSeries already uses.
      
      This version renames paca->proc_enabled to paca->soft_enabled, and
      changes a couple of soft-disables in the kexec code to hard-disables,
      which should fix the crash that Michael Ellerman saw.  This doesn't
      yet use a reserved CR field for the soft_enabled and hard_enabled
      flags.  This applies on top of Stephen Rothwell's patches to make it
      possible to build a combined iSeries/other kernel.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      d04c56f7
  32. 13 9月, 2006 1 次提交
    • P
      [POWERPC] Fix MMIO ops to provide expected barrier behaviour · f007cacf
      Paul Mackerras 提交于
      This changes the writeX family of functions to have a sync instruction
      before the MMIO store rather than after, because the generally expected
      behaviour is that the device receiving the MMIO store can be guaranteed
      to see the effects of any preceding writes to normal memory.
      
      To preserve ordering between writeX and readX, and to preserve ordering
      between preceding stores and the readX, the readX family of functions
      have had an sync added before the load.
      
      Although writeX followed by spin_unlock is not officially guaranteed
      to keep the writeX inside the spin-locked region unless an mmiowb()
      is used, there are currently drivers that depend on the previous
      behaviour on powerpc, which was that the mmiowb wasn't actually required.
      Therefore we have a per-cpu flag that is set by writeX, cleared by
      __raw_spin_lock and mmiowb, and tested by __raw_spin_unlock.  If it is
      set, __raw_spin_unlock does a sync and clears it.
      
      This changes both 32-bit and 64-bit readX/writeX.  32-bit already has a
      sync in __raw_spin_unlock (since lwsync doesn't exist on 32-bit), and thus
      doesn't need the per-cpu flag.
      
      Tested on G5 (PPC970) and POWER5.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      f007cacf
  33. 08 8月, 2006 1 次提交
  34. 15 6月, 2006 1 次提交
    • P
      powerpc: Use 64k pages without needing cache-inhibited large pages · bf72aeba
      Paul Mackerras 提交于
      Some POWER5+ machines can do 64k hardware pages for normal memory but
      not for cache-inhibited pages.  This patch lets us use 64k hardware
      pages for most user processes on such machines (assuming the kernel
      has been configured with CONFIG_PPC_64K_PAGES=y).  User processes
      start out using 64k pages and get switched to 4k pages if they use any
      non-cacheable mappings.
      
      With this, we use 64k pages for the vmalloc region and 4k pages for
      the imalloc region.  If anything creates a non-cacheable mapping in
      the vmalloc region, the vmalloc region will get switched to 4k pages.
      I don't know of any driver other than the DRM that would do this,
      though, and these machines don't have AGP.
      
      When a region gets switched from 64k pages to 4k pages, we do not have
      to clear out all the 64k HPTEs from the hash table immediately.  We
      use the _PAGE_COMBO bit in the Linux PTE to indicate whether the page
      was hashed in as a 64k page or a set of 4k pages.  If hash_page is
      trying to insert a 4k page for a Linux PTE and it sees that it has
      already been inserted as a 64k page, it first invalidates the 64k HPTE
      before inserting the 4k HPTE.  The hash invalidation routines also use
      the _PAGE_COMBO bit, to determine whether to look for a 64k HPTE or a
      set of 4k HPTEs to remove.  With those two changes, we can tolerate a
      mix of 4k and 64k HPTEs in the hash table, and they will all get
      removed when the address space is torn down.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      bf72aeba
  35. 12 6月, 2006 1 次提交
  36. 26 4月, 2006 1 次提交