1. 20 8月, 2009 2 次提交
  2. 18 8月, 2009 1 次提交
    • P
      powerpc: Allow perf_counters to access user memory at interrupt time · 9c1e1052
      Paul Mackerras 提交于
      This provides a mechanism to allow the perf_counters code to access
      user memory in a PMU interrupt routine.  Such an access can cause
      various kinds of interrupt: SLB miss, MMU hash table miss, segment
      table miss, or TLB miss, depending on the processor.  This commit
      only deals with 64-bit classic/server processors, which use an MMU
      hash table.  32-bit processors are already able to access user memory
      at interrupt time.  Since we don't soft-disable on 32-bit, we avoid
      the possibility of reentering hash_page or the TLB miss handlers,
      since they run with interrupts disabled.
      
      On 64-bit processors, an SLB miss interrupt on a user address will
      update the slb_cache and slb_cache_ptr fields in the paca.  This is
      OK except in the case where a PMU interrupt occurs in switch_slb,
      which also accesses those fields.  To prevent this, we hard-disable
      interrupts in switch_slb.  Interrupts are already soft-disabled at
      this point, and will get hard-enabled when they get soft-enabled
      later.
      
      This also reworks slb_flush_and_rebolt: to avoid hard-disabling twice,
      and to make sure that it clears the slb_cache_ptr when called from
      other callers than switch_slb, the existing routine is renamed to
      __slb_flush_and_rebolt, which is called by switch_slb and the new
      version of slb_flush_and_rebolt.
      
      Similarly, switch_stab (used on POWER3 and RS64 processors) gets a
      hard_irq_disable() to protect the per-cpu variables used there and
      in ste_allocate.
      
      If a MMU hashtable miss interrupt occurs, normally we would call
      hash_page to look up the Linux PTE for the address and create a HPTE.
      However, hash_page is fairly complex and takes some locks, so to
      avoid the possibility of deadlock, we check the preemption count
      to see if we are in a (pseudo-)NMI handler, and if so, we don't call
      hash_page but instead treat it like a bad access that will get
      reported up through the exception table mechanism.  An interrupt
      whose handler runs even though the interrupt occurred when
      soft-disabled (such as the PMU interrupt) is considered a pseudo-NMI
      handler, which should use nmi_enter()/nmi_exit() rather than
      irq_enter()/irq_exit().
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      9c1e1052
  3. 09 6月, 2009 1 次提交
  4. 24 3月, 2009 1 次提交
  5. 11 3月, 2009 1 次提交
  6. 09 1月, 2009 1 次提交
    • P
      powerpc: Provide a way to defer perf counter work until interrupts are enabled · 93a6d3ce
      Paul Mackerras 提交于
      Because 64-bit powerpc uses lazy (soft) interrupt disabling, it is
      possible for a performance monitor exception to come in when the
      kernel thinks interrupts are disabled (i.e. when they are
      soft-disabled but hard-enabled).  In such a situation the performance
      monitor exception handler might have some processing to do (such as
      process wakeups) which can't be done in what is effectively an NMI
      handler.
      
      This provides a way to defer that work until interrupts get enabled,
      either in raw_local_irq_restore() or by returning from an interrupt
      handler to code that had interrupts enabled.  We have a per-processor
      flag that indicates that there is work pending to do when interrupts
      subsequently get re-enabled.  This flag is checked in the interrupt
      return path and in raw_local_irq_restore(), and if it is set,
      perf_counter_do_pending() is called to do the pending work.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      93a6d3ce
  7. 08 1月, 2009 1 次提交
  8. 31 12月, 2008 4 次提交
    • H
      KVM: ppc: Implement in-kernel exit timing statistics · 73e75b41
      Hollis Blanchard 提交于
      Existing KVM statistics are either just counters (kvm_stat) reported for
      KVM generally or trace based aproaches like kvm_trace.
      For KVM on powerpc we had the need to track the timings of the different exit
      types. While this could be achieved parsing data created with a kvm_trace
      extension this adds too much overhead (at least on embedded PowerPC) slowing
      down the workloads we wanted to measure.
      
      Therefore this patch adds a in-kernel exit timing statistic to the powerpc kvm
      code. These statistic is available per vm&vcpu under the kvm debugfs directory.
      As this statistic is low, but still some overhead it can be enabled via a
      .config entry and should be off by default.
      
      Since this patch touched all powerpc kvm_stat code anyway this code is now
      merged and simplified together with the exit timing statistic code (still
      working with exit timing disabled in .config).
      Signed-off-by: NChristian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
      Signed-off-by: NHollis Blanchard <hollisb@us.ibm.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      73e75b41
    • H
      KVM: ppc: directly insert shadow mappings into the hardware TLB · 7924bd41
      Hollis Blanchard 提交于
      Formerly, we used to maintain a per-vcpu shadow TLB and on every entry to the
      guest would load this array into the hardware TLB. This consumed 1280 bytes of
      memory (64 entries of 16 bytes plus a struct page pointer each), and also
      required some assembly to loop over the array on every entry.
      
      Instead of saving a copy in memory, we can just store shadow mappings directly
      into the hardware TLB, accepting that the host kernel will clobber these as
      part of the normal 440 TLB round robin. When we do that we need less than half
      the memory, and we have decreased the exit handling time for all guest exits,
      at the cost of increased number of TLB misses because the host overwrites some
      guest entries.
      
      These savings will be increased on processors with larger TLBs or which
      implement intelligent flush instructions like tlbivax (which will avoid the
      need to walk arrays in software).
      
      In addition to that and to the code simplification, we have a greater chance of
      leaving other host userspace mappings in the TLB, instead of forcing all
      subsequent tasks to re-fault all their mappings.
      Signed-off-by: NHollis Blanchard <hollisb@us.ibm.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      7924bd41
    • H
      KVM: ppc: create struct kvm_vcpu_44x and introduce container_of() accessor · db93f574
      Hollis Blanchard 提交于
      This patch doesn't yet move all 44x-specific data into the new structure, but
      is the first step down that path. In the future we may also want to create a
      struct kvm_vcpu_booke.
      
      Based on patch from Liu Yu <yu.liu@freescale.com>.
      Signed-off-by: NHollis Blanchard <hollisb@us.ibm.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      db93f574
    • H
      KVM: ppc: Rename "struct tlbe" to "struct kvmppc_44x_tlbe" · 0f55dc48
      Hollis Blanchard 提交于
      This will ease ports to other cores.
      
      Also remove unused "struct kvm_tlb" while we're at it.
      Signed-off-by: NHollis Blanchard <hollisb@us.ibm.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      0f55dc48
  9. 29 12月, 2008 1 次提交
  10. 21 12月, 2008 1 次提交
  11. 06 11月, 2008 1 次提交
    • P
      powerpc: Improve resolution of VDSO clock_gettime · 597bc5c0
      Paul Mackerras 提交于
      Currently the clock_gettime implementation in the VDSO produces a
      result with microsecond resolution for the cases that are handled
      without a system call, i.e. CLOCK_REALTIME and CLOCK_MONOTONIC.  The
      nanoseconds field of the result is obtained by computing a
      microseconds value and multiplying by 1000.
      
      This changes the code in the VDSO to do the computation for
      clock_gettime with nanosecond resolution.  That means that the
      resolution of the result will ultimately depend on the timebase
      frequency.
      
      Because the timestamp in the VDSO datapage (stamp_xsec, the real time
      corresponding to the timebase count in tb_orig_stamp) is in units of
      2^-20 seconds, it doesn't have sufficient resolution for computing a
      result with nanosecond resolution.  Therefore this adds a copy of
      xtime to the VDSO datapage and updates it in update_gtod() along with
      the other time-related fields.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      597bc5c0
  12. 15 10月, 2008 3 次提交
  13. 25 9月, 2008 1 次提交
    • B
      POWERPC: Allow 32-bit hashed pgtable code to support 36-bit physical · 4ee7084e
      Becky Bruce 提交于
      This rearranges a bit of code, and adds support for
      36-bit physical addressing for configs that use a
      hashed page table.  The 36b physical support is not
      enabled by default on any config - it must be
      explicitly enabled via the config system.
      
      This patch *only* expands the page table code to accomodate
      large physical addresses on 32-bit systems and enables the
      PHYS_64BIT config option for 86xx.  It does *not*
      allow you to boot a board with more than about 3.5GB of
      RAM - for that, SWIOTLB support is also required (and
      coming soon).
      Signed-off-by: NBecky Bruce <becky.bruce@freescale.com>
      Signed-off-by: NKumar Gala <galak@kernel.crashing.org>
      4ee7084e
  14. 16 9月, 2008 1 次提交
    • P
      powerpc: Make it possible to move the interrupt handlers away from the kernel · 1f6a93e4
      Paul Mackerras 提交于
      This changes the way that the exception prologs transfer control to
      the handlers in 64-bit kernels with the aim of making it possible to
      have the prologs separate from the main body of the kernel.  Now,
      instead of computing the address of the handler by taking the top
      32 bits of the paca address (to get the 0xc0000000........ part) and
      ORing in something in the bottom 16 bits, we get the base address of
      the kernel by doing a load from the paca and add an offset.
      
      This also replaces an mfmsr and an ori to compute the MSR value for
      the handler with a load from the paca.  That makes it unnecessary to
      have a separate version of EXCEPTION_PROLOG_PSERIES that forces 64-bit
      mode.
      
      We can no longer use a direct branches in the exception prolog code,
      which means that the SLB miss handlers can't branch directly to
      .slb_miss_realmode any more.  Instead we have to compute the address
      and do an indirect branch.  This is conditional on CONFIG_RELOCATABLE;
      for non-relocatable kernels we use a direct branch as before.  (A later
      change will allow CONFIG_RELOCATABLE to be set on 64-bit powerpc.)
      
      Since the secondary CPUs on pSeries start execution in the first 0x100
      bytes of real memory and then have to get to wherever the kernel is,
      we can't use a direct branch to get there.  Instead this changes
      __secondary_hold_spinloop from a flag to a function pointer.  When it
      is set to a non-NULL value, the secondary CPUs jump to the function
      pointed to by that value.
      
      Finally this eliminates one code difference between 32-bit and 64-bit
      by making __secondary_hold be the text address of the secondary CPU
      spinloop rather than a function descriptor for it.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      1f6a93e4
  15. 01 7月, 2008 1 次提交
    • M
      powerpc: Introduce VSX thread_struct and CONFIG_VSX · c6e6771b
      Michael Neuling 提交于
      The layout of the new VSR registers and how they overlap on top of the
      legacy FPR and VR registers is:
      
                         VSR doubleword 0               VSR doubleword 1
                ----------------------------------------------------------------
        VSR[0]  |             FPR[0]            |                              |
                ----------------------------------------------------------------
        VSR[1]  |             FPR[1]            |                              |
                ----------------------------------------------------------------
                |              ...              |                              |
                |              ...              |                              |
                ----------------------------------------------------------------
        VSR[30] |             FPR[30]           |                              |
                ----------------------------------------------------------------
        VSR[31] |             FPR[31]           |                              |
                ----------------------------------------------------------------
        VSR[32] |                             VR[0]                            |
                ----------------------------------------------------------------
        VSR[33] |                             VR[1]                            |
                ----------------------------------------------------------------
                |                              ...                             |
                |                              ...                             |
                ----------------------------------------------------------------
        VSR[62] |                             VR[30]                           |
                ----------------------------------------------------------------
        VSR[63] |                             VR[31]                           |
                ----------------------------------------------------------------
      
      VSX has 64 128bit registers.  The first 32 regs overlap with the FP
      registers and hence extend them with and additional 64 bits.  The
      second 32 regs overlap with the VMX registers.
      
      This commit introduces the thread_struct changes required to reflect
      this register layout.  Ptrace and signals code is updated so that the
      floating point registers are correctly accessed from the thread_struct
      when CONFIG_VSX is enabled.
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      c6e6771b
  16. 03 6月, 2008 1 次提交
    • K
      [POWERPC] 40x/Book-E: Save/restore volatile exception registers · fca622c5
      Kumar Gala 提交于
      On machines with more than one exception level any system register that
      might be modified by the "normal" exception level needs to be saved and
      restored on taking a higher level exception.  We already are saving
      and restoring ESR and DEAR.
      
      For critical level add SRR0/1.
      For debug level add CSRR0/1 and SRR0/1.
      For machine check level add DSRR0/1, CSRR0/1, and SRR0/1.
      
      On FSL Book-E parts we always save/restore the MAS registers for critical,
      debug, and machine check level exceptions.  On 44x we always save/restore
      the MMUCR.
      
      Additionally, we save and restore the ksp_limit since we have to adjust it
      for each exception level.
      Signed-off-by: NKumar Gala <galak@kernel.crashing.org>
      Acked-by: NPaul Mackerras <paulus@samba.org>
      fca622c5
  17. 29 4月, 2008 2 次提交
  18. 27 4月, 2008 1 次提交
  19. 24 4月, 2008 1 次提交
  20. 15 4月, 2008 1 次提交
  21. 26 3月, 2008 1 次提交
  22. 08 2月, 2008 1 次提交
  23. 06 2月, 2008 1 次提交
  24. 07 12月, 2007 1 次提交
  25. 20 11月, 2007 1 次提交
  26. 19 10月, 2007 1 次提交
    • M
      powerpc: add scaled time accounting · 4603ac18
      Michael Neuling 提交于
      This adds POWERPC specific hooks for scaled time accounting.
      
      POWER6 includes a SPURR register.  The SPURR is based off the PURR register
      but is scaled based on CPU frequency and issue rates.  This gives a more
      accurate account of the instructions used per task.  The PURR and timebase
      will be constant relative to the wall clock, irrespective of the CPU
      frequency.
      
      This implementation reads the SPURR register in account_system_vtime which
      is only call called on context witch and hard and soft irq entry and exit.
      The percentage of user and system time is then estimated using the ratio of
      these accounted by the PURR.  If the SPURR is not present, the PURR read.
      
      An earlier implementation of this patch read the SPURR whenever the PURR
      was read, which included the system call entry and exit path.
      Unfortunately this showed a performance regression on lmbench runs, so was
      re-implemented.
      
      I've included the lmbench results here when run bare metal on POWER6.  1st
      column is the unpatch results.  2nd column is the results using the below
      patch and the 3rd is the % diff of these results from the base.  4th and
      5th columns are the results and % differnce from the base using the older
      patch (SPURR read in syscall entry/exit path).
      
                                    Base        Scaled-Acct     SPURR-in-syscall
                                   Result      Result  % diff    Result % diff
      Simple syscall:              0.3086      0.3086  0.0000    0.3452 11.8600
      Simple read:                 0.4591      0.4671  1.7425    0.5044 9.86713
      Simple write:                0.4364      0.4366  0.0458    0.4731 8.40971
      Simple stat:                 2.0055      2.0295  1.1967    2.0669 3.06158
      Simple fstat:                0.5962      0.5876  -1.442    0.6368 6.80979
      Simple open/close:           3.1283      3.1009  -0.875    3.2088 2.57328
      Select on 10 fd's:           0.8554      0.8457  -1.133    0.8667 1.32101
      Select on 100 fd's:          3.5292      3.6329  2.9383    3.6664 3.88756
      Select on 250 fd's:          7.9097      8.1881  3.5197    8.2242 3.97613
      Select on 500 fd's:          15.2659     15.836  3.7357    15.873 3.97814
      Select on 10 tcp fd's:       0.9576      0.9416  -1.670    0.9752 1.83792
      Select on 100 tcp fd's:      7.248       7.2254  -0.311    7.2685 0.28283
      Select on 250 tcp fd's:      17.7742     17.707  -0.375    17.749 -0.1406
      Select on 500 tcp fd's:      35.4258     35.25   -0.496    35.286 -0.3929
      Signal handler installation: 0.6131      0.6075  -0.913    0.647  5.52927
      Signal handler overhead:     2.0919      2.1078  0.7600    2.1831 4.35967
      Protection fault:            0.7345      0.7478  1.8107    0.8031 9.33968
      Pipe latency:                33.006      16.398  -50.31    33.475 1.42368
      AF_UNIX sock stream latency: 14.5093     30.910  113.03    30.715 111.692
      Process fork+exit:           219.8       222.8   1.3648    229.37 4.35623
      Process fork+execve:         876.14      873.28  -0.32     868.66 -0.8533
      Process fork+/bin/sh -c:     2830        2876.5  1.6431    2958   4.52296
      File /var/tmp/XXX write bw:  1193497     1195536 0.1708    118657 -0.5799
      Pagefaults on /var/tmp/XXX:  3.1272      3.2117  2.7020    3.2521 3.99398
      
      Also, kernel compile times show no difference with this patch applied.
      
      [pbadari@us.ibm.com: Avoid unnecessary PURR reading]
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Jay Lan <jlan@engr.sgi.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NBadari Pulavarty <pbadari@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4603ac18
  27. 19 9月, 2007 1 次提交
  28. 22 8月, 2007 1 次提交
  29. 10 5月, 2007 1 次提交
    • R
      rename thread_info to stack · f7e4217b
      Roman Zippel 提交于
      This finally renames the thread_info field in task structure to stack, so that
      the assumptions about this field are gone and archs have more freedom about
      placing the thread_info structure.
      
      Nonbroken archs which have a proper thread pointer can do the access to both
      current thread and task structure via a single pointer.
      
      It'll allow for a few more cleanups of the fork code, from which e.g.  ia64
      could benefit.
      Signed-off-by: NRoman Zippel <zippel@linux-m68k.org>
      [akpm@linux-foundation.org: build fix]
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Ian Molton <spyro@f2s.com>
      Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Roman Zippel <zippel@linux-m68k.org>
      Cc: Greg Ungerer <gerg@uclinux.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
      Cc: Richard Curnow <rc@rc0.org.uk>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
      Cc: Andi Kleen <ak@muc.de>
      Cc: Chris Zankel <chris@zankel.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7e4217b
  30. 09 5月, 2007 1 次提交
    • B
      [POWERPC] Introduce address space "slices" · d0f13e3c
      Benjamin Herrenschmidt 提交于
      The basic issue is to be able to do what hugetlbfs does but with
      different page sizes for some other special filesystems; more
      specifically, my need is:
      
       - Huge pages
      
       - SPE local store mappings using 64K pages on a 4K base page size
      kernel on Cell
      
       - Some special 4K segments in 64K-page kernels for mapping a dodgy
      type of powerpc-specific infiniband hardware that requires 4K MMU
      mappings for various reasons I won't explain here.
      
      The main issues are:
      
       - To maintain/keep track of the page size per "segment" (as we can
      only have one page size per segment on powerpc, which are 256MB
      divisions of the address space).
      
       - To make sure special mappings stay within their allotted
      "segments" (including MAP_FIXED crap)
      
       - To make sure everybody else doesn't mmap/brk/grow_stack into a
      "segment" that is used for a special mapping
      
      Some of the necessary mechanisms to handle that were present in the
      hugetlbfs code, but mostly in ways not suitable for anything else.
      
      The patch relies on some changes to the generic get_unmapped_area()
      that just got merged.  It still hijacks hugetlb callbacks here or
      there as the generic code hasn't been entirely cleaned up yet but
      that shouldn't be a problem.
      
      So what is a slice ?  Well, I re-used the mechanism used formerly by our
      hugetlbfs implementation which divides the address space in
      "meta-segments" which I called "slices".  The division is done using
      256MB slices below 4G, and 1T slices above.  Thus the address space is
      divided currently into 16 "low" slices and 16 "high" slices.  (Special
      case: high slice 0 is the area between 4G and 1T).
      
      Doing so simplifies significantly the tracking of segments and avoids
      having to keep track of all the 256MB segments in the address space.
      
      While I used the "concepts" of hugetlbfs, I mostly re-implemented
      everything in a more generic way and "ported" hugetlbfs to it.
      
      Slices can have an associated page size, which is encoded in the mmu
      context and used by the SLB miss handler to set the segment sizes.  The
      hash code currently doesn't care, it has a specific check for hugepages,
      though I might add a mechanism to provide per-slice hash mapping
      functions in the future.
      
      The slice code provide a pair of "generic" get_unmapped_area() (bottomup
      and topdown) functions that should work with any slice size.  There is
      some trickiness here so I would appreciate people to have a look at the
      implementation of these and let me know if I got something wrong.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      d0f13e3c
  31. 07 5月, 2007 1 次提交
  32. 24 4月, 2007 1 次提交
  33. 22 3月, 2007 1 次提交