1. 03 11月, 2014 1 次提交
    • C
      powerpc: Replace __get_cpu_var uses · 69111bac
      Christoph Lameter 提交于
      This still has not been merged and now powerpc is the only arch that does
      not have this change. Sorry about missing linuxppc-dev before.
      
      V2->V2
        - Fix up to work against 3.18-rc1
      
      __get_cpu_var() is used for multiple purposes in the kernel source. One of
      them is address calculation via the form &__get_cpu_var(x).  This calculates
      the address for the instance of the percpu variable of the current processor
      based on an offset.
      
      Other use cases are for storing and retrieving data from the current
      processors percpu area.  __get_cpu_var() can be used as an lvalue when
      writing data or on the right side of an assignment.
      
      __get_cpu_var() is defined as :
      
      __get_cpu_var() always only does an address determination. However, store
      and retrieve operations could use a segment prefix (or global register on
      other platforms) to avoid the address calculation.
      
      this_cpu_write() and this_cpu_read() can directly take an offset into a
      percpu area and use optimized assembly code to read and write per cpu
      variables.
      
      This patch converts __get_cpu_var into either an explicit address
      calculation using this_cpu_ptr() or into a use of this_cpu operations that
      use the offset.  Thereby address calculations are avoided and less registers
      are used when code is generated.
      
      At the end of the patch set all uses of __get_cpu_var have been removed so
      the macro is removed too.
      
      The patch set includes passes over all arches as well. Once these operations
      are used throughout then specialized macros can be defined in non -x86
      arches as well in order to optimize per cpu access by f.e.  using a global
      register that may be set to the per cpu base.
      
      Transformations done to __get_cpu_var()
      
      1. Determine the address of the percpu instance of the current processor.
      
      	DEFINE_PER_CPU(int, y);
      	int *x = &__get_cpu_var(y);
      
          Converts to
      
      	int *x = this_cpu_ptr(&y);
      
      2. Same as #1 but this time an array structure is involved.
      
      	DEFINE_PER_CPU(int, y[20]);
      	int *x = __get_cpu_var(y);
      
          Converts to
      
      	int *x = this_cpu_ptr(y);
      
      3. Retrieve the content of the current processors instance of a per cpu
      variable.
      
      	DEFINE_PER_CPU(int, y);
      	int x = __get_cpu_var(y)
      
         Converts to
      
      	int x = __this_cpu_read(y);
      
      4. Retrieve the content of a percpu struct
      
      	DEFINE_PER_CPU(struct mystruct, y);
      	struct mystruct x = __get_cpu_var(y);
      
         Converts to
      
      	memcpy(&x, this_cpu_ptr(&y), sizeof(x));
      
      5. Assignment to a per cpu variable
      
      	DEFINE_PER_CPU(int, y)
      	__get_cpu_var(y) = x;
      
         Converts to
      
      	__this_cpu_write(y, x);
      
      6. Increment/Decrement etc of a per cpu variable
      
      	DEFINE_PER_CPU(int, y);
      	__get_cpu_var(y)++
      
         Converts to
      
      	__this_cpu_inc(y)
      
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      CC: Paul Mackerras <paulus@samba.org>
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      [mpe: Fix build errors caused by set/or_softirq_pending(), and rework
            assignment in __set_breakpoint() to use memcpy().]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      69111bac
  2. 25 9月, 2014 2 次提交
  3. 27 8月, 2014 2 次提交
    • T
      Revert "powerpc: Replace __get_cpu_var uses" · 23f66e2d
      Tejun Heo 提交于
      This reverts commit 5828f666 due to
      build failure after merging with pending powerpc changes.
      
      Link: http://lkml.kernel.org/g/20140827142243.6277eaff@canb.auug.org.auSigned-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      23f66e2d
    • C
      powerpc: Replace __get_cpu_var uses · 5828f666
      Christoph Lameter 提交于
      __get_cpu_var() is used for multiple purposes in the kernel source. One of
      them is address calculation via the form &__get_cpu_var(x).  This calculates
      the address for the instance of the percpu variable of the current processor
      based on an offset.
      
      Other use cases are for storing and retrieving data from the current
      processors percpu area.  __get_cpu_var() can be used as an lvalue when
      writing data or on the right side of an assignment.
      
      __get_cpu_var() is defined as :
      
      #define __get_cpu_var(var) (*this_cpu_ptr(&(var)))
      
      __get_cpu_var() always only does an address determination. However, store
      and retrieve operations could use a segment prefix (or global register on
      other platforms) to avoid the address calculation.
      
      this_cpu_write() and this_cpu_read() can directly take an offset into a
      percpu area and use optimized assembly code to read and write per cpu
      variables.
      
      This patch converts __get_cpu_var into either an explicit address
      calculation using this_cpu_ptr() or into a use of this_cpu operations that
      use the offset.  Thereby address calculations are avoided and less registers
      are used when code is generated.
      
      At the end of the patch set all uses of __get_cpu_var have been removed so
      the macro is removed too.
      
      The patch set includes passes over all arches as well. Once these operations
      are used throughout then specialized macros can be defined in non -x86
      arches as well in order to optimize per cpu access by f.e.  using a global
      register that may be set to the per cpu base.
      
      Transformations done to __get_cpu_var()
      
      1. Determine the address of the percpu instance of the current processor.
      
      	DEFINE_PER_CPU(int, y);
      	int *x = &__get_cpu_var(y);
      
          Converts to
      
      	int *x = this_cpu_ptr(&y);
      
      2. Same as #1 but this time an array structure is involved.
      
      	DEFINE_PER_CPU(int, y[20]);
      	int *x = __get_cpu_var(y);
      
          Converts to
      
      	int *x = this_cpu_ptr(y);
      
      3. Retrieve the content of the current processors instance of a per cpu
      variable.
      
      	DEFINE_PER_CPU(int, y);
      	int x = __get_cpu_var(y)
      
         Converts to
      
      	int x = __this_cpu_read(y);
      
      4. Retrieve the content of a percpu struct
      
      	DEFINE_PER_CPU(struct mystruct, y);
      	struct mystruct x = __get_cpu_var(y);
      
         Converts to
      
      	memcpy(&x, this_cpu_ptr(&y), sizeof(x));
      
      5. Assignment to a per cpu variable
      
      	DEFINE_PER_CPU(int, y)
      	__get_cpu_var(y) = x;
      
         Converts to
      
      	__this_cpu_write(y, x);
      
      6. Increment/Decrement etc of a per cpu variable
      
      	DEFINE_PER_CPU(int, y);
      	__get_cpu_var(y)++
      
         Converts to
      
      	__this_cpu_inc(y)
      
      tj: Folded a fix patch.
          http://lkml.kernel.org/g/alpine.DEB.2.11.1408172143020.9652@gentwo.org
      
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      CC: Paul Mackerras <paulus@samba.org>
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      5828f666
  4. 24 7月, 2014 1 次提交
  5. 11 6月, 2014 1 次提交
  6. 12 5月, 2014 1 次提交
    • A
      powerpc: irq work racing with timer interrupt can result in timer interrupt hang · 8050936c
      Anton Blanchard 提交于
      I am seeing an issue where a CPU running perf eventually hangs.
      Traces show timer interrupts happening every 4 seconds even
      when a userspace task is running on the CPU. /proc/timer_list
      also shows pending hrtimers have not run in over an hour,
      including the scheduler.
      
      Looking closer, decrementers_next_tb is getting set to
      0xffffffffffffffff, and at that point we will never take
      a timer interrupt again.
      
      In __timer_interrupt() we set decrementers_next_tb to
      0xffffffffffffffff and rely on ->event_handler to update it:
      
              *next_tb = ~(u64)0;
              if (evt->event_handler)
                      evt->event_handler(evt);
      
      In this case ->event_handler is hrtimer_interrupt. This will eventually
      call back through the clockevents code with the next event to be
      programmed:
      
      static int decrementer_set_next_event(unsigned long evt,
                                            struct clock_event_device *dev)
      {
              /* Don't adjust the decrementer if some irq work is pending */
              if (test_irq_work_pending())
                      return 0;
              __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;
      
      If irq work came in between these two points, we will return
      before updating decrementers_next_tb and we never process a timer
      interrupt again.
      
      This looks to have been introduced by 0215f7d8 (powerpc: Fix races
      with irq_work). Fix it by removing the early exit and relying on
      code later on in the function to force an early decrementer:
      
             /* We may have raced with new irq work */
             if (test_irq_work_pending())
                     set_dec(1);
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Cc: stable@vger.kernel.org # 3.14+
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      8050936c
  7. 05 3月, 2014 3 次提交
  8. 15 1月, 2014 1 次提交
    • B
      powerpc: Fix races with irq_work · 0215f7d8
      Benjamin Herrenschmidt 提交于
      If we set irq_work on a processor and immediately afterward, before the
      irq work has a chance to be processed, we change the decrementer value,
      we can seriously delay the handling of that irq_work.
      
      Fix it by checking in a few places for pending irq work, first before
      changing the decrementer in decrementer_set_next_event() and after
      changing it in the same function and in timer_interrupt().
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      0215f7d8
  9. 02 12月, 2013 1 次提交
  10. 21 11月, 2013 1 次提交
    • A
      powerpc/pseries: Duplicate dtl entries sometimes sent to userspace · 84b07386
      Anton Blanchard 提交于
      When reading from the dispatch trace log (dtl) userspace interface, I
      sometimes see duplicate entries. One example:
      
      # hexdump -C dtl.out
      
      00000000  07 04 00 0c 00 00 48 44  00 00 00 00 00 00 00 00
      00000010  00 0c a0 b4 16 83 6d 68  00 00 00 00 00 00 00 00
      00000020  00 00 00 00 10 00 13 50  80 00 00 00 00 00 d0 32
      
      00000030  07 04 00 0c 00 00 48 44  00 00 00 00 00 00 00 00
      00000040  00 0c a0 b4 16 83 6d 68  00 00 00 00 00 00 00 00
      00000050  00 00 00 00 10 00 13 50  80 00 00 00 00 00 d0 32
      
      The problem is in scan_dispatch_log() where we call dtl_consumer()
      but bail out before incrementing the index.
      
      To fix this I moved dtl_consumer() after the timebase comparison.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      84b07386
  11. 14 8月, 2013 2 次提交
  12. 15 7月, 2013 1 次提交
  13. 01 7月, 2013 1 次提交
    • P
      powerpc: Delete __cpuinit usage from all users · 061d19f2
      Paul Gortmaker 提交于
      The __cpuinit type of throwaway sections might have made sense
      some time ago when RAM was more constrained, but now the savings
      do not offset the cost and complications.  For example, the fix in
      commit 5e427ec2 ("x86: Fix bit corruption at CPU resume time")
      is a good example of the nasty type of bugs that can be created
      with improper use of the various __init prefixes.
      
      After a discussion on LKML[1] it was decided that cpuinit should go
      the way of devinit and be phased out.  Once all the users are gone,
      we can then finally remove the macros themselves from linux/init.h.
      
      This removes all the powerpc uses of the __cpuinit macros.  There
      are no __CPUINIT users in assembly files in powerpc.
      
      [1] https://lkml.org/lkml/2013/5/20/589
      
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Josh Boyer <jwboyer@gmail.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Kumar Gala <galak@kernel.crashing.org>
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      061d19f2
  14. 18 4月, 2013 1 次提交
  15. 29 1月, 2013 1 次提交
  16. 28 1月, 2013 2 次提交
    • F
      kvm: Prepare to add generic guest entry/exit callbacks · c11f11fc
      Frederic Weisbecker 提交于
      Do some ground preparatory work before adding guest_enter()
      and guest_exit() context tracking callbacks. Those will
      be later used to read the guest cputime safely when we
      run in full dynticks mode.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      c11f11fc
    • F
      cputime: Generic on-demand virtual cputime accounting · abf917cd
      Frederic Weisbecker 提交于
      If we want to stop the tick further idle, we need to be
      able to account the cputime without using the tick.
      
      Virtual based cputime accounting solves that problem by
      hooking into kernel/user boundaries.
      
      However implementing CONFIG_VIRT_CPU_ACCOUNTING require
      low level hooks and involves more overhead. But we already
      have a generic context tracking subsystem that is required
      for RCU needs by archs which plan to shut down the tick
      outside idle.
      
      This patch implements a generic virtual based cputime
      accounting that relies on these generic kernel/user hooks.
      
      There are some upsides of doing this:
      
      - This requires no arch code to implement CONFIG_VIRT_CPU_ACCOUNTING
      if context tracking is already built (already necessary for RCU in full
      tickless mode).
      
      - We can rely on the generic context tracking subsystem to dynamically
      (de)activate the hooks, so that we can switch anytime between virtual
      and tick based accounting. This way we don't have the overhead
      of the virtual accounting when the tick is running periodically.
      
      And one downside:
      
      - There is probably more overhead than a native virtual based cputime
      accounting. But this relies on hooks that are already set anyway.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      abf917cd
  17. 16 1月, 2013 1 次提交
    • J
      NTP: Add a CONFIG_RTC_SYSTOHC configuration · 023f333a
      Jason Gunthorpe 提交于
      The purpose of this option is to allow ARM/etc systems that rely on the
      class RTC subsystem to have the same kind of automatic NTP based
      synchronization that we have on PC platforms. Today ARM does not
      implement update_persistent_clock and makes extensive use of the class
      RTC system.
      
      When enabled CONFIG_RTC_SYSTOHC will provide a generic
      rtc_update_persistent_clock that stores the current time in the RTC and
      is intended complement the existing CONFIG_RTC_HCTOSYS option that loads
      the RTC at boot.
      
      Like with RTC_HCTOSYS the platform's update_persistent_clock is used
      first, if it works. Platforms with mixed class RTC and non-RTC drivers
      need to return ENODEV when class RTC should be used. Such an update for
      PPC is included in this patch.
      
      Long term, implementations of update_persistent_clock should migrate to
      proper class RTC drivers and use CONFIG_RTC_SYSTOHC instead.
      
      Tested on ARM kirkwood and PPC405
      Signed-off-by: NJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      023f333a
  18. 03 1月, 2013 1 次提交
    • S
      powerpc/vdso: Remove redundant locking in update_vsyscall_tz() · ce73ec6d
      Shan Hai 提交于
      The locking in update_vsyscall_tz() is not only unnecessary because the vdso
      code copies the data unproteced in __kernel_gettimeofday() but also
      introduces a hard to reproduce race condition between update_vsyscall()
      and update_vsyscall_tz(), which causes user space process to loop
      forever in vdso code.
      
      The following patch removes the locking from update_vsyscall_tz().
      
      Locking is not only unnecessary because the vdso code copies the data
      unprotected in __kernel_gettimeofday() but also erroneous because updating
      the tb_update_count is not atomic and introduces a hard to reproduce race
      condition between update_vsyscall() and update_vsyscall_tz(), which further
      causes user space process to loop forever in vdso code.
      
      The below scenario describes the race condition,
      x==0	Boot CPU			other CPU
      	proc_P: x==0
      	    timer interrupt
      		update_vsyscall
      x==1		    x++;sync		settimeofday
      					    update_vsyscall_tz
      x==2						x++;sync
      x==3		    sync;x++
      						sync;x++
      	proc_P: x==3 (loops until x becomes even)
      
      Because the ++ operator would be implemented as three instructions and not
      atomic on powerpc.
      
      A similar change was made for x86 in commit 6c260d58
      ("x86: vdso: Remove bogus locking in update_vsyscall_tz")
      Signed-off-by: NShan Hai <shan.hai@windriver.com>
      CC: <stable@vger.kernel.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      ce73ec6d
  19. 20 11月, 2012 1 次提交
    • F
      vtime: Warn if irqs aren't disabled on system time accounting APIs · 1b2852b1
      Frederic Weisbecker 提交于
      System time accounting APIs such as vtime_account_system() and
      vtime_account_idle() need to be irqsafe. Current callers include
      irq entry, exit and kvm, all of which have been checked against that
      requirement. Now it's better to grow that with an automatic check
      in case we have further callers or we missed something.
      Suggested-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      1b2852b1
  20. 19 11月, 2012 3 次提交
    • F
      vtime: Consolidate a bit the ctx switch code · e3942ba0
      Frederic Weisbecker 提交于
      On ia64 and powerpc, vtime context switch only consists
      in flushing system and user pending time, plus a few
      arch housekeeping.
      
      Consolidate that into a generic implementation. s390 is
      a special case because pending user and system time accounting
      there is hard to dissociate. So it's keeping its own implementation.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      e3942ba0
    • F
      vtime: Explicitly account pending user time on process tick · bcebdf84
      Frederic Weisbecker 提交于
      All vtime implementations just flush the user time on process
      tick. Consolidate that in generic code by calling a user time
      accounting helper. This avoids an indirect call in ia64 and
      prepare to also consolidate vtime context switch code.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      bcebdf84
    • F
      vtime: Remove the underscore prefix invasion · fd25b4c2
      Frederic Weisbecker 提交于
      Prepending irq-unsafe vtime APIs with underscores was actually
      a bad idea as the result is a big mess in the API namespace that
      is even waiting to be further extended. Also these helpers
      are always called from irq safe callers except kvm. Just
      provide a vtime_account_system_irqsafe() for this specific
      case so that we can remove the underscore prefix on other
      vtime functions.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      fd25b4c2
  21. 30 10月, 2012 1 次提交
    • F
      vtime: Make vtime_account_system() irqsafe · 11113334
      Frederic Weisbecker 提交于
      vtime_account_system() currently has only one caller with
      vtime_account() which is irq safe.
      
      Now we are going to call it from other places like kvm where
      irqs are not always disabled by the time we account the cputime.
      
      So let's make it irqsafe. The arch implementation part is now
      prefixed with "__".
      
      vtime_account_idle() arch implementation is prefixed accordingly
      to stay consistent.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      11113334
  22. 25 9月, 2012 4 次提交
    • F
      vtime: Consolidate system/idle context detection · a7e1a9e3
      Frederic Weisbecker 提交于
      Move the code that finds out to which context we account the
      cputime into generic layer.
      
      Archs that consider the whole time spent in the idle task as idle
      time (ia64, powerpc) can rely on the generic vtime_account()
      and implement vtime_account_system() and vtime_account_idle(),
      letting the generic code to decide when to call which API.
      
      Archs that have their own meaning of idle time, such as s390
      that only considers the time spent in CPU low power mode as idle
      time, can just override vtime_account().
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      a7e1a9e3
    • F
      cputime: Use a proper subsystem naming for vtime related APIs · bf9fae9f
      Frederic Weisbecker 提交于
      Use a naming based on vtime as a prefix for virtual based
      cputime accounting APIs:
      
      - account_system_vtime() -> vtime_account()
      - account_switch_vtime() -> vtime_task_switch()
      
      It makes it easier to allow for further declension such
      as vtime_account_system(), vtime_account_idle(), ... if we
      want to find out the context we account to from generic code.
      
      This also make it better to know on which subsystem these APIs
      refer to.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      bf9fae9f
    • J
      time: Convert CONFIG_GENERIC_TIME_VSYSCALL to CONFIG_GENERIC_TIME_VSYSCALL_OLD · 70639421
      John Stultz 提交于
      To help migrate archtectures over to the new update_vsyscall method,
      redfine CONFIG_GENERIC_TIME_VSYSCALL as CONFIG_GENERIC_TIME_VSYSCALL_OLD
      
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      70639421
    • J
      time: Move update_vsyscall definitions to timekeeper_internal.h · 189374ae
      John Stultz 提交于
      Since users will need to include timekeeper_internal.h, move
      update_vsyscall definitions to timekeeper_internal.h.
      
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      189374ae
  23. 17 9月, 2012 1 次提交
    • L
      powerpc/trace: Fix interrupt tracepoints vs. RCU · e72bbbab
      Li Zhong 提交于
      There are a few tracepoints in the interrupt code path, which is before
      irq_enter(), or after irq_exit(), like
      trace_irq_entry()/trace_irq_exit() in do_IRQ(),
      trace_timer_interrupt_entry()/trace_timer_interrupt_exit() in
      timer_interrupt().
      
      If the interrupt is from idle(), and because tracepoint contains RCU
      read-side critical section, we could see following suspicious RCU usage
      reported:
      
      [  145.127743] ===============================
      [  145.127747] [ INFO: suspicious RCU usage. ]
      [  145.127752] 3.6.0-rc3+ #1 Not tainted
      [  145.127755] -------------------------------
      [  145.127759] /root/.workdir/linux/arch/powerpc/include/asm/trace.h:33
      suspicious rcu_dereference_check() usage!
      [  145.127765]
      [  145.127765] other info that might help us debug this:
      [  145.127765]
      [  145.127771]
      [  145.127771] RCU used illegally from idle CPU!
      [  145.127771] rcu_scheduler_active = 1, debug_locks = 0
      [  145.127777] RCU used illegally from extended quiescent state!
      [  145.127781] no locks held by swapper/0/0.
      [  145.127785]
      [  145.127785] stack backtrace:
      [  145.127789] Call Trace:
      [  145.127796] [c00000000108b530] [c000000000013c40] .show_stack
      +0x70/0x1c0 (unreliable)
      [  145.127806] [c00000000108b5e0]
      [c0000000000f59d8] .lockdep_rcu_suspicious+0x118/0x150
      [  145.127813] [c00000000108b680] [c00000000000fc58] .do_IRQ+0x498/0x500
      [  145.127820] [c00000000108b750] [c000000000003950]
      hardware_interrupt_common+0x150/0x180
      [  145.127828] --- Exception: 501 at .plpar_hcall_norets+0x84/0xd4
      [  145.127828]     LR = .check_and_cede_processor+0x38/0x70
      [  145.127836] [c00000000108bab0] [c0000000000665dc] .shared_cede_loop
      +0x5c/0x100
      [  145.127844] [c00000000108bb70] [c000000000588ab0] .cpuidle_enter
      +0x30/0x50
      [  145.127850] [c00000000108bbe0]
      [c000000000588b0c] .cpuidle_enter_state+0x3c/0xb0
      [  145.127857] [c00000000108bc60] [c000000000589730] .cpuidle_idle_call
      +0x150/0x6c0
      [  145.127863] [c00000000108bd30] [c000000000058440] .pSeries_idle
      +0x10/0x40
      [  145.127870] [c00000000108bda0] [c00000000001683c] .cpu_idle
      +0x18c/0x2d0
      [  145.127876] [c00000000108be60] [c00000000000b434] .rest_init
      +0x124/0x1b0
      [  145.127884] [c00000000108bef0] [c0000000009d0d28] .start_kernel
      +0x568/0x588
      [  145.127890] [c00000000108bf90] [c000000000009660] .start_here_common
      +0x20/0x40
      
      This is because the RCU usage in interrupt context should be used in
      area marked by rcu_irq_enter()/rcu_irq_exit(), called in
      irq_enter()/irq_exit() respectively.
      
      Move them into the irq_enter()/irq_exit() area to avoid the reporting.
      Signed-off-by: NLi Zhong <zhong@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      e72bbbab
  24. 05 9月, 2012 1 次提交
    • P
      powerpc: Give hypervisor decrementer interrupts their own handler · dabe859e
      Paul Mackerras 提交于
      At the moment the handler for hypervisor decrementer interrupts is
      the same as for decrementer interrupts, i.e. timer_interrupt().
      This is bogus; if we ever do get a hypervisor decrementer interrupt
      it won't have anything to do with the next timer event.  In fact
      the only time we get hypervisor decrementer interrupts is when one
      is left pending on exit from a KVM guest.
      
      When we get a hypervisor decrementer interrupt we don't need to do
      anything special to clear it, since they are edge-triggered on the
      transition of HDEC from 0 to -1.  Thus this adds an empty handler
      function for them.  We don't need to have them masked when interrupts
      are soft-disabled, so we use STD_EXCEPTION_HV instead of
      MASKABLE_EXCEPTION_HV.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      dabe859e
  25. 20 8月, 2012 1 次提交
    • F
      cputime: Consolidate vtime handling on context switch · baa36046
      Frederic Weisbecker 提交于
      The archs that implement virtual cputime accounting all
      flush the cputime of a task when it gets descheduled
      and sometimes set up some ground initialization for the
      next task to account its cputime.
      
      These archs all put their own hooks in their context
      switch callbacks and handle the off-case themselves.
      
      Consolidate this by creating a new account_switch_vtime()
      callback called in generic code right after a context switch
      and that these archs must implement to flush the prev task
      cputime and initialize the next task cputime related state.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      baa36046
  26. 08 6月, 2012 1 次提交
    • P
      powerpc/time: Sanity check of decrementer expiration is necessary · 860aed25
      Paul Mackerras 提交于
      This reverts 68568add ("powerpc/time: Remove unnecessary sanity check
      of decrementer expiration").  We do need to check whether we have reached
      the expiration time of the next event, because we sometimes get an early
      decrementer interrupt, most notably when we set the decrementer to 1 in
      arch_irq_work_raise().  The effect of not having the sanity check is that
      if timer_interrupt() gets called early, we leave the decrementer set to
      its maximum value, which means we then don't get any more decrementer
      interrupts for about 4 seconds (or longer, depending on timebase
      frequency).  I saw these pauses as a consequence of getting a stray
      hypervisor decrementer interrupt left over from exiting a KVM guest.
      
      This isn't quite a straight revert because of changes to the surrounding
      code, but it restores the same algorithm as was previously used.
      
      Cc: stable@vger.kernel.org
      Acked-by: NAnton Blanchard <anton@samba.org>
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      860aed25
  27. 06 5月, 2012 1 次提交
    • B
      KVM: PPC: Use clockevent multiplier and shifter for decrementer · 6e35994d
      Bharat Bhushan 提交于
      Time for which the hrtimer is started for decrementer emulation is calculated
      using tb_ticks_per_usec. While hrtimer uses the clockevent for DEC
      reprogramming (if needed) and which calculate timebase ticks using the
      multiplier and shifter mechanism implemented within clockevent layer.
      
      It was observed that this conversion (timebase->time->timebase) are not
      correct because the mechanism are not consistent.
      In our setup it adds 2% jitter.
      
      With this patch clockevent multiplier and shifter mechanism are used when
      starting hrtimer for decrementer emulation. Now the jitter is < 0.5%.
      Signed-off-by: NBharat Bhushan <bharat.bhushan@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      6e35994d
  28. 21 3月, 2012 1 次提交
  29. 09 3月, 2012 1 次提交
    • B
      powerpc: Rework lazy-interrupt handling · 7230c564
      Benjamin Herrenschmidt 提交于
      The current implementation of lazy interrupts handling has some
      issues that this tries to address.
      
      We don't do the various workarounds we need to do when re-enabling
      interrupts in some cases such as when returning from an interrupt
      and thus we may still lose or get delayed decrementer or doorbell
      interrupts.
      
      The current scheme also makes it much harder to handle the external
      "edge" interrupts provided by some BookE processors when using the
      EPR facility (External Proxy) and the Freescale Hypervisor.
      
      Additionally, we tend to keep interrupts hard disabled in a number
      of cases, such as decrementer interrupts, external interrupts, or
      when a masked decrementer interrupt is pending. This is sub-optimal.
      
      This is an attempt at fixing it all in one go by reworking the way
      we do the lazy interrupt disabling from the ground up.
      
      The base idea is to replace the "hard_enabled" field with a
      "irq_happened" field in which we store a bit mask of what interrupt
      occurred while soft-disabled.
      
      When re-enabling, either via arch_local_irq_restore() or when returning
      from an interrupt, we can now decide what to do by testing bits in that
      field.
      
      We then implement replaying of the missed interrupts either by
      re-using the existing exception frame (in exception exit case) or via
      the creation of a new one from an assembly trampoline (in the
      arch_local_irq_enable case).
      
      This removes the need to play with the decrementer to try to create
      fake interrupts, among others.
      
      In addition, this adds a few refinements:
      
       - We no longer  hard disable decrementer interrupts that occur
      while soft-disabled. We now simply bump the decrementer back to max
      (on BookS) or leave it stopped (on BookE) and continue with hard interrupts
      enabled, which means that we'll potentially get better sample quality from
      performance monitor interrupts.
      
       - Timer, decrementer and doorbell interrupts now hard-enable
      shortly after removing the source of the interrupt, which means
      they no longer run entirely hard disabled. Again, this will improve
      perf sample quality.
      
       - On Book3E 64-bit, we now make the performance monitor interrupt
      act as an NMI like Book3S (the necessary C code for that to work
      appear to already be present in the FSL perf code, notably calling
      nmi_enter instead of irq_enter). (This also fixes a bug where BookE
      perfmon interrupts could clobber r14 ... oops)
      
       - We could make "masked" decrementer interrupts act as NMIs when doing
      timer-based perf sampling to improve the sample quality.
      
      Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      ---
      
      v2:
      
      - Add hard-enable to decrementer, timer and doorbells
      - Fix CR clobber in masked irq handling on BookE
      - Make embedded perf interrupt act as an NMI
      - Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
        to retrigger an interrupt without preventing hard-enable
      
      v3:
      
       - Fix or vs. ori bug on Book3E
       - Fix enabling of interrupts for some exceptions on Book3E
      
      v4:
      
       - Fix resend of doorbells on return from interrupt on Book3E
      
      v5:
      
       - Rebased on top of my latest series, which involves some significant
      rework of some aspects of the patch.
      
      v6:
       - 32-bit compile fix
       - more compile fixes with various .config combos
       - factor out the asm code to soft-disable interrupts
       - remove the C wrapper around preempt_schedule_irq
      
      v7:
       - Fix a bug with hard irq state tracking on native power7
      7230c564