1. 01 3月, 2013 1 次提交
    • F
      irq: Don't re-enable interrupts at the end of irq_exit · 4cd5d111
      Frederic Weisbecker 提交于
      Commit 74eed016
      "irq: Ensure irq_exit() code runs with interrupts disabled"
      restore interrupts flags in the end of irq_exit() for archs
      that don't define __ARCH_IRQ_EXIT_IRQS_DISABLED.
      
      However always returning from irq_exit() with interrupts
      disabled should not be a problem for these archs. Prior to
      this commit this was already happening anytime we processed
      pending softirqs anyway.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      4cd5d111
  2. 22 2月, 2013 3 次提交
    • F
      irq: Remove IRQ_EXIT_OFFSET workaround · 4d4c4e24
      Frederic Weisbecker 提交于
      The IRQ_EXIT_OFFSET trick was used to make sure the irq
      doesn't get preempted after we substract the HARDIRQ_OFFSET
      until we are entirely done with any code in irq_exit().
      
      This workaround was necessary because some archs may call
      irq_exit() with irqs enabled and there is still some code
      in the end of this function that is not covered by the
      HARDIRQ_OFFSET but want to stay non-preemptible.
      
      Now that irq are always disabled in irq_exit(), the whole code
      is guaranteed not to be preempted. We can thus remove this hack.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      4d4c4e24
    • T
      irq: Sanitize invoke_softirq · facd8b80
      Thomas Gleixner 提交于
      With the irq protection in irq_exit, we can remove the #ifdeffery and
      the bh_disable/enable dance in invoke_softirq()
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linuxfoundation.org>
      Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1302202155320.22263@ionos
      facd8b80
    • T
      irq: Ensure irq_exit() code runs with interrupts disabled · 74eed016
      Thomas Gleixner 提交于
      We had already a few problems with code called from irq_exit() when
      interrupted from a nesting interrupt. This can happen on architectures
      which do not define __ARCH_IRQ_EXIT_IRQS_DISABLED.
      
      __ARCH_IRQ_EXIT_IRQS_DISABLED should go away and we want to make it
      mandatory to call irq_exit() with interrupts disabled.
      
      As a temporary protection disable interrupts for those architectures
      which do not define __ARCH_IRQ_EXIT_IRQS_DISABLED and add a WARN_ONCE
      when an architecture which defines __ARCH_IRQ_EXIT_IRQS_DISABLED calls
      irq_exit() with interrupts enabled.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linuxfoundation.org>
      Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1302202155320.22263@ionos
      74eed016
  3. 28 1月, 2013 1 次提交
    • F
      cputime: Safely read cputime of full dynticks CPUs · 6a61671b
      Frederic Weisbecker 提交于
      While remotely reading the cputime of a task running in a
      full dynticks CPU, the values stored in utime/stime fields
      of struct task_struct may be stale. Its values may be those
      of the last kernel <-> user transition time snapshot and
      we need to add the tickless time spent since this snapshot.
      
      To fix this, flush the cputime of the dynticks CPUs on
      kernel <-> user transition and record the time / context
      where we did this. Then on top of this snapshot and the current
      time, perform the fixup on the reader side from task_times()
      accessors.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      [fixed kvm module related build errors]
      Signed-off-by: NSedat Dilek <sedat.dilek@gmail.com>
      6a61671b
  4. 11 1月, 2013 1 次提交
    • E
      softirq: reduce latencies · c10d7367
      Eric Dumazet 提交于
      In various network workloads, __do_softirq() latencies can be up
      to 20 ms if HZ=1000, and 200 ms if HZ=100.
      
      This is because we iterate 10 times in the softirq dispatcher,
      and some actions can consume a lot of cycles.
      
      This patch changes the fallback to ksoftirqd condition to :
      
      - A time limit of 2 ms.
      - need_resched() being set on current task
      
      When one of this condition is met, we wakeup ksoftirqd for further
      softirq processing if we still have pending softirqs.
      
      Using need_resched() as the only condition can trigger RCU stalls,
      as we can keep BH disabled for too long.
      
      I ran several benchmarks and got no significant difference in
      throughput, but a very significant reduction of latencies (one order
      of magnitude) :
      
      In following bench, 200 antagonist "netperf -t TCP_RR" are started in
      background, using all available cpus.
      
      Then we start one "netperf -t TCP_RR", bound to the cpu handling the NIC
      IRQ (hard+soft)
      
      Before patch :
      
      # netperf -H 7.7.7.84 -t TCP_RR -T2,2 -- -k
      RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY
      MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
      to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind
      RT_LATENCY=550110.424
      MIN_LATENCY=146858
      MAX_LATENCY=997109
      P50_LATENCY=305000
      P90_LATENCY=550000
      P99_LATENCY=710000
      MEAN_LATENCY=376989.12
      STDDEV_LATENCY=184046.92
      
      After patch :
      
      # netperf -H 7.7.7.84 -t TCP_RR -T2,2 -- -k
      RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY
      MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
      to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind
      RT_LATENCY=40545.492
      MIN_LATENCY=9834
      MAX_LATENCY=78366
      P50_LATENCY=33583
      P90_LATENCY=59000
      P99_LATENCY=69000
      MEAN_LATENCY=38364.67
      STDDEV_LATENCY=12865.26
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c10d7367
  5. 30 10月, 2012 1 次提交
    • F
      cputime: Specialize irq vtime hooks · fa5058f3
      Frederic Weisbecker 提交于
      With CONFIG_VIRT_CPU_ACCOUNTING, when vtime_account()
      is called in irq entry/exit, we perform a check on the
      context: if we are interrupting the idle task we
      account the pending cputime to idle, otherwise account
      to system time or its sub-areas: tsk->stime, hardirq time,
      softirq time, ...
      
      However this check for idle only concerns the hardirq entry
      and softirq entry:
      
      * Hardirq may directly interrupt the idle task, in which case
      we need to flush the pending CPU time to idle.
      
      * The idle task may be directly interrupted by a softirq if
      it calls local_bh_enable(). There is probably no such call
      in any idle task but we need to cover every case. Ksoftirqd
      is not concerned because the idle time is flushed on context
      switch and softirq in the end of hardirq have the idle time
      already flushed from the hardirq entry.
      
      In the other cases we always account to system/irq time:
      
      * On hardirq exit we account the time to hardirq time.
      * On softirq exit we account the time to softirq time.
      
      To optimize this and avoid the indirect call to vtime_account()
      and the checks it performs, specialize the vtime irq APIs and
      only perform the check on irq entry. Irq exit can directly call
      vtime_account_system().
      
      CONFIG_IRQ_TIME_ACCOUNTING behaviour doesn't change and directly
      maps to its own vtime_account() implementation. One may want
      to take benefits from the new APIs to optimize irq time accounting
      as well in the future.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      fa5058f3
  6. 25 9月, 2012 1 次提交
    • F
      cputime: Use a proper subsystem naming for vtime related APIs · bf9fae9f
      Frederic Weisbecker 提交于
      Use a naming based on vtime as a prefix for virtual based
      cputime accounting APIs:
      
      - account_system_vtime() -> vtime_account()
      - account_switch_vtime() -> vtime_task_switch()
      
      It makes it easier to allow for further declension such
      as vtime_account_system(), vtime_account_idle(), ... if we
      want to find out the context we account to from generic code.
      
      This also make it better to know on which subsystem these APIs
      refer to.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      bf9fae9f
  7. 13 8月, 2012 1 次提交
  8. 01 8月, 2012 1 次提交
    • M
      mm: allow PF_MEMALLOC from softirq context · 907aed48
      Mel Gorman 提交于
      This is needed to allow network softirq packet processing to make use of
      PF_MEMALLOC.
      
      Currently softirq context cannot use PF_MEMALLOC due to it not being
      associated with a task, and therefore not having task flags to fiddle with
      - thus the gfp to alloc flag mapping ignores the task flags when in
      interrupts (hard or soft) context.
      
      Allowing softirqs to make use of PF_MEMALLOC therefore requires some
      trickery.  This patch borrows the task flags from whatever process happens
      to be preempted by the softirq.  It then modifies the gfp to alloc flags
      mapping to not exclude task flags in softirq context, and modify the
      softirq code to save, clear and restore the PF_MEMALLOC flag.
      
      The save and clear, ensures the preempted task's PF_MEMALLOC flag doesn't
      leak into the softirq.  The restore ensures a softirq's PF_MEMALLOC flag
      cannot leak back into the preempted process.  This should be safe due to
      the following reasons
      
      Softirqs can run on multiple CPUs sure but the same task should not be
      	executing the same softirq code. Neither should the softirq
      	handler be preempted by any other softirq handler so the flags
      	should not leak to an unrelated softirq.
      
      Softirqs re-enable hardware interrupts in __do_softirq() so can be
      	preempted by hardware interrupts so PF_MEMALLOC is inherited
      	by the hard IRQ. However, this is similar to a process in
      	reclaim being preempted by a hardirq. While PF_MEMALLOC is
      	set, gfp_to_alloc_flags() distinguishes between hard and
      	soft irqs and avoids giving a hardirq the ALLOC_NO_WATERMARKS
      	flag.
      
      If the softirq is deferred to ksoftirq then its flags may be used
              instead of a normal tasks but as the softirq cannot be preempted,
              the PF_MEMALLOC flag does not leak to other code by accident.
      
      [davem@davemloft.net: Document why PF_MEMALLOC is safe]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      907aed48
  9. 06 3月, 2012 1 次提交
  10. 01 3月, 2012 2 次提交
  11. 15 2月, 2012 1 次提交
    • F
      timer: Fix bad idle check on irq entry · 0a8a2e78
      Frederic Weisbecker 提交于
      idle_cpu() is called on irq entry to guess if we need to call
      tick_check_idle(). This way we can catch up with jiffies if the tick
      was stopped, stop accounting idle time during the interrupt and
      maintain the sched clock if it is unstable.
      
      But if we are going to exit the idle loop to schedule a new task (ie:
      if we have a task in the runqueue or a remotely enqueued ttwu to
      perform), the idle_cpu() check will return 0 such that we miss the
      call to tick_check_idle() for all interrupts happening before we
      schedule the new task.
      
      As a result these interrupts and the softirqs coming along may deal
      with stale jiffies values, bad sched clock values, and won't substract
      their time from the idle time accounting.
      
      Fix this with using is_idle_task() instead that strictly checks that
      we are running the idle task, without caring about the fact we are
      going to schedule a task soon.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Link: http://lkml.kernel.org/r/1327427984-23282-3-git-send-email-fweisbec@gmail.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      0a8a2e78
  12. 03 2月, 2012 1 次提交
  13. 12 12月, 2011 2 次提交
    • F
      rcu: Fix early call to rcu_idle_enter() · 416eb33c
      Frederic Weisbecker 提交于
      On the irq exit path, tick_nohz_irq_exit()
      may raise a softirq, which action leads to the wake up
      path and select_task_rq_fair() that makes use of rcu
      to iterate the domains.
      
      This is an illegal use of RCU because we may be in RCU
      extended quiescent state if we interrupted an RCU-idle
      window in the idle loop:
      
      [  132.978883] ===============================
      [  132.978883] [ INFO: suspicious RCU usage. ]
      [  132.978883] -------------------------------
      [  132.978883] kernel/sched_fair.c:1707 suspicious rcu_dereference_check() usage!
      [  132.978883]
      [  132.978883] other info that might help us debug this:
      [  132.978883]
      [  132.978883]
      [  132.978883] rcu_scheduler_active = 1, debug_locks = 0
      [  132.978883] RCU used illegally from extended quiescent state!
      [  132.978883] 2 locks held by swapper/0:
      [  132.978883]  #0:  (&p->pi_lock){-.-.-.}, at: [<ffffffff8105a729>] try_to_wake_up+0x39/0x2f0
      [  132.978883]  #1:  (rcu_read_lock){.+.+..}, at: [<ffffffff8105556a>] select_task_rq_fair+0x6a/0xec0
      [  132.978883]
      [  132.978883] stack backtrace:
      [  132.978883] Pid: 0, comm: swapper Tainted: G        W   3.0.0+ #178
      [  132.978883] Call Trace:
      [  132.978883]  <IRQ>  [<ffffffff810a01f6>] lockdep_rcu_suspicious+0xe6/0x100
      [  132.978883]  [<ffffffff81055c49>] select_task_rq_fair+0x749/0xec0
      [  132.978883]  [<ffffffff8105556a>] ? select_task_rq_fair+0x6a/0xec0
      [  132.978883]  [<ffffffff812fe494>] ? do_raw_spin_lock+0x54/0x150
      [  132.978883]  [<ffffffff810a1f2d>] ? trace_hardirqs_on+0xd/0x10
      [  132.978883]  [<ffffffff8105a7c3>] try_to_wake_up+0xd3/0x2f0
      [  132.978883]  [<ffffffff81094f98>] ? ktime_get+0x68/0xf0
      [  132.978883]  [<ffffffff8105aa35>] wake_up_process+0x15/0x20
      [  132.978883]  [<ffffffff81069dd5>] raise_softirq_irqoff+0x65/0x110
      [  132.978883]  [<ffffffff8108eb65>] __hrtimer_start_range_ns+0x415/0x5a0
      [  132.978883]  [<ffffffff812fe3ee>] ? do_raw_spin_unlock+0x5e/0xb0
      [  132.978883]  [<ffffffff8108ed08>] hrtimer_start+0x18/0x20
      [  132.978883]  [<ffffffff8109c9c3>] tick_nohz_stop_sched_tick+0x393/0x450
      [  132.978883]  [<ffffffff810694f2>] irq_exit+0xd2/0x100
      [  132.978883]  [<ffffffff81829e96>] do_IRQ+0x66/0xe0
      [  132.978883]  [<ffffffff81820d53>] common_interrupt+0x13/0x13
      [  132.978883]  <EOI>  [<ffffffff8103434b>] ? native_safe_halt+0xb/0x10
      [  132.978883]  [<ffffffff810a1f2d>] ? trace_hardirqs_on+0xd/0x10
      [  132.978883]  [<ffffffff810144ea>] default_idle+0xba/0x370
      [  132.978883]  [<ffffffff810147fe>] amd_e400_idle+0x5e/0x130
      [  132.978883]  [<ffffffff8100a9f6>] cpu_idle+0xb6/0x120
      [  132.978883]  [<ffffffff817f217f>] rest_init+0xef/0x150
      [  132.978883]  [<ffffffff817f20e2>] ? rest_init+0x52/0x150
      [  132.978883]  [<ffffffff81ed9cf3>] start_kernel+0x3da/0x3e5
      [  132.978883]  [<ffffffff81ed9346>] x86_64_start_reservations+0x131/0x135
      [  132.978883]  [<ffffffff81ed944d>] x86_64_start_kernel+0x103/0x112
      
      Fix this by calling rcu_idle_enter() after tick_nohz_irq_exit().
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      416eb33c
    • F
      nohz: Separate out irq exit and idle loop dyntick logic · 280f0677
      Frederic Weisbecker 提交于
      The tick_nohz_stop_sched_tick() function, which tries to delay
      the next timer tick as long as possible, can be called from two
      places:
      
      - From the idle loop to start the dytick idle mode
      - From interrupt exit if we have interrupted the dyntick
      idle mode, so that we reprogram the next tick event in
      case the irq changed some internal state that requires this
      action.
      
      There are only few minor differences between both that
      are handled by that function, driven by the ts->inidle
      cpu variable and the inidle parameter. The whole guarantees
      that we only update the dyntick mode on irq exit if we actually
      interrupted the dyntick idle mode, and that we enter in RCU extended
      quiescent state from idle loop entry only.
      
      Split this function into:
      
      - tick_nohz_idle_enter(), which sets ts->inidle to 1, enters
      dynticks idle mode unconditionally if it can, and enters into RCU
      extended quiescent state.
      
      - tick_nohz_irq_exit() which only updates the dynticks idle mode
      when ts->inidle is set (ie: if tick_nohz_idle_enter() has been called).
      
      To maintain symmetry, tick_nohz_restart_sched_tick() has been renamed
      into tick_nohz_idle_exit().
      
      This simplifies the code and micro-optimize the irq exit path (no need
      for local_irq_save there). This also prepares for the split between
      dynticks and rcu extended quiescent state logics. We'll need this split to
      further fix illegal uses of RCU in extended quiescent states in the idle
      loop.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: David Miller <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Hans-Christian Egtvedt <hans-christian.egtvedt@atmel.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      280f0677
  14. 31 10月, 2011 1 次提交
  15. 21 7月, 2011 1 次提交
    • P
      softirq,rcu: Inform RCU of irq_exit() activity · ec433f0c
      Peter Zijlstra 提交于
      The rcu_read_unlock_special() function relies on in_irq() to exclude
      scheduler activity from interrupt level.  This fails because exit_irq()
      can invoke the scheduler after clearing the preempt_count() bits that
      in_irq() uses to determine that it is at interrupt level.  This situation
      can result in failures as follows:
      
       $task			IRQ		SoftIRQ
      
       rcu_read_lock()
      
       /* do stuff */
      
       <preempt> |= UNLOCK_BLOCKED
      
       rcu_read_unlock()
         --t->rcu_read_lock_nesting
      
      			irq_enter();
      			/* do stuff, don't use RCU */
      			irq_exit();
      			  sub_preempt_count(IRQ_EXIT_OFFSET);
      			  invoke_softirq()
      
      					ttwu();
      					  spin_lock_irq(&pi->lock)
      					  rcu_read_lock();
      					  /* do stuff */
      					  rcu_read_unlock();
      					    rcu_read_unlock_special()
      					      rcu_report_exp_rnp()
      					        ttwu()
      					          spin_lock_irq(&pi->lock) /* deadlock */
      
         rcu_read_unlock_special(t);
      
      Ed can simply trigger this 'easy' because invoke_softirq() immediately
      does a ttwu() of ksoftirqd/# instead of doing the in-place softirq stuff
      first, but even without that the above happens.
      
      Cure this by also excluding softirqs from the
      rcu_read_unlock_special() handler and ensuring the force_irqthreads
      ksoftirqd/# wakeup is done from full softirq context.
      
      [ Alternatively, delaying the ->rcu_read_lock_nesting decrement
        until after the special handling would make the thing more robust
        in the face of interrupts as well.  And there is a separate patch
        for that. ]
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Reported-and-tested-by: NEd Tomlinson <edt@aei.ca>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      ec433f0c
  16. 15 6月, 2011 1 次提交
    • S
      rcu: Use softirq to address performance regression · 09223371
      Shaohua Li 提交于
      Commit a26ac245(rcu: move TREE_RCU from softirq to kthread)
      introduced performance regression. In an AIM7 test, this commit degraded
      performance by about 40%.
      
      The commit runs rcu callbacks in a kthread instead of softirq. We observed
      high rate of context switch which is caused by this. Out test system has
      64 CPUs and HZ is 1000, so we saw more than 64k context switch per second
      which is caused by RCU's per-CPU kthread.  A trace showed that most of
      the time the RCU per-CPU kthread doesn't actually handle any callbacks,
      but instead just does a very small amount of work handling grace periods.
      This means that RCU's per-CPU kthreads are making the scheduler do quite
      a bit of work in order to allow a very small amount of RCU-related
      processing to be done.
      
      Alex Shi's analysis determined that this slowdown is due to lock
      contention within the scheduler.  Unfortunately, as Peter Zijlstra points
      out, the scheduler's real-time semantics require global action, which
      means that this contention is inherent in real-time scheduling.  (Yes,
      perhaps someone will come up with a workaround -- otherwise, -rt is not
      going to do well on large SMP systems -- but this patch will work around
      this issue in the meantime.  And "the meantime" might well be forever.)
      
      This patch therefore re-introduces softirq processing to RCU, but only
      for core RCU work.  RCU callbacks are still executed in kthread context,
      so that only a small amount of RCU work runs in softirq context in the
      common case.  This should minimize ksoftirqd execution, allowing us to
      skip boosting of ksoftirqd for CONFIG_RCU_BOOST=y kernels.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Tested-by: N"Alex,Shi" <alex.shi@intel.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      09223371
  17. 06 5月, 2011 1 次提交
  18. 31 3月, 2011 1 次提交
  19. 23 3月, 2011 1 次提交
  20. 26 2月, 2011 1 次提交
    • T
      genirq: Provide forced interrupt threading · 8d32a307
      Thomas Gleixner 提交于
      Add a commandline parameter "threadirqs" which forces all interrupts except
      those marked IRQF_NO_THREAD to run threaded. That's mostly a debug option to
      allow retrieving better debug data from crashing interrupt handlers. If
      "threadirqs" is not enabled on the kernel command line, then there is no
      impact in the interrupt hotpath.
      
      Architecture code needs to select CONFIG_IRQ_FORCED_THREADING after
      marking the interrupts which cant be threaded IRQF_NO_THREAD. All
      interrupts which have IRQF_TIMER set are implict marked
      IRQF_NO_THREAD. Also all PER_CPU interrupts are excluded.
      
      Forced threading hard interrupts also forces all soft interrupt
      handling into thread context.
      
      When enabled it might slow down things a bit, but for debugging problems in
      interrupt code it's a reasonable penalty as it does not immediately
      crash and burn the machine when an interrupt handler is buggy.
      
      Some test results on a Core2Duo machine:
      
      Cache cold run of:
       # time git grep irq_desc
      
            non-threaded       threaded
       real 1m18.741s          1m19.061s
       user 0m1.874s           0m1.757s
       sys  0m5.843s           0m5.427s
      
       # iperf -c server
      non-threaded
      [  3]  0.0-10.0 sec  1.09 GBytes   933 Mbits/sec
      [  3]  0.0-10.0 sec  1.09 GBytes   934 Mbits/sec
      [  3]  0.0-10.0 sec  1.09 GBytes   933 Mbits/sec
      threaded
      [  3]  0.0-10.0 sec  1.09 GBytes   939 Mbits/sec
      [  3]  0.0-10.0 sec  1.09 GBytes   934 Mbits/sec
      [  3]  0.0-10.0 sec  1.09 GBytes   937 Mbits/sec
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      LKML-Reference: <20110223234956.772668648@linutronix.de>
      8d32a307
  21. 09 2月, 2011 1 次提交
  22. 26 1月, 2011 1 次提交
  23. 14 1月, 2011 1 次提交
    • A
      kernel: clean up USE_GENERIC_SMP_HELPERS · 351f8f8e
      Amerigo Wang 提交于
      For arch which needs USE_GENERIC_SMP_HELPERS, it has to select
      USE_GENERIC_SMP_HELPERS, rather than leaving a choice to user, since they
      don't provide their own implementions.
      
      Also, move on_each_cpu() to kernel/smp.c, it is strange to put it in
      kernel/softirq.c.
      
      For arch which doesn't use USE_GENERIC_SMP_HELPERS, e.g.  blackfin, only
      on_each_cpu() is compiled.
      Signed-off-by: NAmerigo Wang <amwang@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      351f8f8e
  24. 07 1月, 2011 1 次提交
  25. 17 12月, 2010 1 次提交
  26. 23 10月, 2010 1 次提交
  27. 21 10月, 2010 1 次提交
    • T
      tracing: Cleanup the convoluted softirq tracepoints · f4bc6bb2
      Thomas Gleixner 提交于
      With the addition of trace_softirq_raise() the softirq tracepoint got
      even more convoluted. Why the tracepoints take two pointers to assign
      an integer is beyond my comprehension.
      
      But adding an extra case which treats the first pointer as an unsigned
      long when the second pointer is NULL including the back and forth
      type casting is just horrible.
      
      Convert the softirq tracepoints to take a single unsigned int argument
      for the softirq vector number and fix the call sites.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      LKML-Reference: <alpine.LFD.2.00.1010191428560.6815@localhost6.localdomain6>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: mathieu.desnoyers@efficios.com
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      f4bc6bb2
  28. 19 10月, 2010 3 次提交
    • V
      sched: Call tick_check_idle before __irq_enter · d267f87f
      Venkatesh Pallipadi 提交于
      When CPU is idle and on first interrupt, irq_enter calls tick_check_idle()
      to notify interruption from idle. But, there is a problem if this call
      is done after __irq_enter, as all routines in __irq_enter may find
      stale time due to yet to be done tick_check_idle.
      
      Specifically, trace calls in __irq_enter when they use global clock and also
      account_system_vtime change in this patch as it wants to use sched_clock_cpu()
      to do proper irq timing.
      
      But, tick_check_idle was moved after __irq_enter intentionally to
      prevent problem of unneeded ksoftirqd wakeups by the commit ee5f80a9:
      
          irq: call __irq_enter() before calling the tick_idle_check
          Impact: avoid spurious ksoftirqd wakeups
      
      Moving tick_check_idle() before __irq_enter and wrapping it with
      local_bh_enable/disable would solve both the problems.
      Fixed-by: NYong Zhang <yong.zhang0@gmail.com>
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-9-git-send-email-venki@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d267f87f
    • V
      sched: Add a PF flag for ksoftirqd identification · 6cdd5199
      Venkatesh Pallipadi 提交于
      To account softirq time cleanly in scheduler, we need to identify whether
      softirq is invoked in ksoftirqd context or softirq at hardirq tail context.
      Add PF_KSOFTIRQD for that purpose.
      
      As all PF flag bits are currently taken, create space by moving one of the
      infrequently used bits (PF_THREAD_BOUND) down in task_struct to be along
      with some other state fields.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-4-git-send-email-venki@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      6cdd5199
    • V
      sched: Fix softirq time accounting · 75e1056f
      Venkatesh Pallipadi 提交于
      Peter Zijlstra found a bug in the way softirq time is accounted in
      VIRT_CPU_ACCOUNTING on this thread:
      
         http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html
      
      The problem is, softirq processing uses local_bh_disable internally. There
      is no way, later in the flow, to differentiate between whether softirq is
      being processed or is it just that bh has been disabled. So, a hardirq when bh
      is disabled results in time being wrongly accounted as softirq.
      
      Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING
      as well. As account_system_time() in normal tick based accouting also uses
      softirq_count, which will be set even when not in softirq with bh disabled.
      
      Peter also suggested solution of using 2*SOFTIRQ_OFFSET as irq count
      for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq
      processing. The patch below does that and adds API in_serving_softirq() which
      returns whether we are currently processing softirq or not.
      
      Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c
      to in_serving_softirq.
      
      Looks like many usages of in_softirq really want in_serving_softirq. Those
      changes can be made individually on a case by case basis.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-2-git-send-email-venki@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      75e1056f
  29. 12 10月, 2010 2 次提交
  30. 22 9月, 2010 1 次提交
  31. 05 6月, 2010 1 次提交
  32. 28 5月, 2010 1 次提交
  33. 11 5月, 2010 1 次提交
    • P
      rcu: refactor RCU's context-switch handling · 25502a6c
      Paul E. McKenney 提交于
      The addition of preemptible RCU to treercu resulted in a bit of
      confusion and inefficiency surrounding the handling of context switches
      for RCU-sched and for RCU-preempt.  For RCU-sched, a context switch
      is a quiescent state, pure and simple, just like it always has been.
      For RCU-preempt, a context switch is in no way a quiescent state, but
      special handling is required when a task blocks in an RCU read-side
      critical section.
      
      However, the callout from the scheduler and the outer loop in ksoftirqd
      still calls something named rcu_sched_qs(), whose name is no longer
      accurate.  Furthermore, when rcu_check_callbacks() notes an RCU-sched
      quiescent state, it ends up unnecessarily (though harmlessly, aside
      from the performance hit) enqueuing the current task if it happens to
      be running in an RCU-preempt read-side critical section.  This not only
      increases the maximum latency of scheduler_tick(), it also needlessly
      increases the overhead of the next outermost rcu_read_unlock() invocation.
      
      This patch addresses this situation by separating the notion of RCU's
      context-switch handling from that of RCU-sched's quiescent states.
      The context-switch handling is covered by rcu_note_context_switch() in
      general and by rcu_preempt_note_context_switch() for preemptible RCU.
      This permits rcu_sched_qs() to handle quiescent states and only quiescent
      states.  It also reduces the maximum latency of scheduler_tick(), though
      probably by much less than a microsecond.  Finally, it means that tasks
      within preemptible-RCU read-side critical sections avoid incurring the
      overhead of queuing unless there really is a context switch.
      Suggested-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Acked-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      25502a6c