1. 12 12月, 2011 1 次提交
    • P
      rcu: Track idleness independent of idle tasks · 9b2e4f18
      Paul E. McKenney 提交于
      Earlier versions of RCU used the scheduling-clock tick to detect idleness
      by checking for the idle task, but handled idleness differently for
      CONFIG_NO_HZ=y.  But there are now a number of uses of RCU read-side
      critical sections in the idle task, for example, for tracing.  A more
      fine-grained detection of idleness is therefore required.
      
      This commit presses the old dyntick-idle code into full-time service,
      so that rcu_idle_enter(), previously known as rcu_enter_nohz(), is
      always invoked at the beginning of an idle loop iteration.  Similarly,
      rcu_idle_exit(), previously known as rcu_exit_nohz(), is always invoked
      at the end of an idle-loop iteration.  This allows the idle task to
      use RCU everywhere except between consecutive rcu_idle_enter() and
      rcu_idle_exit() calls, in turn allowing architecture maintainers to
      specify exactly where in the idle loop that RCU may be used.
      
      Because some of the userspace upcall uses can result in what looks
      to RCU like half of an interrupt, it is not possible to expect that
      the irq_enter() and irq_exit() hooks will give exact counts.  This
      patch therefore expands the ->dynticks_nesting counter to 64 bits
      and uses two separate bitfields to count process/idle transitions
      and interrupt entry/exit transitions.  It is presumed that userspace
      upcalls do not happen in the idle loop or from usermode execution
      (though usermode might do a system call that results in an upcall).
      The counter is hard-reset on each process/idle transition, which
      avoids the interrupt entry/exit error from accumulating.  Overflow
      is avoided by the 64-bitness of the ->dyntick_nesting counter.
      
      This commit also adds warnings if a non-idle task asks RCU to enter
      idle state (and these checks will need some adjustment before applying
      Frederic's OS-jitter patches (http://lkml.org/lkml/2011/10/7/246).
      In addition, validation of ->dynticks and ->dynticks_nesting is added.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      9b2e4f18
  2. 29 9月, 2011 1 次提交
  3. 08 9月, 2011 3 次提交
  4. 01 2月, 2011 1 次提交
  5. 20 1月, 2011 1 次提交
    • S
      hrtimers: Notify hrtimer users of switches to NOHZ mode · 2d0640b4
      Stephen Boyd 提交于
      When NOHZ=y and high res timers are disabled (via cmdline or
      Kconfig) tick_nohz_switch_to_nohz() will notify the user about
      switching into NOHZ mode. Nothing is printed for the case where
      HIGH_RES_TIMERS=y. Fix this for the HIGH_RES_TIMERS=y case by
      duplicating the printk from the low res NOHZ path in the high
      res NOHZ path.
      
      This confused me since I was thinking 'dmesg | grep -i NOHZ' would
      tell me if NOHZ was enabled, but if I have hrtimers there is
      nothing.
      Signed-off-by: NStephen Boyd <sboyd@codeaurora.org>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1295419594-13085-1-git-send-email-sboyd@codeaurora.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2d0640b4
  6. 03 8月, 2010 1 次提交
  7. 17 7月, 2010 1 次提交
  8. 01 7月, 2010 1 次提交
    • P
      sched: Cure nr_iowait_cpu() users · 8c215bd3
      Peter Zijlstra 提交于
      Commit 0224cf4c (sched: Intoduce get_cpu_iowait_time_us())
      broke things by not making sure preemption was indeed disabled
      by the callers of nr_iowait_cpu() which took the iowait value of
      the current cpu.
      
      This resulted in a heap of preempt warnings. Cure this by making
      nr_iowait_cpu() take a cpu number and fix up the callers to pass
      in the right number.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Rafael J. Wysocki <rjw@sisk.pl>
      Cc: Maxim Levitsky <maximlevitsky@gmail.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: linux-pm@lists.linux-foundation.org
      LKML-Reference: <1277968037.1868.120.camel@laptop>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8c215bd3
  9. 18 6月, 2010 1 次提交
  10. 09 6月, 2010 1 次提交
    • V
      sched: Change nohz idle load balancing logic to push model · 83cd4fe2
      Venkatesh Pallipadi 提交于
      In the new push model, all idle CPUs indeed go into nohz mode. There is
      still the concept of idle load balancer (performing the load balancing
      on behalf of all the idle cpu's in the system). Busy CPU kicks the nohz
      balancer when any of the nohz CPUs need idle load balancing.
      The kickee CPU does the idle load balancing on behalf of all idle CPUs
      instead of the normal idle balance.
      
      This addresses the below two problems with the current nohz ilb logic:
      * the idle load balancer continued to have periodic ticks during idle and
        wokeup frequently, even though it did not have any rebalancing to do on
        behalf of any of the idle CPUs.
      * On x86 and CPUs that have APIC timer stoppage on idle CPUs, this
        periodic wakeup can result in a periodic additional interrupt on a CPU
        doing the timer broadcast.
      
      Also currently we are migrating the unpinned timers from an idle to the cpu
      doing idle load balancing (when all the cpus in the system are idle,
      there is no idle load balancing cpu and timers get added to the same idle cpu
      where the request was made. So the existing optimization works only on semi idle
      system).
      
      And In semi idle system, we no longer have periodic ticks on the idle load
      balancer CPU. Using that cpu will add more delays to the timers than intended
      (as that cpu's timer base may not be uptodate wrt jiffies etc). This was
      causing mysterious slowdowns during boot etc.
      
      For now, in the semi idle case, use the nearest busy cpu for migrating timers
      from an idle cpu.  This is good for power-savings anyway.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      LKML-Reference: <1274486981.2840.46.camel@sbs-t61.sc.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      83cd4fe2
  11. 10 5月, 2010 6 次提交
  12. 12 3月, 2010 1 次提交
    • M
      sched: Rate-limit nohz · 39c0cbe2
      Mike Galbraith 提交于
      Entering nohz code on every micro-idle is costing ~10% throughput for netperf
      TCP_RR when scheduling cross-cpu.  Rate limiting entry fixes this, but raises
      ticks a bit.  On my Q6600, an idle box goes from ~85 interrupts/sec to 128.
      
      The higher the context switch rate, the more nohz entry costs.  With this patch
      and some cycle recovery patches in my tree, max cross cpu context switch rate is
      improved by ~16%, a large portion of which of which is this ratelimiting.
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1268301003.6785.28.camel@marge.simson.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      39c0cbe2
  13. 14 11月, 2009 3 次提交
    • T
      nohz: Track last do_timer() cpu · 27185016
      Thomas Gleixner 提交于
      The previous patch which limits the sleep time to the maximum
      deferment time of the time keeping clocksource has some limitations on
      SMP machines: if all CPUs are idle then for all CPUs the maximum sleep
      time is limited.
      
      Solve this by keeping track of which cpu had the do_timer() duty
      assigned last and limit the sleep time only for this cpu.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      LKML-Reference: <new-submission>
      Cc: Jon Hunter <jon-hunter@ti.com>
      Cc: John Stultz <johnstul@us.ibm.com>
      27185016
    • J
      nohz: Prevent clocksource wrapping during idle · 98962465
      Jon Hunter 提交于
      The dynamic tick allows the kernel to sleep for periods longer than a
      single tick, but it does not limit the sleep time currently. In the
      worst case the kernel could sleep longer than the wrap around time of
      the time keeping clock source which would result in losing track of
      time.
      
      Prevent this by limiting it to the safe maximum sleep time of the
      current time keeping clock source. The value is calculated when the
      clock source is registered.
      
      [ tglx: simplified the code a bit and massaged the commit msg ]
      Signed-off-by: NJon Hunter <jon-hunter@ti.com>
      Cc: John Stultz <johnstul@us.ibm.com>
      LKML-Reference: <1250617512-23567-2-git-send-email-jon-hunter@ti.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      98962465
    • T
      nohz: Type cast printk argument · 529eaccd
      Thomas Gleixner 提交于
      On some archs local_softirq_pending() has a data type of unsigned long
      on others its unsigned int. Type cast it to (unsigned int) in the
      printk to avoid the compiler warning.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      LKML-Reference: <new-submission>
      529eaccd
  14. 05 11月, 2009 2 次提交
    • M
      nohz: Introduce arch_needs_cpu · 3c5d92a0
      Martin Schwidefsky 提交于
      Allow the architecture to request a normal jiffy tick when the system
      goes idle and tick_nohz_stop_sched_tick is called . On s390 the hook is
      used to prevent the system going fully idle if there has been an
      interrupt other than a clock comparator interrupt since the last wakeup.
      
      On s390 the HiperSockets response time for 1 connection ping-pong goes
      down from 42 to 34 microseconds. The CPU cost decreases by 27%.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      LKML-Reference: <20090929122533.402715150@de.ibm.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      3c5d92a0
    • M
      nohz: Reuse ktime in sub-functions of tick_check_idle. · eed3b9cf
      Martin Schwidefsky 提交于
      On a system with NOHZ=y tick_check_idle calls tick_nohz_stop_idle and
      tick_nohz_update_jiffies. Given the right conditions (ts->idle_active
      and/or ts->tick_stopped) both function get a time stamp with ktime_get.
      The same time stamp can be reused if both function require one.
      
      On s390 this change has the additional benefit that gcc inlines the
      tick_nohz_stop_idle function into tick_check_idle. The number of
      instructions to execute tick_check_idle drops from 225 to 144
      (without the ktime_get optimization it is 367 vs 215 instructions).
      
      before:
      
       0)               |  tick_check_idle() {
       0)               |    tick_nohz_stop_idle() {
       0)               |      ktime_get() {
       0)               |        read_tod_clock() {
       0)   0.601 us    |        }
       0)   1.765 us    |      }
       0)   3.047 us    |    }
       0)               |    ktime_get() {
       0)               |      read_tod_clock() {
       0)   0.570 us    |      }
       0)   1.727 us    |    }
       0)               |    tick_do_update_jiffies64() {
       0)   0.609 us    |    }
       0)   8.055 us    |  }
      
      after:
      
       0)               |  tick_check_idle() {
       0)               |    ktime_get() {
       0)               |      read_tod_clock() {
       0)   0.617 us    |      }
       0)   1.773 us    |    }
       0)               |    tick_do_update_jiffies64() {
       0)   0.593 us    |    }
       0)   4.477 us    |  }
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: john stultz <johnstul@us.ibm.com>
      LKML-Reference: <20090929122533.206589318@de.ibm.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      eed3b9cf
  15. 07 10月, 2009 1 次提交
    • E
      NOHZ: update idle state also when NOHZ is inactive · fdc6f192
      Eero Nurkkala 提交于
      Commit f2e21c96 had unfortunate side
      effects with cpufreq governors on some systems.
      
      If the system did not switch into NOHZ mode ts->inidle is not set when
      tick_nohz_stop_sched_tick() is called from the idle routine. Therefor
      all subsequent calls from irq_exit() to tick_nohz_stop_sched_tick()
      fail to call tick_nohz_start_idle(). This results in bogus idle
      accounting information which is passed to cpufreq governors.
      
      Set the inidle flag unconditionally of the NOHZ active state to keep
      the idle time accounting correct in any case.
      
      [ tglx: Added comment and tweaked the changelog ]
      Reported-by: NSteven Noonan <steven@uplinklabs.net>
      Signed-off-by: NEero Nurkkala <ext-eero.nurkkala@nokia.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Steven Noonan <steven@uplinklabs.net>
      Cc: stable@kernel.org
      LKML-Reference: <1254907901.30157.93.camel@eenurkka-desktop>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      fdc6f192
  16. 27 5月, 2009 1 次提交
  17. 13 5月, 2009 1 次提交
  18. 15 1月, 2009 1 次提交
  19. 31 12月, 2008 2 次提交
    • M
      [PATCH] idle cputime accounting · 79741dd3
      Martin Schwidefsky 提交于
      The cpu time spent by the idle process actually doing something is
      currently accounted as idle time. This is plain wrong, the architectures
      that support VIRT_CPU_ACCOUNTING=y can do better: distinguish between the
      time spent doing nothing and the time spent by idle doing work. The first
      is accounted with account_idle_time and the second with account_system_time.
      The architectures that use the account_xxx_time interface directly and not
      the account_xxx_ticks interface now need to do the check for the idle
      process in their arch code. In particular to improve the system vs true
      idle time accounting the arch code needs to measure the true idle time
      instead of just testing for the idle process.
      To improve the tick based accounting as well we would need an architecture
      primitive that can tell us if the pt_regs of the interrupted context
      points to the magic instruction that halts the cpu.
      
      In addition idle time is no more added to the stime of the idle process.
      This field now contains the system time of the idle process as it should
      be. On systems without VIRT_CPU_ACCOUNTING this will always be zero as
      every tick that occurs while idle is running will be accounted as idle
      time.
      
      This patch contains the necessary common code changes to be able to
      distinguish idle system time and true idle time. The architectures with
      support for VIRT_CPU_ACCOUNTING need some changes to exploit this.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      79741dd3
    • M
      [PATCH] fix scaled & unscaled cputime accounting · 457533a7
      Martin Schwidefsky 提交于
      The utimescaled / stimescaled fields in the task structure and the
      global cpustat should be set on all architectures. On s390 the calls
      to account_user_time_scaled and account_system_time_scaled never have
      been added. In addition system time that is accounted as guest time
      to the user time of a process is accounted to the scaled system time
      instead of the scaled user time.
      To fix the bugs and to prevent future forgetfulness this patch merges
      account_system_time_scaled into account_system_time and
      account_user_time_scaled into account_user_time.
      
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: Michael Neuling <mikey@neuling.org>
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      457533a7
  20. 12 12月, 2008 2 次提交
    • W
      nohz: suppress needless timer reprogramming · 00147449
      Woodruff, Richard 提交于
      In my device I get many interrupts from a high speed USB device in a very
      short period of time.  The system spends a lot of time reprogramming the
      hardware timer which is in a slower timing domain as compared to the CPU. 
      This results in the CPU spending a huge amount of time waiting for the
      timer posting to be done.  All of this reprogramming is useless as the
      wake up time has not changed.
      
      As measured using ETM trace this drops my reprogramming penalty from
      almost 60% CPU load down to 15% during high interrupt rate.  I can send
      traces to show this.
      
      Suppress setting of duplicate timer event when timer already stopped. 
      Timer programming can be very costly and can result in long cpu stall/wait
      times.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [tglx@linutronix.de: move the check to the right place and avoid raising
      		     the softirq for nothing]
      Signed-off-by: NRichard Woodruff <r-woodruff2@ti.com>
      Cc: johnstul@us.ibm.com
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      00147449
    • H
      nohz: no softirq pending warnings for offline cpus · fa116ea3
      Heiko Carstens 提交于
      Impact: remove false positive warning
      
      After a cpu was taken down during cpu hotplug (read: disabled for interrupts)
      it still might have pending softirqs. However take_cpu_down makes sure
      that the idle task will run next instead of ksoftirqd on the taken down cpu.
      The idle task will call tick_nohz_stop_sched_tick which might warn about
      pending softirqs just before the cpu kills itself completely.
      
      However the pending softirqs on the dead cpu aren't a problem because they
      will be moved to an online cpu during CPU_DEAD handling.
      
      So make sure we warn only for online cpus.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      fa116ea3
  21. 25 11月, 2008 2 次提交
    • P
      hrtimer: removing all ur callback modes · ca109491
      Peter Zijlstra 提交于
      Impact: cleanup, move all hrtimer processing into hardirq context
      
      This is an attempt at removing some of the hrtimer complexity by
      reducing the number of callback modes to 1.
      
      This means that all hrtimer callback functions will be ran from HARD-irq
      context.
      
      I went through all the 30 odd hrtimer callback functions in the kernel
      and saw only one that I'm not quite sure of, which is the one in
      net/can/bcm.c - hence I'm CC-ing the folks responsible for that code.
      
      Furthermore, the hrtimer core now calls callbacks directly with IRQs
      disabled in case you try to enqueue an expired timer. If this timer is a
      periodic timer (which should use hrtimer_forward() to advance its time)
      then it might be possible to end up in an inf. recursive loop due to the
      fact that hrtimer_forward() doesn't round up to the next timer
      granularity, and therefore keeps on calling the callback - obviously
      this needs a fix.
      
      Aside from that, this seems to compile and actually boot on my dual core
      test box - although I'm sure there are some bugs in, me not hitting any
      makes me certain :-)
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ca109491
    • R
      sched: convert nohz_cpu_mask to cpumask_var_t. · 6a7b3dc3
      Rusty Russell 提交于
      Impact: (future) size reduction for large NR_CPUS.
      
      Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
      space for small nr_cpu_ids but big CONFIG_NR_CPUS.  cpumask_var_t
      is just a struct cpumask for !CONFIG_CPUMASK_OFFSTACK.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      6a7b3dc3
  22. 11 11月, 2008 1 次提交
    • T
      nohz: disable tick_nohz_kick_tick() for now · ae99286b
      Thomas Gleixner 提交于
      Impact: nohz powersavings and wakeup regression
      
      commit fb02fbc1 (NOHZ: restart tick
      device from irq_enter()) causes a serious wakeup regression.
      
      While the patch is correct it does not take into account that spurious
      wakeups happen on x86. A fix for this issue is available, but we just
      revert to the .27 behaviour and let long running softirqs screw
      themself.
      
      Disable it for now.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      ae99286b
  23. 22 10月, 2008 1 次提交
    • T
      NOHZ: fix thinko in the timer restart code path · c4bd822e
      Thomas Gleixner 提交于
      commit fb02fbc1 (NOHZ: restart tick
      device from irq_enter())
      
      solves the problem of stale jiffies when long running softirqs happen
      in a long idle sleep period, but it has a major thinko in it:
      
      When the interrupt which came in _is_ the timer interrupt which should
      expire ts->sched_timer then we cancel and rearm the timer _before_ it
      gets expired in hrtimer_interrupt() to the next period. That means the
      call back function is not called. This game can go on for ever :(
      
      Prevent this by making sure to only rearm the timer when the expiry
      time is more than one tick_period away. Otherwise keep it running as
      it is either already expired or will expiry at the right point to
      update jiffies.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NVenkatesch Pallipadi <venkatesh.pallipadi@intel.com>
      c4bd822e
  24. 18 10月, 2008 3 次提交
    • T
      NOHZ: restart tick device from irq_enter() · fb02fbc1
      Thomas Gleixner 提交于
      We did not restart the tick device from irq_enter() to avoid double
      reprogramming and extra events in the return immediate to idle case.
      
      But long lasting softirqs can lead to a situation where jiffies become
      stale:
      
      idle()
        tick stopped (reprogrammed to next pending timer)
        halt()
         interrupt
           jiffies updated from irq_enter()
           interrupt handler
           softirq function 1 runs 20ms
           softirq function 2 arms a 10ms timer with a stale jiffies value
           jiffies updated from irq_exit()
           timer wheel has now an already expired timer
           (the one added in function 2)
           timer fires and timer softirq runs
      
      This was discovered when debugging a timer problem which happend only
      when the ath5k driver is active. The debugging proved that there is a
      softirq function running for more than 20ms, which is a bug by itself.
      
      To solve this we restart the tick timer right from irq_enter(), but do
      not go through the other functions which are necessary to return from
      idle when need_resched() is set.
      Reported-by: NElias Oltmanns <eo@nebensachen.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NElias Oltmanns <eo@nebensachen.de>
      fb02fbc1
    • T
      NOHZ: split tick_nohz_restart_sched_tick() · c34bec5a
      Thomas Gleixner 提交于
      Split out the clock event device reprogramming. Preparatory
      patch.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      c34bec5a
    • T
      NOHZ: unify the nohz function calls in irq_enter() · 719254fa
      Thomas Gleixner 提交于
      We have two separate nohz function calls in irq_enter() for no good
      reason. Just call a single NOHZ function from irq_enter() and call
      the bits in the tick code.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      719254fa
  25. 10 10月, 2008 1 次提交