1. 06 5月, 2011 15 次提交
    • P
      rcu: add grace-period age and more kthread state to tracing · 15ba0ba8
      Paul E. McKenney 提交于
      This commit adds the age in jiffies of the current grace period along
      with the duration in jiffies of the longest grace period since boot
      to the rcu/rcugp debugfs file.  It also adds an additional "O" state
      to kthread tracing to differentiate between the kthread waiting due to
      having nothing to do on the one hand and waiting due to being on the
      wrong CPU on the other hand.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      15ba0ba8
    • P
      rcu: fix tracing bug thinko on boost-balk attribution · a9f4793d
      Paul E. McKenney 提交于
      The rcu_initiate_boost_trace() function mis-attributed refusals to
      initiate RCU priority boosting that were in fact due to its not yet
      being time to boost.  This patch fixes the faulty comparison.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      a9f4793d
    • P
      rcu: make rcutorture version numbers available through debugfs · 4a298656
      Paul E. McKenney 提交于
      It is not possible to accurately correlate rcutorture output with that
      of debugfs.  This patch therefore adds a debugfs file that prints out
      the rcutorture version number, permitting easy correlation.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      4a298656
    • P
      rcu: add tracing for RCU's kthread run states. · d71df90e
      Paul E. McKenney 提交于
      Add tracing to help debugging situations when RCU's kthreads are not
      running but are supposed to be.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      d71df90e
    • P
      rcu: add callback-queue information to rcudata output · 0ac3d136
      Paul E. McKenney 提交于
      This commit adds an indication of the state of the callback queue using
      a string of four characters following the "ql=" integer queue length.
      The first character is "N" if there are callbacks that have been
      queued that are not yet ready to be handled by the next grace period, or
      "." otherwise.  The second character is "R" if there are callbacks queued
      that are ready to be handled by the next grace period, or "." otherwise.
      The third character is "W" if there are callbacks waiting for the current
      grace period, or "." otherwise.  Finally, the fourth character is "D"
      if there are callbacks that have been handled by a prior grace period
      and are waiting to be invoked, or ".".
      
      Note that callbacks that are in the process of being invoked are
      not shown.  These callbacks would have been removed from the rcu_data
      structure's list by rcu_do_batch() prior to being executed.  (These
      callbacks are also not reflected in the "ql=" total, FWIW.)
      
      Also, document the new callback-queue trace information.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      0ac3d136
    • P
      rcu: Add boosting to TREE_PREEMPT_RCU tracing · 0ea1f2eb
      Paul E. McKenney 提交于
      Includes total number of tasks boosted, number boosted on behalf of each
      of normal and expedited grace periods, and statistics on attempts to
      initiate boosting that failed for various reasons.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      0ea1f2eb
    • P
      rcu: eliminate unused boosting statistics · 67b98dba
      Paul E. McKenney 提交于
      The n_rcu_torture_boost_allocerror and n_rcu_torture_boost_afferror
      statistics are not actually incremented anymore, so eliminate them.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      67b98dba
    • P
      rcu: avoid hammering sched with yet another bound RT kthread · 3acf4a9a
      Paul E. McKenney 提交于
      The scheduler does not appear to take kindly to having multiple
      real-time threads bound to a CPU that is going offline.  So this
      commit is a temporary hack-around to avoid that happening.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      3acf4a9a
    • P
      rcu: put per-CPU kthread at non-RT priority during CPU hotplug operations · e3995a25
      Paul E. McKenney 提交于
      If you are doing CPU hotplug operations, it is best not to have
      CPU-bound realtime tasks running CPU-bound on the outgoing CPU.
      So this commit makes per-CPU kthreads run at non-realtime priority
      during that time.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      e3995a25
    • P
      rcu: Force per-rcu_node kthreads off of the outgoing CPU · 0f962a5e
      Paul E. McKenney 提交于
      The scheduler has had some heartburn in the past when too many real-time
      kthreads were affinitied to the outgoing CPU.  So, this commit lightens
      the load by forcing the per-rcu_node and the boost kthreads off of the
      outgoing CPU.  Note that RCU's per-CPU kthread remains on the outgoing
      CPU until the bitter end, as it must in order to preserve correctness.
      
      Also avoid disabling hardirqs across calls to set_cpus_allowed_ptr(),
      given that this function can block.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      0f962a5e
    • P
      rcu: priority boosting for TREE_PREEMPT_RCU · 27f4d280
      Paul E. McKenney 提交于
      Add priority boosting for TREE_PREEMPT_RCU, similar to that for
      TINY_PREEMPT_RCU.  This is enabled by the default-off RCU_BOOST
      kernel parameter.  The priority to which to boost preempted
      RCU readers is controlled by the RCU_BOOST_PRIO kernel parameter
      (defaulting to real-time priority 1) and the time to wait before
      boosting the readers who are blocking a given grace period is
      controlled by the RCU_BOOST_DELAY kernel parameter (defaulting to
      500 milliseconds).
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      27f4d280
    • P
      rcu: move TREE_RCU from softirq to kthread · a26ac245
      Paul E. McKenney 提交于
      If RCU priority boosting is to be meaningful, callback invocation must
      be boosted in addition to preempted RCU readers.  Otherwise, in presence
      of CPU real-time threads, the grace period ends, but the callbacks don't
      get invoked.  If the callbacks don't get invoked, the associated memory
      doesn't get freed, so the system is still subject to OOM.
      
      But it is not reasonable to priority-boost RCU_SOFTIRQ, so this commit
      moves the callback invocations to a kthread, which can be boosted easily.
      
      Also add comments and properly synchronized all accesses to
      rcu_cpu_kthread_task, as suggested by Lai Jiangshan.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      a26ac245
    • P
      rcu: merge TREE_PREEPT_RCU blocked_tasks[] lists · 12f5f524
      Paul E. McKenney 提交于
      Combine the current TREE_PREEMPT_RCU ->blocked_tasks[] lists in the
      rcu_node structure into a single ->blkd_tasks list with ->gp_tasks
      and ->exp_tasks tail pointers.  This is in preparation for RCU priority
      boosting, which will add a third dimension to the combinatorial explosion
      in the ->blocked_tasks[] case, but simply a third pointer in the new
      ->blkd_tasks case.
      
      Also update documentation to reflect blocked_tasks[] merge
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      12f5f524
    • P
      rcu: Decrease memory-barrier usage based on semi-formal proof · e59fb312
      Paul E. McKenney 提交于
      Commit d09b62df fixed grace-period synchronization, but left some smp_mb()
      invocations in rcu_process_callbacks() that are no longer needed, but
      sheer paranoia prevented them from being removed.  This commit removes
      them and provides a proof of correctness in their absence.  It also adds
      a memory barrier to rcu_report_qs_rsp() immediately before the update to
      rsp->completed in order to handle the theoretical possibility that the
      compiler or CPU might move massive quantities of code into a lock-based
      critical section.  This also proves that the sheer paranoia was not
      entirely unjustified, at least from a theoretical point of view.
      
      In addition, the old dyntick-idle synchronization depended on the fact
      that grace periods were many milliseconds in duration, so that it could
      be assumed that no dyntick-idle CPU could reorder a memory reference
      across an entire grace period.  Unfortunately for this design, the
      addition of expedited grace periods breaks this assumption, which has
      the unfortunate side-effect of requiring atomic operations in the
      functions that track dyntick-idle state for RCU.  (There is some hope
      that the algorithms used in user-level RCU might be applied here, but
      some work is required to handle the NMIs that user-space applications
      can happily ignore.  For the short term, better safe than sorry.)
      
      This proof assumes that neither compiler nor CPU will allow a lock
      acquisition and release to be reordered, as doing so can result in
      deadlock.  The proof is as follows:
      
      1.	A given CPU declares a quiescent state under the protection of
      	its leaf rcu_node's lock.
      
      2.	If there is more than one level of rcu_node hierarchy, the
      	last CPU to declare a quiescent state will also acquire the
      	->lock of the next rcu_node up in the hierarchy,  but only
      	after releasing the lower level's lock.  The acquisition of this
      	lock clearly cannot occur prior to the acquisition of the leaf
      	node's lock.
      
      3.	Step 2 repeats until we reach the root rcu_node structure.
      	Please note again that only one lock is held at a time through
      	this process.  The acquisition of the root rcu_node's ->lock
      	must occur after the release of that of the leaf rcu_node.
      
      4.	At this point, we set the ->completed field in the rcu_state
      	structure in rcu_report_qs_rsp().  However, if the rcu_node
      	hierarchy contains only one rcu_node, then in theory the code
      	preceding the quiescent state could leak into the critical
      	section.  We therefore precede the update of ->completed with a
      	memory barrier.  All CPUs will therefore agree that any updates
      	preceding any report of a quiescent state will have happened
      	before the update of ->completed.
      
      5.	Regardless of whether a new grace period is needed, rcu_start_gp()
      	will propagate the new value of ->completed to all of the leaf
      	rcu_node structures, under the protection of each rcu_node's ->lock.
      	If a new grace period is needed immediately, this propagation
      	will occur in the same critical section that ->completed was
      	set in, but courtesy of the memory barrier in #4 above, is still
      	seen to follow any pre-quiescent-state activity.
      
      6.	When a given CPU invokes __rcu_process_gp_end(), it becomes
      	aware of the end of the old grace period and therefore makes
      	any RCU callbacks that were waiting on that grace period eligible
      	for invocation.
      
      	If this CPU is the same one that detected the end of the grace
      	period, and if there is but a single rcu_node in the hierarchy,
      	we will still be in the single critical section.  In this case,
      	the memory barrier in step #4 guarantees that all callbacks will
      	be seen to execute after each CPU's quiescent state.
      
      	On the other hand, if this is a different CPU, it will acquire
      	the leaf rcu_node's ->lock, and will again be serialized after
      	each CPU's quiescent state for the old grace period.
      
      On the strength of this proof, this commit therefore removes the memory
      barriers from rcu_process_callbacks() and adds one to rcu_report_qs_rsp().
      The effect is to reduce the number of memory barriers by one and to
      reduce the frequency of execution from about once per scheduling tick
      per CPU to once per grace period.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      e59fb312
    • P
      rcu: Remove conditional compilation for RCU CPU stall warnings · a00e0d71
      Paul E. McKenney 提交于
      The RCU CPU stall warnings can now be controlled using the
      rcu_cpu_stall_suppress boot-time parameter or via the same parameter
      from sysfs.  There is therefore no longer any reason to have
      kernel config parameters for this feature.  This commit therefore
      removes the RCU_CPU_STALL_DETECTOR and RCU_CPU_STALL_DETECTOR_RUNNABLE
      kernel config parameters.  The RCU_CPU_STALL_TIMEOUT parameter remains
      to allow the timeout to be tuned and the RCU_CPU_STALL_VERBOSE parameter
      remains to allow task-stall information to be suppressed if desired.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      a00e0d71
  2. 03 5月, 2011 1 次提交
  3. 30 4月, 2011 1 次提交
  4. 29 4月, 2011 2 次提交
    • T
      hrtimer: Initialize CLOCK_ID to HRTIMER_BASE table statically · ce31332d
      Thomas Gleixner 提交于
      Sedat and Bruno reported RCU stalls which turned out to be caused by
      the following;
      
      sched_init() calls init_rt_bandwidth() which calls hrtimer_init()
      _BEFORE_ hrtimers_init() is called. While not entirely correct this
      worked because hrtimer_init() only accessed statically initialized
      data (hrtimer_bases.clock_base[CLOCK_MONOTONIC])
      
      Commit e06383db (hrtimers: extend hrtimer base code to handle more
      then 2 clockids) added an indirection to the hrtimer_bases.clock_base
      lookup to avoid gap handling in the hot path. The table which is used
      for the translataion from CLOCK_ID to HRTIMER_BASE index is
      initialized at runtime in hrtimers_init(). So the early call of the
      scheduler code translates CLOCK_MONOTONIC to HRTIMER_BASE_REALTIME.
      
      Thus the rt_bandwith timer ends up on CLOCK_REALTIME. If the timer is
      armed and the wall clock time is set (e.g. ntpdate in the early boot
      process - which also gives the problem deterministic behaviour
      i.e. magic recovery after N hours), then the timer ends up with an
      expiry time far into the future. That breaks the RT throttler
      mechanism as rt runtime is accumulated and never cleared, so the rt
      throttler detects a false cpu hog condition and blocks all RT tasks
      until the timer finally expires. That in turn stalls the RCU thread of
      TINYRCU which leads to an huge amount of RCU callbacks piling up.
      
      Make the translation table statically initialized, so we are back to
      the status of <= 2.6.39.
      Reported-and-tested-by: NSedat Dilek <sedat.dilek@gmail.com>
      Reported-by: NBruno Prémont <bonbons@linux-vserver.org>
      Cc: John stultz <johnstul@us.ibm.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/%3Calpine.LFD.2.02.1104282353140.3005%40ionos%3EReviewed-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      ce31332d
    • H
      kernel/watchdog.c: disable nmi perf event in the error path of enabling watchdog · 1409f141
      Hillf Danton 提交于
      In corner cases where softlockup watchdog is not setup successfully, the
      relevant nmi perf event for hardlockup watchdog could be disabled, then
      the status of the underlying hardware remains unchanged.
      
      Also, if the kthread doesn't start then the hrtimer won't run and the
      hardlockup detector will falsely fire.
      Signed-off-by: NHillf Danton <dhillf@gmail.com>
      Signed-off-by: NDon Zickus <dzickus@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1409f141
  5. 21 4月, 2011 1 次提交
  6. 20 4月, 2011 1 次提交
  7. 19 4月, 2011 2 次提交
    • R
      PM: Fix error code paths executed after failing syscore_suspend() · 2ca6f62f
      Rafael J. Wysocki 提交于
      If syscore_suspend() fails in suspend_enter(), create_image() or
      resume_target_kernel(), it is necessary to call sysdev_resume(),
      because sysdev_suspend() has been called already and succeeded
      and we are going to abort the transition.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: NGreg Kroah-Hartman <gregkh@suse.de>
      2ca6f62f
    • L
      next_pidmap: fix overflow condition · c78193e9
      Linus Torvalds 提交于
      next_pidmap() just quietly accepted whatever 'last' pid that was passed
      in, which is not all that safe when one of the users is /proc.
      
      Admittedly the proc code should do some sanity checking on the range
      (and that will be the next commit), but that doesn't mean that the
      helper functions should just do that pidmap pointer arithmetic without
      checking the range of its arguments.
      
      So clamp 'last' to PID_MAX_LIMIT.  The fact that we then do "last+1"
      doesn't really matter, the for-loop does check against the end of the
      pidmap array properly (it's only the actual pointer arithmetic overflow
      case we need to worry about, and going one bit beyond isn't going to
      overflow).
      
      [ Use PID_MAX_LIMIT rather than pid_max as per Eric Biederman ]
      Reported-by: NTavis Ormandy <taviso@cmpxchg8b.com>
      Analyzed-by: NRobert Święcki <robert@swiecki.net>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c78193e9
  8. 18 4月, 2011 1 次提交
  9. 16 4月, 2011 2 次提交
    • J
      block: make unplug timer trace event correspond to the schedule() unplug · 49cac01e
      Jens Axboe 提交于
      It's a pretty close match to what we had before - the timer triggering
      would mean that nobody unplugged the plug in due time, in the new
      scheme this matches very closely what the schedule() unplug now is.
      It's essentially the difference between an explicit unplug (IO unplug)
      or an implicit unplug (timer unplug, we scheduled with pending IO
      queued).
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      49cac01e
    • J
      block: let io_schedule() flush the plug inline · a237c1c5
      Jens Axboe 提交于
      Linus correctly observes that the most important dispatch cases
      are now done from kblockd, this isn't ideal for latency reasons.
      The original reason for switching dispatches out-of-line was to
      avoid too deep a stack, so by _only_ letting the "accidental"
      flush directly in schedule() be guarded by offload to kblockd,
      we should be able to get the best of both worlds.
      
      So add a blk_schedule_flush_plug() that offloads to kblockd,
      and only use that from the schedule() path.
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      a237c1c5
  10. 15 4月, 2011 1 次提交
  11. 13 4月, 2011 1 次提交
  12. 12 4月, 2011 4 次提交
  13. 11 4月, 2011 3 次提交
    • K
      sched: Fix erroneous all_pinned logic · b30aef17
      Ken Chen 提交于
      The scheduler load balancer has specific code to deal with cases of
      unbalanced system due to lots of unmovable tasks (for example because of
      hard CPU affinity). In those situation, it excludes the busiest CPU that
      has pinned tasks for load balance consideration such that it can perform
      second 2nd load balance pass on the rest of the system.
      
      This all works as designed if there is only one cgroup in the system.
      
      However, when we have multiple cgroups, this logic has false positives and
      triggers multiple load balance passes despite there are actually no pinned
      tasks at all.
      
      The reason it has false positives is that the all pinned logic is deep in
      the lowest function of can_migrate_task() and is too low level:
      
      load_balance_fair() iterates each task group and calls balance_tasks() to
      migrate target load. Along the way, balance_tasks() will also set a
      all_pinned variable. Given that task-groups are iterated, this all_pinned
      variable is essentially the status of last group in the scanning process.
      Task group can have number of reasons that no load being migrated, none
      due to cpu affinity. However, this status bit is being propagated back up
      to the higher level load_balance(), which incorrectly think that no tasks
      were moved.  It kick off the all pinned logic and start multiple passes
      attempt to move load onto puller CPU.
      
      To fix this, move the all_pinned aggregation up at the iterator level.
      This ensures that the status is aggregated over all task-groups, not just
      last one in the list.
      Signed-off-by: NKen Chen <kenchen@google.com>
      Cc: stable@kernel.org
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/BANLkTi=ernzNawaR5tJZEsV_QVnfxqXmsQ@mail.gmail.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      b30aef17
    • K
      sched: Fix sched-domain avg_load calculation · b0432d8f
      Ken Chen 提交于
      In function find_busiest_group(), the sched-domain avg_load isn't
      calculated at all if there is a group imbalance within the domain. This
      will cause erroneous imbalance calculation.
      
      The reason is that calculate_imbalance() sees sds->avg_load = 0 and it
      will dump entire sds->max_load into imbalance variable, which is used
      later on to migrate entire load from busiest CPU to the puller CPU.
      
      This has two really bad effect:
      
      1. stampede of task migration, and they won't be able to break out
         of the bad state because of positive feedback loop: large load
         delta -> heavier load migration -> larger imbalance and the cycle
         goes on.
      
      2. severe imbalance in CPU queue depth.  This causes really long
         scheduling latency blip which affects badly on application that
         has tight latency requirement.
      
      The fix is to have kernel calculate domain avg_load in both cases. This
      will ensure that imbalance calculation is always sensible and the target
      is usually half way between busiest and puller CPU.
      Signed-off-by: NKen Chen <kenchen@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: <stable@kernel.org>
      Link: http://lkml.kernel.org/r/20110408002322.3A0D812217F@elm.corp.google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      b0432d8f
    • S
      perf_event: Fix cgrp event scheduling bug in perf_enable_on_exec() · e566b76e
      Stephane Eranian 提交于
      There is a bug in perf_event_enable_on_exec() when cgroup events are
      active on a CPU: the cgroup events may be scheduled twice causing event
      state corruptions which eventually may lead to kernel panics.
      
      The reason is that the function needs to first schedule out the cgroup
      events, just like for the per-thread events. The cgroup event are
      scheduled back in automatically from the perf_event_context_sched_in()
      function.
      
      The patch also adds a WARN_ON_ONCE() is perf_cgroup_switch() to catch any
      bogus state.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20110406005454.GA1062@quadSigned-off-by: NIngo Molnar <mingo@elte.hu>
      e566b76e
  14. 09 4月, 2011 1 次提交
  15. 05 4月, 2011 3 次提交
  16. 04 4月, 2011 1 次提交