1. 23 9月, 2012 19 次提交
    • P
      rcu: Adjust for unconditional ->completed assignment · 25d30cf4
      Paul E. McKenney 提交于
      Now that the rcu_node structures' ->completed fields are unconditionally
      assigned at grace-period cleanup time, they should already have the
      correct value for the new grace period at grace-period initialization
      time.  This commit therefore inserts a WARN_ON_ONCE() to verify this
      invariant.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      25d30cf4
    • P
      rcu: Add random PROVE_RCU_DELAY to grace-period initialization · 661a85dc
      Paul E. McKenney 提交于
      Preemption greatly raised the probability of certain types of race
      conditions, so this commit adds an anti-heisenbug to greatly increase
      the collision cross section, also known as the probability of occurrence.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      661a85dc
    • P
      rcu: Fix day-zero grace-period initialization/cleanup race · 5d4b8659
      Paul E. McKenney 提交于
      The current approach to grace-period initialization is vulnerable to
      extremely low-probability races.  These races stem from the fact that
      the old grace period is marked completed on the same traversal through
      the rcu_node structure that is marking the start of the new grace period.
      This means that some rcu_node structures will believe that the old grace
      period is still in effect at the same time that other rcu_node structures
      believe that the new grace period has already started.
      
      These sorts of disagreements can result in too-short grace periods,
      as shown in the following scenario:
      
      1.	CPU 0 completes a grace period, but needs an additional
      	grace period, so starts initializing one, initializing all
      	the non-leaf rcu_node structures and the first leaf rcu_node
      	structure.  Because CPU 0 is both completing the old grace
      	period and starting a new one, it marks the completion of
      	the old grace period and the start of the new grace period
      	in a single traversal of the rcu_node structures.
      
      	Therefore, CPUs corresponding to the first rcu_node structure
      	can become aware that the prior grace period has completed, but
      	CPUs corresponding to the other rcu_node structures will see
      	this same prior grace period as still being in progress.
      
      2.	CPU 1 passes through a quiescent state, and therefore informs
      	the RCU core.  Because its leaf rcu_node structure has already
      	been initialized, this CPU's quiescent state is applied to the
      	new (and only partially initialized) grace period.
      
      3.	CPU 1 enters an RCU read-side critical section and acquires
      	a reference to data item A.  Note that this CPU believes that
      	its critical section started after the beginning of the new
      	grace period, and therefore will not block this new grace period.
      
      4.	CPU 16 exits dyntick-idle mode.  Because it was in dyntick-idle
      	mode, other CPUs informed the RCU core of its extended quiescent
      	state for the past several grace periods.  This means that CPU 16
      	is not yet aware that these past grace periods have ended.  Assume
      	that CPU 16 corresponds to the second leaf rcu_node structure --
      	which has not yet been made aware of the new grace period.
      
      5.	CPU 16 removes data item A from its enclosing data structure
      	and passes it to call_rcu(), which queues a callback in the
      	RCU_NEXT_TAIL segment of the callback queue.
      
      6.	CPU 16 enters the RCU core, possibly because it has taken a
      	scheduling-clock interrupt, or alternatively because it has
      	more than 10,000 callbacks queued.  It notes that the second
      	most recent grace period has completed (recall that because it
      	corresponds to the second as-yet-uninitialized rcu_node structure,
      	it cannot yet become aware that the most recent grace period has
      	completed), and therefore advances its callbacks.  The callback
      	for data item A is therefore in the RCU_NEXT_READY_TAIL segment
      	of the callback queue.
      
      7.	CPU 0 completes initialization of the remaining leaf rcu_node
      	structures for the new grace period, including the structure
      	corresponding to CPU 16.
      
      8.	CPU 16 again enters the RCU core, again, possibly because it has
      	taken a scheduling-clock interrupt, or alternatively because
      	it now has more than 10,000 callbacks queued.	It notes that
      	the most recent grace period has ended, and therefore advances
      	its callbacks.	The callback for data item A is therefore in
      	the RCU_DONE_TAIL segment of the callback queue.
      
      9.	All CPUs other than CPU 1 pass through quiescent states.  Because
      	CPU 1 already passed through its quiescent state, the new grace
      	period completes.  Note that CPU 1 is still in its RCU read-side
      	critical section, still referencing data item A.
      
      10.	Suppose that CPU 2 wais the last CPU to pass through a quiescent
      	state for the new grace period, and suppose further that CPU 2
      	did not have any callbacks queued, therefore not needing an
      	additional grace period.  CPU 2 therefore traverses all of the
      	rcu_node structures, marking the new grace period as completed,
      	but does not initialize a new grace period.
      
      11.	CPU 16 yet again enters the RCU core, yet again possibly because
      	it has taken a scheduling-clock interrupt, or alternatively
      	because it now has more than 10,000 callbacks queued.	It notes
      	that the new grace period has ended, and therefore advances
      	its callbacks.	The callback for data item A is therefore in
      	the RCU_DONE_TAIL segment of the callback queue.  This means
      	that this callback is now considered ready to be invoked.
      
      12.	CPU 16 invokes the callback, freeing data item A while CPU 1
      	is still referencing it.
      
      This scenario represents a day-zero bug for TREE_RCU.  This commit
      therefore ensures that the old grace period is marked completed in
      all leaf rcu_node structures before a new grace period is marked
      started in any of them.
      
      That said, it would have been insanely difficult to force this race to
      happen before the grace-period initialization process was preemptible.
      Therefore, this commit is not a candidate for -stable.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      
      Conflicts:
      
      	kernel/rcutree.c
      5d4b8659
    • P
      rcu: Make rcutree module parameters visible in sysfs · 7e5c2dfb
      Paul E. McKenney 提交于
      The module parameters blimit, qhimark, and qlomark (and more
      recently, rcu_fanout_leaf) have permission masks of zero, so
      that their values are not visible from sysfs.  This is unnecessary
      and inconvenient to administrators who might like an easy way to
      see what these values are on a running system.  This commit therefore
      sets their permission masks to 0444, allowing them to be read but
      not written.
      Reported-by: NRusty Russell <rusty@ozlabs.org>
      Reported-by: NJosh Triplett <josh@joshtriplett.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      7e5c2dfb
    • P
      rcu: Control grace-period duration from sysfs · d40011f6
      Paul E. McKenney 提交于
      Although almost everyone is well-served by the defaults, some uses of RCU
      benefit from shorter grace periods, while others benefit more from the
      greater efficiency provided by longer grace periods.  Situations requiring
      a large number of grace periods to elapse (and wireshark startup has
      been called out as an example of this) are helped by lower-latency
      grace periods.  Furthermore, in some embedded applications, people are
      willing to accept a small degradation in update efficiency (due to there
      being more of the shorter grace-period operations) in order to gain the
      lower latency.
      
      In contrast, those few systems with thousands of CPUs need longer grace
      periods because the CPU overhead of a grace period rises roughly
      linearly with the number of CPUs.  Such systems normally do not make
      much use of facilities that require large numbers of grace periods to
      elapse, so this is a good tradeoff.
      
      Therefore, this commit allows the durations to be controlled from sysfs.
      There are two sysfs parameters, one named "jiffies_till_first_fqs" that
      specifies the delay in jiffies from the end of grace-period initialization
      until the first attempt to force quiescent states, and the other named
      "jiffies_till_next_fqs" that specifies the delay (again in jiffies)
      between subsequent attempts to force quiescent states.  They both default
      to three jiffies, which is compatible with the old hard-coded behavior.
      
      At some future time, it may be possible to automatically increase the
      grace-period length with the number of CPUs, but we do not yet have
      sufficient data to do a good job.  Preliminary data indicates that we
      should add an addiitonal jiffy to each of the delays for every 200 CPUs
      in the system, but more experimentation is needed.  For now, the number
      of systems with more than 1,000 CPUs is small enough that this can be
      relegated to boot-time hand tuning.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      d40011f6
    • P
      rcu: Prevent force_quiescent_state() memory contention · 394f2769
      Paul E. McKenney 提交于
      Large systems running RCU_FAST_NO_HZ kernels see extreme memory
      contention on the rcu_state structure's ->fqslock field.  This
      can be avoided by disabling RCU_FAST_NO_HZ, either at compile time
      or at boot time (via the nohz kernel boot parameter), but large
      systems will no doubt become sensitive to energy consumption.
      This commit therefore uses a combining-tree approach to spread the
      memory contention across new cache lines in the leaf rcu_node structures.
      This can be thought of as a tournament lock that has only a try-lock
      acquisition primitive.
      
      The effect on small systems is minimal, because such systems have
      an rcu_node "tree" consisting of a single node.  In addition, this
      functionality is not used on fastpaths.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      394f2769
    • P
      rcu: Adjust debugfs tracing for kthread-based quiescent-state forcing · 4605c014
      Paul E. McKenney 提交于
      Moving quiescent-state forcing into a kthread dispenses with the need
      for the ->n_rp_need_fqs field, so this commit removes it.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      4605c014
    • P
      rcu: Allow RCU quiescent-state forcing to be preempted · b4be093f
      Paul E. McKenney 提交于
      RCU quiescent-state forcing is currently carried out without preemption
      points, which can result in excessive latency spikes on large systems
      (many hundreds or thousands of CPUs).  This patch therefore inserts
      a voluntary preemption point into force_qs_rnp(), which should greatly
      reduce the magnitude of these spikes.
      Reported-by: NMike Galbraith <mgalbraith@suse.de>
      Reported-by: NDimitri Sivanich <sivanich@sgi.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      b4be093f
    • P
      rcu: Move quiescent-state forcing into kthread · 4cdfc175
      Paul E. McKenney 提交于
      As the first step towards allowing quiescent-state forcing to be
      preemptible, this commit moves RCU quiescent-state forcing into the
      same kthread that is now used to initialize and clean up after grace
      periods.  This is yet another step towards keeping scheduling
      latency down to a dull roar.
      
      Updated to change from raw_spin_lock_irqsave() to raw_spin_lock_irq()
      and to remove the now-unused rcu_state structure fields as suggested by
      Peter Zijlstra.
      Reported-by: NMike Galbraith <mgalbraith@suse.de>
      Reported-by: NDimitri Sivanich <sivanich@sgi.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      4cdfc175
    • D
      rcu: Segregate rcu_state fields to improve cache locality · b402b73b
      Dimitri Sivanich 提交于
      The fields in the rcu_state structure that are protected by the
      root rcu_node structure's ->lock can share a cache line with the
      fields protected by ->onofflock.  This can result in excessive
      memory contention on large systems, so this commit applies
      ____cacheline_internodealigned_in_smp to the ->onofflock field in
      order to segregate them.
      Signed-off-by: NDimitri Sivanich <sivanich@sgi.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NDimitri Sivanich <sivanich@sgi.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      b402b73b
    • P
      rcu: Provide OOM handler to motivate lazy RCU callbacks · b626c1b6
      Paul E. McKenney 提交于
      In kernels built with CONFIG_RCU_FAST_NO_HZ=y, CPUs can accumulate a
      large number of lazy callbacks, which as the name implies will be slow
      to be invoked.  This can be a problem on small-memory systems, where the
      default 6-second sleep for CPUs having only lazy RCU callbacks could well
      be fatal.  This commit therefore installs an OOM hander that ensures that
      every CPU with lazy callbacks has at least one non-lazy callback, in turn
      ensuring timely advancement for these callbacks.
      
      Updated to fix bug that disabled OOM killing, noted by Lai Jiangshan.
      
      Updated to push the for_each_rcu_flavor() loop into rcu_oom_notify_cpu(),
      thus reducing the number of IPIs, as suggested by Steven Rostedt.  Also
      to make the for_each_online_cpu() loop be preemptible.  (Later, it might
      be good to use smp_call_function(), as suggested by Peter Zijlstra.)
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NSasha Levin <levinsasha928@gmail.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      b626c1b6
    • P
      rcu: Prevent offline CPUs from executing RCU core code · bfa00b4c
      Paul E. McKenney 提交于
      Earlier versions of RCU invoked the RCU core from the CPU_DYING notifier
      in order to note a quiescent state for the outgoing CPU.  Because the
      CPU is marked "offline" during the execution of the CPU_DYING notifiers,
      the RCU core had to tolerate being invoked from an offline CPU.  However,
      commit b1420f1c (Make rcu_barrier() less disruptive) left only tracing
      code in the CPU_DYING notifier, so the RCU core need no longer execute
      on offline CPUs.  This commit therefore enforces this restriction.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      bfa00b4c
    • P
      rcu: Break up rcu_gp_kthread() into subfunctions · 7fdefc10
      Paul E. McKenney 提交于
      Then rcu_gp_kthread() function is too large and furthermore needs to
      have the force_quiescent_state() code pulled in.  This commit therefore
      breaks up rcu_gp_kthread() into rcu_gp_init() and rcu_gp_cleanup().
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      7fdefc10
    • P
      rcu: Allow RCU grace-period cleanup to be preempted · c856bafa
      Paul E. McKenney 提交于
      RCU grace-period cleanup is currently carried out with interrupts
      disabled, which can result in excessive latency spikes on large systems
      (many hundreds or thousands of CPUs).  This patch therefore makes the
      RCU grace-period cleanup be preemptible, including voluntary preemption
      points, which should eliminate those latency spikes.  Similar spikes from
      forcing of quiescent states will be dealt with similarly by later patches.
      
      Updated to replace uses of spin_lock_irqsave() with spin_lock_irq(), as
      suggested by Peter Zijlstra.
      Reported-by: NMike Galbraith <mgalbraith@suse.de>
      Reported-by: NDimitri Sivanich <sivanich@sgi.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      c856bafa
    • P
      rcu: Move RCU grace-period cleanup into kthread · cabc49c1
      Paul E. McKenney 提交于
      As a first step towards allowing grace-period cleanup to be preemptible,
      this commit moves the RCU grace-period cleanup into the same kthread
      that is now used to initialize grace periods.  This is needed to keep
      scheduling latency down to a dull roar.
      
      [ paulmck: Get rid of stray spin_lock_irqsave() calls. ]
      Reported-by: NMike Galbraith <mgalbraith@suse.de>
      Reported-by: NDimitri Sivanich <sivanich@sgi.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      cabc49c1
    • P
      rcu: Allow RCU grace-period initialization to be preempted · 755609a9
      Paul E. McKenney 提交于
      RCU grace-period initialization is currently carried out with interrupts
      disabled, which can result in 200-microsecond latency spikes on systems
      on which RCU has been configured for 4096 CPUs.  This patch therefore
      makes the RCU grace-period initialization be preemptible, which should
      eliminate those latency spikes.  Similar spikes from grace-period cleanup
      and the forcing of quiescent states will be dealt with similarly by later
      patches.
      Reported-by: NMike Galbraith <mgalbraith@suse.de>
      Reported-by: NDimitri Sivanich <sivanich@sgi.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      755609a9
    • P
      rcu: Prevent initialization-time quiescent-state race · 79bce672
      Paul E. McKenney 提交于
      The next step in reducing RCU's grace-period initialization latency on
      large systems will make this initialization preemptible.  Unfortunately,
      making the grace-period initialization subject to interrupts (let alone
      preemption) exposes the following race on systems whose rcu_node tree
      contains more than one node:
      
      1.	CPU 31 starts initializing the grace period, including the
          	first leaf rcu_node structures, and is then preempted.
      
      2.	CPU 0 refers to the first leaf rcu_node structure, and notes
          	that a new grace period has started.  It passes through a
          	quiescent state shortly thereafter, and informs the RCU core
          	of this rite of passage.
      
      3.	CPU 0 enters an RCU read-side critical section, acquiring
          	a pointer to an RCU-protected data item.
      
      4.	CPU 31 takes an interrupt whose handler removes the data item
      	referenced by CPU 0 from the data structure, and registers an
      	RCU callback in order to free it.
      
      5.	CPU 31 resumes initializing the grace period, including its
          	own rcu_node structure.  In invokes rcu_start_gp_per_cpu(),
          	which advances all callbacks, including the one registered
          	in #4 above, to be handled by the current grace period.
      
      6.	The remaining CPUs pass through quiescent states and inform
          	the RCU core, but CPU 0 remains in its RCU read-side critical
          	section, still referencing the now-removed data item.
      
      7.	The grace period completes and all the callbacks are invoked,
          	including the one that frees the data item that CPU 0 is still
          	referencing.  Oops!!!
      
      One way to avoid this race is to remove grace-period acceleration from
      rcu_start_gp_per_cpu().  Now, the only reason for this acceleration was
      to allow CPUs bringing RCU out of idle state to have their callbacks
      invoked after only one grace period, rather than the two grace periods
      that would otherwise be required.  But this acceleration does not
      work when RCU grace-period initialization is moved to a kthread because
      the CPU posting the callback is no longer necessarily the CPU that is
      initializing the resulting grace period.
      
      This commit therefore removes this now-pointless (and soon to be dangerous)
      grace-period acceleration, thus avoiding the above race.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      79bce672
    • P
      rcu: Move RCU grace-period initialization into a kthread · b3dbec76
      Paul E. McKenney 提交于
      As the first step towards allowing grace-period initialization to be
      preemptible, this commit moves the RCU grace-period initialization
      into its own kthread.  This is needed to keep large-system scheduling
      latency at reasonable levels.
      
      Also change raw_spin_lock_irqsave() to raw_spin_lock_irq() as suggested
      by Peter Zijlstra in review comments.
      Reported-by: NMike Galbraith <mgalbraith@suse.de>
      Reported-by: NDimitri Sivanich <sivanich@sgi.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      b3dbec76
    • P
      rcu: Fix day-one dyntick-idle stall-warning bug · a10d206e
      Paul E. McKenney 提交于
      Each grace period is supposed to have at least one callback waiting
      for that grace period to complete.  However, if CONFIG_NO_HZ=n, an
      extra callback-free grace period is no big problem -- it will chew up
      a tiny bit of CPU time, but it will complete normally.  In contrast,
      CONFIG_NO_HZ=y kernels have the potential for all the CPUs to go to
      sleep indefinitely, in turn indefinitely delaying completion of the
      callback-free grace period.  Given that nothing is waiting on this grace
      period, this is also not a problem.
      
      That is, unless RCU CPU stall warnings are also enabled, as they are
      in recent kernels.  In this case, if a CPU wakes up after at least one
      minute of inactivity, an RCU CPU stall warning will result.  The reason
      that no one noticed until quite recently is that most systems have enough
      OS noise that they will never remain absolutely idle for a full minute.
      But there are some embedded systems with cut-down userspace configurations
      that consistently get into this situation.
      
      All this begs the question of exactly how a callback-free grace period
      gets started in the first place.  This can happen due to the fact that
      CPUs do not necessarily agree on which grace period is in progress.
      If a CPU still believes that the grace period that just completed is
      still ongoing, it will believe that it has callbacks that need to wait for
      another grace period, never mind the fact that the grace period that they
      were waiting for just completed.  This CPU can therefore erroneously
      decide to start a new grace period.  Note that this can happen in
      TREE_RCU and TREE_PREEMPT_RCU even on a single-CPU system:  Deadlock
      considerations mean that the CPU that detected the end of the grace
      period is not necessarily officially informed of this fact for some time.
      
      Once this CPU notices that the earlier grace period completed, it will
      invoke its callbacks.  It then won't have any callbacks left.  If no
      other CPU has any callbacks, we now have a callback-free grace period.
      
      This commit therefore makes CPUs check more carefully before starting a
      new grace period.  This new check relies on an array of tail pointers
      into each CPU's list of callbacks.  If the CPU is up to date on which
      grace periods have completed, it checks to see if any callbacks follow
      the RCU_DONE_TAIL segment, otherwise it checks to see if any callbacks
      follow the RCU_WAIT_TAIL segment.  The reason that this works is that
      the RCU_WAIT_TAIL segment will be promoted to the RCU_DONE_TAIL segment
      as soon as the CPU is officially notified that the old grace period
      has ended.
      
      This change is to cpu_needs_another_gp(), which is called in a number
      of places.  The only one that really matters is in rcu_start_gp(), where
      the root rcu_node structure's ->lock is held, which prevents any
      other CPU from starting or completing a grace period, so that the
      comparison that determines whether the CPU is missing the completion
      of a grace period is stable.
      Reported-by: NBecky Bruce <bgillbruce@gmail.com>
      Reported-by: NSubodh Nijsure <snijsure@grid-net.com>
      Reported-by: NPaul Walmsley <paul@pwsan.com>
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: Paul Walmsley <paul@pwsan.com>  # OMAP3730, OMAP4430
      Cc: stable@vger.kernel.org
      a10d206e
  2. 02 9月, 2012 1 次提交
    • J
      time: Move ktime_t overflow checking into timespec_valid_strict · cee58483
      John Stultz 提交于
      Andreas Bombe reported that the added ktime_t overflow checking added to
      timespec_valid in commit 4e8b1452 ("time: Improve sanity checking of
      timekeeping inputs") was causing problems with X.org because it caused
      timeouts larger then KTIME_T to be invalid.
      
      Previously, these large timeouts would be clamped to KTIME_MAX and would
      never expire, which is valid.
      
      This patch splits the ktime_t overflow checking into a new
      timespec_valid_strict function, and converts the timekeeping codes
      internal checking to use this more strict function.
      Reported-and-tested-by: NAndreas Bombe <aeb@debian.org>
      Cc: Zhouping Liu <zliu@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cee58483
  3. 22 8月, 2012 6 次提交
  4. 21 8月, 2012 1 次提交
  5. 19 8月, 2012 1 次提交
  6. 18 8月, 2012 1 次提交
  7. 15 8月, 2012 4 次提交
  8. 14 8月, 2012 5 次提交
    • M
      sched: Fix migration thread runtime bogosity · 8f618968
      Mike Galbraith 提交于
      Make stop scheduler class do the same accounting as other classes,
      
      Migration threads can be caught in the act while doing exec balancing,
      leading to the below due to use of unmaintained ->se.exec_start.  The
      load that triggered this particular instance was an apparently out of
      control heavily threaded application that does system monitoring in
      what equated to an exec bomb, with one of the VERY frequently migrated
      tasks being ps.
      
      %CPU   PID USER     CMD
      99.3    45 root     [migration/10]
      97.7    53 root     [migration/12]
      97.0    57 root     [migration/13]
      90.1    49 root     [migration/11]
      89.6    65 root     [migration/15]
      88.7    17 root     [migration/3]
      80.4    37 root     [migration/8]
      78.1    41 root     [migration/9]
      44.2    13 root     [migration/2]
      Signed-off-by: NMike Galbraith <mgalbraith@suse.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1344051854.6739.19.camel@marge.simpson.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      8f618968
    • M
      sched,rt: fix isolated CPUs leaving root_task_group indefinitely throttled · e221d028
      Mike Galbraith 提交于
      Root task group bandwidth replenishment must service all CPUs, regardless of
      where the timer was last started, and regardless of the isolation mechanism,
      lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy.
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1344326558.6968.25.camel@marge.simpson.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      e221d028
    • M
      sched,cgroup: Fix up task_groups list · 35cf4e50
      Mike Galbraith 提交于
      With multiple instances of task_groups, for_each_rt_rq() is a noop,
      no task groups having been added to the rt.c list instance.  This
      renders __enable/disable_runtime() and print_rt_stats() noop, the
      user (non) visible effect being that rt task groups are missing in
      /proc/sched_debug.
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Cc: stable@kernel.org # v3.3+
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1344308413.6846.7.camel@marge.simpson.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      35cf4e50
    • S
      sched: fix divide by zero at {thread_group,task}_times · bea6832c
      Stanislaw Gruszka 提交于
      On architectures where cputime_t is 64 bit type, is possible to trigger
      divide by zero on do_div(temp, (__force u32) total) line, if total is a
      non zero number but has lower 32 bit's zeroed. Removing casting is not
      a good solution since some do_div() implementations do cast to u32
      internally.
      
      This problem can be triggered in practice on very long lived processes:
      
        PID: 2331   TASK: ffff880472814b00  CPU: 2   COMMAND: "oraagent.bin"
         #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b
         #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2
         #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00
         #3 [ffff880472a51cd0] die at ffffffff8100f26b
         #4 [ffff880472a51d00] do_trap at ffffffff814f03f4
         #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff
         #6 [ffff880472a51e00] divide_error at ffffffff8100be7b
            [exception RIP: thread_group_times+0x56]
            RIP: ffffffff81056a16  RSP: ffff880472a51eb8  RFLAGS: 00010046
            RAX: bc3572c9fe12d194  RBX: ffff880874150800  RCX: 0000000110266fad
            RDX: 0000000000000000  RSI: ffff880472a51eb8  RDI: 001038ae7d9633dc
            RBP: ffff880472a51ef8   R8: 00000000b10a3a64   R9: ffff880874150800
            R10: 00007fcba27ab680  R11: 0000000000000202  R12: ffff880472a51f08
            R13: ffff880472a51f10  R14: 0000000000000000  R15: 0000000000000007
            ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
         #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d
         #8 [ffff880472a51f40] sys_times at ffffffff81088524
         #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2
            RIP: 0000003808caac3a  RSP: 00007fcba27ab6d8  RFLAGS: 00000202
            RAX: 0000000000000064  RBX: ffffffff8100b0f2  RCX: 0000000000000000
            RDX: 00007fcba27ab6e0  RSI: 000000000076d58e  RDI: 00007fcba27ab6e0
            RBP: 00007fcba27ab700   R8: 0000000000000020   R9: 000000000000091b
            R10: 00007fcba27ab680  R11: 0000000000000202  R12: 00007fff9ca41940
            R13: 0000000000000000  R14: 00007fcba27ac9c0  R15: 00007fff9ca41940
            ORIG_RAX: 0000000000000064  CS: 0033  SS: 002b
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20120808092714.GA3580@redhat.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      bea6832c
    • P
      sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchies · a35b6466
      Peter Zijlstra 提交于
      Peter Portante reported that for large cgroup hierarchies (and or on
      large CPU counts) we get immense lock contention on rq->lock and stuff
      stops working properly.
      
      His workload was a ton of processes, each in their own cgroup,
      everybody idling except for a sporadic wakeup once every so often.
      
      It was found that:
      
        schedule()
          idle_balance()
            load_balance()
              local_irq_save()
              double_rq_lock()
              update_h_load()
                walk_tg_tree(tg_load_down)
                  tg_load_down()
      
      Results in an entire cgroup hierarchy walk under rq->lock for every
      new-idle balance and since new-idle balance isn't throttled this
      results in a lot of work while holding the rq->lock.
      
      This patch does two things, it removes the work from under rq->lock
      based on the good principle of race and pray which is widely employed
      in the load-balancer as a whole. And secondly it throttles the
      update_h_load() calculation to max once per jiffy.
      
      I considered excluding update_h_load() for new-idle balance
      all-together, but purely relying on regular balance passes to update
      this data might not work out under some rare circumstances where the
      new-idle busiest isn't the regular busiest for a while (unlikely, but
      a nightmare to debug if someone hits it and suffers).
      
      Cc: pjt@google.com
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Reported-by: NPeter Portante <pportant@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/n/tip-aaarrzfpnaam7pqrekofu8a6@git.kernel.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      a35b6466
  9. 13 8月, 2012 1 次提交
    • J
      printk: Fix calculation of length used to discard records · e3756477
      Jeff Mahoney 提交于
      While tracking down a weird buffer overflow issue in a program that
      looked to be sane, I started double checking the length returned by
      syslog(SYSLOG_ACTION_READ_ALL, ...) to make sure it wasn't overflowing
      the buffer.
      
      Sure enough, it was.  I saw this in strace:
      
        11339 syslog(SYSLOG_ACTION_READ_ALL, "<5>[244017.708129] REISERFS (dev"..., 8192) = 8279
      
      It turns out that the loops that calculate how much space the entries
      will take when they're copied don't include the newlines and prefixes
      that will be included in the final output since prev flags is passed as
      zero.
      
      This patch properly accounts for it and fixes the overflow.
      
      CC: stable@kernel.org
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3756477
  10. 09 8月, 2012 1 次提交