1. 23 7月, 2015 2 次提交
  2. 19 6月, 2015 1 次提交
    • T
      timer: Reduce timer migration overhead if disabled · bc7a34b8
      Thomas Gleixner 提交于
      Eric reported that the timer_migration sysctl is not really nice
      performance wise as it needs to check at every timer insertion whether
      the feature is enabled or not. Further the check does not live in the
      timer code, so we have an extra function call which checks an extra
      cache line to figure out that it is disabled.
      
      We can do better and store that information in the per cpu (hr)timer
      bases. I pondered to use a static key, but that's a nightmare to
      update from the nohz code and the timer base cache line is hot anyway
      when we select a timer base.
      
      The old logic enabled the timer migration unconditionally if
      CONFIG_NO_HZ was set even if nohz was disabled on the kernel command
      line.
      
      With this modification, we start off with migration disabled. The user
      visible sysctl is still set to enabled. If the kernel switches to NOHZ
      migration is enabled, if the user did not disable it via the sysctl
      prior to the switch. If nohz=off is on the kernel command line,
      migration stays disabled no matter what.
      
      Before:
        47.76%  hog       [.] main
        14.84%  [kernel]  [k] _raw_spin_lock_irqsave
         9.55%  [kernel]  [k] _raw_spin_unlock_irqrestore
         6.71%  [kernel]  [k] mod_timer
         6.24%  [kernel]  [k] lock_timer_base.isra.38
         3.76%  [kernel]  [k] detach_if_pending
         3.71%  [kernel]  [k] del_timer
         2.50%  [kernel]  [k] internal_add_timer
         1.51%  [kernel]  [k] get_nohz_timer_target
         1.28%  [kernel]  [k] __internal_add_timer
         0.78%  [kernel]  [k] timerfn
         0.48%  [kernel]  [k] wake_up_nohz_cpu
      
      After:
        48.10%  hog       [.] main
        15.25%  [kernel]  [k] _raw_spin_lock_irqsave
         9.76%  [kernel]  [k] _raw_spin_unlock_irqrestore
         6.50%  [kernel]  [k] mod_timer
         6.44%  [kernel]  [k] lock_timer_base.isra.38
         3.87%  [kernel]  [k] detach_if_pending
         3.80%  [kernel]  [k] del_timer
         2.67%  [kernel]  [k] internal_add_timer
         1.33%  [kernel]  [k] __internal_add_timer
         0.73%  [kernel]  [k] timerfn
         0.54%  [kernel]  [k] wake_up_nohz_cpu
      Reported-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Joonwoo Park <joonwoop@codeaurora.org>
      Cc: Wenbo Wang <wenbo.wang@memblaze.com>
      Link: http://lkml.kernel.org/r/20150526224512.127050787@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      bc7a34b8
  3. 28 5月, 2015 12 次提交
  4. 22 4月, 2015 1 次提交
  5. 13 3月, 2015 3 次提交
    • P
      rcu: Process offlining and onlining only at grace-period start · 0aa04b05
      Paul E. McKenney 提交于
      Races between CPU hotplug and grace periods can be difficult to resolve,
      so the ->onoff_mutex is used to exclude the two events.  Unfortunately,
      this means that it is impossible for an outgoing CPU to perform the
      last bits of its offlining from its last pass through the idle loop,
      because sleeplocks cannot be acquired in that context.
      
      This commit avoids these problems by buffering online and offline events
      in a new ->qsmaskinitnext field in the leaf rcu_node structures.  When a
      grace period starts, the events accumulated in this mask are applied to
      the ->qsmaskinit field, and, if needed, up the rcu_node tree.  The special
      case of all CPUs corresponding to a given leaf rcu_node structure being
      offline while there are still elements in that structure's ->blkd_tasks
      list is handled using a new ->wait_blkd_tasks field.  In this case,
      propagating the offline bits up the tree is deferred until the beginning
      of the grace period after all of the tasks have exited their RCU read-side
      critical sections and removed themselves from the list, at which point
      the ->wait_blkd_tasks flag is cleared.  If one of that leaf rcu_node
      structure's CPUs comes back online before the list empties, then the
      ->wait_blkd_tasks flag is simply cleared.
      
      This of course means that RCU's notion of which CPUs are offline can be
      out of date.  This is OK because RCU need only wait on CPUs that were
      online at the time that the grace period started.  In addition, RCU's
      force-quiescent-state actions will handle the case where a CPU goes
      offline after the grace period starts.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      0aa04b05
    • P
      rcu: Move rcu_report_unblock_qs_rnp() to common code · cc99a310
      Paul E. McKenney 提交于
      The rcu_report_unblock_qs_rnp() function is invoked when the
      last task blocking the current grace period exits its outermost
      RCU read-side critical section.  Previously, this was called only
      from rcu_read_unlock_special(), and was therefore defined only when
      CONFIG_RCU_PREEMPT=y.  However, this function will be invoked even when
      CONFIG_RCU_PREEMPT=n once CPU-hotplug operations are processed only at
      the beginnings of RCU grace periods.  The reason for this change is that
      the last task on a given leaf rcu_node structure's ->blkd_tasks list
      might well exit its RCU read-side critical section between the time that
      recent CPU-hotplug operations were applied and when the new grace period
      was initialized.  This situation could result in RCU waiting forever on
      that leaf rcu_node structure, because if all that structure's CPUs were
      already offline, there would be no quiescent-state events to drive that
      structure's part of the grace period.
      
      This commit therefore moves rcu_report_unblock_qs_rnp() to common code
      that is built unconditionally so that the quiescent-state-forcing code
      can clean up after this situation, avoiding the grace-period stall.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      cc99a310
    • P
      rcu: Rework preemptible expedited bitmask handling · 8eb74b2b
      Paul E. McKenney 提交于
      Currently, the rcu_node tree ->expmask bitmasks are initially set to
      reflect the online CPUs.  This is pointless, because only the CPUs
      preempted within RCU read-side critical sections by the preceding
      synchronize_sched_expedited() need to be tracked.  This commit therefore
      instead sets up these bitmasks based on the state of the ->blkd_tasks
      lists.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      8eb74b2b
  6. 12 3月, 2015 2 次提交
  7. 04 3月, 2015 4 次提交
  8. 27 2月, 2015 3 次提交
  9. 14 2月, 2015 1 次提交
  10. 12 2月, 2015 1 次提交
    • P
      rcu: Clear need_qs flag to prevent splat · c0135d07
      Paul E. McKenney 提交于
      If the scheduling-clock interrupt sets the current tasks need_qs flag,
      but if the current CPU passes through a quiescent state in the meantime,
      then rcu_preempt_qs() will fail to clear the need_qs flag, which can fool
      RCU into thinking that additional rcu_read_unlock_special() processing
      is needed.  This commit therefore clears the need_qs flag before checking
      for additional processing.
      
      For this problem to occur, we need rcu_preempt_data.passed_quiesce equal
      to true and current->rcu_read_unlock_special.b.need_qs also equal to true.
      This condition can occur as follows:
      
      1.	CPU 0 is aware of the current preemptible RCU grace period,
      	but has not yet passed through a quiescent state.  Among other
      	things, this means that rcu_preempt_data.passed_quiesce is false.
      
      2.	Task A running on CPU 0 enters a preemptible RCU read-side
      	critical section.
      
      3.	CPU 0 takes a scheduling-clock interrupt, which notices the
      	RCU read-side critical section and the need for a quiescent state,
      	and thus sets current->rcu_read_unlock_special.b.need_qs to true.
      
      4.	Task A is preempted, enters the scheduler, eventually invoking
      	rcu_preempt_note_context_switch() which in turn invokes
      	rcu_preempt_qs().
      
      	Because rcu_preempt_data.passed_quiesce is false,
      	control enters the body of the "if" statement, which sets
      	rcu_preempt_data.passed_quiesce to true.
      
      5.	At this point, CPU 0 takes an interrupt.  The interrupt
      	handler contains an RCU read-side critical section, and
      	the rcu_read_unlock() notes that current->rcu_read_unlock_special
      	is nonzero, and thus invokes rcu_read_unlock_special().
      
      6.	Once in rcu_read_unlock_special(), the fact that
      	current->rcu_read_unlock_special.b.need_qs is true becomes
      	apparent, so rcu_read_unlock_special() invokes rcu_preempt_qs().
      	Recursively, given that we interrupted out of that same
      	function in the preceding step.
      
      7.	Because rcu_preempt_data.passed_quiesce is now true,
      	rcu_preempt_qs() does nothing, and simply returns.
      
      8.	Upon return to rcu_read_unlock_special(), it is noted that
      	current->rcu_read_unlock_special is still nonzero (because
      	the interrupted rcu_preempt_qs() had not yet gotten around
      	to clearing current->rcu_read_unlock_special.b.need_qs).
      
      9.	Execution proceeds to the WARN_ON_ONCE(), which notes that
      	we are in an interrupt handler and thus duly splats.
      
      The solution, as noted above, is to make rcu_read_unlock_special()
      clear out current->rcu_read_unlock_special.b.need_qs after calling
      rcu_preempt_qs().  The interrupted rcu_preempt_qs() will clear it again,
      but this is harmless.  The worst that happens is that we clobber another
      attempt to set this field, but this is not a problem because we just
      got done reporting a quiescent state.
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      [ paulmck: Fix embarrassing build bug noted by Sasha Levin. ]
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      c0135d07
  11. 16 1月, 2015 1 次提交
    • P
      rcu: Optionally run grace-period kthreads at real-time priority · a94844b2
      Paul E. McKenney 提交于
      Recent testing has shown that under heavy load, running RCU's grace-period
      kthreads at real-time priority can improve performance (according to 0day
      test robot) and reduce the incidence of RCU CPU stall warnings.  However,
      most systems do just fine with the default non-realtime priorities for
      these kthreads, and it does not make sense to expose the entire user
      base to any risk stemming from this change, given that this change is
      of use only to a few users running extremely heavy workloads.
      
      Therefore, this commit allows users to specify realtime priorities
      for the grace-period kthreads, but leaves them running SCHED_OTHER
      by default.  The realtime priority may be specified at build time
      via the RCU_KTHREAD_PRIO Kconfig parameter, or at boot time via the
      rcutree.kthread_prio parameter.  Either way, 0 says to continue the
      default SCHED_OTHER behavior and values from 1-99 specify that priority
      of SCHED_FIFO behavior.  Note that a value of 0 is not permitted when
      the RCU_BOOST Kconfig parameter is specified.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      a94844b2
  12. 11 1月, 2015 2 次提交
    • P
      rcutorture: Check from beginning to end of grace period · 917963d0
      Paul E. McKenney 提交于
      Currently, rcutorture's Reader Batch checks measure from the end of
      the previous grace period to the end of the current one.  This commit
      tightens up these checks by measuring from the start and end of the same
      grace period.  This involves adding rcu_batches_started() and friends
      corresponding to the existing rcu_batches_completed() and friends.
      
      We leave SRCU alone for the moment, as it does not yet have a way of
      tracking both ends of its grace periods.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      917963d0
    • P
      rcu: Make _batches_completed() functions return unsigned long · 9733e4f0
      Paul E. McKenney 提交于
      Long ago, the various ->completed fields were of type long, but now are
      unsigned long due to signed-integer-overflow concerns.  However, the
      various _batches_completed() functions remained of type long, even though
      their only purpose in life is to return the corresponding ->completed
      field.  This patch cleans this up by changing these functions' return
      types to unsigned long.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      9733e4f0
  13. 07 1月, 2015 7 次提交
    • P
      rcu: Handle gpnum/completed wrap while dyntick idle · e3663b10
      Paul E. McKenney 提交于
      Subtle race conditions can result if a CPU stays in dyntick-idle mode
      long enough for the ->gpnum and ->completed fields to wrap.  For
      example, consider the following sequence of events:
      
      o	CPU 1 encounters a quiescent state while waiting for grace period
      	5 to complete, but then enters dyntick-idle mode.
      
      o	While CPU 1 is in dyntick-idle mode, the grace-period counters
      	wrap around so that the grace period number is now 4.
      
      o	Just as CPU 1 exits dyntick-idle mode, grace period 4 completes
      	and grace period 5 begins.
      
      o	The quiescent state that CPU 1 passed through during the old
      	grace period 5 looks like it applies to the new grace period
      	5.  Therefore, the new grace period 5 completes without CPU 1
      	having passed through a quiescent state.
      
      This could clearly be a fatal surprise to any long-running RCU read-side
      critical section that happened to be running on CPU 1 at the time.  At one
      time, this was not a problem, given that it takes significant time for
      the grace-period counters to overflow even on 32-bit systems.  However,
      with the advent of NO_HZ_FULL and SMP embedded systems, arbitrarily long
      idle periods are now becoming quite feasible.  It is therefore time to
      close this race.
      
      This commit therefore avoids this race condition by having the
      quiescent-state forcing code detect when a CPU is falling too far
      behind, and setting a new rcu_data field ->gpwrap when this happens.
      Whenever this new ->gpwrap field is set, the CPU's ->gpnum and ->completed
      fields are known to be untrustworthy, and can be ignored, along with
      any associated quiescent states.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      e3663b10
    • P
      rcu: Make RCU_CPU_STALL_INFO include number of fqs attempts · fc908ed3
      Paul E. McKenney 提交于
      One way that an RCU CPU stall warning can happen is if the grace-period
      kthread is not allowed to execute.  One proxy for this kthread's
      forward progress is the number of force-quiescent-state (fqs) scans.
      This commit therefore adds the number of fqs scans to the RCU CPU stall
      warning printouts when CONFIG_RCU_CPU_STALL_INFO=y.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      fc908ed3
    • L
      rcu: Revert "Allow post-unlock reference for rt_mutex" to avoid priority-inversion · abaf3f9d
      Lai Jiangshan 提交于
      The patch dfeb9765 ("Allow post-unlock reference for rt_mutex")
      ensured rcu-boost safe even the rt_mutex has post-unlock reference.
      
      But rt_mutex allowing post-unlock reference is definitely a bug and it was
      fixed by the commit 27e35715 ("rtmutex: Plug slow unlock race").
      This fix made the previous patch (dfeb9765) useless.
      
      And even worse, the priority-inversion introduced by the the previous
      patch still exists.
      
      rcu_read_unlock_special() {
      	rt_mutex_unlock(&rnp->boost_mtx);
      	/* Priority-Inversion:
      	 * the current task had been deboosted and preempted as a low
      	 * priority task immediately, it could wait long before reschedule in,
      	 * and the rcu-booster also waits on this low priority task and sleeps.
      	 * This priority-inversion makes rcu-booster can't work
      	 * as expected.
      	 */
      	complete(&rnp->boost_completion);
      }
      
      Just revert the patch to avoid it.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      abaf3f9d
    • P
      rcu: Don't bother affinitying rcub kthreads away from offline CPUs · 5d0b0249
      Paul E. McKenney 提交于
      When rcu_boost_kthread_setaffinity() sees that all CPUs for a given
      rcu_node structure are now offline, it affinities the corresponding
      RCU-boost ("rcub") kthread away from those CPUs.  This is pointless
      because the kthread cannot run on those offline CPUs in any case.
      This commit therefore removes this unneeded code.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      5d0b0249
    • P
      rcu: Don't spawn rcub kthreads on root rcu_node structure · 3e9f5c70
      Paul E. McKenney 提交于
      Now that offlining CPUs no longer moves leaf rcu_node structures'
      ->blkd_tasks lists to the root, there is no way for the root rcu_node
      structure's ->blkd_task list to be nonempty, unless the root node is also
      the sole leaf node.  This commit therefore refrains from creating an rcub
      kthread for the root rcu_node structure unless it is also the sole leaf.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      3e9f5c70
    • P
      rcu: Make use of rcu_preempt_has_tasks() · 96e92021
      Paul E. McKenney 提交于
      Given that there is now arcu_preempt_has_tasks() function that checks
      to see if the ->blkd_tasks list is non-empty, this commit makes use of it.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      96e92021
    • P
      rcu: Don't migrate blocked tasks even if all corresponding CPUs offline · d19fb8d1
      Paul E. McKenney 提交于
      When the last CPU associated with a given leaf rcu_node structure
      goes offline, something must be done about the tasks queued on that
      rcu_node structure.  Each of these tasks has been preempted on one of
      the leaf rcu_node structure's CPUs while in an RCU read-side critical
      section that it have not yet exited.  Handling these tasks is the job of
      rcu_preempt_offline_tasks(), which migrates them from the leaf rcu_node
      structure to the root rcu_node structure.
      
      Unfortunately, this migration has to be done one task at a time because
      each tasks allegiance must be shifted from the original leaf rcu_node to
      the root, so that future attempts to deal with these tasks will acquire
      the root rcu_node structure's ->lock rather than that of the leaf.
      Worse yet, this migration must be done with interrupts disabled, which
      is not so good for realtime response, especially given that there is
      no bound on the number of tasks on a given rcu_node structure's list.
      (OK, OK, there is a bound, it is just that it is unreasonably large,
      especially on 64-bit systems.)  This was not considered a problem back
      when rcu_preempt_offline_tasks() was first written because realtime
      systems were assumed not to do CPU-hotplug operations while real-time
      applications were running.  This assumption has proved of dubious validity
      given that people are starting to run multiple realtime applications
      on a single SMP system and that it is common practice to offline then
      online a CPU before starting its real-time application in order to clear
      extraneous processing off of that CPU.  So we now need CPU hotplug
      operations to avoid undue latencies.
      
      This commit therefore avoids migrating these tasks, instead letting
      them be dequeued one by one from the original leaf rcu_node structure
      by rcu_read_unlock_special().  This means that the clearing of bits
      from the upper-level rcu_node structures must be deferred until the
      last such task has been dequeued, because otherwise subsequent grace
      periods won't wait on them.  This commit has the beneficial side effect
      of simplifying the CPU-hotplug code for TREE_PREEMPT_RCU, especially in
      CONFIG_RCU_BOOST builds.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      d19fb8d1