1. 16 7月, 2015 1 次提交
  2. 19 6月, 2015 1 次提交
    • T
      timer: Reduce timer migration overhead if disabled · bc7a34b8
      Thomas Gleixner 提交于
      Eric reported that the timer_migration sysctl is not really nice
      performance wise as it needs to check at every timer insertion whether
      the feature is enabled or not. Further the check does not live in the
      timer code, so we have an extra function call which checks an extra
      cache line to figure out that it is disabled.
      
      We can do better and store that information in the per cpu (hr)timer
      bases. I pondered to use a static key, but that's a nightmare to
      update from the nohz code and the timer base cache line is hot anyway
      when we select a timer base.
      
      The old logic enabled the timer migration unconditionally if
      CONFIG_NO_HZ was set even if nohz was disabled on the kernel command
      line.
      
      With this modification, we start off with migration disabled. The user
      visible sysctl is still set to enabled. If the kernel switches to NOHZ
      migration is enabled, if the user did not disable it via the sysctl
      prior to the switch. If nohz=off is on the kernel command line,
      migration stays disabled no matter what.
      
      Before:
        47.76%  hog       [.] main
        14.84%  [kernel]  [k] _raw_spin_lock_irqsave
         9.55%  [kernel]  [k] _raw_spin_unlock_irqrestore
         6.71%  [kernel]  [k] mod_timer
         6.24%  [kernel]  [k] lock_timer_base.isra.38
         3.76%  [kernel]  [k] detach_if_pending
         3.71%  [kernel]  [k] del_timer
         2.50%  [kernel]  [k] internal_add_timer
         1.51%  [kernel]  [k] get_nohz_timer_target
         1.28%  [kernel]  [k] __internal_add_timer
         0.78%  [kernel]  [k] timerfn
         0.48%  [kernel]  [k] wake_up_nohz_cpu
      
      After:
        48.10%  hog       [.] main
        15.25%  [kernel]  [k] _raw_spin_lock_irqsave
         9.76%  [kernel]  [k] _raw_spin_unlock_irqrestore
         6.50%  [kernel]  [k] mod_timer
         6.44%  [kernel]  [k] lock_timer_base.isra.38
         3.87%  [kernel]  [k] detach_if_pending
         3.80%  [kernel]  [k] del_timer
         2.67%  [kernel]  [k] internal_add_timer
         1.33%  [kernel]  [k] __internal_add_timer
         0.73%  [kernel]  [k] timerfn
         0.54%  [kernel]  [k] wake_up_nohz_cpu
      Reported-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Joonwoo Park <joonwoop@codeaurora.org>
      Cc: Wenbo Wang <wenbo.wang@memblaze.com>
      Link: http://lkml.kernel.org/r/20150526224512.127050787@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      bc7a34b8
  3. 28 5月, 2015 32 次提交
  4. 14 5月, 2015 1 次提交
  5. 22 4月, 2015 1 次提交
  6. 15 4月, 2015 1 次提交
    • P
      rcu: Control grace-period delays directly from value · 8d7dc928
      Paul E. McKenney 提交于
      In a misguided attempt to avoid an #ifdef, the use of the
      gp_init_delay module parameter was conditioned on the corresponding
      RCU_TORTURE_TEST_SLOW_INIT Kconfig variable, using IS_ENABLED() at
      the point of use in the code.  This meant that the compiler always saw
      the delay, which meant that RCU_TORTURE_TEST_SLOW_INIT_DELAY had to be
      unconditionally defined.  This in turn caused "make oldconfig" to ask
      pointless questions about the value of RCU_TORTURE_TEST_SLOW_INIT_DELAY
      in cases where it was not even used.
      
      This commit avoids these pointless questions by defining gp_init_delay
      under #ifdef.  In one branch, gp_init_delay is initialized to
      RCU_TORTURE_TEST_SLOW_INIT_DELAY and is also a module parameter (thus
      allowing boot-time modification), and in the other branch gp_init_delay
      is a const variable initialized by default to zero.
      
      This approach also simplifies the code at the delay point by eliminating
      the IS_DEFINED().  Because gp_init_delay is constant zero in the no-delay
      case intended for production use, the "gp_init_delay > 0" check causes
      the delay to become dead code, as desired in this case.  In addition,
      this commit replaces magic constant "10" with the preprocessor variable
      PER_RCU_NODE_PERIOD, which controls the number of grace periods that
      are allowed to elapse at full speed before a delay is inserted.
      
      Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
      Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      8d7dc928
  7. 20 3月, 2015 2 次提交
    • P
      rcu: Associate quiescent-state reports with grace period · 654e9533
      Paul E. McKenney 提交于
      As noted in earlier commit logs, CPU hotplug operations running
      concurrently with grace-period initialization can result in a given
      leaf rcu_node structure having all CPUs offline and no blocked readers,
      but with this rcu_node structure nevertheless blocking the current
      grace period.  Therefore, the quiescent-state forcing code now checks
      for this situation and repairs it.
      
      Unfortunately, this checking can result in false positives, for example,
      when the last task has just removed itself from this leaf rcu_node
      structure, but has not yet started clearing the ->qsmask bits further
      up the structure.  This means that the grace-period kthread (which
      forces quiescent states) and some other task might be attempting to
      concurrently clear these ->qsmask bits.  This is usually not a problem:
      One of these tasks will be the first to acquire the upper-level rcu_node
      structure's lock and with therefore clear the bit, and the other task,
      seeing the bit already cleared, will stop trying to clear bits.
      
      Sadly, this means that the following unusual sequence of events -can-
      result in a problem:
      
      1.	The grace-period kthread wins, and clears the ->qsmask bits.
      
      2.	This is the last thing blocking the current grace period, so
      	that the grace-period kthread clears ->qsmask bits all the way
      	to the root and finds that the root ->qsmask field is now zero.
      
      3.	Another grace period is required, so that the grace period kthread
      	initializes it, including setting all the needed qsmask bits.
      
      4.	The leaf rcu_node structure (the one that started this whole
      	mess) is blocking this new grace period, either because it
      	has at least one online CPU or because there is at least one
      	task that had blocked within an RCU read-side critical section
      	while running on one of this leaf rcu_node structure's CPUs.
      	(And yes, that CPU might well have gone offline before the
      	grace period in step (3) above started, which can mean that
      	there is a task on the leaf rcu_node structure's ->blkd_tasks
      	list, but ->qsmask equal to zero.)
      
      5.	The other kthread didn't get around to trying to clear the upper
      	level ->qsmask bits until all the above had happened.  This means
      	that it now sees bits set in the upper-level ->qsmask field, so it
      	proceeds to clear them.  Too bad that it is doing so on behalf of
      	a quiescent state that does not apply to the current grace period!
      
      This sequence of events can result in the new grace period being too
      short.  It can also result in the new grace period ending before the
      leaf rcu_node structure's ->qsmask bits have been cleared, which will
      result in splats during initialization of the next grace period.  In
      addition, it can result in tasks blocking the new grace period still
      being queued at the start of the next grace period, which will result
      in other splats.  Sasha's testing turned up another of these splats,
      as did rcutorture testing.  (And yes, rcutorture is being adjusted to
      make these splats show up more quickly.  Which probably is having the
      undesirable side effect of making other problems show up less quickly.
      Can't have everything!)
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org> # 4.0.x
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      654e9533
    • P
      rcu: Yet another fix for preemption and CPU hotplug · a77da14c
      Paul E. McKenney 提交于
      As noted earlier, the following sequence of events can occur when
      running PREEMPT_RCU and HOTPLUG_CPU on a system with a multi-level
      rcu_node combining tree:
      
      1.	A group of tasks block on CPUs corresponding to a given leaf
      	rcu_node structure while within RCU read-side critical sections.
      2.	All CPUs corrsponding to that rcu_node structure go offline.
      3.	The next grace period starts, but because there are still tasks
      	blocked, the upper-level bits corresponding to this leaf rcu_node
      	structure remain set.
      4.	All the tasks exit their RCU read-side critical sections and
      	remove themselves from the leaf rcu_node structure's list,
      	leaving it empty.
      5.	But because there now is code to check for this condition at
      	force-quiescent-state time, the upper bits are cleared and the
      	grace period completes.
      
      However, there is another complication that can occur following step 4 above:
      
      4a.	The grace period starts, and the leaf rcu_node structure's
      	gp_tasks pointer is set to NULL because there are no tasks
      	blocked on this structure.
      4b.	One of the CPUs corresponding to the leaf rcu_node structure
      	comes back online.
      4b.	An endless stream of tasks are preempted within RCU read-side
      	critical sections on this CPU, such that the ->blkd_tasks
      	list is always non-empty.
      
      The grace period will never end.
      
      This commit therefore makes the force-quiescent-state processing check only
      for absence of tasks blocking the current grace period rather than absence
      of tasks altogether.  This will cause a quiescent state to be reported if
      the current leaf rcu_node structure is not blocking the current grace period
      and its parent thinks that it is, regardless of how RCU managed to get
      itself into this state.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org> # 4.0.x
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      a77da14c
  8. 13 3月, 2015 1 次提交