1. 04 11月, 2014 2 次提交
  2. 29 10月, 2014 1 次提交
    • P
      rcu: Make rcu_barrier() understand about missing rcuo kthreads · d7e29933
      Paul E. McKenney 提交于
      Commit 35ce7f29 (rcu: Create rcuo kthreads only for onlined CPUs)
      avoids creating rcuo kthreads for CPUs that never come online.  This
      fixes a bug in many instances of firmware: Instead of lying about their
      age, these systems instead lie about the number of CPUs that they have.
      Before commit 35ce7f29, this could result in huge numbers of useless
      rcuo kthreads being created.
      
      It appears that experience indicates that I should have told the
      people suffering from this problem to fix their broken firmware, but
      I instead produced what turned out to be a partial fix.   The missing
      piece supplied by this commit makes sure that rcu_barrier() knows not to
      post callbacks for no-CBs CPUs that have not yet come online, because
      otherwise rcu_barrier() will hang on systems having firmware that lies
      about the number of CPUs.
      
      It is tempting to simply have rcu_barrier() refuse to post a callback on
      any no-CBs CPU that does not have an rcuo kthread.  This unfortunately
      does not work because rcu_barrier() is required to wait for all pending
      callbacks.  It is therefore required to wait even for those callbacks
      that cannot possibly be invoked.  Even if doing so hangs the system.
      
      Given that posting a callback to a no-CBs CPU that does not yet have an
      rcuo kthread can hang rcu_barrier(), It is tempting to report an error
      in this case.  Unfortunately, this will result in false positives at
      boot time, when it is perfectly legal to post callbacks to the boot CPU
      before the scheduler has started, in other words, before it is legal
      to invoke rcu_barrier().
      
      So this commit instead has rcu_barrier() avoid posting callbacks to
      CPUs having neither rcuo kthread nor pending callbacks, and has it
      complain bitterly if it finds CPUs having no rcuo kthread but some
      pending callbacks.  And when rcu_barrier() does find CPUs having no rcuo
      kthread but pending callbacks, as noted earlier, it has no choice but
      to hang indefinitely.
      Reported-by: NYanko Kaneti <yaneti@declera.com>
      Reported-by: NJay Vosburgh <jay.vosburgh@canonical.com>
      Reported-by: NMeelis Roos <mroos@linux.ee>
      Reported-by: NEric B Munson <emunson@akamai.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NEric B Munson <emunson@akamai.com>
      Tested-by: NJay Vosburgh <jay.vosburgh@canonical.com>
      Tested-by: NYanko Kaneti <yaneti@declera.com>
      Tested-by: NKevin Fenzi <kevin@scrye.com>
      Tested-by: NMeelis Roos <mroos@linux.ee>
      d7e29933
  3. 19 9月, 2014 1 次提交
    • P
      rcu: Eliminate deadlock between CPU hotplug and expedited grace periods · dd56af42
      Paul E. McKenney 提交于
      Currently, the expedited grace-period primitives do get_online_cpus().
      This greatly simplifies their implementation, but means that calls
      to them holding locks that are acquired by CPU-hotplug notifiers (to
      say nothing of calls to these primitives from CPU-hotplug notifiers)
      can deadlock.  But this is starting to become inconvenient, as can be
      seen here: https://lkml.org/lkml/2014/8/5/754.  The problem in this
      case is that some developers need to acquire a mutex from a CPU-hotplug
      notifier, but also need to hold it across a synchronize_rcu_expedited().
      As noted above, this currently results in deadlock.
      
      This commit avoids the deadlock and retains the simplicity by creating
      a try_get_online_cpus(), which returns false if the get_online_cpus()
      reference count could not immediately be incremented.  If a call to
      try_get_online_cpus() returns true, the expedited primitives operate as
      before.  If a call returns false, the expedited primitives fall back to
      normal grace-period operations.  This falling back of course results in
      increased grace-period latency, but only during times when CPU hotplug
      operations are actually in flight.  The effect should therefore be
      negligible during normal operation.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Tested-by: NLan Tianyu <tianyu.lan@intel.com>
      dd56af42
  4. 17 9月, 2014 14 次提交
  5. 08 9月, 2014 9 次提交
  6. 28 8月, 2014 1 次提交
  7. 17 7月, 2014 1 次提交
    • P
      rcu: Allow for NULL tick_nohz_full_mask when nohz_full= missing · 187497fa
      Paul E. McKenney 提交于
      If there isn't a nohz_full= kernel parameter specified, then
      tick_nohz_full_mask can legitimately be NULL.  This can cause
      problems when RCU's boot code tries to cpumask_or() this value into
      rcu_nocb_mask.  In addition, if NO_HZ_FULL_ALL=y, there is no point
      in doing the cpumask_or() in the first place because this will cause
      RCU_NOCB_CPU_ALL=y, which in turn will have all bits already set in
      rcu_nocb_mask.
      
      This commit therefore avoids the cpumask_or() if NO_HZ_FULL_ALL=y
      and checks for !tick_nohz_full_running otherwise, this latter check
      catching cases when there was no nohz_full= kernel parameter specified.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      187497fa
  8. 10 7月, 2014 6 次提交
    • P
      rcu: Fix a sparse warning in rcu_report_unblock_qs_rnp() · b41d1b92
      Pranith Kumar 提交于
      This commit annotates rcu_report_unblock_qs_rnp() in order to fix the
      following sparse warning:
      
      kernel/rcu/tree_plugin.h:990:13: warning: context imbalance in 'rcu_report_unblock_qs_rnp' - unexpected unlock
      Signed-off-by: NPranith Kumar <bobby.prani@gmail.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      b41d1b92
    • P
      rcu: Fix a sparse warning in rcu_initiate_boost() · 615e41c6
      Pranith Kumar 提交于
      This commit annotates rcu_initiate_boost() fixes the following sparse
      warning:
      
      	kernel/rcu/tree_plugin.h:1494:13: warning: context imbalance in 'rcu_initiate_boost' - unexpected unlock
      Signed-off-by: NPranith Kumar <bobby.prani@gmail.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      615e41c6
    • P
      rcu: Bind grace-period kthreads to non-NO_HZ_FULL CPUs · c0f489d2
      Paul E. McKenney 提交于
      Binding the grace-period kthreads to the timekeeping CPU resulted in
      significant performance decreases for some workloads.  For more detail,
      see:
      
      https://lkml.org/lkml/2014/6/3/395 for benchmark numbers
      
      https://lkml.org/lkml/2014/6/4/218 for CPU statistics
      
      It turns out that it is necessary to bind the grace-period kthreads
      to the timekeeping CPU only when all but CPU 0 is a nohz_full CPU
      on the one hand or if CONFIG_NO_HZ_FULL_SYSIDLE=y on the other.
      In other cases, it suffices to bind the grace-period kthreads to the
      set of non-nohz_full CPUs.
      
      This commit therefore creates a tick_nohz_not_full_mask that is the
      complement of tick_nohz_full_mask, and then binds the grace-period
      kthread to the set of CPUs indicated by this new mask, which covers
      the CONFIG_NO_HZ_FULL_SYSIDLE=n case.  The CONFIG_NO_HZ_FULL_SYSIDLE=y
      case still binds the grace-period kthreads to the timekeeping CPU.
      This commit also includes the tick_nohz_full_enabled() check suggested
      by Frederic Weisbecker.
      Reported-by: NJet Chen <jet.chen@intel.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      [ paulmck: Created housekeeping_affine() and housekeeping_mask per
        fweisbec feedback. ]
      c0f489d2
    • P
      rcu: Simplify priority boosting by putting rt_mutex in rcu_node · abaa93d9
      Paul E. McKenney 提交于
      RCU priority boosting currently checks for boosting via a pointer in
      task_struct.  However, this is not needed: As Oleg noted, if the
      rt_mutex is placed in the rcu_node instead of on the booster's stack,
      the boostee can simply check it see if it owns the lock.  This commit
      makes this change, shrinking task_struct by one pointer and the kernel
      by thirteen lines.
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      abaa93d9
    • P
      rcu: Allow post-unlock reference for rt_mutex · dfeb9765
      Paul E. McKenney 提交于
      The current approach to RCU priority boosting uses an rt_mutex strictly
      for its priority-boosting side effects.  The rt_mutex_init_proxy_locked()
      function is used by the booster to initialize the lock as held by the
      boostee.  The booster then uses rt_mutex_lock() to acquire this rt_mutex,
      which priority-boosts the boostee.  When the boostee reaches the end
      of its outermost RCU read-side critical section, it checks a field in
      its task structure to see whether it has been boosted, and, if so, uses
      rt_mutex_unlock() to release the rt_mutex.  The booster can then go on
      to boost the next task that is blocking the current RCU grace period.
      
      But reasonable implementations of rt_mutex_unlock() might result in the
      boostee referencing the rt_mutex's data after releasing it.  But the
      booster might have re-initialized the rt_mutex between the time that the
      boostee released it and the time that it later referenced it.  This is
      clearly asking for trouble, so this commit introduces a completion that
      forces the booster to wait until the boostee has completely finished with
      the rt_mutex, thus avoiding the case where the booster is re-initializing
      the rt_mutex before the last boostee's last reference to that rt_mutex.
      
      This of course does introduce some overhead, but the priority-boosting
      code paths are miles from any possible fastpath, and the overhead of
      executing the completion will normally be quite small compared to the
      overhead of priority boosting and deboosting, so this should be OK.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      dfeb9765
    • P
      rcu: Remove redundant ACCESS_ONCE() from tick_do_timer_cpu · 4da117cf
      Paul E. McKenney 提交于
      In kernels built with CONFIG_NO_HZ_FULL, tick_do_timer_cpu is constant
      once boot completes.  Thus, there is no need to wrap it in ACCESS_ONCE()
      in code that is built only when CONFIG_NO_HZ_FULL.  This commit therefore
      removes the redundant ACCESS_ONCE().
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      4da117cf
  9. 08 7月, 2014 2 次提交
    • P
      rcu: Don't offload callbacks unless specifically requested · b58cc46c
      Paul E. McKenney 提交于
      Enabling NO_HZ_FULL currently has the side effect of enabling callback
      offloading on all CPUs.  This results in lots of additional rcuo kthreads,
      and can also increase context switching and wakeups, even in cases where
      callback offloading is neither needed nor particularly desirable.  This
      commit therefore enables callback offloading on a given CPU only if
      specifically requested at build time or boot time, or if that CPU has
      been specifically designated (again, either at build time or boot time)
      as a nohz_full CPU.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      b58cc46c
    • P
      rcu: Parallelize and economize NOCB kthread wakeups · fbce7497
      Paul E. McKenney 提交于
      An 80-CPU system with a context-switch-heavy workload can require so
      many NOCB kthread wakeups that the RCU grace-period kthreads spend several
      tens of percent of a CPU just awakening things.  This clearly will not
      scale well: If you add enough CPUs, the RCU grace-period kthreads would
      get behind, increasing grace-period latency.
      
      To avoid this problem, this commit divides the NOCB kthreads into leaders
      and followers, where the grace-period kthreads awaken the leaders each of
      whom in turn awakens its followers.  By default, the number of groups of
      kthreads is the square root of the number of CPUs, but this default may
      be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
      This reduces the number of wakeups done per grace period by the RCU
      grace-period kthread by the square root of the number of CPUs, but of
      course by shifting those wakeups to the leaders.  In addition, because
      the leaders do grace periods on behalf of their respective followers,
      the number of wakeups of the followers decreases by up to a factor of two.
      Instead of being awakened once when new callbacks arrive and again
      at the end of the grace period, the followers are awakened only at
      the end of the grace period.
      
      For a numerical example, in a 4096-CPU system, the grace-period kthread
      would awaken 64 leaders, each of which would awaken its 63 followers
      at the end of the grace period.  This compares favorably with the 79
      wakeups for the grace-period kthread on an 80-CPU system.
      Reported-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      fbce7497
  10. 24 6月, 2014 1 次提交
    • P
      rcu: Reduce overhead of cond_resched() checks for RCU · 4a81e832
      Paul E. McKenney 提交于
      Commit ac1bea85 (Make cond_resched() report RCU quiescent states)
      fixed a problem where a CPU looping in the kernel with but one runnable
      task would give RCU CPU stall warnings, even if the in-kernel loop
      contained cond_resched() calls.  Unfortunately, in so doing, it introduced
      performance regressions in Anton Blanchard's will-it-scale "open1" test.
      The problem appears to be not so much the increased cond_resched() path
      length as an increase in the rate at which grace periods complete, which
      increased per-update grace-period overhead.
      
      This commit takes a different approach to fixing this bug, mainly by
      moving the RCU-visible quiescent state from cond_resched() to
      rcu_note_context_switch(), and by further reducing the check to a
      simple non-zero test of a single per-CPU variable.  However, this
      approach requires that the force-quiescent-state processing send
      resched IPIs to the offending CPUs.  These will be sent only once
      the grace period has reached an age specified by the boot/sysfs
      parameter rcutree.jiffies_till_sched_qs, or once the grace period
      reaches an age halfway to the point at which RCU CPU stall warnings
      will be emitted, whichever comes first.
      Reported-by: NDave Hansen <dave.hansen@intel.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Christoph Lameter <cl@gentwo.org>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      [ paulmck: Made rcu_momentary_dyntick_idle() as suggested by the
        ktest build robot.  Also fixed smp_mb() comment as noted by
        Oleg Nesterov. ]
      
      Merge with e552592e (Reduce overhead of cond_resched() checks for RCU)
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      4a81e832
  11. 15 5月, 2014 1 次提交
  12. 29 4月, 2014 1 次提交