1. 08 9月, 2014 6 次提交
  2. 28 8月, 2014 1 次提交
  3. 31 7月, 2014 1 次提交
  4. 17 7月, 2014 1 次提交
    • P
      rcu: Allow for NULL tick_nohz_full_mask when nohz_full= missing · 187497fa
      Paul E. McKenney 提交于
      If there isn't a nohz_full= kernel parameter specified, then
      tick_nohz_full_mask can legitimately be NULL.  This can cause
      problems when RCU's boot code tries to cpumask_or() this value into
      rcu_nocb_mask.  In addition, if NO_HZ_FULL_ALL=y, there is no point
      in doing the cpumask_or() in the first place because this will cause
      RCU_NOCB_CPU_ALL=y, which in turn will have all bits already set in
      rcu_nocb_mask.
      
      This commit therefore avoids the cpumask_or() if NO_HZ_FULL_ALL=y
      and checks for !tick_nohz_full_running otherwise, this latter check
      catching cases when there was no nohz_full= kernel parameter specified.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      187497fa
  5. 10 7月, 2014 14 次提交
    • P
      rcu: Fix a sparse warning in rcu_report_unblock_qs_rnp() · b41d1b92
      Pranith Kumar 提交于
      This commit annotates rcu_report_unblock_qs_rnp() in order to fix the
      following sparse warning:
      
      kernel/rcu/tree_plugin.h:990:13: warning: context imbalance in 'rcu_report_unblock_qs_rnp' - unexpected unlock
      Signed-off-by: NPranith Kumar <bobby.prani@gmail.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      b41d1b92
    • P
      rcu: Fix a sparse warning in rcu_initiate_boost() · 615e41c6
      Pranith Kumar 提交于
      This commit annotates rcu_initiate_boost() fixes the following sparse
      warning:
      
      	kernel/rcu/tree_plugin.h:1494:13: warning: context imbalance in 'rcu_initiate_boost' - unexpected unlock
      Signed-off-by: NPranith Kumar <bobby.prani@gmail.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      615e41c6
    • P
      rcu: Fix __rcu_reclaim() to use true/false for bool · 406e3e53
      Paul E. McKenney 提交于
      The __rcu_reclaim() function returned 0/1, which is not proper for a
      function of type bool.  This commit therefore converts to false/true.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      406e3e53
    • P
      rcu: Remove CONFIG_PROVE_RCU_DELAY · 11992c70
      Paul E. McKenney 提交于
      The CONFIG_PROVE_RCU_DELAY Kconfig parameter doesn't appear to be very
      effective at finding race conditions, so this commit removes it.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      [ paulmck: Remove definition and uses as noted by Paul Bolle. ]
      11992c70
    • S
      rcu: Use __this_cpu_read() instead of per_cpu_ptr() · d860d403
      Shan Wei 提交于
      The __this_cpu_read() function produces better code than does
      per_cpu_ptr() on both ARM and x86.  For example, gcc (Ubuntu/Linaro
      4.7.3-12ubuntu1) 4.7.3 produces the following:
      
      ARMv7 per_cpu_ptr():
      
      force_quiescent_state:
          mov    r3, sp    @,
          bic    r1, r3, #8128    @ tmp171,,
          ldr    r2, .L98    @ tmp169,
          bic    r1, r1, #63    @ tmp170, tmp171,
          ldr    r3, [r0, #220]    @ __ptr, rsp_6(D)->rda
          ldr    r1, [r1, #20]    @ D.35903_68->cpu, D.35903_68->cpu
          mov    r6, r0    @ rsp, rsp
          ldr    r2, [r2, r1, asl #2]    @ tmp173, __per_cpu_offset
          add    r3, r3, r2    @ tmp175, __ptr, tmp173
          ldr    r5, [r3, #12]    @ rnp_old, D.29162_13->mynode
      
      ARMv7 __this_cpu_read():
      
      force_quiescent_state:
          ldr    r3, [r0, #220]    @ rsp_7(D)->rda, rsp_7(D)->rda
          mov    r6, r0    @ rsp, rsp
          add    r3, r3, #12    @ __ptr, rsp_7(D)->rda,
          ldr    r5, [r2, r3]    @ rnp_old, *D.29176_13
      
      Using gcc 4.8.2:
      
      x86_64 per_cpu_ptr():
      
          movl %gs:cpu_number,%edx    # cpu_number, pscr_ret__
          movslq    %edx, %rdx    # pscr_ret__, pscr_ret__
          movq    __per_cpu_offset(,%rdx,8), %rdx    # __per_cpu_offset, tmp93
          movq    %rdi, %r13    # rsp, rsp
          movq    1000(%rdi), %rax    # rsp_9(D)->rda, __ptr
          movq    24(%rdx,%rax), %r12    # _15->mynode, rnp_old
      
      x86_64 __this_cpu_read():
      
          movq    %rdi, %r13    # rsp, rsp
          movq    1000(%rdi), %rax    # rsp_9(D)->rda, rsp_9(D)->rda
          movq %gs:24(%rax),%r12    # _10->mynode, rnp_old
      
      Because this change produces significant benefits for these two very
      diverse architectures, this commit makes this change.
      Signed-off-by: NShan Wei <davidshan@tencent.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPranith Kumar <bobby.prani@gmail.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      d860d403
    • P
      rcu: Don't use NMIs to dump other CPUs' stacks · bc1dce51
      Paul E. McKenney 提交于
      Although NMI-based stack dumps are in principle more accurate, they are
      also more likely to trigger deadlocks.  This commit therefore replaces
      all uses of trigger_all_cpu_backtrace() with rcu_dump_cpu_stacks(), so
      that the CPU detecting an RCU CPU stall does the stack dumping.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      bc1dce51
    • P
      rcu: Bind grace-period kthreads to non-NO_HZ_FULL CPUs · c0f489d2
      Paul E. McKenney 提交于
      Binding the grace-period kthreads to the timekeeping CPU resulted in
      significant performance decreases for some workloads.  For more detail,
      see:
      
      https://lkml.org/lkml/2014/6/3/395 for benchmark numbers
      
      https://lkml.org/lkml/2014/6/4/218 for CPU statistics
      
      It turns out that it is necessary to bind the grace-period kthreads
      to the timekeeping CPU only when all but CPU 0 is a nohz_full CPU
      on the one hand or if CONFIG_NO_HZ_FULL_SYSIDLE=y on the other.
      In other cases, it suffices to bind the grace-period kthreads to the
      set of non-nohz_full CPUs.
      
      This commit therefore creates a tick_nohz_not_full_mask that is the
      complement of tick_nohz_full_mask, and then binds the grace-period
      kthread to the set of CPUs indicated by this new mask, which covers
      the CONFIG_NO_HZ_FULL_SYSIDLE=n case.  The CONFIG_NO_HZ_FULL_SYSIDLE=y
      case still binds the grace-period kthreads to the timekeeping CPU.
      This commit also includes the tick_nohz_full_enabled() check suggested
      by Frederic Weisbecker.
      Reported-by: NJet Chen <jet.chen@intel.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      [ paulmck: Created housekeeping_affine() and housekeeping_mask per
        fweisbec feedback. ]
      c0f489d2
    • P
      rcu: Simplify priority boosting by putting rt_mutex in rcu_node · abaa93d9
      Paul E. McKenney 提交于
      RCU priority boosting currently checks for boosting via a pointer in
      task_struct.  However, this is not needed: As Oleg noted, if the
      rt_mutex is placed in the rcu_node instead of on the booster's stack,
      the boostee can simply check it see if it owns the lock.  This commit
      makes this change, shrinking task_struct by one pointer and the kernel
      by thirteen lines.
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      abaa93d9
    • P
      rcu: Check both root and current rcu_node when setting up future grace period · 48bd8e9b
      Pranith Kumar 提交于
      The rcu_start_future_gp() function checks the current rcu_node's ->gpnum
      and ->completed twice, once without ACCESS_ONCE() and once with it.
      Which is pointless because we hold that rcu_node's ->lock at that point.
      The intent was to check the current rcu_node structure and the root
      rcu_node structure, the latter locklessly with ACCESS_ONCE().  This
      commit therefore makes that change.
      
      The reason that it is safe to locklessly check the root rcu_nodes's
      ->gpnum and ->completed fields is that we hold the current rcu_node's
      ->lock, which constrains the root rcu_node's ability to change its
      ->gpnum and ->completed fields.  Of course, if there is a single rcu_node
      structure, then rnp_root==rnp, and holding the lock prevents all changes.
      If there is more than one rcu_node structure, then the code updates the
      fields in the following order:
      
      1.	Increment rnp_root->gpnum to start new grace period.
      2.	Increment rnp->gpnum to initialize the current rcu_node,
      	continuing initialization for the new grace period.
      3.	Increment rnp_root->completed to end the current grace period.
      4.	Increment rnp->completed to continue cleaning up after the
      	old grace period.
      
      So there are four possible combinations of relative values of these
      four fields:
      
      N   N   N   N:  RCU idle, new grace period must be initiated.
      		Although rnp_root->gpnum might be incremented immediately
      		after we check, that will just result in unnecessary work.
      		The grace period already started, and we try to start it.
      
      N+1 N   N   N:  RCU grace period just started.  No further change is
      		possible because we hold rnp->lock, so the checks of
      		rnp_root->gpnum and rnp_root->completed are stable.
      		We know that our request for a future grace period will
      		be seen during grace-period cleanup.
      
      N+1 N   N+1 N:  RCU grace period is ongoing.  Because rnp->gpnum is
      		different than rnp->completed, we won't even look at
      		rnp_root->gpnum and rnp_root->completed, so the possible
      		concurrent change to rnp_root->completed does not matter.
      		We know that our request for a future grace period will
      		be seen during grace-period cleanup, which cannot pass
      		this rcu_node because we hold its ->lock.
      
      N+1 N+1 N+1 N:  RCU grace period has ended, but not yet been cleaned up.
      		Because rnp->gpnum is different than rnp->completed, we
      		won't look at rnp_root->gpnum and rnp_root->completed, so
      		the possible concurrent change to rnp_root->completed does
      		not matter.  We know that our request for a future grace
      		period will be seen during grace-period cleanup, which
      		cannot pass this rcu_node because we hold its ->lock.
      
      Therefore, despite initial appearances, the lockless check is safe.
      Signed-off-by: NPranith Kumar <bobby.prani@gmail.com>
      [ paulmck: Update comment to say why the lockless check is safe. ]
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      48bd8e9b
    • P
      rcu: Allow post-unlock reference for rt_mutex · dfeb9765
      Paul E. McKenney 提交于
      The current approach to RCU priority boosting uses an rt_mutex strictly
      for its priority-boosting side effects.  The rt_mutex_init_proxy_locked()
      function is used by the booster to initialize the lock as held by the
      boostee.  The booster then uses rt_mutex_lock() to acquire this rt_mutex,
      which priority-boosts the boostee.  When the boostee reaches the end
      of its outermost RCU read-side critical section, it checks a field in
      its task structure to see whether it has been boosted, and, if so, uses
      rt_mutex_unlock() to release the rt_mutex.  The booster can then go on
      to boost the next task that is blocking the current RCU grace period.
      
      But reasonable implementations of rt_mutex_unlock() might result in the
      boostee referencing the rt_mutex's data after releasing it.  But the
      booster might have re-initialized the rt_mutex between the time that the
      boostee released it and the time that it later referenced it.  This is
      clearly asking for trouble, so this commit introduces a completion that
      forces the booster to wait until the boostee has completely finished with
      the rt_mutex, thus avoiding the case where the booster is re-initializing
      the rt_mutex before the last boostee's last reference to that rt_mutex.
      
      This of course does introduce some overhead, but the priority-boosting
      code paths are miles from any possible fastpath, and the overhead of
      executing the completion will normally be quite small compared to the
      overhead of priority boosting and deboosting, so this should be OK.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      dfeb9765
    • P
      rcu: Loosen __call_rcu()'s rcu_head alignment constraint · 1146edcb
      Paul E. McKenney 提交于
      The m68k architecture aligns only to 16-bit boundaries, which can cause
      the align-to-32-bits check in __call_rcu() to trigger.  Because there is
      currently no known potential need for more than one low-order bit, this
      commit loosens the check to 16-bit boundaries.
      Reported-by: NGreg Ungerer <gerg@uclinux.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      1146edcb
    • P
      rcu: Eliminate read-modify-write ACCESS_ONCE() calls · a792563b
      Paul E. McKenney 提交于
      RCU contains code of the following forms:
      
      	ACCESS_ONCE(x)++;
      	ACCESS_ONCE(x) += y;
      	ACCESS_ONCE(x) -= y;
      
      Now these constructs do operate correctly, but they really result in a
      pair of volatile accesses, one to do the load and another to do the store.
      This can be confusing, as the casual reader might well assume that (for
      example) gcc might generate a memory-to-memory add instruction for each
      of these three cases.  In fact, gcc will do no such thing.  Also, there
      is a good chance that the kernel will move to separate load and store
      variants of ACCESS_ONCE(), and constructs like the above could easily
      confuse both people and scripts attempting to make that sort of change.
      Finally, most of RCU's read-modify-write uses of ACCESS_ONCE() really
      only need the store to be volatile, so that the read-modify-write form
      might be misleading.
      
      This commit therefore changes the above forms in RCU so that each instance
      of ACCESS_ONCE() either does a load or a store, but not both.  In a few
      cases, ACCESS_ONCE() was not critical, for example, for maintaining
      statisitics.  In these cases, ACCESS_ONCE() has been dispensed with
      entirely.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      a792563b
    • P
      rcu: Remove redundant ACCESS_ONCE() from tick_do_timer_cpu · 4da117cf
      Paul E. McKenney 提交于
      In kernels built with CONFIG_NO_HZ_FULL, tick_do_timer_cpu is constant
      once boot completes.  Thus, there is no need to wrap it in ACCESS_ONCE()
      in code that is built only when CONFIG_NO_HZ_FULL.  This commit therefore
      removes the redundant ACCESS_ONCE().
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      4da117cf
    • F
      rcu: Make rcu node arrays static const char * const · b4426b49
      Fabian Frederick 提交于
      Those two arrays are being passed to lockdep_init_map(), which expects
      const char *, and are stored in lockdep_map the same way.
      
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      b4426b49
  6. 08 7月, 2014 2 次提交
    • P
      rcu: Don't offload callbacks unless specifically requested · b58cc46c
      Paul E. McKenney 提交于
      Enabling NO_HZ_FULL currently has the side effect of enabling callback
      offloading on all CPUs.  This results in lots of additional rcuo kthreads,
      and can also increase context switching and wakeups, even in cases where
      callback offloading is neither needed nor particularly desirable.  This
      commit therefore enables callback offloading on a given CPU only if
      specifically requested at build time or boot time, or if that CPU has
      been specifically designated (again, either at build time or boot time)
      as a nohz_full CPU.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      b58cc46c
    • P
      rcu: Parallelize and economize NOCB kthread wakeups · fbce7497
      Paul E. McKenney 提交于
      An 80-CPU system with a context-switch-heavy workload can require so
      many NOCB kthread wakeups that the RCU grace-period kthreads spend several
      tens of percent of a CPU just awakening things.  This clearly will not
      scale well: If you add enough CPUs, the RCU grace-period kthreads would
      get behind, increasing grace-period latency.
      
      To avoid this problem, this commit divides the NOCB kthreads into leaders
      and followers, where the grace-period kthreads awaken the leaders each of
      whom in turn awakens its followers.  By default, the number of groups of
      kthreads is the square root of the number of CPUs, but this default may
      be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
      This reduces the number of wakeups done per grace period by the RCU
      grace-period kthread by the square root of the number of CPUs, but of
      course by shifting those wakeups to the leaders.  In addition, because
      the leaders do grace periods on behalf of their respective followers,
      the number of wakeups of the followers decreases by up to a factor of two.
      Instead of being awakened once when new callbacks arrive and again
      at the end of the grace period, the followers are awakened only at
      the end of the grace period.
      
      For a numerical example, in a 4096-CPU system, the grace-period kthread
      would awaken 64 leaders, each of which would awaken its 63 followers
      at the end of the grace period.  This compares favorably with the 79
      wakeups for the grace-period kthread on an 80-CPU system.
      Reported-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      fbce7497
  7. 24 6月, 2014 2 次提交
    • P
      rcu: Reduce overhead of cond_resched() checks for RCU · 4a81e832
      Paul E. McKenney 提交于
      Commit ac1bea85 (Make cond_resched() report RCU quiescent states)
      fixed a problem where a CPU looping in the kernel with but one runnable
      task would give RCU CPU stall warnings, even if the in-kernel loop
      contained cond_resched() calls.  Unfortunately, in so doing, it introduced
      performance regressions in Anton Blanchard's will-it-scale "open1" test.
      The problem appears to be not so much the increased cond_resched() path
      length as an increase in the rate at which grace periods complete, which
      increased per-update grace-period overhead.
      
      This commit takes a different approach to fixing this bug, mainly by
      moving the RCU-visible quiescent state from cond_resched() to
      rcu_note_context_switch(), and by further reducing the check to a
      simple non-zero test of a single per-CPU variable.  However, this
      approach requires that the force-quiescent-state processing send
      resched IPIs to the offending CPUs.  These will be sent only once
      the grace period has reached an age specified by the boot/sysfs
      parameter rcutree.jiffies_till_sched_qs, or once the grace period
      reaches an age halfway to the point at which RCU CPU stall warnings
      will be emitted, whichever comes first.
      Reported-by: NDave Hansen <dave.hansen@intel.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Christoph Lameter <cl@gentwo.org>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      [ paulmck: Made rcu_momentary_dyntick_idle() as suggested by the
        ktest build robot.  Also fixed smp_mb() comment as noted by
        Oleg Nesterov. ]
      
      Merge with e552592e (Reduce overhead of cond_resched() checks for RCU)
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      4a81e832
    • P
      rcu: Export debug_init_rcu_head() and and debug_init_rcu_head() · 546a9d85
      Paul E. McKenney 提交于
      Currently, call_rcu() relies on implicit allocation and initialization
      for the debug-objects handling of RCU callbacks.  If you hammer the
      kernel hard enough with Sasha's modified version of trinity, you can end
      up with the sl*b allocators recursing into themselves via this implicit
      call_rcu() allocation.
      
      This commit therefore exports the debug_init_rcu_head() and
      debug_rcu_head_free() functions, which permits the allocators to allocated
      and pre-initialize the debug-objects information, so that there no longer
      any need for call_rcu() to do that initialization, which in turn prevents
      the recursion into the memory allocators.
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Looks-good-to: Christoph Lameter <cl@linux.com>
      546a9d85
  8. 20 5月, 2014 1 次提交
  9. 15 5月, 2014 12 次提交