1. 26 7月, 2017 1 次提交
    • P
      rcu: Use timer as backstop for NOCB deferred wakeups · 8be6e1b1
      Paul E. McKenney 提交于
      The handling of RCU's no-CBs CPUs has a maintenance headache, namely
      that if call_rcu() is invoked with interrupts disabled, the rcuo kthread
      wakeup must be defered to a point where we can be sure that scheduler
      locks are not held.  Of course, there are a lot of code paths leading
      from an interrupts-disabled invocation of call_rcu(), and missing any
      one of these can result in excessive callback-invocation latency, and
      potentially even system hangs.
      
      This commit therefore uses a timer to guarantee that the wakeup will
      eventually occur.  If one of the deferred-wakeup points kicks in, then
      the timer is simply cancelled.
      
      This commit also fixes up an incomplete removal of commits that were
      intended to plug remaining exit paths, which should have the added
      benefit of reducing the overhead of RCU's context-switch hooks.  In
      addition, it simplifies leader-to-follower callback-list handoff by
      introducing locking.  The call_rcu()-to-leader handoff continues to
      use atomic operations in order to maintain good real-time latency for
      common-case use of call_rcu().
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      [ paulmck: Dan Carpenter fix for mod_timer() usage bug found by smatch. ]
      8be6e1b1
  2. 09 6月, 2017 4 次提交
  3. 08 6月, 2017 2 次提交
    • P
      rcu: Use RCU_NOCB_WAKE rather than RCU_NOGP_WAKE · 511324e4
      Paul E. McKenney 提交于
      The RCU_NOGP_WAKE_NOT, RCU_NOGP_WAKE, and RCU_NOGP_WAKE_FORCE flags
      are used to mediate wakeups for the no-CBs CPU kthreads.  The "NOGP"
      really doesn't make any sense, so this commit does s/NOGP/NOCB/.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      511324e4
    • P
      rcu: Complain if blocking in preemptible RCU read-side critical section · 5b72f964
      Paul E. McKenney 提交于
      Although preemptible RCU allows its read-side critical sections to be
      preempted, general blocking is forbidden.  The reason for this is that
      excessive preemption times can be handled by CONFIG_RCU_BOOST=y, but a
      voluntarily blocked task doesn't care how high you boost its priority.
      Because preemptible RCU is a global mechanism, one ill-behaved reader
      hurts everyone.  Hence the prohibition against general blocking in
      RCU-preempt read-side critical sections.  Preemption yes, blocking no.
      
      This commit enforces this prohibition.
      
      There is a special exception for the -rt patchset (which they kindly
      volunteered to implement):  It is OK to block (as opposed to merely being
      preempted) within an RCU-preempt read-side critical section, but only if
      the blocking is subject to priority inheritance.  This exception permits
      CONFIG_RCU_BOOST=y to get -rt RCU readers out of trouble.
      
      Why doesn't this exception also apply to mainline's rt_mutex?  Because
      of the possibility that someone does general blocking while holding
      an rt_mutex.  Yes, the priority boosting will affect the rt_mutex,
      but it won't help with the task doing general blocking while holding
      that rt_mutex.
      Reported-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      5b72f964
  4. 02 5月, 2017 1 次提交
    • I
      srcu: Debloat the <linux/rcu_segcblist.h> header · 45753c5f
      Ingo Molnar 提交于
      Linus noticed that the <linux/rcu_segcblist.h> has huge inline functions
      which should not be inline at all.
      
      As a first step in cleaning this up, move them all to kernel/rcu/ and
      only keep an absolute minimum of data type defines in the header:
      
        before:   -rw-r--r-- 1 mingo mingo 22284 May  2 10:25 include/linux/rcu_segcblist.h
         after:   -rw-r--r-- 1 mingo mingo  3180 May  2 10:22 include/linux/rcu_segcblist.h
      
      More can be done, such as uninlining the large functions, which inlining
      is unjustified even if it's an RCU internal matter.
      Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      45753c5f
  5. 21 4月, 2017 1 次提交
    • P
      srcu: Parallelize callback handling · da915ad5
      Paul E. McKenney 提交于
      Peter Zijlstra proposed using SRCU to reduce mmap_sem contention [1,2],
      however, there are workloads that could result in a high volume of
      concurrent invocations of call_srcu(), which with current SRCU would
      result in excessive lock contention on the srcu_struct structure's
      ->queue_lock, which protects SRCU's callback lists.  This commit therefore
      moves SRCU to per-CPU callback lists, thus greatly reducing contention.
      
      Because a given SRCU instance no longer has a single centralized callback
      list, starting grace periods and invoking callbacks are both more complex
      than in the single-list Classic SRCU implementation.  Starting grace
      periods and handling callbacks are now handled using an srcu_node tree
      that is in some ways similar to the rcu_node trees used by RCU-bh,
      RCU-preempt, and RCU-sched (for example, the srcu_node tree shape is
      controlled by exactly the same Kconfig options and boot parameters that
      control the shape of the rcu_node tree).
      
      In addition, the old per-CPU srcu_array structure is now named srcu_data
      and contains an rcu_segcblist structure named ->srcu_cblist for its
      callbacks (and a spinlock to protect this).  The srcu_struct gets
      an srcu_gp_seq that is used to associate callback segments with the
      corresponding completion-time grace-period number.  These completion-time
      grace-period numbers are propagated up the srcu_node tree so that the
      grace-period workqueue handler can determine whether additional grace
      periods are needed on the one hand and where to look for callbacks that
      are ready to be invoked.
      
      The srcu_barrier() function must now wait on all instances of the per-CPU
      ->srcu_cblist.  Because each ->srcu_cblist is protected by ->lock,
      srcu_barrier() can remotely add the needed callbacks.  In theory,
      it could also remotely start grace periods, but in practice doing so
      is complex and racy.  And interestingly enough, it is never necessary
      for srcu_barrier() to start a grace period because srcu_barrier() only
      enqueues a callback when a callback is already present--and it turns out
      that a grace period has to have already been started for this pre-existing
      callback.  Furthermore, it is only the callback that srcu_barrier()
      needs to wait on, not any particular grace period.  Therefore, a new
      rcu_segcblist_entrain() function enqueues the srcu_barrier() function's
      callback into the same segment occupied by the last pre-existing callback
      in the list.  The special case where all the pre-existing callbacks are
      on a different list (because they are in the process of being invoked)
      is handled by enqueuing srcu_barrier()'s callback into the RCU_DONE_TAIL
      segment, relying on the done-callbacks check that takes place after all
      callbacks are inovked.
      
      Note that the readers use the same algorithm as before.  Note that there
      is a separate srcu_idx that tells the readers what counter to increment.
      This unfortunately cannot be combined with srcu_gp_seq because they
      need to be incremented at different times.
      
      This commit introduces some ugly #ifdefs in rcutorture.  These will go
      away when I feel good enough about Tree SRCU to ditch Classic SRCU.
      
      Some crude performance comparisons, courtesy of a quickly hacked rcuperf
      asynchronous-grace-period capability:
      
      			Callback Queuing Overhead
      			-------------------------
      	# CPUS		Classic SRCU	Tree SRCU
      	------          ------------    ---------
      	     2              0.349 us     0.342 us
      	    16             31.66  us     0.4   us
      	    41             ---------     0.417 us
      
      The times are the 90th percentiles, a statistic that was chosen to reject
      the overheads of the occasional srcu_barrier() call needed to avoid OOMing
      the test machine.  The rcuperf test hangs when running Classic SRCU at 41
      CPUs, hence the line of dashes.  Despite the hacks to both the rcuperf code
      and that statistics, this is a convincing demonstration of Tree SRCU's
      performance and scalability advantages.
      
      [1] https://lwn.net/Articles/309030/
      [2] https://patchwork.kernel.org/patch/5108281/Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      [ paulmck: Fix initialization if synchronize_srcu_expedited() called first. ]
      da915ad5
  6. 19 4月, 2017 10 次提交
    • P
      srcu: Move rcu_node traversal macros to rcu.h · efbe451d
      Paul E. McKenney 提交于
      This commit moves rcu_for_each_node_breadth_first(),
      rcu_for_each_nonleaf_node_breadth_first(), and
      rcu_for_each_leaf_node() from kernel/rcu/tree.h to
      kernel/rcu/rcu.h so that SRCU can access them.
      This commit is code-movement only.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      efbe451d
    • P
      srcu: Move combining-tree definitions for SRCU's benefit · f2425b4e
      Paul E. McKenney 提交于
      This commit moves the C preprocessor code that defines the default shape
      of the rcu_node combining tree to a new include/linux/rcu_node_tree.h
      file as a first step towards enabling SRCU to create its own combining
      tree, which in turn enables SRCU to implement per-CPU callback handling,
      thus avoiding contention on the lock currently guarding the single list
      of callbacks.  Note that users of SRCU still need to know the size of
      the srcu_struct structure, hence include/linux rather than kernel/rcu.
      
      This commit is code-movement only.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      f2425b4e
    • P
      srcu: Use rcu_segcblist to track SRCU callbacks · 8660b7d8
      Paul E. McKenney 提交于
      This commit switches SRCU from custom-built callback queues to the new
      rcu_segcblist structure.  This change associates grace-period sequence
      numbers with groups of callbacks, which will be needed for efficient
      processing of per-CPU callbacks.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      8660b7d8
    • P
      srcu: Abstract multi-tail callback list handling · 15fecf89
      Paul E. McKenney 提交于
      RCU has only one multi-tail callback list, which is implemented via
      the nxtlist, nxttail, nxtcompleted, qlen_lazy, and qlen fields in the
      rcu_data structure, and whose operations are open-code throughout the
      Tree RCU implementation.  This has been more or less OK in the past,
      but upcoming callback-list optimizations in SRCU could really use
      a multi-tail callback list there as well.
      
      This commit therefore abstracts the multi-tail callback list handling
      into a new kernel/rcu/rcu_segcblist.h file, and uses this new API.
      The simple head-and-tail pointer callback list is also abstracted and
      applied everywhere except for the NOCB callback-offload lists.  (Yes,
      the plan is to apply them there as well, but this commit is already
      bigger than would be good.)
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      15fecf89
    • P
      rcu: Default RCU_FANOUT_LEAF to 16 unless explicitly changed · b8c78d3a
      Paul E. McKenney 提交于
      If the RCU_EXPERT Kconfig option is not set (the default), then the
      RCU_FANOUT_LEAF Kconfig option will not be defined, which will cause
      the leaf-level rcu_node tree fanout to default to 32 on 32-bit systems
      and 64 on 64-bit systems.  This can result in excessive lock contention.
      This commit therefore changes the computation of the leaf-level rcu_node
      tree fanout so that the result will be 16 unless an explicit Kconfig or
      kernel-boot setting says otherwise.
      Reported-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      b8c78d3a
    • P
      rcu: Place guard on rcu_all_qs() and rcu_note_context_switch() actions · 9226b10d
      Paul E. McKenney 提交于
      The rcu_all_qs() and rcu_note_context_switch() do a series of checks,
      taking various actions to supply RCU with quiescent states, depending
      on the outcomes of the various checks.  This is a bit much for scheduling
      fastpaths, so this commit creates a separate ->rcu_urgent_qs field in
      the rcu_dynticks structure that acts as a global guard for these checks.
      Thus, in the common case, rcu_all_qs() and rcu_note_context_switch()
      check the ->rcu_urgent_qs field, find it false, and simply return.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      9226b10d
    • P
      rcu: Eliminate flavor scan in rcu_momentary_dyntick_idle() · 0f9be8ca
      Paul E. McKenney 提交于
      The rcu_momentary_dyntick_idle() function scans the RCU flavors, checking
      that one of them still needs a quiescent state before doing an expensive
      atomic operation on the ->dynticks counter.  However, this check reduces
      overhead only after a rare race condition, and increases complexity.  This
      commit therefore removes the scan and the mechanism enabling the scan.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      0f9be8ca
    • P
      rcu: Pull rcu_qs_ctr into rcu_dynticks structure · 9577df9a
      Paul E. McKenney 提交于
      The rcu_qs_ctr variable is yet another isolated per-CPU variable,
      so this commit pulls it into the pre-existing rcu_dynticks per-CPU
      structure.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      9577df9a
    • P
      rcu: Pull rcu_sched_qs_mask into rcu_dynticks structure · abb06b99
      Paul E. McKenney 提交于
      The rcu_sched_qs_mask variable is yet another isolated per-CPU variable,
      so this commit pulls it into the pre-existing rcu_dynticks per-CPU
      structure.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      abb06b99
    • P
      rcu: Maintain special bits at bottom of ->dynticks counter · b8c17e66
      Paul E. McKenney 提交于
      Currently, IPIs are used to force other CPUs to invalidate their TLBs
      in response to a kernel virtual-memory mapping change.  This works, but
      degrades both battery lifetime (for idle CPUs) and real-time response
      (for nohz_full CPUs), and in addition results in unnecessary IPIs due to
      the fact that CPUs executing in usermode are unaffected by stale kernel
      mappings.  It would be better to cause a CPU executing in usermode to
      wait until it is entering kernel mode to do the flush, first to avoid
      interrupting usemode tasks and second to handle multiple flush requests
      with a single flush in the case of a long-running user task.
      
      This commit therefore reserves a bit at the bottom of the ->dynticks
      counter, which is checked upon exit from extended quiescent states.
      If it is set, it is cleared and then a new rcu_eqs_special_exit() macro is
      invoked, which, if not supplied, is an empty single-pass do-while loop.
      If this bottom bit is set on -entry- to an extended quiescent state,
      then a WARN_ON_ONCE() triggers.
      
      This bottom bit may be set using a new rcu_eqs_special_set() function,
      which returns true if the bit was set, or false if the CPU turned
      out to not be in an extended quiescent state.  Please note that this
      function refuses to set the bit for a non-nohz_full CPU when that CPU
      is executing in usermode because usermode execution is tracked by RCU
      as a dyntick-idle extended quiescent state only for nohz_full CPUs.
      Reported-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      b8c17e66
  7. 02 3月, 2017 1 次提交
  8. 26 1月, 2017 1 次提交
    • P
      srcu: Force full grace-period ordering · d85b62f1
      Paul E. McKenney 提交于
      If a process invokes synchronize_srcu(), is delayed just the right amount
      of time, and thus does not sleep when waiting for the grace period to
      complete, there is no ordering between the end of the grace period and
      the code following the synchronize_srcu().  Similarly, there can be a
      lack of ordering between the end of the SRCU grace period and callback
      invocation.
      
      This commit adds the necessary ordering.
      Reported-by: NLance Roy <ldr709@gmail.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      [ paulmck: Further smp_mb() adjustment per email with Lance Roy. ]
      d85b62f1
  9. 24 1月, 2017 2 次提交
  10. 15 11月, 2016 1 次提交
  11. 23 8月, 2016 1 次提交
    • P
      rcu: Drive expedited grace periods from workqueue · 8b355e3b
      Paul E. McKenney 提交于
      The current implementation of expedited grace periods has the user
      task drive the grace period.  This works, but has downsides: (1) The
      user task must awaken tasks piggybacking on this grace period, which
      can result in latencies rivaling that of the grace period itself, and
      (2) User tasks can receive signals, which interfere with RCU CPU stall
      warnings.
      
      This commit therefore uses workqueues to drive the grace periods, so
      that the user task need not do the awakening.  A subsequent commit
      will remove the now-unnecessary code allowing for signals.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      8b355e3b
  12. 16 6月, 2016 1 次提交
    • M
      rcu: Correctly handle sparse possible cpus · bc75e999
      Mark Rutland 提交于
      In many cases in the RCU tree code, we iterate over the set of cpus for
      a leaf node described by rcu_node::grplo and rcu_node::grphi, checking
      per-cpu data for each cpu in this range. However, if the set of possible
      cpus is sparse, some cpus described in this range are not possible, and
      thus no per-cpu region will have been allocated (or initialised) for
      them by the generic percpu code.
      
      Erroneous accesses to a per-cpu area for these !possible cpus may fault
      or may hit other data depending on the addressed generated when the
      erroneous per cpu offset is applied. In practice, both cases have been
      observed on arm64 hardware (the former being silent, but detectable with
      additional patches).
      
      To avoid issues resulting from this, we must iterate over the set of
      *possible* cpus for a given leaf node. This patch add a new helper,
      for_each_leaf_node_possible_cpu, to enable this. As iteration is often
      intertwined with rcu_node local bitmask manipulation, a new
      leaf_node_cpu_bit helper is added to make this simpler and more
      consistent. The RCU tree code is made to use both of these where
      appropriate.
      
      Without this patch, running reboot at a shell can result in an oops
      like:
      
      [ 3369.075979] Unable to handle kernel paging request at virtual address ffffff8008b21b4c
      [ 3369.083881] pgd = ffffffc3ecdda000
      [ 3369.087270] [ffffff8008b21b4c] *pgd=00000083eca48003, *pud=00000083eca48003, *pmd=0000000000000000
      [ 3369.096222] Internal error: Oops: 96000007 [#1] PREEMPT SMP
      [ 3369.101781] Modules linked in:
      [ 3369.104825] CPU: 2 PID: 1817 Comm: NetworkManager Tainted: G        W       4.6.0+ #3
      [ 3369.121239] task: ffffffc0fa13e000 ti: ffffffc3eb940000 task.ti: ffffffc3eb940000
      [ 3369.128708] PC is at sync_rcu_exp_select_cpus+0x188/0x510
      [ 3369.134094] LR is at sync_rcu_exp_select_cpus+0x104/0x510
      [ 3369.139479] pc : [<ffffff80081109a8>] lr : [<ffffff8008110924>] pstate: 200001c5
      [ 3369.146860] sp : ffffffc3eb9435a0
      [ 3369.150162] x29: ffffffc3eb9435a0 x28: ffffff8008be4f88
      [ 3369.155465] x27: ffffff8008b66c80 x26: ffffffc3eceb2600
      [ 3369.160767] x25: 0000000000000001 x24: ffffff8008be4f88
      [ 3369.166070] x23: ffffff8008b51c3c x22: ffffff8008b66c80
      [ 3369.171371] x21: 0000000000000001 x20: ffffff8008b21b40
      [ 3369.176673] x19: ffffff8008b66c80 x18: 0000000000000000
      [ 3369.181975] x17: 0000007fa951a010 x16: ffffff80086a30f0
      [ 3369.187278] x15: 0000007fa9505590 x14: 0000000000000000
      [ 3369.192580] x13: ffffff8008b51000 x12: ffffffc3eb940000
      [ 3369.197882] x11: 0000000000000006 x10: ffffff8008b51b78
      [ 3369.203184] x9 : 0000000000000001 x8 : ffffff8008be4000
      [ 3369.208486] x7 : ffffff8008b21b40 x6 : 0000000000001003
      [ 3369.213788] x5 : 0000000000000000 x4 : ffffff8008b27280
      [ 3369.219090] x3 : ffffff8008b21b4c x2 : 0000000000000001
      [ 3369.224406] x1 : 0000000000000001 x0 : 0000000000000140
      ...
      [ 3369.972257] [<ffffff80081109a8>] sync_rcu_exp_select_cpus+0x188/0x510
      [ 3369.978685] [<ffffff80081128b4>] synchronize_rcu_expedited+0x64/0xa8
      [ 3369.985026] [<ffffff80086b987c>] synchronize_net+0x24/0x30
      [ 3369.990499] [<ffffff80086ddb54>] dev_deactivate_many+0x28c/0x298
      [ 3369.996493] [<ffffff80086b6bb8>] __dev_close_many+0x60/0xd0
      [ 3370.002052] [<ffffff80086b6d48>] __dev_close+0x28/0x40
      [ 3370.007178] [<ffffff80086bf62c>] __dev_change_flags+0x8c/0x158
      [ 3370.012999] [<ffffff80086bf718>] dev_change_flags+0x20/0x60
      [ 3370.018558] [<ffffff80086cf7f0>] do_setlink+0x288/0x918
      [ 3370.023771] [<ffffff80086d0798>] rtnl_newlink+0x398/0x6a8
      [ 3370.029158] [<ffffff80086cee84>] rtnetlink_rcv_msg+0xe4/0x220
      [ 3370.034891] [<ffffff80086e274c>] netlink_rcv_skb+0xc4/0xf8
      [ 3370.040364] [<ffffff80086ced8c>] rtnetlink_rcv+0x2c/0x40
      [ 3370.045663] [<ffffff80086e1fe8>] netlink_unicast+0x160/0x238
      [ 3370.051309] [<ffffff80086e24b8>] netlink_sendmsg+0x2f0/0x358
      [ 3370.056956] [<ffffff80086a0070>] sock_sendmsg+0x18/0x30
      [ 3370.062168] [<ffffff80086a21cc>] ___sys_sendmsg+0x26c/0x280
      [ 3370.067728] [<ffffff80086a30ac>] __sys_sendmsg+0x44/0x88
      [ 3370.073027] [<ffffff80086a3100>] SyS_sendmsg+0x10/0x20
      [ 3370.078153] [<ffffff8008085e70>] el0_svc_naked+0x24/0x28
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Reported-by: NDennis Chen <dennis.chen@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      bc75e999
  13. 01 4月, 2016 5 次提交
    • P
      rcu: Awaken grace-period kthread if too long since FQS · 8c7c4829
      Paul E. McKenney 提交于
      Recent kernels can fail to awaken the grace-period kthread for
      quiescent-state forcing.  This commit is a crude hack that does
      a wakeup if a scheduling-clock interrupt sees that it has been
      too long since force-quiescent-state (FQS) processing.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      8c7c4829
    • P
      rcu: Overlap wakeups with next expedited grace period · 3b5f668e
      Paul E. McKenney 提交于
      The current expedited grace-period implementation makes subsequent grace
      periods wait on wakeups for the prior grace period.  This does not fit
      the dictionary definition of "expedited", so this commit allows these two
      phases to overlap.  Doing this requires four waitqueues rather than two
      because tasks can now be waiting on the previous, current, and next grace
      periods.  The fourth waitqueue makes the bit masking work out nicely.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      3b5f668e
    • P
      rcu: Enforce expedited-GP fairness via funnel wait queue · f6a12f34
      Paul E. McKenney 提交于
      The current mutex-based funnel-locking approach used by expedited grace
      periods is subject to severe unfairness.  The problem arises when a
      few tasks, making a path from leaves to root, all wake up before other
      tasks do.  A new task can then follow this path all the way to the root,
      which needlessly delays tasks whose grace period is done, but who do
      not happen to acquire the lock quickly enough.
      
      This commit avoids this problem by maintaining per-rcu_node wait queues,
      along with a per-rcu_node counter that tracks the latest grace period
      sought by an earlier task to visit this node.  If that grace period
      would satisfy the current task, instead of proceeding up the tree,
      it waits on the current rcu_node structure using a pair of wait queues
      provided for that purpose.  This decouples awakening of old tasks from
      the arrival of new tasks.
      
      If the wakeups prove to be a bottleneck, additional kthreads can be
      brought to bear for that purpose.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      f6a12f34
    • P
      rcu: Shorten expedited_workdone* to exp_workdone* · d40a4f09
      Paul E. McKenney 提交于
      Just a name change to save a few lines and a bit of typing.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      d40a4f09
    • P
      rcu: Remove expedited GP funnel-lock bypass · e2fd9d35
      Paul E. McKenney 提交于
      Commit #cdacbe1f ("rcu: Add fastpath bypassing funnel locking")
      turns out to be a pessimization at high load because it forces a tree
      full of tasks to wait for an expedited grace period that they probably
      do not need.  This commit therefore removes this optimization.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      e2fd9d35
  14. 25 2月, 2016 2 次提交
    • P
      rcu: Use simple wait queues where possible in rcutree · abedf8e2
      Paul Gortmaker 提交于
      As of commit dae6e64d ("rcu: Introduce proper blocking to no-CBs kthreads
      GP waits") the RCU subsystem started making use of wait queues.
      
      Here we convert all additions of RCU wait queues to use simple wait queues,
      since they don't need the extra overhead of the full wait queue features.
      
      Originally this was done for RT kernels[1], since we would get things like...
      
        BUG: sleeping function called from invalid context at kernel/rtmutex.c:659
        in_atomic(): 1, irqs_disabled(): 1, pid: 8, name: rcu_preempt
        Pid: 8, comm: rcu_preempt Not tainted
        Call Trace:
         [<ffffffff8106c8d0>] __might_sleep+0xd0/0xf0
         [<ffffffff817d77b4>] rt_spin_lock+0x24/0x50
         [<ffffffff8106fcf6>] __wake_up+0x36/0x70
         [<ffffffff810c4542>] rcu_gp_kthread+0x4d2/0x680
         [<ffffffff8105f910>] ? __init_waitqueue_head+0x50/0x50
         [<ffffffff810c4070>] ? rcu_gp_fqs+0x80/0x80
         [<ffffffff8105eabb>] kthread+0xdb/0xe0
         [<ffffffff8106b912>] ? finish_task_switch+0x52/0x100
         [<ffffffff817e0754>] kernel_thread_helper+0x4/0x10
         [<ffffffff8105e9e0>] ? __init_kthread_worker+0x60/0x60
         [<ffffffff817e0750>] ? gs_change+0xb/0xb
      
      ...and hence simple wait queues were deployed on RT out of necessity
      (as simple wait uses a raw lock), but mainline might as well take
      advantage of the more streamline support as well.
      
      [1] This is a carry forward of work from v3.10-rt; the original conversion
      was by Thomas on an earlier -rt version, and Sebastian extended it to
      additional post-3.10 added RCU waiters; here I've added a commit log and
      unified the RCU changes into one, and uprev'd it to match mainline RCU.
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: linux-rt-users@vger.kernel.org
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/1455871601-27484-6-git-send-email-wagi@monom.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      abedf8e2
    • D
      rcu: Do not call rcu_nocb_gp_cleanup() while holding rnp->lock · 065bb78c
      Daniel Wagner 提交于
      rcu_nocb_gp_cleanup() is called while holding rnp->lock. Currently,
      this is okay because the wake_up_all() in rcu_nocb_gp_cleanup() will
      not enable the IRQs. lockdep is happy.
      
      By switching over using swait this is not true anymore. swake_up_all()
      enables the IRQs while processing the waiters. __do_softirq() can now
      run and will eventually call rcu_process_callbacks() which wants to
      grap nrp->lock.
      
      Let's move the rcu_nocb_gp_cleanup() call outside the lock before we
      switch over to swait.
      
      If we would hold the rnp->lock and use swait, lockdep reports
      following:
      
       =================================
       [ INFO: inconsistent lock state ]
       4.2.0-rc5-00025-g9a73ba0 #136 Not tainted
       ---------------------------------
       inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
       rcu_preempt/8 [HC0[0]:SC0[0]:HE1:SE1] takes:
        (rcu_node_1){+.?...}, at: [<ffffffff811387c7>] rcu_gp_kthread+0xb97/0xeb0
       {IN-SOFTIRQ-W} state was registered at:
         [<ffffffff81109b9f>] __lock_acquire+0xd5f/0x21e0
         [<ffffffff8110be0f>] lock_acquire+0xdf/0x2b0
         [<ffffffff81841cc9>] _raw_spin_lock_irqsave+0x59/0xa0
         [<ffffffff81136991>] rcu_process_callbacks+0x141/0x3c0
         [<ffffffff810b1a9d>] __do_softirq+0x14d/0x670
         [<ffffffff810b2214>] irq_exit+0x104/0x110
         [<ffffffff81844e96>] smp_apic_timer_interrupt+0x46/0x60
         [<ffffffff81842e70>] apic_timer_interrupt+0x70/0x80
         [<ffffffff810dba66>] rq_attach_root+0xa6/0x100
         [<ffffffff810dbc2d>] cpu_attach_domain+0x16d/0x650
         [<ffffffff810e4b42>] build_sched_domains+0x942/0xb00
         [<ffffffff821777c2>] sched_init_smp+0x509/0x5c1
         [<ffffffff821551e3>] kernel_init_freeable+0x172/0x28f
         [<ffffffff8182cdce>] kernel_init+0xe/0xe0
         [<ffffffff8184231f>] ret_from_fork+0x3f/0x70
       irq event stamp: 76
       hardirqs last  enabled at (75): [<ffffffff81841330>] _raw_spin_unlock_irq+0x30/0x60
       hardirqs last disabled at (76): [<ffffffff8184116f>] _raw_spin_lock_irq+0x1f/0x90
       softirqs last  enabled at (0): [<ffffffff810a8df2>] copy_process.part.26+0x602/0x1cf0
       softirqs last disabled at (0): [<          (null)>]           (null)
       other info that might help us debug this:
        Possible unsafe locking scenario:
              CPU0
              ----
         lock(rcu_node_1);
         <Interrupt>
           lock(rcu_node_1);
        *** DEADLOCK ***
       1 lock held by rcu_preempt/8:
        #0:  (rcu_node_1){+.?...}, at: [<ffffffff811387c7>] rcu_gp_kthread+0xb97/0xeb0
       stack backtrace:
       CPU: 0 PID: 8 Comm: rcu_preempt Not tainted 4.2.0-rc5-00025-g9a73ba0 #136
       Hardware name: Dell Inc. PowerEdge R820/066N7P, BIOS 2.0.20 01/16/2014
        0000000000000000 000000006d7e67d8 ffff881fb081fbd8 ffffffff818379e0
        0000000000000000 ffff881fb0812a00 ffff881fb081fc38 ffffffff8110813b
        0000000000000000 0000000000000001 ffff881f00000001 ffffffff8102fa4f
       Call Trace:
        [<ffffffff818379e0>] dump_stack+0x4f/0x7b
        [<ffffffff8110813b>] print_usage_bug+0x1db/0x1e0
        [<ffffffff8102fa4f>] ? save_stack_trace+0x2f/0x50
        [<ffffffff811087ad>] mark_lock+0x66d/0x6e0
        [<ffffffff81107790>] ? check_usage_forwards+0x150/0x150
        [<ffffffff81108898>] mark_held_locks+0x78/0xa0
        [<ffffffff81841330>] ? _raw_spin_unlock_irq+0x30/0x60
        [<ffffffff81108a28>] trace_hardirqs_on_caller+0x168/0x220
        [<ffffffff81108aed>] trace_hardirqs_on+0xd/0x10
        [<ffffffff81841330>] _raw_spin_unlock_irq+0x30/0x60
        [<ffffffff810fd1c7>] swake_up_all+0xb7/0xe0
        [<ffffffff811386e1>] rcu_gp_kthread+0xab1/0xeb0
        [<ffffffff811089bf>] ? trace_hardirqs_on_caller+0xff/0x220
        [<ffffffff81841341>] ? _raw_spin_unlock_irq+0x41/0x60
        [<ffffffff81137c30>] ? rcu_barrier+0x20/0x20
        [<ffffffff810d2014>] kthread+0x104/0x120
        [<ffffffff81841330>] ? _raw_spin_unlock_irq+0x30/0x60
        [<ffffffff810d1f10>] ? kthread_create_on_node+0x260/0x260
        [<ffffffff8184231f>] ret_from_fork+0x3f/0x70
        [<ffffffff810d1f10>] ? kthread_create_on_node+0x260/0x260
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: linux-rt-users@vger.kernel.org
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/1455871601-27484-5-git-send-email-wagi@monom.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      065bb78c
  15. 24 2月, 2016 1 次提交
    • B
      RCU: Privatize rcu_node::lock · 67c583a7
      Boqun Feng 提交于
      In patch:
      
      "rcu: Add transitivity to remaining rcu_node ->lock acquisitions"
      
      All locking operations on rcu_node::lock are replaced with the wrappers
      because of the need of transitivity, which indicates we should never
      write code using LOCK primitives alone(i.e. without a proper barrier
      following) on rcu_node::lock outside those wrappers. We could detect
      this kind of misuses on rcu_node::lock in the future by adding __private
      modifier on rcu_node::lock.
      
      To privatize rcu_node::lock, unlock wrappers are also needed. Replacing
      spinlock unlocks with these wrappers not only privatizes rcu_node::lock
      but also makes it easier to figure out critical sections of rcu_node.
      
      This patch adds __private modifier to rcu_node::lock and makes every
      access to it wrapped by ACCESS_PRIVATE(). Besides, unlock wrappers are
      added and raw_spin_unlock(&rnp->lock) and its friends are replaced with
      those wrappers.
      Signed-off-by: NBoqun Feng <boqun.feng@gmail.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      67c583a7
  16. 06 12月, 2015 1 次提交
  17. 05 12月, 2015 2 次提交
    • P
      rcu: Reduce expedited GP memory contention via per-CPU variables · df5bd514
      Paul E. McKenney 提交于
      Currently, the piggybacked-work checks carried out by sync_exp_work_done()
      atomically increment a small set of variables (the ->expedited_workdone0,
      ->expedited_workdone1, ->expedited_workdone2, ->expedited_workdone3
      fields in the rcu_state structure), which will form a memory-contention
      bottleneck given a sufficiently large number of CPUs concurrently invoking
      either synchronize_rcu_expedited() or synchronize_sched_expedited().
      
      This commit therefore moves these for fields to the per-CPU rcu_data
      structure, eliminating the memory contention.  The show_rcuexp() function
      also changes to sum up each field in the rcu_data structures.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      df5bd514
    • P
      rcu: Clarify role of ->expmaskinitnext · 1de6e56d
      Paul E. McKenney 提交于
      Analogy with the ->qsmaskinitnext field might lead one to believe that
      ->expmaskinitnext tracks online CPUs.  This belief is incorrect: Any CPU
      that has ever been online will have its bit set in the ->expmaskinitnext
      field.  This commit therefore adds a comment to make this clear, at
      least to people who read comments.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      1de6e56d
  18. 24 11月, 2015 1 次提交
  19. 08 10月, 2015 2 次提交
    • P
      rcu: Add online/offline info to expedited stall warning message · 74611ecb
      Paul E. McKenney 提交于
      This commit makes the RCU CPU stall warning message print online/offline
      indications immediately after the CPU number.  A "O" indicates global
      offline, a "." global online, and a "o" indicates RCU believes that the
      CPU is offline for the current grace period and "." otherwise, and an
      "N" indicates that RCU believes that the CPU will be offline for the
      next grace period, and "." otherwise, all right after the CPU number.
      So for CPU 10, you would normally see "10-...:" indicating that everything
      believes that the CPU is online.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      74611ecb
    • P
      rcu: Stop silencing lockdep false positive for expedited grace periods · 83c2c735
      Paul E. McKenney 提交于
      This reverts commit af859bea (rcu: Silence lockdep false positive
      for expedited grace periods).  Because synchronize_rcu_expedited()
      no longer invokes synchronize_sched_expedited(), ->exp_funnel_mutex
      acquisition is no longer nested, so the false positive no longer happens.
      This commit therefore removes the extra lockdep data structures, as they
      are no longer needed.
      83c2c735