1. 02 5月, 2017 1 次提交
    • I
      srcu: Debloat the <linux/rcu_segcblist.h> header · 45753c5f
      Ingo Molnar 提交于
      Linus noticed that the <linux/rcu_segcblist.h> has huge inline functions
      which should not be inline at all.
      
      As a first step in cleaning this up, move them all to kernel/rcu/ and
      only keep an absolute minimum of data type defines in the header:
      
        before:   -rw-r--r-- 1 mingo mingo 22284 May  2 10:25 include/linux/rcu_segcblist.h
         after:   -rw-r--r-- 1 mingo mingo  3180 May  2 10:22 include/linux/rcu_segcblist.h
      
      More can be done, such as uninlining the large functions, which inlining
      is unjustified even if it's an RCU internal matter.
      Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      45753c5f
  2. 27 4月, 2017 6 次提交
    • P
      srcu: Adjust default auto-expediting holdoff · b5fe223a
      Paul E. McKenney 提交于
      The default value for the kernel boot parameter srcutree.exp_holdoff
      is 50 microseconds, which is too long for good Tree SRCU performance
      (compared to Classic SRCU) on the workloads tested by Mike Galbraith.
      This commit therefore sets the default value to 25 microseconds, which
      shows excellent results in Mike's testing.
      Reported-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NMike Galbraith <efault@gmx.de>
      b5fe223a
    • P
      srcu: Specify auto-expedite holdoff time · 22607d66
      Paul E. McKenney 提交于
      On small systems, in the absence of readers, expedited SRCU grace
      periods can complete in less than a microsecond.  This means that an
      eight-CPU system can have all CPUs doing synchronize_srcu() in a tight
      loop and almost always expedite.  This might actually be desirable in
      some situations, but in general it is a good way to needlessly burn
      CPU cycles.  And in those situations where it is desirable, your friend
      is the function synchronize_srcu_expedited().
      
      For other situations, this commit adds a kernel parameter that specifies
      a holdoff between completing the last SRCU grace period and auto-expediting
      the next.  If the next grace period starts before the holdoff expires,
      auto-expediting is disabled.  The holdoff is 50 microseconds by default,
      and can be tuned to the desired number of nanoseconds.  A value of zero
      disables auto-expediting.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NMike Galbraith <efault@gmx.de>
      22607d66
    • P
      srcu: Expedite first synchronize_srcu() when idle · 2da4b2a7
      Paul E. McKenney 提交于
      Classic SRCU in effect expedites the first synchronize_srcu() when SRCU
      is idle, and Mike Galbraith demonstrated that some use cases do in fact
      rely on this behavior.  In particular, Mike showed that Steven Rostedt's
      hotplug stress script takes 55 seconds with Classic SRCU and more than
      16 -minutes- when running Tree SRCU.  Assuming that each Tree SRCU's call
      to synchronize_srcu() takes four milliseconds, this implies that Steven's
      test invokes synchronize_srcu() in isolation, but more than once per
      200 microseconds.  Mike used ftrace to demonstrate that the time between
      successive calls to synchronize_srcu() ranged from 118 to 342 microseconds,
      with one outlier at 80 milliseconds.  This data clearly indicates that
      Tree SRCU needs to expedite the first invocation of synchronize_srcu()
      during an SRCU idle period.
      
      This commit therefor introduces a srcu_might_be_idle() function that
      probabilistically checks whether or not SRCU is idle.  This function is
      used by synchronize_rcu() as an additional criterion in deciding whether
      or not to expedite.
      
      (Hat trick to Peter Zijlstra for his earlier suggestion that this might
      in fact be a problem.  Which for all I know might have motivated Mike to
      look into it.)
      Reported-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NMike Galbraith <efault@gmx.de>
      2da4b2a7
    • P
      srcu: Expedited grace periods with reduced memory contention · 1e9a038b
      Paul E. McKenney 提交于
      Commit f60d231a ("srcu: Crude control of expedited grace periods")
      introduced a per-srcu_struct atomic counter to track outstanding
      requests for grace periods.  This works, but represents a memory-contention
      bottleneck.  This commit therefore uses the srcu_node combining tree
      to remove this bottleneck.
      
      This commit adds new ->srcu_gp_seq_needed_exp fields to the
      srcu_data, srcu_node, and srcu_struct structures, which track the
      farthest-in-the-future grace period that must be expedited, which in
      turn requires that all nearer-term grace periods also be expedited.
      Requests for expediting start with the srcu_data structure, run up
      through the srcu_node tree, and end at the srcu_struct structure.
      Note that it may be necessary to expedite a grace period that just
      now started, and this is handled by a new srcu_funnel_exp_start()
      function, which is invoked when the grace period itself is already
      in its way, but when that grace period was not marked as expedited.
      
      A new srcu_get_delay() function returns zero if there is at least one
      expedited SRCU grace period in flight, or SRCU_INTERVAL otherwise.
      This function is used to calculate delays:  Normal grace periods
      are allowed to extend in order to cover more requests with a given
      grace-period computation, which decreases per-request overhead.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NMike Galbraith <efault@gmx.de>
      1e9a038b
    • P
      srcu: Make rcutorture writer stalls print SRCU GP state · 7f6733c3
      Paul E. McKenney 提交于
      In the past, SRCU was simple enough that there was little point in
      making the rcutorture writer stall messages print the SRCU grace-period
      number state.  With the advent of Tree SRCU, this has changed.  This
      commit therefore makes Classic, Tiny, and Tree SRCU report this state
      to rcutorture as needed.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NMike Galbraith <efault@gmx.de>
      7f6733c3
    • P
      srcu: Exact tracking of srcu_data structures containing callbacks · c7e88067
      Paul E. McKenney 提交于
      The current Tree SRCU implementation schedules a workqueue for every
      srcu_data covered by a given leaf srcu_node structure having callbacks,
      even if only one of those srcu_data structures actually contains
      callbacks.  This is clearly inefficient for workloads that don't feature
      callbacks everywhere all the time.  This commit therefore adds an array
      of masks that are used by the leaf srcu_node structures to track exactly
      which srcu_data structures contain callbacks.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NMike Galbraith <efault@gmx.de>
      c7e88067
  3. 21 4月, 2017 3 次提交
    • P
      rcu: Make non-preemptive schedule be Tasks RCU quiescent state · bcbfdd01
      Paul E. McKenney 提交于
      Currently, a call to schedule() acts as a Tasks RCU quiescent state
      only if a context switch actually takes place.  However, just the
      call to schedule() guarantees that the calling task has moved off of
      whatever tracing trampoline that it might have been one previously.
      This commit therefore plumbs schedule()'s "preempt" parameter into
      rcu_note_context_switch(), which then records the Tasks RCU quiescent
      state, but only if this call to schedule() was -not- due to a preemption.
      
      To avoid adding overhead to the common-case context-switch path,
      this commit hides the rcu_note_context_switch() check under an existing
      non-common-case check.
      Suggested-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      bcbfdd01
    • P
      srcu: Expedite srcu_schedule_cbs_snp() callback invocation · 0497b489
      Paul E. McKenney 提交于
      Although Tree SRCU does reduce delays when there is at least one
      synchronize_srcu_expedited() invocation pending, srcu_schedule_cbs_snp()
      still waits for SRCU_INTERVAL before invoking callbacks.  Since
      synchronize_srcu_expedited() now posts a callback and waits for
      that callback to do a wakeup, this destroys the expedited nature of
      synchronize_srcu_expedited().  This destruction became apparent to
      Marc Zyngier in the guise of a guest-OS bootup slowdown from five
      seconds to no fewer than forty seconds.
      
      This commit therefore invokes callbacks immediately at the end of the
      grace period when there is at least one synchronize_srcu_expedited()
      invocation pending.  This brought Marc's guest-OS bootup times back
      into the realm of reason.
      Reported-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NMarc Zyngier <marc.zyngier@arm.com>
      0497b489
    • P
      srcu: Parallelize callback handling · da915ad5
      Paul E. McKenney 提交于
      Peter Zijlstra proposed using SRCU to reduce mmap_sem contention [1,2],
      however, there are workloads that could result in a high volume of
      concurrent invocations of call_srcu(), which with current SRCU would
      result in excessive lock contention on the srcu_struct structure's
      ->queue_lock, which protects SRCU's callback lists.  This commit therefore
      moves SRCU to per-CPU callback lists, thus greatly reducing contention.
      
      Because a given SRCU instance no longer has a single centralized callback
      list, starting grace periods and invoking callbacks are both more complex
      than in the single-list Classic SRCU implementation.  Starting grace
      periods and handling callbacks are now handled using an srcu_node tree
      that is in some ways similar to the rcu_node trees used by RCU-bh,
      RCU-preempt, and RCU-sched (for example, the srcu_node tree shape is
      controlled by exactly the same Kconfig options and boot parameters that
      control the shape of the rcu_node tree).
      
      In addition, the old per-CPU srcu_array structure is now named srcu_data
      and contains an rcu_segcblist structure named ->srcu_cblist for its
      callbacks (and a spinlock to protect this).  The srcu_struct gets
      an srcu_gp_seq that is used to associate callback segments with the
      corresponding completion-time grace-period number.  These completion-time
      grace-period numbers are propagated up the srcu_node tree so that the
      grace-period workqueue handler can determine whether additional grace
      periods are needed on the one hand and where to look for callbacks that
      are ready to be invoked.
      
      The srcu_barrier() function must now wait on all instances of the per-CPU
      ->srcu_cblist.  Because each ->srcu_cblist is protected by ->lock,
      srcu_barrier() can remotely add the needed callbacks.  In theory,
      it could also remotely start grace periods, but in practice doing so
      is complex and racy.  And interestingly enough, it is never necessary
      for srcu_barrier() to start a grace period because srcu_barrier() only
      enqueues a callback when a callback is already present--and it turns out
      that a grace period has to have already been started for this pre-existing
      callback.  Furthermore, it is only the callback that srcu_barrier()
      needs to wait on, not any particular grace period.  Therefore, a new
      rcu_segcblist_entrain() function enqueues the srcu_barrier() function's
      callback into the same segment occupied by the last pre-existing callback
      in the list.  The special case where all the pre-existing callbacks are
      on a different list (because they are in the process of being invoked)
      is handled by enqueuing srcu_barrier()'s callback into the RCU_DONE_TAIL
      segment, relying on the done-callbacks check that takes place after all
      callbacks are inovked.
      
      Note that the readers use the same algorithm as before.  Note that there
      is a separate srcu_idx that tells the readers what counter to increment.
      This unfortunately cannot be combined with srcu_gp_seq because they
      need to be incremented at different times.
      
      This commit introduces some ugly #ifdefs in rcutorture.  These will go
      away when I feel good enough about Tree SRCU to ditch Classic SRCU.
      
      Some crude performance comparisons, courtesy of a quickly hacked rcuperf
      asynchronous-grace-period capability:
      
      			Callback Queuing Overhead
      			-------------------------
      	# CPUS		Classic SRCU	Tree SRCU
      	------          ------------    ---------
      	     2              0.349 us     0.342 us
      	    16             31.66  us     0.4   us
      	    41             ---------     0.417 us
      
      The times are the 90th percentiles, a statistic that was chosen to reject
      the overheads of the occasional srcu_barrier() call needed to avoid OOMing
      the test machine.  The rcuperf test hangs when running Classic SRCU at 41
      CPUs, hence the line of dashes.  Despite the hacks to both the rcuperf code
      and that statistics, this is a convincing demonstration of Tree SRCU's
      performance and scalability advantages.
      
      [1] https://lwn.net/Articles/309030/
      [2] https://patchwork.kernel.org/patch/5108281/Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      [ paulmck: Fix initialization if synchronize_srcu_expedited() called first. ]
      da915ad5
  4. 20 4月, 2017 5 次提交
  5. 19 4月, 2017 25 次提交