1. 04 12月, 2013 4 次提交
    • P
      rcu: Allow task-level idle entry/exit nesting · 3a592405
      Paul E. McKenney 提交于
      The current task-level idle entry/exit code forces an entry/exit on
      each call, regardless of the nesting level.  This commit therefore
      properly accounts for nesting.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NFrederic Weisbecker <fweisbec@gmail.com>
      3a592405
    • P
      rcu: Break call_rcu() deadlock involving scheduler and perf · 96d3fd0d
      Paul E. McKenney 提交于
      Dave Jones got the following lockdep splat:
      
      >  ======================================================
      >  [ INFO: possible circular locking dependency detected ]
      >  3.12.0-rc3+ #92 Not tainted
      >  -------------------------------------------------------
      >  trinity-child2/15191 is trying to acquire lock:
      >   (&rdp->nocb_wq){......}, at: [<ffffffff8108ff43>] __wake_up+0x23/0x50
      >
      > but task is already holding lock:
      >   (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
      >
      > which lock already depends on the new lock.
      >
      >
      > the existing dependency chain (in reverse order) is:
      >
      > -> #3 (&ctx->lock){-.-...}:
      >         [<ffffffff810cc243>] lock_acquire+0x93/0x200
      >         [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
      >         [<ffffffff811500ff>] __perf_event_task_sched_out+0x2df/0x5e0
      >         [<ffffffff81091b83>] perf_event_task_sched_out+0x93/0xa0
      >         [<ffffffff81732052>] __schedule+0x1d2/0xa20
      >         [<ffffffff81732f30>] preempt_schedule_irq+0x50/0xb0
      >         [<ffffffff817352b6>] retint_kernel+0x26/0x30
      >         [<ffffffff813eed04>] tty_flip_buffer_push+0x34/0x50
      >         [<ffffffff813f0504>] pty_write+0x54/0x60
      >         [<ffffffff813e900d>] n_tty_write+0x32d/0x4e0
      >         [<ffffffff813e5838>] tty_write+0x158/0x2d0
      >         [<ffffffff811c4850>] vfs_write+0xc0/0x1f0
      >         [<ffffffff811c52cc>] SyS_write+0x4c/0xa0
      >         [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
      >
      > -> #2 (&rq->lock){-.-.-.}:
      >         [<ffffffff810cc243>] lock_acquire+0x93/0x200
      >         [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
      >         [<ffffffff810980b2>] wake_up_new_task+0xc2/0x2e0
      >         [<ffffffff81054336>] do_fork+0x126/0x460
      >         [<ffffffff81054696>] kernel_thread+0x26/0x30
      >         [<ffffffff8171ff93>] rest_init+0x23/0x140
      >         [<ffffffff81ee1e4b>] start_kernel+0x3f6/0x403
      >         [<ffffffff81ee1571>] x86_64_start_reservations+0x2a/0x2c
      >         [<ffffffff81ee1664>] x86_64_start_kernel+0xf1/0xf4
      >
      > -> #1 (&p->pi_lock){-.-.-.}:
      >         [<ffffffff810cc243>] lock_acquire+0x93/0x200
      >         [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
      >         [<ffffffff810979d1>] try_to_wake_up+0x31/0x350
      >         [<ffffffff81097d62>] default_wake_function+0x12/0x20
      >         [<ffffffff81084af8>] autoremove_wake_function+0x18/0x40
      >         [<ffffffff8108ea38>] __wake_up_common+0x58/0x90
      >         [<ffffffff8108ff59>] __wake_up+0x39/0x50
      >         [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
      >         [<ffffffff81111450>] __call_rcu+0x140/0x820
      >         [<ffffffff81111b8d>] call_rcu+0x1d/0x20
      >         [<ffffffff81093697>] cpu_attach_domain+0x287/0x360
      >         [<ffffffff81099d7e>] build_sched_domains+0xe5e/0x10a0
      >         [<ffffffff81efa7fc>] sched_init_smp+0x3b7/0x47a
      >         [<ffffffff81ee1f4e>] kernel_init_freeable+0xf6/0x202
      >         [<ffffffff817200be>] kernel_init+0xe/0x190
      >         [<ffffffff8173d22c>] ret_from_fork+0x7c/0xb0
      >
      > -> #0 (&rdp->nocb_wq){......}:
      >         [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
      >         [<ffffffff810cc243>] lock_acquire+0x93/0x200
      >         [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
      >         [<ffffffff8108ff43>] __wake_up+0x23/0x50
      >         [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
      >         [<ffffffff81111450>] __call_rcu+0x140/0x820
      >         [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
      >         [<ffffffff81149abf>] put_ctx+0x4f/0x70
      >         [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
      >         [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
      >         [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
      >         [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
      >         [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
      >
      > other info that might help us debug this:
      >
      > Chain exists of:
      >   &rdp->nocb_wq --> &rq->lock --> &ctx->lock
      >
      >   Possible unsafe locking scenario:
      >
      >         CPU0                    CPU1
      >         ----                    ----
      >    lock(&ctx->lock);
      >                                 lock(&rq->lock);
      >                                 lock(&ctx->lock);
      >    lock(&rdp->nocb_wq);
      >
      >  *** DEADLOCK ***
      >
      > 1 lock held by trinity-child2/15191:
      >  #0:  (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
      >
      > stack backtrace:
      > CPU: 2 PID: 15191 Comm: trinity-child2 Not tainted 3.12.0-rc3+ #92
      >  ffffffff82565b70 ffff880070c2dbf8 ffffffff8172a363 ffffffff824edf40
      >  ffff880070c2dc38 ffffffff81726741 ffff880070c2dc90 ffff88022383b1c0
      >  ffff88022383aac0 0000000000000000 ffff88022383b188 ffff88022383b1c0
      > Call Trace:
      >  [<ffffffff8172a363>] dump_stack+0x4e/0x82
      >  [<ffffffff81726741>] print_circular_bug+0x200/0x20f
      >  [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
      >  [<ffffffff810c6439>] ? get_lock_stats+0x19/0x60
      >  [<ffffffff8100b2f4>] ? native_sched_clock+0x24/0x80
      >  [<ffffffff810cc243>] lock_acquire+0x93/0x200
      >  [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
      >  [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
      >  [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
      >  [<ffffffff8108ff43>] __wake_up+0x23/0x50
      >  [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
      >  [<ffffffff81111450>] __call_rcu+0x140/0x820
      >  [<ffffffff8109bc8f>] ? local_clock+0x3f/0x50
      >  [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
      >  [<ffffffff81149abf>] put_ctx+0x4f/0x70
      >  [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
      >  [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
      >  [<ffffffff810c9af5>] ? trace_hardirqs_on_caller+0x115/0x1e0
      >  [<ffffffff810c9bcd>] ? trace_hardirqs_on+0xd/0x10
      >  [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
      >  [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
      >  [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
      
      The underlying problem is that perf is invoking call_rcu() with the
      scheduler locks held, but in NOCB mode, call_rcu() will with high
      probability invoke the scheduler -- which just might want to use its
      locks.  The reason that call_rcu() needs to invoke the scheduler is
      to wake up the corresponding rcuo callback-offload kthread, which
      does the job of starting up a grace period and invoking the callbacks
      afterwards.
      
      One solution (championed on a related problem by Lai Jiangshan) is to
      simply defer the wakeup to some point where scheduler locks are no longer
      held.  Since we don't want to unnecessarily incur the cost of such
      deferral, the task before us is threefold:
      
      1.	Determine when it is likely that a relevant scheduler lock is held.
      
      2.	Defer the wakeup in such cases.
      
      3.	Ensure that all deferred wakeups eventually happen, preferably
      	sooner rather than later.
      
      We use irqs_disabled_flags() as a proxy for relevant scheduler locks
      being held.  This works because the relevant locks are always acquired
      with interrupts disabled.  We may defer more often than needed, but that
      is at least safe.
      
      The wakeup deferral is tracked via a new field in the per-CPU and
      per-RCU-flavor rcu_data structure, namely ->nocb_defer_wakeup.
      
      This flag is checked by the RCU core processing.  The __rcu_pending()
      function now checks this flag, which causes rcu_check_callbacks()
      to initiate RCU core processing at each scheduling-clock interrupt
      where this flag is set.  Of course this is not sufficient because
      scheduling-clock interrupts are often turned off (the things we used to
      be able to count on!).  So the flags are also checked on entry to any
      state that RCU considers to be idle, which includes both NO_HZ_IDLE idle
      state and NO_HZ_FULL user-mode-execution state.
      
      This approach should allow call_rcu() to be invoked regardless of what
      locks you might be holding, the key word being "should".
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      96d3fd0d
    • P
      rcu: Fix and comment ordering around wait_event() · 78e4bc34
      Paul E. McKenney 提交于
      It is all too easy to forget that wait_event() does not necessarily
      imply a full memory barrier.  The case where it does not is where the
      condition transitions to true just as wait_event() starts execution.
      This is actually a feature: The standard use of wait_event() involves
      locking, in which case the locks provide the needed ordering (you hold a
      lock across the wake_up() and acquire that same lock after wait_event()
      returns).
      
      Given that I did forget that wait_event() does not necessarily imply a
      full memory barrier in one case, this commit fixes that case.  This commit
      also adds comments calling out the placement of existing memory barriers
      relied on by wait_event() calls.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      78e4bc34
    • P
      rcu: Kick CPU halfway to RCU CPU stall warning · 6193c76a
      Paul E. McKenney 提交于
      When an RCU CPU stall warning occurs, the CPU invokes resched_cpu() on
      itself.  This can help move the grace period forward in some situations,
      but it would be even better to do this -before- the RCU CPU stall warning.
      This commit therefore causes resched_cpu() to be called every five jiffies
      once the system is halfway to an RCU CPU stall warning.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      6193c76a
  2. 16 10月, 2013 1 次提交
  3. 25 9月, 2013 4 次提交
  4. 24 9月, 2013 11 次提交
    • P
      rcu: Add tracing of normal (non-NOCB) grace-period requests · bb311ecc
      Paul E. McKenney 提交于
      This commit adds tracing to the normal grace-period request points.
      These are rcu_gp_cleanup(), which checks for the need for another
      grace period at the end of the previous grace period, and
      rcu_start_gp_advanced(), which restarts RCU's state machine after
      an idle period.  These trace events are intended to help track down
      bugs where RCU remains idle despite there being work for it to do.
      Reported-by: NClark Williams <williams@redhat.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      bb311ecc
    • P
      rcu: Add tracing to rcu_gp_kthread() · 63c4db78
      Paul E. McKenney 提交于
      This commit adds tracing to the rcu_gp_kthread() function in order to
      help trace down hangs potentially involving this kthread.
      Reported-by: NClark Williams <williams@redhat.com>
      Reported-by: NCarsten Emde <C.Emde@osadl.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      63c4db78
    • P
      rcu: Flag lockless access to ->gp_flags with ACCESS_ONCE() · 591c6d17
      Paul E. McKenney 提交于
      This commit applies ACCESS_ONCE() to an outside-of-lock access to
      ->gp_flags.  Although it is hard to imagine any sane compiler messing
      this particular case up, the documentation benefits are substantial.
      Plus the definition of "sane compiler" grows ever looser.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      591c6d17
    • P
      rcu: Prevent spurious-wakeup DoS attack on rcu_gp_kthread() · 88d6df61
      Paul E. McKenney 提交于
      Spurious wakeups in the force-quiescent-state loop in rcu_gp_kthread()
      cause the timeout to be recalculated, which would prevent rcu_gp_fqs()
      from ever being called.  This would in turn would prevent the grace period
      from ever ending for as long as there was at least one CPU in an extended
      quiescent state that had not yet passed through a quiescent state.
      
      This commit therefore avoids recalculating the timeout unless the
      previous pass's call to wait_event_interruptible_timeout() actually
      did time out, thus preventing the above scenario.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      88d6df61
    • P
      rcu: Improve grace-period start logic · f7be8209
      Paul E. McKenney 提交于
      This commit improves grace-period start logic by checking ->gp_flags
      under the lock and by issuing a warning if a grace period is already
      in progress.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      f7be8209
    • P
      rcu: Have rcutiny tracepoints use tracepoint_string() · 0d752924
      Paul E. McKenney 提交于
      This commit extends the work done in f7f7bac9 (rcu: Have the RCU
      tracepoints use the tracepoint_string infrastructure) to cover rcutiny.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      0d752924
    • P
      rcu: Reject memory-order-induced stall-warning false positives · 26cdfedf
      Paul E. McKenney 提交于
      If a system is idle from an RCU perspective for longer than specified
      by CONFIG_RCU_CPU_STALL_TIMEOUT, and if one CPU starts a grace period
      just as a second checks for CPU stalls, and if this second CPU happens
      to see the old value of rsp->jiffies_stall, it will incorrectly report a
      CPU stall.  This is quite rare, but apparently occurs deterministically
      on systems with about 6TB of memory.
      
      This commit therefore orders accesses to the data used to determine
      whether or not a CPU stall is in progress.  Grace-period initialization
      and cleanup first increments rsp->completed to mark the end of the
      previous grace period, then records the current jiffies in rsp->gp_start,
      then records the jiffies at which a stall can be expected to occur in
      rsp->jiffies_stall, and finally increments rsp->gpnum to mark the start
      of the new grace period.  Now, this ordering by itself does not prevent
      false positives.  For example, if grace-period initialization was delayed
      between recording rsp->gp_start and rsp->jiffies_stall, the CPU stall
      warning code might still see an old value of rsp->jiffies_stall.
      
      Therefore, this commit also orders the CPU stall warning accesses as
      well, loading rsp->gpnum and jiffies, then rsp->jiffies_stall, then
      rsp->gp_start, and finally rsp->completed.  This ordering means that
      the false-positive scenario in the previous paragraph would result
      in rsp->completed being greater than or equal to rsp->gpnum, which is
      never valid for a CPU stall, allowing the false positive to be rejected.
      Furthermore, any fetch that gets an old value of rsp->jiffies_stall
      must also get an old value of rsp->gpnum, which will again be rejected
      by the comparison of rsp->gpnum and rsp->completed.  Situations where
      rsp->gp_start is later than rsp->jiffies_stall are also rejected, as
      are situations where jiffies is less than rsp->jiffies_stall.
      
      Although use of unsynchronized accesses means that there are likely
      still some false-positive scenarios (synchronization has proven to be
      a very bad idea on large systems), this should get rid of a large class
      of these scenarios.
      Reported-by: NFabian Herschel <fabian.herschel@suse.com>
      Reported-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Tested-by: NJochen Striepe <jochen@tolot.escape.de>
      26cdfedf
    • P
      rcu: Micro-optimize rcu_cpu_has_callbacks() · 69c8d28c
      Paul E. McKenney 提交于
      The for_each_rcu_flavor() loop unconditionally scans all flavors, even
      when the first flavor might have some non-lazy callbacks.  Once the
      loop has seen a non-lazy callback, further passes through the loop
      cannot change the state.  This is not a huge problem, given that there
      can be at most three RCU flavors (RCU-bh, RCU-preempt, and RCU-sched),
      but this code is on the path to idle, so speeding it up even a small
      amount would have some benefit.
      
      This commit therefore does two things:
      
      1.	Rearranges the order of the list of RCU flavors in order to
      	place the most active flavor first in the list.  The most active
      	RCU flavor is RCU-preempt, or, if there is no RCU-preempt,
      	RCU-sched.
      
      2.	Reworks the for_each_rcu_flavor() to exit early when the first
      	non-lazy callback is seen, or, in the case where the caller
      	does not care about non-lazy callbacks (RCU_FAST_NO_HZ=n),
      	when the first callback is seen.
      Reported-by: NChen Gang <gang.chen@asianux.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      69c8d28c
    • P
      rcu: Silence unused-variable warnings · 289828e6
      Paul E. McKenney 提交于
      The "idle" variable in both rcu_eqs_enter_common() and
      rcu_eqs_exit_common() is only used in a WARN_ON_ONCE().  If the kernel
      is built disabling WARN_ON_ONCE(), the compiler will complain (rightly)
      that "idle" is unused.  This commit therefore adds a __maybe_unused to
      the declaration of both variables.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      289828e6
    • C
      rcu: Replace __get_cpu_var() uses · c9d4b0af
      Christoph Lameter 提交于
      __get_cpu_var() is used for multiple purposes in the kernel source. One
      of them is address calculation via the form &__get_cpu_var(x). This
      calculates the address for the instance of the percpu variable of the
      current processor based on an offset.
      
      Other use cases are for storing and retrieving data from the current
      processors percpu area.  __get_cpu_var() can be used as an lvalue when
      writing data or on the right side of an assignment.
      
      __get_cpu_var() is defined as :
      
      __get_cpu_var() always only does an address determination. However,
      store and retrieve operations could use a segment prefix (or global
      register on other platforms) to avoid the address calculation.
      
      this_cpu_write() and this_cpu_read() can directly take an offset into
      a percpu area and use optimized assembly code to read and write per
      cpu variables.
      
      This patch converts __get_cpu_var into either an explicit address
      calculation using this_cpu_ptr() or into a use of this_cpu operations
      that use the offset. Thereby address calcualtions are avoided and less
      registers are used when code is generated.
      
      At the end of the patchset all uses of __get_cpu_var have been removed
      so the macro is removed too.
      
      The patchset includes passes over all arches as well. Once these
      operations are used throughout then specialized macros can be defined in
      non -x86 arches as well in order to optimize per cpu access by f.e. using
      a global register that may be set to the per cpu base.
      
      Transformations done to __get_cpu_var()
      
      1. Determine the address of the percpu instance of the current processor.
      
      	DEFINE_PER_CPU(int, y);
      	int *x = &__get_cpu_var(y);
      
          Converts to
      
      	int *x = this_cpu_ptr(&y);
      
      2. Same as #1 but this time an array structure is involved.
      
      	DEFINE_PER_CPU(int, y[20]);
      	int *x = __get_cpu_var(y);
      
          Converts to
      
      	int *x = this_cpu_ptr(y);
      
      3. Retrieve the content of the current processors instance of a per cpu
         variable.
      
      	DEFINE_PER_CPU(int, u);
      	int x = __get_cpu_var(y)
      
         Converts to
      
      	int x = __this_cpu_read(y);
      
      4. Retrieve the content of a percpu struct
      
      	DEFINE_PER_CPU(struct mystruct, y);
      	struct mystruct x = __get_cpu_var(y);
      
         Converts to
      
      	memcpy(this_cpu_ptr(&x), y, sizeof(x));
      
      5. Assignment to a per cpu variable
      
      	DEFINE_PER_CPU(int, y)
      	__get_cpu_var(y) = x;
      
         Converts to
      
      	this_cpu_write(y, x);
      
      6. Increment/Decrement etc of a per cpu variable
      
      	DEFINE_PER_CPU(int, y);
      	__get_cpu_var(y)++
      
         Converts to
      
      	this_cpu_inc(y)
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      [ paulmck: Address conflicts. ]
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      c9d4b0af
    • P
      rcu: Convert local functions to static · 01896f7e
      Paul E. McKenney 提交于
      The rcu_cpu_stall_timeout kernel parameter, the rcu_dynticks per-CPU
      variable, and the rcu_gp_fqs() function are used only locally.  This
      commit therefore marks them as static.
      Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      01896f7e
  5. 21 9月, 2013 1 次提交
  6. 01 9月, 2013 2 次提交
    • P
      nohz_full: Force RCU's grace-period kthreads onto timekeeping CPU · eb75767b
      Paul E. McKenney 提交于
      Because RCU's quiescent-state-forcing mechanism is used to drive the
      full-system-idle state machine, and because this mechanism is executed
      by RCU's grace-period kthreads, this commit forces these kthreads to
      run on the timekeeping CPU (tick_do_timer_cpu).  To do otherwise would
      mean that the RCU grace-period kthreads would force the system into
      non-idle state every time they drove the state machine, which would
      be just a bit on the futile side.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      eb75767b
    • P
      nohz_full: Add full-system-idle state machine · 0edd1b17
      Paul E. McKenney 提交于
      This commit adds the state machine that takes the per-CPU idle data
      as input and produces a full-system-idle indication as output.  This
      state machine is driven out of RCU's quiescent-state-forcing
      mechanism, which invokes rcu_sysidle_check_cpu() to collect per-CPU
      idle state and then rcu_sysidle_report() to drive the state machine.
      
      The full-system-idle state is sampled using rcu_sys_is_idle(), which
      also drives the state machine if RCU is idle (and does so by forcing
      RCU to become non-idle).  This function returns true if all but the
      timekeeping CPU (tick_do_timer_cpu) are idle and have been idle long
      enough to avoid memory contention on the full_sysidle_state state
      variable.  The rcu_sysidle_force_exit() may be called externally
      to reset the state machine back into non-idle state.
      
      For large systems the state machine is driven out of RCU's
      force-quiescent-state logic, which provides good scalability at the price
      of millisecond-scale latencies on the transition to full-system-idle
      state.  This is not so good for battery-powered systems, which are usually
      small enough that they don't need to care about scalability, but which
      do care deeply about energy efficiency.  Small systems therefore drive
      the state machine directly out of the idle-entry code.  The number of
      CPUs in a "small" system is defined by a new NO_HZ_FULL_SYSIDLE_SMALL
      Kconfig parameter, which defaults to 8.  Note that this is a build-time
      definition.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      [ paulmck: Use true and false for boolean constants per Lai Jiangshan. ]
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      [ paulmck: Simplify logic and provide better comments for memory barriers,
        based on review comments and questions by Lai Jiangshan. ]
      0edd1b17
  7. 21 8月, 2013 1 次提交
  8. 19 8月, 2013 7 次提交
    • P
      nohz_full: Add full-system-idle arguments to API · 217af2a2
      Paul E. McKenney 提交于
      This commit adds an isidle and jiffies argument to force_qs_rnp(),
      dyntick_save_progress_counter(), and rcu_implicit_dynticks_qs() to enable
      RCU's force-quiescent-state process to check for full-system idle.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      [ paulmck: Use true and false for boolean constants per Lai Jiangshan. ]
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      217af2a2
    • P
      nohz_full: Add per-CPU idle-state tracking · eb348b89
      Paul E. McKenney 提交于
      This commit adds the code that updates the rcu_dyntick structure's
      new fields to track the per-CPU idle state based on interrupts and
      transitions into and out of the idle loop (NMIs are ignored because NMI
      handlers cannot cleanly read out the time anyway).  This code is similar
      to the code that maintains RCU's idea of per-CPU idleness, but differs
      in that RCU treats CPUs running in user mode as idle, where this new
      code does not.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      eb348b89
    • P
      nohz_full: Add rcu_dyntick data for scalable detection of all-idle state · 2333210b
      Paul E. McKenney 提交于
      This commit adds fields to the rcu_dyntick structure that are used to
      detect idle CPUs.  These new fields differ from the existing ones in
      that the existing ones consider a CPU executing in user mode to be idle,
      where the new ones consider CPUs executing in user mode to be busy.
      The handling of these new fields is otherwise quite similar to that for
      the exiting fields.  This commit also adds the initialization required
      for these fields.
      
      So, why is usermode execution treated differently, with RCU considering
      it a quiescent state equivalent to idle, while in contrast the new
      full-system idle state detection considers usermode execution to be
      non-idle?
      
      It turns out that although one of RCU's quiescent states is usermode
      execution, it is not a full-system idle state.  This is because the
      purpose of the full-system idle state is not RCU, but rather determining
      when accurate timekeeping can safely be disabled.  Whenever accurate
      timekeeping is required in a CONFIG_NO_HZ_FULL kernel, at least one
      CPU must keep the scheduling-clock tick going.  If even one CPU is
      executing in user mode, accurate timekeeping is requires, particularly for
      architectures where gettimeofday() and friends do not enter the kernel.
      Only when all CPUs are really and truly idle can accurate timekeeping be
      disabled, allowing all CPUs to turn off the scheduling clock interrupt,
      thus greatly improving energy efficiency.
      
      This naturally raises the question "Why is this code in RCU rather than in
      timekeeping?", and the answer is that RCU has the data and infrastructure
      to efficiently make this determination.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      2333210b
    • P
      rcu: Eliminate unused APIs intended for adaptive ticks · feed66ed
      Paul E. McKenney 提交于
      The rcu_user_enter_after_irq() and rcu_user_exit_after_irq()
      functions were intended for use by adaptive ticks, but changes
      in implementation have rendered them unnecessary.  This commit
      therefore removes them.
      Reported-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      feed66ed
    • P
      rcu: Avoid redundant grace-period kthread wakeups · 1eafd31c
      Paul E. McKenney 提交于
      When setting up an in-the-future "advanced" grace period, the code needs
      to wake up the relevant grace-period kthread, which it currently does
      unconditionally.  However, this results in needless wakeups in the case
      where the advanced grace period is being set up by the grace-period
      kthread itself, which is a non-uncommon situation.  This commit therefore
      checks to see if the running thread is the grace-period kthread, and
      avoids doing the irq_work_queue()-mediated wakeup in that case.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      1eafd31c
    • P
      rcu: Make call_rcu() leak callbacks for debug-object errors · ae150184
      Paul E. McKenney 提交于
      If someone does a duplicate call_rcu(), the worst thing the second
      call_rcu() could do would be to actually queue the callback the second
      time because doing so corrupts whatever list the callback was already
      queued on.  This commit therefore makes __call_rcu() check the new
      return value from debug-objects and leak the callback upon error.
      This commit also substitutes rcu_leak_callback() for whatever callback
      function was previously in place in order to avoid freeing the callback
      out from under any readers that might still be referencing it.
      
      These changes increase the probability that the debug-objects error
      messages will actually make it somewhere visible.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Sedat Dilek <sedat.dilek@gmail.com>
      Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      ae150184
    • B
      rcu: Expedite grace periods during suspend/resume · d1d74d14
      Borislav Petkov 提交于
      CONFIG_RCU_FAST_NO_HZ can increase grace-period durations by up to
      a factor of four, which can result in long suspend and resume times.
      Thus, this commit temporarily switches to expedited grace periods when
      suspending the box and return to normal settings when resuming.  Similar
      logic is applied to hibernation.
      
      Because expedited grace periods are of dubious benefit on very large
      systems, so this commit restricts their automated use during suspend
      and resume to systems of 256 or fewer CPUs.  (Some day a number of
      Linux-kernel facilities, including RCU's expedited grace periods,
      will be more scalable, but I need to see bug reports first.)
      
      [ paulmck: This also papers over an audio/irq bug, but hopefully that will
        be fixed soon. ]
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NBjørn Mork <bjorn@mork.no>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      d1d74d14
  9. 30 7月, 2013 3 次提交
    • S
      rcu: Have the RCU tracepoints use the tracepoint_string infrastructure · f7f7bac9
      Steven Rostedt (Red Hat) 提交于
      Currently, RCU tracepoints save only a pointer to strings in the
      ring buffer. When displayed via the /sys/kernel/debug/tracing/trace file
      they are referenced like the printf "%s" that looks at the address
      in the ring buffer and prints out the string it points too. This requires
      that the strings are constant and persistent in the kernel.
      
      The problem with this is for tools like trace-cmd and perf that read the
      binary data from the buffers but have no access to the kernel memory to
      find out what string is represented by the address in the buffer.
      
      By using the tracepoint_string infrastructure, the RCU tracepoint strings
      can be exported such that userspace tools can map the addresses to
      the strings.
      
       # cat /sys/kernel/debug/tracing/printk_formats
      0xffffffff81a4a0e8 : "rcu_preempt"
      0xffffffff81a4a0f4 : "rcu_bh"
      0xffffffff81a4a100 : "rcu_sched"
      0xffffffff818437a0 : "cpuqs"
      0xffffffff818437a6 : "rcu_sched"
      0xffffffff818437a0 : "cpuqs"
      0xffffffff818437b0 : "rcu_bh"
      0xffffffff818437b7 : "Start context switch"
      0xffffffff818437cc : "End context switch"
      0xffffffff818437a0 : "cpuqs"
      [...]
      
      Now userspaces tools can display:
      
       rcu_utilization:      Start context switch
       rcu_dyntick:          Start 1 0
       rcu_utilization:      End context switch
       rcu_batch_start:      rcu_preempt CBs=0/5 bl=10
       rcu_dyntick:          End 0 140000000000000
       rcu_invoke_callback:  rcu_preempt rhp=0xffff880071c0d600 func=proc_i_callback
       rcu_invoke_callback:  rcu_preempt rhp=0xffff880077b5b230 func=__d_free
       rcu_dyntick:          Start 140000000000000 0
       rcu_invoke_callback:  rcu_preempt rhp=0xffff880077563980 func=file_free_rcu
       rcu_batch_end:        rcu_preempt CBs-invoked=3 idle=>c<>c<>c<>c<
       rcu_utilization:      End RCU core
       rcu_grace_period:     rcu_preempt 9741 start
       rcu_dyntick:          Start 1 0
       rcu_dyntick:          End 0 140000000000000
       rcu_dyntick:          Start 140000000000000 0
      
      Instead of:
      
       rcu_utilization:      ffffffff81843110
       rcu_future_grace_period: ffffffff81842f1d 9939 9939 9940 0 0 3 ffffffff81842f32
       rcu_batch_start:      ffffffff81842f1d CBs=0/4 bl=10
       rcu_future_grace_period: ffffffff81842f1d 9939 9939 9940 0 0 3 ffffffff81842f3c
       rcu_grace_period:     ffffffff81842f1d 9939 ffffffff81842f80
       rcu_invoke_callback:  ffffffff81842f1d rhp=0xffff88007888aac0 func=file_free_rcu
       rcu_grace_period:     ffffffff81842f1d 9939 ffffffff81842f95
       rcu_invoke_callback:  ffffffff81842f1d rhp=0xffff88006aeb4600 func=proc_i_callback
       rcu_future_grace_period: ffffffff81842f1d 9939 9939 9940 0 0 3 ffffffff81842f32
       rcu_future_grace_period: ffffffff81842f1d 9939 9939 9940 0 0 3 ffffffff81842f3c
       rcu_invoke_callback:  ffffffff81842f1d rhp=0xffff880071cb9fc0 func=__d_free
       rcu_grace_period:     ffffffff81842f1d 9939 ffffffff81842f80
       rcu_invoke_callback:  ffffffff81842f1d rhp=0xffff88007888ae80 func=file_free_rcu
       rcu_batch_end:        ffffffff81842f1d CBs-invoked=4 idle=>c<>c<>c<>c<
       rcu_utilization:      ffffffff8184311f
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      f7f7bac9
    • S
      rcu: Simplify RCU_STATE_INITIALIZER() macro · a41bfeb2
      Steven Rostedt (Red Hat) 提交于
      The RCU_STATE_INITIALIZER() macro is used only in the rcutree.c file
      as well as the rcutree_plugin.h file. It is passed as a rvalue to
      a variable of a similar name. A per_cpu variable is also created
      with a similar name as well.
      
      The uses of RCU_STATE_INITIALIZER() can be simplified to remove some
      of the duplicate code that is done. Currently the three users of this
      macro has this format:
      
      struct rcu_state rcu_sched_state =
      	RCU_STATE_INITIALIZER(rcu_sched, call_rcu_sched);
      DEFINE_PER_CPU(struct rcu_data, rcu_sched_data);
      
      Notice that "rcu_sched" is called three times. This is the same with
      the other two users. This can be condensed to just:
      
      RCU_STATE_INITIALIZER(rcu_sched, call_rcu_sched);
      
      by moving the rest into the macro itself.
      
      This also opens the door to allow the RCU tracepoint strings and
      their addresses to be exported so that userspace tracing tools can
      translate the contents of the pointers of the RCU tracepoints.
      The change will allow for helper code to be placed in the
      RCU_STATE_INITIALIZER() macro to export the name that is used.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      a41bfeb2
    • S
      rcu: Add const annotation to char * for RCU tracepoints and functions · e66c33d5
      Steven Rostedt (Red Hat) 提交于
      All the RCU tracepoints and functions that reference char pointers do
      so with just 'char *' even though they do not modify the contents of
      the string itself. This will cause warnings if a const char * is used
      in one of these functions.
      
      The RCU tracepoints store the pointer to the string to refer back to them
      when the trace output is displayed. As this can be minutes, hours or
      even days later, those strings had better be constant.
      
      This change also opens the door to allow the RCU tracepoint strings and
      their addresses to be exported so that userspace tracing tools can
      translate the contents of the pointers of the RCU tracepoints.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      e66c33d5
  10. 15 7月, 2013 1 次提交
    • P
      rcu: delete __cpuinit usage from all rcu files · 49fb4c62
      Paul Gortmaker 提交于
      The __cpuinit type of throwaway sections might have made sense
      some time ago when RAM was more constrained, but now the savings
      do not offset the cost and complications.  For example, the fix in
      commit 5e427ec2 ("x86: Fix bit corruption at CPU resume time")
      is a good example of the nasty type of bugs that can be created
      with improper use of the various __init prefixes.
      
      After a discussion on LKML[1] it was decided that cpuinit should go
      the way of devinit and be phased out.  Once all the users are gone,
      we can then finally remove the macros themselves from linux/init.h.
      
      This removes all the drivers/rcu uses of the __cpuinit macros
      from all C files.
      
      [1] https://lkml.org/lkml/2013/5/20/589
      
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Josh Triplett <josh@freedesktop.org>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      49fb4c62
  11. 04 7月, 2013 1 次提交
  12. 11 6月, 2013 4 次提交