1. 11 6月, 2013 8 次提交
    • P
      rcu: Merge __rcu_process_gp_end() into __note_gp_changes() · ba9fbe95
      Paul E. McKenney 提交于
      This commit eliminates some duplicated code by merging
      __rcu_process_gp_end() into __note_gp_changes().
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      ba9fbe95
    • P
      rcu: Switch callers from rcu_process_gp_end() to note_gp_changes() · 470716fc
      Paul E. McKenney 提交于
      Because note_gp_changes() now incorporates rcu_process_gp_end() function,
      this commit switches to the former and eliminates the latter.  In
      addition, this commit changes external calls from __rcu_process_gp_end()
      to __note_gp_changes().
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      470716fc
    • P
      rcu: Rename note_new_gpnum() to note_gp_changes() · d34ea322
      Paul E. McKenney 提交于
      Because note_new_gpnum() now also checks for the ends of old grace periods,
      this commit changes its name to note_gp_changes().  Later commits will merge
      rcu_process_gp_end() into note_gp_changes().
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      d34ea322
    • P
      rcu: Make __note_new_gpnum() check for ends of prior grace periods · 398ebe60
      Paul E. McKenney 提交于
      The current implementation can detect the beginning of a new grace period
      before noting the end of a previous grace period.  Although the current
      implementation correctly handles this sort of nonsense, it would be
      good to reduce RCU's state space by making such nonsense unnecessary,
      which is now possible thanks to the fact that RCU's callback groups are
      now numbered.
      
      This commit therefore makes __note_new_gpnum() invoke
      __rcu_process_gp_end() in order to note the ends of prior grace
      periods before noting the beginnings of new grace periods.
      Of course, this now means that note_new_gpnum() notes both the
      beginnings and ends of grace periods, and could therefore be
      used in place of rcu_process_gp_end().  But that is a job for
      later commits.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      398ebe60
    • P
      rcu: Move code to apply callback-numbering simplifications · 6eaef633
      Paul E. McKenney 提交于
      The addition of callback numbering allows combining the detection of the
      ends of old grace periods and the beginnings of new grace periods.  This
      commit moves code to set the stage for this combining.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      6eaef633
    • P
      rcu: Convert rcutree.c printk calls · d7f3e207
      Paul E. McKenney 提交于
      This commit converts printk() calls to the corresponding pr_*() calls.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      d7f3e207
    • P
      rcu: Fix deadlock with CPU hotplug, RCU GP init, and timer migration · 971394f3
      Paul E. McKenney 提交于
      In Steven Rostedt's words:
      
      > I've been debugging the last couple of days why my tests have been
      > locking up. One of my tracing tests, runs all available tracers. The
      > lockup always happened with the mmiotrace, which is used to trace
      > interactions between priority drivers and the kernel. But to do this
      > easily, when the tracer gets registered, it disables all but the boot
      > CPUs. The lockup always happened after it got done disabling the CPUs.
      >
      > Then I decided to try this:
      >
      > while :; do
      > 	for i in 1 2 3; do
      > 		echo 0 > /sys/devices/system/cpu/cpu$i/online
      > 	done
      > 	for i in 1 2 3; do
      > 		echo 1 > /sys/devices/system/cpu/cpu$i/online
      > 	done
      > done
      >
      > Well, sure enough, that locked up too, with the same users. Doing a
      > sysrq-w (showing all blocked tasks):
      >
      > [ 2991.344562]   task                        PC stack   pid father
      > [ 2991.344562] rcu_preempt     D ffff88007986fdf8     0    10      2 0x00000000
      > [ 2991.344562]  ffff88007986fc98 0000000000000002 ffff88007986fc48 0000000000000908
      > [ 2991.344562]  ffff88007986c280 ffff88007986ffd8 ffff88007986ffd8 00000000001d3c80
      > [ 2991.344562]  ffff880079248a40 ffff88007986c280 0000000000000000 00000000fffd4295
      > [ 2991.344562] Call Trace:
      > [ 2991.344562]  [<ffffffff815437ba>] schedule+0x64/0x66
      > [ 2991.344562]  [<ffffffff81541750>] schedule_timeout+0xbc/0xf9
      > [ 2991.344562]  [<ffffffff8154bec0>] ? ftrace_call+0x5/0x2f
      > [ 2991.344562]  [<ffffffff81049513>] ? cascade+0xa8/0xa8
      > [ 2991.344562]  [<ffffffff815417ab>] schedule_timeout_uninterruptible+0x1e/0x20
      > [ 2991.344562]  [<ffffffff810c980c>] rcu_gp_kthread+0x502/0x94b
      > [ 2991.344562]  [<ffffffff81062791>] ? __init_waitqueue_head+0x50/0x50
      > [ 2991.344562]  [<ffffffff810c930a>] ? rcu_gp_fqs+0x64/0x64
      > [ 2991.344562]  [<ffffffff81061cdb>] kthread+0xb1/0xb9
      > [ 2991.344562]  [<ffffffff81091e31>] ? lock_release_holdtime.part.23+0x4e/0x55
      > [ 2991.344562]  [<ffffffff81061c2a>] ? __init_kthread_worker+0x58/0x58
      > [ 2991.344562]  [<ffffffff8154c1dc>] ret_from_fork+0x7c/0xb0
      > [ 2991.344562]  [<ffffffff81061c2a>] ? __init_kthread_worker+0x58/0x58
      > [ 2991.344562] kworker/0:1     D ffffffff81a30680     0    47      2 0x00000000
      > [ 2991.344562] Workqueue: events cpuset_hotplug_workfn
      > [ 2991.344562]  ffff880078dbbb58 0000000000000002 0000000000000006 00000000000000d8
      > [ 2991.344562]  ffff880078db8100 ffff880078dbbfd8 ffff880078dbbfd8 00000000001d3c80
      > [ 2991.344562]  ffff8800779ca5c0 ffff880078db8100 ffffffff81541fcf 0000000000000000
      > [ 2991.344562] Call Trace:
      > [ 2991.344562]  [<ffffffff81541fcf>] ? __mutex_lock_common+0x3d4/0x609
      > [ 2991.344562]  [<ffffffff815437ba>] schedule+0x64/0x66
      > [ 2991.344562]  [<ffffffff81543a39>] schedule_preempt_disabled+0x18/0x24
      > [ 2991.344562]  [<ffffffff81541fcf>] __mutex_lock_common+0x3d4/0x609
      > [ 2991.344562]  [<ffffffff8103d11b>] ? get_online_cpus+0x3c/0x50
      > [ 2991.344562]  [<ffffffff8103d11b>] ? get_online_cpus+0x3c/0x50
      > [ 2991.344562]  [<ffffffff815422ff>] mutex_lock_nested+0x3b/0x40
      > [ 2991.344562]  [<ffffffff8103d11b>] get_online_cpus+0x3c/0x50
      > [ 2991.344562]  [<ffffffff810af7e6>] rebuild_sched_domains_locked+0x6e/0x3a8
      > [ 2991.344562]  [<ffffffff810b0ec6>] rebuild_sched_domains+0x1c/0x2a
      > [ 2991.344562]  [<ffffffff810b109b>] cpuset_hotplug_workfn+0x1c7/0x1d3
      > [ 2991.344562]  [<ffffffff810b0ed9>] ? cpuset_hotplug_workfn+0x5/0x1d3
      > [ 2991.344562]  [<ffffffff81058e07>] process_one_work+0x2d4/0x4d1
      > [ 2991.344562]  [<ffffffff81058d3a>] ? process_one_work+0x207/0x4d1
      > [ 2991.344562]  [<ffffffff8105964c>] worker_thread+0x2e7/0x3b5
      > [ 2991.344562]  [<ffffffff81059365>] ? rescuer_thread+0x332/0x332
      > [ 2991.344562]  [<ffffffff81061cdb>] kthread+0xb1/0xb9
      > [ 2991.344562]  [<ffffffff81061c2a>] ? __init_kthread_worker+0x58/0x58
      > [ 2991.344562]  [<ffffffff8154c1dc>] ret_from_fork+0x7c/0xb0
      > [ 2991.344562]  [<ffffffff81061c2a>] ? __init_kthread_worker+0x58/0x58
      > [ 2991.344562] bash            D ffffffff81a4aa80     0  2618   2612 0x10000000
      > [ 2991.344562]  ffff8800379abb58 0000000000000002 0000000000000006 0000000000000c2c
      > [ 2991.344562]  ffff880077fea140 ffff8800379abfd8 ffff8800379abfd8 00000000001d3c80
      > [ 2991.344562]  ffff8800779ca5c0 ffff880077fea140 ffffffff81541fcf 0000000000000000
      > [ 2991.344562] Call Trace:
      > [ 2991.344562]  [<ffffffff81541fcf>] ? __mutex_lock_common+0x3d4/0x609
      > [ 2991.344562]  [<ffffffff815437ba>] schedule+0x64/0x66
      > [ 2991.344562]  [<ffffffff81543a39>] schedule_preempt_disabled+0x18/0x24
      > [ 2991.344562]  [<ffffffff81541fcf>] __mutex_lock_common+0x3d4/0x609
      > [ 2991.344562]  [<ffffffff81530078>] ? rcu_cpu_notify+0x2f5/0x86e
      > [ 2991.344562]  [<ffffffff81530078>] ? rcu_cpu_notify+0x2f5/0x86e
      > [ 2991.344562]  [<ffffffff815422ff>] mutex_lock_nested+0x3b/0x40
      > [ 2991.344562]  [<ffffffff81530078>] rcu_cpu_notify+0x2f5/0x86e
      > [ 2991.344562]  [<ffffffff81091c99>] ? __lock_is_held+0x32/0x53
      > [ 2991.344562]  [<ffffffff81548912>] notifier_call_chain+0x6b/0x98
      > [ 2991.344562]  [<ffffffff810671fd>] __raw_notifier_call_chain+0xe/0x10
      > [ 2991.344562]  [<ffffffff8103cf64>] __cpu_notify+0x20/0x32
      > [ 2991.344562]  [<ffffffff8103cf8d>] cpu_notify_nofail+0x17/0x36
      > [ 2991.344562]  [<ffffffff815225de>] _cpu_down+0x154/0x259
      > [ 2991.344562]  [<ffffffff81522710>] cpu_down+0x2d/0x3a
      > [ 2991.344562]  [<ffffffff81526351>] store_online+0x4e/0xe7
      > [ 2991.344562]  [<ffffffff8134d764>] dev_attr_store+0x20/0x22
      > [ 2991.344562]  [<ffffffff811b3c5f>] sysfs_write_file+0x108/0x144
      > [ 2991.344562]  [<ffffffff8114c5ef>] vfs_write+0xfd/0x158
      > [ 2991.344562]  [<ffffffff8114c928>] SyS_write+0x5c/0x83
      > [ 2991.344562]  [<ffffffff8154c494>] tracesys+0xdd/0xe2
      >
      > As well as held locks:
      >
      > [ 3034.728033] Showing all locks held in the system:
      > [ 3034.728033] 1 lock held by rcu_preempt/10:
      > [ 3034.728033]  #0:  (rcu_preempt_state.onoff_mutex){+.+...}, at: [<ffffffff810c9471>] rcu_gp_kthread+0x167/0x94b
      > [ 3034.728033] 4 locks held by kworker/0:1/47:
      > [ 3034.728033]  #0:  (events){.+.+.+}, at: [<ffffffff81058d3a>] process_one_work+0x207/0x4d1
      > [ 3034.728033]  #1:  (cpuset_hotplug_work){+.+.+.}, at: [<ffffffff81058d3a>] process_one_work+0x207/0x4d1
      > [ 3034.728033]  #2:  (cpuset_mutex){+.+.+.}, at: [<ffffffff810b0ec1>] rebuild_sched_domains+0x17/0x2a
      > [ 3034.728033]  #3:  (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff8103d11b>] get_online_cpus+0x3c/0x50
      > [ 3034.728033] 1 lock held by mingetty/2563:
      > [ 3034.728033]  #0:  (&ldata->atomic_read_lock){+.+...}, at: [<ffffffff8131e28a>] n_tty_read+0x252/0x7e8
      > [ 3034.728033] 1 lock held by mingetty/2565:
      > [ 3034.728033]  #0:  (&ldata->atomic_read_lock){+.+...}, at: [<ffffffff8131e28a>] n_tty_read+0x252/0x7e8
      > [ 3034.728033] 1 lock held by mingetty/2569:
      > [ 3034.728033]  #0:  (&ldata->atomic_read_lock){+.+...}, at: [<ffffffff8131e28a>] n_tty_read+0x252/0x7e8
      > [ 3034.728033] 1 lock held by mingetty/2572:
      > [ 3034.728033]  #0:  (&ldata->atomic_read_lock){+.+...}, at: [<ffffffff8131e28a>] n_tty_read+0x252/0x7e8
      > [ 3034.728033] 1 lock held by mingetty/2575:
      > [ 3034.728033]  #0:  (&ldata->atomic_read_lock){+.+...}, at: [<ffffffff8131e28a>] n_tty_read+0x252/0x7e8
      > [ 3034.728033] 7 locks held by bash/2618:
      > [ 3034.728033]  #0:  (sb_writers#5){.+.+.+}, at: [<ffffffff8114bc3f>] file_start_write+0x2a/0x2c
      > [ 3034.728033]  #1:  (&buffer->mutex#2){+.+.+.}, at: [<ffffffff811b3b93>] sysfs_write_file+0x3c/0x144
      > [ 3034.728033]  #2:  (s_active#54){.+.+.+}, at: [<ffffffff811b3c3e>] sysfs_write_file+0xe7/0x144
      > [ 3034.728033]  #3:  (x86_cpu_hotplug_driver_mutex){+.+.+.}, at: [<ffffffff810217c2>] cpu_hotplug_driver_lock+0x17/0x19
      > [ 3034.728033]  #4:  (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff8103d196>] cpu_maps_update_begin+0x17/0x19
      > [ 3034.728033]  #5:  (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff8103cfd8>] cpu_hotplug_begin+0x2c/0x6d
      > [ 3034.728033]  #6:  (rcu_preempt_state.onoff_mutex){+.+...}, at: [<ffffffff81530078>] rcu_cpu_notify+0x2f5/0x86e
      > [ 3034.728033] 1 lock held by bash/2980:
      > [ 3034.728033]  #0:  (&ldata->atomic_read_lock){+.+...}, at: [<ffffffff8131e28a>] n_tty_read+0x252/0x7e8
      >
      > Things looked a little weird. Also, this is a deadlock that lockdep did
      > not catch. But what we have here does not look like a circular lock
      > issue:
      >
      > Bash is blocked in rcu_cpu_notify():
      >
      > 1961		/* Exclude any attempts to start a new grace period. */
      > 1962		mutex_lock(&rsp->onoff_mutex);
      >
      >
      > kworker is blocked in get_online_cpus(), which makes sense as we are
      > currently taking down a CPU.
      >
      > But rcu_preempt is not blocked on anything. It is simply sleeping in
      > rcu_gp_kthread (really rcu_gp_init) here:
      >
      > 1453	#ifdef CONFIG_PROVE_RCU_DELAY
      > 1454			if ((prandom_u32() % (rcu_num_nodes * 8)) == 0 &&
      > 1455			    system_state == SYSTEM_RUNNING)
      > 1456				schedule_timeout_uninterruptible(2);
      > 1457	#endif /* #ifdef CONFIG_PROVE_RCU_DELAY */
      >
      > And it does this while holding the onoff_mutex that bash is waiting for.
      >
      > Doing a function trace, it showed me where it happened:
      >
      > [  125.940066] rcu_pree-10      3.... 28384115273: schedule_timeout_uninterruptible <-rcu_gp_kthread
      > [...]
      > [  125.940066] rcu_pree-10      3d..3 28384202439: sched_switch: prev_comm=rcu_preempt prev_pid=10 prev_prio=120 prev_state=D ==> next_comm=watchdog/3 next_pid=38 next_prio=120
      >
      > The watchdog ran, and then:
      >
      > [  125.940066] watchdog-38      3d..3 28384692863: sched_switch: prev_comm=watchdog/3 prev_pid=38 prev_prio=120 prev_state=P ==> next_comm=modprobe next_pid=2848 next_prio=118
      >
      > Not sure what modprobe was doing, but shortly after that:
      >
      > [  125.940066] modprobe-2848    3d..3 28385041749: sched_switch: prev_comm=modprobe prev_pid=2848 prev_prio=118 prev_state=R+ ==> next_comm=migration/3 next_pid=40 next_prio=0
      >
      > Where the migration thread took down the CPU:
      >
      > [  125.940066] migratio-40      3d..3 28389148276: sched_switch: prev_comm=migration/3 prev_pid=40 prev_prio=0 prev_state=P ==> next_comm=swapper/3 next_pid=0 next_prio=120
      >
      > which finally did:
      >
      > [  125.940066]   <idle>-0       3...1 28389282142: arch_cpu_idle_dead <-cpu_startup_entry
      > [  125.940066]   <idle>-0       3...1 28389282548: native_play_dead <-arch_cpu_idle_dead
      > [  125.940066]   <idle>-0       3...1 28389282924: play_dead_common <-native_play_dead
      > [  125.940066]   <idle>-0       3...1 28389283468: idle_task_exit <-play_dead_common
      > [  125.940066]   <idle>-0       3...1 28389284644: amd_e400_remove_cpu <-play_dead_common
      >
      >
      > CPU 3 is now offline, the rcu_preempt thread that ran on CPU 3 is still
      > doing a schedule_timeout_uninterruptible() and it registered it's
      > timeout to the timer base for CPU 3. You would think that it would get
      > migrated right? The issue here is that the timer migration happens at
      > the CPU notifier for CPU_DEAD. The problem is that the rcu notifier for
      > CPU_DOWN is blocked waiting for the onoff_mutex to be released, which is
      > held by the thread that just put itself into a uninterruptible sleep,
      > that wont wake up until the CPU_DEAD notifier of the timer
      > infrastructure is called, which wont happen until the rcu notifier
      > finishes. Here's our deadlock!
      
      This commit breaks this deadlock cycle by substituting a shorter udelay()
      for the previous schedule_timeout_uninterruptible(), while at the same
      time increasing the probability of the delay.  This maintains the intensity
      of the testing.
      Reported-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NSteven Rostedt <rostedt@goodmis.org>
      971394f3
    • S
      rcu: Don't call wakeup() with rcu_node structure ->lock held · 016a8d5b
      Steven Rostedt 提交于
      This commit fixes a lockdep-detected deadlock by moving a wake_up()
      call out from a rnp->lock critical section.  Please see below for
      the long version of this story.
      
      On Tue, 2013-05-28 at 16:13 -0400, Dave Jones wrote:
      
      > [12572.705832] ======================================================
      > [12572.750317] [ INFO: possible circular locking dependency detected ]
      > [12572.796978] 3.10.0-rc3+ #39 Not tainted
      > [12572.833381] -------------------------------------------------------
      > [12572.862233] trinity-child17/31341 is trying to acquire lock:
      > [12572.870390]  (rcu_node_0){..-.-.}, at: [<ffffffff811054ff>] rcu_read_unlock_special+0x9f/0x4c0
      > [12572.878859]
      > but task is already holding lock:
      > [12572.894894]  (&ctx->lock){-.-...}, at: [<ffffffff811390ed>] perf_lock_task_context+0x7d/0x2d0
      > [12572.903381]
      > which lock already depends on the new lock.
      >
      > [12572.927541]
      > the existing dependency chain (in reverse order) is:
      > [12572.943736]
      > -> #4 (&ctx->lock){-.-...}:
      > [12572.960032]        [<ffffffff810b9851>] lock_acquire+0x91/0x1f0
      > [12572.968337]        [<ffffffff816ebc90>] _raw_spin_lock+0x40/0x80
      > [12572.976633]        [<ffffffff8113c987>] __perf_event_task_sched_out+0x2e7/0x5e0
      > [12572.984969]        [<ffffffff81088953>] perf_event_task_sched_out+0x93/0xa0
      > [12572.993326]        [<ffffffff816ea0bf>] __schedule+0x2cf/0x9c0
      > [12573.001652]        [<ffffffff816eacfe>] schedule_user+0x2e/0x70
      > [12573.009998]        [<ffffffff816ecd64>] retint_careful+0x12/0x2e
      > [12573.018321]
      > -> #3 (&rq->lock){-.-.-.}:
      > [12573.034628]        [<ffffffff810b9851>] lock_acquire+0x91/0x1f0
      > [12573.042930]        [<ffffffff816ebc90>] _raw_spin_lock+0x40/0x80
      > [12573.051248]        [<ffffffff8108e6a7>] wake_up_new_task+0xb7/0x260
      > [12573.059579]        [<ffffffff810492f5>] do_fork+0x105/0x470
      > [12573.067880]        [<ffffffff81049686>] kernel_thread+0x26/0x30
      > [12573.076202]        [<ffffffff816cee63>] rest_init+0x23/0x140
      > [12573.084508]        [<ffffffff81ed8e1f>] start_kernel+0x3f1/0x3fe
      > [12573.092852]        [<ffffffff81ed856f>] x86_64_start_reservations+0x2a/0x2c
      > [12573.101233]        [<ffffffff81ed863d>] x86_64_start_kernel+0xcc/0xcf
      > [12573.109528]
      > -> #2 (&p->pi_lock){-.-.-.}:
      > [12573.125675]        [<ffffffff810b9851>] lock_acquire+0x91/0x1f0
      > [12573.133829]        [<ffffffff816ebe9b>] _raw_spin_lock_irqsave+0x4b/0x90
      > [12573.141964]        [<ffffffff8108e881>] try_to_wake_up+0x31/0x320
      > [12573.150065]        [<ffffffff8108ebe2>] default_wake_function+0x12/0x20
      > [12573.158151]        [<ffffffff8107bbf8>] autoremove_wake_function+0x18/0x40
      > [12573.166195]        [<ffffffff81085398>] __wake_up_common+0x58/0x90
      > [12573.174215]        [<ffffffff81086909>] __wake_up+0x39/0x50
      > [12573.182146]        [<ffffffff810fc3da>] rcu_start_gp_advanced.isra.11+0x4a/0x50
      > [12573.190119]        [<ffffffff810fdb09>] rcu_start_future_gp+0x1c9/0x1f0
      > [12573.198023]        [<ffffffff810fe2c4>] rcu_nocb_kthread+0x114/0x930
      > [12573.205860]        [<ffffffff8107a91d>] kthread+0xed/0x100
      > [12573.213656]        [<ffffffff816f4b1c>] ret_from_fork+0x7c/0xb0
      > [12573.221379]
      > -> #1 (&rsp->gp_wq){..-.-.}:
      > [12573.236329]        [<ffffffff810b9851>] lock_acquire+0x91/0x1f0
      > [12573.243783]        [<ffffffff816ebe9b>] _raw_spin_lock_irqsave+0x4b/0x90
      > [12573.251178]        [<ffffffff810868f3>] __wake_up+0x23/0x50
      > [12573.258505]        [<ffffffff810fc3da>] rcu_start_gp_advanced.isra.11+0x4a/0x50
      > [12573.265891]        [<ffffffff810fdb09>] rcu_start_future_gp+0x1c9/0x1f0
      > [12573.273248]        [<ffffffff810fe2c4>] rcu_nocb_kthread+0x114/0x930
      > [12573.280564]        [<ffffffff8107a91d>] kthread+0xed/0x100
      > [12573.287807]        [<ffffffff816f4b1c>] ret_from_fork+0x7c/0xb0
      
      Notice the above call chain.
      
      rcu_start_future_gp() is called with the rnp->lock held. Then it calls
      rcu_start_gp_advance, which does a wakeup.
      
      You can't do wakeups while holding the rnp->lock, as that would mean
      that you could not do a rcu_read_unlock() while holding the rq lock, or
      any lock that was taken while holding the rq lock. This is because...
      (See below).
      
      > [12573.295067]
      > -> #0 (rcu_node_0){..-.-.}:
      > [12573.309293]        [<ffffffff810b8d36>] __lock_acquire+0x1786/0x1af0
      > [12573.316568]        [<ffffffff810b9851>] lock_acquire+0x91/0x1f0
      > [12573.323825]        [<ffffffff816ebc90>] _raw_spin_lock+0x40/0x80
      > [12573.331081]        [<ffffffff811054ff>] rcu_read_unlock_special+0x9f/0x4c0
      > [12573.338377]        [<ffffffff810760a6>] __rcu_read_unlock+0x96/0xa0
      > [12573.345648]        [<ffffffff811391b3>] perf_lock_task_context+0x143/0x2d0
      > [12573.352942]        [<ffffffff8113938e>] find_get_context+0x4e/0x1f0
      > [12573.360211]        [<ffffffff811403f4>] SYSC_perf_event_open+0x514/0xbd0
      > [12573.367514]        [<ffffffff81140e49>] SyS_perf_event_open+0x9/0x10
      > [12573.374816]        [<ffffffff816f4dd4>] tracesys+0xdd/0xe2
      
      Notice the above trace.
      
      perf took its own ctx->lock, which can be taken while holding the rq
      lock. While holding this lock, it did a rcu_read_unlock(). The
      perf_lock_task_context() basically looks like:
      
      rcu_read_lock();
      raw_spin_lock(ctx->lock);
      rcu_read_unlock();
      
      Now, what looks to have happened, is that we scheduled after taking that
      first rcu_read_lock() but before taking the spin lock. When we scheduled
      back in and took the ctx->lock, the following rcu_read_unlock()
      triggered the "special" code.
      
      The rcu_read_unlock_special() takes the rnp->lock, which gives us a
      possible deadlock scenario.
      
      	CPU0		CPU1		CPU2
      	----		----		----
      
      				     rcu_nocb_kthread()
          lock(rq->lock);
      		    lock(ctx->lock);
      				     lock(rnp->lock);
      
      				     wake_up();
      
      				     lock(rq->lock);
      
      		    rcu_read_unlock();
      
      		    rcu_read_unlock_special();
      
      		    lock(rnp->lock);
          lock(ctx->lock);
      
      **** DEADLOCK ****
      
      > [12573.382068]
      > other info that might help us debug this:
      >
      > [12573.403229] Chain exists of:
      >   rcu_node_0 --> &rq->lock --> &ctx->lock
      >
      > [12573.424471]  Possible unsafe locking scenario:
      >
      > [12573.438499]        CPU0                    CPU1
      > [12573.445599]        ----                    ----
      > [12573.452691]   lock(&ctx->lock);
      > [12573.459799]                                lock(&rq->lock);
      > [12573.467010]                                lock(&ctx->lock);
      > [12573.474192]   lock(rcu_node_0);
      > [12573.481262]
      >  *** DEADLOCK ***
      >
      > [12573.501931] 1 lock held by trinity-child17/31341:
      > [12573.508990]  #0:  (&ctx->lock){-.-...}, at: [<ffffffff811390ed>] perf_lock_task_context+0x7d/0x2d0
      > [12573.516475]
      > stack backtrace:
      > [12573.530395] CPU: 1 PID: 31341 Comm: trinity-child17 Not tainted 3.10.0-rc3+ #39
      > [12573.545357]  ffffffff825b4f90 ffff880219f1dbc0 ffffffff816e375b ffff880219f1dc00
      > [12573.552868]  ffffffff816dfa5d ffff880219f1dc50 ffff88023ce4d1f8 ffff88023ce4ca40
      > [12573.560353]  0000000000000001 0000000000000001 ffff88023ce4d1f8 ffff880219f1dcc0
      > [12573.567856] Call Trace:
      > [12573.575011]  [<ffffffff816e375b>] dump_stack+0x19/0x1b
      > [12573.582284]  [<ffffffff816dfa5d>] print_circular_bug+0x200/0x20f
      > [12573.589637]  [<ffffffff810b8d36>] __lock_acquire+0x1786/0x1af0
      > [12573.596982]  [<ffffffff810918f5>] ? sched_clock_cpu+0xb5/0x100
      > [12573.604344]  [<ffffffff810b9851>] lock_acquire+0x91/0x1f0
      > [12573.611652]  [<ffffffff811054ff>] ? rcu_read_unlock_special+0x9f/0x4c0
      > [12573.619030]  [<ffffffff816ebc90>] _raw_spin_lock+0x40/0x80
      > [12573.626331]  [<ffffffff811054ff>] ? rcu_read_unlock_special+0x9f/0x4c0
      > [12573.633671]  [<ffffffff811054ff>] rcu_read_unlock_special+0x9f/0x4c0
      > [12573.640992]  [<ffffffff811390ed>] ? perf_lock_task_context+0x7d/0x2d0
      > [12573.648330]  [<ffffffff810b429e>] ? put_lock_stats.isra.29+0xe/0x40
      > [12573.655662]  [<ffffffff813095a0>] ? delay_tsc+0x90/0xe0
      > [12573.662964]  [<ffffffff810760a6>] __rcu_read_unlock+0x96/0xa0
      > [12573.670276]  [<ffffffff811391b3>] perf_lock_task_context+0x143/0x2d0
      > [12573.677622]  [<ffffffff81139070>] ? __perf_event_enable+0x370/0x370
      > [12573.684981]  [<ffffffff8113938e>] find_get_context+0x4e/0x1f0
      > [12573.692358]  [<ffffffff811403f4>] SYSC_perf_event_open+0x514/0xbd0
      > [12573.699753]  [<ffffffff8108cd9d>] ? get_parent_ip+0xd/0x50
      > [12573.707135]  [<ffffffff810b71fd>] ? trace_hardirqs_on_caller+0xfd/0x1c0
      > [12573.714599]  [<ffffffff81140e49>] SyS_perf_event_open+0x9/0x10
      > [12573.721996]  [<ffffffff816f4dd4>] tracesys+0xdd/0xe2
      
      This commit delays the wakeup via irq_work(), which is what
      perf and ftrace use to perform wakeups in critical sections.
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      016a8d5b
  2. 30 4月, 2013 1 次提交
  3. 19 4月, 2013 1 次提交
    • F
      nohz: Ensure full dynticks CPUs are RCU nocbs · d1e43fa5
      Frederic Weisbecker 提交于
      We need full dynticks CPU to also be RCU nocb so
      that we don't have to keep the tick to handle RCU
      callbacks.
      
      Make sure the range passed to nohz_full= boot
      parameter is a subset of rcu_nocbs=
      
      The CPUs that fail to meet this requirement will be
      excluded from the nohz_full range. This is checked
      early in boot time, before any CPU has the opportunity
      to stop its tick.
      Suggested-by: NSteven Rostedt <rostedt@goodmis.org>
      Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      d1e43fa5
  4. 16 4月, 2013 1 次提交
    • P
      rcu: Kick adaptive-ticks CPUs that are holding up RCU grace periods · 65d798f0
      Paul E. McKenney 提交于
      Adaptive-ticks CPUs inform RCU when they enter kernel mode, but they do
      not necessarily turn the scheduler-clock tick back on.  This state of
      affairs could result in RCU waiting on an adaptive-ticks CPU running
      for an extended period in kernel mode.  Such a CPU will never run the
      RCU state machine, and could therefore indefinitely extend the RCU state
      machine, sooner or later resulting in an OOM condition.
      
      This patch, inspired by an earlier patch by Frederic Weisbecker, therefore
      causes RCU's force-quiescent-state processing to check for this condition
      and to send an IPI to CPUs that remain in that state for too long.
      "Too long" currently means about three jiffies by default, which is
      quite some time for a CPU to remain in the kernel without blocking.
      The rcu_tree.jiffies_till_first_fqs and rcutree.jiffies_till_next_fqs
      sysfs variables may be used to tune "too long" if needed.
      Reported-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      65d798f0
  5. 26 3月, 2013 8 次提交
  6. 14 3月, 2013 1 次提交
  7. 13 3月, 2013 4 次提交
  8. 29 1月, 2013 1 次提交
  9. 27 1月, 2013 2 次提交
  10. 09 1月, 2013 5 次提交
  11. 01 12月, 2012 1 次提交
    • F
      context_tracking: New context tracking susbsystem · 91d1aa43
      Frederic Weisbecker 提交于
      Create a new subsystem that probes on kernel boundaries
      to keep track of the transitions between level contexts
      with two basic initial contexts: user or kernel.
      
      This is an abstraction of some RCU code that use such tracking
      to implement its userspace extended quiescent state.
      
      We need to pull this up from RCU into this new level of indirection
      because this tracking is also going to be used to implement an "on
      demand" generic virtual cputime accounting. A necessary step to
      shutdown the tick while still accounting the cputime.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      [ paulmck: fix whitespace error and email address. ]
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      91d1aa43
  12. 17 11月, 2012 1 次提交
    • P
      rcu: Add callback-free CPUs · 3fbfbf7a
      Paul E. McKenney 提交于
      RCU callback execution can add significant OS jitter and also can
      degrade both scheduling latency and, in asymmetric multiprocessors,
      energy efficiency.  This commit therefore adds the ability for selected
      CPUs ("rcu_nocbs=" boot parameter) to have their callbacks offloaded
      to kthreads.  If the "rcu_nocb_poll" boot parameter is also specified,
      these kthreads will do polling, removing the need for the offloaded
      CPUs to do wakeups.  At least one CPU must be doing normal callback
      processing: currently CPU 0 cannot be selected as a no-CBs CPU.
      In addition, attempts to offline the last normal-CBs CPU will fail.
      
      This feature was inspired by Jim Houston's and Joe Korty's JRCU, and
      this commit includes fixes to problems located by Fengguang Wu's
      kbuild test robot.
      
      [ paulmck: Added gfp.h include file as suggested by Fengguang Wu. ]
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      3fbfbf7a
  13. 14 11月, 2012 2 次提交
  14. 09 11月, 2012 4 次提交
    • P
      rcu: Fix tracing formatting · 42c3533e
      Paul E. McKenney 提交于
      The rcu_state structure's ->completed field is unsigned long, so this
      commit adjusts show_one_rcugp()'s printf() format to suit.  Also add
      the required ACCESS_ONCE() directives while we are in this function.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      42c3533e
    • P
      rcu: Instrument synchronize_rcu_expedited() for debugfs tracing · a30489c5
      Paul E. McKenney 提交于
      This commit adds the counters to rcu_state and updates them in
      synchronize_rcu_expedited() to provide the data needed for debugfs
      tracing.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      a30489c5
    • P
      rcu: Move synchronize_sched_expedited() state to rcu_state · 40694d66
      Paul E. McKenney 提交于
      Tracing (debugfs) of expedited RCU primitives is required, which in turn
      requires that the relevant data be located where the tracing code can find
      it, not in its current static global variables in kernel/rcutree.c.
      This commit therefore moves sync_sched_expedited_started and
      sync_sched_expedited_done to the rcu_state structure, as fields
      ->expedited_start and ->expedited_done, respectively.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      40694d66
    • P
      rcu: Avoid counter wrap in synchronize_sched_expedited() · 1924bcb0
      Paul E. McKenney 提交于
      There is a counter scheme similar to ticket locking that
      synchronize_sched_expedited() uses to service multiple concurrent
      callers with the same expedited grace period.  Upon entry, a
      sync_sched_expedited_started variable is atomically incremented,
      and upon completion of a expedited grace period a separate
      sync_sched_expedited_done variable is atomically incremented.
      
      However, if a synchronize_sched_expedited() is delayed while
      in try_stop_cpus(), concurrent invocations will increment the
      sync_sched_expedited_started counter, which will eventually overflow.
      If the original synchronize_sched_expedited() resumes execution just
      as the counter overflows, a concurrent invocation could incorrectly
      conclude that an expedited grace period elapsed in zero time, which
      would be bad.  One could rely on counter size to prevent this from
      happening in practice, but the goal is to formally validate this
      code, so it needs to be fixed anyway.
      
      This commit therefore checks the gap between the two counters before
      incrementing sync_sched_expedited_started, and if the gap is too
      large, does a normal grace period instead.  Overflow is thus only
      possible if there are more than about 3.5 billion threads on 32-bit
      systems, which can be excluded until such time as task_struct fits
      into a single byte and 4G/4G patches are accepted into mainline.
      It is also easy to encode this limitation into mechanical theorem
      provers.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      1924bcb0