1. 12 12月, 2011 21 次提交
    • F
      x86: Call idle notifier after irq_enter() · 98ad1cc1
      Frederic Weisbecker 提交于
      Interrupts notify the idle exit state before calling irq_enter().
      But the notifier code calls rcu_read_lock() and this is not
      allowed while rcu is in an extended quiescent state. We need
      to wait for irq_enter() -> rcu_idle_exit() to be called before
      doing so otherwise this results in a grumpy RCU:
      
      [    0.099991] WARNING: at include/linux/rcupdate.h:194 __atomic_notifier_call_chain+0xd2/0x110()
      [    0.099991] Hardware name: AMD690VM-FMH
      [    0.099991] Modules linked in:
      [    0.099991] Pid: 0, comm: swapper Not tainted 3.0.0-rc6+ #255
      [    0.099991] Call Trace:
      [    0.099991]  <IRQ>  [<ffffffff81051c8a>] warn_slowpath_common+0x7a/0xb0
      [    0.099991]  [<ffffffff81051cd5>] warn_slowpath_null+0x15/0x20
      [    0.099991]  [<ffffffff817d6fa2>] __atomic_notifier_call_chain+0xd2/0x110
      [    0.099991]  [<ffffffff817d6ff1>] atomic_notifier_call_chain+0x11/0x20
      [    0.099991]  [<ffffffff81001873>] exit_idle+0x43/0x50
      [    0.099991]  [<ffffffff81020439>] smp_apic_timer_interrupt+0x39/0xa0
      [    0.099991]  [<ffffffff817da253>] apic_timer_interrupt+0x13/0x20
      [    0.099991]  <EOI>  [<ffffffff8100ae67>] ? default_idle+0xa7/0x350
      [    0.099991]  [<ffffffff8100ae65>] ? default_idle+0xa5/0x350
      [    0.099991]  [<ffffffff8100b19b>] amd_e400_idle+0x8b/0x110
      [    0.099991]  [<ffffffff810cb01f>] ? rcu_enter_nohz+0x8f/0x160
      [    0.099991]  [<ffffffff810019a0>] cpu_idle+0xb0/0x110
      [    0.099991]  [<ffffffff817a7505>] rest_init+0xe5/0x140
      [    0.099991]  [<ffffffff817a7468>] ? rest_init+0x48/0x140
      [    0.099991]  [<ffffffff81cc5ca3>] start_kernel+0x3d1/0x3dc
      [    0.099991]  [<ffffffff81cc5321>] x86_64_start_reservations+0x131/0x135
      [    0.099991]  [<ffffffff81cc5412>] x86_64_start_kernel+0xed/0xf4
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Andy Henroid <andrew.d.henroid@intel.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      98ad1cc1
    • F
      x86: Enter rcu extended qs after idle notifier call · e37e112d
      Frederic Weisbecker 提交于
      The idle notifier, called by enter_idle(), enters into rcu read
      side critical section but at that time we already switched into
      the RCU-idle window (rcu_idle_enter() has been called). And it's
      illegal to use rcu_read_lock() in that state.
      
      This results in rcu reporting its bad mood:
      
      [    1.275635] WARNING: at include/linux/rcupdate.h:194 __atomic_notifier_call_chain+0xd2/0x110()
      [    1.275635] Hardware name: AMD690VM-FMH
      [    1.275635] Modules linked in:
      [    1.275635] Pid: 0, comm: swapper Not tainted 3.0.0-rc6+ #252
      [    1.275635] Call Trace:
      [    1.275635]  [<ffffffff81051c8a>] warn_slowpath_common+0x7a/0xb0
      [    1.275635]  [<ffffffff81051cd5>] warn_slowpath_null+0x15/0x20
      [    1.275635]  [<ffffffff817d6f22>] __atomic_notifier_call_chain+0xd2/0x110
      [    1.275635]  [<ffffffff817d6f71>] atomic_notifier_call_chain+0x11/0x20
      [    1.275635]  [<ffffffff810018a0>] enter_idle+0x20/0x30
      [    1.275635]  [<ffffffff81001995>] cpu_idle+0xa5/0x110
      [    1.275635]  [<ffffffff817a7465>] rest_init+0xe5/0x140
      [    1.275635]  [<ffffffff817a73c8>] ? rest_init+0x48/0x140
      [    1.275635]  [<ffffffff81cc5ca3>] start_kernel+0x3d1/0x3dc
      [    1.275635]  [<ffffffff81cc5321>] x86_64_start_reservations+0x131/0x135
      [    1.275635]  [<ffffffff81cc5412>] x86_64_start_kernel+0xed/0xf4
      [    1.275635] ---[ end trace a22d306b065d4a66 ]---
      
      Fix this by entering rcu extended quiescent state later, just before
      the CPU goes to sleep.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      e37e112d
    • F
      nohz: Allow rcu extended quiescent state handling seperately from tick stop · 2bbb6817
      Frederic Weisbecker 提交于
      It is assumed that rcu won't be used once we switch to tickless
      mode and until we restart the tick. However this is not always
      true, as in x86-64 where we dereference the idle notifiers after
      the tick is stopped.
      
      To prepare for fixing this, add two new APIs:
      tick_nohz_idle_enter_norcu() and tick_nohz_idle_exit_norcu().
      
      If no use of RCU is made in the idle loop between
      tick_nohz_enter_idle() and tick_nohz_exit_idle() calls, the arch
      must instead call the new *_norcu() version such that the arch doesn't
      need to call rcu_idle_enter() and rcu_idle_exit().
      
      Otherwise the arch must call tick_nohz_enter_idle() and
      tick_nohz_exit_idle() and also call explicitly:
      
      - rcu_idle_enter() after its last use of RCU before the CPU is put
      to sleep.
      - rcu_idle_exit() before the first use of RCU after the CPU is woken
      up.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: David Miller <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Hans-Christian Egtvedt <hans-christian.egtvedt@atmel.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      2bbb6817
    • F
      nohz: Separate out irq exit and idle loop dyntick logic · 280f0677
      Frederic Weisbecker 提交于
      The tick_nohz_stop_sched_tick() function, which tries to delay
      the next timer tick as long as possible, can be called from two
      places:
      
      - From the idle loop to start the dytick idle mode
      - From interrupt exit if we have interrupted the dyntick
      idle mode, so that we reprogram the next tick event in
      case the irq changed some internal state that requires this
      action.
      
      There are only few minor differences between both that
      are handled by that function, driven by the ts->inidle
      cpu variable and the inidle parameter. The whole guarantees
      that we only update the dyntick mode on irq exit if we actually
      interrupted the dyntick idle mode, and that we enter in RCU extended
      quiescent state from idle loop entry only.
      
      Split this function into:
      
      - tick_nohz_idle_enter(), which sets ts->inidle to 1, enters
      dynticks idle mode unconditionally if it can, and enters into RCU
      extended quiescent state.
      
      - tick_nohz_irq_exit() which only updates the dynticks idle mode
      when ts->inidle is set (ie: if tick_nohz_idle_enter() has been called).
      
      To maintain symmetry, tick_nohz_restart_sched_tick() has been renamed
      into tick_nohz_idle_exit().
      
      This simplifies the code and micro-optimize the irq exit path (no need
      for local_irq_save there). This also prepares for the split between
      dynticks and rcu extended quiescent state logics. We'll need this split to
      further fix illegal uses of RCU in extended quiescent states in the idle
      loop.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: David Miller <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Hans-Christian Egtvedt <hans-christian.egtvedt@atmel.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      280f0677
    • P
      rcu: Make srcu_read_lock_held() call common lockdep-enabled function · 867f236b
      Paul E. McKenney 提交于
      A common debug_lockdep_rcu_enabled() function is used to check whether
      RCU lockdep splats should be reported, but srcu_read_lock() does not
      use it.  This commit therefore brings srcu_read_lock_held() up to date.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      867f236b
    • P
      rcu: Warn when srcu_read_lock() is used in an extended quiescent state · ff195cb6
      Paul E. McKenney 提交于
      Catch SRCU up to the other variants of RCU by making PROVE_RCU
      complain if either srcu_read_lock() or srcu_read_lock_held() are
      used from within RCU-idle mode.
      
      Frederic reworked this to allow for the new versions of his patches
      that check for extended quiescent states.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      ff195cb6
    • P
      rcu: Remove one layer of abstraction from PROVE_RCU checking · d8ab29f8
      Paul E. McKenney 提交于
      Simplify things a bit by substituting the definitions of the single-line
      rcu_read_acquire(), rcu_read_release(), rcu_read_acquire_bh(),
      rcu_read_release_bh(), rcu_read_acquire_sched(), and
      rcu_read_release_sched() functions at their call points.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      d8ab29f8
    • F
      rcu: Warn when rcu_read_lock() is used in extended quiescent state · 00f49e57
      Frederic Weisbecker 提交于
      We are currently able to detect uses of rcu_dereference_check() inside
      extended quiescent states (such as the RCU-free window in idle).
      But rcu_read_lock() and friends can be used without rcu_dereference(),
      so that the earlier commit checking for use of rcu_dereference() and
      friends while in RCU idle mode miss some error conditions.  This commit
      therefore adds extended quiescent state checking to rcu_read_lock() and
      friends.
      
      Uses of RCU from within RCU-idle mode are totally ignored by
      RCU, hence the importance of these checks.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      00f49e57
    • F
      rcu: Inform the user about extended quiescent state on PROVE_RCU warning · 0464e937
      Frederic Weisbecker 提交于
      Inform the user if an RCU usage error is detected by lockdep while in
      an extended quiescent state (in this case, the RCU-free window in idle).
      This is accomplished by adding a line to the RCU lockdep splat indicating
      whether or not the splat occurred in extended quiescent state.
      
      Uses of RCU from within extended quiescent state mode are totally ignored
      by RCU, hence the importance of this diagnostic.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      0464e937
    • F
      rcu: Detect illegal rcu dereference in extended quiescent state · e6b80a3b
      Frederic Weisbecker 提交于
      Report that none of the rcu read lock maps are held while in an RCU
      extended quiescent state (the section between rcu_idle_enter()
      and rcu_idle_exit()). This helps detect any use of rcu_dereference()
      and friends from within the section in idle where RCU is not allowed.
      
      This way we can guarantee an extended quiescent window where the CPU
      can be put in dyntick idle mode or can simply aoid to be part of any
      global grace period completion while in the idle loop.
      
      Uses of RCU from such mode are totally ignored by RCU, hence the
      importance of these checks.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      e6b80a3b
    • T
      rcu: Remove redundant return from rcu_report_exp_rnp() · a0f8eefb
      Thomas Gleixner 提交于
      Empty void functions do not need "return", so this commit removes it
      from rcu_report_exp_rnp().
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      a0f8eefb
    • T
      rcu: Omit self-awaken when setting up expedited grace period · b40d293e
      Thomas Gleixner 提交于
      When setting up an expedited grace period, if there were no readers, the
      task will awaken itself.  This commit removes this useless self-awakening.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      b40d293e
    • P
      rcu: Disable preemption in rcu_is_cpu_idle() · 34240697
      Paul E. McKenney 提交于
      Because rcu_is_cpu_idle() is to be used to check for extended quiescent
      states in RCU-preempt read-side critical sections, it cannot assume that
      preemption is disabled.  And preemption must be disabled when accessing
      the dyntick-idle state, because otherwise the following sequence of events
      could occur:
      
      1.	Task A on CPU 1 enters rcu_is_cpu_idle() and picks up the pointer
      	to CPU 1's per-CPU variables.
      
      2.	Task B preempts Task A and starts running on CPU 1.
      
      3.	Task A migrates to CPU 2.
      
      4.	Task B blocks, leaving CPU 1 idle.
      
      5.	Task A continues execution on CPU 2, accessing CPU 1's dyntick-idle
      	information using the pointer fetched in step 1 above, and finds
      	that CPU 1 is idle.
      
      6.	Task A therefore incorrectly concludes that it is executing in
      	an extended quiescent state, possibly issuing a spurious splat.
      
      Therefore, this commit disables preemption within the rcu_is_cpu_idle()
      function.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      34240697
    • P
      rcu: Document failing tick as cause of RCU CPU stall warning · 2c01531f
      Paul E. McKenney 提交于
      One of lclaudio's systems was seeing RCU CPU stall warnings from idle.
      These turned out to be caused by a bug that stopped scheduling-clock
      tick interrupts from being sent to a given CPU for several hundred seconds.
      This commit therefore updates the documentation to call this out as a
      possible cause for RCU CPU stall warnings.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      2c01531f
    • P
      rcu: Add failure tracing to rcutorture · 91afaf30
      Paul E. McKenney 提交于
      Trace the rcutorture RCU accesses and dump the trace buffer when the
      first failure is detected.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      91afaf30
    • P
      trace: Allow ftrace_dump() to be called from modules · a8eecf22
      Paul E. McKenney 提交于
      Add an EXPORT_SYMBOL_GPL() so that rcutorture can dump the trace buffer
      upon detection of an RCU error.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      a8eecf22
    • P
      rcu: Track idleness independent of idle tasks · 9b2e4f18
      Paul E. McKenney 提交于
      Earlier versions of RCU used the scheduling-clock tick to detect idleness
      by checking for the idle task, but handled idleness differently for
      CONFIG_NO_HZ=y.  But there are now a number of uses of RCU read-side
      critical sections in the idle task, for example, for tracing.  A more
      fine-grained detection of idleness is therefore required.
      
      This commit presses the old dyntick-idle code into full-time service,
      so that rcu_idle_enter(), previously known as rcu_enter_nohz(), is
      always invoked at the beginning of an idle loop iteration.  Similarly,
      rcu_idle_exit(), previously known as rcu_exit_nohz(), is always invoked
      at the end of an idle-loop iteration.  This allows the idle task to
      use RCU everywhere except between consecutive rcu_idle_enter() and
      rcu_idle_exit() calls, in turn allowing architecture maintainers to
      specify exactly where in the idle loop that RCU may be used.
      
      Because some of the userspace upcall uses can result in what looks
      to RCU like half of an interrupt, it is not possible to expect that
      the irq_enter() and irq_exit() hooks will give exact counts.  This
      patch therefore expands the ->dynticks_nesting counter to 64 bits
      and uses two separate bitfields to count process/idle transitions
      and interrupt entry/exit transitions.  It is presumed that userspace
      upcalls do not happen in the idle loop or from usermode execution
      (though usermode might do a system call that results in an upcall).
      The counter is hard-reset on each process/idle transition, which
      avoids the interrupt entry/exit error from accumulating.  Overflow
      is avoided by the 64-bitness of the ->dyntick_nesting counter.
      
      This commit also adds warnings if a non-idle task asks RCU to enter
      idle state (and these checks will need some adjustment before applying
      Frederic's OS-jitter patches (http://lkml.org/lkml/2011/10/7/246).
      In addition, validation of ->dynticks and ->dynticks_nesting is added.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      9b2e4f18
    • P
      lockdep: Update documentation for lock-class leak detection · b804cb9e
      Paul E. McKenney 提交于
      There are a number of bugs that can leak or overuse lock classes,
      which can cause the maximum number of lock classes (currently 8191)
      to be exceeded.  However, the documentation does not tell you how to
      track down these problems.  This commit addresses this shortcoming.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      b804cb9e
    • P
      rcu: Make synchronize_sched_expedited() better at work sharing · 7077714e
      Paul E. McKenney 提交于
      When synchronize_sched_expedited() takes its second and subsequent
      snapshots of sync_sched_expedited_started, it subtracts 1.  This
      means that the concurrent caller of synchronize_sched_expedited()
      that incremented to that value sees our successful completion, it
      will not be able to take advantage of it.  This restriction is
      pointless, given that our full expedited grace period would have
      happened after the other guy started, and thus should be able to
      serve as a proxy for the other guy successfully executing
      try_stop_cpus().
      
      This commit therefore removes the subtraction of 1.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      7077714e
    • P
      rcu: Avoid RCU-preempt expedited grace-period botch · 389abd48
      Paul E. McKenney 提交于
      Because rcu_read_unlock_special() samples rcu_preempted_readers_exp(rnp)
      after dropping rnp->lock, the following sequence of events is possible:
      
      1.	Task A exits its RCU read-side critical section, and removes
      	itself from the ->blkd_tasks list, releases rnp->lock, and is
      	then preempted.  Task B remains on the ->blkd_tasks list, and
      	blocks the current expedited grace period.
      
      2.	Task B exits from its RCU read-side critical section and removes
      	itself from the ->blkd_tasks list.  Because it is the last task
      	blocking the current expedited grace period, it ends that
      	expedited grace period.
      
      3.	Task A resumes, and samples rcu_preempted_readers_exp(rnp) which
      	of course indicates that nothing is blocking the nonexistent
      	expedited grace period. Task A is again preempted.
      
      4.	Some other CPU starts an expedited grace period.  There are several
      	tasks blocking this expedited grace period queued on the
      	same rcu_node structure that Task A was using in step 1 above.
      
      5.	Task A examines its state and incorrectly concludes that it was
      	the last task blocking the expedited grace period on the current
      	rcu_node structure.  It therefore reports completion up the
      	rcu_node tree.
      
      6.	The expedited grace period can then incorrectly complete before
      	the tasks blocked on this same rcu_node structure exit their
      	RCU read-side critical sections.  Arbitrarily bad things happen.
      
      This commit therefore takes a snapshot of rcu_preempted_readers_exp(rnp)
      prior to dropping the lock, so that only the last task thinks that it is
      the last task, thus avoiding the failure scenario laid out above.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      389abd48
    • P
      rcu: ->signaled better named ->fqs_state · af446b70
      Paul E. McKenney 提交于
      The ->signaled field was named before complications in the form of
      dyntick-idle mode and offlined CPUs.  These complications have required
      that force_quiescent_state() be implemented as a state machine, instead
      of simply unconditionally sending reschedule IPIs.  Therefore, this
      commit renames ->signaled to ->fqs_state to catch up with the new
      force_quiescent_state() reality.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      af446b70
  2. 10 12月, 2011 9 次提交
  3. 09 12月, 2011 10 次提交
    • M
      sys_getppid: add missing rcu_dereference · 031af165
      Mandeep Singh Baines 提交于
      In order to safely dereference current->real_parent inside an
      rcu_read_lock, we need an rcu_dereference.
      Signed-off-by: NMandeep Singh Baines <msb@chromium.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      031af165
    • A
      rapidio/tsi721: modify PCIe capability settings · 1cee22b7
      Alexandre Bounine 提交于
      Modify initialization of PCIe capability registers in Tsi721 mport driver:
       - change Completion Timeout value to avoid unexpected data transfer
         aborts during intensive traffic.
       - replace hardcoded offset of PCIe capability block by making it use the
         common function.
      
      This patch is applicable to kernel versions starting from 3.2-rc1.
      Signed-off-by: NAlexandre Bounine <alexandre.bounine@idt.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1cee22b7
    • A
      rapidio/tsi721: fix mailbox resource reporting · b439e66f
      Alexandre Bounine 提交于
      Bug fix for Tsi721 RapidIO mport driver: Tsi721 supports four RapidIO
      mailboxes (MBOX0 - MBOX3) as defined by RapidIO specification.  Mailbox
      resources has to be properly reported to allow use of all available
      mailboxes (initial version reports only MBOX0).
      
      This patch is applicable to kernel versions staring from 3.2-rc1.
      Signed-off-by: NAlexandre Bounine <alexandre.bounine@idt.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b439e66f
    • A
      rapidio/tsi721: switch to dma_zalloc_coherent · ceb96398
      Alexandre Bounine 提交于
      Replace the pair dma_alloc_coherent()+memset() with the new
      dma_zalloc_coherent() added by Andrew Morton for kernel version 3.2
      Signed-off-by: NAlexandre Bounine <alexandre.bounine@idt.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ceb96398
    • M
      procfs: do not overflow get_{idle,iowait}_time for nohz · 2a95ea6c
      Michal Hocko 提交于
      Since commit a25cac51 ("proc: Consider NO_HZ when printing idle and
      iowait times") we are reporting idle/io_wait time also while a CPU is
      tickless.  We rely on get_{idle,iowait}_time functions to retrieve
      proper data.
      
      These functions, however, use usecs_to_cputime to translate micro
      seconds time to cputime64_t.  This is just an alias to usecs_to_jiffies
      which reduces the data type from u64 to unsigned int and also checks
      whether the given parameter overflows jiffies_to_usecs(MAX_JIFFY_OFFSET)
      and returns MAX_JIFFY_OFFSET in that case.
      
      When we overflow depends on CONFIG_HZ but especially for CONFIG_HZ_300
      it is quite low (1431649781) so we are getting MAX_JIFFY_OFFSET for
      >3000s! until we overflow unsigned int.  Just for reference
      CONFIG_HZ_100 has an overflow window around 20s, CONFIG_HZ_250 ~8s and
      CONFIG_HZ_1000 ~2s.
      
      This results in a bug when people saw [h]top going mad reporting 100%
      CPU usage even though there was basically no CPU load.  The reason was
      simply that /proc/stat stopped reporting idle/io_wait changes (and
      reported MAX_JIFFY_OFFSET) and so the only change happening was for user
      system time.
      
      Let's use nsecs_to_jiffies64 instead which doesn't reduce the precision
      to 32b type and it is much more appropriate for cumulative time values
      (unlike usecs_to_jiffies which intended for timeout calculations).
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Tested-by: NArtem S. Tashkinov <t.artem@mailcity.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a95ea6c
    • M
      mm: vmalloc: check for page allocation failure before vmlist insertion · 1368edf0
      Mel Gorman 提交于
      Commit f5252e00 ("mm: avoid null pointer access in vm_struct via
      /proc/vmallocinfo") adds newly allocated vm_structs to the vmlist after
      it is fully initialised.  Unfortunately, it did not check that
      __vmalloc_area_node() successfully populated the area.  In the event of
      allocation failure, the vmalloc area is freed but the pointer to freed
      memory is inserted into the vmlist leading to a a crash later in
      get_vmalloc_info().
      
      This patch adds a check for ____vmalloc_area_node() failure within
      __vmalloc_node_range.  It does not use "goto fail" as in the previous
      error path as a warning was already displayed by __vmalloc_area_node()
      before it called vfree in its failure path.
      
      Credit goes to Luciano Chavez for doing all the real work of identifying
      exactly where the problem was.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NLuciano Chavez <lnx1138@linux.vnet.ibm.com>
      Tested-by: NLuciano Chavez <lnx1138@linux.vnet.ibm.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>		[3.1.x+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1368edf0
    • M
      mm: Ensure that pfn_valid() is called once per pageblock when reserving pageblocks · d0215638
      Michal Hocko 提交于
      setup_zone_migrate_reserve() expects that zone->start_pfn starts at
      pageblock_nr_pages aligned pfn otherwise we could access beyond an
      existing memblock resulting in the following panic if
      CONFIG_HOLES_IN_ZONE is not configured and we do not check pfn_valid:
      
        IP: [<c02d331d>] setup_zone_migrate_reserve+0xcd/0x180
        *pdpt = 0000000000000000 *pde = f000ff53f000ff53
        Oops: 0000 [#1] SMP
        Pid: 1, comm: swapper Not tainted 3.0.7-0.7-pae #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
        EIP: 0060:[<c02d331d>] EFLAGS: 00010006 CPU: 0
        EIP is at setup_zone_migrate_reserve+0xcd/0x180
        EAX: 000c0000 EBX: f5801fc0 ECX: 000c0000 EDX: 00000000
        ESI: 000c01fe EDI: 000c01fe EBP: 00140000 ESP: f2475f58
        DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
        Process swapper (pid: 1, ti=f2474000 task=f2472cd0 task.ti=f2474000)
        Call Trace:
        [<c02d389c>] __setup_per_zone_wmarks+0xec/0x160
        [<c02d3a1f>] setup_per_zone_wmarks+0xf/0x20
        [<c08a771c>] init_per_zone_wmark_min+0x27/0x86
        [<c020111b>] do_one_initcall+0x2b/0x160
        [<c086639d>] kernel_init+0xbe/0x157
        [<c05cae26>] kernel_thread_helper+0x6/0xd
        Code: a5 39 f5 89 f7 0f 46 fd 39 cf 76 40 8b 03 f6 c4 08 74 32 eb 91 90 89 c8 c1 e8 0e 0f be 80 80 2f 86 c0 8b 14 85 60 2f 86 c0 89 c8 <2b> 82 b4 12 00 00 c1 e0 05 03 82 ac 12 00 00 8b 00 f6 c4 08 0f
        EIP: [<c02d331d>] setup_zone_migrate_reserve+0xcd/0x180 SS:ESP 0068:f2475f58
        CR2: 00000000000012b4
      
      We crashed in pageblock_is_reserved() when accessing pfn 0xc0000 because
      highstart_pfn = 0x36ffe.
      
      The issue was introduced in 3.0-rc1 by 6d3163ce ("mm: check if any page
      in a pageblock is reserved before marking it MIGRATE_RESERVE").
      
      Make sure that start_pfn is always aligned to pageblock_nr_pages to
      ensure that pfn_valid s always called at the start of each pageblock.
      Architectures with holes in pageblocks will be correctly handled by
      pfn_valid_within in pageblock_is_reserved.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Tested-by: NDang Bo <bdang@vmware.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Arve Hjnnevg <arve@android.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>	[3.0+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0215638
    • H
      mm/migrate.c: pair unlock_page() and lock_page() when migrating huge pages · 09761333
      Hillf Danton 提交于
      Avoid unlocking and unlocked page if we failed to lock it.
      Signed-off-by: NHillf Danton <dhillf@gmail.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09761333
    • Y
      thp: set compound tail page _count to zero · 58a84aa9
      Youquan Song 提交于
      Commit 70b50f94 ("mm: thp: tail page refcounting fix") keeps all
      page_tail->_count zero at all times.  But the current kernel does not
      set page_tail->_count to zero if a 1GB page is utilized.  So when an
      IOMMU 1GB page is used by KVM, it wil result in a kernel oops because a
      tail page's _count does not equal zero.
      
        kernel BUG at include/linux/mm.h:386!
        invalid opcode: 0000 [#1] SMP
        Call Trace:
          gup_pud_range+0xb8/0x19d
          get_user_pages_fast+0xcb/0x192
          ? trace_hardirqs_off+0xd/0xf
          hva_to_pfn+0x119/0x2f2
          gfn_to_pfn_memslot+0x2c/0x2e
          kvm_iommu_map_pages+0xfd/0x1c1
          kvm_iommu_map_memslots+0x7c/0xbd
          kvm_iommu_map_guest+0xaa/0xbf
          kvm_vm_ioctl_assigned_device+0x2ef/0xa47
          kvm_vm_ioctl+0x36c/0x3a2
          do_vfs_ioctl+0x49e/0x4e4
          sys_ioctl+0x5a/0x7c
          system_call_fastpath+0x16/0x1b
        RIP  gup_huge_pud+0xf2/0x159
      Signed-off-by: NYouquan Song <youquan.song@intel.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58a84aa9
    • Y
      thp: add compound tail page _mapcount when mapped · b6999b19
      Youquan Song 提交于
      With the 3.2-rc kernel, IOMMU 2M pages in KVM works.  But when I tried
      to use IOMMU 1GB pages in KVM, I encountered an oops and the 1GB page
      failed to be used.
      
      The root cause is that 1GB page allocation calls gup_huge_pud() while 2M
      page calls gup_huge_pmd.  If compound pages are used and the page is a
      tail page, gup_huge_pmd() increases _mapcount to record tail page are
      mapped while gup_huge_pud does not do that.
      
      So when the mapped page is relesed, it will result in kernel oops
      because the page is not marked mapped.
      
      This patch add tail process for compound page in 1GB huge page which
      keeps the same process as 2M page.
      
      Reproduce like:
      1. Add grub boot option: hugepagesz=1G hugepages=8
      2. mount -t hugetlbfs -o pagesize=1G hugetlbfs /dev/hugepages
      3. qemu-kvm -m 2048 -hda os-kvm.img -cpu kvm64 -smp 4 -mem-path /dev/hugepages
      	-net none -device pci-assign,host=07:00.1
      
        kernel BUG at mm/swap.c:114!
        invalid opcode: 0000 [#1] SMP
        Call Trace:
          put_page+0x15/0x37
          kvm_release_pfn_clean+0x31/0x36
          kvm_iommu_put_pages+0x94/0xb1
          kvm_iommu_unmap_memslots+0x80/0xb6
          kvm_assign_device+0xba/0x117
          kvm_vm_ioctl_assigned_device+0x301/0xa47
          kvm_vm_ioctl+0x36c/0x3a2
          do_vfs_ioctl+0x49e/0x4e4
          sys_ioctl+0x5a/0x7c
          system_call_fastpath+0x16/0x1b
        RIP  put_compound_page+0xd4/0x168
      Signed-off-by: NYouquan Song <youquan.song@intel.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6999b19