1. 22 2月, 2012 23 次提交
    • P
      rcu: Move synchronize_sched_expedited() to rcutree.c · 3d3b7db0
      Paul E. McKenney 提交于
      Now that TREE_RCU and TREE_PREEMPT_RCU no longer do anything different
      for the single-CPU case, there is no need for multiple definitions of
      synchronize_sched_expedited().  It is no longer in any sense a plug-in,
      so move it from kernel/rcutree_plugin.h to kernel/rcutree.c.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      3d3b7db0
    • P
      rcu: Check for illegal use of RCU from offlined CPUs · c0d6d01b
      Paul E. McKenney 提交于
      Although it is legal to use RCU during early boot, it is anything
      but legal to use RCU at runtime from an offlined CPU.  After all, RCU
      explicitly ignores offlined CPUs.  This commit therefore adds checks
      for runtime use of RCU from offlined CPUs.
      
      These checks are not perfect, in particular, they can be subverted
      through use of things like rcu_dereference_raw().  Note that it is not
      possible to put checks in rcu_read_lock() and friends due to the fact
      that these primitives are used in code that might be used under either
      RCU or lock-based protection, which means that checking rcu_read_lock()
      gets you fat piles of false positives.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      c0d6d01b
    • P
      rcu: Add CPU-stall capability to rcutorture · c13f3757
      Paul E. McKenney 提交于
      Add module parameters to rcutorture that induce a CPU stall.
      The stall_cpu parameter specifies how long to stall in seconds,
      defaulting to zero, which indicates no stalling is to be undertaken.
      The stall_cpu_holdoff parameter specifies how many seconds after
      insmod (or boot, if rcutorture is built into the kernel) that this
      stall is to start.  The default value for stall_cpu_holdoff is ten
      seconds.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      c13f3757
    • P
      rcutorture: Permit holding off CPU-hotplug operations during boot · 9b9ec9b9
      Paul E. McKenney 提交于
      When rcutorture is started automatically at boot time, it might well
      also start CPU-hotplug operations at that time, which might not be
      desirable.  This commit therefore adds an rcutorture parameter that
      allows CPU-hotplug operations to be held off for the specified number
      of seconds after the start of boot.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      9b9ec9b9
    • P
      rcu: Print scheduling-clock information on RCU CPU stall-warning messages · a858af28
      Paul E. McKenney 提交于
      There have been situations where RCU CPU stall warnings were caused by
      issues in scheduling-clock timer initialization.  To make it easier to
      track these down, this commit causes the RCU CPU stall-warning messages
      to print out the number of scheduling-clock interrupts taken in the
      current grace period for each stalled CPU.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      a858af28
    • P
      rcu: Set RCU CPU stall times via sysfs · 13cfcca0
      Paul E. McKenney 提交于
      The default CONFIG_RCU_CPU_STALL_TIMEOUT value of 60 seconds has served
      Linux users well for production use for quite some time.  However, for
      debugging, there will be more than three minutes between subsequent
      stall-warning messages.  This can be an annoyingly long wait if you
      are trying to work out where the offending infinite loop is hiding.
      
      Therefore, this commit provides a rcu_cpu_stall_timeout sysfs
      parameter that may be adjusted at boot time and at runtime to speed
      up debugging.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      13cfcca0
    • P
      rcu: Remove #ifdef CONFIG_SMP from TREE_RCU · 27565d64
      Paul E. McKenney 提交于
      Now that both TINY_RCU and TINY_PREEMPT_RCU have been in place for awhile,
      it is time to remove UP support from TREE_RCU, which is what this commit
      does.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      27565d64
    • P
      rcu: Check for idle-loop entry while in RCU read-side critical section · c44e2cdd
      Paul E. McKenney 提交于
      The inner idle loop is an extended quiescent state for all flavors
      of RCU, but there have been recent bug involving use of RCU read-side
      primitives from within the idle loop.  Therefore, this commit enlists
      lockdep-RCU to detect attempts to enter the inner idle loop while in
      an RCU read-side critical section, emitting a lockdep-RCU splat if so.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      c44e2cdd
    • P
      rcu: Clean up straggling rcu_preempt_needs_cpu() name · 30fbcc90
      Paul E. McKenney 提交于
      The recent updates to RCU_CPU_FAST_NO_HZ have an rcu_needs_cpu() that
      does more than just check for callbacks, so get the name for
      rcu_preempt_needs_cpu() consistent with that change, now calling it
      rcu_preempt_cpu_has_callbacks().
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      30fbcc90
    • P
      rcu: Simplify unboosting checks · 1aa03f11
      Paul E. McKenney 提交于
      This is a port of commit #82e78d80 from TREE_PREEMPT_RCU to
      TINY_PREEMPT_RCU.
      
      This commit uses the fact that current->rcu_boost_mutex is set
      any time that the RCU_READ_UNLOCK_BOOSTED flag is set in the
      current->rcu_read_unlock_special bitmask.  This allows tests of
      the bit to be changed to tests of the pointer, which in turn allows
      the RCU_READ_UNLOCK_BOOSTED flag to be eliminated.
      
      Please note that the check of current->rcu_read_unlock_special need not
      change because any time that RCU_READ_UNLOCK_BOOSTED was set, so was
      RCU_READ_UNLOCK_BLOCKED.  Therefore, __rcu_read_unlock() can continue
      testing current->rcu_read_unlock_special for non-zero, as before.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      1aa03f11
    • P
      rcu: Inform RCU of irq_exit() activity · 8762705a
      Paul E. McKenney 提交于
      This is a port to TINY_RCU of Peter Zijlstra's commit #ec433f0c
      
      The rcu_read_unlock_special() function relies on in_irq() to exclude
      scheduler activity from interrupt level.  This fails because exit_irq()
      can invoke the scheduler after clearing the preempt_count() bits that
      in_irq() uses to determine that it is at interrupt level.  This situation
      can result in failures as follows:
      
           $task			IRQ		SoftIRQ
      
           rcu_read_lock()
      
           /* do stuff */
      
           <preempt> |= UNLOCK_BLOCKED
      
           rcu_read_unlock()
             --t->rcu_read_lock_nesting
      
          			irq_enter();
          			/* do stuff, don't use RCU */
          			irq_exit();
          			  sub_preempt_count(IRQ_EXIT_OFFSET);
          			  invoke_softirq()
      
          					ttwu();
          					  spin_lock_irq(&pi->lock)
          					  rcu_read_lock();
          					  /* do stuff */
          					  rcu_read_unlock();
          					    rcu_read_unlock_special()
          					      rcu_report_exp_rnp()
          					        ttwu()
          					          spin_lock_irq(&pi->lock) /* deadlock */
      
             rcu_read_unlock_special(t);
      
      This can be triggered 'easily' because invoke_softirq() immediately does
      a ttwu() of ksoftirqd/# instead of doing the in-place softirq stuff first,
      but even without that the above happens.
      
      Cure this by also excluding softirqs from the rcu_read_unlock_special()
      handler and ensuring the force_irqthreads ksoftirqd/# wakeup is done
      from full softirq context.
      
      It is also necessary to delay the ->rcu_read_lock_nesting decrement until
      after rcu_read_unlock_special().  This delay is handled by the commit
      "Protect __rcu_read_unlock() against scheduler-using irq handlers".
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      8762705a
    • P
      rcu: Prevent RCU callbacks from executing before scheduler initialized · 768dfffd
      Paul E. McKenney 提交于
      This is a port of commit #b0d30417 from TREE_RCU to TREE_PREEMPT_RCU.
      
      Under some rare but real combinations of configuration parameters, RCU
      callbacks are posted during early boot that use kernel facilities that are
      not yet initialized.  Therefore, when these callbacks are invoked, hard
      hangs and crashes ensue.  This commit therefore prevents RCU callbacks
      from being invoked until after the scheduler is fully up and running,
      as in after multiple tasks have been spawned.
      
      It might well turn out that a better approach is to identify the specific
      RCU callbacks that are causing this problem, but that discussion will
      wait until such time as someone really needs an RCU callback to be invoked
      (as opposed to merely registered) during early boot.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      768dfffd
    • P
      rcu: Streamline code produced by __rcu_read_unlock() · afef2054
      Paul E. McKenney 提交于
      This is a port of commit #be0e1e21 to TINY_PREEMPT_RCU.  This uses
      noinline to prevent rcu_read_unlock_special() from being inlined into
      __rcu_read_unlock().
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      afef2054
    • P
      rcu: Protect __rcu_read_unlock() against scheduler-using irq handlers · 26861faf
      Paul E. McKenney 提交于
      This commit ports commit #10f39bb1 (rcu: protect __rcu_read_unlock()
      against scheduler-using irq handlers) from TREE_PREEMPT_RCU to
      TINY_PREEMPT_RCU.  The following is a corresponding port of that
      commit message.
      
      The addition of RCU read-side critical sections within runqueue and
      priority-inheritance critical sections introduced some deadlocks,
      for example, involving interrupts from __rcu_read_unlock() where the
      interrupt handlers call wake_up().  This situation can cause the
      instance of __rcu_read_unlock() invoked from interrupt to do some
      of the processing that would otherwise have been carried out by the
      task-level instance of __rcu_read_unlock().  When the interrupt-level
      instance of __rcu_read_unlock() is called with a scheduler lock held from
      interrupt-entry/exit situations where in_irq() returns false, deadlock can
      result.  Of course, in a UP kernel, there are not really any deadlocks,
      but the upper-level critical section can still be be fatally confused
      by the lower-level critical section changing things out from under it.
      
      This commit resolves these deadlocks by using negative values of the
      per-task ->rcu_read_lock_nesting counter to indicate that an instance of
      __rcu_read_unlock() is in flight, which in turn prevents instances from
      interrupt handlers from doing any special processing.  Note that nested
      rcu_read_lock()/rcu_read_unlock() pairs are still permitted, but they will
      never see ->rcu_read_lock_nesting go to zero, and will therefore never
      invoke rcu_read_unlock_special(), thus preventing them from seeing the
      RCU_READ_UNLOCK_BLOCKED bit should it be set in ->rcu_read_unlock_special.
      This patch also adds a check for ->rcu_read_unlock_special being negative
      in rcu_check_callbacks(), thus preventing the RCU_READ_UNLOCK_NEED_QS
      bit from being set should a scheduling-clock interrupt occur while
      __rcu_read_unlock() is exiting from an outermost RCU read-side critical
      section.
      
      Of course, __rcu_read_unlock() can be preempted during the time that
      ->rcu_read_lock_nesting is negative.  This could result in the setting
      of the RCU_READ_UNLOCK_BLOCKED bit after __rcu_read_unlock() checks it,
      and would also result it this task being queued on the corresponding
      rcu_node structure's blkd_tasks list.  Therefore, some later RCU read-side
      critical section would enter rcu_read_unlock_special() to clean up --
      which could result in deadlock (OK, OK, fatal confusion) if that RCU
      read-side critical section happened to be in the scheduler where the
      runqueue or priority-inheritance locks were held.
      
      To prevent the possibility of fatal confusion that might result from
      preemption during the time that ->rcu_read_lock_nesting is negative,
      this commit also makes rcu_preempt_note_context_switch() check for
      negative ->rcu_read_lock_nesting, thus refraining from queuing the task
      (and from setting RCU_READ_UNLOCK_BLOCKED) if we are already exiting
      from the outermost RCU read-side critical section (in other words,
      we really are no longer actually in that RCU read-side critical
      section).  In addition, rcu_preempt_note_context_switch() invokes
      rcu_read_unlock_special() to carry out the cleanup in this case, which
      clears out the ->rcu_read_unlock_special bits and dequeues the task
      (if necessary), in turn avoiding needless delay of the current RCU grace
      period and needless RCU priority boosting.
      
      It is still illegal to call rcu_read_unlock() while holding a scheduler
      lock if the prior RCU read-side critical section has ever had both
      preemption and irqs enabled.  However, the common use case is legal,
      namely where then entire RCU read-side critical section executes with
      irqs disabled, for example, when the scheduler lock is held across the
      entire lifetime of the RCU read-side critical section.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      26861faf
    • P
      rcu: Remove single-rcu_node optimization in rcu_start_gp() · f38bd102
      Paul E. McKenney 提交于
      The grace-period initialization sequence in rcu_start_gp() has a special
      case for systems where the rcu_node tree is a single rcu_node structure.
      This made sense some years ago when systems were smaller and up to 64
      CPUs could share a single rcu_node structure, but now that large systems
      are common and a given leaf rcu_node structure can support only 16 CPUs
      (due to lock contention on the rcu_node's ->lock field), this optimization
      is almost never taken.  And even the small mobile platforms that might
      make use of it might rather have the kernel text reduction.
      
      Therefore, this commit removes the check for single-rcu_node trees.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      f38bd102
    • P
      rcu: Don't make callbacks go through second full grace period · a50c3af9
      Paul E. McKenney 提交于
      RCU's current CPU-offline code path dumps all of the outgoing CPU's
      callbacks onto the RCU_NEXT_TAIL portion of the surviving CPU's
      callback list.  This means that all the ready-to-invoke callbacks from
      the outgoing CPU must wait for another full RCU grace period.  This was
      just fine when CPU-hotplug events were rare, but there is increasing
      evidence that users are planning to make increasing use of CPU hotplug.
      
      Therefore, this commit changes the callback-dumping procedure so that
      callbacks that are ready to invoke are moved to the RCU_DONE_TAIL
      portion of the surviving CPU's callback list.  This avoids running
      these callbacks through a second unnecessary grace period.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      a50c3af9
    • P
      rcu: Check for callback invocation from offline CPUs · 8146c4e2
      Paul E. McKenney 提交于
      Because quiescent states are now reported from offline CPUs in
      CPU_DYING state, there is some possibility that such a CPU might
      note the end of a grace period and attempt to start invoking
      callbacks.  This would be a very bad thing, and is supposed to
      be prevented by the fact that the CPU_DYING CPU gets rid of all
      its callbacks before reporting the quiescent state.  However,
      there is other CPU-offline code in the kernel, and it is quite
      possible that someone will invoke RCU core processing from that
      code.  Therefore, this commit adds a warning for this case.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      8146c4e2
    • P
      rcu: Limit lazy-callback duration · 778d250a
      Paul E. McKenney 提交于
      Currently, a given CPU is permitted to remain in dyntick-idle mode
      indefinitely if it has only lazy RCU callbacks queued.  This is vulnerable
      to corner cases in NUMA systems, so limit the time to six seconds by
      default.  (Currently controlled by a cpp macro.)
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      778d250a
    • P
      rcu: Make rcutorture flag online/offline failures · 091541bb
      Paul E. McKenney 提交于
      Make rcutorture check for CPU-hotplug failures and complain if there
      were any.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      091541bb
    • P
      rcu: Simplify offline processing · e5601400
      Paul E. McKenney 提交于
      Move ->qsmaskinit and blkd_tasks[] manipulation to the CPU_DYING
      notifier.  This simplifies the code by eliminating a potential
      deadlock and by reducing the responsibilities of force_quiescent_state().
      Also rename functions to make their connection to the CPU-hotplug
      stages explicit.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      e5601400
    • P
      rcu: Avoid waking up CPUs having only kfree_rcu() callbacks · 486e2593
      Paul E. McKenney 提交于
      When CONFIG_RCU_FAST_NO_HZ is enabled, RCU will allow a given CPU to
      enter dyntick-idle mode even if it still has RCU callbacks queued.
      RCU avoids system hangs in this case by scheduling a timer for several
      jiffies in the future.  However, if all of the callbacks on that CPU
      are from kfree_rcu(), there is no reason to wake the CPU up, as it is
      not a problem to defer freeing of memory.
      
      This commit therefore tracks the number of callbacks on a given CPU
      that are from kfree_rcu(), and avoids scheduling the timer if all of
      a given CPU's callbacks are from kfree_rcu().
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      486e2593
    • P
      rcu: Add diagnostic for misaligned rcu_head structures · 0bb7b59d
      Paul E. McKenney 提交于
      The push for energy efficiency will require that RCU tag rcu_head
      structures to indicate whether or not their invocation is time critical.
      This tagging is best carried out in the bottom bits of the ->next
      pointers in the rcu_head structures.  This tagging requires that the
      rcu_head structures be properly aligned, so this commit adds the required
      diagnostics.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      0bb7b59d
    • P
      rcu: Add lockdep-RCU checks for simple self-deadlock · fe15d706
      Paul E. McKenney 提交于
      It is illegal to have a grace period within a same-flavor RCU read-side
      critical section, so this commit adds lockdep-RCU checks to splat when
      such abuse is encountered.  This commit does not detect more elaborate
      RCU deadlock situations.  These situations might be a job for lockdep
      enhancements.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      fe15d706
  2. 14 2月, 2012 1 次提交
  3. 10 2月, 2012 1 次提交
  4. 07 2月, 2012 2 次提交
    • S
      perf: Fix double start/stop in x86_pmu_start() · f39d47ff
      Stephane Eranian 提交于
      The following patch fixes a bug introduced by the following
      commit:
      
              e050e3f0 ("perf: Fix broken interrupt rate throttling")
      
      The patch caused the following warning to pop up depending on
      the sampling frequency adjustments:
      
        ------------[ cut here ]------------
        WARNING: at arch/x86/kernel/cpu/perf_event.c:995 x86_pmu_start+0x79/0xd4()
      
      It was caused by the following call sequence:
      
      perf_adjust_freq_unthr_context.part() {
           stop()
           if (delta > 0) {
                perf_adjust_period() {
                    if (period > 8*...) {
                        stop()
                        ...
                        start()
                    }
                }
            }
            start()
      }
      
      Which caused a double start and a double stop, thus triggering
      the assert in x86_pmu_start().
      
      The patch fixes the problem by avoiding the double calls. We
      pass a new argument to perf_adjust_period() to indicate whether
      or not the event is already stopped. We can't just remove the
      start/stop from that function because it's called from
      __perf_event_overflow where the event needs to be reloaded via a
      stop/start back-toback call.
      
      The patch reintroduces the assertion in x86_pmu_start() which
      was removed by commit:
      
      	84f2b9b2 ("perf: Remove deprecated WARN_ON_ONCE()")
      
      In this second version, we've added calls to disable/enable PMU
      during unthrottling or frequency adjustment based on bug report
      of spurious NMI interrupts from Eric Dumazet.
      Reported-and-tested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: markus@trippelsdorf.de
      Cc: paulus@samba.org
      Link: http://lkml.kernel.org/r/20120207133956.GA4932@quad
      [ Minor edits to the changelog and to the code ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f39d47ff
    • T
      block: strip out locking optimization in put_io_context() · 11a3122f
      Tejun Heo 提交于
      put_io_context() performed a complex trylock dancing to avoid
      deferring ioc release to workqueue.  It was also broken on UP because
      trylock was always assumed to succeed which resulted in unbalanced
      preemption count.
      
      While there are ways to fix the UP breakage, even the most
      pathological microbench (forced ioc allocation and tight fork/exit
      loop) fails to show any appreciable performance benefit of the
      optimization.  Strip it out.  If there turns out to be workloads which
      are affected by this change, simpler optimization from the discussion
      thread can be applied later.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      LKML-Reference: <1328514611.21268.66.camel@sli10-conroe>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      11a3122f
  5. 05 2月, 2012 1 次提交
    • S
      PM / Freezer: Thaw only kernel threads if freezing of kernel threads fails · 379e0be8
      Srivatsa S. Bhat 提交于
      If freezing of kernel threads fails, we are expected to automatically
      thaw tasks in the error recovery path. However, at times, we encounter
      situations in which we would like the automatic error recovery path
      to thaw only the kernel threads, because we want to be able to do
      some more cleanup before we thaw userspace. Something like:
      
      error = freeze_kernel_threads();
      if (error) {
      	/* Do some cleanup */
      
      	/* Only then thaw userspace tasks*/
      	thaw_processes();
      }
      
      An example of such a situation is where we freeze/thaw filesystems
      during suspend/hibernation. There, if freezing of kernel threads
      fails, we would like to thaw the frozen filesystems before thawing
      the userspace tasks.
      
      So, modify freeze_kernel_threads() to thaw only kernel threads in
      case of freezing failure. And change suspend_freeze_processes()
      accordingly. (At the same time, let us also get rid of the rather
      cryptic usage of the conditional operator (:?) in that function.)
      
      [rjw: In fact, this patch fixes a regression introduced during the
       3.3 merge window, because without it thaw_processes() may be called
       before swsusp_free() in some situations and that may lead to massive
       memory allocation failures.]
      Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNigel Cunningham <nigel@tuxonice.net>
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      379e0be8
  6. 04 2月, 2012 1 次提交
  7. 03 2月, 2012 1 次提交
  8. 02 2月, 2012 1 次提交
  9. 30 1月, 2012 1 次提交
    • R
      PM / Hibernate: Fix s2disk regression related to freezing workqueues · 181e9bde
      Rafael J. Wysocki 提交于
      Commit 2aede851
      
        PM / Hibernate: Freeze kernel threads after preallocating memory
      
      introduced a mechanism by which kernel threads were frozen after
      the preallocation of hibernate image memory to avoid problems with
      frozen kernel threads not responding to memory freeing requests.
      However, it overlooked the s2disk code path in which the
      SNAPSHOT_CREATE_IMAGE ioctl was run directly after SNAPSHOT_FREE,
      which caused freeze_workqueues_begin() to BUG(), because it saw
      that worqueues had been already frozen.
      
      Although in principle this issue might be addressed by removing
      the relevant BUG_ON() from freeze_workqueues_begin(), that would
      reintroduce the very problem that commit 2aede851
      attempted to avoid into that particular code path.  For this reason,
      to fix the issue at hand, introduce thaw_kernel_threads() and make
      the SNAPSHOT_FREE ioctl execute it.
      
      Special thanks to Srivatsa S. Bhat for detailed analysis of the
      problem.
      Reported-and-tested-by: NJiri Slaby <jslaby@suse.cz>
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Cc: stable@kernel.org
      181e9bde
  10. 27 1月, 2012 7 次提交
    • C
      sched/rt: Fix task stack corruption under __ARCH_WANT_INTERRUPTS_ON_CTXSW · cb297a3e
      Chanho Min 提交于
      This issue happens under the following conditions:
      
       1. preemption is off
       2. __ARCH_WANT_INTERRUPTS_ON_CTXSW is defined
       3. RT scheduling class
       4. SMP system
      
      Sequence is as follows:
      
       1.suppose current task is A. start schedule()
       2.task A is enqueued pushable task at the entry of schedule()
         __schedule
          prev = rq->curr;
          ...
          put_prev_task
           put_prev_task_rt
            enqueue_pushable_task
       4.pick the task B as next task.
         next = pick_next_task(rq);
       3.rq->curr set to task B and context_switch is started.
         rq->curr = next;
       4.At the entry of context_swtich, release this cpu's rq->lock.
         context_switch
          prepare_task_switch
           prepare_lock_switch
            raw_spin_unlock_irq(&rq->lock);
       5.Shortly after rq->lock is released, interrupt is occurred and start IRQ context
       6.try_to_wake_up() which called by ISR acquires rq->lock
          try_to_wake_up
           ttwu_remote
            rq = __task_rq_lock(p)
            ttwu_do_wakeup(rq, p, wake_flags);
              task_woken_rt
       7.push_rt_task picks the task A which is enqueued before.
         task_woken_rt
          push_rt_tasks(rq)
           next_task = pick_next_pushable_task(rq)
       8.At find_lock_lowest_rq(), If double_lock_balance() returns 0,
         lowest_rq can be the remote rq.
        (But,If preemption is on, double_lock_balance always return 1 and it
         does't happen.)
         push_rt_task
          find_lock_lowest_rq
           if (double_lock_balance(rq, lowest_rq))..
       9.find_lock_lowest_rq return the available rq. task A is migrated to
         the remote cpu/rq.
         push_rt_task
          ...
          deactivate_task(rq, next_task, 0);
          set_task_cpu(next_task, lowest_rq->cpu);
          activate_task(lowest_rq, next_task, 0);
       10. But, task A is on irq context at this cpu.
           So, task A is scheduled by two cpus at the same time until restore from IRQ.
           Task A's stack is corrupted.
      
      To fix it, don't migrate an RT task if it's still running.
      Signed-off-by: NChanho Min <chanho.min@lge.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: <stable@kernel.org>
      Link: http://lkml.kernel.org/r/CAOAMb1BHA=5fm7KTewYyke6u-8DP0iUuJMpgQw54vNeXFsGpoQ@mail.gmail.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      cb297a3e
    • S
      perf: Fix broken interrupt rate throttling · e050e3f0
      Stephane Eranian 提交于
      This patch fixes the sampling interrupt throttling mechanism.
      
      It was broken in v3.2. Events were not being unthrottled. The
      unthrottling mechanism required that events be checked at each
      timer tick.
      
      This patch solves this problem and also separates:
      
        - unthrottling
        - multiplexing
        - frequency-mode period adjustments
      
      Not all of them need to be executed at each timer tick.
      
      This third version of the patch is based on my original patch +
      PeterZ proposal (https://lkml.org/lkml/2012/1/7/87).
      
      At each timer tick, for each context:
      
        - if the current CPU has throttled events, we unthrottle events
      
        - if context has frequency-based events, we adjust sampling periods
      
        - if we have reached the jiffies interval, we multiplex (rotate)
      
      We decoupled rotation (multiplexing) from frequency-mode sampling
      period adjustments.  They should not necessarily happen at the same
      rate. Multiplexing is subject to jiffies_interval (currently at 1
      but could be higher once the tunable is exposed via sysfs).
      
      We have grouped frequency-mode adjustment and unthrottling into the
      same routine to minimize code duplication. When throttled while in
      frequency mode, we scan the events only once.
      
      We have fixed the threshold enforcement code in __perf_event_overflow().
      There was a bug whereby it would allow more than the authorized rate
      because an increment of hwc->interrupts was not executed at the right
      place.
      
      The patch was tested with low sampling limit (2000) and fixed periods,
      frequency mode, overcommitted PMU.
      
      On a 2.1GHz AMD CPU:
      
       $ cat /proc/sys/kernel/perf_event_max_sample_rate
       2000
      
      We set a rate of 3000 samples/sec (2.1GHz/3000 = 700000):
      
       $ perf record -e cycles,cycles -c 700000  noploop 10
       $ perf report -D | tail -21
      
       Aggregated stats:
                 TOTAL events:      80086
                  MMAP events:         88
                  COMM events:          2
                  EXIT events:          4
              THROTTLE events:      19996
            UNTHROTTLE events:      19996
                SAMPLE events:      40000
      
       cycles stats:
                 TOTAL events:      40006
                  MMAP events:          5
                  COMM events:          1
                  EXIT events:          4
              THROTTLE events:       9998
            UNTHROTTLE events:       9998
                SAMPLE events:      20000
      
       cycles stats:
                 TOTAL events:      39996
              THROTTLE events:       9998
            UNTHROTTLE events:       9998
                SAMPLE events:      20000
      
      For 10s, the cap is 2x2000x10 = 40000 samples.
      We get exactly that: 20000 samples/event.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Cc: <stable@kernel.org> # v3.2+
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20120126160319.GA5655@quadSigned-off-by: NIngo Molnar <mingo@elte.hu>
      e050e3f0
    • Y
      sched: Fix ancient race in do_exit() · b5740f4b
      Yasunori Goto 提交于
      try_to_wake_up() has a problem which may change status from TASK_DEAD to
      TASK_RUNNING in race condition with SMI or guest environment of virtual
      machine. As a result, exited task is scheduled() again and panic occurs.
      
      Here is the sequence how it occurs:
      
       ----------------------------------+-----------------------------
                                         |
                  CPU A                  |             CPU B
       ----------------------------------+-----------------------------
      
      TASK A calls exit()....
      
      do_exit()
      
        exit_mm()
          down_read(mm->mmap_sem);
      
          rwsem_down_failed_common()
      
            set TASK_UNINTERRUPTIBLE
            set waiter.task <= task A
            list_add to sem->wait_list
                 :
            raw_spin_unlock_irq()
            (I/O interruption occured)
      
                                            __rwsem_do_wake(mmap_sem)
      
                                              list_del(&waiter->list);
                                              waiter->task = NULL
                                              wake_up_process(task A)
                                                try_to_wake_up()
                                                   (task is still
                                                     TASK_UNINTERRUPTIBLE)
                                                    p->on_rq is still 1.)
      
                                                    ttwu_do_wakeup()
                                                       (*A)
                                                         :
           (I/O interruption handler finished)
      
            if (!waiter.task)
                schedule() is not called
                due to waiter.task is NULL.
      
            tsk->state = TASK_RUNNING
      
                :
                                                    check_preempt_curr();
                                                        :
        task->state = TASK_DEAD
                                                    (*B)
                                              <---    set TASK_RUNNING (*C)
      
           schedule()
           (exit task is running again)
           BUG_ON() is called!
       --------------------------------------------------------
      
      The execution time between (*A) and (*B) is usually very short,
      because the interruption is disabled, and setting TASK_RUNNING at (*C)
      must be executed before setting TASK_DEAD.
      
      HOWEVER, if SMI is interrupted between (*A) and (*B),
      (*C) is able to execute AFTER setting TASK_DEAD!
      Then, exited task is scheduled again, and BUG_ON() is called....
      
      If the system works on guest system of virtual machine, the time
      between (*A) and (*B) may be also long due to scheduling of hypervisor,
      and same phenomenon can occur.
      
      By this patch, do_exit() waits for releasing task->pi_lock which is used
      in try_to_wake_up(). It guarantees the task becomes TASK_DEAD after
      waking up.
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20120117174031.3118.E1E9C6FF@jp.fujitsu.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      b5740f4b
    • P
      bugs, x86: Fix printk levels for panic, softlockups and stack dumps · b0f4c4b3
      Prarit Bhargava 提交于
      rsyslog will display KERN_EMERG messages on a connected
      terminal.  However, these messages are useless/undecipherable
      for a general user.
      
      For example, after a softlockup we get:
      
       Message from syslogd@intel-s3e37-04 at Jan 25 14:18:06 ...
       kernel:Stack:
      
       Message from syslogd@intel-s3e37-04 at Jan 25 14:18:06 ...
       kernel:Call Trace:
      
       Message from syslogd@intel-s3e37-04 at Jan 25 14:18:06 ...
       kernel:Code: ff ff a8 08 75 25 31 d2 48 8d 86 38 e0 ff ff 48 89
       d1 0f 01 c8 0f ae f0 48 8b 86 38 e0 ff ff a8 08 75 08 b1 01 4c 89 e0 0f 01 c9 <e8> ea 69 dd ff 4c 29 e8 48 89 c7 e8 0f bc da ff 49 89 c4 49 89
      
      This happens because the printk levels for these messages are
      incorrect. Only an informational message should be displayed on
      a terminal.
      
      I modified the printk levels for various messages in the kernel
      and tested the output by using the drivers/misc/lkdtm.c kernel
      modules (ie, softlockups, panics, hard lockups, etc.) and
      confirmed that the console output was still the same and that
      the output to the terminals was correct.
      
      For example, in the case of a softlockup we now see the much
      more informative:
      
       Message from syslogd@intel-s3e37-04 at Jan 25 10:18:06 ...
       BUG: soft lockup - CPU4 stuck for 60s!
      
      instead of the above confusing messages.
      
      AFAICT, the messages no longer have to be KERN_EMERG.  In the
      most important case of a panic we set console_verbose().  As for
      the other less severe cases the correct data is output to the
      console and /var/log/messages.
      
      Successfully tested by me using the drivers/misc/lkdtm.c module.
      Signed-off-by: NPrarit Bhargava <prarit@redhat.com>
      Cc: dzickus@redhat.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1327586134-11926-1-git-send-email-prarit@redhat.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      b0f4c4b3
    • S
      sched/nohz: Fix nohz cpu idle load balancing state with cpu hotplug · 71325960
      Suresh Siddha 提交于
      With the recent nohz scheduler changes, rq's nohz flag
      'NOHZ_TICK_STOPPED' and its associated state doesn't get cleared
      immediately after the cpu exits idle. This gets cleared as part
      of the next tick seen on that cpu.
      
      For the cpu offline support, we need to clear this state
      manually. Fix it by registering a cpu notifier, which clears the
      nohz idle load balance state for this rq explicitly during the
      CPU_DYING notification.
      
      There won't be any nohz updates for that cpu, after the
      CPU_DYING notification. But lets be extra paranoid and skip
      updating the nohz state in the select_nohz_load_balancer() if
      the cpu is not in active state anymore.
      Reported-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Reviewed-and-tested-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Tested-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1327026538.16150.40.camel@sbsiddha-desk.sc.intel.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      71325960
    • C
      sched/s390: Fix compile error in sched/core.c · db7e527d
      Christian Borntraeger 提交于
      Commit 029632fb ("sched: Make
      separate sched*.c translation units") removed the include of
      asm/mutex.h from sched.c.
      
      This breaks the combination of:
      
       CONFIG_MUTEX_SPIN_ON_OWNER=yes
       CONFIG_HAVE_ARCH_MUTEX_CPU_RELAX=yes
      
      like s390 without mutex debugging:
      
        CC      kernel/sched/core.o
        kernel/sched/core.c: In function ‘mutex_spin_on_owner’:
        kernel/sched/core.c:3287: error: implicit declaration of function ‘arch_mutex_cpu_relax’
      
      Lets re-add the include to kernel/sched/core.c
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1326268696-30904-1-git-send-email-borntraeger@de.ibm.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      db7e527d
    • P
      sched: Fix rq->nr_uninterruptible update race · 4ca9b72b
      Peter Zijlstra 提交于
      KOSAKI Motohiro noticed the following race:
      
       > CPU0                    CPU1
       > --------------------------------------------------------
       > deactivate_task()
       >                         task->state = TASK_UNINTERRUPTIBLE;
       > activate_task()
       >    rq->nr_uninterruptible--;
       >
       >                         schedule()
       >                           deactivate_task()
       >                             rq->nr_uninterruptible++;
       >
      
      Kosaki-San's scenario is possible when CPU0 runs
      __sched_setscheduler() against CPU1's current @task.
      
      __sched_setscheduler() does a dequeue/enqueue in order to move
      the task to its new queue (position) to reflect the newly provided
      scheduling parameters. However it should be completely invariant to
      nr_uninterruptible accounting, sched_setscheduler() doesn't affect
      readyness to run, merely policy on when to run.
      
      So convert the inappropriate activate/deactivate_task usage to
      enqueue/dequeue_task, which avoids the nr_uninterruptible accounting.
      
      Also convert the two other sites: __migrate_task() and
      normalize_task() that still use activate/deactivate_task. These sites
      aren't really a problem since __migrate_task() will only be called on
      non-running task (and therefore are immume to the described problem)
      and normalize_task() isn't ever used on regular systems.
      
      Also remove the comments from activate/deactivate_task since they're
      misleading at best.
      Reported-by: NKOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1327486224.2614.45.camel@laptopSigned-off-by: NIngo Molnar <mingo@elte.hu>
      4ca9b72b
  11. 24 1月, 2012 1 次提交
    • R
      kernel-doc: fix kernel-doc warnings in sched · fa757281
      Randy Dunlap 提交于
      Fix new kernel-doc notation warnings:
      
      Warning(include/linux/sched.h:2094): No description found for parameter 'p'
      Warning(include/linux/sched.h:2094): Excess function parameter 'tsk' description in 'is_idle_task'
      Warning(kernel/sched/cpupri.c:139): No description found for parameter 'newpri'
      Warning(kernel/sched/cpupri.c:139): Excess function parameter 'pri' description in 'cpupri_set'
      Warning(kernel/sched/cpupri.c:208): Excess function parameter 'bootmem' description in 'cpupri_init'
      Signed-off-by: NRandy Dunlap <rdunlap@xenotime.net>
      Cc:	Ingo Molnar <mingo@elte.hu>
      Cc:	Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa757281