1. 12 12月, 2011 4 次提交
    • P
      rcu: Track idleness independent of idle tasks · 9b2e4f18
      Paul E. McKenney 提交于
      Earlier versions of RCU used the scheduling-clock tick to detect idleness
      by checking for the idle task, but handled idleness differently for
      CONFIG_NO_HZ=y.  But there are now a number of uses of RCU read-side
      critical sections in the idle task, for example, for tracing.  A more
      fine-grained detection of idleness is therefore required.
      
      This commit presses the old dyntick-idle code into full-time service,
      so that rcu_idle_enter(), previously known as rcu_enter_nohz(), is
      always invoked at the beginning of an idle loop iteration.  Similarly,
      rcu_idle_exit(), previously known as rcu_exit_nohz(), is always invoked
      at the end of an idle-loop iteration.  This allows the idle task to
      use RCU everywhere except between consecutive rcu_idle_enter() and
      rcu_idle_exit() calls, in turn allowing architecture maintainers to
      specify exactly where in the idle loop that RCU may be used.
      
      Because some of the userspace upcall uses can result in what looks
      to RCU like half of an interrupt, it is not possible to expect that
      the irq_enter() and irq_exit() hooks will give exact counts.  This
      patch therefore expands the ->dynticks_nesting counter to 64 bits
      and uses two separate bitfields to count process/idle transitions
      and interrupt entry/exit transitions.  It is presumed that userspace
      upcalls do not happen in the idle loop or from usermode execution
      (though usermode might do a system call that results in an upcall).
      The counter is hard-reset on each process/idle transition, which
      avoids the interrupt entry/exit error from accumulating.  Overflow
      is avoided by the 64-bitness of the ->dyntick_nesting counter.
      
      This commit also adds warnings if a non-idle task asks RCU to enter
      idle state (and these checks will need some adjustment before applying
      Frederic's OS-jitter patches (http://lkml.org/lkml/2011/10/7/246).
      In addition, validation of ->dynticks and ->dynticks_nesting is added.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      9b2e4f18
    • P
      rcu: Make synchronize_sched_expedited() better at work sharing · 7077714e
      Paul E. McKenney 提交于
      When synchronize_sched_expedited() takes its second and subsequent
      snapshots of sync_sched_expedited_started, it subtracts 1.  This
      means that the concurrent caller of synchronize_sched_expedited()
      that incremented to that value sees our successful completion, it
      will not be able to take advantage of it.  This restriction is
      pointless, given that our full expedited grace period would have
      happened after the other guy started, and thus should be able to
      serve as a proxy for the other guy successfully executing
      try_stop_cpus().
      
      This commit therefore removes the subtraction of 1.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      7077714e
    • P
      rcu: Avoid RCU-preempt expedited grace-period botch · 389abd48
      Paul E. McKenney 提交于
      Because rcu_read_unlock_special() samples rcu_preempted_readers_exp(rnp)
      after dropping rnp->lock, the following sequence of events is possible:
      
      1.	Task A exits its RCU read-side critical section, and removes
      	itself from the ->blkd_tasks list, releases rnp->lock, and is
      	then preempted.  Task B remains on the ->blkd_tasks list, and
      	blocks the current expedited grace period.
      
      2.	Task B exits from its RCU read-side critical section and removes
      	itself from the ->blkd_tasks list.  Because it is the last task
      	blocking the current expedited grace period, it ends that
      	expedited grace period.
      
      3.	Task A resumes, and samples rcu_preempted_readers_exp(rnp) which
      	of course indicates that nothing is blocking the nonexistent
      	expedited grace period. Task A is again preempted.
      
      4.	Some other CPU starts an expedited grace period.  There are several
      	tasks blocking this expedited grace period queued on the
      	same rcu_node structure that Task A was using in step 1 above.
      
      5.	Task A examines its state and incorrectly concludes that it was
      	the last task blocking the expedited grace period on the current
      	rcu_node structure.  It therefore reports completion up the
      	rcu_node tree.
      
      6.	The expedited grace period can then incorrectly complete before
      	the tasks blocked on this same rcu_node structure exit their
      	RCU read-side critical sections.  Arbitrarily bad things happen.
      
      This commit therefore takes a snapshot of rcu_preempted_readers_exp(rnp)
      prior to dropping the lock, so that only the last task thinks that it is
      the last task, thus avoiding the failure scenario laid out above.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      389abd48
    • P
      rcu: ->signaled better named ->fqs_state · af446b70
      Paul E. McKenney 提交于
      The ->signaled field was named before complications in the form of
      dyntick-idle mode and offlined CPUs.  These complications have required
      that force_quiescent_state() be implemented as a state machine, instead
      of simply unconditionally sending reschedule IPIs.  Therefore, this
      commit renames ->signaled to ->fqs_state to catch up with the new
      force_quiescent_state() reality.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      af446b70
  2. 09 12月, 2011 2 次提交
  3. 07 12月, 2011 2 次提交
  4. 06 12月, 2011 6 次提交
  5. 05 12月, 2011 1 次提交
    • P
      perf: Fix loss of notification with multi-event · 10c6db11
      Peter Zijlstra 提交于
      When you do:
              $ perf record -e cycles,cycles,cycles noploop 10
      
      You expect about 10,000 samples for each event, i.e., 10s at
      1000samples/sec. However, this is not what's happening. You
      get much fewer samples, maybe 3700 samples/event:
      
      $ perf report -D | tail -15
      Aggregated stats:
                 TOTAL events:      10998
                  MMAP events:         66
                  COMM events:          2
                SAMPLE events:      10930
      cycles stats:
                 TOTAL events:       3644
                SAMPLE events:       3644
      cycles stats:
                 TOTAL events:       3642
                SAMPLE events:       3642
      cycles stats:
                 TOTAL events:       3644
                SAMPLE events:       3644
      
      On a Intel Nehalem or even AMD64, there are 4 counters capable
      of measuring cycles, so there is plenty of space to measure those
      events without multiplexing (even with the NMI watchdog active).
      And even with multiplexing, we'd expect roughly the same number
      of samples per event.
      
      The root of the problem was that when the event that caused the buffer
      to become full was not the first event passed on the cmdline, the user
      notification would get lost. The notification was sent to the file
      descriptor of the overflowed event but the perf tool was not polling
      on it.  The perf tool aggregates all samples into a single buffer,
      i.e., the buffer of the first event. Consequently, it assumes
      notifications for any event will come via that descriptor.
      
      The seemingly straight forward solution of moving the waitq into the
      ringbuffer object doesn't work because of life-time issues. One could
      perf_event_set_output() on a fd that you're also blocking on and cause
      the old rb object to be freed while its waitq would still be
      referenced by the blocked thread -> FAIL.
      
      Therefore link all events to the ringbuffer and broadcast the wakeup
      from the ringbuffer object to all possible events that could be waited
      upon. This is rather ugly, and we're open to better solutions but it
      works for now.
      Reported-by: NStephane Eranian <eranian@google.com>
      Finished-by: NStephane Eranian <eranian@google.com>
      Reviewed-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20111126014731.GA7030@quadSigned-off-by: NIngo Molnar <mingo@elte.hu>
      10c6db11
  6. 02 12月, 2011 5 次提交
  7. 29 11月, 2011 1 次提交
  8. 25 11月, 2011 1 次提交
    • M
      cgroup_freezer: fix freezing groups with stopped tasks · 884a45d9
      Michal Hocko 提交于
      2d3cbf8b (cgroup_freezer: update_freezer_state() does incorrect state
      transitions) removed is_task_frozen_enough and replaced it with a simple
      frozen call. This, however, breaks freezing for a group with stopped tasks
      because those cannot be frozen and so the group remains in CGROUP_FREEZING
      state (update_if_frozen doesn't count stopped tasks) and never reaches
      CGROUP_FROZEN.
      
      Let's add is_task_frozen_enough back and use it at the original locations
      (update_if_frozen and try_to_freeze_cgroup). Semantically we consider
      stopped tasks as frozen enough so we should consider both cases when
      testing frozen tasks.
      
      Testcase:
      mkdir /dev/freezer
      mount -t cgroup -o freezer none /dev/freezer
      mkdir /dev/freezer/foo
      sleep 1h &
      pid=$!
      kill -STOP $pid
      echo $pid > /dev/freezer/foo/tasks
      echo FROZEN > /dev/freezer/foo/freezer.state
      while true
      do
      	cat /dev/freezer/foo/freezer.state
      	[ "`cat /dev/freezer/foo/freezer.state`" = "FROZEN" ] && break
      	sleep 1
      done
      echo OK
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Tomasz Buchert <tomasz.buchert@inria.fr>
      Cc: Paul Menage <paul@paulmenage.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: stable@kernel.org
      Signed-off-by: NTejun Heo <htejun@gmail.com>
      884a45d9
  9. 24 11月, 2011 1 次提交
  10. 19 11月, 2011 3 次提交
  11. 18 11月, 2011 2 次提交
  12. 17 11月, 2011 1 次提交
  13. 16 11月, 2011 2 次提交
  14. 14 11月, 2011 6 次提交
  15. 11 11月, 2011 1 次提交
    • J
      clocksource: Avoid selecting mult values that might overflow when adjusted · d65670a7
      John Stultz 提交于
      For some frequencies, the clocks_calc_mult_shift() function will
      unfortunately select mult values very close to 0xffffffff.  This
      has the potential to overflow when NTP adjusts the clock, adding
      to the mult value.
      
      This patch adds a clocksource.maxadj value, which provides
      an approximation of an 11% adjustment(NTP limits adjustments to
      500ppm and the tick adjustment is limited to 10%), which could
      be made to the clocksource.mult value. This is then used to both
      check that the current mult value won't overflow/underflow, as
      well as warning us if the timekeeping_adjust() code pushes over
      that 11% boundary.
      
      v2: Fix max_adjustment calculation, and improve WARN_ONCE
      messages.
      
      v3: Don't warn before maxadj has actually been set
      
      CC: Yong Zhang <yong.zhang0@gmail.com>
      CC: David Daney <ddaney.cavm@gmail.com>
      CC: Thomas Gleixner <tglx@linutronix.de>
      CC: Chen Jie <chenj@lemote.com>
      CC: zhangfx <zhangfx@lemote.com>
      CC: stable@kernel.org
      Reported-by: NChen Jie <chenj@lemote.com>
      Reported-by: Nzhangfx <zhangfx@lemote.com>
      Tested-by: NYong Zhang <yong.zhang0@gmail.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      d65670a7
  16. 08 11月, 2011 1 次提交
  17. 07 11月, 2011 1 次提交