1. 14 7月, 2008 3 次提交
    • I
      lockdep: fix kernel/fork.c warning · d12c1a37
      Ingo Molnar 提交于
      fix:
      
      [    0.184011] ------------[ cut here ]------------
      [    0.188011] WARNING: at kernel/fork.c:918 copy_process+0x1c0/0x1084()
      [    0.192011] Pid: 0, comm: swapper Not tainted 2.6.26-tip-00351-g01d4a50-dirty #14521
      [    0.196011]  [<c0135d48>] warn_on_slowpath+0x3c/0x60
      [    0.200012]  [<c016f805>] ? __alloc_pages_internal+0x92/0x36b
      [    0.208012]  [<c033de5e>] ? __spin_lock_init+0x24/0x4a
      [    0.212012]  [<c01347e3>] copy_process+0x1c0/0x1084
      [    0.216013]  [<c013575f>] do_fork+0xb8/0x1ad
      [    0.220013]  [<c034f75e>] ? acpi_os_release_lock+0x8/0xa
      [    0.228013]  [<c034ff7a>] ? acpi_os_vprintf+0x20/0x24
      [    0.232014]  [<c01129ee>] kernel_thread+0x75/0x7d
      [    0.236014]  [<c0a491eb>] ? kernel_init+0x0/0x24a
      [    0.240014]  [<c0a491eb>] ? kernel_init+0x0/0x24a
      [    0.244014]  [<c01151b0>] ? kernel_thread_helper+0x0/0x10
      [    0.252015]  [<c06c6ac0>] rest_init+0x14/0x50
      [    0.256015]  [<c0a498ce>] start_kernel+0x2b9/0x2c0
      [    0.260015]  [<c0a4904f>] __init_begin+0x4f/0x57
      [    0.264016]  =======================
      [    0.268016] ---[ end trace 4eaa2a86a8e2da22 ]---
      [    0.272016] enabled ExtINT on CPU#0
      
      which occurs if CONFIG_TRACE_IRQFLAGS=y, CONFIG_DEBUG_LOCKDEP=y,
      but CONFIG_PROVE_LOCKING is disabled.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d12c1a37
    • I
      lockdep: fix ftrace irq tracing false positive · 992860e9
      Ingo Molnar 提交于
      fix this false positive:
      
      [    0.020000] ------------[ cut here ]------------
      [    0.020000] WARNING: at kernel/lockdep.c:2718 check_flags+0x14a/0x170()
      [    0.020000] Modules linked in:
      [    0.020000] Pid: 0, comm: swapper Not tainted 2.6.26-tip-00343-gd7e5521-dirty #14486
      [    0.020000]  [<c01312e4>] warn_on_slowpath+0x54/0x80
      [    0.020000]  [<c067e451>] ? _spin_unlock_irqrestore+0x61/0x70
      [    0.020000]  [<c0131bb1>] ? release_console_sem+0x201/0x210
      [    0.020000]  [<c0143d65>] ? __kernel_text_address+0x35/0x40
      [    0.020000]  [<c010562e>] ? dump_trace+0x5e/0x140
      [    0.020000]  [<c01518b5>] ? __lock_acquire+0x245/0x820
      [    0.020000]  [<c015063a>] check_flags+0x14a/0x170
      [    0.020000]  [<c0151ed8>] ? lock_acquire+0x48/0xc0
      [    0.020000]  [<c0151ee1>] lock_acquire+0x51/0xc0
      [    0.020000]  [<c014a16c>] ? down+0x2c/0x40
      [    0.020000]  [<c010a609>] ? sched_clock+0x9/0x10
      [    0.020000]  [<c067e7b2>] _write_lock+0x32/0x60
      [    0.020000]  [<c013797f>] ? request_resource+0x1f/0xb0
      [    0.020000]  [<c013797f>] request_resource+0x1f/0xb0
      [    0.020000]  [<c02f89ad>] vgacon_startup+0x2bd/0x3e0
      [    0.020000]  [<c094d62a>] con_init+0x19/0x22f
      [    0.020000]  [<c0330c7c>] ? tty_register_ldisc+0x5c/0x70
      [    0.020000]  [<c094cf49>] console_init+0x20/0x2e
      [    0.020000]  [<c092a969>] start_kernel+0x20c/0x379
      [    0.020000]  [<c092a516>] ? unknown_bootoption+0x0/0x1f6
      [    0.020000]  [<c092a099>] __init_begin+0x99/0xa1
      [    0.020000]  =======================
      [    0.020000] ---[ end trace 4eaa2a86a8e2da22 ]---
      [    0.020000] possible reason: unannotated irqs-on.
      [    0.020000] irq event stamp: 0
      
      which occurs if CONFIG_TRACE_IRQFLAGS=y, CONFIG_DEBUG_LOCKDEP=y,
      but CONFIG_PROVE_LOCKING is disabled.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      992860e9
    • S
      Security: split proc ptrace checking into read vs. attach · 006ebb40
      Stephen Smalley 提交于
      Enable security modules to distinguish reading of process state via
      proc from full ptrace access by renaming ptrace_may_attach to
      ptrace_may_access and adding a mode argument indicating whether only
      read access or full attach access is requested.  This allows security
      modules to permit access to reading process state without granting
      full ptrace access.  The base DAC/capability checking remains unchanged.
      
      Read access to /proc/pid/mem continues to apply a full ptrace attach
      check since check_mem_permission() already requires the current task
      to already be ptracing the target.  The other ptrace checks within
      proc for elements like environ, maps, and fds are changed to pass the
      read mode instead of attach.
      
      In the SELinux case, we model such reading of process state as a
      reading of a proc file labeled with the target process' label.  This
      enables SELinux policy to permit such reading of process state without
      permitting control or manipulation of the target process, as there are
      a number of cases where programs probe for such information via proc
      but do not need to be able to control the target (e.g. procps,
      lsof, PolicyKit, ConsoleKit).  At present we have to choose between
      allowing full ptrace in policy (more permissive than required/desired)
      or breaking functionality (or in some cases just silencing the denials
      via dontaudit rules but this can hide genuine attacks).
      
      This version of the patch incorporates comments from Casey Schaufler
      (change/replace existing ptrace_may_attach interface, pass access
      mode), and Chris Wright (provide greater consistency in the checking).
      
      Note that like their predecessors __ptrace_may_attach and
      ptrace_may_attach, the __ptrace_may_access and ptrace_may_access
      interfaces use different return value conventions from each other (0
      or -errno vs. 1 or 0).  I retained this difference to avoid any
      changes to the caller logic but made the difference clearer by
      changing the latter interface to return a bool rather than an int and
      by adding a comment about it to ptrace.h for any future callers.
      Signed-off-by: NStephen Smalley <sds@tycho.nsa.gov>
      Acked-by: NChris Wright <chrisw@sous-sol.org>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      006ebb40
  2. 13 7月, 2008 1 次提交
    • D
      cpusets, hotplug, scheduler: fix scheduler domain breakage · 3e84050c
      Dmitry Adamushko 提交于
      Commit f18f982a ("sched: CPU hotplug events must not destroy scheduler
      domains created by the cpusets") introduced a hotplug-related problem as
      described below:
      
      Upon CPU_DOWN_PREPARE,
      
        update_sched_domains() -> detach_destroy_domains(&cpu_online_map)
      
      does the following:
      
      /*
       * Force a reinitialization of the sched domains hierarchy. The domains
       * and groups cannot be updated in place without racing with the balancing
       * code, so we temporarily attach all running cpus to the NULL domain
       * which will prevent rebalancing while the sched domains are recalculated.
       */
      
      The sched-domains should be rebuilt when a CPU_DOWN ops. has been
      completed, effectively either upon CPU_DEAD{_FROZEN} (upon success) or
      CPU_DOWN_FAILED{_FROZEN} (upon failure -- restore the things to their
      initial state). That's what update_sched_domains() also does but only
      for !CPUSETS case.
      
      With f18f982a, sched-domains' reinitialization is delegated to
      CPUSETS code:
      
      cpuset_handle_cpuhp() -> common_cpu_mem_hotplug_unplug() ->
      rebuild_sched_domains()
      
      Being called for CPU_UP_PREPARE and if its callback is called after
      update_sched_domains()), it just negates all the work done by
      update_sched_domains() -- i.e. a soon-to-be-offline cpu is included in
      the sched-domains and that makes it visible for the load-balancer
      while the CPU_DOWN ops. is in progress.
      
      __migrate_live_tasks() moves the tasks off a 'dead' cpu (it's already
      "offline" when this function is called).
      
      try_to_wake_up() is called for one of these tasks from another CPU ->
      the load-balancer (wake_idle()) picks up a "dead" CPU and places the
      task on it. Then e.g. BUG_ON(rq->nr_running) detects this a bit later
      -> oops.
      Signed-off-by: NDmitry Adamushko <dmitry.adamushko@gmail.com>
      Tested-by: NVegard Nossum <vegard.nossum@gmail.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Max Krasnyansky <maxk@qualcomm.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: miaox@cn.fujitsu.com
      Cc: rostedt@goodmis.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3e84050c
  3. 11 7月, 2008 19 次提交
    • I
      ftrace: build fix for ftraced_suspend · b2613e37
      Ingo Molnar 提交于
      fix:
      
       kernel/trace/ftrace.c:1615: error: 'ftraced_suspend' undeclared (first use in this function)
       kernel/trace/ftrace.c:1615: error: (Each undeclared identifier is reported only once
       kernel/trace/ftrace.c:1615: error: for each function it appears in.)
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b2613e37
    • S
      sched_clock: and multiplier for TSC to gtod drift · c300ba25
      Steven Rostedt 提交于
      The sched_clock code currently tries to keep all CPU clocks of all CPUS
      somewhat in sync. At every clock tick it records the gtod clock and
      uses that and jiffies and the TSC to calculate a CPU clock that tries to
      stay in sync with all the other CPUs.
      
      ftrace depends heavily on this timer and it detects when this timer
      "jumps".  One problem is that the TSC and the gtod also drift.
      When the TSC is 0.1% faster or slower than the gtod it is very noticeable
      in ftrace. To help compensate for this, I've added a multiplier that
      tries to keep the CPU clock updating at the same rate as the gtod.
      
      I've tried various ways to get it to be in sync and this ended up being
      the most reliable. At every scheduler tick we calculate the new multiplier:
      
        multi = delta_gtod / delta_TSC
      
      This means we perform a 64 bit divide at the tick (once a HZ). A shift
      is used to handle the accuracy.
      
      Other methods that failed due to dynamic HZ are:
      
      (not used)  multi += (gtod - tsc) / delta_gtod
      (not used)  multi += (gtod - (last_tsc + delta_tsc)) / delta_gtod
      
      as well as other variants.
      
      This code still allows for a slight drift between TSC and gtod, but
      it keeps the damage down to a minimum.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Cc: Steven Rostedt <srostedt@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: john stultz <johnstul@us.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c300ba25
    • S
      sched_clock: record TSC after gtod · a83bc47c
      Steven Rostedt 提交于
      To read the gtod we need to grab the xtime lock for read. Reading the gtod
      before the TSC can cause a bigger gab if the xtime lock is contended.
      
      This patch simply reverses the order to read the TSC after the gtod.
      The locking in the reading of the gtod handles any barriers one might
      think is needed.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Cc: Steven Rostedt <srostedt@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: john stultz <johnstul@us.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a83bc47c
    • S
      sched_clock: only update deltas with local reads. · c0c87734
      Steven Rostedt 提交于
      Reading the CPU clock should try to stay accurate within the CPU.
      By reading the CPU clock from another CPU and updating the deltas can
      cause unneeded jumps when reading from the local CPU.
      
      This patch changes the code to update the last read TSC only when read
      from the local CPU.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Cc: Steven Rostedt <srostedt@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: john stultz <johnstul@us.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c0c87734
    • S
      sched_clock: fix calculation of other CPU · 2b8a0cf4
      Steven Rostedt 提交于
      The algorithm to calculate the 'now' of another CPU is not correct.
      At each scheduler tick, each CPU records the last sched_clock and
      gtod (tick_raw and tick_gtod respectively). If the TSC is somewhat the
      same in speed between two clocks the algorithm would be:
      
        tick_gtod1 + (now1 - tick_raw1) = tick_gtod2 + (now2 - tick_raw2)
      
      To calculate now2 we would have:
      
        now2 = (tick_gtod1 - tick_gtod2) + (tick_raw2 - tick_raw1) + now1
      
      Currently the algorithm is:
      
        now2 = (tick_gtod1 - tick_gtod2) + (tick_raw1 - tick_raw2) + now1
      
      This solves most of the rest of the issues I've had with timestamps in
      ftace.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: john stultz <johnstul@us.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2b8a0cf4
    • S
      sched_clock: stop maximum check on NO HZ · af52a90a
      Steven Rostedt 提交于
      Working with ftrace I would get large jumps of 11 millisecs or more with
      the clock tracer. This killed the latencing timings of ftrace and also
      caused the irqoff self tests to fail.
      
      What was happening is with NO_HZ the idle would stop the jiffy counter and
      before the jiffy counter was updated the sched_clock would have a bad
      delta jiffies to compare with the gtod with the maximum.
      
      The jiffies would stop and the last sched_tick would record the last gtod.
      On wakeup, the sched clock update would compare the gtod + delta jiffies
      (which would be zero) and compare it to the TSC. The TSC would have
      correctly (with a stable TSC) moved forward several jiffies. But because the
      jiffies has not been updated yet the clock would be prevented from moving
      forward because it would appear that the TSC jumped too far ahead.
      
      The clock would then virtually stop, until the jiffies are updated. Then
      the next sched clock update would see that the clock was very much behind
      since the delta jiffies is now correct. This would then jump the clock
      forward by several jiffies.
      
      This caused ftrace to report several milliseconds of interrupts off
      latency at every resume from NO_HZ idle.
      
      This patch adds hooks into the nohz code to disable the checking of the
      maximum clock update when nohz is in effect. It resumes the max check
      when nohz has updated the jiffies again.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Cc: Steven Rostedt <srostedt@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      af52a90a
    • S
      sched_clock: widen the max and min time · f7cce27f
      Steven Rostedt 提交于
      With keeping the max and min sched time within one jiffy of the gtod clock
      was too tight. Just before a schedule tick the max could easily be hit, as
      well as just after a schedule_tick the min could be hit. This caused the
      clock to jump around by a jiffy.
      
      This patch widens the minimum to
         last gtod + (delta_jiffies ? delta_jiffies - 1 : 0) * TICK_NSECS
      
      and the maximum to
          last gtod + (2 + delta_jiffies) * TICK_NSECS
      
      This keeps the minum to gtod or if one jiffy less than delta jiffies
      and the maxim 2 jiffies ahead of gtod. This may cause unstable TSCs to be
      a bit more sporadic, but it helps keep a clock with a stable TSC working well.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Cc: Steven Rostedt <srostedt@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f7cce27f
    • S
      sched_clock: record from last tick · 62c43dd9
      Steven Rostedt 提交于
      The sched_clock code tries to keep within the gtod time by one tick (jiffy).
      The current code mistakenly keeps track of the delta jiffies between
      updates of the clock, where the the delta is used to compare with the
      number of jiffies that have past since an update of the gtod. The gtod is
      updated at each schedule tick not each sched_clock update. After one
      jiffy passes the clock is updated fine. But the delta is taken from the
      last update so if the next update happens before the next tick the delta
      jiffies used will be incorrect.
      
      This patch changes the code to check the delta of jiffies between ticks
      and not updates to match the comparison of the updates with the gtod.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Cc: Steven Rostedt <srostedt@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      62c43dd9
    • S
      ftrace: separate out the function enabled variable · 60bc0800
      Steven Rostedt 提交于
      Currently the function tracer uses the global tracer_enabled variable that
      is used to keep track if the tracer is enabled or not. The function tracing
      startup needs to be separated out, otherwise the internal happenings of
      the tracer startup is also recorded.
      
      This patch creates a ftrace_function_enabled variable to all the starting
      of the function traces to happen after everything has been started.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Cc: Steven Rostedt <srostedt@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      60bc0800
    • S
      ftrace: add ftrace_kill_atomic · a2bb6a3d
      Steven Rostedt 提交于
      It has been suggested that I add a way to disable the function tracer
      on an oops. This code adds a ftrace_kill_atomic. It is not meant to be
      used in normal situations. It will disable the ftrace tracer, but will
      not perform the nice shutdown that requires scheduling.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Cc: Steven Rostedt <srostedt@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a2bb6a3d
    • S
      ftrace: use current CPU for function startup · 26bc83f4
      Steven Rostedt 提交于
      This is more of a clean up. Currently the function tracer initializes the
      tracer with which ever CPU was last used for tracing. This value isn't
      realy useful for function tracing, but at least it should be something other
      than a random number.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Cc: Steven Rostedt <srostedt@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      26bc83f4
    • S
      ftrace: start wakeup tracing after setting function tracer · ad591240
      Steven Rostedt 提交于
      Enabling the wakeup tracer before enabling the function tracing causes
      some strange results due to the dynamic enabling of the functions.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Cc: Steven Rostedt <srostedt@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ad591240
    • S
      ftrace: check proper config for preempt type · b5c21b45
      Steven Rostedt 提交于
      There is no CONFIG_PREEMPT_DESKTOP. Use the proper entry CONFIG_PREEMPT.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Cc: Steven Rostedt <srostedt@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b5c21b45
    • S
      ftrace: trace schedule · 1e16c0a0
      Steven Rostedt 提交于
      After the sched_clock code has been removed from sched.c we can now trace
      the scheduler. The scheduler has a lot of functions that would be worth
      tracing.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Cc: Steven Rostedt <srostedt@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      1e16c0a0
    • S
      ftrace: define function trace nop · 001b6767
      Steven Rostedt 提交于
      When CONFIG_FTRACE is not enabled, the tracing_start_functon_trace
      and tracing_stop_function_trace should be nops.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Cc: Steven Rostedt <srostedt@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      001b6767
    • S
      ftrace: move sched_switch enable after markers · 007c05d4
      Steven Rostedt 提交于
      We have two markers now that are enabled on sched_switch. One that records
      the context switching and the other that records task wake ups. Currently
      we enable the tracing first and then set the markers. This causes some
      confusing traces:
      
      # tracer: sched_switch
      #
      #           TASK-PID   CPU#    TIMESTAMP  FUNCTION
      #              | |      |          |         |
             trace-cmd-3973  [00]   115.834817:   3973:120:R   +     3:  0:S
             trace-cmd-3973  [01]   115.834910:   3973:120:R   +     6:  0:S
             trace-cmd-3973  [02]   115.834910:   3973:120:R   +     9:  0:S
             trace-cmd-3973  [03]   115.834910:   3973:120:R   +    12:  0:S
             trace-cmd-3973  [02]   115.834910:   3973:120:R   +     9:  0:S
                <idle>-0     [02]   115.834910:      0:140:R ==>  3973:120:R
      
      Here we see that trace-cmd with PID 3973 wakes up task 9 but the next line
      shows the idle task doing a context switch to task 3973.
      
      Enabling the tracing to _after_ the markers are set creates a much saner
      output:
      
      # tracer: sched_switch
      #
      #           TASK-PID   CPU#    TIMESTAMP  FUNCTION
      #              | |      |          |         |
                <idle>-0     [02]  7922.634225:      0:140:R ==>  4790:120:R
             trace-cmd-4789  [03]  7922.634225:      0:140:R   +  4790:120:R
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Cc: Steven Rostedt <srostedt@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      007c05d4
    • L
      sched: fix cpu hotplug, cleanup · b1e38734
      Linus Torvalds 提交于
      Clean up __migrate_task(): to just have separate "done" and "fail"
      cases, instead of that "out" case with random error behavior.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b1e38734
    • N
      Fix PREEMPT_RCU without HOTPLUG_CPU · 70ff0555
      Nick Piggin 提交于
      PREEMPT_RCU without HOTPLUG_CPU is broken.  The rcu_online_cpu is called
      to initially populate rcu_cpu_online_map with all online CPUs when the
      hotplug event handler is installed, and also to populate the map with
      CPUs as they come online.  The former case is meant to happen with and
      without HOTPLUG_CPU, but without HOTPLUG_CPU, the rcu_offline_cpu
      function is no-oped -- while it still gets called, it does not set the
      rcu CPU map.
      
      With a blank RCU CPU map, grace periods get to tick by completely
      oblivious to active RCU read side critical sections.  This results in
      free-before-grace bugs.
      
      Fix is obvious once the problem is known. (Also, change __devinit to
      __cpuinit so the function gets thrown away on !HOTPLUG_CPU kernels).
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Reported-and-tested-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      [ Nick is my personal hero of the day - Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70ff0555
    • D
  4. 10 7月, 2008 1 次提交
    • D
      sched: fix cpu hotplug · dc7fab8b
      Dmitry Adamushko 提交于
      I think we may have a race between try_to_wake_up() and
      migrate_live_tasks() -> move_task_off_dead_cpu() when the later one
      may end up looping endlessly.
      
      Interrupts are enabled on other CPUs when migration_call(CPU_DEAD, ...) is
      called so we may get a race between try_to_wake_up() and
      migrate_live_tasks() -> move_task_off_dead_cpu(). The former one may push
      a task out of a dead CPU causing the later one to loop endlessly.
      
      Heiko Carstens observed:
      
      | That's exactly what explains a dump I got yesterday. Thanks for fixing! :)
      Signed-off-by: NDmitry Adamushko <dmitry.adamushko@gmail.com>
      Cc: miaox@cn.fujitsu.com
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Avi Kivity <avi@qumranet.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      dc7fab8b
  5. 09 7月, 2008 1 次提交
  6. 08 7月, 2008 3 次提交
    • I
      printk: export console_drivers · a29d1cfe
      Ingo Molnar 提交于
      this symbol is needed by drivers/video/xen-fbfront.ko.
      
      [ cherry-picked from tip/core/printk ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a29d1cfe
    • M
      sched, numa: replace MAX_NUMNODES with nr_node_ids in kernel/sched.c · 076ac2af
      Mike Travis 提交于
        * Replace usages of MAX_NUMNODES with nr_node_ids in kernel/sched.c,
          where appropriate.  This saves some allocated space as well as many
          wasted cycles going through node entries that are non-existent.
      Signed-off-by: NMike Travis <travis@sgi.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      076ac2af
    • T
      x86, clockevents: add C1E aware idle function · aa276e1c
      Thomas Gleixner 提交于
      C1E on AMD machines is like C3 but without control from the OS. Up to
      now we disabled the local apic timer for those machines as it stops
      when the CPU goes into C1E. This excludes those machines from high
      resolution timers / dynamic ticks, which hurts especially X2 based
      laptops.
      
      The current boot time C1E detection has another, more serious flaw
      as well: some BIOSes do not enable C1E until the ACPI processor module
      is loaded. This causes systems to stop working after that point.
      
      To work nicely with C1E enabled machines we use a separate idle
      function, which checks on idle entry whether C1E was enabled in the
      Interrupt Pending Message MSR. This allows us to do timer broadcasting
      for C1E and covers the late enablement of C1E as well.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      aa276e1c
  7. 05 7月, 2008 3 次提交
  8. 04 7月, 2008 4 次提交
    • A
      sched: fix accounting in task delay accounting & migration · 46ac22ba
      Ankita Garg 提交于
      On Thu, Jun 19, 2008 at 12:27:14PM +0200, Peter Zijlstra wrote:
      > On Thu, 2008-06-05 at 10:50 +0530, Ankita Garg wrote:
      >
      > > Thanks Peter for the explanation...
      > >
      > > I agree with the above and that is the reason why I did not see weird
      > > values with cpu_time. But, run_delay still would suffer skews as the end
      > > points for delta could be taken on different cpus due to migration (more
      > > so on RT kernel due to the push-pull operations). With the below patch,
      > > I could not reproduce the issue I had seen earlier. After every dequeue,
      > > we take the delta and start wait measurements from zero when moved to a
      > > different rq.
      >
      > OK, so task delay delay accounting is broken because it doesn't take
      > migration into account.
      >
      > What you've done is make it symmetric wrt enqueue, and account it like
      >
      >   cpu0      cpu1
      >
      > enqueue
      >  <wait-d1>
      > dequeue
      >             enqueue
      >              <wait-d2>
      >             run
      >
      > Where you add both d1 and d2 to the run_delay,.. right?
      >
      
      Thanks for reviewing the patch. The above is exactly what I have done.
      
      > This seems like a good fix, however it looks like the patch will break
      > compilation in !CONFIG_SCHEDSTATS && !CONFIG_TASK_DELAY_ACCT, of it
      > failing to provide a stub for sched_info_dequeue() in that case.
      
      Fixed. Pl. find the new patch below.
      Signed-off-by: NAnkita Garg <ankita@in.ibm.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Gregory Haskins <ghaskins@novell.com>
      Cc: rostedt@goodmis.org
      Cc: suresh.b.siddha@intel.com
      Cc: aneesh.kumar@linux.vnet.ibm.com
      Cc: dhaval@linux.vnet.ibm.com
      Cc: vatsa@linux.vnet.ibm.com
      Cc: David Bahi <DBahi@novell.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      46ac22ba
    • G
      sched: add avg-overlap support to RT tasks · 2087a1ad
      Gregory Haskins 提交于
      We have the notion of tracking process-coupling (a.k.a. buddy-wake) via
      the p->se.last_wake / p->se.avg_overlap facilities, but it is only used
      for cfs to cfs interactions.  There is no reason why an rt to cfs
      interaction cannot share in establishing a relationhip in a similar
      manner.
      
      Because PREEMPT_RT runs many kernel threads as FIFO priority, we often
      times have heavy interaction between RT threads waking CFS applications.
      This patch offers a substantial boost (50-60%+) in perfomance under those
      circumstances.
      Signed-off-by: NGregory Haskins <ghaskins@novell.com>
      Cc: npiggin@suse.de
      Cc: rostedt@goodmis.org
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2087a1ad
    • G
      sched: terminate newidle balancing once at least one task has moved over · c4acb2c0
      Gregory Haskins 提交于
      Inspired by Peter Zijlstra.
      Signed-off-by: NGregory Haskins <ghaskins@novell.com>
      Cc: npiggin@suse.de
      Cc: rostedt@goodmis.org
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c4acb2c0
    • S
      hrtimer: prevent migration for raising softirq · ee3ece83
      Steven Rostedt 提交于
      Due to a possible deadlock, the waking of the softirq was pushed outside
      of the hrtimer base locks. See commit 0c96c597
      
      Unfortunately this allows the task to migrate after setting up the softirq
      and raising it. Since softirqs run a queue that is per-cpu we may raise the
      softirq on the wrong CPU and this will keep the queued softirq task from
      running.
      
      To solve this issue, this patch disables preemption around the releasing
      of the hrtimer lock and raising of the softirq.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ee3ece83
  9. 03 7月, 2008 3 次提交
  10. 01 7月, 2008 2 次提交
    • P
      mmiotrace broken in linux-next (8-bit writes only) · 3e61e0c9
      Pekka Paalanen 提交于
      The moment mmiotrace is enabled, I hit a NULL deref in:
      
      IP: [<ffffffff80256e71>] __trace_special+0x17c/0x23a
      Call Trace:
       [<ffffffff802573cc>] ftrace_special+0x6f/0x9a
       [<ffffffff8023e3e4>] down+0x19/0x4a
       [<ffffffff80228adc>] acquire_console_sem+0x42/0x58
       [<ffffffff8035d273>] con_flush_chars+0x28/0x43
       [<ffffffff80354a70>] write_chan+0x22e/0x334
       [<ffffffff802244e9>] ? default_wake_function+0x0/0xf
       [<ffffffff8035236d>] tty_write+0x195/0x228
       [<ffffffff80354842>] ? write_chan+0x0/0x334
       [<ffffffff8027c23a>] vfs_write+0xae/0x137
       [<ffffffff8027c6e3>] sys_write+0x47/0x70
       [<ffffffff8020b1db>] system_call_after_swapgs+0x7b/0x80
      
      which means 'entry' in __trace_special() is NULL.
      
      [ mingo@elte.hu: that ftrace_special() was a leftover. ]
      Signed-off-by: NPekka Paalanen <pq@iki.fi>
      Cc: Steven Rostedt <srostedt@redhat.com>
      Cc: proski@gnu.org
      Cc: "Vegard Nossum" <vegard.nossum@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3e61e0c9
    • G
      rcu: fix hotplug vs rcu race · 8558f8f8
      Gautham R Shenoy 提交于
      Dhaval Giani reported this warning during cpu hotplug stress-tests:
      
      | On running kernel compiles in parallel with cpu hotplug:
      |
      | WARNING: at arch/x86/kernel/smp.c:118
      | native_smp_send_reschedule+0x21/0x36()
      | Modules linked in:
      | Pid: 27483, comm: cc1 Not tainted 2.6.26-rc7 #1
      | [...]
      |  [<c0110355>] native_smp_send_reschedule+0x21/0x36
      |  [<c014fe8f>] force_quiescent_state+0x47/0x57
      |  [<c014fef0>] call_rcu+0x51/0x6d
      |  [<c01713b3>] __fput+0x130/0x158
      |  [<c0171231>] fput+0x17/0x19
      |  [<c016fd99>] filp_close+0x4d/0x57
      |  [<c016fdff>] sys_close+0x5c/0x97
      
      IMHO the warning is a spurious one.
      
      cpu_online_map is updated by the _cpu_down() using stop_machine_run().
      Since force_quiescent_state is invoked from irqs disabled section,
      stop_machine_run() won't be executing while a cpu is executing
      force_quiescent_state(). Hence the cpu_online_map is stable while we're
      in the irq disabled section.
      
      However, a cpu might have been offlined _just_ before we disabled irqs
      while entering force_quiescent_state(). And rcu subsystem might not yet
      have handled the CPU_DEAD notification, leading to the offlined cpu's
      bit being set in the rcp->cpumask.
      
      Hence cpumask = (rcp->cpumask & cpu_online_map) to prevent sending
      smp_reschedule() to an offlined CPU.
      
      Here's the timeline:
      
      CPU_A						 CPU_B
      --------------------------------------------------------------
      cpu_down():					.
      .					   	.
      .						.
      stop_machine(): /* disables preemption,		.
      		 * and irqs */			.
      .						.
      .						.
      take_cpu_down();				.
      .						.
      .						.
      .						.
      cpu_disable(); /*this removes cpu 		.
      		*from cpu_online_map 		.
      		*/				.
      .						.
      .						.
      restart_machine(); /* enables irqs */		.
      ------WINDOW DURING WHICH rcp->cpumask is stale ---------------
      .						call_rcu();
      .						/* disables irqs here */
      .						.force_quiescent_state();
      .CPU_DEAD:					.for_each_cpu(rcp->cpumask)
      .						.   smp_send_reschedule();
      .						.
      .						.   WARN_ON() for offlined CPU!
      .
      .
      .
      rcu_cpu_notify:
      .
      -------- WINDOW ENDS ------------------------------------------
      rcu_offline_cpu() /* Which calls cpu_quiet()
      		   * which removes
      		   * cpu from rcp->cpumask.
      		   */
      
      If a new batch was started just before calling stop_machine_run(), the
      "tobe-offlined" cpu is still present in rcp-cpumask.
      
      During a cpu-offline, from take_cpu_down(), we queue an rt-prio idle
      task as the next task to be picked by the scheduler. We also call
      cpu_disable() which will disable any further interrupts and remove the
      cpu's bit from the cpu_online_map.
      
      Once the stop_machine_run() successfully calls take_cpu_down(), it calls
      schedule(). That's the last time a schedule is called on the offlined
      cpu, and hence the last time when rdp->passed_quiesc will be set to 1
      through rcu_qsctr_inc().
      
      But the cpu_quiet() will be on this cpu will be called only when the
      next RCU_SOFTIRQ occurs on this CPU. So at this time, the offlined CPU
      is still set in rcp->cpumask.
      
      Now coming back to the idle_task which truely offlines the CPU, it does
      check for a pending RCU and raises the softirq, since it will find
      rdp->passed_quiesc to be 0 in this case. However, since the cpu is
      offline I am not sure if the softirq will trigger on the CPU.
      
      Even if it doesn't the rcu_offline_cpu() will find that rcp->completed
      is not the same as rcp->cur, which means that our cpu could be holding
      up the grace period progression. Hence we call cpu_quiet() and move
      ahead.
      
      But because of the window explained in the timeline, we could still have
      a call_rcu() before the RCU subsystem executes it's CPU_DEAD
      notification, and we send smp_send_reschedule() to offlined cpu while
      trying to force the quiescent states. The appended patch adds comments
      and prevents checking for offlined cpu everytime.
      
      cpu_online_map is updated by the _cpu_down() using stop_machine_run().
      Since force_quiescent_state is invoked from irqs disabled section,
      stop_machine_run() won't be executing while a cpu is executing
      force_quiescent_state(). Hence the cpu_online_map is stable while we're
      in the irq disabled section.
      Reported-by: NDhaval Giani <dhaval@linux.vnet.ibm.com>
      Signed-off-by: NGautham R Shenoy <ego@in.ibm.com>
      Acked-by: NDhaval Giani <dhaval@linux.vnet.ibm.com>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Rusty Russel <rusty@rustcorp.com.au>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8558f8f8