1. 13 3月, 2010 3 次提交
    • S
      tracing: Do not record user stack trace from NMI context · b6345879
      Steven Rostedt 提交于
      A bug was found with Li Zefan's ftrace_stress_test that caused applications
      to segfault during the test.
      
      Placing a tracing_off() in the segfault code, and examining several
      traces, I found that the following was always the case. The lock tracer
      was enabled (lockdep being required) and userstack was enabled. Testing
      this out, I just enabled the two, but that was not good enough. I needed
      to run something else that could trigger it. Running a load like hackbench
      did not work, but executing a new program would. The following would
      trigger the segfault within seconds:
      
        # echo 1 > /debug/tracing/options/userstacktrace
        # echo 1 > /debug/tracing/events/lock/enable
        # while :; do ls > /dev/null ; done
      
      Enabling the function graph tracer and looking at what was happening
      I finally noticed that all cashes happened just after an NMI.
      
       1)               |    copy_user_handle_tail() {
       1)               |      bad_area_nosemaphore() {
       1)               |        __bad_area_nosemaphore() {
       1)               |          no_context() {
       1)               |            fixup_exception() {
       1)   0.319 us    |              search_exception_tables();
       1)   0.873 us    |            }
      [...]
       1)   0.314 us    |  __rcu_read_unlock();
       1)   0.325 us    |    native_apic_mem_write();
       1)   0.943 us    |  }
       1)   0.304 us    |  rcu_nmi_exit();
      [...]
       1)   0.479 us    |  find_vma();
       1)               |  bad_area() {
       1)               |    __bad_area() {
      
      After capturing several traces of failures, all of them happened
      after an NMI. Curious about this, I added a trace_printk() to the NMI
      handler to read the regs->ip to see where the NMI happened. In which I
      found out it was here:
      
      ffffffff8135b660 <page_fault>:
      ffffffff8135b660:       48 83 ec 78             sub    $0x78,%rsp
      ffffffff8135b664:       e8 97 01 00 00          callq  ffffffff8135b800 <error_entry>
      
      What was happening is that the NMI would happen at the place that a page
      fault occurred. It would call rcu_read_lock() which was traced by
      the lock events, and the user_stack_trace would run. This would trigger
      a page fault inside the NMI. I do not see where the CR2 register is
      saved or restored in NMI handling. This means that it would corrupt
      the page fault handling that the NMI interrupted.
      
      The reason the while loop of ls helped trigger the bug, was that
      each execution of ls would cause lots of pages to be faulted in, and
      increase the chances of the race happening.
      
      The simple solution is to not allow user stack traces in NMI context.
      After this patch, I ran the above "ls" test for a couple of hours
      without any issues. Without this patch, the bug would trigger in less
      than a minute.
      
      Cc: stable@kernel.org
      Reported-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      b6345879
    • S
      tracing: Disable buffer switching when starting or stopping trace · a2f80714
      Steven Rostedt 提交于
      When the trace iterator is read, tracing_start() and tracing_stop()
      is called to stop tracing while the iterator is processing the trace
      output.
      
      These functions disable both the standard buffer and the max latency
      buffer. But if the wakeup tracer is running, it can switch these
      buffers between the two disables:
      
        buffer = global_trace.buffer;
        if (buffer)
            ring_buffer_record_disable(buffer);
      
            <<<--------- swap happens here
      
        buffer = max_tr.buffer;
        if (buffer)
            ring_buffer_record_disable(buffer);
      
      What happens is that we disabled the same buffer twice. On tracing_start()
      we can enable the same buffer twice. All ring_buffer_record_disable()
      must be matched with a ring_buffer_record_enable() or the buffer
      can be disable permanently, or enable prematurely, and cause a bug
      where a reset happens while a trace is commiting.
      
      This patch protects these two by taking the ftrace_max_lock to prevent
      a switch from occurring.
      
      Found with Li Zefan's ftrace_stress_test.
      
      Cc: stable@kernel.org
      Reported-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      a2f80714
    • S
      tracing: Use same local variable when resetting the ring buffer · 283740c6
      Steven Rostedt 提交于
      In the ftrace code that resets the ring buffer it references the
      buffer with a local variable, but then uses the tr->buffer as the
      parameter to reset. If the wakeup tracer is running, which can
      switch the tr->buffer with the max saved buffer, this can break
      the requirement of disabling the buffer before the reset.
      
         buffer = tr->buffer;
         ring_buffer_record_disable(buffer);
         synchronize_sched();
         __tracing_reset(tr->buffer, cpu);
      
      If the tr->buffer is swapped, then the reset is not happening to the
      buffer that was disabled. This will cause the ring buffer to fail.
      
      Found with Li Zefan's ftrace_stress_test.
      
      Cc: stable@kernel.org
      Reported-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      283740c6
  2. 06 3月, 2010 2 次提交
    • T
      function-graph: Add tracing_thresh support to function_graph tracer · 0e950173
      Tim Bird 提交于
      Add support for tracing_thresh to the function_graph tracer.  This
      version of this feature isolates the checks into new entry and
      return functions, to avoid adding more conditional code into the
      main function_graph paths.
      
      When the tracing_thresh is set and the function graph tracer is
      enabled, only the functions that took longer than the time in
      microseconds that was set in tracing_thresh are recorded. To do this
      efficiently, only the function exits are recorded:
      
       [tracing]# echo 100 > tracing_thresh
       [tracing]# echo function_graph > current_tracer
       [tracing]# cat trace
       # tracer: function_graph
       #
       # CPU  DURATION                  FUNCTION CALLS
       # |     |   |                     |   |   |   |
        1) ! 119.214 us  |  } /* smp_apic_timer_interrupt */
        1)   <========== |
        0) ! 101.527 us  |              } /* __rcu_process_callbacks */
        0) ! 126.461 us  |            } /* rcu_process_callbacks */
        0) ! 145.111 us  |          } /* __do_softirq */
        0) ! 149.667 us  |        } /* do_softirq */
        0) ! 168.817 us  |      } /* irq_exit */
        0) ! 248.254 us  |    } /* smp_apic_timer_interrupt */
      
      Also, add support for specifying tracing_thresh on the kernel
      command line.  When used like so: "tracing_thresh=200 ftrace=function_graph"
      this can be used to analyse system startup.  It is important to disable
      tracing soon after boot, in order to avoid losing the trace data.
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NTim Bird <tim.bird@am.sony.com>
      LKML-Reference: <4B87098B.4040308@am.sony.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      0e950173
    • A
      tracing: Update the comm field in the right variable in update_max_tr · 1acaa1b2
      Arnaldo Carvalho de Melo 提交于
      The latency output showed:
      
       #    | task: -3 (uid:0 nice:0 policy:1 rt_prio:99)
      
      The comm is missing in the "task:" and it looks like a minus 3 is
      the output. The correct display should be:
      
       #    | task: migration/0-3 (uid:0 nice:0 policy:1 rt_prio:99)
      
      The problem is that the comm is being stored in the wrong data
      structure. The max_tr.data[cpu] is what stores the comm, not the
      tr->data[cpu].
      
      Before this patch the max_tr.data[cpu]->comm was zeroed and the /debug/trace
      ended up showing just the '-' sign followed by the pid.
      
      Also remove a needless initialization of max_data.
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      LKML-Reference: <1267824230-23861-1-git-send-email-acme@infradead.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      1acaa1b2
  3. 03 3月, 2010 1 次提交
    • L
      tracing: Fix warning in s_next of trace file ops · ac91d854
      Lai Jiangshan 提交于
      This warning in s_next() can be triggered by lseek():
       [<c018b3f7>] ? s_next+0x77/0x80
       [<c013e3c1>] warn_slowpath_common+0x81/0xa0
       [<c018b3f7>] ? s_next+0x77/0x80
       [<c013e3fa>] warn_slowpath_null+0x1a/0x20
       [<c018b3f7>] s_next+0x77/0x80
       [<c01efa77>] traverse+0x117/0x200
       [<c01eff13>] seq_lseek+0xa3/0x120
       [<c01efe70>] ? seq_lseek+0x0/0x120
       [<c01d7081>] vfs_llseek+0x41/0x50
       [<c01d8116>] sys_llseek+0x66/0xa0
       [<c0102bd0>] sysenter_do_call+0x12/0x26
      
      The iterator "leftover" variable is zeroed in the opening of the trace
      file. But lseek can call s_start() which will call s_next() without
      reseting the "leftover" variable back to zero, which might trigger
      the WARN_ON_ONCE(iter->leftover) that is in s_next().
      
      Cc: stable@kernel.org
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      LKML-Reference: <4B8CE06A.9090207@cn.fujitsu.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      ac91d854
  4. 26 1月, 2010 1 次提交
    • S
      tracing: Prevent kernel oops with corrupted buffer · 74bf4076
      Steven Rostedt 提交于
      If the contents of the ftrace ring buffer gets corrupted and the trace
      file is read, it could create a kernel oops (usualy just killing the user
      task thread). This is caused by the checking of the pid in the buffer.
      If the pid is negative, it still references the cmdline cache array,
      which could point to an invalid address.
      
      The simple fix is to test for negative PIDs.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      74bf4076
  5. 07 1月, 2010 2 次提交
    • S
      tracing: Add stack dump to trace_printk if stacktrace option is set · d931369b
      Steven Rostedt 提交于
      If the ftrace stacktrace option is set, then add the stack dumps to
      trace_printk.
      Requested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      d931369b
    • L
      tracing: Consolidate protection of reader access to the ring buffer · 7e53bd42
      Lai Jiangshan 提交于
      At the beginning, access to the ring buffer was fully serialized
      by trace_types_lock. Patch d7350c3f gives more freedom to readers,
      and patch b04cc6b1 adds code to protect trace_pipe and cpu#/trace_pipe.
      
      But actually it is not enough, ring buffer readers are not always
      read-only, they may consume data.
      
      This patch makes accesses to trace, trace_pipe, trace_pipe_raw
      cpu#/trace, cpu#/trace_pipe and cpu#/trace_pipe_raw serialized.
      And removes tracing_reader_cpumask which is used to protect trace_pipe.
      
      Details:
      
      Ring buffer serializes readers, but it is low level protection.
      The validity of the events (which returns by ring_buffer_peek() ..etc)
      are not protected by ring buffer.
      
      The content of events may become garbage if we allow another process to consume
      these events concurrently:
        A) the page of the consumed events may become a normal page
           (not reader page) in ring buffer, and this page will be rewritten
           by the events producer.
        B) The page of the consumed events may become a page for splice_read,
           and this page will be returned to system.
      
      This patch adds trace_access_lock() and trace_access_unlock() primitives.
      
      These primitives allow multi process access to different cpu ring buffers
      concurrently.
      
      These primitives don't distinguish read-only and read-consume access.
      Multi read-only access is also serialized.
      
      And we don't use these primitives when we open files,
      we only use them when we read files.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      LKML-Reference: <4B447D52.1050602@cn.fujitsu.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      7e53bd42
  6. 22 12月, 2009 1 次提交
  7. 16 12月, 2009 1 次提交
  8. 15 12月, 2009 4 次提交
  9. 14 12月, 2009 4 次提交
  10. 12 12月, 2009 1 次提交
  11. 11 12月, 2009 1 次提交
    • S
      tracing: Add trace_dump_stack() · 03889384
      Steven Rostedt 提交于
      I've been asked a few times about how to find out what is calling
      some location in the kernel. One way is to use dynamic function tracing
      and implement the func_stack_trace. But this only finds out who is
      calling a particular function. It does not tell you who is calling
      that function and entering a specific if conditional.
      
      I have myself implemented a quick version of trace_dump_stack() for
      this purpose a few times, and just needed it now. This is when I realized
      that this would be a good tool to have in the kernel like trace_printk().
      
      Using trace_dump_stack() is similar to dump_stack() except that it
      writes to the trace buffer instead and can be used in critical locations.
      
      For example:
      
      @@ -5485,8 +5485,12 @@ need_resched_nonpreemptible:
       	if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
       		if (unlikely(signal_pending_state(prev->state, prev)))
       			prev->state = TASK_RUNNING;
      -		else
      +		else {
       			deactivate_task(rq, prev, 1);
      +			trace_printk("Deactivating task %s:%d\n",
      +				     prev->comm, prev->pid);
      +			trace_dump_stack();
      +		}
       		switch_count = &prev->nvcsw;
       	}
      
      Produces:
      
                 <...>-3249  [001]   296.105269: schedule: Deactivating task ntpd:3249
                 <...>-3249  [001]   296.105270: <stack trace>
       => schedule
       => schedule_hrtimeout_range
       => poll_schedule_timeout
       => do_select
       => core_sys_select
       => sys_select
       => system_call_fastpath
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      03889384
  12. 10 12月, 2009 3 次提交
    • C
      tracing: Remove comparing of NULL to va_list in trace_array_vprintk() · f2942487
      Carsten Emde 提交于
      Olof Johansson stated the following:
      
        Comparing a va_list with NULL is bogus. It's supposed to be treated like
        an opaque type and only be manipulated with va_* accessors.
      
      Olof noticed that this code broke the ARM builds:
      
          kernel/trace/trace.c: In function 'trace_array_vprintk':
          kernel/trace/trace.c:1364: error: invalid operands to binary == (have 'va_list' and 'void *')
          kernel/trace/trace.c: In function 'tracing_mark_write':
          kernel/trace/trace.c:3349: error: incompatible type for argument 3 of 'trace_vprintk'
      
      This patch partly reverts c13d2f7c and
      re-installs the original mark_printk() mechanism.
      Reported-by: NOlof Johansson <olof@lixom.net>
      Signed-off-by: NCarsten Emde <C.Emde@osadl.org>
      LKML-Reference: <4B1BAB74.104@osadl.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      f2942487
    • S
      tracing: Buffer the output of seq_file in case of filled buffer · a63ce5b3
      Steven Rostedt 提交于
      If the seq_read fills the buffer it will call s_start again on the next
      itertation with the same position. This causes a problem with the
      function_graph tracer because it consumes the iteration in order to
      determine leaf functions.
      
      What happens is that the iterator stores the entry, and the function
      graph plugin will look at the next entry. If that next entry is a return
      of the same function and task, then the function is a leaf and the
      function_graph plugin calls ring_buffer_read which moves the ring buffer
      iterator forward (the trace iterator still points to the function start
      entry).
      
      The copying of the trace_seq to the seq_file buffer will fail if the
      seq_file buffer is full. The seq_read will not show this entry.
      The next read by userspace will cause seq_read to again call s_start
      which will reuse the trace iterator entry (the function start entry).
      But the function return entry was already consumed. The function graph
      plugin will think that this entry is a nested function and not a leaf.
      
      To solve this, the trace code now checks the return status of the
      seq_printf (trace_print_seq). If the writing to the seq_file buffer
      fails, we set a flag in the iterator (leftover) and we do not reset
      the trace_seq buffer. On the next call to s_start, we check the leftover
      flag, and if it is set, we just reuse the trace_seq buffer and do not
      call into the plugin print functions.
      
      Before this patch:
      
       2)               |      fput() {
       2)               |        __fput() {
       2)   0.550 us    |          inotify_inode_queue_event();
       2)               |          __fsnotify_parent() {
       2)   0.540 us    |          inotify_dentry_parent_queue_event();
      
      After the patch:
      
       2)               |      fput() {
       2)               |        __fput() {
       2)   0.550 us    |          inotify_inode_queue_event();
       2)   0.548 us    |          __fsnotify_parent();
       2)   0.540 us    |          inotify_dentry_parent_queue_event();
      
      [
        Updated the patch to fix a missing return 0 from the trace_print_seq()
        stub when CONFIG_TRACING is disabled.
      Reported-by: NIngo Molnar <mingo@elte.hu>
      ]
      Reported-by: NJiri Olsa <jolsa@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      a63ce5b3
    • S
      tracing: Only call pipe_close if pipe_close is defined · 29bf4a5e
      Steven Rostedt 提交于
      This fixes a cut and paste error that had pipe_close get called
      if pipe_open was defined (not pipe_close).
      Reported-by: NKosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
      LKML-Reference: <20091209153204.F4CD.A69D9226@jp.fujitsu.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      29bf4a5e
  13. 08 12月, 2009 1 次提交
    • S
      tracing: Add pipe_close interface · c521efd1
      Steven Rostedt 提交于
      An ftrace plugin can add a pipe_open interface when the user opens
      trace_pipe. But if the plugin allocates something within the pipe_open
      it can not free it because there exists no pipe_close. The hook to
      the trace file open has a corresponding close. The closing of the
      trace_pipe file should also have a corresponding close.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      c521efd1
  14. 17 11月, 2009 1 次提交
  15. 12 11月, 2009 1 次提交
  16. 29 10月, 2009 2 次提交
    • R
      percpu: remove per_cpu__ prefix. · dd17c8f7
      Rusty Russell 提交于
      Now that the return from alloc_percpu is compatible with the address
      of per-cpu vars, it makes sense to hand around the address of per-cpu
      variables.  To make this sane, we remove the per_cpu__ prefix we used
      created to stop people accidentally using these vars directly.
      
      Now we have sparse, we can use that (next patch).
      
      tj: * Updated to convert stuff which were missed by or added after the
            original patch.
      
          * Kill per_cpu_var() macro.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      dd17c8f7
    • T
      percpu: make percpu symbols in tracer unique · 9705f69e
      Tejun Heo 提交于
      This patch updates percpu related symbols in kernel tracer such that
      percpu symbols are unique and don't clash with local symbols.  This
      serves two purposes of decreasing the possibility of global percpu
      symbol collision and allowing dropping per_cpu__ prefix from percpu
      symbols.
      
      * kernel/trace/trace.c: s/max_data/max_tr_data/
      * kernel/trace/trace_hw_branches: s/tracer/hwb_tracer/, s/buffer/hwb_buffer/
      
      Partly based on Rusty Russell's "alloc_percpu: rename percpu vars
      which cause name clashes" patch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      9705f69e
  17. 24 10月, 2009 1 次提交
  18. 15 10月, 2009 1 次提交
  19. 13 10月, 2009 1 次提交
  20. 12 10月, 2009 1 次提交
  21. 09 10月, 2009 1 次提交
  22. 24 9月, 2009 1 次提交
  23. 23 9月, 2009 1 次提交
  24. 22 9月, 2009 1 次提交
  25. 21 9月, 2009 1 次提交
  26. 19 9月, 2009 1 次提交
  27. 13 9月, 2009 1 次提交