1. 14 8月, 2010 1 次提交
    • M
      tracing: Sanitize value returned from write(trace_marker, "...", len) · 1aa54bca
      Marcin Slusarz 提交于
      When userspace code writes non-new-line-terminated string to trace_marker
      file, write handler appends new-line and returns number of bytes written
      to trace buffer, so
      write(fd, "abc", 3) will return 4
      
      That's unexpected and unfortunately it confuses glibc's fprintf function.
      
      Example:
      int main() {
        fprintf(stderr, "abc");
        return 0;
      }
      
      $ gcc test.c -o test
      $ echo mmiotrace > /sys/kernel/debug/tracing/current_tracer
      $ ./test 2>/sys/kernel/debug/tracing/trace_marker
      
      results in infinite loop:
      write(fd, "abc", 3) = 4
      write(fd, "", 1) = 0
      write(fd, "", 1) = 0
      write(fd, "", 1) = 0
      write(fd, "", 1) = 0
      write(fd, "", 1) = 0
      write(fd, "", 1) = 0
      write(fd, "", 1) = 0
      (...)
      
      ...and kernel trace buffer full of empty markers.
      
      Fix it by sanitizing write return value.
      Signed-off-by: NMarcin Slusarz <marcin.slusarz@gmail.com>
      LKML-Reference: <20100727231801.GB2826@joi.lan>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      1aa54bca
  2. 05 8月, 2010 1 次提交
  3. 23 7月, 2010 1 次提交
  4. 21 7月, 2010 2 次提交
    • K
      tracing: Shrink max latency ringbuffer if unnecessary · ef710e10
      KOSAKI Motohiro 提交于
      Documentation/trace/ftrace.txt says
      
        buffer_size_kb:
      
              This sets or displays the number of kilobytes each CPU
              buffer can hold. The tracer buffers are the same size
              for each CPU. The displayed number is the size of the
              CPU buffer and not total size of all buffers. The
              trace buffers are allocated in pages (blocks of memory
              that the kernel uses for allocation, usually 4 KB in size).
              If the last page allocated has room for more bytes
              than requested, the rest of the page will be used,
              making the actual allocation bigger than requested.
              ( Note, the size may not be a multiple of the page size
                due to buffer management overhead. )
      
              This can only be updated when the current_tracer
              is set to "nop".
      
      But it's incorrect. currently total memory consumption is
      'buffer_size_kb x CPUs x 2'.
      
      Why two times difference is there? because ftrace implicitly allocate
      the buffer for max latency too.
      
      That makes sad result when admin want to use large buffer. (If admin
      want full logging and makes detail analysis). example, If admin
      have 24 CPUs machine and write 200MB to buffer_size_kb, the system
      consume ~10GB memory (200MB x 24 x 2). umm.. 5GB memory waste is
      usually unacceptable.
      
      Fortunatelly, almost all users don't use max latency feature.
      The max latency buffer can be disabled easily.
      
      This patch shrink buffer size of the max latency buffer if
      unnecessary.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      LKML-Reference: <20100701104554.DA2D.A69D9226@jp.fujitsu.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      ef710e10
    • L
      tracing: Allow to disable cmdline recording · e870e9a1
      Li Zefan 提交于
      We found that even enabling a single trace event that will rarely be
      triggered can add big overhead to context switch.
      
      (lmbench context switch test)
       -------------------------------------------------
       2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
       ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
      ------ ------ ------ ------ ------ ------- -------
        2.19   2.3   2.21   2.56   2.13     2.54    2.07
        2.39   2.51  2.35   2.75   2.27     2.81    2.24
      
      The overhead is 6% ~ 11%.
      
      It's because when a trace event is enabled 3 tracepoints (sched_switch,
      sched_wakeup, sched_wakeup_new) will be activated to map pid to cmdname.
      
      We'd like to avoid this overhead, so add a trace option '(no)record-cmd'
      to allow to disable cmdline recording.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      LKML-Reference: <4C2D57F4.2050204@cn.fujitsu.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      e870e9a1
  5. 20 7月, 2010 3 次提交
  6. 09 7月, 2010 1 次提交
  7. 09 6月, 2010 1 次提交
    • A
      tracing: Remove boot tracer · 30dbb20e
      Américo Wang 提交于
      The boot tracer is useless. It simply logs the initcalls
      but in fact these initcalls are also logged through printk
      while using the initcall_debug kernel parameter.
      
      Nobody seem to be using it so far. Then just remove it.
      Signed-off-by: NWANG Cong <xiyou.wangcong@gmail.com>
      Cc: Chase Douglas <chase.douglas@canonical.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      LKML-Reference: <20100526105753.GA5677@cr0.nay.redhat.com>
      [ remove the hooks in main.c, and the headers ]
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      30dbb20e
  8. 04 6月, 2010 1 次提交
    • S
      tracing: Remove ftrace_preempt_disable/enable · 5168ae50
      Steven Rostedt 提交于
      The ftrace_preempt_disable/enable functions were to address a
      recursive race caused by the function tracer. The function tracer
      traces all functions which makes it easily susceptible to recursion.
      One area was preempt_enable(). This would call the scheduler and
      the schedulre would call the function tracer and loop.
      (So was it thought).
      
      The ftrace_preempt_disable/enable was made to protect against recursion
      inside the scheduler by storing the NEED_RESCHED flag. If it was
      set before the ftrace_preempt_disable() it would not call schedule
      on ftrace_preempt_enable(), thinking that if it was set before then
      it would have already scheduled unless it was already in the scheduler.
      
      This worked fine except in the case of SMP, where another task would set
      the NEED_RESCHED flag for a task on another CPU, and then kick off an
      IPI to trigger it. This could cause the NEED_RESCHED to be saved at
      ftrace_preempt_disable() but the IPI to arrive in the the preempt
      disabled section. The ftrace_preempt_enable() would not call the scheduler
      because the flag was already set before entring the section.
      
      This bug would cause a missed preemption check and cause lower latencies.
      
      Investigating further, I found that the recusion caused by the function
      tracer was not due to schedule(), but due to preempt_schedule(). Now
      that preempt_schedule is completely annotated with notrace, the recusion
      no longer is an issue.
      Reported-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      5168ae50
  9. 25 5月, 2010 1 次提交
    • S
      ring-buffer: Move zeroing out excess in page to ring buffer code · 2711ca23
      Steven Rostedt 提交于
      Currently the trace splice code zeros out the excess bytes in the page before
      sending it off to userspace.
      
      This is to make sure userspace is not getting anything it should not be
      when reading the pages, because the excess data was never initialized
      to zero before writing (for perfomance reasons).
      
      But the splice code has no business in doing this work, it should be
      done by the ring buffer. With the latest changes for recording lost
      events, the splice code gets it wrong anyway.
      
      Move the zeroing out of excess bytes into the ring buffer code.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      2711ca23
  10. 22 5月, 2010 1 次提交
  11. 15 5月, 2010 1 次提交
    • S
      tracing: Allow events to share their print functions · a9a57763
      Steven Rostedt 提交于
      Multiple events may use the same method to print their data.
      Instead of having all events have a pointer to their print funtions,
      the trace_event structure now points to a trace_event_functions structure
      that will hold the way to print ouf the event.
      
      The event itself is now passed to the print function to let the print
      function know what kind of event it should print.
      
      This opens the door to consolidating the way several events print
      their output.
      
         text	   data	    bss	    dec	    hex	filename
      4913961	1088356	 861512	6863829	 68bbd5	vmlinux.orig
      4900382	1048964	 861512	6810858	 67ecea	vmlinux.init
      4900446	1049028	 861512	6810986	 67ed6a	vmlinux.preprint
      
      This change slightly increases the size but is needed for the next change.
      
      v3: Fix the branch tracer events to handle this change.
      
      v2: Fix the new function graph tracer event calls to handle this change.
      Acked-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NMasami Hiramatsu <mhiramat@redhat.com>
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      a9a57763
  12. 28 4月, 2010 2 次提交
    • D
      ring-buffer: Make non-consuming read less expensive with lots of cpus. · 72c9ddfd
      David Miller 提交于
      When performing a non-consuming read, a synchronize_sched() is
      performed once for every cpu which is actively tracing.
      
      This is very expensive, and can make it take several seconds to open
      up the 'trace' file with lots of cpus.
      
      Only one synchronize_sched() call is actually necessary.  What is
      desired is for all cpus to see the disabling state change.  So we
      transform the existing sequence:
      
      	for_each_cpu() {
      		ring_buffer_read_start();
      	}
      
      where each ring_buffer_start() call performs a synchronize_sched(),
      into the following:
      
      	for_each_cpu() {
      		ring_buffer_read_prepare();
      	}
      	ring_buffer_read_prepare_sync();
      	for_each_cpu() {
      		ring_buffer_read_start();
      	}
      
      wherein only the single ring_buffer_read_prepare_sync() call needs to
      do the synchronize_sched().
      
      The first phase, via ring_buffer_read_prepare(), allocates the 'iter'
      memory and increments ->record_disabled.
      
      In the second phase, ring_buffer_read_prepare_sync() makes sure this
      ->record_disabled state is visible fully to all cpus.
      
      And in the final third phase, the ring_buffer_read_start() calls reset
      the 'iter' objects allocated in the first phase since we now know that
      none of the cpus are adding trace entries any more.
      
      This makes openning the 'trace' file nearly instantaneous on a
      sparc64 Niagara2 box with 128 cpus tracing.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      LKML-Reference: <20100420.154711.11246950.davem@davemloft.net>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      72c9ddfd
    • J
      tracing: Add graph output support for irqsoff tracer · 62b915f1
      Jiri Olsa 提交于
      Add function graph output to irqsoff tracer.
      
      The graph output is enabled by setting new 'display-graph' trace option.
      Signed-off-by: NJiri Olsa <jolsa@redhat.com>
      LKML-Reference: <1270227683-14631-4-git-send-email-jolsa@redhat.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      62b915f1
  13. 22 4月, 2010 1 次提交
    • F
      tracing: Dump either the oops's cpu source or all cpus buffers · cecbca96
      Frederic Weisbecker 提交于
      The ftrace_dump_on_oops kernel parameter, sysctl and sysrq let one
      dump every cpu buffers when an oops or panic happens.
      
      It's nice when you have few cpus but it may take ages if have many,
      plus you miss the real origin of the problem in all the cpu traces.
      
      Sometimes, all you need is to dump the cpu buffer that triggered the
      opps, most of the time it is our main interest.
      
      This patch modifies ftrace_dump_on_oops to handle this choice.
      
      The ftrace_dump_on_oops kernel parameter, when it comes alone, has
      the same behaviour than before. But ftrace_dump_on_oops=orig_cpu
      will only dump the buffer of the cpu that oops'ed.
      
      Similarly, sysctl kernel.ftrace_dump_on_oops=1 and
      echo 1 > /proc/sys/kernel/ftrace_dump_on_oops keep their previous
      behaviour. But setting 2 jumps into cpu origin dump mode.
      
      v2: Fix double setup
      v3: Fix spelling issues reported by Randy Dunlap
      v4: Also update __ftrace_dump in the selftests
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      cecbca96
  14. 05 4月, 2010 1 次提交
    • L
      tracing: Fix uninitialized variable of tracing/trace output · aa27497c
      Lai Jiangshan 提交于
      Because a local variable is not initialized, I got these
      when I did 'cat tracing/trace'. (not trace_pipe):
      
      CPU:0 [LOST 18446744071579453134 EVENTS]
                    ps-3099  [000]   560.770221: lock_acquire: ffff880030865010 &(&dentry->d_lock)->rlock
      CPU:0 [LOST 18446744071579453134 EVENTS]
                    ps-3099  [000]   560.770221: lock_release: ffff880030865010 &(&dentry->d_lock)->rlock
      CPU:0 [LOST 18446612133255294080 EVENTS]
                    ps-3099  [000]   560.770221: lock_acquire: ffff880030865010 &(&dentry->d_lock)->rlock
      CPU:0 [LOST 18446744071579453134 EVENTS]
                    ps-3099  [000]   560.770222: lock_release: ffff880030865010 &(&dentry->d_lock)->rlock
      CPU:0 [LOST 18446744071579453134 EVENTS]
                    ps-3099  [000]   560.770222: lock_release: ffffffff816cfb98 dcache_lock
      
      See peek_next_entry(), it does not set *lost_events when we 'cat tracing/trace'
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      LKML-Reference: <4BB9A929.2000303@cn.fujitsu.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      aa27497c
  15. 01 4月, 2010 2 次提交
    • S
      tracing: Show the lost events in the trace_pipe output · bc21b478
      Steven Rostedt 提交于
      Now that the ring buffer can keep track of where events are lost.
      Use this information to the output of trace_pipe:
      
             hackbench-3588  [001]  1326.701660: lock_acquire: ffffffff816591e0 read rcu_read_lock
             hackbench-3588  [001]  1326.701661: lock_acquire: ffff88003f4091f0 &(&dentry->d_lock)->rlock
             hackbench-3588  [001]  1326.701664: lock_release: ffff88003f4091f0 &(&dentry->d_lock)->rlock
      CPU:1 [LOST 673 EVENTS]
             hackbench-3588  [001]  1326.702711: kmem_cache_free: call_site=ffffffff81102b85 ptr=ffff880026d96738
             hackbench-3588  [001]  1326.702712: lock_release: ffff88003e1480a8 &mm->mmap_sem
             hackbench-3588  [001]  1326.702713: lock_acquire: ffff88003e1480a8 &mm->mmap_sem
      
      Even works with the function graph tracer:
      
       2) ! 170.098 us  |                                            }
       2)   4.036 us    |                                            rcu_irq_exit();
       2)   3.657 us    |                                            idle_cpu();
       2) ! 190.301 us  |                                          }
      CPU:2 [LOST 2196 EVENTS]
       2)   0.853 us    |                            } /* cancel_dirty_page */
       2)               |                            remove_from_page_cache() {
       2)   1.578 us    |                              _raw_spin_lock_irq();
       2)               |                              __remove_from_page_cache() {
      
      Note, it does not work with the iterator "trace" file, since it requires
      the use of consuming the page from the ring buffer to determine how many
      events were lost, which the iterator does not do.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      bc21b478
    • S
      ring-buffer: Add place holder recording of dropped events · 66a8cb95
      Steven Rostedt 提交于
      Currently, when the ring buffer drops events, it does not record
      the fact that it did so. It does inform the writer that the event
      was dropped by returning a NULL event, but it does not put in any
      place holder where the event was dropped.
      
      This is not a trivial thing to add because the ring buffer mostly
      runs in overwrite (flight recorder) mode. That is, when the ring
      buffer is full, new data will overwrite old data.
      
      In a produce/consumer mode, where new data is simply dropped when
      the ring buffer is full, it is trivial to add the placeholder
      for dropped events. When there's more room to write new data, then
      a special event can be added to notify the reader about the dropped
      events.
      
      But in overwrite mode, any new write can overwrite events. A place
      holder can not be inserted into the ring buffer since there never
      may be room. A reader could also come in at anytime and miss the
      placeholder.
      
      Luckily, the way the ring buffer works, the read side can find out
      if events were lost or not, and how many events. Everytime a write
      takes place, if it overwrites the header page (the next read) it
      updates a "overrun" variable that keeps track of the number of
      lost events. When a reader swaps out a page from the ring buffer,
      it can record this number, perfom the swap, and then check to
      see if the number changed, and take the diff if it has, which would be
      the number of events dropped. This can be stored by the reader
      and returned to callers of the reader.
      
      Since the reader page swap will fail if the writer moved the head
      page since the time the reader page set up the swap, this gives room
      to record the overruns without worrying about races. If the reader
      sets up the pages, records the overrun, than performs the swap,
      if the swap succeeds, then the overrun variable has not been
      updated since the setup before the swap.
      
      For binary readers of the ring buffer, a flag is set in the header
      of each sub page (sub buffer) of the ring buffer. This flag is embedded
      in the size field of the data on the sub buffer, in the 31st bit (the size
      can be 32 or 64 bits depending on the architecture), but only 27
      bits needs to be used for the actual size (less actually).
      
      We could add a new field in the sub buffer header to also record the
      number of events dropped since the last read, but this will change the
      format of the binary ring buffer a bit too much. Perhaps this change can
      be made if the information on the number of events dropped is considered
      important enough.
      
      Note, the notification of dropped events is only used by consuming reads
      or peeking at the ring buffer. Iterating over the ring buffer does not
      keep this information because the necessary data is only available when
      a page swap is made, and the iterator does not swap out pages.
      
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: "Luis Claudio R. Goncalves" <lclaudio@uudg.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      66a8cb95
  16. 30 3月, 2010 1 次提交
    • T
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo 提交于
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  17. 13 3月, 2010 3 次提交
    • S
      tracing: Do not record user stack trace from NMI context · b6345879
      Steven Rostedt 提交于
      A bug was found with Li Zefan's ftrace_stress_test that caused applications
      to segfault during the test.
      
      Placing a tracing_off() in the segfault code, and examining several
      traces, I found that the following was always the case. The lock tracer
      was enabled (lockdep being required) and userstack was enabled. Testing
      this out, I just enabled the two, but that was not good enough. I needed
      to run something else that could trigger it. Running a load like hackbench
      did not work, but executing a new program would. The following would
      trigger the segfault within seconds:
      
        # echo 1 > /debug/tracing/options/userstacktrace
        # echo 1 > /debug/tracing/events/lock/enable
        # while :; do ls > /dev/null ; done
      
      Enabling the function graph tracer and looking at what was happening
      I finally noticed that all cashes happened just after an NMI.
      
       1)               |    copy_user_handle_tail() {
       1)               |      bad_area_nosemaphore() {
       1)               |        __bad_area_nosemaphore() {
       1)               |          no_context() {
       1)               |            fixup_exception() {
       1)   0.319 us    |              search_exception_tables();
       1)   0.873 us    |            }
      [...]
       1)   0.314 us    |  __rcu_read_unlock();
       1)   0.325 us    |    native_apic_mem_write();
       1)   0.943 us    |  }
       1)   0.304 us    |  rcu_nmi_exit();
      [...]
       1)   0.479 us    |  find_vma();
       1)               |  bad_area() {
       1)               |    __bad_area() {
      
      After capturing several traces of failures, all of them happened
      after an NMI. Curious about this, I added a trace_printk() to the NMI
      handler to read the regs->ip to see where the NMI happened. In which I
      found out it was here:
      
      ffffffff8135b660 <page_fault>:
      ffffffff8135b660:       48 83 ec 78             sub    $0x78,%rsp
      ffffffff8135b664:       e8 97 01 00 00          callq  ffffffff8135b800 <error_entry>
      
      What was happening is that the NMI would happen at the place that a page
      fault occurred. It would call rcu_read_lock() which was traced by
      the lock events, and the user_stack_trace would run. This would trigger
      a page fault inside the NMI. I do not see where the CR2 register is
      saved or restored in NMI handling. This means that it would corrupt
      the page fault handling that the NMI interrupted.
      
      The reason the while loop of ls helped trigger the bug, was that
      each execution of ls would cause lots of pages to be faulted in, and
      increase the chances of the race happening.
      
      The simple solution is to not allow user stack traces in NMI context.
      After this patch, I ran the above "ls" test for a couple of hours
      without any issues. Without this patch, the bug would trigger in less
      than a minute.
      
      Cc: stable@kernel.org
      Reported-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      b6345879
    • S
      tracing: Disable buffer switching when starting or stopping trace · a2f80714
      Steven Rostedt 提交于
      When the trace iterator is read, tracing_start() and tracing_stop()
      is called to stop tracing while the iterator is processing the trace
      output.
      
      These functions disable both the standard buffer and the max latency
      buffer. But if the wakeup tracer is running, it can switch these
      buffers between the two disables:
      
        buffer = global_trace.buffer;
        if (buffer)
            ring_buffer_record_disable(buffer);
      
            <<<--------- swap happens here
      
        buffer = max_tr.buffer;
        if (buffer)
            ring_buffer_record_disable(buffer);
      
      What happens is that we disabled the same buffer twice. On tracing_start()
      we can enable the same buffer twice. All ring_buffer_record_disable()
      must be matched with a ring_buffer_record_enable() or the buffer
      can be disable permanently, or enable prematurely, and cause a bug
      where a reset happens while a trace is commiting.
      
      This patch protects these two by taking the ftrace_max_lock to prevent
      a switch from occurring.
      
      Found with Li Zefan's ftrace_stress_test.
      
      Cc: stable@kernel.org
      Reported-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      a2f80714
    • S
      tracing: Use same local variable when resetting the ring buffer · 283740c6
      Steven Rostedt 提交于
      In the ftrace code that resets the ring buffer it references the
      buffer with a local variable, but then uses the tr->buffer as the
      parameter to reset. If the wakeup tracer is running, which can
      switch the tr->buffer with the max saved buffer, this can break
      the requirement of disabling the buffer before the reset.
      
         buffer = tr->buffer;
         ring_buffer_record_disable(buffer);
         synchronize_sched();
         __tracing_reset(tr->buffer, cpu);
      
      If the tr->buffer is swapped, then the reset is not happening to the
      buffer that was disabled. This will cause the ring buffer to fail.
      
      Found with Li Zefan's ftrace_stress_test.
      
      Cc: stable@kernel.org
      Reported-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      283740c6
  18. 06 3月, 2010 2 次提交
    • T
      function-graph: Add tracing_thresh support to function_graph tracer · 0e950173
      Tim Bird 提交于
      Add support for tracing_thresh to the function_graph tracer.  This
      version of this feature isolates the checks into new entry and
      return functions, to avoid adding more conditional code into the
      main function_graph paths.
      
      When the tracing_thresh is set and the function graph tracer is
      enabled, only the functions that took longer than the time in
      microseconds that was set in tracing_thresh are recorded. To do this
      efficiently, only the function exits are recorded:
      
       [tracing]# echo 100 > tracing_thresh
       [tracing]# echo function_graph > current_tracer
       [tracing]# cat trace
       # tracer: function_graph
       #
       # CPU  DURATION                  FUNCTION CALLS
       # |     |   |                     |   |   |   |
        1) ! 119.214 us  |  } /* smp_apic_timer_interrupt */
        1)   <========== |
        0) ! 101.527 us  |              } /* __rcu_process_callbacks */
        0) ! 126.461 us  |            } /* rcu_process_callbacks */
        0) ! 145.111 us  |          } /* __do_softirq */
        0) ! 149.667 us  |        } /* do_softirq */
        0) ! 168.817 us  |      } /* irq_exit */
        0) ! 248.254 us  |    } /* smp_apic_timer_interrupt */
      
      Also, add support for specifying tracing_thresh on the kernel
      command line.  When used like so: "tracing_thresh=200 ftrace=function_graph"
      this can be used to analyse system startup.  It is important to disable
      tracing soon after boot, in order to avoid losing the trace data.
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NTim Bird <tim.bird@am.sony.com>
      LKML-Reference: <4B87098B.4040308@am.sony.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      0e950173
    • A
      tracing: Update the comm field in the right variable in update_max_tr · 1acaa1b2
      Arnaldo Carvalho de Melo 提交于
      The latency output showed:
      
       #    | task: -3 (uid:0 nice:0 policy:1 rt_prio:99)
      
      The comm is missing in the "task:" and it looks like a minus 3 is
      the output. The correct display should be:
      
       #    | task: migration/0-3 (uid:0 nice:0 policy:1 rt_prio:99)
      
      The problem is that the comm is being stored in the wrong data
      structure. The max_tr.data[cpu] is what stores the comm, not the
      tr->data[cpu].
      
      Before this patch the max_tr.data[cpu]->comm was zeroed and the /debug/trace
      ended up showing just the '-' sign followed by the pid.
      
      Also remove a needless initialization of max_data.
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      LKML-Reference: <1267824230-23861-1-git-send-email-acme@infradead.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      1acaa1b2
  19. 03 3月, 2010 1 次提交
    • L
      tracing: Fix warning in s_next of trace file ops · ac91d854
      Lai Jiangshan 提交于
      This warning in s_next() can be triggered by lseek():
       [<c018b3f7>] ? s_next+0x77/0x80
       [<c013e3c1>] warn_slowpath_common+0x81/0xa0
       [<c018b3f7>] ? s_next+0x77/0x80
       [<c013e3fa>] warn_slowpath_null+0x1a/0x20
       [<c018b3f7>] s_next+0x77/0x80
       [<c01efa77>] traverse+0x117/0x200
       [<c01eff13>] seq_lseek+0xa3/0x120
       [<c01efe70>] ? seq_lseek+0x0/0x120
       [<c01d7081>] vfs_llseek+0x41/0x50
       [<c01d8116>] sys_llseek+0x66/0xa0
       [<c0102bd0>] sysenter_do_call+0x12/0x26
      
      The iterator "leftover" variable is zeroed in the opening of the trace
      file. But lseek can call s_start() which will call s_next() without
      reseting the "leftover" variable back to zero, which might trigger
      the WARN_ON_ONCE(iter->leftover) that is in s_next().
      
      Cc: stable@kernel.org
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      LKML-Reference: <4B8CE06A.9090207@cn.fujitsu.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      ac91d854
  20. 26 1月, 2010 1 次提交
    • S
      tracing: Prevent kernel oops with corrupted buffer · 74bf4076
      Steven Rostedt 提交于
      If the contents of the ftrace ring buffer gets corrupted and the trace
      file is read, it could create a kernel oops (usualy just killing the user
      task thread). This is caused by the checking of the pid in the buffer.
      If the pid is negative, it still references the cmdline cache array,
      which could point to an invalid address.
      
      The simple fix is to test for negative PIDs.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      74bf4076
  21. 07 1月, 2010 2 次提交
    • S
      tracing: Add stack dump to trace_printk if stacktrace option is set · d931369b
      Steven Rostedt 提交于
      If the ftrace stacktrace option is set, then add the stack dumps to
      trace_printk.
      Requested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      d931369b
    • L
      tracing: Consolidate protection of reader access to the ring buffer · 7e53bd42
      Lai Jiangshan 提交于
      At the beginning, access to the ring buffer was fully serialized
      by trace_types_lock. Patch d7350c3f gives more freedom to readers,
      and patch b04cc6b1 adds code to protect trace_pipe and cpu#/trace_pipe.
      
      But actually it is not enough, ring buffer readers are not always
      read-only, they may consume data.
      
      This patch makes accesses to trace, trace_pipe, trace_pipe_raw
      cpu#/trace, cpu#/trace_pipe and cpu#/trace_pipe_raw serialized.
      And removes tracing_reader_cpumask which is used to protect trace_pipe.
      
      Details:
      
      Ring buffer serializes readers, but it is low level protection.
      The validity of the events (which returns by ring_buffer_peek() ..etc)
      are not protected by ring buffer.
      
      The content of events may become garbage if we allow another process to consume
      these events concurrently:
        A) the page of the consumed events may become a normal page
           (not reader page) in ring buffer, and this page will be rewritten
           by the events producer.
        B) The page of the consumed events may become a page for splice_read,
           and this page will be returned to system.
      
      This patch adds trace_access_lock() and trace_access_unlock() primitives.
      
      These primitives allow multi process access to different cpu ring buffers
      concurrently.
      
      These primitives don't distinguish read-only and read-consume access.
      Multi read-only access is also serialized.
      
      And we don't use these primitives when we open files,
      we only use them when we read files.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      LKML-Reference: <4B447D52.1050602@cn.fujitsu.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      7e53bd42
  22. 22 12月, 2009 1 次提交
  23. 16 12月, 2009 1 次提交
  24. 15 12月, 2009 4 次提交
  25. 14 12月, 2009 4 次提交