1. 12 8月, 2009 1 次提交
  2. 09 8月, 2009 1 次提交
    • F
      perf_counter: Fix/complete ftrace event records sampling · f413cdb8
      Frederic Weisbecker 提交于
      This patch implements the kernel side support for ftrace event
      record sampling.
      
      A new counter sampling attribute is added:
      
         PERF_SAMPLE_TP_RECORD
      
      which requests ftrace events record sampling. In this case
      if a PERF_TYPE_TRACEPOINT counter is active and a tracepoint
      fires, we emit the tracepoint binary record to the
      perfcounter event buffer, as a sample.
      
      Result, after setting PERF_SAMPLE_TP_RECORD attribute from perf
      record:
      
       perf record -f -F 1 -a -e workqueue:workqueue_execution
       perf report -D
      
       0x21e18 [0x48]: event: 9
       .
       . ... raw event: size 72 bytes
       .  0000:  09 00 00 00 01 00 48 00 d0 c7 00 81 ff ff ff ff  ......H........
       .  0010:  0a 00 00 00 0a 00 00 00 21 00 00 00 00 00 00 00  ........!......
       .  0020:  2b 00 01 02 0a 00 00 00 0a 00 00 00 65 76 65 6e  +...........eve
       .  0030:  74 73 2f 31 00 00 00 00 00 00 00 00 0a 00 00 00  ts/1...........
       .  0040:  e0 b1 31 81 ff ff ff ff                          .......
      .
      0x21e18 [0x48]: PERF_EVENT_SAMPLE (IP, 1): 10: 0xffffffff8100c7d0 period: 33
      
      The raw ftrace binary record starts at offset 0020.
      
      Translation:
      
       struct trace_entry {
      	type		= 0x2b = 43;
      	flags		= 1;
      	preempt_count	= 2;
      	pid		= 0xa = 10;
      	tgid		= 0xa = 10;
       }
      
       thread_comm = "events/1"
       thread_pid  = 0xa = 10;
       func	    = 0xffffffff8131b1e0 = flush_to_ldisc()
      
      What will come next?
      
       - Userspace support ('perf trace'), 'flight data recorder' mode
         for perf trace, etc.
      
       - The unconditional copy from the profiling callback brings
         some costs however if someone wants no such sampling to
         occur, and needs to be fixed in the future. For that we need
         to have an instant access to the perf counter attribute.
         This is a matter of a flag to add in the struct ftrace_event.
      
       - Take care of the events recursivity! Don't ever try to record
         a lock event for example, it seems some locking is used in
         the profiling fast path and lead to a tracing recursivity.
         That will be fixed using raw spinlock or recursivity
         protection.
      
       - [...]
      
       - Profit! :-)
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Tom Zanussi <tzanussi@gmail.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Gabriel Munteanu <eduard.munteanu@linux360.ro>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f413cdb8
  3. 06 8月, 2009 4 次提交
  4. 29 7月, 2009 1 次提交
    • L
      tracing: Fix missing function_graph events when we splice_read from trace_pipe · 74e7ff8c
      Lai Jiangshan 提交于
      About a half events are missing when we splice_read
      from trace_pipe. They are unexpectedly consumed because we ignore
      the TRACE_TYPE_NO_CONSUME return value used by the function graph
      tracer when it needs to consume the events by itself to walk on
      the ring buffer.
      
      The same problem appears with ftrace_dump()
      
      Example of an output before this patch:
      
      1)               |      ktime_get_real() {
      1)   2.846 us    |          read_hpet();
      1)   4.558 us    |        }
      1)   6.195 us    |      }
      
      After this patch:
      
      0)               |      ktime_get_real() {
      0)               |        getnstimeofday() {
      0)   1.960 us    |          read_hpet();
      0)   3.597 us    |        }
      0)   5.196 us    |      }
      
      The fix also applies on 2.6.30
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: stable@kernel.org
      LKML-Reference: <4A6EEC52.90704@cn.fujitsu.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      74e7ff8c
  5. 23 7月, 2009 1 次提交
  6. 21 7月, 2009 1 次提交
  7. 18 7月, 2009 1 次提交
  8. 13 7月, 2009 1 次提交
  9. 08 7月, 2009 1 次提交
    • S
      ring-buffer: make lockless · 77ae365e
      Steven Rostedt 提交于
      This patch converts the ring buffers into a completely lockless
      buffer recording system. The read side still takes locks since
      we still serialize readers. But the writers are the ones that
      must be lockless (those can happen in NMIs).
      
      The main change is to the "head_page" pointer. We write to the
      tail, and read from the head. The "head_page" pointer in the cpu
      buffer is now just a reference to where to look. The real head
      page is now kept in the head_page->list->prev->next pointer.
      That is, in the list head of the previous page we set flags.
      
      The list pages are allocated to be aligned such that the lowest
      significant bits are always zero pointing to the list. This gives
      us play to put in flags to their pointers.
      
      bit 0: set when the page is a head page
      bit 1: set when the writer is moving the page (for overwrite mode)
      
      cmpxchg is used to update the pointer.
      
      When the writer wraps the buffer and the tail meets the head,
      in overwrite mode, the writer must move the head page forward.
      It first uses cmpxchg to change the pointer flag from 1 to 2.
      Once this is done, the reader on another CPU will not take the
      page from the buffer.
      
      The writers need to protect against interrupts (we don't bother with
      disabling interrupts because NMIs are allowed to write too).
      
      After the writer sets the pointer flag to 2, it takes care to
      manage interrupts coming in. This is discribed in detail within the
      comments of the code.
      
       Changes in version 2:
        - Let reader reset entries value of header page.
        - Fix tail page passing commit page on reader page test.
        - Always increment entries and write counter in rb_tail_page_update
        - Add safety check in rb_set_commit_to_write to break out of infinite loop
        - add mask in rb_is_reader_page
      
      [ Impact: lock free writing to the ring buffer ]
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      77ae365e
  10. 01 7月, 2009 1 次提交
  11. 24 6月, 2009 2 次提交
  12. 16 6月, 2009 1 次提交
    • G
      debugfs: Fix terminology inconsistency of dir name to mount debugfs filesystem. · 156f5a78
      GeunSik Lim 提交于
      Many developers use "/debug/" or "/debugfs/" or "/sys/kernel/debug/"
      directory name to mount debugfs filesystem for ftrace according to
      ./Documentation/tracers/ftrace.txt file.
      
      And, three directory names(ex:/debug/, /debugfs/, /sys/kernel/debug/) is
      existed in kernel source like ftrace, DRM, Wireless, Documentation,
      Network[sky2]files to mount debugfs filesystem.
      
      debugfs means debug filesystem for debugging easy to use by greg kroah
      hartman. "/sys/kernel/debug/" name is suitable as directory name
      of debugfs filesystem.
      - debugfs related reference: http://lwn.net/Articles/334546/
      
      Fix inconsistency of directory name to mount debugfs filesystem.
      
      * From Steven Rostedt
        - find_debugfs() and tracing_files() in this patch.
      Signed-off-by: NGeunSik Lim <geunsik.lim@samsung.com>
      Acked-by     : Inaky Perez-Gonzalez <inaky@linux.intel.com>
      Reviewed-by  : Steven Rostedt <rostedt@goodmis.org>
      Reviewed-by  : James Smart <james.smart@emulex.com>
      CC: Jiri Kosina <trivial@kernel.org>
      CC: David Airlie <airlied@linux.ie>
      CC: Peter Osterlund <petero2@telia.com>
      CC: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      CC: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      CC: Masami Hiramatsu <mhiramat@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      156f5a78
  13. 15 6月, 2009 2 次提交
  14. 02 6月, 2009 1 次提交
  15. 28 5月, 2009 1 次提交
    • H
      trace: disable preemption before taking raw spinlocks · 5b6045a9
      Heiko Carstens 提交于
      s390 code uses smp_processor_id() in __raw_spin_lock() code which
      reveals that a (raw) spinlock is taken without preemption disabled.
      This can potentially deadlock.
      
      To fix this explicitly disable and enable preemption.
      
      BUG: using smp_processor_id() in preemptible [00000000] code: cat/2278
      caller is trace_find_cmdline+0x40/0xfc
      CPU: 0 Not tainted 2.6.30-rc7-dirty #39
      Process cat (pid: 2278, task: 000000003faedb68, ksp: 000000003b33b988)
      000000003b33b988 000000003b33bae0 0000000000000002 0000000000000000
             000000003b33bb80 000000003b33baf8 000000003b33baf8 00000000000175d6
             0000000000000001 000000003b33b988 000000003f9b0000 000000000000000b
             000000000000000c 000000003b33bb40 000000003b33bae0 0000000000000000
             0000000000000000 00000000000175d6 000000003b33bae0 000000003b33bb28
      Call Trace:
      ([<00000000000174b2>] show_trace+0x112/0x170)
       [<0000000000017582>] show_stack+0x72/0x100
       [<0000000000441538>] dump_stack+0xc8/0xd8
       [<000000000025c350>] debug_smp_processor_id+0x114/0x130
       [<00000000000bf0e4>] trace_find_cmdline+0x40/0xfc
       [<00000000000c35d4>] trace_print_context+0x58/0xac
       [<00000000000bb676>] print_trace_line+0x416/0x470
       [<00000000000bc8fe>] s_show+0x4e/0x428
       [<000000000013834e>] seq_read+0x36a/0x5d4
       [<0000000000112a78>] vfs_read+0xc8/0x174
       [<0000000000112c58>] SyS_read+0x74/0xc4
       [<000000000002c7ae>] sysc_noemu+0x10/0x16
       [<000002000012436c>] 0x2000012436c
      1 lock held by cat/2278:
       #0:  (&p->lock){+.+.+.}, at: [<0000000000138056>] seq_read+0x72/0x5d4
      
      [ Impact: fix preempt-unsafe raw spinlock ]
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      5b6045a9
  16. 26 5月, 2009 1 次提交
    • L
      tracing: add trace_event_read_lock() · 4f535968
      Lai Jiangshan 提交于
      I found that there is nothing to protect event_hash in
      ftrace_find_event(). Rcu protects the event hashlist
      but not the event itself while we use it after its extraction
      through ftrace_find_event().
      
      This lack of a proper locking in this spot opens a race
      window between any event dereferencing and module removal.
      
      Eg:
      
      --Task A--
      
      print_trace_line(trace) {
        event = find_ftrace_event(trace)
      
      --Task B--
      
      trace_module_remove_events(mod) {
        list_trace_events_module(ev, mod) {
          unregister_ftrace_event(ev->event) {
            hlist_del(ev->event->node)
              list_del(....)
          }
        }
      }
      |--> module removed, the event has been dropped
      
      --Task A--
      
        event->print(trace); // Dereferencing freed memory
      
      If the event retrieved belongs to a module and this module
      is concurrently removed, we may end up dereferencing a data
      from a freed module.
      
      RCU could solve this, but it would add latency to the kernel and
      forbid tracers output callbacks to call any sleepable code.
      So this fix converts 'trace_event_mutex' to a read/write semaphore,
      and adds trace_event_read_lock() to protect ftrace_find_event().
      
      [ Impact: fix possible freed memory dereference in ftrace ]
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      LKML-Reference: <4A114806.7090302@cn.fujitsu.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      4f535968
  17. 16 5月, 2009 1 次提交
  18. 07 5月, 2009 1 次提交
    • S
      tracing: reset ring buffer when removing modules with events · 9456f0fa
      Steven Rostedt 提交于
      Li Zefan found that there's a race using the event ids of events and
      modules. When a module is loaded, an event id is incremented. We only
      have 16 bits for event ids (65536) and there is a possible (but highly
      unlikely) race that we could load and unload a module that registers
      events so many times that the event id counter overflows.
      
      When it overflows, it then restarts and goes looking for available
      ids. An id is available if it was added by a module and released.
      
      The race is if you have one module add an id, and then is removed.
      Another module loaded can use that same event id. But if the old module
      still had events in the ring buffer, the new module's call back would
      get bogus data.  At best (and most likely) the output would just be
      garbage. But if the module for some reason used pointers (not recommended)
      then this could potentially crash.
      
      The safest thing to do is just reset the ring buffer if a module that
      registered events is removed.
      
      [ Impact: prevent unpredictable results of event id overflows ]
      Reported-by: NLi Zefan <lizf@cn.fujitsu.com>
      LKML-Reference: <49FEAFD0.30106@cn.fujitsu.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      9456f0fa
  19. 06 5月, 2009 2 次提交
    • S
      tracing: use proper export symbol for tracing api · 94487d6d
      Steven Rostedt 提交于
      When adding the EXPORT_SYMBOL to some of the tracing API, I accidently
      used EXPORT_SYMBOL instead of EXPORT_SYMBOL_GPL. This patch fixes
      that mistake.
      
      [ Impact: export the tracing code only for GPL modules ]
      Reported-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      94487d6d
    • S
      tracing: export stats of ring buffers to userspace · c8d77183
      Steven Rostedt 提交于
      This patch adds stats to the ftrace ring buffers:
      
       # cat /debugfs/tracing/per_cpu/cpu0/stats
       entries: 42360
       overrun: 30509326
       commit overrun: 0
       nmi dropped: 0
      
      Where entries are the total number of data entries in the buffer.
      
      overrun is the number of entries not consumed and were overwritten by
      the writer.
      
      commit overrun is the number of entries dropped due to nested writers
      wrapping the buffer before the initial writer finished the commit.
      
      nmi dropped is the number of entries dropped due to the ring buffer
      lock being held when an nmi was going to write to the ring buffer.
      Note, this field will be meaningless and will go away when the ring
      buffer becomes lockless.
      
      [ Impact: let userspace know what is happening in the ring buffers ]
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      c8d77183
  20. 29 4月, 2009 4 次提交
    • S
      tracing: fix ref count in splice pages · 7267fa68
      Steven Rostedt 提交于
      The pages allocated for the splice binary buffer did not initialize
      the ref count correctly. This caused pages not to be freed and causes
      a drastic memory leak.
      
      Thanks to logdev I was able to trace the tracer to find where the leak
      was.
      
      [ Impact: stop memory leak when using splice ]
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      7267fa68
    • S
      tracing: have splice only copy full pages · f2957f1f
      Steven Rostedt 提交于
      Splice works with pages, it is much more effecient to use an entire
      page than to copy bits over several pages.
      
      Using logdev to trace the internals of the splice mechanism, I was
      able to see that splice can be very aggressive. When tracing is
      occurring, and the reader caught up to the writer, and the writer
      is on the reader page, the reader will copy what is there into the
      splice page. Splice may iterate over several pages and if the
      writer is still writing to the page, the reader will keep copying
      bits to new pages to pass to userspace.
      
      This patch changes it to only pass data to userspace if the page
      is full (the writer has left the page). This has a small side effect
      that splice can not read a partial page, and must wait for the
      page to fill. This should not be an issue. If tracing has stopped,
      then a use of "read" will still read all of the page.
      
      [ Impact: better performance for ring buffer splice code ]
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      f2957f1f
    • S
      tracing: only add splice page if entries exist · 93459c6c
      Steven Rostedt 提交于
      The splice code allocates a page even when the ring buffer is empty.
      It detects the ring buffer being empty when it it fails to copy
      anything from the ring buffer into the page.
      
      This patch adds a check to see if there is anything in the ring buffer
      before allocating a page.
      
      Thanks to logdev for letting me trace the tracer to find this.
      
      [ Impact: speed up due to removing unnecessary allocation ]
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      93459c6c
    • S
      tracing: fix ref count in splice pages · 5beae6ef
      Steven Rostedt 提交于
      The pages allocated for the splice binary buffer did not initialize
      the ref count correctly. This caused pages not to be freed and causes
      a drastic memory leak.
      
      Thanks to logdev I was able to trace the tracer to find where the leak
      was.
      
      [ Impact: stop memory leak when using splice ]
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      5beae6ef
  21. 28 4月, 2009 1 次提交
    • S
      tracing: convert ftrace_dump spinlocks to raw · cd891ae0
      Steven Rostedt 提交于
      ftrace_dump is used for printing out the contents of the ftrace ring buffer
      to the console on failure. Currently it uses a spinlock to synchronize
      the output from multiple failures on different CPUs. This spin lock
      currently is a normal spinlock and can cause issues with lockdep and
      lock tracing.
      
      This patch converts it to raw since it is for error handling only.
      The lock is local to the ftrace_dump and is not used by any other
      infrastructure.
      
      [ Impact: prevent ftrace_dump from locking up by internal tracing ]
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      cd891ae0
  22. 22 4月, 2009 1 次提交
    • L
      tracing/events: make struct trace_entry->type to be int type · 7a4f453b
      Li Zefan 提交于
      struct trace_entry->type is unsigned char, while trace event's id is
      int type, thus for a event with id >= 256, it's entry->type is cast
      to (id % 256), and then we can't see the trace output of this event.
      
       # insmod trace-events-sample.ko
       # echo foo_bar > /mnt/tracing/set_event
       # cat /debug/tracing/events/trace-events-sample/foo_bar/id
       256
       # cat /mnt/tracing/trace_pipe
                 <...>-3548  [001]   215.091142: Unknown type 0
                 <...>-3548  [001]   216.089207: Unknown type 0
                 <...>-3548  [001]   217.087271: Unknown type 0
                 <...>-3548  [001]   218.085332: Unknown type 0
      
      [ Impact: fix output for trace events with id >= 256 ]
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tom Zanussi <tzanussi@gmail.com>
      LKML-Reference: <49EEDB0E.5070207@cn.fujitsu.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      7a4f453b
  23. 18 4月, 2009 3 次提交
  24. 17 4月, 2009 1 次提交
  25. 15 4月, 2009 1 次提交
  26. 14 4月, 2009 4 次提交
    • T
      tracing/filters: use ring_buffer_discard_commit() in filter_check_discard() · eb02ce01
      Tom Zanussi 提交于
      This patch changes filter_check_discard() to make use of the new
      ring_buffer_discard_commit() function and modifies the current users to
      call the old commit function in the non-discard case.
      
      It also introduces a version of filter_check_discard() that uses the
      global trace buffer (filter_current_check_discard()) for those cases.
      
      v2 changes:
      
      - fix compile error noticed by Ingo Molnar
      Signed-off-by: NTom Zanussi <tzanussi@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: fweisbec@gmail.com
      LKML-Reference: <1239178554.10295.36.camel@tropicana>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      eb02ce01
    • S
      tracing/filters: use ring_buffer_discard_commit for discarded events · 77d9f465
      Steven Rostedt 提交于
      The ring_buffer_discard_commit makes better usage of the ring_buffer
      when an event has been discarded. It tries to remove it completely if
      possible.
      
      This patch converts the trace event filtering to use
      ring_buffer_discard_commit instead of the ring_buffer_event_discard.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      77d9f465
    • T
      tracing/filters: add TRACE_EVENT_FORMAT_NOFILTER event macro · e45f2e2b
      Tom Zanussi 提交于
      Frederic Weisbecker suggested that the trace_special event shouldn't be
      filterable; this patch adds a TRACE_EVENT_FORMAT_NOFILTER event macro
      that allows an event format to be exported without having a filter
      attached, and removes filtering from the trace_special event.
      Signed-off-by: NTom Zanussi <tzanussi@gmail.com>
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e45f2e2b
    • T
      tracing/filters: add run-time field descriptions to TRACE_EVENT_FORMAT events · e1112b4d
      Tom Zanussi 提交于
      This patch adds run-time field descriptions to all the event formats
      exported using TRACE_EVENT_FORMAT.  It also hooks up all the tracers
      that use them (i.e. the tracers in the 'ftrace subsystem') so they can
      also have their output filtered by the event-filtering mechanism.
      
      When I was testing this, there were a couple of things that fooled me
      into thinking the filters weren't working, when actually they were -
      I'll mention them here so others don't make the same mistakes (and file
      bug reports. ;-)
      
      One is that some of the tracers trace multiple events e.g. the
      sched_switch tracer uses the context_switch and wakeup events, and if
      you don't set filters on all of the traced events, the unfiltered output
      from the events without filters on them can make it look like the
      filtering as a whole isn't working properly, when actually it is doing
      what it was asked to do - it just wasn't asked to do the right thing.
      
      The other is that for the really high-volume tracers e.g. the function
      tracer, the volume of filtered events can be so high that it pushes the
      unfiltered events out of the ring buffer before they can be read so e.g.
      cat'ing the trace file repeatedly shows either no output, or once in
      awhile some output but that isn't there the next time you read the
      trace, which isn't what you normally expect when reading the trace file.
      If you read from the trace_pipe file though, you can catch them before
      they disappear.
      
      Changes from v1:
      
      As suggested by Frederic Weisbecker:
      
      - get rid of externs in functions
      - added unlikely() to filter_check_discard()
      Signed-off-by: NTom Zanussi <tzanussi@gmail.com>
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e1112b4d