1. 08 4月, 2015 1 次提交
    • S
      tracing: Add TRACE_DEFINE_ENUM() macro to map enums to their values · 0c564a53
      Steven Rostedt (Red Hat) 提交于
      Several tracepoints use the helper functions __print_symbolic() or
      __print_flags() and pass in enums that do the mapping between the
      binary data stored and the value to print. This works well for reading
      the ASCII trace files, but when the data is read via userspace tools
      such as perf and trace-cmd, the conversion of the binary value to a
      human string format is lost if an enum is used, as userspace does not
      have access to what the ENUM is.
      
      For example, the tracepoint trace_tlb_flush() has:
      
       __print_symbolic(REC->reason,
          { TLB_FLUSH_ON_TASK_SWITCH, "flush on task switch" },
          { TLB_REMOTE_SHOOTDOWN, "remote shootdown" },
          { TLB_LOCAL_SHOOTDOWN, "local shootdown" },
          { TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" })
      
      Which maps the enum values to the strings they represent. But perf and
      trace-cmd do no know what value TLB_LOCAL_MM_SHOOTDOWN is, and would
      not be able to map it.
      
      With TRACE_DEFINE_ENUM(), developers can place these in the event header
      files and ftrace will convert the enums to their values:
      
      By adding:
      
       TRACE_DEFINE_ENUM(TLB_FLUSH_ON_TASK_SWITCH);
       TRACE_DEFINE_ENUM(TLB_REMOTE_SHOOTDOWN);
       TRACE_DEFINE_ENUM(TLB_LOCAL_SHOOTDOWN);
       TRACE_DEFINE_ENUM(TLB_LOCAL_MM_SHOOTDOWN);
      
       $ cat /sys/kernel/debug/tracing/events/tlb/tlb_flush/format
      [...]
       __print_symbolic(REC->reason,
          { 0, "flush on task switch" },
          { 1, "remote shootdown" },
          { 2, "local shootdown" },
          { 3, "local mm shootdown" })
      
      The above is what userspace expects to see, and tools do not need to
      be modified to parse them.
      
      Link: http://lkml.kernel.org/r/20150403013802.220157513@goodmis.org
      
      Cc: Guilherme Cox <cox@computer.org>
      Cc: Tony Luck <tony.luck@gmail.com>
      Cc: Xie XiuQi <xiexiuqi@huawei.com>
      Acked-by: NNamhyung Kim <namhyung@kernel.org>
      Reviewed-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Tested-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      0c564a53
  2. 03 4月, 2015 1 次提交
  3. 31 3月, 2015 1 次提交
  4. 25 3月, 2015 4 次提交
  5. 09 3月, 2015 3 次提交
    • S
      ftrace: Fix ftrace enable ordering of sysctl ftrace_enabled · 524a3868
      Steven Rostedt (Red Hat) 提交于
      Some archs (specifically PowerPC), are sensitive with the ordering of
      the enabling of the calls to function tracing and setting of the
      function to use to be traced.
      
      That is, update_ftrace_function() sets what function the ftrace_caller
      trampoline should call. Some archs require this to be set before
      calling ftrace_run_update_code().
      
      Another bug was discovered, that ftrace_startup_sysctl() called
      ftrace_run_update_code() directly. If the function the ftrace_caller
      trampoline changes, then it will not be updated. Instead a call
      to ftrace_startup_enable() should be called because it tests to see
      if the callback changed since the code was disabled, and will
      tell the arch to update appropriately. Most archs do not need this
      notification, but PowerPC does.
      
      The problem could be seen by the following commands:
      
       # echo 0 > /proc/sys/kernel/ftrace_enabled
       # echo function > /sys/kernel/debug/tracing/current_tracer
       # echo 1 > /proc/sys/kernel/ftrace_enabled
       # cat /sys/kernel/debug/tracing/trace
      
      The trace will show that function tracing was not active.
      
      Cc: stable@vger.kernel.org # 2.6.27+
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      524a3868
    • P
      ftrace: Fix en(dis)able graph caller when en(dis)abling record via sysctl · 1619dc3f
      Pratyush Anand 提交于
      When ftrace is enabled globally through the proc interface, we must check if
      ftrace_graph_active is set. If it is set, then we should also pass the
      FTRACE_START_FUNC_RET command to ftrace_run_update_code(). Similarly, when
      ftrace is disabled globally through the proc interface, we must check if
      ftrace_graph_active is set. If it is set, then we should also pass the
      FTRACE_STOP_FUNC_RET command to ftrace_run_update_code().
      
      Consider the following situation.
      
       # echo 0 > /proc/sys/kernel/ftrace_enabled
      
      After this ftrace_enabled = 0.
      
       # echo function_graph > /sys/kernel/debug/tracing/current_tracer
      
      Since ftrace_enabled = 0, ftrace_enable_ftrace_graph_caller() is never
      called.
      
       # echo 1 > /proc/sys/kernel/ftrace_enabled
      
      Now ftrace_enabled will be set to true, but still
      ftrace_enable_ftrace_graph_caller() will not be called, which is not
      desired.
      
      Further if we execute the following after this:
        # echo nop > /sys/kernel/debug/tracing/current_tracer
      
      Now since ftrace_enabled is set it will call
      ftrace_disable_ftrace_graph_caller(), which causes a kernel warning on
      the ARM platform.
      
      On the ARM platform, when ftrace_enable_ftrace_graph_caller() is called,
      it checks whether the old instruction is a nop or not. If it's not a nop,
      then it returns an error. If it is a nop then it replaces instruction at
      that address with a branch to ftrace_graph_caller.
      ftrace_disable_ftrace_graph_caller() behaves just the opposite. Therefore,
      if generic ftrace code ever calls either ftrace_enable_ftrace_graph_caller()
      or ftrace_disable_ftrace_graph_caller() consecutively two times in a row,
      then it will return an error, which will cause the generic ftrace code to
      raise a warning.
      
      Note, x86 does not have an issue with this because the architecture
      specific code for ftrace_enable_ftrace_graph_caller() and
      ftrace_disable_ftrace_graph_caller() does not check the previous state,
      and calling either of these functions twice in a row has no ill effect.
      
      Link: http://lkml.kernel.org/r/e4fbe64cdac0dd0e86a3bf914b0f83c0b419f146.1425666454.git.panand@redhat.com
      
      Cc: stable@vger.kernel.org # 2.6.31+
      Signed-off-by: NPratyush Anand <panand@redhat.com>
      [
        removed extra if (ftrace_start_up) and defined ftrace_graph_active as 0
        if CONFIG_FUNCTION_GRAPH_TRACER is not set.
      ]
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      1619dc3f
    • S
      ftrace: Clear REGS_EN and TRAMP_EN flags on disabling record via sysctl · b24d443b
      Steven Rostedt (Red Hat) 提交于
      When /proc/sys/kernel/ftrace_enabled is set to zero, all function
      tracing is disabled. But the records that represent the functions
      still hold information about the ftrace_ops that are hooked to them.
      
      ftrace_ops may request "REGS" (have a full set of pt_regs passed to
      the callback), or "TRAMP" (the ops has its own trampoline to use).
      When the record is updated to represent the state of the ops hooked
      to it, it sets "REGS_EN" and/or "TRAMP_EN" to state that the callback
      points to the correct trampoline (REGS has its own trampoline).
      
      When ftrace_enabled is set to zero, all ftrace locations are a nop,
      so they do not point to any trampoline. But the _EN flags are still
      set. This can cause the accounting to go wrong when ftrace_enabled
      is cleared and an ops that has a trampoline is registered or unregistered.
      
      For example, the following will cause ftrace to crash:
      
       # echo function_graph > /sys/kernel/debug/tracing/current_tracer
       # echo 0 > /proc/sys/kernel/ftrace_enabled
       # echo nop > /sys/kernel/debug/tracing/current_tracer
       # echo 1 > /proc/sys/kernel/ftrace_enabled
       # echo function_graph > /sys/kernel/debug/tracing/current_tracer
      
      As function_graph uses a trampoline, when ftrace_enabled is set to zero
      the updates to the record are not done. When enabling function_graph
      again, the record will still have the TRAMP_EN flag set, and it will
      look for an op that has a trampoline other than the function_graph
      ops, and fail to find one.
      
      Cc: stable@vger.kernel.org # 3.17+
      Reported-by: NPratyush Anand <panand@redhat.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      b24d443b
  6. 14 2月, 2015 1 次提交
  7. 11 2月, 2015 1 次提交
    • S
      ring-buffer: Do not wake up a splice waiter when page is not full · 1e0d6714
      Steven Rostedt (Red Hat) 提交于
      When an application connects to the ring buffer via splice, it can only
      read full pages. Splice does not work with partial pages. If there is
      not enough data to fill a page, the splice command will either block
      or return -EAGAIN (if set to nonblock).
      
      Code was added where if the page is not full, to just sleep again.
      The problem is, it will get woken up again on the next event. That
      is, when something is written into the ring buffer, if there is a waiter
      it will wake it up. The waiter would then check the buffer, see that
      it still does not have enough data to fill a page and go back to sleep.
      To make matters worse, when the waiter goes back to sleep, it could
      cause another event, which would wake it back up again to see it
      doesn't have enough data and sleep again. This produces a tremendous
      overhead and fills the ring buffer with noise.
      
      For example, recording sched_switch on an idle system for 10 seconds
      produces 25,350,475 events!!!
      
      Create another wait queue for those waiters wanting full pages.
      When an event is written, it only wakes up waiters if there's a full
      page of data. It does not wake up the waiter if the page is not yet
      full.
      
      After this change, recording sched_switch on an idle system for 10
      seconds produces only 800 events. Getting rid of 25,349,675 useless
      events (99.9969% of events!!), is something to take seriously.
      
      Cc: stable@vger.kernel.org # 3.16+
      Cc: Rabin Vincent <rabin@rab.in>
      Fixes: e30f53aa "tracing: Do not busy wait in buffer splice"
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      1e0d6714
  8. 10 2月, 2015 1 次提交
  9. 02 2月, 2015 2 次提交
  10. 30 1月, 2015 1 次提交
  11. 29 1月, 2015 2 次提交
  12. 28 1月, 2015 2 次提交
  13. 23 1月, 2015 2 次提交
  14. 15 1月, 2015 4 次提交
  15. 14 1月, 2015 1 次提交
    • P
      perf: Avoid horrible stack usage · 86038c5e
      Peter Zijlstra (Intel) 提交于
      Both Linus (most recent) and Steve (a while ago) reported that perf
      related callbacks have massive stack bloat.
      
      The problem is that software events need a pt_regs in order to
      properly report the event location and unwind stack. And because we
      could not assume one was present we allocated one on stack and filled
      it with minimal bits required for operation.
      
      Now, pt_regs is quite large, so this is undesirable. Furthermore it
      turns out that most sites actually have a pt_regs pointer available,
      making this even more onerous, as the stack space is pointless waste.
      
      This patch addresses the problem by observing that software events
      have well defined nesting semantics, therefore we can use static
      per-cpu storage instead of on-stack.
      
      Linus made the further observation that all but the scheduler callers
      of perf_sw_event() have a pt_regs available, so we change the regular
      perf_sw_event() to require a valid pt_regs (where it used to be
      optional) and add perf_sw_event_sched() for the scheduler.
      
      We have a scheduler specific call instead of a more generic _noregs()
      like construct because we can assume non-recursion from the scheduler
      and thereby simplify the code further (_noregs would have to put the
      recursion context call inline in order to assertain which __perf_regs
      element to use).
      
      One last note on the implementation of perf_trace_buf_prepare(); we
      allow .regs = NULL for those cases where we already have a pt_regs
      pointer available and do not need another.
      Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reported-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Javi Merino <javi.merino@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
      Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
      Link: http://lkml.kernel.org/r/20141216115041.GW3337@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      86038c5e
  16. 23 12月, 2014 2 次提交
    • S
      tracing: Remove taking of trace_types_lock in pipe files · d716ff71
      Steven Rostedt (Red Hat) 提交于
      Taking the global mutex "trace_types_lock" in the trace_pipe files
      causes a bottle neck as most the pipe files can be read per cpu
      and there's no reason to serialize them.
      
      The current_trace variable was given a ref count and it can not
      change when the ref count is not zero. Opening the trace_pipe
      files will up the ref count (and decremented on close), so that
      the lock no longer needs to be taken when accessing the
      current_trace variable.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      d716ff71
    • S
      tracing: Add ref count to tracer for when they are being read by pipe · cf6ab6d9
      Steven Rostedt (Red Hat) 提交于
      When one of the trace pipe files are being read (by either the trace_pipe
      or trace_pipe_raw), do not allow the current_trace to change. By adding
      a ref count that is incremented when the pipe files are opened, will
      prevent the current_trace from being changed.
      
      This will allow for the removal of the global trace_types_lock from
      reading the pipe buffers (which is currently a bottle neck).
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      cf6ab6d9
  17. 15 12月, 2014 2 次提交
  18. 13 12月, 2014 1 次提交
  19. 10 12月, 2014 1 次提交
    • A
      blktrace: don't let the sysfs interface remove trace from running list · 7eca2103
      Arianna Avanzini 提交于
      Currently, blktrace can be started/stopped via its ioctl-based interface
      (used by the userspace blktrace tool) or via its ftrace interface. The
      function blk_trace_remove_queue(), called each time an "enable" tunable
      of the ftrace interface transitions to zero, removes the trace from the
      running list, even if no function from the sysfs interface adds it to
      such a list. This leads to a null pointer dereference.  This commit
      changes the blk_trace_remove_queue() function so that it does not remove
      the blk_trace from the running list.
      
      v2:
          - Now the patch removes the invocation of list_del() instead of
            adding an useless if branch, as suggested by Namhyung Kim.
      Signed-off-by: NArianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7eca2103
  20. 04 12月, 2014 2 次提交
    • D
      tracing: Truncated output is better than nothing · 3558a5ac
      Dan Carpenter 提交于
      The initial reason for this patch is that I noticed that:
      
      	if (len > TRACE_BUF_SIZE)
      
      is off by one.  In this code, if len == TRACE_BUF_SIZE, then it means we
      have truncated the last character off the output string.  If we truncate
      two or more characters then we exit without printing.
      
      After some discussion, we decided that printing truncated data is better
      than not printing at all so we should just use vscnprintf() and remove
      the test entirely.  Also I have updated memcpy() to copy the NUL char
      instead of setting the NUL in a separate step.
      
      Link: http://lkml.kernel.org/r/20141127155752.GA21914@mwandaSigned-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      3558a5ac
    • B
      tracing: Add additional marks to signal very large time deltas · 8e1e1df2
      Byungchul Park 提交于
      Currently, function graph tracer prints "!" or "+" just before
      function execution time to signal a function overhead, depending
      on the time. And some tracers tracing latency also print "!" or
      "+" just after time to signal overhead, depending on the interval
      between events. Even it is usually enough to do that, we sometimes
      need to signal for bigger execution time than 100 micro seconds.
      
      For example, I used function graph tracer to detect if there is
      any case that exit_mm() takes too much time. I did following steps
      in /sys/kernel/debug/tracing. It was easier to detect very large
      excution time with patched kernel than with original kernel.
      
      $ echo exit_mm > set_graph_function
      $ echo function_graph > current_tracer
      $ echo > trace
      $ cat trace_pipe > $LOGFILE
       ... (do something and terminate logging)
      $ grep "\\$" $LOGFILE
       3) $ 22082032 us |                      } /* kernel_map_pages */
       3) $ 22082040 us |                    } /* free_pages_prepare */
       3) $ 22082113 us |                  } /* free_hot_cold_page */
       3) $ 22083455 us |                } /* free_hot_cold_page_list */
       3) $ 22083895 us |              } /* release_pages */
       3) $ 22177873 us |            } /* free_pages_and_swap_cache */
       3) $ 22178929 us |          } /* unmap_single_vma */
       3) $ 22198885 us |        } /* unmap_vmas */
       3) $ 22206949 us |      } /* exit_mmap */
       3) $ 22207659 us |    } /* mmput */
       3) $ 22207793 us |  } /* exit_mm */
      
      And then, it was easy to find out that a schedule-out occured by
      sub_preempt_count() within kernel_map_pages().
      
      To detect very large function exection time caused by either problematic
      function implementation or scheduling issues, this patch can be useful.
      
      Link: http://lkml.kernel.org/r/1416789259-24038-1-git-send-email-byungchul.park@lge.comSigned-off-by: NByungchul Park <byungchul.park@lge.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      8e1e1df2
  21. 03 12月, 2014 2 次提交
  22. 22 11月, 2014 1 次提交
    • M
      ftrace, kprobes: Support IPMODIFY flag to find IP modify conflict · f8b8be8a
      Masami Hiramatsu 提交于
      Introduce FTRACE_OPS_FL_IPMODIFY to avoid conflict among
      ftrace users who may modify regs->ip to change the execution
      path. If two or more users modify the regs->ip on the same
      function entry, one of them will be broken. So they must add
      IPMODIFY flag and make sure that ftrace_set_filter_ip() succeeds.
      
      Note that ftrace doesn't allow ftrace_ops which has IPMODIFY
      flag to have notrace hash, and the ftrace_ops must have a
      filter hash (so that the ftrace_ops can hook only specific
      entries), because it strongly depends on the address and
      must be allowed for only few selected functions.
      
      Link: http://lkml.kernel.org/r/20141121102516.11844.27829.stgit@localhost.localdomain
      
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Vojtech Pavlik <vojtech@suse.cz>
      Cc: Miroslav Benes <mbenes@suse.cz>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Signed-off-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      [ fixed up some of the comments ]
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      f8b8be8a
  23. 20 11月, 2014 2 次提交