1. 02 4月, 2015 4 次提交
    • A
      tracing: Allow BPF programs to call bpf_trace_printk() · 9c959c86
      Alexei Starovoitov 提交于
      Debugging of BPF programs needs some form of printk from the
      program, so let programs call limited trace_printk() with %d %u
      %x %p modifiers only.
      
      Similar to kernel modules, during program load verifier checks
      whether program is calling bpf_trace_printk() and if so, kernel
      allocates trace_printk buffers and emits big 'this is debug
      only' banner.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1427312966-8434-6-git-send-email-ast@plumgrid.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9c959c86
    • A
      tracing: Allow BPF programs to call bpf_ktime_get_ns() · d9847d31
      Alexei Starovoitov 提交于
      bpf_ktime_get_ns() is used by programs to compute time delta
      between events or as a timestamp
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1427312966-8434-5-git-send-email-ast@plumgrid.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d9847d31
    • A
      tracing, perf: Implement BPF programs attached to kprobes · 2541517c
      Alexei Starovoitov 提交于
      BPF programs, attached to kprobes, provide a safe way to execute
      user-defined BPF byte-code programs without being able to crash or
      hang the kernel in any way. The BPF engine makes sure that such
      programs have a finite execution time and that they cannot break
      out of their sandbox.
      
      The user interface is to attach to a kprobe via the perf syscall:
      
      	struct perf_event_attr attr = {
      		.type	= PERF_TYPE_TRACEPOINT,
      		.config	= event_id,
      		...
      	};
      
      	event_fd = perf_event_open(&attr,...);
      	ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
      
      'prog_fd' is a file descriptor associated with BPF program
      previously loaded.
      
      'event_id' is an ID of the kprobe created.
      
      Closing 'event_fd':
      
      	close(event_fd);
      
      ... automatically detaches BPF program from it.
      
      BPF programs can call in-kernel helper functions to:
      
        - lookup/update/delete elements in maps
      
        - probe_read - wraper of probe_kernel_read() used to access any
          kernel data structures
      
      BPF programs receive 'struct pt_regs *' as an input ('struct pt_regs' is
      architecture dependent) and return 0 to ignore the event and 1 to store
      kprobe event into the ring buffer.
      
      Note, kprobes are a fundamentally _not_ a stable kernel ABI,
      so BPF programs attached to kprobes must be recompiled for
      every kernel version and user must supply correct LINUX_VERSION_CODE
      in attr.kern_version during bpf_prog_load() call.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Reviewed-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1427312966-8434-4-git-send-email-ast@plumgrid.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2541517c
    • A
      tracing: Add kprobe flag · 72cbbc89
      Alexei Starovoitov 提交于
      add TRACE_EVENT_FL_KPROBE flag to differentiate kprobe type of
      tracepoints, since bpf programs can only be attached to kprobe
      type of PERF_TYPE_TRACEPOINT perf events.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Reviewed-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1427312966-8434-3-git-send-email-ast@plumgrid.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      72cbbc89
  2. 23 3月, 2015 1 次提交
  3. 09 3月, 2015 3 次提交
    • S
      ftrace: Fix ftrace enable ordering of sysctl ftrace_enabled · 524a3868
      Steven Rostedt (Red Hat) 提交于
      Some archs (specifically PowerPC), are sensitive with the ordering of
      the enabling of the calls to function tracing and setting of the
      function to use to be traced.
      
      That is, update_ftrace_function() sets what function the ftrace_caller
      trampoline should call. Some archs require this to be set before
      calling ftrace_run_update_code().
      
      Another bug was discovered, that ftrace_startup_sysctl() called
      ftrace_run_update_code() directly. If the function the ftrace_caller
      trampoline changes, then it will not be updated. Instead a call
      to ftrace_startup_enable() should be called because it tests to see
      if the callback changed since the code was disabled, and will
      tell the arch to update appropriately. Most archs do not need this
      notification, but PowerPC does.
      
      The problem could be seen by the following commands:
      
       # echo 0 > /proc/sys/kernel/ftrace_enabled
       # echo function > /sys/kernel/debug/tracing/current_tracer
       # echo 1 > /proc/sys/kernel/ftrace_enabled
       # cat /sys/kernel/debug/tracing/trace
      
      The trace will show that function tracing was not active.
      
      Cc: stable@vger.kernel.org # 2.6.27+
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      524a3868
    • P
      ftrace: Fix en(dis)able graph caller when en(dis)abling record via sysctl · 1619dc3f
      Pratyush Anand 提交于
      When ftrace is enabled globally through the proc interface, we must check if
      ftrace_graph_active is set. If it is set, then we should also pass the
      FTRACE_START_FUNC_RET command to ftrace_run_update_code(). Similarly, when
      ftrace is disabled globally through the proc interface, we must check if
      ftrace_graph_active is set. If it is set, then we should also pass the
      FTRACE_STOP_FUNC_RET command to ftrace_run_update_code().
      
      Consider the following situation.
      
       # echo 0 > /proc/sys/kernel/ftrace_enabled
      
      After this ftrace_enabled = 0.
      
       # echo function_graph > /sys/kernel/debug/tracing/current_tracer
      
      Since ftrace_enabled = 0, ftrace_enable_ftrace_graph_caller() is never
      called.
      
       # echo 1 > /proc/sys/kernel/ftrace_enabled
      
      Now ftrace_enabled will be set to true, but still
      ftrace_enable_ftrace_graph_caller() will not be called, which is not
      desired.
      
      Further if we execute the following after this:
        # echo nop > /sys/kernel/debug/tracing/current_tracer
      
      Now since ftrace_enabled is set it will call
      ftrace_disable_ftrace_graph_caller(), which causes a kernel warning on
      the ARM platform.
      
      On the ARM platform, when ftrace_enable_ftrace_graph_caller() is called,
      it checks whether the old instruction is a nop or not. If it's not a nop,
      then it returns an error. If it is a nop then it replaces instruction at
      that address with a branch to ftrace_graph_caller.
      ftrace_disable_ftrace_graph_caller() behaves just the opposite. Therefore,
      if generic ftrace code ever calls either ftrace_enable_ftrace_graph_caller()
      or ftrace_disable_ftrace_graph_caller() consecutively two times in a row,
      then it will return an error, which will cause the generic ftrace code to
      raise a warning.
      
      Note, x86 does not have an issue with this because the architecture
      specific code for ftrace_enable_ftrace_graph_caller() and
      ftrace_disable_ftrace_graph_caller() does not check the previous state,
      and calling either of these functions twice in a row has no ill effect.
      
      Link: http://lkml.kernel.org/r/e4fbe64cdac0dd0e86a3bf914b0f83c0b419f146.1425666454.git.panand@redhat.com
      
      Cc: stable@vger.kernel.org # 2.6.31+
      Signed-off-by: NPratyush Anand <panand@redhat.com>
      [
        removed extra if (ftrace_start_up) and defined ftrace_graph_active as 0
        if CONFIG_FUNCTION_GRAPH_TRACER is not set.
      ]
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      1619dc3f
    • S
      ftrace: Clear REGS_EN and TRAMP_EN flags on disabling record via sysctl · b24d443b
      Steven Rostedt (Red Hat) 提交于
      When /proc/sys/kernel/ftrace_enabled is set to zero, all function
      tracing is disabled. But the records that represent the functions
      still hold information about the ftrace_ops that are hooked to them.
      
      ftrace_ops may request "REGS" (have a full set of pt_regs passed to
      the callback), or "TRAMP" (the ops has its own trampoline to use).
      When the record is updated to represent the state of the ops hooked
      to it, it sets "REGS_EN" and/or "TRAMP_EN" to state that the callback
      points to the correct trampoline (REGS has its own trampoline).
      
      When ftrace_enabled is set to zero, all ftrace locations are a nop,
      so they do not point to any trampoline. But the _EN flags are still
      set. This can cause the accounting to go wrong when ftrace_enabled
      is cleared and an ops that has a trampoline is registered or unregistered.
      
      For example, the following will cause ftrace to crash:
      
       # echo function_graph > /sys/kernel/debug/tracing/current_tracer
       # echo 0 > /proc/sys/kernel/ftrace_enabled
       # echo nop > /sys/kernel/debug/tracing/current_tracer
       # echo 1 > /proc/sys/kernel/ftrace_enabled
       # echo function_graph > /sys/kernel/debug/tracing/current_tracer
      
      As function_graph uses a trampoline, when ftrace_enabled is set to zero
      the updates to the record are not done. When enabling function_graph
      again, the record will still have the TRAMP_EN flag set, and it will
      look for an op that has a trampoline other than the function_graph
      ops, and fail to find one.
      
      Cc: stable@vger.kernel.org # 3.17+
      Reported-by: NPratyush Anand <panand@redhat.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      b24d443b
  4. 14 2月, 2015 1 次提交
  5. 11 2月, 2015 1 次提交
    • S
      ring-buffer: Do not wake up a splice waiter when page is not full · 1e0d6714
      Steven Rostedt (Red Hat) 提交于
      When an application connects to the ring buffer via splice, it can only
      read full pages. Splice does not work with partial pages. If there is
      not enough data to fill a page, the splice command will either block
      or return -EAGAIN (if set to nonblock).
      
      Code was added where if the page is not full, to just sleep again.
      The problem is, it will get woken up again on the next event. That
      is, when something is written into the ring buffer, if there is a waiter
      it will wake it up. The waiter would then check the buffer, see that
      it still does not have enough data to fill a page and go back to sleep.
      To make matters worse, when the waiter goes back to sleep, it could
      cause another event, which would wake it back up again to see it
      doesn't have enough data and sleep again. This produces a tremendous
      overhead and fills the ring buffer with noise.
      
      For example, recording sched_switch on an idle system for 10 seconds
      produces 25,350,475 events!!!
      
      Create another wait queue for those waiters wanting full pages.
      When an event is written, it only wakes up waiters if there's a full
      page of data. It does not wake up the waiter if the page is not yet
      full.
      
      After this change, recording sched_switch on an idle system for 10
      seconds produces only 800 events. Getting rid of 25,349,675 useless
      events (99.9969% of events!!), is something to take seriously.
      
      Cc: stable@vger.kernel.org # 3.16+
      Cc: Rabin Vincent <rabin@rab.in>
      Fixes: e30f53aa "tracing: Do not busy wait in buffer splice"
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      1e0d6714
  6. 10 2月, 2015 1 次提交
  7. 02 2月, 2015 2 次提交
  8. 30 1月, 2015 1 次提交
  9. 29 1月, 2015 2 次提交
  10. 28 1月, 2015 2 次提交
  11. 23 1月, 2015 2 次提交
  12. 15 1月, 2015 4 次提交
  13. 14 1月, 2015 1 次提交
    • P
      perf: Avoid horrible stack usage · 86038c5e
      Peter Zijlstra (Intel) 提交于
      Both Linus (most recent) and Steve (a while ago) reported that perf
      related callbacks have massive stack bloat.
      
      The problem is that software events need a pt_regs in order to
      properly report the event location and unwind stack. And because we
      could not assume one was present we allocated one on stack and filled
      it with minimal bits required for operation.
      
      Now, pt_regs is quite large, so this is undesirable. Furthermore it
      turns out that most sites actually have a pt_regs pointer available,
      making this even more onerous, as the stack space is pointless waste.
      
      This patch addresses the problem by observing that software events
      have well defined nesting semantics, therefore we can use static
      per-cpu storage instead of on-stack.
      
      Linus made the further observation that all but the scheduler callers
      of perf_sw_event() have a pt_regs available, so we change the regular
      perf_sw_event() to require a valid pt_regs (where it used to be
      optional) and add perf_sw_event_sched() for the scheduler.
      
      We have a scheduler specific call instead of a more generic _noregs()
      like construct because we can assume non-recursion from the scheduler
      and thereby simplify the code further (_noregs would have to put the
      recursion context call inline in order to assertain which __perf_regs
      element to use).
      
      One last note on the implementation of perf_trace_buf_prepare(); we
      allow .regs = NULL for those cases where we already have a pt_regs
      pointer available and do not need another.
      Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reported-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Javi Merino <javi.merino@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
      Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
      Link: http://lkml.kernel.org/r/20141216115041.GW3337@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      86038c5e
  14. 23 12月, 2014 2 次提交
    • S
      tracing: Remove taking of trace_types_lock in pipe files · d716ff71
      Steven Rostedt (Red Hat) 提交于
      Taking the global mutex "trace_types_lock" in the trace_pipe files
      causes a bottle neck as most the pipe files can be read per cpu
      and there's no reason to serialize them.
      
      The current_trace variable was given a ref count and it can not
      change when the ref count is not zero. Opening the trace_pipe
      files will up the ref count (and decremented on close), so that
      the lock no longer needs to be taken when accessing the
      current_trace variable.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      d716ff71
    • S
      tracing: Add ref count to tracer for when they are being read by pipe · cf6ab6d9
      Steven Rostedt (Red Hat) 提交于
      When one of the trace pipe files are being read (by either the trace_pipe
      or trace_pipe_raw), do not allow the current_trace to change. By adding
      a ref count that is incremented when the pipe files are opened, will
      prevent the current_trace from being changed.
      
      This will allow for the removal of the global trace_types_lock from
      reading the pipe buffers (which is currently a bottle neck).
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      cf6ab6d9
  15. 15 12月, 2014 2 次提交
  16. 13 12月, 2014 1 次提交
  17. 10 12月, 2014 1 次提交
    • A
      blktrace: don't let the sysfs interface remove trace from running list · 7eca2103
      Arianna Avanzini 提交于
      Currently, blktrace can be started/stopped via its ioctl-based interface
      (used by the userspace blktrace tool) or via its ftrace interface. The
      function blk_trace_remove_queue(), called each time an "enable" tunable
      of the ftrace interface transitions to zero, removes the trace from the
      running list, even if no function from the sysfs interface adds it to
      such a list. This leads to a null pointer dereference.  This commit
      changes the blk_trace_remove_queue() function so that it does not remove
      the blk_trace from the running list.
      
      v2:
          - Now the patch removes the invocation of list_del() instead of
            adding an useless if branch, as suggested by Namhyung Kim.
      Signed-off-by: NArianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7eca2103
  18. 04 12月, 2014 2 次提交
    • D
      tracing: Truncated output is better than nothing · 3558a5ac
      Dan Carpenter 提交于
      The initial reason for this patch is that I noticed that:
      
      	if (len > TRACE_BUF_SIZE)
      
      is off by one.  In this code, if len == TRACE_BUF_SIZE, then it means we
      have truncated the last character off the output string.  If we truncate
      two or more characters then we exit without printing.
      
      After some discussion, we decided that printing truncated data is better
      than not printing at all so we should just use vscnprintf() and remove
      the test entirely.  Also I have updated memcpy() to copy the NUL char
      instead of setting the NUL in a separate step.
      
      Link: http://lkml.kernel.org/r/20141127155752.GA21914@mwandaSigned-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      3558a5ac
    • B
      tracing: Add additional marks to signal very large time deltas · 8e1e1df2
      Byungchul Park 提交于
      Currently, function graph tracer prints "!" or "+" just before
      function execution time to signal a function overhead, depending
      on the time. And some tracers tracing latency also print "!" or
      "+" just after time to signal overhead, depending on the interval
      between events. Even it is usually enough to do that, we sometimes
      need to signal for bigger execution time than 100 micro seconds.
      
      For example, I used function graph tracer to detect if there is
      any case that exit_mm() takes too much time. I did following steps
      in /sys/kernel/debug/tracing. It was easier to detect very large
      excution time with patched kernel than with original kernel.
      
      $ echo exit_mm > set_graph_function
      $ echo function_graph > current_tracer
      $ echo > trace
      $ cat trace_pipe > $LOGFILE
       ... (do something and terminate logging)
      $ grep "\\$" $LOGFILE
       3) $ 22082032 us |                      } /* kernel_map_pages */
       3) $ 22082040 us |                    } /* free_pages_prepare */
       3) $ 22082113 us |                  } /* free_hot_cold_page */
       3) $ 22083455 us |                } /* free_hot_cold_page_list */
       3) $ 22083895 us |              } /* release_pages */
       3) $ 22177873 us |            } /* free_pages_and_swap_cache */
       3) $ 22178929 us |          } /* unmap_single_vma */
       3) $ 22198885 us |        } /* unmap_vmas */
       3) $ 22206949 us |      } /* exit_mmap */
       3) $ 22207659 us |    } /* mmput */
       3) $ 22207793 us |  } /* exit_mm */
      
      And then, it was easy to find out that a schedule-out occured by
      sub_preempt_count() within kernel_map_pages().
      
      To detect very large function exection time caused by either problematic
      function implementation or scheduling issues, this patch can be useful.
      
      Link: http://lkml.kernel.org/r/1416789259-24038-1-git-send-email-byungchul.park@lge.comSigned-off-by: NByungchul Park <byungchul.park@lge.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      8e1e1df2
  19. 03 12月, 2014 2 次提交
  20. 22 11月, 2014 1 次提交
    • M
      ftrace, kprobes: Support IPMODIFY flag to find IP modify conflict · f8b8be8a
      Masami Hiramatsu 提交于
      Introduce FTRACE_OPS_FL_IPMODIFY to avoid conflict among
      ftrace users who may modify regs->ip to change the execution
      path. If two or more users modify the regs->ip on the same
      function entry, one of them will be broken. So they must add
      IPMODIFY flag and make sure that ftrace_set_filter_ip() succeeds.
      
      Note that ftrace doesn't allow ftrace_ops which has IPMODIFY
      flag to have notrace hash, and the ftrace_ops must have a
      filter hash (so that the ftrace_ops can hook only specific
      entries), because it strongly depends on the address and
      must be allowed for only few selected functions.
      
      Link: http://lkml.kernel.org/r/20141121102516.11844.27829.stgit@localhost.localdomain
      
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Petr Mladek <pmladek@suse.cz>
      Cc: Vojtech Pavlik <vojtech@suse.cz>
      Cc: Miroslav Benes <mbenes@suse.cz>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Signed-off-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      [ fixed up some of the comments ]
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      f8b8be8a
  21. 20 11月, 2014 4 次提交