1. 16 6月, 2015 2 次提交
  2. 11 6月, 2015 1 次提交
  3. 01 6月, 2015 1 次提交
  4. 22 5月, 2015 1 次提交
    • A
      bpf: allow bpf programs to tail-call other bpf programs · 04fd61ab
      Alexei Starovoitov 提交于
      introduce bpf_tail_call(ctx, &jmp_table, index) helper function
      which can be used from BPF programs like:
      int bpf_prog(struct pt_regs *ctx)
      {
        ...
        bpf_tail_call(ctx, &jmp_table, index);
        ...
      }
      that is roughly equivalent to:
      int bpf_prog(struct pt_regs *ctx)
      {
        ...
        if (jmp_table[index])
          return (*jmp_table[index])(ctx);
        ...
      }
      The important detail that it's not a normal call, but a tail call.
      The kernel stack is precious, so this helper reuses the current
      stack frame and jumps into another BPF program without adding
      extra call frame.
      It's trivially done in interpreter and a bit trickier in JITs.
      In case of x64 JIT the bigger part of generated assembler prologue
      is common for all programs, so it is simply skipped while jumping.
      Other JITs can do similar prologue-skipping optimization or
      do stack unwind before jumping into the next program.
      
      bpf_tail_call() arguments:
      ctx - context pointer
      jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
      index - index in the jump table
      
      Since all BPF programs are idenitified by file descriptor, user space
      need to populate the jmp_table with FDs of other BPF programs.
      If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
      and program execution continues as normal.
      
      New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
      populate this jmp_table array with FDs of other bpf programs.
      Programs can share the same jmp_table array or use multiple jmp_tables.
      
      The chain of tail calls can form unpredictable dynamic loops therefore
      tail_call_cnt is used to limit the number of calls and currently is set to 32.
      
      Use cases:
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      
      ==========
      - simplify complex programs by splitting them into a sequence of small programs
      
      - dispatch routine
        For tracing and future seccomp the program may be triggered on all system
        calls, but processing of syscall arguments will be different. It's more
        efficient to implement them as:
        int syscall_entry(struct seccomp_data *ctx)
        {
           bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
           ... default: process unknown syscall ...
        }
        int sys_write_event(struct seccomp_data *ctx) {...}
        int sys_read_event(struct seccomp_data *ctx) {...}
        syscall_jmp_table[__NR_write] = sys_write_event;
        syscall_jmp_table[__NR_read] = sys_read_event;
      
        For networking the program may call into different parsers depending on
        packet format, like:
        int packet_parser(struct __sk_buff *skb)
        {
           ... parse L2, L3 here ...
           __u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
           bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
           ... default: process unknown protocol ...
        }
        int parse_tcp(struct __sk_buff *skb) {...}
        int parse_udp(struct __sk_buff *skb) {...}
        ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
        ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
      
      - for TC use case, bpf_tail_call() allows to implement reclassify-like logic
      
      - bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
        are atomic, so user space can build chains of BPF programs on the fly
      
      Implementation details:
      =======================
      - high performance of bpf_tail_call() is the goal.
        It could have been implemented without JIT changes as a wrapper on top of
        BPF_PROG_RUN() macro, but with two downsides:
        . all programs would have to pay performance penalty for this feature and
          tail call itself would be slower, since mandatory stack unwind, return,
          stack allocate would be done for every tailcall.
        . tailcall would be limited to programs running preempt_disabled, since
          generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
          need to be either global per_cpu variable accessed by helper and by wrapper
          or global variable protected by locks.
      
        In this implementation x64 JIT bypasses stack unwind and jumps into the
        callee program after prologue.
      
      - bpf_prog_array_compatible() ensures that prog_type of callee and caller
        are the same and JITed/non-JITed flag is the same, since calling JITed
        program from non-JITed is invalid, since stack frames are different.
        Similarly calling kprobe type program from socket type program is invalid.
      
      - jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
        abstraction, its user space API and all of verifier logic.
        It's in the existing arraymap.c file, since several functions are
        shared with regular array map.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04fd61ab
  5. 07 5月, 2015 1 次提交
  6. 17 4月, 2015 1 次提交
  7. 16 4月, 2015 4 次提交
    • J
      tracing: Fix incorrect enabling of trace events by boot cmdline · 84fce9db
      Joonsoo Kim 提交于
      There is a problem that trace events are not properly enabled with
      boot cmdline. The problem is that if we pass "trace_event=kmem:mm_page_alloc"
      to the boot cmdline, it enables all kmem trace events, and not just
      the page_alloc event.
      
      This is caused by the parsing mechanism. When we parse the cmdline, the buffer
      contents is modified due to tokenization. And, if we use this buffer
      again, we will get the wrong result.
      
      Unfortunately, this buffer is be accessed three times to set trace events
      properly at boot time. So, we need to handle this situation.
      
      There is already code handling ",", but we need another for ":".
      This patch adds it.
      
      Link: http://lkml.kernel.org/r/1429159484-22977-1-git-send-email-iamjoonsoo.kim@lge.com
      
      Cc: stable@vger.kernel.org # 3.19+
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      [ added missing return ret; ]
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      84fce9db
    • R
      tracing: Handle ftrace_dump() atomic context in graph_trace_open() · ef99b88b
      Rabin Vincent 提交于
      graph_trace_open() can be called in atomic context from ftrace_dump().
      Use GFP_ATOMIC for the memory allocations when that's the case, in order
      to avoid the following splat.
      
       BUG: sleeping function called from invalid context at mm/slab.c:2849
       in_atomic(): 1, irqs_disabled(): 128, pid: 0, name: swapper/0
       Backtrace:
       ..
       [<8004dc94>] (__might_sleep) from [<801371f4>] (kmem_cache_alloc_trace+0x160/0x238)
        r7:87800040 r6:000080d0 r5:810d16e8 r4:000080d0
       [<80137094>] (kmem_cache_alloc_trace) from [<800cbd60>] (graph_trace_open+0x30/0xd0)
        r10:00000100 r9:809171a8 r8:00008e28 r7:810d16f0 r6:00000001 r5:810d16e8
        r4:810d16f0
       [<800cbd30>] (graph_trace_open) from [<800c79c4>] (trace_init_global_iter+0x50/0x9c)
        r8:00008e28 r7:808c853c r6:00000001 r5:810d16e8 r4:810d16f0 r3:800cbd30
       [<800c7974>] (trace_init_global_iter) from [<800c7aa0>] (ftrace_dump+0x90/0x2ec)
        r4:810d2580 r3:00000000
       [<800c7a10>] (ftrace_dump) from [<80414b2c>] (sysrq_ftrace_dump+0x1c/0x20)
        r10:00000100 r9:809171a8 r8:808f6e7c r7:00000001 r6:00000007 r5:0000007a
        r4:808d5394
       [<80414b10>] (sysrq_ftrace_dump) from [<800169b8>] (return_to_handler+0x0/0x18)
       [<80415498>] (__handle_sysrq) from [<800169b8>] (return_to_handler+0x0/0x18)
        r8:808c8100 r7:808c8444 r6:00000101 r5:00000010 r4:84eb3210
       [<80415668>] (handle_sysrq) from [<800169b8>] (return_to_handler+0x0/0x18)
       [<8042a760>] (pl011_int) from [<800169b8>] (return_to_handler+0x0/0x18)
        r10:809171bc r9:809171a8 r8:00000001 r7:00000026 r6:808c6000 r5:84f01e60
        r4:8454fe00
       [<8007782c>] (handle_irq_event_percpu) from [<80077b44>] (handle_irq_event+0x4c/0x6c)
        r10:808c7ef0 r9:87283e00 r8:00000001 r7:00000000 r6:8454fe00 r5:84f01e60
        r4:84f01e00
       [<80077af8>] (handle_irq_event) from [<8007aa28>] (handle_fasteoi_irq+0xf0/0x1ac)
        r6:808f52a4 r5:84f01e60 r4:84f01e00 r3:00000000
       [<8007a938>] (handle_fasteoi_irq) from [<80076dc0>] (generic_handle_irq+0x3c/0x4c)
        r6:00000026 r5:00000000 r4:00000026 r3:8007a938
       [<80076d84>] (generic_handle_irq) from [<80077128>] (__handle_domain_irq+0x8c/0xfc)
        r4:808c1e38 r3:0000002e
       [<8007709c>] (__handle_domain_irq) from [<800087b8>] (gic_handle_irq+0x34/0x6c)
        r10:80917748 r9:00000001 r8:88802100 r7:808c7ef0 r6:808c8fb0 r5:00000015
        r4:8880210c r3:808c7ef0
       [<80008784>] (gic_handle_irq) from [<80014044>] (__irq_svc+0x44/0x7c)
      
      Link: http://lkml.kernel.org/r/1428953721-31349-1-git-send-email-rabin@rab.in
      Link: http://lkml.kernel.org/r/1428957012-2319-1-git-send-email-rabin@rab.in
      
      Cc: stable@vger.kernel.org # 3.13+
      Signed-off-by: NRabin Vincent <rabin@rab.in>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      ef99b88b
    • J
      tracing: remove use of seq_printf return value · 962e3707
      Joe Perches 提交于
      The seq_printf return value, because it's frequently misused,
      will eventually be converted to void.
      
      See: commit 1f33c41c ("seq_file: Rename seq_overflow() to
           seq_has_overflowed() and make public")
      
      Miscellanea:
      
      o Remove unused return value from trace_lookup_stack
      Signed-off-by: NJoe Perches <joe@perches.com>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      962e3707
    • D
      VFS: kernel/: d_inode() annotations · 7682c918
      David Howells 提交于
      relayfs and tracefs are dealing with inodes of their own;
      those two act as filesystem drivers
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7682c918
  8. 08 4月, 2015 3 次提交
  9. 03 4月, 2015 1 次提交
  10. 02 4月, 2015 5 次提交
    • I
      bpf: Fix the build on BPF_SYSCALL=y && !CONFIG_TRACING kernels, make it more configurable · e1abf2cc
      Ingo Molnar 提交于
      So bpf_tracing.o depends on CONFIG_BPF_SYSCALL - but that's not its only
      dependency, it also depends on the tracing infrastructure and on kprobes,
      without which it will fail to build with:
      
        In file included from kernel/trace/bpf_trace.c:14:0:
        kernel/trace/trace.h: In function ‘trace_test_and_set_recursion’:
        kernel/trace/trace.h:491:28: error: ‘struct task_struct’ has no member named ‘trace_recursion’
          unsigned int val = current->trace_recursion;
        [...]
      
      It took quite some time to trigger this build failure, because right now
      BPF_SYSCALL is very obscure, depends on CONFIG_EXPERT. So also make BPF_SYSCALL
      more configurable, not just under CONFIG_EXPERT.
      
      If BPF_SYSCALL, tracing and kprobes are enabled then enable the bpf_tracing
      gateway as well.
      
      We might want to make this an interactive option later on, although
      I'd not complicate it unnecessarily: enabling BPF_SYSCALL is enough of
      an indicator that the user wants BPF support.
      
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      e1abf2cc
    • A
      tracing: Allow BPF programs to call bpf_trace_printk() · 9c959c86
      Alexei Starovoitov 提交于
      Debugging of BPF programs needs some form of printk from the
      program, so let programs call limited trace_printk() with %d %u
      %x %p modifiers only.
      
      Similar to kernel modules, during program load verifier checks
      whether program is calling bpf_trace_printk() and if so, kernel
      allocates trace_printk buffers and emits big 'this is debug
      only' banner.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1427312966-8434-6-git-send-email-ast@plumgrid.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9c959c86
    • A
      tracing: Allow BPF programs to call bpf_ktime_get_ns() · d9847d31
      Alexei Starovoitov 提交于
      bpf_ktime_get_ns() is used by programs to compute time delta
      between events or as a timestamp
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1427312966-8434-5-git-send-email-ast@plumgrid.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d9847d31
    • A
      tracing, perf: Implement BPF programs attached to kprobes · 2541517c
      Alexei Starovoitov 提交于
      BPF programs, attached to kprobes, provide a safe way to execute
      user-defined BPF byte-code programs without being able to crash or
      hang the kernel in any way. The BPF engine makes sure that such
      programs have a finite execution time and that they cannot break
      out of their sandbox.
      
      The user interface is to attach to a kprobe via the perf syscall:
      
      	struct perf_event_attr attr = {
      		.type	= PERF_TYPE_TRACEPOINT,
      		.config	= event_id,
      		...
      	};
      
      	event_fd = perf_event_open(&attr,...);
      	ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
      
      'prog_fd' is a file descriptor associated with BPF program
      previously loaded.
      
      'event_id' is an ID of the kprobe created.
      
      Closing 'event_fd':
      
      	close(event_fd);
      
      ... automatically detaches BPF program from it.
      
      BPF programs can call in-kernel helper functions to:
      
        - lookup/update/delete elements in maps
      
        - probe_read - wraper of probe_kernel_read() used to access any
          kernel data structures
      
      BPF programs receive 'struct pt_regs *' as an input ('struct pt_regs' is
      architecture dependent) and return 0 to ignore the event and 1 to store
      kprobe event into the ring buffer.
      
      Note, kprobes are a fundamentally _not_ a stable kernel ABI,
      so BPF programs attached to kprobes must be recompiled for
      every kernel version and user must supply correct LINUX_VERSION_CODE
      in attr.kern_version during bpf_prog_load() call.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Reviewed-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1427312966-8434-4-git-send-email-ast@plumgrid.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2541517c
    • A
      tracing: Add kprobe flag · 72cbbc89
      Alexei Starovoitov 提交于
      add TRACE_EVENT_FL_KPROBE flag to differentiate kprobe type of
      tracepoints, since bpf programs can only be attached to kprobe
      type of PERF_TYPE_TRACEPOINT perf events.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Reviewed-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1427312966-8434-3-git-send-email-ast@plumgrid.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      72cbbc89
  11. 31 3月, 2015 1 次提交
  12. 25 3月, 2015 4 次提交
  13. 23 3月, 2015 1 次提交
  14. 09 3月, 2015 3 次提交
    • S
      ftrace: Fix ftrace enable ordering of sysctl ftrace_enabled · 524a3868
      Steven Rostedt (Red Hat) 提交于
      Some archs (specifically PowerPC), are sensitive with the ordering of
      the enabling of the calls to function tracing and setting of the
      function to use to be traced.
      
      That is, update_ftrace_function() sets what function the ftrace_caller
      trampoline should call. Some archs require this to be set before
      calling ftrace_run_update_code().
      
      Another bug was discovered, that ftrace_startup_sysctl() called
      ftrace_run_update_code() directly. If the function the ftrace_caller
      trampoline changes, then it will not be updated. Instead a call
      to ftrace_startup_enable() should be called because it tests to see
      if the callback changed since the code was disabled, and will
      tell the arch to update appropriately. Most archs do not need this
      notification, but PowerPC does.
      
      The problem could be seen by the following commands:
      
       # echo 0 > /proc/sys/kernel/ftrace_enabled
       # echo function > /sys/kernel/debug/tracing/current_tracer
       # echo 1 > /proc/sys/kernel/ftrace_enabled
       # cat /sys/kernel/debug/tracing/trace
      
      The trace will show that function tracing was not active.
      
      Cc: stable@vger.kernel.org # 2.6.27+
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      524a3868
    • P
      ftrace: Fix en(dis)able graph caller when en(dis)abling record via sysctl · 1619dc3f
      Pratyush Anand 提交于
      When ftrace is enabled globally through the proc interface, we must check if
      ftrace_graph_active is set. If it is set, then we should also pass the
      FTRACE_START_FUNC_RET command to ftrace_run_update_code(). Similarly, when
      ftrace is disabled globally through the proc interface, we must check if
      ftrace_graph_active is set. If it is set, then we should also pass the
      FTRACE_STOP_FUNC_RET command to ftrace_run_update_code().
      
      Consider the following situation.
      
       # echo 0 > /proc/sys/kernel/ftrace_enabled
      
      After this ftrace_enabled = 0.
      
       # echo function_graph > /sys/kernel/debug/tracing/current_tracer
      
      Since ftrace_enabled = 0, ftrace_enable_ftrace_graph_caller() is never
      called.
      
       # echo 1 > /proc/sys/kernel/ftrace_enabled
      
      Now ftrace_enabled will be set to true, but still
      ftrace_enable_ftrace_graph_caller() will not be called, which is not
      desired.
      
      Further if we execute the following after this:
        # echo nop > /sys/kernel/debug/tracing/current_tracer
      
      Now since ftrace_enabled is set it will call
      ftrace_disable_ftrace_graph_caller(), which causes a kernel warning on
      the ARM platform.
      
      On the ARM platform, when ftrace_enable_ftrace_graph_caller() is called,
      it checks whether the old instruction is a nop or not. If it's not a nop,
      then it returns an error. If it is a nop then it replaces instruction at
      that address with a branch to ftrace_graph_caller.
      ftrace_disable_ftrace_graph_caller() behaves just the opposite. Therefore,
      if generic ftrace code ever calls either ftrace_enable_ftrace_graph_caller()
      or ftrace_disable_ftrace_graph_caller() consecutively two times in a row,
      then it will return an error, which will cause the generic ftrace code to
      raise a warning.
      
      Note, x86 does not have an issue with this because the architecture
      specific code for ftrace_enable_ftrace_graph_caller() and
      ftrace_disable_ftrace_graph_caller() does not check the previous state,
      and calling either of these functions twice in a row has no ill effect.
      
      Link: http://lkml.kernel.org/r/e4fbe64cdac0dd0e86a3bf914b0f83c0b419f146.1425666454.git.panand@redhat.com
      
      Cc: stable@vger.kernel.org # 2.6.31+
      Signed-off-by: NPratyush Anand <panand@redhat.com>
      [
        removed extra if (ftrace_start_up) and defined ftrace_graph_active as 0
        if CONFIG_FUNCTION_GRAPH_TRACER is not set.
      ]
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      1619dc3f
    • S
      ftrace: Clear REGS_EN and TRAMP_EN flags on disabling record via sysctl · b24d443b
      Steven Rostedt (Red Hat) 提交于
      When /proc/sys/kernel/ftrace_enabled is set to zero, all function
      tracing is disabled. But the records that represent the functions
      still hold information about the ftrace_ops that are hooked to them.
      
      ftrace_ops may request "REGS" (have a full set of pt_regs passed to
      the callback), or "TRAMP" (the ops has its own trampoline to use).
      When the record is updated to represent the state of the ops hooked
      to it, it sets "REGS_EN" and/or "TRAMP_EN" to state that the callback
      points to the correct trampoline (REGS has its own trampoline).
      
      When ftrace_enabled is set to zero, all ftrace locations are a nop,
      so they do not point to any trampoline. But the _EN flags are still
      set. This can cause the accounting to go wrong when ftrace_enabled
      is cleared and an ops that has a trampoline is registered or unregistered.
      
      For example, the following will cause ftrace to crash:
      
       # echo function_graph > /sys/kernel/debug/tracing/current_tracer
       # echo 0 > /proc/sys/kernel/ftrace_enabled
       # echo nop > /sys/kernel/debug/tracing/current_tracer
       # echo 1 > /proc/sys/kernel/ftrace_enabled
       # echo function_graph > /sys/kernel/debug/tracing/current_tracer
      
      As function_graph uses a trampoline, when ftrace_enabled is set to zero
      the updates to the record are not done. When enabling function_graph
      again, the record will still have the TRAMP_EN flag set, and it will
      look for an op that has a trampoline other than the function_graph
      ops, and fail to find one.
      
      Cc: stable@vger.kernel.org # 3.17+
      Reported-by: NPratyush Anand <panand@redhat.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      b24d443b
  15. 14 2月, 2015 1 次提交
  16. 11 2月, 2015 1 次提交
    • S
      ring-buffer: Do not wake up a splice waiter when page is not full · 1e0d6714
      Steven Rostedt (Red Hat) 提交于
      When an application connects to the ring buffer via splice, it can only
      read full pages. Splice does not work with partial pages. If there is
      not enough data to fill a page, the splice command will either block
      or return -EAGAIN (if set to nonblock).
      
      Code was added where if the page is not full, to just sleep again.
      The problem is, it will get woken up again on the next event. That
      is, when something is written into the ring buffer, if there is a waiter
      it will wake it up. The waiter would then check the buffer, see that
      it still does not have enough data to fill a page and go back to sleep.
      To make matters worse, when the waiter goes back to sleep, it could
      cause another event, which would wake it back up again to see it
      doesn't have enough data and sleep again. This produces a tremendous
      overhead and fills the ring buffer with noise.
      
      For example, recording sched_switch on an idle system for 10 seconds
      produces 25,350,475 events!!!
      
      Create another wait queue for those waiters wanting full pages.
      When an event is written, it only wakes up waiters if there's a full
      page of data. It does not wake up the waiter if the page is not yet
      full.
      
      After this change, recording sched_switch on an idle system for 10
      seconds produces only 800 events. Getting rid of 25,349,675 useless
      events (99.9969% of events!!), is something to take seriously.
      
      Cc: stable@vger.kernel.org # 3.16+
      Cc: Rabin Vincent <rabin@rab.in>
      Fixes: e30f53aa "tracing: Do not busy wait in buffer splice"
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      1e0d6714
  17. 10 2月, 2015 1 次提交
  18. 04 2月, 2015 5 次提交
    • S
      tracing: Have mkdir and rmdir be part of tracefs · eae47358
      Steven Rostedt (Red Hat) 提交于
      The tracing "instances" directory can create sub tracing buffers
      with mkdir, and remove them with rmdir. As a mkdir will also create
      all the files and directories that control the sub buffer the inode
      mutexes need to be released before this is done, to avoid deadlocks.
      It is better to let the tracing system unlock the inode mutexes before
      calling the functions that create the files within the new directory
      (or deletes the files from the one being destroyed).
      
      Now that tracing has been converted over to tracefs, the tracefs file
      system can be modified to accommodate this feature. It still releases
      the locks, but the filesystem itself can take care of the ugly
      business and let the user just do what it needs.
      
      The tracing system now attaches a descriptor to the directory dentry
      that can have userspace create or remove sub directories. If this
      descriptor does not exist for a dentry, then that dentry can not be
      used to create other directories. This descriptor holds a mkdir and
      rmdir method that only takes a character string as an argument.
      
      The tracefs file system will first make a copy of the dentry name
      before releasing the locks. Then it will pass the copied name to the
      methods. It is up to the tracing system that supplied the methods to
      handle races with duplicate names and such as all the inode mutexes
      would be released when the functions are called.
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      eae47358
    • S
      tracing: Automatically mount tracefs on debugfs/tracing · f76180bc
      Steven Rostedt (Red Hat) 提交于
      As tools currently rely on the tracing directory in debugfs, we can not
      just created a tracefs infrastructure and expect sysadmins to mount
      the new tracefs to have their old tools work.
      
      Instead, the debugfs tracing directory is still created and the tracefs
      file system is mounted there when the debugfs filesystem is mounted.
      
      No longer does the tracing infrastructure update the debugfs file system,
      but instead interacts with the tracefs file system. But now, it still
      appears to the user like nothing changed, except you also have the feature
      of mounting just the tracing system without needing all of debugfs!
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      f76180bc
    • S
      tracing: Convert the tracing facility over to use tracefs · 8434dc93
      Steven Rostedt (Red Hat) 提交于
      debugfs was fine for the tracing facility as a quick way to get
      an interface. Now that tracing has matured, it should separate itself
      from debugfs such that it can be mounted separately without needing
      to mount all of debugfs with it. That is, users resist using tracing
      because it requires mounting debugfs. Having tracing have its own file
      system lets users get the features of tracing without needing to bring
      in the rest of the kernel's debug infrastructure.
      
      Another reason for tracefs is that debubfs does not support mkdir.
      Currently, to create instances, one does a mkdir in the tracing/instance
      directory. This is implemented via a hack that forces debugfs to do
      something it is not intended on doing. By converting over to tracefs, this
      hack can be removed and mkdir can be properly implemented. This patch does
      not address this yet, but it lays the ground work for that to be done.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      8434dc93
    • S
      tracing: Create cmdline tracer options on tracing fs init · 09d23a1d
      Steven Rostedt (Red Hat) 提交于
      The options for cmdline tracers are not created if the debugfs system
      is not ready yet. If tracing has started before debugfs is up, then the
      option files for the tracer are not created. Create them when creating
      the tracing directory if the current tracer requires option files.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      09d23a1d
    • S
      tracing: Only create tracer options files if directory exists · 0f67f04f
      Steven Rostedt (Red Hat) 提交于
      Do not bother creating tracer options if no tracing directory
      exists. If a tracer is enabled via the command line, and is
      started before the tracing directory is created, then it wont have
      its tracer specific options created.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      0f67f04f
  19. 02 2月, 2015 2 次提交
  20. 30 1月, 2015 1 次提交