1. 08 6月, 2016 1 次提交
  2. 20 4月, 2016 2 次提交
    • D
      bpf: add event output helper for notifications/sampling/logging · bd570ff9
      Daniel Borkmann 提交于
      This patch adds a new helper for cls/act programs that can push events
      to user space applications. For networking, this can be f.e. for sampling,
      debugging, logging purposes or pushing of arbitrary wake-up events. The
      idea is similar to a43eec30 ("bpf: introduce bpf_perf_event_output()
      helper") and 39111695 ("samples: bpf: add bpf_perf_event_output example").
      
      The eBPF program utilizes a perf event array map that user space populates
      with fds from perf_event_open(), the eBPF program calls into the helper
      f.e. as skb_event_output(skb, &my_map, BPF_F_CURRENT_CPU, raw, sizeof(raw))
      so that the raw data is pushed into the fd f.e. at the map index of the
      current CPU.
      
      User space can poll/mmap/etc on this and has a data channel for receiving
      events that can be post-processed. The nice thing is that since the eBPF
      program and user space application making use of it are tightly coupled,
      they can define their own arbitrary raw data format and what/when they
      want to push.
      
      While f.e. packet headers could be one part of the meta data that is being
      pushed, this is not a substitute for things like packet sockets as whole
      packet is not being pushed and push is only done in a single direction.
      Intention is more of a generically usable, efficient event pipe to applications.
      Workflow is that tc can pin the map and applications can attach themselves
      e.g. after cls/act setup to one or multiple map slots, demuxing is done by
      the eBPF program.
      
      Adding this facility is with minimal effort, it reuses the helper
      introduced in a43eec30 ("bpf: introduce bpf_perf_event_output() helper")
      and we get its functionality for free by overloading its BPF_FUNC_ identifier
      for cls/act programs, ctx is currently unused, but will be made use of in
      future. Example will be added to iproute2's BPF example files.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd570ff9
    • D
      bpf, trace: add BPF_F_CURRENT_CPU flag for bpf_perf_event_output · 1e33759c
      Daniel Borkmann 提交于
      Add a BPF_F_CURRENT_CPU flag to optimize the use-case where user space has
      per-CPU ring buffers and the eBPF program pushes the data into the current
      CPU's ring buffer which saves us an extra helper function call in eBPF.
      Also, make sure to properly reserve the remaining flags which are not used.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e33759c
  3. 19 4月, 2016 1 次提交
    • A
      bpf: avoid warning for wrong pointer cast · 266a0a79
      Arnd Bergmann 提交于
      Two new functions in bpf contain a cast from a 'u64' to a
      pointer. This works on 64-bit architectures but causes a warning
      on all 32-bit architectures:
      
      kernel/trace/bpf_trace.c: In function 'bpf_perf_event_output_tp':
      kernel/trace/bpf_trace.c:350:13: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
        u64 ctx = *(long *)r1;
      
      This changes the cast to first convert the u64 argument into a uintptr_t,
      which is guaranteed to be the same size as a pointer.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Fixes: 9940d67c ("bpf: support bpf_get_stackid() and bpf_perf_event_output() in tracepoint programs")
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      266a0a79
  4. 15 4月, 2016 1 次提交
  5. 08 4月, 2016 2 次提交
  6. 09 3月, 2016 1 次提交
  7. 20 2月, 2016 1 次提交
    • A
      bpf: introduce BPF_MAP_TYPE_STACK_TRACE · d5a3b1f6
      Alexei Starovoitov 提交于
      add new map type to store stack traces and corresponding helper
      bpf_get_stackid(ctx, map, flags) - walk user or kernel stack and return id
      @ctx: struct pt_regs*
      @map: pointer to stack_trace map
      @flags: bits 0-7 - numer of stack frames to skip
              bit 8 - collect user stack instead of kernel
              bit 9 - compare stacks by hash only
              bit 10 - if two different stacks hash into the same stackid
                       discard old
              other bits - reserved
      Return: >= 0 stackid on success or negative error
      
      stackid is a 32-bit integer handle that can be further combined with
      other data (including other stackid) and used as a key into maps.
      
      Userspace will access stackmap using standard lookup/delete syscall commands to
      retrieve full stack trace for given stackid.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d5a3b1f6
  8. 29 1月, 2016 1 次提交
  9. 24 12月, 2015 1 次提交
  10. 27 10月, 2015 2 次提交
  11. 22 10月, 2015 1 次提交
    • A
      bpf: introduce bpf_perf_event_output() helper · a43eec30
      Alexei Starovoitov 提交于
      This helper is used to send raw data from eBPF program into
      special PERF_TYPE_SOFTWARE/PERF_COUNT_SW_BPF_OUTPUT perf_event.
      User space needs to perf_event_open() it (either for one or all cpus) and
      store FD into perf_event_array (similar to bpf_perf_event_read() helper)
      before eBPF program can send data into it.
      
      Today the programs triggered by kprobe collect the data and either store
      it into the maps or print it via bpf_trace_printk() where latter is the debug
      facility and not suitable to stream the data. This new helper replaces
      such bpf_trace_printk() usage and allows programs to have dedicated
      channel into user space for post-processing of the raw data collected.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a43eec30
  12. 29 8月, 2015 1 次提交
    • A
      bpf: add support for %s specifier to bpf_trace_printk() · 8d3b7dce
      Alexei Starovoitov 提交于
      %s specifier makes bpf program and kernel debugging easier.
      To make sure that trace_printk won't crash the unsafe string
      is copied into stack and unsafe pointer is substituted.
      
      The following C program:
       #include <linux/fs.h>
      int foo(struct pt_regs *ctx, struct filename *filename)
      {
        void *name = 0;
      
        bpf_probe_read(&name, sizeof(name), &filename->name);
        bpf_trace_printk("executed %s\n", name);
        return 0;
      }
      
      when attached to kprobe do_execve()
      will produce output in /sys/kernel/debug/tracing/trace_pipe :
          make-13492 [002] d..1  3250.997277: : executed /bin/sh
            sh-13493 [004] d..1  3250.998716: : executed /usr/bin/gcc
           gcc-13494 [002] d..1  3250.999822: : executed /usr/lib/gcc/x86_64-linux-gnu/4.7/cc1
           gcc-13495 [002] d..1  3251.006731: : executed /usr/bin/as
           gcc-13496 [002] d..1  3251.011831: : executed /usr/lib/gcc/x86_64-linux-gnu/4.7/collect2
      collect2-13497 [000] d..1  3251.012941: : executed /usr/bin/ld
      Suggested-by: NBrendan Gregg <brendan.d.gregg@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8d3b7dce
  13. 10 8月, 2015 1 次提交
  14. 16 6月, 2015 3 次提交
  15. 01 6月, 2015 1 次提交
  16. 22 5月, 2015 1 次提交
    • A
      bpf: allow bpf programs to tail-call other bpf programs · 04fd61ab
      Alexei Starovoitov 提交于
      introduce bpf_tail_call(ctx, &jmp_table, index) helper function
      which can be used from BPF programs like:
      int bpf_prog(struct pt_regs *ctx)
      {
        ...
        bpf_tail_call(ctx, &jmp_table, index);
        ...
      }
      that is roughly equivalent to:
      int bpf_prog(struct pt_regs *ctx)
      {
        ...
        if (jmp_table[index])
          return (*jmp_table[index])(ctx);
        ...
      }
      The important detail that it's not a normal call, but a tail call.
      The kernel stack is precious, so this helper reuses the current
      stack frame and jumps into another BPF program without adding
      extra call frame.
      It's trivially done in interpreter and a bit trickier in JITs.
      In case of x64 JIT the bigger part of generated assembler prologue
      is common for all programs, so it is simply skipped while jumping.
      Other JITs can do similar prologue-skipping optimization or
      do stack unwind before jumping into the next program.
      
      bpf_tail_call() arguments:
      ctx - context pointer
      jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
      index - index in the jump table
      
      Since all BPF programs are idenitified by file descriptor, user space
      need to populate the jmp_table with FDs of other BPF programs.
      If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
      and program execution continues as normal.
      
      New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
      populate this jmp_table array with FDs of other bpf programs.
      Programs can share the same jmp_table array or use multiple jmp_tables.
      
      The chain of tail calls can form unpredictable dynamic loops therefore
      tail_call_cnt is used to limit the number of calls and currently is set to 32.
      
      Use cases:
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      
      ==========
      - simplify complex programs by splitting them into a sequence of small programs
      
      - dispatch routine
        For tracing and future seccomp the program may be triggered on all system
        calls, but processing of syscall arguments will be different. It's more
        efficient to implement them as:
        int syscall_entry(struct seccomp_data *ctx)
        {
           bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
           ... default: process unknown syscall ...
        }
        int sys_write_event(struct seccomp_data *ctx) {...}
        int sys_read_event(struct seccomp_data *ctx) {...}
        syscall_jmp_table[__NR_write] = sys_write_event;
        syscall_jmp_table[__NR_read] = sys_read_event;
      
        For networking the program may call into different parsers depending on
        packet format, like:
        int packet_parser(struct __sk_buff *skb)
        {
           ... parse L2, L3 here ...
           __u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
           bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
           ... default: process unknown protocol ...
        }
        int parse_tcp(struct __sk_buff *skb) {...}
        int parse_udp(struct __sk_buff *skb) {...}
        ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
        ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
      
      - for TC use case, bpf_tail_call() allows to implement reclassify-like logic
      
      - bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
        are atomic, so user space can build chains of BPF programs on the fly
      
      Implementation details:
      =======================
      - high performance of bpf_tail_call() is the goal.
        It could have been implemented without JIT changes as a wrapper on top of
        BPF_PROG_RUN() macro, but with two downsides:
        . all programs would have to pay performance penalty for this feature and
          tail call itself would be slower, since mandatory stack unwind, return,
          stack allocate would be done for every tailcall.
        . tailcall would be limited to programs running preempt_disabled, since
          generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
          need to be either global per_cpu variable accessed by helper and by wrapper
          or global variable protected by locks.
      
        In this implementation x64 JIT bypasses stack unwind and jumps into the
        callee program after prologue.
      
      - bpf_prog_array_compatible() ensures that prog_type of callee and caller
        are the same and JITed/non-JITed flag is the same, since calling JITed
        program from non-JITed is invalid, since stack frames are different.
        Similarly calling kprobe type program from socket type program is invalid.
      
      - jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
        abstraction, its user space API and all of verifier logic.
        It's in the existing arraymap.c file, since several functions are
        shared with regular array map.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04fd61ab
  17. 02 4月, 2015 3 次提交
    • A
      tracing: Allow BPF programs to call bpf_trace_printk() · 9c959c86
      Alexei Starovoitov 提交于
      Debugging of BPF programs needs some form of printk from the
      program, so let programs call limited trace_printk() with %d %u
      %x %p modifiers only.
      
      Similar to kernel modules, during program load verifier checks
      whether program is calling bpf_trace_printk() and if so, kernel
      allocates trace_printk buffers and emits big 'this is debug
      only' banner.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1427312966-8434-6-git-send-email-ast@plumgrid.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9c959c86
    • A
      tracing: Allow BPF programs to call bpf_ktime_get_ns() · d9847d31
      Alexei Starovoitov 提交于
      bpf_ktime_get_ns() is used by programs to compute time delta
      between events or as a timestamp
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1427312966-8434-5-git-send-email-ast@plumgrid.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d9847d31
    • A
      tracing, perf: Implement BPF programs attached to kprobes · 2541517c
      Alexei Starovoitov 提交于
      BPF programs, attached to kprobes, provide a safe way to execute
      user-defined BPF byte-code programs without being able to crash or
      hang the kernel in any way. The BPF engine makes sure that such
      programs have a finite execution time and that they cannot break
      out of their sandbox.
      
      The user interface is to attach to a kprobe via the perf syscall:
      
      	struct perf_event_attr attr = {
      		.type	= PERF_TYPE_TRACEPOINT,
      		.config	= event_id,
      		...
      	};
      
      	event_fd = perf_event_open(&attr,...);
      	ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
      
      'prog_fd' is a file descriptor associated with BPF program
      previously loaded.
      
      'event_id' is an ID of the kprobe created.
      
      Closing 'event_fd':
      
      	close(event_fd);
      
      ... automatically detaches BPF program from it.
      
      BPF programs can call in-kernel helper functions to:
      
        - lookup/update/delete elements in maps
      
        - probe_read - wraper of probe_kernel_read() used to access any
          kernel data structures
      
      BPF programs receive 'struct pt_regs *' as an input ('struct pt_regs' is
      architecture dependent) and return 0 to ignore the event and 1 to store
      kprobe event into the ring buffer.
      
      Note, kprobes are a fundamentally _not_ a stable kernel ABI,
      so BPF programs attached to kprobes must be recompiled for
      every kernel version and user must supply correct LINUX_VERSION_CODE
      in attr.kern_version during bpf_prog_load() call.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Reviewed-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1427312966-8434-4-git-send-email-ast@plumgrid.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2541517c