1. 09 4月, 2016 31 次提交
  2. 08 4月, 2016 9 次提交
    • D
      Merge branch 'bpf-tracepoints' · f8711655
      David S. Miller 提交于
      Alexei Starovoitov says:
      
      ====================
      allow bpf attach to tracepoints
      
      Hi Steven, Peter,
      
      v1->v2: addressed Peter's comments:
      - fixed wording in patch 1, added ack
      - refactored 2nd patch into 3:
      2/10 remove unused __perf_addr macro which frees up
      an argument in perf_trace_buf_submit
      3/10 split perf_trace_buf_prepare into alloc and update parts, so that bpf
      programs don't have to pay performance penalty for update of struct trace_entry
      which is not going to be accessed by bpf
      4/10 actual addition of bpf filter to perf tracepoint handler is now trivial
      and bpf prog can be used as proper filter of tracepoints
      
      v1 cover:
      last time we discussed bpf+tracepoints it was a year ago [1] and the reason
      we didn't proceed with that approach was that bpf would make arguments
      arg1, arg2 to trace_xx(arg1, arg2) call to be exposed to bpf program
      and that was considered unnecessary extension of abi. Back then I wanted
      to avoid the cost of buffer alloc and field assign part in all
      of the tracepoints, but looks like when optimized the cost is acceptable.
      So this new apporach doesn't expose any new abi to bpf program.
      The program is looking at tracepoint fields after they were copied
      by perf_trace_xx() and described in /sys/kernel/debug/tracing/events/xxx/format
      We made a tool [2] that takes arguments from /sys/.../format and works as:
      $ tplist.py -v random:urandom_read
          int got_bits;
          int pool_left;
          int input_left;
      Then these fields can be copy-pasted into bpf program like:
      struct urandom_read {
          __u64 hidden_pad;
          int got_bits;
          int pool_left;
          int input_left;
      };
      and the program can use it:
      SEC("tracepoint/random/urandom_read")
      int bpf_prog(struct urandom_read *ctx)
      {
          return ctx->pool_left > 0 ? 1 : 0;
      }
      This way the program can access tracepoint fields faster than
      equivalent bpf+kprobe program, which is the main goal of these patches.
      
      Patch 1-4 are simple changes in perf core side, please review.
      I'd like to take the whole set via net-next tree, since the rest of
      the patches might conflict with other bpf work going on in net-next
      and we want to avoid cross-tree merge conflicts.
      Alternatively we can put patches 1-4 into both tip and net-next.
      
      Patch 9 is an example of access to tracepoint fields from bpf prog.
      Patch 10 is a micro benchmark for bpf+kprobe vs bpf+tracepoint.
      
      Note that for actual tracing tools the user doesn't need to
      run tplist.py and copy-paste fields manually. The tools do it
      automatically. Like argdist tool [3] can be used as:
      $ argdist -H 't:block:block_rq_complete():u32:nr_sector'
      where 'nr_sector' is name of tracepoint field taken from
      /sys/kernel/debug/tracing/events/block/block_rq_complete/format
      and appropriate bpf program is generated on the fly.
      
      [1] http://thread.gmane.org/gmane.linux.kernel.api/8127/focus=8165
      [2] https://github.com/iovisor/bcc/blob/master/tools/tplist.py
      [3] https://github.com/iovisor/bcc/blob/master/tools/argdist.py
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8711655
    • A
      samples/bpf: add tracepoint vs kprobe performance tests · e3edfdec
      Alexei Starovoitov 提交于
      the first microbenchmark does
      fd=open("/proc/self/comm");
      for() {
        write(fd, "test");
      }
      and on 4 cpus in parallel:
                                            writes per sec
      base (no tracepoints, no kprobes)         930k
      with kprobe at __set_task_comm()          420k
      with tracepoint at task:task_rename       730k
      
      For kprobe + full bpf program manully fetches oldcomm, newcomm via bpf_probe_read.
      For tracepint bpf program does nothing, since arguments are copied by tracepoint.
      
      2nd microbenchmark does:
      fd=open("/dev/urandom");
      for() {
        read(fd, buf);
      }
      and on 4 cpus in parallel:
                                             reads per sec
      base (no tracepoints, no kprobes)         300k
      with kprobe at urandom_read()             279k
      with tracepoint at random:urandom_read    290k
      
      bpf progs attached to kprobe and tracepoint are noop.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e3edfdec
    • A
      samples/bpf: tracepoint example · 3c9b1644
      Alexei Starovoitov 提交于
      modify offwaketime to work with sched/sched_switch tracepoint
      instead of kprobe into finish_task_switch
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3c9b1644
    • A
      samples/bpf: add tracepoint support to bpf loader · c0766040
      Alexei Starovoitov 提交于
      Recognize "tracepoint/" section name prefix and attach the program
      to that tracepoint.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0766040
    • A
      bpf: sanitize bpf tracepoint access · 32bbe007
      Alexei Starovoitov 提交于
      during bpf program loading remember the last byte of ctx access
      and at the time of attaching the program to tracepoint check that
      the program doesn't access bytes beyond defined in tracepoint fields
      
      This also disallows access to __dynamic_array fields, but can be
      relaxed in the future.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      32bbe007
    • A
      bpf: support bpf_get_stackid() and bpf_perf_event_output() in tracepoint programs · 9940d67c
      Alexei Starovoitov 提交于
      needs two wrapper functions to fetch 'struct pt_regs *' to convert
      tracepoint bpf context into kprobe bpf context to reuse existing
      helper functions
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9940d67c
    • A
      bpf: register BPF_PROG_TYPE_TRACEPOINT program type · 9fd82b61
      Alexei Starovoitov 提交于
      register tracepoint bpf program type and let it call the same set
      of helper functions as BPF_PROG_TYPE_KPROBE
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9fd82b61
    • A
      perf, bpf: allow bpf programs attach to tracepoints · 98b5c2c6
      Alexei Starovoitov 提交于
      introduce BPF_PROG_TYPE_TRACEPOINT program type and allow it to be attached
      to the perf tracepoint handler, which will copy the arguments into
      the per-cpu buffer and pass it to the bpf program as its first argument.
      The layout of the fields can be discovered by doing
      'cat /sys/kernel/debug/tracing/events/sched/sched_switch/format'
      prior to the compilation of the program with exception that first 8 bytes
      are reserved and not accessible to the program. This area is used to store
      the pointer to 'struct pt_regs' which some of the bpf helpers will use:
      +---------+
      | 8 bytes | hidden 'struct pt_regs *' (inaccessible to bpf program)
      +---------+
      | N bytes | static tracepoint fields defined in tracepoint/format (bpf readonly)
      +---------+
      | dynamic | __dynamic_array bytes of tracepoint (inaccessible to bpf yet)
      +---------+
      
      Not that all of the fields are already dumped to user space via perf ring buffer
      and broken application access it directly without consulting tracepoint/format.
      Same rule applies here: static tracepoint fields should only be accessed
      in a format defined in tracepoint/format. The order of fields and
      field sizes are not an ABI.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98b5c2c6
    • A
      perf: split perf_trace_buf_prepare into alloc and update parts · 1e1dcd93
      Alexei Starovoitov 提交于
      split allows to move expensive update of 'struct trace_entry' to later phase.
      Repurpose unused 1st argument of perf_tp_event() to indicate event type.
      
      While splitting use temp variable 'rctx' instead of '*rctx' to avoid
      unnecessary loads done by the compiler due to -fno-strict-aliasing
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e1dcd93