1. 17 11月, 2015 1 次提交
  2. 03 11月, 2015 2 次提交
    • D
      bpf: add sample usages for persistent maps/progs · 42984d7c
      Daniel Borkmann 提交于
      This patch adds a couple of stand-alone examples on how BPF_OBJ_PIN
      and BPF_OBJ_GET commands can be used.
      
      Example with maps:
      
        # ./fds_example -F /sys/fs/bpf/m -P -m -k 1 -v 42
        bpf: map fd:3 (Success)
        bpf: pin ret:(0,Success)
        bpf: fd:3 u->(1:42) ret:(0,Success)
        # ./fds_example -F /sys/fs/bpf/m -G -m -k 1
        bpf: get fd:3 (Success)
        bpf: fd:3 l->(1):42 ret:(0,Success)
        # ./fds_example -F /sys/fs/bpf/m -G -m -k 1 -v 24
        bpf: get fd:3 (Success)
        bpf: fd:3 u->(1:24) ret:(0,Success)
        # ./fds_example -F /sys/fs/bpf/m -G -m -k 1
        bpf: get fd:3 (Success)
        bpf: fd:3 l->(1):24 ret:(0,Success)
      
        # ./fds_example -F /sys/fs/bpf/m2 -P -m
        bpf: map fd:3 (Success)
        bpf: pin ret:(0,Success)
        # ./fds_example -F /sys/fs/bpf/m2 -G -m -k 1
        bpf: get fd:3 (Success)
        bpf: fd:3 l->(1):0 ret:(0,Success)
        # ./fds_example -F /sys/fs/bpf/m2 -G -m
        bpf: get fd:3 (Success)
      
      Example with progs:
      
        # ./fds_example -F /sys/fs/bpf/p -P -p
        bpf: prog fd:3 (Success)
        bpf: pin ret:(0,Success)
        bpf sock:4 <- fd:3 attached ret:(0,Success)
        # ./fds_example -F /sys/fs/bpf/p -G -p
        bpf: get fd:3 (Success)
        bpf: sock:4 <- fd:3 attached ret:(0,Success)
      
        # ./fds_example -F /sys/fs/bpf/p2 -P -p -o ./sockex1_kern.o
        bpf: prog fd:5 (Success)
        bpf: pin ret:(0,Success)
        bpf: sock:3 <- fd:5 attached ret:(0,Success)
        # ./fds_example -F /sys/fs/bpf/p2 -G -p
        bpf: get fd:3 (Success)
        bpf: sock:4 <- fd:3 attached ret:(0,Success)
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      42984d7c
    • C
      Sample: Trace_event: Correct the comments · 67aedeb8
      Chunyan Zhang 提交于
      The commit 88920427 ("tracing: Update trace-event-sample with
      TRACE_SYSTEM_VAR documentation") changed TRACE_SYSTEM to 'sample-trace',
      but didn't make the according change of its name in the comments.
      
      Link: http://lkml.kernel.org/r/1443599650-23680-1-git-send-email-zhang.chunyan@linaro.orgSigned-off-by: NChunyan Zhang <zhang.chunyan@linaro.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      67aedeb8
  3. 28 10月, 2015 1 次提交
  4. 22 10月, 2015 1 次提交
  5. 13 10月, 2015 1 次提交
    • A
      bpf: add unprivileged bpf tests · bf508877
      Alexei Starovoitov 提交于
      Add new tests samples/bpf/test_verifier:
      
      unpriv: return pointer
        checks that pointer cannot be returned from the eBPF program
      
      unpriv: add const to pointer
      unpriv: add pointer to pointer
      unpriv: neg pointer
        checks that pointer arithmetic is disallowed
      
      unpriv: cmp pointer with const
      unpriv: cmp pointer with pointer
        checks that comparison of pointers is disallowed
        Only one case allowed 'void *value = bpf_map_lookup_elem(..); if (value == 0) ...'
      
      unpriv: check that printk is disallowed
        since bpf_trace_printk is not available to unprivileged
      
      unpriv: pass pointer to helper function
        checks that pointers cannot be passed to functions that expect integers
        If function expects a pointer the verifier allows only that type of pointer.
        Like 1st argument of bpf_map_lookup_elem() must be pointer to map.
        (applies to non-root as well)
      
      unpriv: indirectly pass pointer on stack to helper function
        checks that pointer stored into stack cannot be used as part of key
        passed into bpf_map_lookup_elem()
      
      unpriv: mangle pointer on stack 1
      unpriv: mangle pointer on stack 2
        checks that writing into stack slot that already contains a pointer
        is disallowed
      
      unpriv: read pointer from stack in small chunks
        checks that < 8 byte read from stack slot that contains a pointer is
        disallowed
      
      unpriv: write pointer into ctx
        checks that storing pointers into skb->fields is disallowed
      
      unpriv: write pointer into map elem value
        checks that storing pointers into element values is disallowed
        For example:
        int bpf_prog(struct __sk_buff *skb)
        {
          u32 key = 0;
          u64 *value = bpf_map_lookup_elem(&map, &key);
          if (value)
             *value = (u64) skb;
        }
        will be rejected.
      
      unpriv: partial copy of pointer
        checks that doing 32-bit register mov from register containing
        a pointer is disallowed
      
      unpriv: pass pointer to tail_call
        checks that passing pointer as an index into bpf_tail_call
        is disallowed
      
      unpriv: cmp map pointer with zero
        checks that comparing map pointer with constant is disallowed
      
      unpriv: write into frame pointer
        checks that frame pointer is read-only (applies to root too)
      
      unpriv: cmp of frame pointer
        checks that R10 cannot be using in comparison
      
      unpriv: cmp of stack pointer
        checks that Rx = R10 - imm is ok, but comparing Rx is not
      
      unpriv: obfuscate stack pointer
        checks that Rx = R10 - imm is ok, but Rx -= imm is not
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf508877
  6. 02 10月, 2015 1 次提交
    • P
      kprobes: use _do_fork() in samples to make them work again · 54aea454
      Petr Mladek 提交于
      Commit 3033f14a ("clone: support passing tls argument via C rather
      than pt_regs magic") introduced _do_fork() that allowed to pass @tls
      parameter.
      
      The old do_fork() is defined only for architectures that are not ready
      to use this way and do not define HAVE_COPY_THREAD_TLS.
      
      Let's use _do_fork() in the kprobe examples to make them work again on
      all architectures.
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thiago Macieira <thiago.macieira@intel.com>
      Cc: Jiri Kosina <jkosina@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54aea454
  7. 18 9月, 2015 1 次提交
    • A
      bpf: add bpf_redirect() helper · 27b29f63
      Alexei Starovoitov 提交于
      Existing bpf_clone_redirect() helper clones skb before redirecting
      it to RX or TX of destination netdev.
      Introduce bpf_redirect() helper that does that without cloning.
      
      Benchmarked with two hosts using 10G ixgbe NICs.
      One host is doing line rate pktgen.
      Another host is configured as:
      $ tc qdisc add dev $dev ingress
      $ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
         action bpf run object-file tcbpf1_kern.o section clone_redirect_xmit drop
      so it receives the packet on $dev and immediately xmits it on $dev + 1
      The section 'clone_redirect_xmit' in tcbpf1_kern.o file has the program
      that does bpf_clone_redirect() and performance is 2.0 Mpps
      
      $ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
         action bpf run object-file tcbpf1_kern.o section redirect_xmit drop
      which is using bpf_redirect() - 2.4 Mpps
      
      and using cls_bpf with integrated actions as:
      $ tc filter add dev $dev root pref 10 \
        bpf run object-file tcbpf1_kern.o section redirect_xmit integ_act classid 1
      performance is 2.5 Mpps
      
      To summarize:
      u32+act_bpf using clone_redirect - 2.0 Mpps
      u32+act_bpf using redirect - 2.4 Mpps
      cls_bpf using redirect - 2.5 Mpps
      
      For comparison linux bridge in this setup is doing 2.1 Mpps
      and ixgbe rx + drop in ip_rcv - 7.8 Mpps
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27b29f63
  8. 13 8月, 2015 1 次提交
    • K
      bpf: fix build warnings and add function read_trace_pipe() · 5ed3ccbd
      Kaixu Xia 提交于
      There are two improvements in this patch:
       1. Fix the build warnings;
       2. Add function read_trace_pipe() to print the result on
          the screen;
      
      Before this patch, we can get the result through /sys/kernel/de
      bug/tracing/trace_pipe and get nothing on the screen.
      By applying this patch, the result can be printed on the screen.
        $ ./tracex6
      	...
               tracex6-705   [003] d..1   131.428593: : CPU-3   19981414
                  sshd-683   [000] d..1   131.428727: : CPU-0   221682321
                  sshd-683   [000] d..1   131.428821: : CPU-0   221808766
                  sshd-683   [000] d..1   131.428950: : CPU-0   221982984
                  sshd-683   [000] d..1   131.429045: : CPU-0   222111851
               tracex6-705   [003] d..1   131.429168: : CPU-3   20757551
                  sshd-683   [000] d..1   131.429170: : CPU-0   222281240
                  sshd-683   [000] d..1   131.429261: : CPU-0   222403340
                  sshd-683   [000] d..1   131.429378: : CPU-0   222561024
      	...
      Signed-off-by: NKaixu Xia <xiakaixu@huawei.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ed3ccbd
  9. 10 8月, 2015 1 次提交
  10. 27 7月, 2015 1 次提交
  11. 18 7月, 2015 1 次提交
    • S
      tracing: Fix sample output of dynamic arrays · d6726c81
      Steven Rostedt (Red Hat) 提交于
      He Kuang noticed that the trace event samples for arrays was broken:
      
      "The output result of trace_foo_bar event in traceevent samples is
       wrong. This problem can be reproduced as following:
      
        (Build kernel with SAMPLE_TRACE_EVENTS=m)
      
        $ insmod trace-events-sample.ko
      
        $ echo 1 > /sys/kernel/debug/tracing/events/sample-trace/foo_bar/enable
      
        $ cat /sys/kernel/debug/tracing/trace
      
        event-sample-980 [000] ....  43.649559: foo_bar: foo hello 21 0x15
        BIT1|BIT3|0x10 {0x1,0x6f6f6e53,0xff007970,0xffffffff} Snoopy
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                       The array length is not right, should be {0x1}.
        (ffffffff,ffffffff)
      
        event-sample-980 [000] ....  44.653827: foo_bar: foo hello 22 0x16
        BIT2|BIT3|0x10
        {0x1,0x2,0x646e6147,0x666c61,0xffffffff,0xffffffff,0x750aeffe,0x7}
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                       The array length is not right, should be {0x1,0x2}.
        Gandalf (ffffffff,ffffffff)"
      
      This was caused by an update to have __print_array()'s second parameter
      be the count of items in the array and not the size of the array.
      
      As there is already users of __print_array(), it can not change. But
      the sample code can and we can also improve on the documentation about
      __print_array() and __get_dynamic_array_len().
      
      Link: http://lkml.kernel.org/r/1436839171-31527-2-git-send-email-hekuang@huawei.com
      
      Fixes: ac01ce14 ("tracing: Make ftrace_print_array_seq compute buf_len")
      Reported-by: NHe Kuang <hekuang@huawei.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      d6726c81
  12. 09 7月, 2015 1 次提交
  13. 23 6月, 2015 1 次提交
    • D
      bpf: BPF based latency tracing · 0fb1170e
      Daniel Wagner 提交于
      BPF offers another way to generate latency histograms. We attach
      kprobes at trace_preempt_off and trace_preempt_on and calculate the
      time it takes to from seeing the off/on transition.
      
      The first array is used to store the start time stamp. The key is the
      CPU id. The second array stores the log2(time diff). We need to use
      static allocation here (array and not hash tables). The kprobes
      hooking into trace_preempt_on|off should not calling any dynamic
      memory allocation or free path. We need to avoid recursivly
      getting called. Besides that, it reduces jitter in the measurement.
      
      CPU 0
            latency        : count     distribution
             1 -> 1        : 0        |                                        |
             2 -> 3        : 0        |                                        |
             4 -> 7        : 0        |                                        |
             8 -> 15       : 0        |                                        |
            16 -> 31       : 0        |                                        |
            32 -> 63       : 0        |                                        |
            64 -> 127      : 0        |                                        |
           128 -> 255      : 0        |                                        |
           256 -> 511      : 0        |                                        |
           512 -> 1023     : 0        |                                        |
          1024 -> 2047     : 0        |                                        |
          2048 -> 4095     : 166723   |*************************************** |
          4096 -> 8191     : 19870    |***                                     |
          8192 -> 16383    : 6324     |                                        |
         16384 -> 32767    : 1098     |                                        |
         32768 -> 65535    : 190      |                                        |
         65536 -> 131071   : 179      |                                        |
        131072 -> 262143   : 18       |                                        |
        262144 -> 524287   : 4        |                                        |
        524288 -> 1048575  : 1363     |                                        |
      CPU 1
            latency        : count     distribution
             1 -> 1        : 0        |                                        |
             2 -> 3        : 0        |                                        |
             4 -> 7        : 0        |                                        |
             8 -> 15       : 0        |                                        |
            16 -> 31       : 0        |                                        |
            32 -> 63       : 0        |                                        |
            64 -> 127      : 0        |                                        |
           128 -> 255      : 0        |                                        |
           256 -> 511      : 0        |                                        |
           512 -> 1023     : 0        |                                        |
          1024 -> 2047     : 0        |                                        |
          2048 -> 4095     : 114042   |*************************************** |
          4096 -> 8191     : 9587     |**                                      |
          8192 -> 16383    : 4140     |                                        |
         16384 -> 32767    : 673      |                                        |
         32768 -> 65535    : 179      |                                        |
         65536 -> 131071   : 29       |                                        |
        131072 -> 262143   : 4        |                                        |
        262144 -> 524287   : 1        |                                        |
        524288 -> 1048575  : 364      |                                        |
      CPU 2
            latency        : count     distribution
             1 -> 1        : 0        |                                        |
             2 -> 3        : 0        |                                        |
             4 -> 7        : 0        |                                        |
             8 -> 15       : 0        |                                        |
            16 -> 31       : 0        |                                        |
            32 -> 63       : 0        |                                        |
            64 -> 127      : 0        |                                        |
           128 -> 255      : 0        |                                        |
           256 -> 511      : 0        |                                        |
           512 -> 1023     : 0        |                                        |
          1024 -> 2047     : 0        |                                        |
          2048 -> 4095     : 40147    |*************************************** |
          4096 -> 8191     : 2300     |*                                       |
          8192 -> 16383    : 828      |                                        |
         16384 -> 32767    : 178      |                                        |
         32768 -> 65535    : 59       |                                        |
         65536 -> 131071   : 2        |                                        |
        131072 -> 262143   : 0        |                                        |
        262144 -> 524287   : 1        |                                        |
        524288 -> 1048575  : 174      |                                        |
      CPU 3
            latency        : count     distribution
             1 -> 1        : 0        |                                        |
             2 -> 3        : 0        |                                        |
             4 -> 7        : 0        |                                        |
             8 -> 15       : 0        |                                        |
            16 -> 31       : 0        |                                        |
            32 -> 63       : 0        |                                        |
            64 -> 127      : 0        |                                        |
           128 -> 255      : 0        |                                        |
           256 -> 511      : 0        |                                        |
           512 -> 1023     : 0        |                                        |
          1024 -> 2047     : 0        |                                        |
          2048 -> 4095     : 29626    |*************************************** |
          4096 -> 8191     : 2704     |**                                      |
          8192 -> 16383    : 1090     |                                        |
         16384 -> 32767    : 160      |                                        |
         32768 -> 65535    : 72       |                                        |
         65536 -> 131071   : 32       |                                        |
        131072 -> 262143   : 26       |                                        |
        262144 -> 524287   : 12       |                                        |
        524288 -> 1048575  : 298      |                                        |
      
      All this is based on the trace3 examples written by
      Alexei Starovoitov <ast@plumgrid.com>.
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: linux-kernel@vger.kernel.org
      Cc: netdev@vger.kernel.org
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0fb1170e
  14. 16 6月, 2015 1 次提交
    • A
      bpf: introduce current->pid, tgid, uid, gid, comm accessors · ffeedafb
      Alexei Starovoitov 提交于
      eBPF programs attached to kprobes need to filter based on
      current->pid, uid and other fields, so introduce helper functions:
      
      u64 bpf_get_current_pid_tgid(void)
      Return: current->tgid << 32 | current->pid
      
      u64 bpf_get_current_uid_gid(void)
      Return: current_gid << 32 | current_uid
      
      bpf_get_current_comm(char *buf, int size_of_buf)
      stores current->comm into buf
      
      They can be used from the programs attached to TC as well to classify packets
      based on current task fields.
      
      Update tracex2 example to print histogram of write syscalls for each process
      instead of aggregated for all.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ffeedafb
  15. 07 6月, 2015 2 次提交
    • A
      bpf: allow programs to write to certain skb fields · d691f9e8
      Alexei Starovoitov 提交于
      allow programs read/write skb->mark, tc_index fields and
      ((struct qdisc_skb_cb *)cb)->data.
      
      mark and tc_index are generically useful in TC.
      cb[0]-cb[4] are primarily used to pass arguments from one
      program to another called via bpf_tail_call() which can
      be seen in sockex3_kern.c example.
      
      All fields of 'struct __sk_buff' are readable to socket and tc_cls_act progs.
      mark, tc_index are writeable from tc_cls_act only.
      cb[0]-cb[4] are writeable by both sockets and tc_cls_act.
      
      Add verifier tests and improve sample code.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d691f9e8
    • A
      bpf: make programs see skb->data == L2 for ingress and egress · 3431205e
      Alexei Starovoitov 提交于
      eBPF programs attached to ingress and egress qdiscs see inconsistent skb->data.
      For ingress L2 header is already pulled, whereas for egress it's present.
      This is known to program writers which are currently forced to use
      BPF_LL_OFF workaround.
      Since programs don't change skb internal pointers it is safe to do
      pull/push right around invocation of the program and earlier taps and
      later pt->func() will not be affected.
      Multiple taps via packet_rcv(), tpacket_rcv() are doing the same trick
      around run_filter/BPF_PROG_RUN even if skb_shared.
      
      This fix finally allows programs to use optimized LD_ABS/IND instructions
      without BPF_LL_OFF for higher performance.
      tc ingress + cls_bpf + samples/bpf/tcbpf1_kern.o
             w/o JIT   w/JIT
      before  20.5     23.6 Mpps
      after   21.8     26.6 Mpps
      
      Old programs with BPF_LL_OFF will still work as-is.
      
      We can now undo most of the earlier workaround commit:
      a166151c ("bpf: fix bpf helpers to use skb->mac_header relative offsets")
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3431205e
  16. 23 5月, 2015 5 次提交
  17. 22 5月, 2015 2 次提交
    • A
      samples/bpf: bpf_tail_call example for networking · 530b2c86
      Alexei Starovoitov 提交于
      Usage:
      $ sudo ./sockex3
      IP     src.port -> dst.port               bytes      packets
      127.0.0.1.42010 -> 127.0.0.1.12865         1568            8
      127.0.0.1.59526 -> 127.0.0.1.33778     11422636       173070
      127.0.0.1.33778 -> 127.0.0.1.59526  11260224828       341974
      127.0.0.1.12865 -> 127.0.0.1.42010         1832           12
      IP     src.port -> dst.port               bytes      packets
      127.0.0.1.42010 -> 127.0.0.1.12865         1568            8
      127.0.0.1.59526 -> 127.0.0.1.33778     23198092       351486
      127.0.0.1.33778 -> 127.0.0.1.59526  22972698518       698616
      127.0.0.1.12865 -> 127.0.0.1.42010         1832           12
      
      this example is similar to sockex2 in a way that it accumulates per-flow
      statistics, but it does packet parsing differently.
      sockex2 inlines full packet parser routine into single bpf program.
      This sockex3 example have 4 independent programs that parse vlan, mpls, ip, ipv6
      and one main program that starts the process.
      bpf_tail_call() mechanism allows each program to be small and be called
      on demand potentially multiple times, so that many vlan, mpls, ip in ip,
      gre encapsulations can be parsed. These and other protocol parsers can
      be added or removed at runtime. TLVs can be parsed in similar manner.
      Note, tail_call_cnt dynamic check limits the number of tail calls to 32.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      530b2c86
    • A
      samples/bpf: bpf_tail_call example for tracing · 5bacd780
      Alexei Starovoitov 提交于
      kprobe example that demonstrates how future seccomp programs may look like.
      It attaches to seccomp_phase1() function and tail-calls other BPF programs
      depending on syscall number.
      
      Existing optimized classic BPF seccomp programs generated by Chrome look like:
      if (sd.nr < 121) {
        if (sd.nr < 57) {
          if (sd.nr < 22) {
            if (sd.nr < 7) {
              if (sd.nr < 4) {
                if (sd.nr < 1) {
                  check sys_read
                } else {
                  if (sd.nr < 3) {
                    check sys_write and sys_open
                  } else {
                    check sys_close
                  }
                }
              } else {
            } else {
          } else {
        } else {
      } else {
      }
      
      the future seccomp using native eBPF may look like:
        bpf_tail_call(&sd, &syscall_jmp_table, sd.nr);
      which is simpler, faster and leaves more room for per-syscall checks.
      
      Usage:
      $ sudo ./tracex5
      <...>-366   [001] d...     4.870033: : read(fd=1, buf=00007f6d5bebf000, size=771)
      <...>-369   [003] d...     4.870066: : mmap
      <...>-369   [003] d...     4.870077: : syscall=110 (one of get/set uid/pid/gid)
      <...>-369   [003] d...     4.870089: : syscall=107 (one of get/set uid/pid/gid)
         sh-369   [000] d...     4.891740: : read(fd=0, buf=00000000023d1000, size=512)
         sh-369   [000] d...     4.891747: : write(fd=1, buf=00000000023d3000, size=512)
         sh-369   [000] d...     4.891747: : read(fd=1, buf=00000000023d3000, size=512)
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5bacd780
  18. 13 5月, 2015 1 次提交
  19. 17 4月, 2015 2 次提交
    • A
      bpf: fix two bugs in verification logic when accessing 'ctx' pointer · 725f9dcd
      Alexei Starovoitov 提交于
      1.
      first bug is a silly mistake. It broke tracing examples and prevented
      simple bpf programs from loading.
      
      In the following code:
      if (insn->imm == 0 && BPF_SIZE(insn->code) == BPF_W) {
      } else if (...) {
        // this part should have been executed when
        // insn->code == BPF_W and insn->imm != 0
      }
      
      Obviously it's not doing that. So simple instructions like:
      r2 = *(u64 *)(r1 + 8)
      will be rejected. Note the comments in the code around these branches
      were and still valid and indicate the true intent.
      
      Replace it with:
      if (BPF_SIZE(insn->code) != BPF_W)
        continue;
      
      if (insn->imm == 0) {
      } else if (...) {
        // now this code will be executed when
        // insn->code == BPF_W and insn->imm != 0
      }
      
      2.
      second bug is more subtle.
      If malicious code is using the same dest register as source register,
      the checks designed to prevent the same instruction to be used with different
      pointer types will fail to trigger, since we were assigning src_reg_type
      when it was already overwritten by check_mem_access().
      The fix is trivial. Just move line:
      src_reg_type = regs[insn->src_reg].type;
      before check_mem_access().
      Add new 'access skb fields bad4' test to check this case.
      
      Fixes: 9bac3d6d ("bpf: allow extended BPF programs access skb fields")
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      725f9dcd
    • A
      bpf: fix bpf helpers to use skb->mac_header relative offsets · a166151c
      Alexei Starovoitov 提交于
      For the short-term solution, lets fix bpf helper functions to use
      skb->mac_header relative offsets instead of skb->data in order to
      get the same eBPF programs with cls_bpf and act_bpf work on ingress
      and egress qdisc path. We need to ensure that mac_header is set
      before calling into programs. This is effectively the first option
      from below referenced discussion.
      
      More long term solution for LD_ABS|LD_IND instructions will be more
      intrusive but also more beneficial than this, and implemented later
      as it's too risky at this point in time.
      
      I.e., we plan to look into the option of moving skb_pull() out of
      eth_type_trans() and into netif_receive_skb() as has been suggested
      as second option. Meanwhile, this solution ensures ingress can be
      used with eBPF, too, and that we won't run into ABI troubles later.
      For dealing with negative offsets inside eBPF helper functions,
      we've implemented bpf_skb_clone_unwritable() to test for unwriteable
      headers.
      
      Reference: http://thread.gmane.org/gmane.linux.network/359129/focus=359694
      Fixes: 608cd71a ("tc: bpf: generalize pedit action")
      Fixes: 91bc4822 ("tc: bpf: add checksum helpers")
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a166151c
  20. 08 4月, 2015 2 次提交
  21. 07 4月, 2015 1 次提交
    • A
      tc: bpf: add checksum helpers · 91bc4822
      Alexei Starovoitov 提交于
      Commit 608cd71a ("tc: bpf: generalize pedit action") has added the
      possibility to mangle packet data to BPF programs in the tc pipeline.
      This patch adds two helpers bpf_l3_csum_replace() and bpf_l4_csum_replace()
      for fixing up the protocol checksums after the packet mangling.
      
      It also adds 'flags' argument to bpf_skb_store_bytes() helper to avoid
      unnecessary checksum recomputations when BPF programs adjusting l3/l4
      checksums and documents all three helpers in uapi header.
      
      Moreover, a sample program is added to show how BPF programs can make use
      of the mangle and csum helpers.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      91bc4822
  22. 02 4月, 2015 4 次提交
    • A
      samples/bpf: Add kmem_alloc()/free() tracker tool · 9811e353
      Alexei Starovoitov 提交于
      One BPF program attaches to kmem_cache_alloc_node() and
      remembers all allocated objects in the map.
      Another program attaches to kmem_cache_free() and deletes
      corresponding object from the map.
      
      User space walks the map every second and prints any objects
      which are older than 1 second.
      
      Usage:
      
      	$ sudo tracex4
      
      Then start few long living processes. The 'tracex4' will print
      something like this:
      
      	obj 0xffff880465928000 is 13sec old was allocated at ip ffffffff8105dc32
      	obj 0xffff88043181c280 is 13sec old was allocated at ip ffffffff8105dc32
      	obj 0xffff880465848000 is  8sec old was allocated at ip ffffffff8105dc32
      	obj 0xffff8804338bc280 is 15sec old was allocated at ip ffffffff8105dc32
      
      	$ addr2line -fispe vmlinux ffffffff8105dc32
      	do_fork at fork.c:1665
      
      As soon as processes exit the memory is reclaimed and 'tracex4'
      prints nothing.
      
      Similar experiment can be done with the __kmalloc()/kfree() pair.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/1427312966-8434-10-git-send-email-ast@plumgrid.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9811e353
    • A
      samples/bpf: Add IO latency analysis (iosnoop/heatmap) tool · 5c7fc2d2
      Alexei Starovoitov 提交于
      BPF C program attaches to
      blk_mq_start_request()/blk_update_request() kprobe events to
      calculate IO latency.
      
      For every completed block IO event it computes the time delta
      in nsec and records in a histogram map:
      
      	map[log10(delta)*10]++
      
      User space reads this histogram map every 2 seconds and prints
      it as a 'heatmap' using gray shades of text terminal. Black
      spaces have many events and white spaces have very few events.
      Left most space is the smallest latency, right most space is
      the largest latency in the range.
      
      Usage:
      
      	$ sudo ./tracex3
      	and do 'sudo dd if=/dev/sda of=/dev/null' in other terminal.
      
      Observe IO latencies and how different activity (like 'make
      kernel') affects it.
      
      Similar experiments can be done for network transmit latencies,
      syscalls, etc.
      
      '-t' flag prints the heatmap using normal ascii characters:
      
      $ sudo ./tracex3 -t
        heatmap of IO latency
        # - many events with this latency
          - few events
      	|1us      |10us     |100us    |1ms      |10ms     |100ms    |1s |10s
      				 *ooo. *O.#.                                    # 221
      			      .  *#     .                                       # 125
      				 ..   .o#*..                                    # 55
      			    .  . .  .  .#O                                      # 37
      				 .#                                             # 175
      				       .#*.                                     # 37
      				  #                                             # 199
      		      .              . *#*.                                     # 55
      				       *#..*                                    # 42
      				  #                                             # 266
      			      ...***Oo#*OO**o#* .                               # 629
      				  #                                             # 271
      				      . .#o* o.*o*                              # 221
      				. . o* *#O..                                    # 50
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/1427312966-8434-9-git-send-email-ast@plumgrid.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5c7fc2d2
    • A
      samples/bpf: Add counting example for kfree_skb() function calls and the write() syscall · d822a192
      Alexei Starovoitov 提交于
      this example has two probes in one C file that attach to
      different kprove events and use two different maps.
      
      1st probe is x64 specific equivalent of dropmon. It attaches to
      kfree_skb, retrevies 'ip' address of kfree_skb() caller and
      counts number of packet drops at that 'ip' address. User space
      prints 'location - count' map every second.
      
      2nd probe attaches to kprobe:sys_write and computes a histogram
      of different write sizes
      
      Usage:
      	$ sudo tracex2
      	location 0xffffffff81695995 count 1
      	location 0xffffffff816d0da9 count 2
      
      	location 0xffffffff81695995 count 2
      	location 0xffffffff816d0da9 count 2
      
      	location 0xffffffff81695995 count 3
      	location 0xffffffff816d0da9 count 2
      
      	557145+0 records in
      	557145+0 records out
      	285258240 bytes (285 MB) copied, 1.02379 s, 279 MB/s
      		   syscall write() stats
      	     byte_size       : count     distribution
      	       1 -> 1        : 3        |                                      |
      	       2 -> 3        : 0        |                                      |
      	       4 -> 7        : 0        |                                      |
      	       8 -> 15       : 0        |                                      |
      	      16 -> 31       : 2        |                                      |
      	      32 -> 63       : 3        |                                      |
      	      64 -> 127      : 1        |                                      |
      	     128 -> 255      : 1        |                                      |
      	     256 -> 511      : 0        |                                      |
      	     512 -> 1023     : 1118968  |************************************* |
      
      Ctrl-C at any time. Kernel will auto cleanup maps and programs
      
      	$ addr2line -ape ./bld_x64/vmlinux 0xffffffff81695995
      	0xffffffff816d0da9 0xffffffff81695995:
      	./bld_x64/../net/ipv4/icmp.c:1038 0xffffffff816d0da9:
      	./bld_x64/../net/unix/af_unix.c:1231
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/1427312966-8434-8-git-send-email-ast@plumgrid.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d822a192
    • A
      samples/bpf: Add simple non-portable kprobe filter example · b896c4f9
      Alexei Starovoitov 提交于
      tracex1_kern.c - C program compiled into BPF.
      
      It attaches to kprobe:netif_receive_skb()
      
      When skb->dev->name == "lo", it prints sample debug message into
      trace_pipe via bpf_trace_printk() helper function.
      
      tracex1_user.c - corresponding user space component that:
        - loads BPF program via bpf() syscall
        - opens kprobes:netif_receive_skb event via perf_event_open()
          syscall
        - attaches the program to event via ioctl(event_fd,
          PERF_EVENT_IOC_SET_BPF, prog_fd);
        - prints from trace_pipe
      
      Note, this BPF program is non-portable. It must be recompiled
      with current kernel headers. kprobe is not a stable ABI and
      BPF+kprobe scripts may no longer be meaningful when kernel
      internals change.
      
      No matter in what way the kernel changes, neither the kprobe,
      nor the BPF program can ever crash or corrupt the kernel,
      assuming the kprobes, perf and BPF subsystem has no bugs.
      
      The verifier will detect that the program is using
      bpf_trace_printk() and the kernel will print 'this is a DEBUG
      kernel' warning banner, which means that bpf_trace_printk()
      should be used for debugging of the BPF program only.
      
      Usage:
      $ sudo tracex1
                  ping-19826 [000] d.s2 63103.382648: : skb ffff880466b1ca00 len 84
                  ping-19826 [000] d.s2 63103.382684: : skb ffff880466b1d300 len 84
      
                  ping-19826 [000] d.s2 63104.382533: : skb ffff880466b1ca00 len 84
                  ping-19826 [000] d.s2 63104.382594: : skb ffff880466b1d300 len 84
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/1427312966-8434-7-git-send-email-ast@plumgrid.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b896c4f9
  23. 25 3月, 2015 2 次提交
  24. 18 3月, 2015 1 次提交
  25. 16 3月, 2015 1 次提交
  26. 15 3月, 2015 1 次提交
  27. 02 3月, 2015 1 次提交