1. 07 6月, 2015 1 次提交
    • A
      bpf: make programs see skb->data == L2 for ingress and egress · 3431205e
      Alexei Starovoitov 提交于
      eBPF programs attached to ingress and egress qdiscs see inconsistent skb->data.
      For ingress L2 header is already pulled, whereas for egress it's present.
      This is known to program writers which are currently forced to use
      BPF_LL_OFF workaround.
      Since programs don't change skb internal pointers it is safe to do
      pull/push right around invocation of the program and earlier taps and
      later pt->func() will not be affected.
      Multiple taps via packet_rcv(), tpacket_rcv() are doing the same trick
      around run_filter/BPF_PROG_RUN even if skb_shared.
      
      This fix finally allows programs to use optimized LD_ABS/IND instructions
      without BPF_LL_OFF for higher performance.
      tc ingress + cls_bpf + samples/bpf/tcbpf1_kern.o
             w/o JIT   w/JIT
      before  20.5     23.6 Mpps
      after   21.8     26.6 Mpps
      
      Old programs with BPF_LL_OFF will still work as-is.
      
      We can now undo most of the earlier workaround commit:
      a166151c ("bpf: fix bpf helpers to use skb->mac_header relative offsets")
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3431205e
  2. 17 4月, 2015 1 次提交
    • A
      bpf: fix bpf helpers to use skb->mac_header relative offsets · a166151c
      Alexei Starovoitov 提交于
      For the short-term solution, lets fix bpf helper functions to use
      skb->mac_header relative offsets instead of skb->data in order to
      get the same eBPF programs with cls_bpf and act_bpf work on ingress
      and egress qdisc path. We need to ensure that mac_header is set
      before calling into programs. This is effectively the first option
      from below referenced discussion.
      
      More long term solution for LD_ABS|LD_IND instructions will be more
      intrusive but also more beneficial than this, and implemented later
      as it's too risky at this point in time.
      
      I.e., we plan to look into the option of moving skb_pull() out of
      eth_type_trans() and into netif_receive_skb() as has been suggested
      as second option. Meanwhile, this solution ensures ingress can be
      used with eBPF, too, and that we won't run into ABI troubles later.
      For dealing with negative offsets inside eBPF helper functions,
      we've implemented bpf_skb_clone_unwritable() to test for unwriteable
      headers.
      
      Reference: http://thread.gmane.org/gmane.linux.network/359129/focus=359694
      Fixes: 608cd71a ("tc: bpf: generalize pedit action")
      Fixes: 91bc4822 ("tc: bpf: add checksum helpers")
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a166151c
  3. 13 3月, 2015 1 次提交
    • D
      cls_bpf: do eBPF invocation under non-bh RCU lock variant for maps · 54720df1
      Daniel Borkmann 提交于
      Currently, it is possible in cls_bpf to access eBPF maps only under
      rcu_read_lock_bh() variants: while on ingress side, that is, handle_ing(),
      the classifier would be called from __netif_receive_skb_core() under
      rcu_read_lock(); on egress side, however, it's rcu_read_lock_bh() via
      __dev_queue_xmit().
      
      This rcu/rcu_bh mix doesn't work together with eBPF maps as they require
      soley to be called under rcu_read_lock(). eBPF maps could also be shared
      among various other eBPF programs (possibly even with other eBPF program
      types, f.e. tracing) and user space processes, so any context is assumed.
      
      Therefore, a possible fix for cls_bpf is to wrap/nest eBPF program
      invocation under non-bh RCU lock variant.
      
      Fixes: e2e9b654 ("cls_bpf: add initial eBPF support for programmable classifiers")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      54720df1
  4. 10 3月, 2015 1 次提交
    • C
      net_sched: destroy proto tp when all filters are gone · 1e052be6
      Cong Wang 提交于
      Kernel automatically creates a tp for each
      (kind, protocol, priority) tuple, which has handle 0,
      when we add a new filter, but it still is left there
      after we remove our own, unless we don't specify the
      handle (literally means all the filters under
      the tuple). For example this one is left:
      
        # tc filter show dev eth0
        filter parent 8001: protocol arp pref 49152 basic
      
      The user-space is hard to clean up these for kernel
      because filters like u32 are organized in a complex way.
      So kernel is responsible to remove it after all filters
      are gone.  Each type of filter has its own way to
      store the filters, so each type has to provide its
      way to check if all filters are gone.
      
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NCong Wang <cwang@twopensource.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: Jamal Hadi Salim<jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e052be6
  5. 02 3月, 2015 1 次提交
    • D
      cls_bpf: add initial eBPF support for programmable classifiers · e2e9b654
      Daniel Borkmann 提交于
      This work extends the "classic" BPF programmable tc classifier by
      extending its scope also to native eBPF code!
      
      This allows for user space to implement own custom, 'safe' C like
      classifiers (or whatever other frontend language LLVM et al may
      provide in future), that can then be compiled with the LLVM eBPF
      backend to an eBPF elf file. The result of this can be loaded into
      the kernel via iproute2's tc. In the kernel, they can be JITed on
      major archs and thus run in native performance.
      
      Simple, minimal toy example to demonstrate the workflow:
      
        #include <linux/ip.h>
        #include <linux/if_ether.h>
        #include <linux/bpf.h>
      
        #include "tc_bpf_api.h"
      
        __section("classify")
        int cls_main(struct sk_buff *skb)
        {
          return (0x800 << 16) | load_byte(skb, ETH_HLEN + __builtin_offsetof(struct iphdr, tos));
        }
      
        char __license[] __section("license") = "GPL";
      
      The classifier can then be compiled into eBPF opcodes and loaded
      via tc, for example:
      
        clang -O2 -emit-llvm -c cls.c -o - | llc -march=bpf -filetype=obj -o cls.o
        tc filter add dev em1 parent 1: bpf cls.o [...]
      
      As it has been demonstrated, the scope can even reach up to a fully
      fledged flow dissector (similarly as in samples/bpf/sockex2_kern.c).
      
      For tc, maps are allowed to be used, but from kernel context only,
      in other words, eBPF code can keep state across filter invocations.
      In future, we perhaps may reattach from a different application to
      those maps e.g., to read out collected statistics/state.
      
      Similarly as in socket filters, we may extend functionality for eBPF
      classifiers over time depending on the use cases. For that purpose,
      cls_bpf programs are using BPF_PROG_TYPE_SCHED_CLS program type, so
      we can allow additional functions/accessors (e.g. an ABI compatible
      offset translation to skb fields/metadata). For an initial cls_bpf
      support, we allow the same set of helper functions as eBPF socket
      filters, but we could diverge at some point in time w/o problem.
      
      I was wondering whether cls_bpf and act_bpf could share C programs,
      I can imagine that at some point, we introduce i) further common
      handlers for both (or even beyond their scope), and/or if truly needed
      ii) some restricted function space for each of them. Both can be
      abstracted easily through struct bpf_verifier_ops in future.
      
      The context of cls_bpf versus act_bpf is slightly different though:
      a cls_bpf program will return a specific classid whereas act_bpf a
      drop/non-drop return code, latter may also in future mangle skbs.
      That said, we can surely have a "classify" and "action" section in
      a single object file, or considered mentioned constraint add a
      possibility of a shared section.
      
      The workflow for getting native eBPF running from tc [1] is as
      follows: for f_bpf, I've added a slightly modified ELF parser code
      from Alexei's kernel sample, which reads out the LLVM compiled
      object, sets up maps (and dynamically fixes up map fds) if any, and
      loads the eBPF instructions all centrally through the bpf syscall.
      
      The resulting fd from the loaded program itself is being passed down
      to cls_bpf, which looks up struct bpf_prog from the fd store, and
      holds reference, so that it stays available also after tc program
      lifetime. On tc filter destruction, it will then drop its reference.
      
      Moreover, I've also added the optional possibility to annotate an
      eBPF filter with a name (e.g. path to object file, or something
      else if preferred) so that when tc dumps currently installed filters,
      some more context can be given to an admin for a given instance (as
      opposed to just the file descriptor number).
      
      Last but not least, bpf_prog_get() and bpf_prog_put() needed to be
      exported, so that eBPF can be used from cls_bpf built as a module.
      Thanks to 60a3b225 ("net: bpf: make eBPF interpreter images
      read-only") I think this is of no concern since anything wanting to
      alter eBPF opcode after verification stage would crash the kernel.
      
        [1] http://git.breakpoint.cc/cgit/dborkman/iproute2.git/log/?h=ebpfSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2e9b654
  6. 27 1月, 2015 2 次提交
    • D
      net: cls_bpf: fix auto generation of per list handles · 3f2ab135
      Daniel Borkmann 提交于
      When creating a bpf classifier in tc with priority collisions and
      invoking automatic unique handle assignment, cls_bpf_grab_new_handle()
      will return a wrong handle id which in fact is non-unique. Usually
      altering of specific filters is being addressed over major id, but
      in case of collisions we result in a filter chain, where handle ids
      address individual cls_bpf_progs inside the classifier.
      
      Issue is, in cls_bpf_grab_new_handle() we probe for head->hgen handle
      in cls_bpf_get() and in case we found a free handle, we're supposed
      to use exactly head->hgen. In case of insufficient numbers of handles,
      we bail out later as handle id 0 is not allowed.
      
      Fixes: 7d1d65cb ("net: sched: cls_bpf: add BPF-based classifier")
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NJiri Pirko <jiri@resnulli.us>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3f2ab135
    • D
      net: cls_bpf: fix size mismatch on filter preparation · 7913ecf6
      Daniel Borkmann 提交于
      In cls_bpf_modify_existing(), we read out the number of filter blocks,
      do some sanity checks, allocate a block on that size, and copy over the
      BPF instruction blob from user space, then pass everything through the
      classic BPF checker prior to installation of the classifier.
      
      We should reject mismatches here, there are 2 scenarios: the number of
      filter blocks could be smaller than the provided instruction blob, so
      we do a partial copy of the BPF program, and thus the instructions will
      either be rejected from the verifier or a valid BPF program will be run;
      in the other case, we'll end up copying more than we're supposed to,
      and most likely the trailing garbage will be rejected by the verifier
      as well (i.e. we need to fit instruction pattern, ret {A,K} needs to be
      last instruction, load/stores must be correct, etc); in case not, we
      would leak memory when dumping back instruction patterns. The code should
      have only used nla_len() as Dave noted to avoid this from the beginning.
      Anyway, lets fix it by rejecting such load attempts.
      
      Fixes: 7d1d65cb ("net: sched: cls_bpf: add BPF-based classifier")
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7913ecf6
  7. 18 1月, 2015 1 次提交
  8. 10 12月, 2014 1 次提交
  9. 09 12月, 2014 2 次提交
  10. 07 10月, 2014 1 次提交
    • J
      net: sched: do not use tcf_proto 'tp' argument from call_rcu · 18cdb37e
      John Fastabend 提交于
      Using the tcf_proto pointer 'tp' from inside the classifiers callback
      is not valid because it may have been cleaned up by another call_rcu
      occuring on another CPU.
      
      'tp' is currently being used by tcf_unbind_filter() in this patch we
      move instances of tcf_unbind_filter outside of the call_rcu() context.
      This is safe to do because any running schedulers will either read the
      valid class field or it will be zeroed.
      
      And all schedulers today when the class is 0 do a lookup using the
      same call used by the tcf_exts_bind(). So even if we have a running
      classifier hit the null class pointer it will do a lookup and get
      to the same result. This is particularly fragile at the moment because
      the only way to verify this is to audit the schedulers call sites.
      Reported-by: NCong Wang <xiyou.wangconf@gmail.com>
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Acked-by: NCong Wang <cwang@twopensource.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      18cdb37e
  11. 29 9月, 2014 1 次提交
  12. 16 9月, 2014 1 次提交
  13. 14 9月, 2014 2 次提交
  14. 03 8月, 2014 1 次提交
    • A
      net: filter: split 'struct sk_filter' into socket and bpf parts · 7ae457c1
      Alexei Starovoitov 提交于
      clean up names related to socket filtering and bpf in the following way:
      - everything that deals with sockets keeps 'sk_*' prefix
      - everything that is pure BPF is changed to 'bpf_*' prefix
      
      split 'struct sk_filter' into
      struct sk_filter {
      	atomic_t        refcnt;
      	struct rcu_head rcu;
      	struct bpf_prog *prog;
      };
      and
      struct bpf_prog {
              u32                     jited:1,
                                      len:31;
              struct sock_fprog_kern  *orig_prog;
              unsigned int            (*bpf_func)(const struct sk_buff *skb,
                                                  const struct bpf_insn *filter);
              union {
                      struct sock_filter      insns[0];
                      struct bpf_insn         insnsi[0];
                      struct work_struct      work;
              };
      };
      so that 'struct bpf_prog' can be used independent of sockets and cleans up
      'unattached' bpf use cases
      
      split SK_RUN_FILTER macro into:
          SK_RUN_FILTER to be used with 'struct sk_filter *' and
          BPF_PROG_RUN to be used with 'struct bpf_prog *'
      
      __sk_filter_release(struct sk_filter *) gains
      __bpf_prog_release(struct bpf_prog *) helper function
      
      also perform related renames for the functions that work
      with 'struct bpf_prog *', since they're on the same lines:
      
      sk_filter_size -> bpf_prog_size
      sk_filter_select_runtime -> bpf_prog_select_runtime
      sk_filter_free -> bpf_prog_free
      sk_unattached_filter_create -> bpf_prog_create
      sk_unattached_filter_destroy -> bpf_prog_destroy
      sk_store_orig_filter -> bpf_prog_store_orig_filter
      sk_release_orig_filter -> bpf_release_orig_filter
      __sk_migrate_filter -> bpf_migrate_filter
      __sk_prepare_filter -> bpf_prepare_filter
      
      API for attaching classic BPF to a socket stays the same:
      sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
      and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
      which is used by sockets, tun, af_packet
      
      API for 'unattached' BPF programs becomes:
      bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
      and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
      which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ae457c1
  15. 24 5月, 2014 1 次提交
    • D
      net: filter: let unattached filters use sock_fprog_kern · b1fcd35c
      Daniel Borkmann 提交于
      The sk_unattached_filter_create() API is used by BPF filters that
      are not directly attached or related to sockets, and are used in
      team, ptp, xt_bpf, cls_bpf, etc. As such all users do their own
      internal managment of obtaining filter blocks and thus already
      have them in kernel memory and set up before calling into
      sk_unattached_filter_create(). As a result, due to __user annotation
      in sock_fprog, sparse triggers false positives (incorrect type in
      assignment [different address space]) when filters are set up before
      passing them to sk_unattached_filter_create(). Therefore, let
      sk_unattached_filter_create() API use sock_fprog_kern to overcome
      this issue.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b1fcd35c
  16. 28 4月, 2014 1 次提交
  17. 14 1月, 2014 1 次提交
  18. 19 12月, 2013 2 次提交
  19. 11 12月, 2013 1 次提交
  20. 30 10月, 2013 1 次提交
    • D
      net: sched: cls_bpf: add BPF-based classifier · 7d1d65cb
      Daniel Borkmann 提交于
      This work contains a lightweight BPF-based traffic classifier that can
      serve as a flexible alternative to ematch-based tree classification, i.e.
      now that BPF filter engine can also be JITed in the kernel. Naturally, tc
      actions and policies are supported as well with cls_bpf. Multiple BPF
      programs/filter can be attached for a class, or they can just as well be
      written within a single BPF program, that's really up to the user how he
      wishes to run/optimize the code, e.g. also for inversion of verdicts etc.
      The notion of a BPF program's return/exit codes is being kept as follows:
      
           0: No match
          -1: Select classid given in "tc filter ..." command
        else: flowid, overwrite the default one
      
      As a minimal usage example with iproute2, we use a 3 band prio root qdisc
      on a router with sfq each as leave, and assign ssh and icmp bpf-based
      filters to band 1, http traffic to band 2 and the rest to band 3. For the
      first two bands we load the bytecode from a file, in the 2nd we load it
      inline as an example:
      
      echo 1 > /proc/sys/net/core/bpf_jit_enable
      
      tc qdisc del dev em1 root
      tc qdisc add dev em1 root handle 1: prio bands 3 priomap 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
      
      tc qdisc add dev em1 parent 1:1 sfq perturb 16
      tc qdisc add dev em1 parent 1:2 sfq perturb 16
      tc qdisc add dev em1 parent 1:3 sfq perturb 16
      
      tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/ssh.bpf flowid 1:1
      tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/icmp.bpf flowid 1:1
      tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/http.bpf flowid 1:2
      tc filter add dev em1 parent 1: bpf run bytecode "`bpfc -f tc -i misc.ops`" flowid 1:3
      
      BPF programs can be easily created and passed to tc, either as inline
      'bytecode' or 'bytecode-file'. There are a couple of front-ends that can
      compile opcodes, for example:
      
      1) People familiar with tcpdump-like filters:
      
         tcpdump -iem1 -ddd port 22 | tr '\n' ',' > /etc/tc/ssh.bpf
      
      2) People that want to low-level program their filters or use BPF
         extensions that lack support by libpcap's compiler:
      
         bpfc -f tc -i ssh.ops > /etc/tc/ssh.bpf
      
         ssh.ops example code:
         ldh [12]
         jne #0x800, drop
         ldb [23]
         jneq #6, drop
         ldh [20]
         jset #0x1fff, drop
         ldxb 4 * ([14] & 0xf)
         ldh [%x + 14]
         jeq #0x16, pass
         ldh [%x + 16]
         jne #0x16, drop
         pass: ret #-1
         drop: ret #0
      
      It was chosen to load bytecode into tc, since the reverse operation,
      tc filter list dev em1, is then able to show the exact commands again.
      Possible follow-up work could also include a small expression compiler
      for iproute2. Tested with the help of bmon. This idea came up during
      the Netfilter Workshop 2013 in Copenhagen. Also thanks to feedback from
      Eric Dumazet!
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d1d65cb