1. 31 3月, 2016 1 次提交
    • D
      bpf: make padding in bpf_tunnel_key explicit · c0e760c9
      Daniel Borkmann 提交于
      Make the 2 byte padding in struct bpf_tunnel_key between tunnel_ttl
      and tunnel_label members explicit. No issue has been observed, and
      gcc/llvm does padding for the old struct already, where tunnel_label
      was not yet present, so the current code works, but since it's part
      of uapi, make sure we don't introduce holes in structs.
      
      Therefore, add tunnel_ext that we can use generically in future
      (f.e. to flag OAM messages for backends, etc). Also add the offset
      to the compat tests to be sure should some compilers not padd the
      tail of the old version of bpf_tunnel_key.
      
      Fixes: 4018ab18 ("bpf: support flow label for bpf_skb_{set, get}_tunnel_key")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0e760c9
  2. 19 3月, 2016 3 次提交
  3. 12 3月, 2016 1 次提交
  4. 09 3月, 2016 6 次提交
    • D
      bpf, vxlan, geneve, gre: fix usage of dst_cache on xmit · db3c6139
      Daniel Borkmann 提交于
      The assumptions from commit 0c1d70af ("net: use dst_cache for vxlan
      device"), 468dfffc ("geneve: add dst caching support") and 3c1cb4d2
      ("net/ipv4: add dst cache support for gre lwtunnels") on dst_cache usage
      when ip_tunnel_info is used is unfortunately not always valid as assumed.
      
      While it seems correct for ip_tunnel_info front-ends such as OVS, eBPF
      however can fill in ip_tunnel_info for consumers like vxlan, geneve or gre
      with different remote dsts, tos, etc, therefore they cannot be assumed as
      packet independent.
      
      Right now vxlan, geneve, gre would cache the dst for eBPF and every packet
      would reuse the same entry that was first created on the initial route
      lookup. eBPF doesn't store/cache the ip_tunnel_info, so each skb may have
      a different one.
      
      Fix it by adding a flag that checks the ip_tunnel_info. Also the !tos test
      in vxlan needs to be handeled differently in this context as it is currently
      inferred from ip_tunnel_info as well if present. ip_tunnel_dst_cache_usable()
      helper is added for the three tunnel cases, which checks if we can use dst
      cache.
      
      Fixes: 0c1d70af ("net: use dst_cache for vxlan device")
      Fixes: 468dfffc ("geneve: add dst caching support")
      Fixes: 3c1cb4d2 ("net/ipv4: add dst cache support for gre lwtunnels")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db3c6139
    • D
      bpf: support for access to tunnel options · 14ca0751
      Daniel Borkmann 提交于
      After eBPF being able to programmatically access/manage tunnel key meta
      data via commit d3aa45ce ("bpf: add helpers to access tunnel metadata")
      and more recently also for IPv6 through c6c33454 ("bpf: support ipv6
      for bpf_skb_{set,get}_tunnel_key"), this work adds two complementary
      helpers to generically access their auxiliary tunnel options.
      
      Geneve and vxlan support this facility. For geneve, TLVs can be pushed,
      and for the vxlan case its GBP extension. I.e. setting tunnel key for geneve
      case only makes sense, if we can also read/write TLVs into it. In the GBP
      case, it provides the flexibility to easily map the group policy ID in
      combination with other helpers or maps.
      
      I chose to model this as two separate helpers, bpf_skb_{set,get}_tunnel_opt(),
      for a couple of reasons. bpf_skb_{set,get}_tunnel_key() is already rather
      complex by itself, and there may be cases for tunnel key backends where
      tunnel options are not always needed. If we would have integrated this
      into bpf_skb_{set,get}_tunnel_key() nevertheless, we are very limited with
      remaining helper arguments, so keeping compatibility on structs in case of
      passing in a flat buffer gets more cumbersome. Separating both also allows
      for more flexibility and future extensibility, f.e. options could be fed
      directly from a map, etc.
      
      Moreover, change geneve's xmit path to test only for info->options_len
      instead of TUNNEL_GENEVE_OPT flag. This makes it more consistent with vxlan's
      xmit path and allows for avoiding to specify a protocol flag in the API on
      xmit, so it can be protocol agnostic. Having info->options_len is enough
      information that is needed. Tested with vxlan and geneve.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      14ca0751
    • D
      bpf: allow to propagate df in bpf_skb_set_tunnel_key · 22080870
      Daniel Borkmann 提交于
      Added by 9a628224 ("ip_tunnel: Add dont fragment flag."), allow to
      feed df flag into tunneling facilities (currently supported on TX by
      vxlan, geneve and gre) as a hint from eBPF's bpf_skb_set_tunnel_key()
      helper.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22080870
    • D
      bpf: make helper function protos static · 577c50aa
      Daniel Borkmann 提交于
      They are only used here, so there's no reason they should not be static.
      Only the vlan push/pop protos are used in the test_bpf suite.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      577c50aa
    • D
      bpf: add flags to bpf_skb_store_bytes for clearing hash · 8afd54c8
      Daniel Borkmann 提交于
      When overwriting parts of the packet with bpf_skb_store_bytes() that
      were fed previously into skb->hash calculation, we should clear the
      current hash with skb_clear_hash(), so that a next skb_get_hash() call
      can determine the correct hash related to this skb.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8afd54c8
    • D
      bpf: allow bpf_csum_diff to feed bpf_l3_csum_replace as well · 8050c0f0
      Daniel Borkmann 提交于
      Commit 7d672345 ("bpf: add generic bpf_csum_diff helper") added a
      generic checksum diff helper that can feed bpf_l4_csum_replace() with
      a target __wsum diff that is to be applied to the L4 checksum. This
      facility is very flexible, can be cascaded, allows for adding, removing,
      or diffing data, or for calculating the pseudo header checksum from
      scratch, but it can also be reused for working with the IPv4 header
      checksum.
      
      Thus, analogous to bpf_l4_csum_replace(), add a case for header field
      value of 0 to change the checksum at a given offset through a new helper
      csum_replace_by_diff(). Also, in addition to that, this provides an
      easy to use interface for feeding precalculated diffs f.e. coming from
      a map. It nicely complements bpf_l3_csum_replace() that currently allows
      only for csum updates of 2 and 4 byte diffs.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8050c0f0
  5. 02 3月, 2016 1 次提交
  6. 25 2月, 2016 1 次提交
    • D
      bpf: fix csum setting for bpf_set_tunnel_key · 2da897e5
      Daniel Borkmann 提交于
      The fix in 35e2d115 ("tunnels: Allow IPv6 UDP checksums to be correctly
      controlled.") changed behavior for bpf_set_tunnel_key() when in use with
      IPv6 and thus uncovered a bug that TUNNEL_CSUM needed to be set but wasn't.
      As a result, the stack dropped ingress vxlan IPv6 packets, that have been
      sent via eBPF through collect meta data mode due to checksum now being zero.
      
      Since after LCO, we enable IPv4 checksum by default, so make that analogous
      and only provide a flag BPF_F_ZERO_CSUM_TX for the user to turn it off in
      IPv4 case.
      
      Fixes: 35e2d115 ("tunnels: Allow IPv6 UDP checksums to be correctly controlled.")
      Fixes: c6c33454 ("bpf: support ipv6 for bpf_skb_{set,get}_tunnel_key")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2da897e5
  7. 22 2月, 2016 5 次提交
  8. 11 2月, 2016 1 次提交
    • C
      soreuseport: Prep for fast reuseport TCP socket selection · fa463497
      Craig Gallek 提交于
      Both of the lines in this patch probably should have been included
      in the initial implementation of this code for generic socket
      support, but weren't technically necessary since only UDP sockets
      were supported.
      
      First, the sk_reuseport_cb points to a structure which assumes
      each socket in the group has this pointer assigned at the same
      time it's added to the array in the structure.  The sk_clone_lock
      function breaks this assumption.  Since a child socket shouldn't
      implicitly be in a reuseport group, the simple fix is to clear
      the field in the clone.
      
      Second, the SO_ATTACH_REUSEPORT_xBPF socket options require that
      SO_REUSEPORT also be set first.  For UDP sockets, this is easily
      enforced at bind-time since that process both puts the socket in
      the appropriate receive hlist and updates the reuseport structures.
      Since these operations can happen at two different times for TCP
      sockets (bind and listen) it must be explicitly checked to enforce
      the use of SO_REUSEPORT with SO_ATTACH_REUSEPORT_xBPF in the
      setsockopt call.
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa463497
  9. 13 1月, 2016 2 次提交
  10. 12 1月, 2016 2 次提交
  11. 11 1月, 2016 1 次提交
    • D
      bpf: add skb_postpush_rcsum and fix dev_forward_skb occasions · f8ffad69
      Daniel Borkmann 提交于
      Add a small helper skb_postpush_rcsum() and fix up redirect locations
      that need CHECKSUM_COMPLETE fixups on ingress. dev_forward_skb() expects
      a proper csum that covers also Ethernet header, f.e. since 2c26d34b
      ("net/core: Handle csum for CHECKSUM_COMPLETE VXLAN forwarding"), we
      also do skb_postpull_rcsum() after pulling Ethernet header off via
      eth_type_trans().
      
      When using eBPF in a netns setup f.e. with vxlan in collect metadata mode,
      I can trigger the following csum issue with an IPv6 setup:
      
        [  505.144065] dummy1: hw csum failure
        [...]
        [  505.144108] Call Trace:
        [  505.144112]  <IRQ>  [<ffffffff81372f08>] dump_stack+0x44/0x5c
        [  505.144134]  [<ffffffff81607cea>] netdev_rx_csum_fault+0x3a/0x40
        [  505.144142]  [<ffffffff815fee3f>] __skb_checksum_complete+0xcf/0xe0
        [  505.144149]  [<ffffffff816f0902>] nf_ip6_checksum+0xb2/0x120
        [  505.144161]  [<ffffffffa08c0e0e>] icmpv6_error+0x17e/0x328 [nf_conntrack_ipv6]
        [  505.144170]  [<ffffffffa0898eca>] ? ip6t_do_table+0x2fa/0x645 [ip6_tables]
        [  505.144177]  [<ffffffffa08c0725>] ? ipv6_get_l4proto+0x65/0xd0 [nf_conntrack_ipv6]
        [  505.144189]  [<ffffffffa06c9a12>] nf_conntrack_in+0xc2/0x5a0 [nf_conntrack]
        [  505.144196]  [<ffffffffa08c039c>] ipv6_conntrack_in+0x1c/0x20 [nf_conntrack_ipv6]
        [  505.144204]  [<ffffffff8164385d>] nf_iterate+0x5d/0x70
        [  505.144210]  [<ffffffff816438d6>] nf_hook_slow+0x66/0xc0
        [  505.144218]  [<ffffffff816bd302>] ipv6_rcv+0x3f2/0x4f0
        [  505.144225]  [<ffffffff816bca40>] ? ip6_make_skb+0x1b0/0x1b0
        [  505.144232]  [<ffffffff8160b77b>] __netif_receive_skb_core+0x36b/0x9a0
        [  505.144239]  [<ffffffff8160bdc8>] ? __netif_receive_skb+0x18/0x60
        [  505.144245]  [<ffffffff8160bdc8>] __netif_receive_skb+0x18/0x60
        [  505.144252]  [<ffffffff8160ccff>] process_backlog+0x9f/0x140
        [  505.144259]  [<ffffffff8160c4a5>] net_rx_action+0x145/0x320
        [...]
      
      What happens is that on ingress, we push Ethernet header back in, either
      from cls_bpf or right before skb_do_redirect(), but without updating csum.
      The "hw csum failure" can be fixed by using the new skb_postpush_rcsum()
      helper for the dev_forward_skb() case to correct the csum diff again.
      
      Thanks to Hannes Frederic Sowa for the csum_partial() idea!
      
      Fixes: 3896d655 ("bpf: introduce bpf_clone_redirect() helper")
      Fixes: 27b29f63 ("bpf: add bpf_redirect() helper")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8ffad69
  12. 05 1月, 2016 1 次提交
  13. 19 12月, 2015 3 次提交
    • D
      bpf: fix misleading comment in bpf_convert_filter · 23bf8807
      Daniel Borkmann 提交于
      Comment says "User BPF's register A is mapped to our BPF register 6",
      which is actually wrong as the mapping is on register 0. This can
      already be inferred from the code itself. So just remove it before
      someone makes assumptions based on that. Only code tells truth. ;)
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      23bf8807
    • D
      bpf: move clearing of A/X into classic to eBPF migration prologue · 8b614aeb
      Daniel Borkmann 提交于
      Back in the days where eBPF (or back then "internal BPF" ;->) was not
      exposed to user space, and only the classic BPF programs internally
      translated into eBPF programs, we missed the fact that for classic BPF
      A and X needed to be cleared. It was fixed back then via 83d5b7ef
      ("net: filter: initialize A and X registers"), and thus classic BPF
      specifics were added to the eBPF interpreter core to work around it.
      
      This added some confusion for JIT developers later on that take the
      eBPF interpreter code as an example for deriving their JIT. F.e. in
      f75298f5 ("s390/bpf: clear correct BPF accumulator register"), at
      least X could leak stack memory. Furthermore, since this is only needed
      for classic BPF translations and not for eBPF (verifier takes care
      that read access to regs cannot be done uninitialized), more complexity
      is added to JITs as they need to determine whether they deal with
      migrations or native eBPF where they can just omit clearing A/X in
      their prologue and thus reduce image size a bit, see f.e. cde66c2d
      ("s390/bpf: Only clear A and X for converted BPF programs"). In other
      cases (x86, arm64), A and X is being cleared in the prologue also for
      eBPF case, which is unnecessary.
      
      Lets move this into the BPF migration in bpf_convert_filter() where it
      actually belongs as long as the number of eBPF JITs are still few. It
      can thus be done generically; allowing us to remove the quirk from
      __bpf_prog_run() and to slightly reduce JIT image size in case of eBPF,
      while reducing code duplication on this matter in current(/future) eBPF
      JITs.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NMichael Holzheu <holzheu@linux.vnet.ibm.com>
      Tested-by: NMichael Holzheu <holzheu@linux.vnet.ibm.com>
      Cc: Zi Shen Lim <zlim.lnx@gmail.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Acked-by: NYang Shi <yang.shi@linaro.org>
      Acked-by: NZi Shen Lim <zlim.lnx@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b614aeb
    • D
      bpf: add bpf_skb_load_bytes helper · 05c74e5e
      Daniel Borkmann 提交于
      When hacking tc programs with eBPF, one of the issues that come up
      from time to time is to load addresses from headers. In eBPF as in
      classic BPF, we have BPF_LD | BPF_ABS | BPF_{B,H,W} instructions that
      extract a byte, half-word or word out of the skb data though helpers
      such as bpf_load_pointer() (interpreter case).
      
      F.e. extracting a whole IPv6 address could possibly look like ...
      
        union v6addr {
          struct {
            __u32 p1;
            __u32 p2;
            __u32 p3;
            __u32 p4;
          };
          __u8 addr[16];
        };
      
        [...]
      
        a.p1 = htonl(load_word(skb, off));
        a.p2 = htonl(load_word(skb, off +  4));
        a.p3 = htonl(load_word(skb, off +  8));
        a.p4 = htonl(load_word(skb, off + 12));
      
        [...]
      
        /* access to a.addr[...] */
      
      This work adds a complementary helper bpf_skb_load_bytes() (we also
      have bpf_skb_store_bytes()) as an alternative where the same call
      would look like from an eBPF program:
      
        ret = bpf_skb_load_bytes(skb, off, addr, sizeof(addr));
      
      Same verifier restrictions apply as in ffeedafb ("bpf: introduce
      current->pid, tgid, uid, gid, comm accessors") case, where stack memory
      access needs to be statically verified and thus guaranteed to be
      initialized in first use (otherwise verifier cannot tell whether a
      subsequent access to it is valid or not as it's runtime dependent).
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      05c74e5e
  14. 13 10月, 2015 1 次提交
    • A
      bpf: enable non-root eBPF programs · 1be7f75d
      Alexei Starovoitov 提交于
      In order to let unprivileged users load and execute eBPF programs
      teach verifier to prevent pointer leaks.
      Verifier will prevent
      - any arithmetic on pointers
        (except R10+Imm which is used to compute stack addresses)
      - comparison of pointers
        (except if (map_value_ptr == 0) ... )
      - passing pointers to helper functions
      - indirectly passing pointers in stack to helper functions
      - returning pointer from bpf program
      - storing pointers into ctx or maps
      
      Spill/fill of pointers into stack is allowed, but mangling
      of pointers stored in the stack or reading them byte by byte is not.
      
      Within bpf programs the pointers do exist, since programs need to
      be able to access maps, pass skb pointer to LD_ABS insns, etc
      but programs cannot pass such pointer values to the outside
      or obfuscate them.
      
      Only allow BPF_PROG_TYPE_SOCKET_FILTER unprivileged programs,
      so that socket filters (tcpdump), af_packet (quic acceleration)
      and future kcm can use it.
      tracing and tc cls/act program types still require root permissions,
      since tracing actually needs to be able to see all kernel pointers
      and tc is for root only.
      
      For example, the following unprivileged socket filter program is allowed:
      int bpf_prog1(struct __sk_buff *skb)
      {
        u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
        u64 *value = bpf_map_lookup_elem(&my_map, &index);
      
        if (value)
      	*value += skb->len;
        return 0;
      }
      
      but the following program is not:
      int bpf_prog1(struct __sk_buff *skb)
      {
        u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
        u64 *value = bpf_map_lookup_elem(&my_map, &index);
      
        if (value)
      	*value += (u64) skb;
        return 0;
      }
      since it would leak the kernel address into the map.
      
      Unprivileged socket filter bpf programs have access to the
      following helper functions:
      - map lookup/update/delete (but they cannot store kernel pointers into them)
      - get_random (it's already exposed to unprivileged user space)
      - get_smp_processor_id
      - tail_call into another socket filter program
      - ktime_get_ns
      
      The feature is controlled by sysctl kernel.unprivileged_bpf_disabled.
      This toggle defaults to off (0), but can be set true (1).  Once true,
      bpf programs and maps cannot be accessed from unprivileged process,
      and the toggle cannot be set back to false.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1be7f75d
  15. 11 10月, 2015 1 次提交
    • A
      bpf: fix cb access in socket filter programs · ff936a04
      Alexei Starovoitov 提交于
      eBPF socket filter programs may see junk in 'u32 cb[5]' area,
      since it could have been used by protocol layers earlier.
      
      For socket filter programs used in af_packet we need to clean
      20 bytes of skb->cb area if it could be used by the program.
      For programs attached to TCP/UDP sockets we need to save/restore
      these 20 bytes, since it's used by protocol layers.
      
      Remove SK_RUN_FILTER macro, since it's no longer used.
      
      Long term we may move this bpf cb area to per-cpu scratch, but that
      requires addition of new 'per-cpu load/store' instructions,
      so not suitable as a short term fix.
      
      Fixes: d691f9e8 ("bpf: allow programs to write to certain skb fields")
      Reported-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ff936a04
  16. 08 10月, 2015 3 次提交
  17. 05 10月, 2015 2 次提交
    • D
      bpf, seccomp: prepare for upcoming criu support · bab18991
      Daniel Borkmann 提交于
      The current ongoing effort to dump existing cBPF seccomp filters back
      to user space requires to hold the pre-transformed instructions like
      we do in case of socket filters from sk_attach_filter() side, so they
      can be reloaded in original form at a later point in time by utilities
      such as criu.
      
      To prepare for this, simply extend the bpf_prog_create_from_user()
      API to hold a flag that tells whether we should store the original
      or not. Also, fanout filters could make use of that in future for
      things like diag. While fanout filters already use bpf_prog_destroy(),
      move seccomp over to them as well to handle original programs when
      present.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Tycho Andersen <tycho.andersen@canonical.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Tested-by: NTycho Andersen <tycho.andersen@canonical.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bab18991
    • D
      bpf: fix panic in SO_GET_FILTER with native ebpf programs · 93d08b69
      Daniel Borkmann 提交于
      When sockets have a native eBPF program attached through
      setsockopt(sk, SOL_SOCKET, SO_ATTACH_BPF, ...), and then try to
      dump these over getsockopt(sk, SOL_SOCKET, SO_GET_FILTER, ...),
      the following panic appears:
      
        [49904.178642] BUG: unable to handle kernel NULL pointer dereference at (null)
        [49904.178762] IP: [<ffffffff81610fd9>] sk_get_filter+0x39/0x90
        [49904.182000] PGD 86fc9067 PUD 531a1067 PMD 0
        [49904.185196] Oops: 0000 [#1] SMP
        [...]
        [49904.224677] Call Trace:
        [49904.226090]  [<ffffffff815e3d49>] sock_getsockopt+0x319/0x740
        [49904.227535]  [<ffffffff812f59e3>] ? sock_has_perm+0x63/0x70
        [49904.228953]  [<ffffffff815e2fc8>] ? release_sock+0x108/0x150
        [49904.230380]  [<ffffffff812f5a43>] ? selinux_socket_getsockopt+0x23/0x30
        [49904.231788]  [<ffffffff815dff36>] SyS_getsockopt+0xa6/0xc0
        [49904.233267]  [<ffffffff8171b9ae>] entry_SYSCALL_64_fastpath+0x12/0x71
      
      The underlying issue is the very same as in commit b382c086
      ("sock, diag: fix panic in sock_diag_put_filterinfo"), that is,
      native eBPF programs don't store an original program since this
      is only needed in cBPF ones.
      
      However, sk_get_filter() wasn't updated to test for this at the
      time when eBPF could be attached. Just throw an error to the user
      to indicate that eBPF cannot be dumped over this interface.
      That way, it can also be known that a program _is_ attached (as
      opposed to just return 0), and a different (future) method needs
      to be consulted for a dump.
      
      Fixes: 89aa0758 ("net: sock: allow eBPF programs to be attached to sockets")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93d08b69
  18. 03 10月, 2015 3 次提交
  19. 24 9月, 2015 1 次提交
  20. 18 9月, 2015 1 次提交
    • A
      bpf: add bpf_redirect() helper · 27b29f63
      Alexei Starovoitov 提交于
      Existing bpf_clone_redirect() helper clones skb before redirecting
      it to RX or TX of destination netdev.
      Introduce bpf_redirect() helper that does that without cloning.
      
      Benchmarked with two hosts using 10G ixgbe NICs.
      One host is doing line rate pktgen.
      Another host is configured as:
      $ tc qdisc add dev $dev ingress
      $ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
         action bpf run object-file tcbpf1_kern.o section clone_redirect_xmit drop
      so it receives the packet on $dev and immediately xmits it on $dev + 1
      The section 'clone_redirect_xmit' in tcbpf1_kern.o file has the program
      that does bpf_clone_redirect() and performance is 2.0 Mpps
      
      $ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
         action bpf run object-file tcbpf1_kern.o section redirect_xmit drop
      which is using bpf_redirect() - 2.4 Mpps
      
      and using cls_bpf with integrated actions as:
      $ tc filter add dev $dev root pref 10 \
        bpf run object-file tcbpf1_kern.o section redirect_xmit integ_act classid 1
      performance is 2.5 Mpps
      
      To summarize:
      u32+act_bpf using clone_redirect - 2.0 Mpps
      u32+act_bpf using redirect - 2.4 Mpps
      cls_bpf using redirect - 2.5 Mpps
      
      For comparison linux bridge in this setup is doing 2.1 Mpps
      and ixgbe rx + drop in ip_rcv - 7.8 Mpps
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27b29f63