1. 16 10月, 2018 1 次提交
    • E
      net: extend sk_pacing_rate to unsigned long · 76a9ebe8
      Eric Dumazet 提交于
      sk_pacing_rate has beed introduced as a u32 field in 2013,
      effectively limiting per flow pacing to 34Gbit.
      
      We believe it is time to allow TCP to pace high speed flows
      on 64bit hosts, as we now can reach 100Gbit on one TCP flow.
      
      This patch adds no cost for 32bit kernels.
      
      The tcpi_pacing_rate and tcpi_max_pacing_rate were already
      exported as 64bit, so iproute2/ss command require no changes.
      
      Unfortunately the SO_MAX_PACING_RATE socket option will stay
      32bit and we will need to add a new option to let applications
      control high pacing rates.
      
      State      Recv-Q Send-Q Local Address:Port             Peer Address:Port
      ESTAB      0      1787144  10.246.9.76:49992             10.246.9.77:36741
                       timer:(on,003ms,0) ino:91863 sk:2 <->
       skmem:(r0,rb540000,t66440,tb2363904,f605944,w1822984,o0,bl0,d0)
       ts sack bbr wscale:8,8 rto:201 rtt:0.057/0.006 mss:1448
       rcvmss:536 advmss:1448
       cwnd:138 ssthresh:178 bytes_acked:256699822585 segs_out:177279177
       segs_in:3916318 data_segs_out:177279175
       bbr:(bw:31276.8Mbps,mrtt:0,pacing_gain:1.25,cwnd_gain:2)
       send 28045.5Mbps lastrcv:73333
       pacing_rate 38705.0Mbps delivery_rate 22997.6Mbps
       busy:73333ms unacked:135 retrans:0/157 rcv_space:14480
       notsent:2085120 minrtt:0.013
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76a9ebe8
  2. 09 10月, 2018 1 次提交
    • A
      bpf: fix building without CONFIG_INET · df3f94a0
      Arnd Bergmann 提交于
      The newly added TCP and UDP handling fails to link when CONFIG_INET
      is disabled:
      
      net/core/filter.o: In function `sk_lookup':
      filter.c:(.text+0x7ff8): undefined reference to `tcp_hashinfo'
      filter.c:(.text+0x7ffc): undefined reference to `tcp_hashinfo'
      filter.c:(.text+0x8020): undefined reference to `__inet_lookup_established'
      filter.c:(.text+0x8058): undefined reference to `__inet_lookup_listener'
      filter.c:(.text+0x8068): undefined reference to `udp_table'
      filter.c:(.text+0x8070): undefined reference to `udp_table'
      filter.c:(.text+0x808c): undefined reference to `__udp4_lib_lookup'
      net/core/filter.o: In function `bpf_sk_release':
      filter.c:(.text+0x82e8): undefined reference to `sock_gen_put'
      
      Wrap the related sections of code in #ifdefs for the config option.
      
      Furthermore, sk_lookup() should always have been marked 'static', this
      also avoids a warning about a missing prototype when building with
      'make W=1'.
      
      Fixes: 6acc9b43 ("bpf: Add helper to retrieve socket in BPF")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NJoe Stringer <joe@wand.net.nz>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      df3f94a0
  3. 04 10月, 2018 1 次提交
  4. 03 10月, 2018 2 次提交
    • J
      bpf: Add helper to retrieve socket in BPF · 6acc9b43
      Joe Stringer 提交于
      This patch adds new BPF helper functions, bpf_sk_lookup_tcp() and
      bpf_sk_lookup_udp() which allows BPF programs to find out if there is a
      socket listening on this host, and returns a socket pointer which the
      BPF program can then access to determine, for instance, whether to
      forward or drop traffic. bpf_sk_lookup_xxx() may take a reference on the
      socket, so when a BPF program makes use of this function, it must
      subsequently pass the returned pointer into the newly added sk_release()
      to return the reference.
      
      By way of example, the following pseudocode would filter inbound
      connections at XDP if there is no corresponding service listening for
      the traffic:
      
        struct bpf_sock_tuple tuple;
        struct bpf_sock_ops *sk;
      
        populate_tuple(ctx, &tuple); // Extract the 5tuple from the packet
        sk = bpf_sk_lookup_tcp(ctx, &tuple, sizeof tuple, netns, 0);
        if (!sk) {
          // Couldn't find a socket listening for this traffic. Drop.
          return TC_ACT_SHOT;
        }
        bpf_sk_release(sk, 0);
        return TC_ACT_OK;
      Signed-off-by: NJoe Stringer <joe@wand.net.nz>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      6acc9b43
    • J
      bpf: Add PTR_TO_SOCKET verifier type · c64b7983
      Joe Stringer 提交于
      Teach the verifier a little bit about a new type of pointer, a
      PTR_TO_SOCKET. This pointer type is accessed from BPF through the
      'struct bpf_sock' structure.
      Signed-off-by: NJoe Stringer <joe@wand.net.nz>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      c64b7983
  5. 15 9月, 2018 1 次提交
  6. 13 9月, 2018 1 次提交
    • T
      bpf: use __GFP_COMP while allocating page · 4c3d795c
      Tushar Dave 提交于
      Helper bpg_msg_pull_data() can allocate multiple pages while
      linearizing multiple scatterlist elements into one shared page.
      However, if the shared page has size > PAGE_SIZE, using
      copy_page_to_iter() causes below warning.
      
      e.g.
      [ 6367.019832] WARNING: CPU: 2 PID: 7410 at lib/iov_iter.c:825
      page_copy_sane.part.8+0x0/0x8
      
      To avoid above warning, use __GFP_COMP while allocating multiple
      contiguous pages.
      
      Fixes: 015632bb ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
      Signed-off-by: NTushar Dave <tushar.n.dave@oracle.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4c3d795c
  7. 12 9月, 2018 1 次提交
    • A
      net/core/filter: fix unused-variable warning · 1edb6e03
      Anders Roxell 提交于
      Building with CONFIG_INET=n will show the warning below:
      net/core/filter.c: In function ‘____bpf_getsockopt’:
      net/core/filter.c:4048:19: warning: unused variable ‘tp’ [-Wunused-variable]
        struct tcp_sock *tp;
                         ^~
      net/core/filter.c:4046:31: warning: unused variable ‘icsk’ [-Wunused-variable]
        struct inet_connection_sock *icsk;
                                     ^~~~
      Move the variable declarations inside the {} block where they are used.
      
      Fixes: 1e215300 ("bpf: add TCP_SAVE_SYN/TCP_SAVED_SYN options for bpf_(set|get)sockopt")
      Signed-off-by: NAnders Roxell <anders.roxell@linaro.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      1edb6e03
  8. 07 9月, 2018 3 次提交
    • J
      xdp: split code for map vs non-map redirect · 47b123ed
      Jesper Dangaard Brouer 提交于
      The compiler does an efficient job of inlining static C functions.
      Perf top clearly shows that almost everything gets inlined into the
      function call xdp_do_redirect.
      
      The function xdp_do_redirect end-up containing and interleaving the
      map and non-map redirect code.  This is sub-optimal, as it would be
      strange for an XDP program to use both types of redirect in the same
      program. The two use-cases are separate, and interleaving the code
      just cause more instruction-cache pressure.
      
      I would like to stress (again) that the non-map variant bpf_redirect
      is very slow compared to the bpf_redirect_map variant, approx half the
      speed.  Measured with driver i40e the difference is:
      
      - map     redirect: 13,250,350 pps
      - non-map redirect:  7,491,425 pps
      
      For this reason, the function name of the non-map variant of redirect
      have been called xdp_do_redirect_slow.  This hopefully gives a hint
      when using perf, that this is not the optimal XDP redirect operating mode.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      47b123ed
    • J
      xdp: explicit inline __xdp_map_lookup_elem · 2a68d85f
      Jesper Dangaard Brouer 提交于
      The compiler chooses to not-inline the function __xdp_map_lookup_elem,
      because it can see that it is used by both Generic-XDP and native-XDP
      do redirect calls (xdp_do_generic_redirect_map and xdp_do_redirect_map).
      
      The compiler cannot know that this is a bad choice, as it cannot know
      that a net device cannot run both XDP modes (Generic or Native) at the
      same time.  Thus, mark this function inline, even-though we normally
      leave this up-to the compiler.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      2a68d85f
    • J
      xdp: unlikely instrumentation for xdp map redirect · e1302542
      Jesper Dangaard Brouer 提交于
      Notice the compiler generated ASM code layout was suboptimal.  It
      assumed map enqueue errors as the likely case, which is shouldn't.
      It assumed that xdp_do_flush_map() was a likely case, due to maps
      changing between packets, which should be very unlikely.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      e1302542
  9. 03 9月, 2018 1 次提交
    • T
      bpf: Fix bpf_msg_pull_data() · 9db39f4d
      Tushar Dave 提交于
      Helper bpf_msg_pull_data() mistakenly reuses variable 'offset' while
      linearizing multiple scatterlist elements. Variable 'offset' is used
      to find first starting scatterlist element
          i.e. msg->data = sg_virt(&sg[first_sg]) + start - offset"
      
      Use different variable name while linearizing multiple scatterlist
      elements so that value contained in variable 'offset' won't get
      overwritten.
      
      Fixes: 015632bb ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
      Signed-off-by: NTushar Dave <tushar.n.dave@oracle.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      9db39f4d
  10. 01 9月, 2018 1 次提交
  11. 30 8月, 2018 3 次提交
    • D
      bpf: fix sg shift repair start offset in bpf_msg_pull_data · a8cf76a9
      Daniel Borkmann 提交于
      When we perform the sg shift repair for the scatterlist ring, we
      currently start out at i = first_sg + 1. However, this is not
      correct since the first_sg could point to the sge sitting at slot
      MAX_SKB_FRAGS - 1, and a subsequent i = MAX_SKB_FRAGS will access
      the scatterlist ring (sg) out of bounds. Add the sk_msg_iter_var()
      helper for iterating through the ring, and apply the same rule
      for advancing to the next ring element as we do elsewhere. Later
      work will use this helper also in other places.
      
      Fixes: 015632bb ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      a8cf76a9
    • D
      bpf: fix shift upon scatterlist ring wrap-around in bpf_msg_pull_data · 2e43f95d
      Daniel Borkmann 提交于
      If first_sg and last_sg wraps around in the scatterlist ring, then we
      need to account for that in the shift as well. E.g. crafting such msgs
      where this is the case leads to a hang as shift becomes negative. E.g.
      consider the following scenario:
      
        first_sg := 14     |=>    shift := -12     msg->sg_start := 10
        last_sg  :=  3     |                       msg->sg_end   :=  5
      
      round  1:  i := 15, move_from :=   3, sg[15] := sg[  3]
      round  2:  i :=  0, move_from := -12, sg[ 0] := sg[-12]
      round  3:  i :=  1, move_from := -11, sg[ 1] := sg[-11]
      round  4:  i :=  2, move_from := -10, sg[ 2] := sg[-10]
      [...]
      round 13:  i := 11, move_from :=  -1, sg[ 2] := sg[ -1]
      round 14:  i := 12, move_from :=   0, sg[ 2] := sg[  0]
      round 15:  i := 13, move_from :=   1, sg[ 2] := sg[  1]
      round 16:  i := 14, move_from :=   2, sg[ 2] := sg[  2]
      round 17:  i := 15, move_from :=   3, sg[ 2] := sg[  3]
      [...]
      
      This means we will loop forever and never hit the msg->sg_end condition
      to break out of the loop. When we see that the ring wraps around, then
      the shift should be MAX_SKB_FRAGS - first_sg + last_sg - 1. Meaning,
      the remainder slots from the tail of the ring and the head until last_sg
      combined.
      
      Fixes: 015632bb ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      2e43f95d
    • D
      bpf: fix msg->data/data_end after sg shift repair in bpf_msg_pull_data · 0e06b227
      Daniel Borkmann 提交于
      In the current code, msg->data is set as sg_virt(&sg[i]) + start - offset
      and msg->data_end relative to it as msg->data + bytes. Using iterator i
      to point to the updated starting scatterlist element holds true for some
      cases, however not for all where we'd end up pointing out of bounds. It
      is /correct/ for these ones:
      
      1) When first finding the starting scatterlist element (sge) where we
         find that the page is already privately owned by the msg and where
         the requested bytes and headroom fit into the sge's length.
      
      However, it's /incorrect/ for the following ones:
      
      2) After we made the requested area private and updated the newly allocated
         page into first_sg slot of the scatterlist ring; when we find that no
         shift repair of the ring is needed where we bail out updating msg->data
         and msg->data_end. At that point i will point to last_sg, which in this
         case is the next elem of first_sg in the ring. The sge at that point
         might as well be invalid (e.g. i == msg->sg_end), which we use for
         setting the range of sg_virt(&sg[i]). The correct one would have been
         first_sg.
      
      3) Similar as in 2) but when we find that a shift repair of the ring is
         needed. In this case we fix up all sges and stop once we've reached the
         end. In this case i will point to will point to the new msg->sg_end,
         and the sge at that point will be invalid. Again here the requested
         range sits in first_sg.
      
      Fixes: 015632bb ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      0e06b227
  12. 29 8月, 2018 1 次提交
    • D
      bpf: fix several offset tests in bpf_msg_pull_data · 5b24109b
      Daniel Borkmann 提交于
      While recently going over bpf_msg_pull_data(), I noticed three
      issues which are fixed in here:
      
      1) When we attempt to find the first scatterlist element (sge)
         for the start offset, we add len to the offset before we check
         for start < offset + len, whereas it should come after when
         we iterate to the next sge to accumulate the offsets. For
         example, given a start offset of 12 with a sge length of 8
         for the first sge in the list would lead us to determine this
         sge as the first sge thinking it covers first 16 bytes where
         start is located, whereas start sits in subsequent sges so
         we would end up pulling in the wrong data.
      
      2) After figuring out the starting sge, we have a short-cut test
         in !msg->sg_copy[i] && bytes <= len. This checks whether it's
         not needed to make the page at the sge private where we can
         just exit by updating msg->data and msg->data_end. However,
         the length test is not fully correct. bytes <= len checks
         whether the requested bytes (end - start offsets) fit into the
         sge's length. The part that is missing is that start must not
         be sge length aligned. Meaning, the start offset into the sge
         needs to be accounted as well on top of the requested bytes
         as otherwise we can access the sge out of bounds. For example
         the sge could have length of 8, our requested bytes could have
         length of 8, but at a start offset of 4, so we also would need
         to pull in 4 bytes of the next sge, when we jump to the out
         label we do set msg->data to sg_virt(&sg[i]) + start - offset
         and msg->data_end to msg->data + bytes which would be oob.
      
      3) The subsequent bytes < copy test for finding the last sge has
         the same issue as in point 2) but also it tests for less than
         rather than less or equal to. Meaning if the sge length is of
         8 and requested bytes of 8 while having the start aligned with
         the sge, we would unnecessarily go and pull in the next sge as
         well to make it private.
      
      Fixes: 015632bb ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      5b24109b
  13. 28 8月, 2018 1 次提交
    • S
      bpf: fix build error with clang · 3f6e138d
      Stefan Agner 提交于
      Building the newly introduced BPF_PROG_TYPE_SK_REUSEPORT leads to
      a compile time error when building with clang:
      net/core/filter.o: In function `sk_reuseport_convert_ctx_access':
        ../net/core/filter.c:7284: undefined reference to `__compiletime_assert_7284'
      
      It seems that clang has issues resolving hweight_long at compile
      time. Since SK_FL_PROTO_MASK is a constant, we can use the interface
      for known constant arguments which works fine with clang.
      
      Fixes: 2dbb9b9e ("bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT")
      Signed-off-by: NStefan Agner <stefan@agner.ch>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      3f6e138d
  14. 18 8月, 2018 1 次提交
    • D
      bpf: fix redirect to map under tail calls · f6069b9a
      Daniel Borkmann 提交于
      Commits 109980b8 ("bpf: don't select potentially stale ri->map
      from buggy xdp progs") and 7c300131 ("bpf: fix ri->map_owner
      pointer on bpf_prog_realloc") tried to mitigate that buggy programs
      using bpf_redirect_map() helper call do not leave stale maps behind.
      Idea was to add a map_owner cookie into the per CPU struct redirect_info
      which was set to prog->aux by the prog making the helper call as a
      proof that the map is not stale since the prog is implicitly holding
      a reference to it. This owner cookie could later on get compared with
      the program calling into BPF whether they match and therefore the
      redirect could proceed with processing the map safely.
      
      In (obvious) hindsight, this approach breaks down when tail calls are
      involved since the original caller's prog->aux pointer does not have
      to match the one from one of the progs out of the tail call chain,
      and therefore the xdp buffer will be dropped instead of redirected.
      A way around that would be to fix the issue differently (which also
      allows to remove related work in fast path at the same time): once
      the life-time of a redirect map has come to its end we use it's map
      free callback where we need to wait on synchronize_rcu() for current
      outstanding xdp buffers and remove such a map pointer from the
      redirect info if found to be present. At that time no program is
      using this map anymore so we simply invalidate the map pointers to
      NULL iff they previously pointed to that instance while making sure
      that the redirect path only reads out the map once.
      
      Fixes: 97f91a7c ("bpf: add bpf_redirect_map helper routine")
      Fixes: 109980b8 ("bpf: don't select potentially stale ri->map from buggy xdp progs")
      Reported-by: NSebastiano Miano <sebastiano.miano@polito.it>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f6069b9a
  15. 15 8月, 2018 1 次提交
  16. 13 8月, 2018 1 次提交
    • A
      bpf: Introduce bpf_skb_ancestor_cgroup_id helper · 77236281
      Andrey Ignatov 提交于
      == Problem description ==
      
      It's useful to be able to identify cgroup associated with skb in TC so
      that a policy can be applied to this skb, and existing bpf_skb_cgroup_id
      helper can help with this.
      
      Though in real life cgroup hierarchy and hierarchy to apply a policy to
      don't map 1:1.
      
      It's often the case that there is a container and corresponding cgroup,
      but there are many more sub-cgroups inside container, e.g. because it's
      delegated to containerized application to control resources for its
      subsystems, or to separate application inside container from infra that
      belongs to containerization system (e.g. sshd).
      
      At the same time it may be useful to apply a policy to container as a
      whole.
      
      If multiple containers like this are run on a host (what is often the
      case) and many of them have sub-cgroups, it may not be possible to apply
      per-container policy in TC with existing helpers such as
      bpf_skb_under_cgroup or bpf_skb_cgroup_id:
      
      * bpf_skb_cgroup_id will return id of immediate cgroup associated with
        skb, i.e. if it's a sub-cgroup inside container, it can't be used to
        identify container's cgroup;
      
      * bpf_skb_under_cgroup can work only with one cgroup and doesn't scale,
        i.e. if there are N containers on a host and a policy has to be
        applied to M of them (0 <= M <= N), it'd require M calls to
        bpf_skb_under_cgroup, and, if M changes, it'd require to rebuild &
        load new BPF program.
      
      == Solution ==
      
      The patch introduces new helper bpf_skb_ancestor_cgroup_id that can be
      used to get id of cgroup v2 that is an ancestor of cgroup associated
      with skb at specified level of cgroup hierarchy.
      
      That way admin can place all containers on one level of cgroup hierarchy
      (what is a good practice in general and already used in many
      configurations) and identify specific cgroup on this level no matter
      what sub-cgroup skb is associated with.
      
      E.g. if there is a cgroup hierarchy:
        root/
        root/container1/
        root/container1/app11/
        root/container1/app11/sub-app-a/
        root/container1/app12/
        root/container2/
        root/container2/app21/
        root/container2/app22/
        root/container2/app22/sub-app-b/
      
      , then having skb associated with root/container1/app11/sub-app-a/ it's
      possible to get ancestor at level 1, what is container1 and apply policy
      for this container, or apply another policy if it's container2.
      
      Policies can be kept e.g. in a hash map where key is a container cgroup
      id and value is an action.
      
      Levels where container cgroups are created are usually known in advance
      whether cgroup hierarchy inside container may be hard to predict
      especially in case when its creation is delegated to containerized
      application.
      
      == Implementation details ==
      
      The helper gets ancestor by walking parents up to specified level.
      
      Another option would be to get different kind of "id" from
      cgroup->ancestor_ids[level] and use it with idr_find() to get struct
      cgroup for ancestor. But that would require radix lookup what doesn't
      seem to be better (at least it's not obviously better).
      
      Format of return value of the new helper is same as that of
      bpf_skb_cgroup_id.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      77236281
  17. 11 8月, 2018 2 次提交
    • M
      bpf: Enable BPF_PROG_TYPE_SK_REUSEPORT bpf prog in reuseport selection · 8217ca65
      Martin KaFai Lau 提交于
      This patch allows a BPF_PROG_TYPE_SK_REUSEPORT bpf prog to select a
      SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY introduced in
      the earlier patch.  "bpf_run_sk_reuseport()" will return -ECONNREFUSED
      when the BPF_PROG_TYPE_SK_REUSEPORT prog returns SK_DROP.
      The callers, in inet[6]_hashtable.c and ipv[46]/udp.c, are modified to
      handle this case and return NULL immediately instead of continuing the
      sk search from its hashtable.
      
      It re-uses the existing SO_ATTACH_REUSEPORT_EBPF setsockopt to attach
      BPF_PROG_TYPE_SK_REUSEPORT.  The "sk_reuseport_attach_bpf()" will check
      if the attaching bpf prog is in the new SK_REUSEPORT or the existing
      SOCKET_FILTER type and then check different things accordingly.
      
      One level of "__reuseport_attach_prog()" call is removed.  The
      "sk_unhashed() && ..." and "sk->sk_reuseport_cb" tests are pushed
      back to "reuseport_attach_prog()" in sock_reuseport.c.  sock_reuseport.c
      seems to have more knowledge on those test requirements than filter.c.
      In "reuseport_attach_prog()", after new_prog is attached to reuse->prog,
      the old_prog (if any) is also directly freed instead of returning the
      old_prog to the caller and asking the caller to free.
      
      The sysctl_optmem_max check is moved back to the
      "sk_reuseport_attach_filter()" and "sk_reuseport_attach_bpf()".
      As of other bpf prog types, the new BPF_PROG_TYPE_SK_REUSEPORT is only
      bounded by the usual "bpf_prog_charge_memlock()" during load time
      instead of bounded by both bpf_prog_charge_memlock and sysctl_optmem_max.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      8217ca65
    • M
      bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT · 2dbb9b9e
      Martin KaFai Lau 提交于
      This patch adds a BPF_PROG_TYPE_SK_REUSEPORT which can select
      a SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY.  Like other
      non SK_FILTER/CGROUP_SKB program, it requires CAP_SYS_ADMIN.
      
      BPF_PROG_TYPE_SK_REUSEPORT introduces "struct sk_reuseport_kern"
      to store the bpf context instead of using the skb->cb[48].
      
      At the SO_REUSEPORT sk lookup time, it is in the middle of transiting
      from a lower layer (ipv4/ipv6) to a upper layer (udp/tcp).  At this
      point,  it is not always clear where the bpf context can be appended
      in the skb->cb[48] to avoid saving-and-restoring cb[].  Even putting
      aside the difference between ipv4-vs-ipv6 and udp-vs-tcp.  It is not
      clear if the lower layer is only ipv4 and ipv6 in the future and
      will it not touch the cb[] again before transiting to the upper
      layer.
      
      For example, in udp_gro_receive(), it uses the 48 byte NAPI_GRO_CB
      instead of IP[6]CB and it may still modify the cb[] after calling
      the udp[46]_lib_lookup_skb().  Because of the above reason, if
      sk->cb is used for the bpf ctx, saving-and-restoring is needed
      and likely the whole 48 bytes cb[] has to be saved and restored.
      
      Instead of saving, setting and restoring the cb[], this patch opts
      to create a new "struct sk_reuseport_kern" and setting the needed
      values in there.
      
      The new BPF_PROG_TYPE_SK_REUSEPORT and "struct sk_reuseport_(kern|md)"
      will serve all ipv4/ipv6 + udp/tcp combinations.  There is no protocol
      specific usage at this point and it is also inline with the current
      sock_reuseport.c implementation (i.e. no protocol specific requirement).
      
      In "struct sk_reuseport_md", this patch exposes data/data_end/len
      with semantic similar to other existing usages.  Together
      with "bpf_skb_load_bytes()" and "bpf_skb_load_bytes_relative()",
      the bpf prog can peek anywhere in the skb.  The "bind_inany" tells
      the bpf prog that the reuseport group is bind-ed to a local
      INANY address which cannot be learned from skb.
      
      The new "bind_inany" is added to "struct sock_reuseport" which will be
      used when running the new "BPF_PROG_TYPE_SK_REUSEPORT" bpf prog in order
      to avoid repeating the "bind INANY" test on
      "sk_v6_rcv_saddr/sk->sk_rcv_saddr" every time a bpf prog is run.  It can
      only be properly initialized when a "sk->sk_reuseport" enabled sk is
      adding to a hashtable (i.e. during "reuseport_alloc()" and
      "reuseport_add_sock()").
      
      The new "sk_select_reuseport()" is the main helper that the
      bpf prog will use to select a SO_REUSEPORT sk.  It is the only function
      that can use the new BPF_MAP_TYPE_REUSEPORT_ARRAY.  As mentioned in
      the earlier patch, the validity of a selected sk is checked in
      run time in "sk_select_reuseport()".  Doing the check in
      verification time is difficult and inflexible (consider the map-in-map
      use case).  The runtime check is to compare the selected sk's reuseport_id
      with the reuseport_id that we want.  This helper will return -EXXX if the
      selected sk cannot serve the incoming request (e.g. reuseport_id
      not match).  The bpf prog can decide if it wants to do SK_DROP as its
      discretion.
      
      When the bpf prog returns SK_PASS, the kernel will check if a
      valid sk has been selected (i.e. "reuse_kern->selected_sk != NULL").
      If it does , it will use the selected sk.  If not, the kernel
      will select one from "reuse->socks[]" (as before this patch).
      
      The SK_DROP and SK_PASS handling logic will be in the next patch.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      2dbb9b9e
  18. 10 8月, 2018 1 次提交
  19. 03 8月, 2018 1 次提交
  20. 31 7月, 2018 2 次提交
    • A
      bpf: Support bpf_get_socket_cookie in more prog types · d692f113
      Andrey Ignatov 提交于
      bpf_get_socket_cookie() helper can be used to identify skb that
      correspond to the same socket.
      
      Though socket cookie can be useful in many other use-cases where socket is
      available in program context. Specifically BPF_PROG_TYPE_CGROUP_SOCK_ADDR
      and BPF_PROG_TYPE_SOCK_OPS programs can benefit from it so that one of
      them can augment a value in a map prepared earlier by other program for
      the same socket.
      
      The patch adds support to call bpf_get_socket_cookie() from
      BPF_PROG_TYPE_CGROUP_SOCK_ADDR and BPF_PROG_TYPE_SOCK_OPS.
      
      It doesn't introduce new helpers. Instead it reuses same helper name
      bpf_get_socket_cookie() but adds support to this helper to accept
      `struct bpf_sock_addr` and `struct bpf_sock_ops`.
      
      Documentation in bpf.h is changed in a way that should not break
      automatic generation of markdown.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      d692f113
    • M
      bpf: add End.DT6 action to bpf_lwt_seg6_action helper · 486cdf21
      Mathieu Xhonneux 提交于
      The seg6local LWT provides the End.DT6 action, which allows to
      decapsulate an outer IPv6 header containing a Segment Routing Header
      (SRH), full specification is available here:
      
      https://tools.ietf.org/html/draft-filsfils-spring-srv6-network-programming-05
      
      This patch adds this action now to the seg6local BPF
      interface. Since it is not mandatory that the inner IPv6 header also
      contains a SRH, seg6_bpf_srh_state has been extended with a pointer to
      a possible SRH of the outermost IPv6 header. This helps assessing if the
      validation must be triggered or not, and avoids some calls to
      ipv6_find_hdr.
      
      v3: s/1/true, s/0/false for boolean values
      v2: - changed true/false -> 1/0
          - preempt_enable no longer called in first conditional block
      Signed-off-by: NMathieu Xhonneux <m.xhonneux@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      486cdf21
  21. 29 7月, 2018 1 次提交
  22. 12 7月, 2018 1 次提交
  23. 10 7月, 2018 1 次提交
  24. 08 7月, 2018 3 次提交
    • T
      xdp: XDP_REDIRECT should check IFF_UP and MTU · d8d7218a
      Toshiaki Makita 提交于
      Otherwise we end up with attempting to send packets from down devices
      or to send oversized packets, which may cause unexpected driver/device
      behaviour. Generic XDP has already done this check, so reuse the logic
      in native XDP.
      
      Fixes: 814abfab ("xdp: add bpf_redirect helper function")
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d8d7218a
    • J
      bpf: sockmap, convert bpf_compute_data_pointers to bpf_*_sk_skb · 0ea488ff
      John Fastabend 提交于
      In commit
      
        'bpf: bpf_compute_data uses incorrect cb structure' (8108a775)
      
      we added the routine bpf_compute_data_end_sk_skb() to compute the
      correct data_end values, but this has since been lost. In kernel
      v4.14 this was correct and the above patch was applied in it
      entirety. Then when v4.14 was merged into v4.15-rc1 net-next tree
      we lost the piece that renamed bpf_compute_data_pointers to the
      new function bpf_compute_data_end_sk_skb. This was done here,
      
      e1ea2f98 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
      
      When it conflicted with the following rename patch,
      
      6aaae2b6 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")
      
      Finally, after a refactor I thought even the function
      bpf_compute_data_end_sk_skb() was no longer needed and it was
      erroneously removed.
      
      However, we never reverted the sk_skb_convert_ctx_access() usage of
      tcp_skb_cb which had been committed and survived the merge conflict.
      Here we fix this by adding back the helper and *_data_end_sk_skb()
      usage. Using the bpf_skc_data_end mapping is not correct because it
      expects a qdisc_skb_cb object but at the sock layer this is not the
      case. Even though it happens to work here because we don't overwrite
      any data in-use at the socket layer and the cb structure is cleared
      later this has potential to create some subtle issues. But, even
      more concretely the filter.c access check uses tcp_skb_cb.
      
      And by some act of chance though,
      
      struct bpf_skb_data_end {
              struct qdisc_skb_cb        qdisc_cb;             /*     0    28 */
      
              /* XXX 4 bytes hole, try to pack */
      
              void *                     data_meta;            /*    32     8 */
              void *                     data_end;             /*    40     8 */
      
              /* size: 48, cachelines: 1, members: 3 */
              /* sum members: 44, holes: 1, sum holes: 4 */
              /* last cacheline: 48 bytes */
      };
      
      and then tcp_skb_cb,
      
      struct tcp_skb_cb {
      	[...]
                      struct {
                              __u32      flags;                /*    24     4 */
                              struct sock * sk_redir;          /*    32     8 */
                              void *     data_end;             /*    40     8 */
                      } bpf;                                   /*          24 */
              };
      
      So when we use offset_of() to track down the byte offset we get 40 in
      either case and everything continues to work. Fix this mess and use
      correct structures its unclear how long this might actually work for
      until someone moves the structs around.
      Reported-by: NMartin KaFai Lau <kafai@fb.com>
      Fixes: e1ea2f98 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
      Fixes: 6aaae2b6 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      0ea488ff
    • J
      bpf: fix sk_skb programs without skb->dev assigned · 0c6bc6e5
      John Fastabend 提交于
      Multiple BPF helpers in use by sk_skb programs calculate the max
      skb length using the __bpf_skb_max_len function. However, this
      calculates the max length using the skb->dev pointer which can be
      NULL when an sk_skb program is paired with an sk_msg program.
      
      To force this a sk_msg program needs to redirect into the ingress
      path of a sock with an attach sk_skb program. Then the the sk_skb
      program would need to call one of the helpers that adjust the skb
      size.
      
      To fix the null ptr dereference use SKB_MAX_ALLOC size if no dev
      is available.
      
      Fixes: 8934ce2f ("bpf: sockmap redirect ingress support")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      0c6bc6e5
  25. 05 7月, 2018 1 次提交
  26. 29 6月, 2018 2 次提交
  27. 16 6月, 2018 1 次提交
  28. 04 6月, 2018 1 次提交
  29. 03 6月, 2018 2 次提交