1. 11 8月, 2018 1 次提交
    • M
      bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT · 2dbb9b9e
      Martin KaFai Lau 提交于
      This patch adds a BPF_PROG_TYPE_SK_REUSEPORT which can select
      a SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY.  Like other
      non SK_FILTER/CGROUP_SKB program, it requires CAP_SYS_ADMIN.
      
      BPF_PROG_TYPE_SK_REUSEPORT introduces "struct sk_reuseport_kern"
      to store the bpf context instead of using the skb->cb[48].
      
      At the SO_REUSEPORT sk lookup time, it is in the middle of transiting
      from a lower layer (ipv4/ipv6) to a upper layer (udp/tcp).  At this
      point,  it is not always clear where the bpf context can be appended
      in the skb->cb[48] to avoid saving-and-restoring cb[].  Even putting
      aside the difference between ipv4-vs-ipv6 and udp-vs-tcp.  It is not
      clear if the lower layer is only ipv4 and ipv6 in the future and
      will it not touch the cb[] again before transiting to the upper
      layer.
      
      For example, in udp_gro_receive(), it uses the 48 byte NAPI_GRO_CB
      instead of IP[6]CB and it may still modify the cb[] after calling
      the udp[46]_lib_lookup_skb().  Because of the above reason, if
      sk->cb is used for the bpf ctx, saving-and-restoring is needed
      and likely the whole 48 bytes cb[] has to be saved and restored.
      
      Instead of saving, setting and restoring the cb[], this patch opts
      to create a new "struct sk_reuseport_kern" and setting the needed
      values in there.
      
      The new BPF_PROG_TYPE_SK_REUSEPORT and "struct sk_reuseport_(kern|md)"
      will serve all ipv4/ipv6 + udp/tcp combinations.  There is no protocol
      specific usage at this point and it is also inline with the current
      sock_reuseport.c implementation (i.e. no protocol specific requirement).
      
      In "struct sk_reuseport_md", this patch exposes data/data_end/len
      with semantic similar to other existing usages.  Together
      with "bpf_skb_load_bytes()" and "bpf_skb_load_bytes_relative()",
      the bpf prog can peek anywhere in the skb.  The "bind_inany" tells
      the bpf prog that the reuseport group is bind-ed to a local
      INANY address which cannot be learned from skb.
      
      The new "bind_inany" is added to "struct sock_reuseport" which will be
      used when running the new "BPF_PROG_TYPE_SK_REUSEPORT" bpf prog in order
      to avoid repeating the "bind INANY" test on
      "sk_v6_rcv_saddr/sk->sk_rcv_saddr" every time a bpf prog is run.  It can
      only be properly initialized when a "sk->sk_reuseport" enabled sk is
      adding to a hashtable (i.e. during "reuseport_alloc()" and
      "reuseport_add_sock()").
      
      The new "sk_select_reuseport()" is the main helper that the
      bpf prog will use to select a SO_REUSEPORT sk.  It is the only function
      that can use the new BPF_MAP_TYPE_REUSEPORT_ARRAY.  As mentioned in
      the earlier patch, the validity of a selected sk is checked in
      run time in "sk_select_reuseport()".  Doing the check in
      verification time is difficult and inflexible (consider the map-in-map
      use case).  The runtime check is to compare the selected sk's reuseport_id
      with the reuseport_id that we want.  This helper will return -EXXX if the
      selected sk cannot serve the incoming request (e.g. reuseport_id
      not match).  The bpf prog can decide if it wants to do SK_DROP as its
      discretion.
      
      When the bpf prog returns SK_PASS, the kernel will check if a
      valid sk has been selected (i.e. "reuse_kern->selected_sk != NULL").
      If it does , it will use the selected sk.  If not, the kernel
      will select one from "reuse->socks[]" (as before this patch).
      
      The SK_DROP and SK_PASS handling logic will be in the next patch.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      2dbb9b9e
  2. 10 8月, 2018 1 次提交
  3. 03 8月, 2018 1 次提交
  4. 31 7月, 2018 2 次提交
    • A
      bpf: Support bpf_get_socket_cookie in more prog types · d692f113
      Andrey Ignatov 提交于
      bpf_get_socket_cookie() helper can be used to identify skb that
      correspond to the same socket.
      
      Though socket cookie can be useful in many other use-cases where socket is
      available in program context. Specifically BPF_PROG_TYPE_CGROUP_SOCK_ADDR
      and BPF_PROG_TYPE_SOCK_OPS programs can benefit from it so that one of
      them can augment a value in a map prepared earlier by other program for
      the same socket.
      
      The patch adds support to call bpf_get_socket_cookie() from
      BPF_PROG_TYPE_CGROUP_SOCK_ADDR and BPF_PROG_TYPE_SOCK_OPS.
      
      It doesn't introduce new helpers. Instead it reuses same helper name
      bpf_get_socket_cookie() but adds support to this helper to accept
      `struct bpf_sock_addr` and `struct bpf_sock_ops`.
      
      Documentation in bpf.h is changed in a way that should not break
      automatic generation of markdown.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      d692f113
    • M
      bpf: add End.DT6 action to bpf_lwt_seg6_action helper · 486cdf21
      Mathieu Xhonneux 提交于
      The seg6local LWT provides the End.DT6 action, which allows to
      decapsulate an outer IPv6 header containing a Segment Routing Header
      (SRH), full specification is available here:
      
      https://tools.ietf.org/html/draft-filsfils-spring-srv6-network-programming-05
      
      This patch adds this action now to the seg6local BPF
      interface. Since it is not mandatory that the inner IPv6 header also
      contains a SRH, seg6_bpf_srh_state has been extended with a pointer to
      a possible SRH of the outermost IPv6 header. This helps assessing if the
      validation must be triggered or not, and avoids some calls to
      ipv6_find_hdr.
      
      v3: s/1/true, s/0/false for boolean values
      v2: - changed true/false -> 1/0
          - preempt_enable no longer called in first conditional block
      Signed-off-by: NMathieu Xhonneux <m.xhonneux@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      486cdf21
  5. 29 7月, 2018 1 次提交
  6. 12 7月, 2018 1 次提交
  7. 10 7月, 2018 1 次提交
  8. 08 7月, 2018 3 次提交
    • T
      xdp: XDP_REDIRECT should check IFF_UP and MTU · d8d7218a
      Toshiaki Makita 提交于
      Otherwise we end up with attempting to send packets from down devices
      or to send oversized packets, which may cause unexpected driver/device
      behaviour. Generic XDP has already done this check, so reuse the logic
      in native XDP.
      
      Fixes: 814abfab ("xdp: add bpf_redirect helper function")
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d8d7218a
    • J
      bpf: sockmap, convert bpf_compute_data_pointers to bpf_*_sk_skb · 0ea488ff
      John Fastabend 提交于
      In commit
      
        'bpf: bpf_compute_data uses incorrect cb structure' (8108a775)
      
      we added the routine bpf_compute_data_end_sk_skb() to compute the
      correct data_end values, but this has since been lost. In kernel
      v4.14 this was correct and the above patch was applied in it
      entirety. Then when v4.14 was merged into v4.15-rc1 net-next tree
      we lost the piece that renamed bpf_compute_data_pointers to the
      new function bpf_compute_data_end_sk_skb. This was done here,
      
      e1ea2f98 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
      
      When it conflicted with the following rename patch,
      
      6aaae2b6 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")
      
      Finally, after a refactor I thought even the function
      bpf_compute_data_end_sk_skb() was no longer needed and it was
      erroneously removed.
      
      However, we never reverted the sk_skb_convert_ctx_access() usage of
      tcp_skb_cb which had been committed and survived the merge conflict.
      Here we fix this by adding back the helper and *_data_end_sk_skb()
      usage. Using the bpf_skc_data_end mapping is not correct because it
      expects a qdisc_skb_cb object but at the sock layer this is not the
      case. Even though it happens to work here because we don't overwrite
      any data in-use at the socket layer and the cb structure is cleared
      later this has potential to create some subtle issues. But, even
      more concretely the filter.c access check uses tcp_skb_cb.
      
      And by some act of chance though,
      
      struct bpf_skb_data_end {
              struct qdisc_skb_cb        qdisc_cb;             /*     0    28 */
      
              /* XXX 4 bytes hole, try to pack */
      
              void *                     data_meta;            /*    32     8 */
              void *                     data_end;             /*    40     8 */
      
              /* size: 48, cachelines: 1, members: 3 */
              /* sum members: 44, holes: 1, sum holes: 4 */
              /* last cacheline: 48 bytes */
      };
      
      and then tcp_skb_cb,
      
      struct tcp_skb_cb {
      	[...]
                      struct {
                              __u32      flags;                /*    24     4 */
                              struct sock * sk_redir;          /*    32     8 */
                              void *     data_end;             /*    40     8 */
                      } bpf;                                   /*          24 */
              };
      
      So when we use offset_of() to track down the byte offset we get 40 in
      either case and everything continues to work. Fix this mess and use
      correct structures its unclear how long this might actually work for
      until someone moves the structs around.
      Reported-by: NMartin KaFai Lau <kafai@fb.com>
      Fixes: e1ea2f98 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
      Fixes: 6aaae2b6 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      0ea488ff
    • J
      bpf: fix sk_skb programs without skb->dev assigned · 0c6bc6e5
      John Fastabend 提交于
      Multiple BPF helpers in use by sk_skb programs calculate the max
      skb length using the __bpf_skb_max_len function. However, this
      calculates the max length using the skb->dev pointer which can be
      NULL when an sk_skb program is paired with an sk_msg program.
      
      To force this a sk_msg program needs to redirect into the ingress
      path of a sock with an attach sk_skb program. Then the the sk_skb
      program would need to call one of the helpers that adjust the skb
      size.
      
      To fix the null ptr dereference use SKB_MAX_ALLOC size if no dev
      is available.
      
      Fixes: 8934ce2f ("bpf: sockmap redirect ingress support")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      0c6bc6e5
  9. 05 7月, 2018 1 次提交
  10. 29 6月, 2018 2 次提交
  11. 16 6月, 2018 1 次提交
  12. 04 6月, 2018 1 次提交
  13. 03 6月, 2018 4 次提交
  14. 31 5月, 2018 1 次提交
  15. 30 5月, 2018 2 次提交
  16. 28 5月, 2018 1 次提交
    • A
      bpf: Hooks for sys_sendmsg · 1cedee13
      Andrey Ignatov 提交于
      In addition to already existing BPF hooks for sys_bind and sys_connect,
      the patch provides new hooks for sys_sendmsg.
      
      It leverages existing BPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR`
      that provides access to socket itlself (properties like family, type,
      protocol) and user-passed `struct sockaddr *` so that BPF program can
      override destination IP and port for system calls such as sendto(2) or
      sendmsg(2) and/or assign source IP to the socket.
      
      The hooks are implemented as two new attach types:
      `BPF_CGROUP_UDP4_SENDMSG` and `BPF_CGROUP_UDP6_SENDMSG` for UDPv4 and
      UDPv6 correspondingly.
      
      UDPv4 and UDPv6 separate attach types for same reason as sys_bind and
      sys_connect hooks, i.e. to prevent reading from / writing to e.g.
      user_ip6 fields when user passes sockaddr_in since it'd be out-of-bound.
      
      The difference with already existing hooks is sys_sendmsg are
      implemented only for unconnected UDP.
      
      For TCP it doesn't make sense to change user-provided `struct sockaddr *`
      at sendto(2)/sendmsg(2) time since socket either was already connected
      and has source/destination set or wasn't connected and call to
      sendto(2)/sendmsg(2) would lead to ENOTCONN anyway.
      
      Connected UDP is already handled by sys_connect hooks that can override
      source/destination at connect time and use fast-path later, i.e. these
      hooks don't affect UDP fast-path.
      
      Rewriting source IP is implemented differently than that in sys_connect
      hooks. When sys_sendmsg is used with unconnected UDP it doesn't work to
      just bind socket to desired local IP address since source IP can be set
      on per-packet basis by using ancillary data (cmsg(3)). So no matter if
      socket is bound or not, source IP has to be rewritten on every call to
      sys_sendmsg.
      
      To do so two new fields are added to UAPI `struct bpf_sock_addr`;
      * `msg_src_ip4` to set source IPv4 for UDPv4;
      * `msg_src_ip6` to set source IPv6 for UDPv6.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      1cedee13
  17. 25 5月, 2018 3 次提交
    • J
      xdp: change ndo_xdp_xmit API to support bulking · 735fc405
      Jesper Dangaard Brouer 提交于
      This patch change the API for ndo_xdp_xmit to support bulking
      xdp_frames.
      
      When kernel is compiled with CONFIG_RETPOLINE, XDP sees a huge slowdown.
      Most of the slowdown is caused by DMA API indirect function calls, but
      also the net_device->ndo_xdp_xmit() call.
      
      Benchmarked patch with CONFIG_RETPOLINE, using xdp_redirect_map with
      single flow/core test (CPU E5-1650 v4 @ 3.60GHz), showed
      performance improved:
       for driver ixgbe: 6,042,682 pps -> 6,853,768 pps = +811,086 pps
       for driver i40e : 6,187,169 pps -> 6,724,519 pps = +537,350 pps
      
      With frames avail as a bulk inside the driver ndo_xdp_xmit call,
      further optimizations are possible, like bulk DMA-mapping for TX.
      
      Testing without CONFIG_RETPOLINE show the same performance for
      physical NIC drivers.
      
      The virtual NIC driver tun sees a huge performance boost, as it can
      avoid doing per frame producer locking, but instead amortize the
      locking cost over the bulk.
      
      V2: Fix compile errors reported by kbuild test robot <lkp@intel.com>
      V4: Isolated ndo, driver changes and callers.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      735fc405
    • J
      xdp: add tracepoint for devmap like cpumap have · 38edddb8
      Jesper Dangaard Brouer 提交于
      Notice how this allow us get XDP statistic without affecting the XDP
      performance, as tracepoint is no-longer activated on a per packet basis.
      
      V5: Spotted by John Fastabend.
       Fix 'sent' also counted 'drops' in this patch, a later patch corrected
       this, but it was a mistake in this intermediate step.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      38edddb8
    • J
      bpf: devmap introduce dev_map_enqueue · 67f29e07
      Jesper Dangaard Brouer 提交于
      Functionality is the same, but the ndo_xdp_xmit call is now
      simply invoked from inside the devmap.c code.
      
      V2: Fix compile issue reported by kbuild test robot <lkp@intel.com>
      
      V5: Cleanups requested by Daniel
       - Newlines before func definition
       - Use BUILD_BUG_ON checks
       - Remove unnecessary use return value store in dev_map_enqueue
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      67f29e07
  18. 24 5月, 2018 3 次提交
    • M
      ipv6: sr: Add seg6local action End.BPF · 004d4b27
      Mathieu Xhonneux 提交于
      This patch adds the End.BPF action to the LWT seg6local infrastructure.
      This action works like any other seg6local End action, meaning that an IPv6
      header with SRH is needed, whose DA has to be equal to the SID of the
      action. It will also advance the SRH to the next segment, the BPF program
      does not have to take care of this.
      
      Since the BPF program may not be a source of instability in the kernel, it
      is important to ensure that the integrity of the packet is maintained
      before yielding it back to the IPv6 layer. The hook hence keeps track if
      the SRH has been altered through the helpers, and re-validates its
      content if needed with seg6_validate_srh. The state kept for validation is
      stored in a per-CPU buffer. The BPF program is not allowed to directly
      write into the packet, and only some fields of the SRH can be altered
      through the helper bpf_lwt_seg6_store_bytes.
      
      Performances profiling has shown that the SRH re-validation does not induce
      a significant overhead. If the altered SRH is deemed as invalid, the packet
      is dropped.
      
      This validation is also done before executing any action through
      bpf_lwt_seg6_action, and will not be performed again if the SRH is not
      modified after calling the action.
      
      The BPF program may return 3 types of return codes:
          - BPF_OK: the End.BPF action will look up the next destination through
                   seg6_lookup_nexthop.
          - BPF_REDIRECT: if an action has been executed through the
                bpf_lwt_seg6_action helper, the BPF program should return this
                value, as the skb's destination is already set and the default
                lookup should not be performed.
          - BPF_DROP : the packet will be dropped.
      Signed-off-by: NMathieu Xhonneux <m.xhonneux@gmail.com>
      Acked-by: NDavid Lebrun <dlebrun@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      004d4b27
    • M
      bpf: Split lwt inout verifier structures · cd3092c7
      Mathieu Xhonneux 提交于
      The new bpf_lwt_push_encap helper should only be accessible within the
      LWT BPF IN hook, and not the OUT one, as this may lead to a skb under
      panic.
      
      At the moment, both LWT BPF IN and OUT share the same list of helpers,
      whose calls are authorized by the verifier. This patch separates the
      verifier ops for the IN and OUT hooks, and allows the IN hook to call the
      bpf_lwt_push_encap helper.
      
      This patch is also the occasion to put all lwt_*_func_proto functions
      together for clarity. At the moment, socks_op_func_proto is in the middle
      of lwt_inout_func_proto and lwt_xmit_func_proto.
      Signed-off-by: NMathieu Xhonneux <m.xhonneux@gmail.com>
      Acked-by: NDavid Lebrun <dlebrun@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      cd3092c7
    • M
      bpf: Add IPv6 Segment Routing helpers · fe94cc29
      Mathieu Xhonneux 提交于
      The BPF seg6local hook should be powerful enough to enable users to
      implement most of the use-cases one could think of. After some thinking,
      we figured out that the following actions should be possible on a SRv6
      packet, requiring 3 specific helpers :
          - bpf_lwt_seg6_store_bytes: Modify non-sensitive fields of the SRH
          - bpf_lwt_seg6_adjust_srh: Allow to grow or shrink a SRH
                                     (to add/delete TLVs)
          - bpf_lwt_seg6_action: Apply some SRv6 network programming actions
                                 (specifically End.X, End.T, End.B6 and
                                  End.B6.Encap)
      
      The specifications of these helpers are provided in the patch (see
      include/uapi/linux/bpf.h).
      
      The non-sensitive fields of the SRH are the following : flags, tag and
      TLVs. The other fields can not be modified, to maintain the SRH
      integrity. Flags, tag and TLVs can easily be modified as their validity
      can be checked afterwards via seg6_validate_srh. It is not allowed to
      modify the segments directly. If one wants to add segments on the path,
      he should stack a new SRH using the End.B6 action via
      bpf_lwt_seg6_action.
      
      Growing, shrinking or editing TLVs via the helpers will flag the SRH as
      invalid, and it will have to be re-validated before re-entering the IPv6
      layer. This flag is stored in a per-CPU buffer, along with the current
      header length in bytes.
      
      Storing the SRH len in bytes in the control block is mandatory when using
      bpf_lwt_seg6_adjust_srh. The Header Ext. Length field contains the SRH
      len rounded to 8 bytes (a padding TLV can be inserted to ensure the 8-bytes
      boundary). When adding/deleting TLVs within the BPF program, the SRH may
      temporary be in an invalid state where its length cannot be rounded to 8
      bytes without remainder, hence the need to store the length in bytes
      separately. The caller of the BPF program can then ensure that the SRH's
      final length is valid using this value. Again, a final SRH modified by a
      BPF program which doesn’t respect the 8-bytes boundary will be discarded
      as it will be considered as invalid.
      
      Finally, a fourth helper is provided, bpf_lwt_push_encap, which is
      available from the LWT BPF IN hook, but not from the seg6local BPF one.
      This helper allows to encapsulate a Segment Routing Header (either with
      a new outer IPv6 header, or by inlining it directly in the existing IPv6
      header) into a non-SRv6 packet. This helper is required if we want to
      offer the possibility to dynamically encapsulate a SRH for non-SRv6 packet,
      as the BPF seg6local hook only works on traffic already containing a SRH.
      This is the BPF equivalent of the seg6 LWT infrastructure, which achieves
      the same purpose but with a static SRH per route.
      
      These helpers require CONFIG_IPV6=y (and not =m).
      Signed-off-by: NMathieu Xhonneux <m.xhonneux@gmail.com>
      Acked-by: NDavid Lebrun <dlebrun@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      fe94cc29
  19. 22 5月, 2018 1 次提交
    • D
      bpf: Add mtu checking to FIB forwarding helper · 4f74fede
      David Ahern 提交于
      Add check that egress MTU can handle packet to be forwarded. If
      the MTU is less than the packet length, return 0 meaning the
      packet is expected to continue up the stack for help - eg.,
      fragmenting the packet or sending an ICMP.
      
      The XDP path needs to leverage the FIB entry for an MTU on the
      route spec or an exception entry for a given destination. The
      skb path lets is_skb_forwardable decide if the packet can be
      sent.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4f74fede
  20. 19 5月, 2018 1 次提交
  21. 18 5月, 2018 1 次提交
    • D
      bpf: fix truncated jump targets on heavy expansions · 050fad7c
      Daniel Borkmann 提交于
      Recently during testing, I ran into the following panic:
      
        [  207.892422] Internal error: Accessing user space memory outside uaccess.h routines: 96000004 [#1] SMP
        [  207.901637] Modules linked in: binfmt_misc [...]
        [  207.966530] CPU: 45 PID: 2256 Comm: test_verifier Tainted: G        W         4.17.0-rc3+ #7
        [  207.974956] Hardware name: FOXCONN R2-1221R-A4/C2U4N_MB, BIOS G31FB18A 03/31/2017
        [  207.982428] pstate: 60400005 (nZCv daif +PAN -UAO)
        [  207.987214] pc : bpf_skb_load_helper_8_no_cache+0x34/0xc0
        [  207.992603] lr : 0xffff000000bdb754
        [  207.996080] sp : ffff000013703ca0
        [  207.999384] x29: ffff000013703ca0 x28: 0000000000000001
        [  208.004688] x27: 0000000000000001 x26: 0000000000000000
        [  208.009992] x25: ffff000013703ce0 x24: ffff800fb4afcb00
        [  208.015295] x23: ffff00007d2f5038 x22: ffff00007d2f5000
        [  208.020599] x21: fffffffffeff2a6f x20: 000000000000000a
        [  208.025903] x19: ffff000009578000 x18: 0000000000000a03
        [  208.031206] x17: 0000000000000000 x16: 0000000000000000
        [  208.036510] x15: 0000ffff9de83000 x14: 0000000000000000
        [  208.041813] x13: 0000000000000000 x12: 0000000000000000
        [  208.047116] x11: 0000000000000001 x10: ffff0000089e7f18
        [  208.052419] x9 : fffffffffeff2a6f x8 : 0000000000000000
        [  208.057723] x7 : 000000000000000a x6 : 00280c6160000000
        [  208.063026] x5 : 0000000000000018 x4 : 0000000000007db6
        [  208.068329] x3 : 000000000008647a x2 : 19868179b1484500
        [  208.073632] x1 : 0000000000000000 x0 : ffff000009578c08
        [  208.078938] Process test_verifier (pid: 2256, stack limit = 0x0000000049ca7974)
        [  208.086235] Call trace:
        [  208.088672]  bpf_skb_load_helper_8_no_cache+0x34/0xc0
        [  208.093713]  0xffff000000bdb754
        [  208.096845]  bpf_test_run+0x78/0xf8
        [  208.100324]  bpf_prog_test_run_skb+0x148/0x230
        [  208.104758]  sys_bpf+0x314/0x1198
        [  208.108064]  el0_svc_naked+0x30/0x34
        [  208.111632] Code: 91302260 f9400001 f9001fa1 d2800001 (29500680)
        [  208.117717] ---[ end trace 263cb8a59b5bf29f ]---
      
      The program itself which caused this had a long jump over the whole
      instruction sequence where all of the inner instructions required
      heavy expansions into multiple BPF instructions. Additionally, I also
      had BPF hardening enabled which requires once more rewrites of all
      constant values in order to blind them. Each time we rewrite insns,
      bpf_adj_branches() would need to potentially adjust branch targets
      which cross the patchlet boundary to accommodate for the additional
      delta. Eventually that lead to the case where the target offset could
      not fit into insn->off's upper 0x7fff limit anymore where then offset
      wraps around becoming negative (in s16 universe), or vice versa
      depending on the jump direction.
      
      Therefore it becomes necessary to detect and reject any such occasions
      in a generic way for native eBPF and cBPF to eBPF migrations. For
      the latter we can simply check bounds in the bpf_convert_filter()'s
      BPF_EMIT_JMP helper macro and bail out once we surpass limits. The
      bpf_patch_insn_single() for native eBPF (and cBPF to eBPF in case
      of subsequent hardening) is a bit more complex in that we need to
      detect such truncations before hitting the bpf_prog_realloc(). Thus
      the latter is split into an extra pass to probe problematic offsets
      on the original program in order to fail early. With that in place
      and carefully tested I no longer hit the panic and the rewrites are
      rejected properly. The above example panic I've seen on bpf-next,
      though the issue itself is generic in that a guard against this issue
      in bpf seems more appropriate in this case.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      050fad7c
  22. 16 5月, 2018 1 次提交
  23. 15 5月, 2018 1 次提交
    • J
      bpf: sockmap, refactor sockmap routines to work with hashmap · e5cd3abc
      John Fastabend 提交于
      This patch only refactors the existing sockmap code. This will allow
      much of the psock initialization code path and bpf helper codes to
      work for both sockmap bpf map types that are backed by an array, the
      currently supported type, and the new hash backed bpf map type
      sockhash.
      
      Most the fallout comes from three changes,
      
        - Pushing bpf programs into an independent structure so we
          can use it from the htab struct in the next patch.
        - Generalizing helpers to use void *key instead of the hardcoded
          u32.
        - Instead of passing map/key through the metadata we now do
          the lookup inline. This avoids storing the key in the metadata
          which will be useful when keys can be longer than 4 bytes. We
          rename the sk pointers to sk_redir at this point as well to
          avoid any confusion between the current sk pointer and the
          redirect pointer sk_redir.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      e5cd3abc
  24. 11 5月, 2018 1 次提交
    • D
      bpf: Provide helper to do forwarding lookups in kernel FIB table · 87f5fc7e
      David Ahern 提交于
      Provide a helper for doing a FIB and neighbor lookup in the kernel
      tables from an XDP program. The helper provides a fastpath for forwarding
      packets. If the packet is a local delivery or for any reason is not a
      simple lookup and forward, the packet continues up the stack.
      
      If it is to be forwarded, the forwarding can be done directly if the
      neighbor is already known. If the neighbor does not exist, the first
      few packets go up the stack for neighbor resolution. Once resolved, the
      xdp program provides the fast path.
      
      On successful lookup the nexthop dmac, current device smac and egress
      device index are returned.
      
      The API supports IPv4, IPv6 and MPLS protocols, but only IPv4 and IPv6
      are implemented in this patch. The API includes layer 4 parameters if
      the XDP program chooses to do deep packet inspection to allow compare
      against ACLs implemented as FIB rules.
      
      Header rewrite is left to the XDP program.
      
      The lookup takes 2 flags:
      - BPF_FIB_LOOKUP_DIRECT to do a lookup that bypasses FIB rules and goes
        straight to the table associated with the device (expert setting for
        those looking to maximize throughput)
      
      - BPF_FIB_LOOKUP_OUTPUT to do a lookup from the egress perspective.
        Default is an ingress lookup.
      
      Initial performance numbers collected by Jesper, forwarded packets/sec:
      
             Full stack    XDP FIB lookup    XDP Direct lookup
      IPv4   1,947,969       7,074,156          7,415,333
      IPv6   1,728,000       6,165,504          7,262,720
      
      These number are single CPU core forwarding on a Broadwell
      E5-1650 v4 @ 3.60GHz.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      87f5fc7e
  25. 10 5月, 2018 1 次提交
  26. 04 5月, 2018 3 次提交
    • D
      bpf: add skb_load_bytes_relative helper · 4e1ec56c
      Daniel Borkmann 提交于
      This adds a small BPF helper similar to bpf_skb_load_bytes() that
      is able to load relative to mac/net header offset from the skb's
      linear data. Compared to bpf_skb_load_bytes(), it takes a fifth
      argument namely start_header, which is either BPF_HDR_START_MAC
      or BPF_HDR_START_NET. This allows for a more flexible alternative
      compared to LD_ABS/LD_IND with negative offset. It's enabled for
      tc BPF programs as well as sock filter program types where it's
      mainly useful in reuseport programs to ease access to lower header
      data.
      
      Reference: https://lists.iovisor.org/pipermail/iovisor-dev/2017-March/000698.htmlSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      4e1ec56c
    • D
      bpf: implement ld_abs/ld_ind in native bpf · e0cea7ce
      Daniel Borkmann 提交于
      The main part of this work is to finally allow removal of LD_ABS
      and LD_IND from the BPF core by reimplementing them through native
      eBPF instead. Both LD_ABS/LD_IND were carried over from cBPF and
      keeping them around in native eBPF caused way more trouble than
      actually worth it. To just list some of the security issues in
      the past:
      
        * fdfaf64e ("x86: bpf_jit: support negative offsets")
        * 35607b02 ("sparc: bpf_jit: fix loads from negative offsets")
        * e0ee9c12 ("x86: bpf_jit: fix two bugs in eBPF JIT compiler")
        * 07aee943 ("bpf, sparc: fix usage of wrong reg for load_skb_regs after call")
        * 6d59b7db ("bpf, s390x: do not reload skb pointers in non-skb context")
        * 87338c8e ("bpf, ppc64: do not reload skb pointers in non-skb context")
      
      For programs in native eBPF, LD_ABS/LD_IND are pretty much legacy
      these days due to their limitations and more efficient/flexible
      alternatives that have been developed over time such as direct
      packet access. LD_ABS/LD_IND only cover 1/2/4 byte loads into a
      register, the load happens in host endianness and its exception
      handling can yield unexpected behavior. The latter is explained
      in depth in f6b1b3bf ("bpf: fix subprog verifier bypass by
      div/mod by 0 exception") with similar cases of exceptions we had.
      In native eBPF more recent program types will disable LD_ABS/LD_IND
      altogether through may_access_skb() in verifier, and given the
      limitations in terms of exception handling, it's also disabled
      in programs that use BPF to BPF calls.
      
      In terms of cBPF, the LD_ABS/LD_IND is used in networking programs
      to access packet data. It is not used in seccomp-BPF but programs
      that use it for socket filtering or reuseport for demuxing with
      cBPF. This is mostly relevant for applications that have not yet
      migrated to native eBPF.
      
      The main complexity and source of bugs in LD_ABS/LD_IND is coming
      from their implementation in the various JITs. Most of them keep
      the model around from cBPF times by implementing a fastpath written
      in asm. They use typically two from the BPF program hidden CPU
      registers for caching the skb's headlen (skb->len - skb->data_len)
      and skb->data. Throughout the JIT phase this requires to keep track
      whether LD_ABS/LD_IND are used and if so, the two registers need
      to be recached each time a BPF helper would change the underlying
      packet data in native eBPF case. At least in eBPF case, available
      CPU registers are rare and the additional exit path out of the
      asm written JIT helper makes it also inflexible since not all
      parts of the JITer are in control from plain C. A LD_ABS/LD_IND
      implementation in eBPF therefore allows to significantly reduce
      the complexity in JITs with comparable performance results for
      them, e.g.:
      
      test_bpf             tcpdump port 22             tcpdump complex
      x64      - before    15 21 10                    14 19  18
               - after      7 10 10                     7 10  15
      arm64    - before    40 91 92                    40 91 151
               - after     51 64 73                    51 62 113
      
      For cBPF we now track any usage of LD_ABS/LD_IND in bpf_convert_filter()
      and cache the skb's headlen and data in the cBPF prologue. The
      BPF_REG_TMP gets remapped from R8 to R2 since it's mainly just
      used as a local temporary variable. This allows to shrink the
      image on x86_64 also for seccomp programs slightly since mapping
      to %rsi is not an ereg. In callee-saved R8 and R9 we now track
      skb data and headlen, respectively. For normal prologue emission
      in the JITs this does not add any extra instructions since R8, R9
      are pushed to stack in any case from eBPF side. cBPF uses the
      convert_bpf_ld_abs() emitter which probes the fast path inline
      already and falls back to bpf_skb_load_helper_{8,16,32}() helper
      relying on the cached skb data and headlen as well. R8 and R9
      never need to be reloaded due to bpf_helper_changes_pkt_data()
      since all skb access in cBPF is read-only. Then, for the case
      of native eBPF, we use the bpf_gen_ld_abs() emitter, which calls
      the bpf_skb_load_helper_{8,16,32}_no_cache() helper unconditionally,
      does neither cache skb data and headlen nor has an inlined fast
      path. The reason for the latter is that native eBPF does not have
      any extra registers available anyway, but even if there were, it
      avoids any reload of skb data and headlen in the first place.
      Additionally, for the negative offsets, we provide an alternative
      bpf_skb_load_bytes_relative() helper in eBPF which operates
      similarly as bpf_skb_load_bytes() and allows for more flexibility.
      Tested myself on x64, arm64, s390x, from Sandipan on ppc64.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      e0cea7ce
    • D
      bpf: migrate ebpf ld_abs/ld_ind tests to test_verifier · 93731ef0
      Daniel Borkmann 提交于
      Remove all eBPF tests involving LD_ABS/LD_IND from test_bpf.ko. Reason
      is that the eBPF tests from test_bpf module do not go via BPF verifier
      and therefore any instruction rewrites from verifier cannot take place.
      
      Therefore, move them into test_verifier which runs out of user space,
      so that verfier can rewrite LD_ABS/LD_IND internally in upcoming patches.
      It will have the same effect since runtime tests are also performed from
      there. This also allows to finally unexport bpf_skb_vlan_{push,pop}_proto
      and keep it internal to core kernel.
      
      Additionally, also add further cBPF LD_ABS/LD_IND test coverage into
      test_bpf.ko suite.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      93731ef0