1. 11 8月, 2018 8 次提交
    • M
      bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT · 2dbb9b9e
      Martin KaFai Lau 提交于
      This patch adds a BPF_PROG_TYPE_SK_REUSEPORT which can select
      a SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY.  Like other
      non SK_FILTER/CGROUP_SKB program, it requires CAP_SYS_ADMIN.
      
      BPF_PROG_TYPE_SK_REUSEPORT introduces "struct sk_reuseport_kern"
      to store the bpf context instead of using the skb->cb[48].
      
      At the SO_REUSEPORT sk lookup time, it is in the middle of transiting
      from a lower layer (ipv4/ipv6) to a upper layer (udp/tcp).  At this
      point,  it is not always clear where the bpf context can be appended
      in the skb->cb[48] to avoid saving-and-restoring cb[].  Even putting
      aside the difference between ipv4-vs-ipv6 and udp-vs-tcp.  It is not
      clear if the lower layer is only ipv4 and ipv6 in the future and
      will it not touch the cb[] again before transiting to the upper
      layer.
      
      For example, in udp_gro_receive(), it uses the 48 byte NAPI_GRO_CB
      instead of IP[6]CB and it may still modify the cb[] after calling
      the udp[46]_lib_lookup_skb().  Because of the above reason, if
      sk->cb is used for the bpf ctx, saving-and-restoring is needed
      and likely the whole 48 bytes cb[] has to be saved and restored.
      
      Instead of saving, setting and restoring the cb[], this patch opts
      to create a new "struct sk_reuseport_kern" and setting the needed
      values in there.
      
      The new BPF_PROG_TYPE_SK_REUSEPORT and "struct sk_reuseport_(kern|md)"
      will serve all ipv4/ipv6 + udp/tcp combinations.  There is no protocol
      specific usage at this point and it is also inline with the current
      sock_reuseport.c implementation (i.e. no protocol specific requirement).
      
      In "struct sk_reuseport_md", this patch exposes data/data_end/len
      with semantic similar to other existing usages.  Together
      with "bpf_skb_load_bytes()" and "bpf_skb_load_bytes_relative()",
      the bpf prog can peek anywhere in the skb.  The "bind_inany" tells
      the bpf prog that the reuseport group is bind-ed to a local
      INANY address which cannot be learned from skb.
      
      The new "bind_inany" is added to "struct sock_reuseport" which will be
      used when running the new "BPF_PROG_TYPE_SK_REUSEPORT" bpf prog in order
      to avoid repeating the "bind INANY" test on
      "sk_v6_rcv_saddr/sk->sk_rcv_saddr" every time a bpf prog is run.  It can
      only be properly initialized when a "sk->sk_reuseport" enabled sk is
      adding to a hashtable (i.e. during "reuseport_alloc()" and
      "reuseport_add_sock()").
      
      The new "sk_select_reuseport()" is the main helper that the
      bpf prog will use to select a SO_REUSEPORT sk.  It is the only function
      that can use the new BPF_MAP_TYPE_REUSEPORT_ARRAY.  As mentioned in
      the earlier patch, the validity of a selected sk is checked in
      run time in "sk_select_reuseport()".  Doing the check in
      verification time is difficult and inflexible (consider the map-in-map
      use case).  The runtime check is to compare the selected sk's reuseport_id
      with the reuseport_id that we want.  This helper will return -EXXX if the
      selected sk cannot serve the incoming request (e.g. reuseport_id
      not match).  The bpf prog can decide if it wants to do SK_DROP as its
      discretion.
      
      When the bpf prog returns SK_PASS, the kernel will check if a
      valid sk has been selected (i.e. "reuse_kern->selected_sk != NULL").
      If it does , it will use the selected sk.  If not, the kernel
      will select one from "reuse->socks[]" (as before this patch).
      
      The SK_DROP and SK_PASS handling logic will be in the next patch.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      2dbb9b9e
    • M
      bpf: Introduce BPF_MAP_TYPE_REUSEPORT_SOCKARRAY · 5dc4c4b7
      Martin KaFai Lau 提交于
      This patch introduces a new map type BPF_MAP_TYPE_REUSEPORT_SOCKARRAY.
      
      To unleash the full potential of a bpf prog, it is essential for the
      userspace to be capable of directly setting up a bpf map which can then
      be consumed by the bpf prog to make decision.  In this case, decide which
      SO_REUSEPORT sk to serve the incoming request.
      
      By adding BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, the userspace has total control
      and visibility on where a SO_REUSEPORT sk should be located in a bpf map.
      The later patch will introduce BPF_PROG_TYPE_SK_REUSEPORT such that
      the bpf prog can directly select a sk from the bpf map.  That will
      raise the programmability of the bpf prog attached to a reuseport
      group (a group of sk serving the same IP:PORT).
      
      For example, in UDP, the bpf prog can peek into the payload (e.g.
      through the "data" pointer introduced in the later patch) to learn
      the application level's connection information and then decide which sk
      to pick from a bpf map.  The userspace can tightly couple the sk's location
      in a bpf map with the application logic in generating the UDP payload's
      connection information.  This connection info contact/API stays within the
      userspace.
      
      Also, when used with map-in-map, the userspace can switch the
      old-server-process's inner map to a new-server-process's inner map
      in one call "bpf_map_update_elem(outer_map, &index, &new_reuseport_array)".
      The bpf prog will then direct incoming requests to the new process instead
      of the old process.  The old process can finish draining the pending
      requests (e.g. by "accept()") before closing the old-fds.  [Note that
      deleting a fd from a bpf map does not necessary mean the fd is closed]
      
      During map_update_elem(),
      Only SO_REUSEPORT sk (i.e. which has already been added
      to a reuse->socks[]) can be used.  That means a SO_REUSEPORT sk that is
      "bind()" for UDP or "bind()+listen()" for TCP.  These conditions are
      ensured in "reuseport_array_update_check()".
      
      A SO_REUSEPORT sk can only be added once to a map (i.e. the
      same sk cannot be added twice even to the same map).  SO_REUSEPORT
      already allows another sk to be created for the same IP:PORT.
      There is no need to re-create a similar usage in the BPF side.
      
      When a SO_REUSEPORT is deleted from the "reuse->socks[]" (e.g. "close()"),
      it will notify the bpf map to remove it from the map also.  It is
      done through "bpf_sk_reuseport_detach()" and it will only be called
      if >=1 of the "reuse->sock[]" has ever been added to a bpf map.
      
      The map_update()/map_delete() has to be in-sync with the
      "reuse->socks[]".  Hence, the same "reuseport_lock" used
      by "reuse->socks[]" has to be used here also. Care has
      been taken to ensure the lock is only acquired when the
      adding sk passes some strict tests. and
      freeing the map does not require the reuseport_lock.
      
      The reuseport_array will also support lookup from the syscall
      side.  It will return a sock_gen_cookie().  The sock_gen_cookie()
      is on-demand (i.e. a sk's cookie is not generated until the very
      first map_lookup_elem()).
      
      The lookup cookie is 64bits but it goes against the logical userspace
      expectation on 32bits sizeof(fd) (and as other fd based bpf maps do also).
      It may catch user in surprise if we enforce value_size=8 while
      userspace still pass a 32bits fd during update.  Supporting different
      value_size between lookup and update seems unintuitive also.
      
      We also need to consider what if other existing fd based maps want
      to return 64bits value from syscall's lookup in the future.
      Hence, reuseport_array supports both value_size 4 and 8, and
      assuming user will usually use value_size=4.  The syscall's lookup
      will return ENOSPC on value_size=4.  It will will only
      return 64bits value from sock_gen_cookie() when user consciously
      choose value_size=8 (as a signal that lookup is desired) which then
      requires a 64bits value in both lookup and update.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      5dc4c4b7
    • M
      net: Add ID (if needed) to sock_reuseport and expose reuseport_lock · 736b4602
      Martin KaFai Lau 提交于
      A later patch will introduce a BPF_MAP_TYPE_REUSEPORT_ARRAY which
      allows a SO_REUSEPORT sk to be added to a bpf map.  When a sk
      is removed from reuse->socks[], it also needs to be removed from
      the bpf map.  Also, when adding a sk to a bpf map, the bpf
      map needs to ensure it is indeed in a reuse->socks[].
      Hence, reuseport_lock is needed by the bpf map to ensure its
      map_update_elem() and map_delete_elem() operations are in-sync with
      the reuse->socks[].  The BPF_MAP_TYPE_REUSEPORT_ARRAY map will only
      acquire the reuseport_lock after ensuring the adding sk is already
      in a reuseport group (i.e. reuse->socks[]).  The map_lookup_elem()
      will be lockless.
      
      This patch also adds an ID to sock_reuseport.  A later patch
      will introduce BPF_PROG_TYPE_SK_REUSEPORT which allows
      a bpf prog to select a sk from a bpf map.  It is inflexible to
      statically enforce a bpf map can only contain the sk belonging to
      a particular reuse->socks[] (i.e. same IP:PORT) during the bpf
      verification time. For example, think about the the map-in-map situation
      where the inner map can be dynamically changed in runtime and the outer
      map may have inner maps belonging to different reuseport groups.
      Hence, when the bpf prog (in the new BPF_PROG_TYPE_SK_REUSEPORT
      type) selects a sk,  this selected sk has to be checked to ensure it
      belongs to the requesting reuseport group (i.e. the group serving
      that IP:PORT).
      
      The "sk->sk_reuseport_cb" pointer cannot be used for this checking
      purpose because the pointer value will change after reuseport_grow().
      Instead of saving all checking conditions like the ones
      preced calling "reuseport_add_sock()" and compare them everytime a
      bpf_prog is run, a 32bits ID is introduced to survive the
      reuseport_grow().  The ID is only acquired if any of the
      reuse->socks[] is added to the newly introduced
      "BPF_MAP_TYPE_REUSEPORT_ARRAY" map.
      
      If "BPF_MAP_TYPE_REUSEPORT_ARRAY" is not used,  the changes in this
      patch is a no-op.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      736b4602
    • M
      tcp: Avoid TCP syncookie rejected by SO_REUSEPORT socket · 40a1227e
      Martin KaFai Lau 提交于
      Although the actual cookie check "__cookie_v[46]_check()" does
      not involve sk specific info, it checks whether the sk has recent
      synq overflow event in "tcp_synq_no_recent_overflow()".  The
      tcp_sk(sk)->rx_opt.ts_recent_stamp is updated every second
      when it has sent out a syncookie (through "tcp_synq_overflow()").
      
      The above per sk "recent synq overflow event timestamp" works well
      for non SO_REUSEPORT use case.  However, it may cause random
      connection request reject/discard when SO_REUSEPORT is used with
      syncookie because it fails the "tcp_synq_no_recent_overflow()"
      test.
      
      When SO_REUSEPORT is used, it usually has multiple listening
      socks serving TCP connection requests destinated to the same local IP:PORT.
      There are cases that the TCP-ACK-COOKIE may not be received
      by the same sk that sent out the syncookie.  For example,
      if reuse->socks[] began with {sk0, sk1},
      1) sk1 sent out syncookies and tcp_sk(sk1)->rx_opt.ts_recent_stamp
         was updated.
      2) the reuse->socks[] became {sk1, sk2} later.  e.g. sk0 was first closed
         and then sk2 was added.  Here, sk2 does not have ts_recent_stamp set.
         There are other ordering that will trigger the similar situation
         below but the idea is the same.
      3) When the TCP-ACK-COOKIE comes back, sk2 was selected.
         "tcp_synq_no_recent_overflow(sk2)" returns true. In this case,
         all syncookies sent by sk1 will be handled (and rejected)
         by sk2 while sk1 is still alive.
      
      The userspace may create and remove listening SO_REUSEPORT sockets
      as it sees fit.  E.g. Adding new thread (and SO_REUSEPORT sock) to handle
      incoming requests, old process stopping and new process starting...etc.
      With or without SO_ATTACH_REUSEPORT_[CB]BPF,
      the sockets leaving and joining a reuseport group makes picking
      the same sk to check the syncookie very difficult (if not impossible).
      
      The later patches will allow bpf prog more flexibility in deciding
      where a sk should be located in a bpf map and selecting a particular
      SO_REUSEPORT sock as it sees fit.  e.g. Without closing any sock,
      replace the whole bpf reuseport_array in one map_update() by using
      map-in-map.  Getting the syncookie check working smoothly across
      socks in the same "reuse->socks[]" is important.
      
      A partial solution is to set the newly added sk's ts_recent_stamp
      to the max ts_recent_stamp of a reuseport group but that will require
      to iterate through reuse->socks[]  OR
      pessimistically set it to "now - TCP_SYNCOOKIE_VALID" when a sk is
      joining a reuseport group.  However, neither of them will solve the
      existing sk getting moved around the reuse->socks[] and that
      sk may not have ts_recent_stamp updated, unlikely under continuous
      synflood but not impossible.
      
      This patch opts to treat the reuseport group as a whole when
      considering the last synq overflow timestamp since
      they are serving the same IP:PORT from the userspace
      (and BPF program) perspective.
      
      "synq_overflow_ts" is added to "struct sock_reuseport".
      The tcp_synq_overflow() and tcp_synq_no_recent_overflow()
      will update/check reuse->synq_overflow_ts if the sk is
      in a reuseport group.  Similar to the reuseport decision in
      __inet_lookup_listener(), both sk->sk_reuseport and
      sk->sk_reuseport_cb are tested for SO_REUSEPORT usage.
      Update on "synq_overflow_ts" happens at roughly once
      every second.
      
      A synflood test was done with a 16 rx-queues and 16 reuseport sockets.
      No meaningful performance change is observed.  Before and
      after the change is ~9Mpps in IPv4.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      40a1227e
    • D
      Merge branch 'bpf-btf-for-htab-lru' · 74b247f4
      Daniel Borkmann 提交于
      Yonghong Song says:
      
      ====================
      Commit a26ca7c9 ("bpf: btf: Add pretty print support to
      the basic arraymap") added pretty print support to array map.
      This patch adds pretty print for hash and lru_hash maps.
      
      The following example shows the pretty-print result of a pinned
      hashmap. Without this patch set, user will get an error instead.
      
          struct map_value {
                  int count_a;
                  int count_b;
          };
      
          cat /sys/fs/bpf/pinned_hash_map:
      
          87907: {87907,87908}
          57354: {37354,57355}
          76625: {76625,76626}
          ...
      
      Patch #1 fixed a bug in bpffs map_seq_next() function so that
      all elements in the hash table will be traversed.
      Patch #2 implemented map_seq_show_elem() and map_check_btf()
      callback functions for hash and lru hash maps.
      Patch #3 enhanced tools/testing/selftests/bpf/test_btf.c to
      test bpffs hash and lru hash map pretty print.
      ====================
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      74b247f4
    • Y
      tools/bpf: add bpffs pretty print btf test for hash/lru_hash maps · af2a81da
      Yonghong Song 提交于
      Pretty print tests for hash/lru_hash maps are added in test_btf.c.
      The btf type blob is the same as pretty print array map test.
      The test result:
        $ mount -t bpf bpf /sys/fs/bpf
        $ ./test_btf -p
          BTF pretty print array......OK
          BTF pretty print hash......OK
          BTF pretty print lru hash......OK
          PASS:3 SKIP:0 FAIL:0
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      af2a81da
    • Y
      bpf: btf: add pretty print for hash/lru_hash maps · 699c86d6
      Yonghong Song 提交于
      Commit a26ca7c9 ("bpf: btf: Add pretty print support to
      the basic arraymap") added pretty print support to array map.
      This patch adds pretty print for hash and lru_hash maps.
      The following example shows the pretty-print result of
      a pinned hashmap:
      
          struct map_value {
                  int count_a;
                  int count_b;
          };
      
          cat /sys/fs/bpf/pinned_hash_map:
      
          87907: {87907,87908}
          57354: {37354,57355}
          76625: {76625,76626}
          ...
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      699c86d6
    • Y
      bpf: fix bpffs non-array map seq_show issue · dc1508a5
      Yonghong Song 提交于
      In function map_seq_next() of kernel/bpf/inode.c,
      the first key will be the "0" regardless of the map type.
      This works for array. But for hash type, if it happens
      key "0" is in the map, the bpffs map show will miss
      some items if the key "0" is not the first element of
      the first bucket.
      
      This patch fixed the issue by guaranteeing to get
      the first element, if the seq_show is just started,
      by passing NULL pointer key to map_get_next_key() callback.
      This way, no missing elements will occur for
      bpffs hash table show even if key "0" is in the map.
      
      Fixes: a26ca7c9 ("bpf: btf: Add pretty print support to the basic arraymap")
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      dc1508a5
  2. 10 8月, 2018 32 次提交
    • D
      Merge branch 'bpf-veth-xdp-support' · 60afdf06
      Daniel Borkmann 提交于
      Toshiaki Makita says:
      
      ====================
      This patch set introduces driver XDP for veth.
      Basically this is used in conjunction with redirect action of another XDP
      program.
      
        NIC -----------> veth===veth
       (XDP) (redirect)        (XDP)
      
      In this case xdp_frame can be forwarded to the peer veth without
      modification, so we can expect far better performance than generic XDP.
      
      Envisioned use-cases
      --------------------
      
      * Container managed XDP program
      Container host redirects frames to containers by XDP redirect action, and
      privileged containers can deploy their own XDP programs.
      
      * XDP program cascading
      Two or more XDP programs can be called for each packet by redirecting
      xdp frames to veth.
      
      * Internal interface for an XDP bridge
      When using XDP redirection to create a virtual bridge, veth can be used
      to create an internal interface for the bridge.
      
      Implementation
      --------------
      
      This changeset is making use of NAPI to implement ndo_xdp_xmit and
      XDP_TX/REDIRECT. This is mainly because XDP heavily relies on NAPI
      context.
       - patch 1: Export a function needed for veth XDP.
       - patch 2-3: Basic implementation of veth XDP.
       - patch 4-6: Add ndo_xdp_xmit.
       - patch 7-9: Add XDP_TX and XDP_REDIRECT.
       - patch 10: Performance optimization for multi-queue env.
      
      Tests and performance numbers
      -----------------------------
      
      Tested with a simple XDP program which only redirects packets between
      NIC and veth. I used i40e 25G NIC (XXV710) for the physical NIC. The
      server has 20 of Xeon Silver 2.20 GHz cores.
      
        pktgen --(wire)--> XXV710 (i40e) <--(XDP redirect)--> veth===veth (XDP)
      
      The rightmost veth loads XDP progs and just does DROP or TX. The number
      of packets is measured in the XDP progs. The leftmost pktgen sends
      packets at 37.1 Mpps (almost 25G wire speed).
      
      veth XDP action    Flows    Mpps
      ================================
      DROP                   1    10.6
      DROP                   2    21.2
      DROP                 100    36.0
      TX                     1     5.0
      TX                     2    10.0
      TX                   100    31.0
      
      I also measured netperf TCP_STREAM but was not so great performance due
      to lack of tx/rx checksum offload and TSO, etc.
      
        netperf <--(wire)--> XXV710 (i40e) <--(XDP redirect)--> veth===veth (XDP PASS)
      
      Direction         Flows   Gbps
      ==============================
      external->veth        1   20.8
      external->veth        2   23.5
      external->veth      100   23.6
      veth->external        1    9.0
      veth->external        2   17.8
      veth->external      100   22.9
      
      Also tested doing ifup/down or load/unload a XDP program repeatedly
      during processing XDP packets in order to check if enabling/disabling
      NAPI is working as expected, and found no problems.
      
      v8:
      - Don't use xdp_frame pointer address to calculate skb->head, headroom,
        and xdp_buff.data_hard_start.
      
      v7:
      - Introduce xdp_scrub_frame() to clear kernel pointers in xdp_frame and
        use it instead of memset().
      
      v6:
      - Check skb->len only if reallocation is needed.
      - Add __GFP_NOWARN to alloc_page() since it can be triggered by external
        events.
      - Fix sparse warning around EXPORT_SYMBOL.
      
      v5:
      - Fix broken SOBs.
      
      v4:
      - Don't adjust MTU automatically.
      - Skip peer IFF_UP check on .ndo_xdp_xmit() because it is unnecessary.
        Add comments to explain that.
      - Use redirect_info instead of xdp_mem_info for storing no_direct flag
        to avoid per packet copy cost.
      
      v3:
      - Drop skb bulk xmit patch since it makes little performance
        difference. The hotspot in TCP skb xmit at this point is checksum
        computation in skb_segment and packet copy on XDP_REDIRECT due to
        cloned/nonlinear skb.
      - Fix race on closing device.
      - Add extack messages in ndo_bpf.
      
      v2:
      - Squash NAPI patch with "Add driver XDP" patch.
      - Remove conversion from xdp_frame to skb when NAPI is not enabled.
      - Introduce per-queue XDP ring (patch 8).
      - Introduce bulk skb xmit when XDP is enabled on the peer (patch 9).
      ====================
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      60afdf06
    • T
      veth: Support per queue XDP ring · 638264dc
      Toshiaki Makita 提交于
      Move XDP and napi related fields from veth_priv to newly created veth_rq
      structure.
      
      When xdp_frames are enqueued from ndo_xdp_xmit and XDP_TX, rxq is
      selected by current cpu.
      
      When skbs are enqueued from the peer device, rxq is one to one mapping
      of its peer txq. This way we have a restriction that the number of rxqs
      must not less than the number of peer txqs, but leave the possibility to
      achieve bulk skb xmit in the future because txq lock would make it
      possible to remove rxq ptr_ring lock.
      
      v3:
      - Add extack messages.
      - Fix array overrun in veth_xmit.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      638264dc
    • T
      veth: Add XDP TX and REDIRECT · d1396004
      Toshiaki Makita 提交于
      This allows further redirection of xdp_frames like
      
       NIC   -> veth--veth -> veth--veth
       (XDP)          (XDP)         (XDP)
      
      The intermediate XDP, redirecting packets from NIC to the other veth,
      reuses xdp_mem_info from NIC so that page recycling of the NIC works on
      the destination veth's XDP.
      In this way return_frame is not fully guarded by NAPI, since another
      NAPI handler on another cpu may use the same xdp_mem_info concurrently.
      Thus disable napi_direct by xdp_set_return_frame_no_direct() during the
      NAPI context.
      
      v8:
      - Don't use xdp_frame pointer address for data_hard_start of xdp_buff.
      
      v4:
      - Use xdp_[set|clear]_return_frame_no_direct() instead of a flag in
        xdp_mem_info.
      
      v3:
      - Fix double free when veth_xdp_tx() returns a positive value.
      - Convert xdp_xmit and xdp_redir variables into flags.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      d1396004
    • T
      xdp: Helpers for disabling napi_direct of xdp_return_frame · 2539650f
      Toshiaki Makita 提交于
      We need some mechanism to disable napi_direct on calling
      xdp_return_frame_rx_napi() from some context.
      When veth gets support of XDP_REDIRECT, it will redirects packets which
      are redirected from other devices. On redirection veth will reuse
      xdp_mem_info of the redirection source device to make return_frame work.
      But in this case .ndo_xdp_xmit() called from veth redirection uses
      xdp_mem_info which is not guarded by NAPI, because the .ndo_xdp_xmit()
      is not called directly from the rxq which owns the xdp_mem_info.
      
      This approach introduces a flag in bpf_redirect_info to indicate that
      napi_direct should be disabled even when _rx_napi variant is used as
      well as helper functions to use it.
      
      A NAPI handler who wants to use this flag needs to call
      xdp_set_return_frame_no_direct() before processing packets, and call
      xdp_clear_return_frame_no_direct() after xdp_do_flush_map() before
      exiting NAPI.
      
      v4:
      - Use bpf_redirect_info for storing the flag instead of xdp_mem_info to
        avoid per-frame copy cost.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      2539650f
    • T
      bpf: Make redirect_info accessible from modules · 0b19cc0a
      Toshiaki Makita 提交于
      We are going to add kern_flags field in redirect_info for kernel
      internal use.
      In order to avoid function call to access the flags, make redirect_info
      accessible from modules. Also as it is now non-static, add prefix bpf_
      to redirect_info.
      
      v6:
      - Fix sparse warning around EXPORT_SYMBOL.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      0b19cc0a
    • T
      veth: Add ndo_xdp_xmit · af87a3aa
      Toshiaki Makita 提交于
      This allows NIC's XDP to redirect packets to veth. The destination veth
      device enqueues redirected packets to the napi ring of its peer, then
      they are processed by XDP on its peer veth device.
      This can be thought as calling another XDP program by XDP program using
      REDIRECT, when the peer enables driver XDP.
      
      Note that when the peer veth device does not set driver xdp, redirected
      packets will be dropped because the peer is not ready for NAPI.
      
      v4:
      - Don't use xdp_ok_fwd_dev() because checking IFF_UP is not necessary.
        Add comments about it and check only MTU.
      
      v2:
      - Drop the part converting xdp_frame into skb when XDP is not enabled.
      - Implement bulk interface of ndo_xdp_xmit.
      - Implement XDP_XMIT_FLUSH bit and drop ndo_xdp_flush.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      af87a3aa
    • T
      veth: Handle xdp_frames in xdp napi ring · 9fc8d518
      Toshiaki Makita 提交于
      This is preparation for XDP TX and ndo_xdp_xmit.
      This allows napi handler to handle xdp_frames through xdp ring as well
      as sk_buff.
      
      v8:
      - Don't use xdp_frame pointer address to calculate skb->head and
        headroom.
      
      v7:
      - Use xdp_scrub_frame() instead of memset().
      
      v3:
      - Revert v2 change around rings and use a flag to differentiate skb and
        xdp_frame, since bulk skb xmit makes little performance difference
        for now.
      
      v2:
      - Use another ring instead of using flag to differentiate skb and
        xdp_frame. This approach makes bulk skb transmit possible in
        veth_xmit later.
      - Clear xdp_frame feilds in skb->head.
      - Implement adjust_tail.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      9fc8d518
    • T
      xdp: Helper function to clear kernel pointers in xdp_frame · a8d5b4ab
      Toshiaki Makita 提交于
      xdp_frame has kernel pointers which should not be readable from bpf
      programs. When we want to reuse xdp_frame region but it may be read by
      bpf programs later, we can use this helper to clear kernel pointers.
      This is more efficient than calling memset() for the entire struct.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      a8d5b4ab
    • T
      veth: Avoid drops by oversized packets when XDP is enabled · dc224822
      Toshiaki Makita 提交于
      Oversized packets including GSO packets can be dropped if XDP is
      enabled on receiver side, so don't send such packets from peer.
      
      Drop TSO and SCTP fragmentation features so that veth devices themselves
      segment packets with XDP enabled. Also cap MTU accordingly.
      
      v4:
      - Don't auto-adjust MTU but cap max MTU.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      dc224822
    • T
      veth: Add driver XDP · 948d4f21
      Toshiaki Makita 提交于
      This is the basic implementation of veth driver XDP.
      
      Incoming packets are sent from the peer veth device in the form of skb,
      so this is generally doing the same thing as generic XDP.
      
      This itself is not so useful, but a starting point to implement other
      useful veth XDP features like TX and REDIRECT.
      
      This introduces NAPI when XDP is enabled, because XDP is now heavily
      relies on NAPI context. Use ptr_ring to emulate NIC ring. Tx function
      enqueues packets to the ring and peer NAPI handler drains the ring.
      
      Currently only one ring is allocated for each veth device, so it does
      not scale on multiqueue env. This can be resolved by allocating rings
      on the per-queue basis later.
      
      Note that NAPI is not used but netif_rx is used when XDP is not loaded,
      so this does not change the default behaviour.
      
      v6:
      - Check skb->len only when allocation is needed.
      - Add __GFP_NOWARN to alloc_page() as it can be triggered by external
        events.
      
      v3:
      - Fix race on closing the device.
      - Add extack messages in ndo_bpf.
      
      v2:
      - Squashed with the patch adding NAPI.
      - Implement adjust_tail.
      - Don't acquire consumer lock because it is guarded by NAPI.
      - Make poll_controller noop since it is unnecessary.
      - Register rxq_info on enabling XDP rather than on opening the device.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      948d4f21
    • T
      net: Export skb_headers_offset_update · b0768a86
      Toshiaki Makita 提交于
      This is needed for veth XDP which does skb_copy_expand()-like operation.
      
      v2:
      - Drop skb_copy_header part because it has already been exported now.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      b0768a86
    • D
      Merge branch 'bpf-sample-cpumap-lb' · c4c20217
      Daniel Borkmann 提交于
      Jesper Dangaard Brouer says:
      
      ====================
      Background: cpumap moves the SKB allocation out of the driver code,
      and instead allocate it on the remote CPU, and invokes the regular
      kernel network stack with the newly allocated SKB.
      
      The idea behind the XDP CPU redirect feature, is to use XDP as a
      load-balancer step in-front of regular kernel network stack.  But the
      current sample code does not provide a good example of this.  Part of
      the reason is that, I have implemented this as part of Suricata XDP
      load-balancer.
      
      Given this is the most frequent feature request I get.  This patchset
      implement the same XDP load-balancing as Suricata does, which is a
      symmetric hash based on the IP-pairs + L4-protocol.
      
      The expected setup for the use-case is to reduce the number of NIC RX
      queues via ethtool (as XDP can handle more per core), and via
      smp_affinity assign these RX queues to a set of CPUs, which will be
      handling RX packets.  The CPUs that runs the regular network stack is
      supplied to the sample xdp_redirect_cpu tool by specifying
      the --cpu option multiple times on the cmdline.
      
      I do note that cpumap SKB creation is not feature complete yet, and
      more work is coming.  E.g. given GRO is not implemented yet, do expect
      TCP workloads to be slower.  My measurements do indicate UDP workloads
      are faster.
      ====================
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      c4c20217
    • J
      samples/bpf: xdp_redirect_cpu load balance like Suricata · 1bca4e6b
      Jesper Dangaard Brouer 提交于
      This implement XDP CPU redirection load-balancing across available
      CPUs, based on the hashing IP-pairs + L4-protocol.  This equivalent to
      xdp-cpu-redirect feature in Suricata, which is inspired by the
      Suricata 'ippair' hashing code.
      
      An important property is that the hashing is flow symmetric, meaning
      that if the source and destination gets swapped then the selected CPU
      will remain the same.  This is helps locality by placing both directions
      of a flows on the same CPU, in a forwarding/routing scenario.
      
      The hashing INITVAL (15485863 the 10^6th prime number) was fairly
      arbitrary choosen, but experiments with kernel tree pktgen scripts
      (pktgen_sample04_many_flows.sh +pktgen_sample05_flow_per_thread.sh)
      showed this improved the distribution.
      
      This patch also change the default loaded XDP program to be this
      load-balancer.  As based on different user feedback, this seems to be
      the expected behavior of the sample xdp_redirect_cpu.
      
      Link: https://github.com/OISF/suricata/commit/796ec08dd7a63Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      1bca4e6b
    • J
      samples/bpf: add Paul Hsieh's (LGPL 2.1) hash function SuperFastHash · 11395686
      Jesper Dangaard Brouer 提交于
      Adjusted function call API to take an initval. This allow the API
      user to set the initial value, as a seed. This could also be used for
      inputting the previous hash.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      11395686
    • B
      Revert "xdp: add NULL pointer check in __xdp_return()" · eb91e4d4
      Björn Töpel 提交于
      This reverts commit 36e0f12b.
      
      The reverted commit adds a WARN to check against NULL entries in the
      mem_id_ht rhashtable. Any kernel path implementing the XDP (generic or
      driver) fast path is required to make a paired
      xdp_rxq_info_reg/xdp_rxq_info_unreg call for proper function. In
      addition, a driver using a different allocation scheme than the
      default MEM_TYPE_PAGE_SHARED is required to additionally call
      xdp_rxq_info_reg_mem_model.
      
      For MEM_TYPE_ZERO_COPY, an xdp_rxq_info_reg_mem_model call ensures
      that the mem_id_ht rhashtable has a properly inserted allocator id. If
      not, this would be a driver bug. A NULL pointer kernel OOPS is
      preferred to the WARN.
      Suggested-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      eb91e4d4
    • D
      Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net · a736e074
      David S. Miller 提交于
      Overlapping changes in RXRPC, changing to ktime_get_seconds() whilst
      adding some tracepoints.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a736e074
    • D
      Merge branch 'Add-support-for-XGMAC2-in-stmmac' · 192e91d2
      David S. Miller 提交于
      Jose Abreu says:
      
      ====================
      Add support for XGMAC2 in stmmac
      
      This series adds support for 10Gigabit IP in stmmac. The IP is called XGMAC2
      and has many similarities with GMAC4. Due to this, its relatively easy to
      incorporate this new IP into stmmac driver by adding a new block and
      filling the necessary callbacks.
      
      The functionality added by this series is still reduced but its only a
      starting point which will later be expanded.
      
      I splitted the patches into funcionality and to ease the review. Only the
      patch 8/9 really enables the XGMAC2 block by adding a new compatible string.
      
      Version 4 addresses review comments of Florian Fainelli and Rob Herring.
      
      NOTE: Although the IP supports 10G, for now it was only possible to test it
      at 1G speed due to 10G PHY HW shipping problems. Here follows iperf3
      results at 1G:
      
      Connecting to host 192.168.0.10, port 5201
      [  4] local 192.168.0.3 port 39178 connected to 192.168.0.10 port 5201
      [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
      [  4]   0.00-1.00   sec   110 MBytes   920 Mbits/sec    0    482 KBytes
      [  4]   1.00-2.00   sec   113 MBytes   946 Mbits/sec    0    482 KBytes
      [  4]   2.00-3.00   sec   112 MBytes   937 Mbits/sec    0    482 KBytes
      [  4]   3.00-4.00   sec   113 MBytes   946 Mbits/sec    0    482 KBytes
      [  4]   4.00-5.00   sec   112 MBytes   935 Mbits/sec    0    482 KBytes
      [  4]   5.00-6.00   sec   113 MBytes   946 Mbits/sec    0    482 KBytes
      [  4]   6.00-7.00   sec   112 MBytes   937 Mbits/sec    0    482 KBytes
      [  4]   7.00-8.00   sec   113 MBytes   946 Mbits/sec    0    482 KBytes
      [  4]   8.00-9.00   sec   112 MBytes   937 Mbits/sec    0    482 KBytes
      [  4]   9.00-10.00  sec   113 MBytes   946 Mbits/sec    0    482 KBytes
      - - - - - - - - - - - - - - - - - - - - - - - - -
      [ ID] Interval           Transfer     Bandwidth       Retr
      [  4]   0.00-10.00  sec  1.09 GBytes   940 Mbits/sec    0             sender
      [  4]   0.00-10.00  sec  1.09 GBytes   938 Mbits/sec                  receiver
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      192e91d2
    • J
      dt-bindings: net: stmmac: Add the bindings documentation for XGMAC2. · 80dfb286
      Jose Abreu 提交于
      Adds the documentation for XGMAC2 DT bindings.
      Signed-off-by: NJose Abreu <joabreu@synopsys.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Joao Pinto <jpinto@synopsys.com>
      Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
      Cc: Alexandre Torgue <alexandre.torgue@st.com>
      Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
      Cc: devicetree@vger.kernel.org
      Cc: Rob Herring <robh+dt@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      80dfb286
    • J
      net: stmmac: Add the bindings parsing for XGMAC2 · a3f14247
      Jose Abreu 提交于
      Add the bindings parsing for XGMAC2 IP block.
      Signed-off-by: NJose Abreu <joabreu@synopsys.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Joao Pinto <jpinto@synopsys.com>
      Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
      Cc: Alexandre Torgue <alexandre.torgue@st.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a3f14247
    • J
      net: stmmac: Integrate XGMAC into main driver flow · 7d9e6c5a
      Jose Abreu 提交于
      Now that we have all the XGMAC related callbacks, lets start integrating
      this IP block into main driver.
      
      Also, we corrected the initialization flow to only start DMA after
      setting descriptors length.
      Signed-off-by: NJose Abreu <joabreu@synopsys.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Joao Pinto <jpinto@synopsys.com>
      Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
      Cc: Alexandre Torgue <alexandre.torgue@st.com>
      Cc: Andrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d9e6c5a
    • J
      net: stmmac: Add PTP support for XGMAC2 · 4bb7aff9
      Jose Abreu 提交于
      XGMAC2 uses the same engine of timestamping as GMAC4. Let's use the same
      callbacks.
      Signed-off-by: NJose Abreu <joabreu@synopsys.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Joao Pinto <jpinto@synopsys.com>
      Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
      Cc: Alexandre Torgue <alexandre.torgue@st.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4bb7aff9
    • J
      net: stmmac: Add MDIO related functions for XGMAC2 · 6fc21117
      Jose Abreu 提交于
      Add the MDIO related funcionalities for the new IP block XGMAC2.
      Signed-off-by: NJose Abreu <joabreu@synopsys.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Joao Pinto <jpinto@synopsys.com>
      Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
      Cc: Alexandre Torgue <alexandre.torgue@st.com>
      Cc: Andrew Lunn <andrew@lunn.ch>
      Cc: Florian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6fc21117
    • J
      net: stmmac: Add descriptor related callbacks for XGMAC2 · 874dfb65
      Jose Abreu 提交于
      Add the descriptor related callbacks for the new IP block XGMAC2.
      Signed-off-by: NJose Abreu <joabreu@synopsys.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Joao Pinto <jpinto@synopsys.com>
      Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
      Cc: Alexandre Torgue <alexandre.torgue@st.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      874dfb65
    • J
      net: stmmac: Add DMA related callbacks for XGMAC2 · d6ddfacd
      Jose Abreu 提交于
      Add the DMA related callbacks for the new IP block XGMAC2.
      Signed-off-by: NJose Abreu <joabreu@synopsys.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Joao Pinto <jpinto@synopsys.com>
      Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
      Cc: Alexandre Torgue <alexandre.torgue@st.com>
      Cc: Florian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d6ddfacd
    • J
      net: stmmac: Add MAC related callbacks for XGMAC2 · 2142754f
      Jose Abreu 提交于
      Add the MAC related callbacks for the new IP block XGMAC2.
      Signed-off-by: NJose Abreu <joabreu@synopsys.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Joao Pinto <jpinto@synopsys.com>
      Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
      Cc: Alexandre Torgue <alexandre.torgue@st.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2142754f
    • J
      net: stmmac: Add XGMAC 2.10 HWIF entry · 48ae5554
      Jose Abreu 提交于
      Add a new entry to HWIF table for XGMAC 2.10. For now we fill it with
      empty callbacks which will be added in posterior patches.
      Signed-off-by: NJose Abreu <joabreu@synopsys.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Joao Pinto <jpinto@synopsys.com>
      Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
      Cc: Alexandre Torgue <alexandre.torgue@st.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48ae5554
    • D
      Merge branch 'More-complete-PHYLINK-support-for-mv88e6xxx' · ff50eda4
      David S. Miller 提交于
      Andrew Lunn says:
      
      ====================
      More complete PHYLINK support for mv88e6xxx
      
      Previous patches added sufficient PHYLINK support to the mv88e6xxx
      that it did not break existing use cases, basically fixed-link phys.
      
      This patchset builds out the support so that SFP modules, up to
      2.5Gbps can be supported, on mv88e6390X, on ports 9 and 10. It also
      provides a framework which can be extended to support SFPs on ports
      2-8 of mv88e6390X, 10Gbps PHYs, and SFP support on the 6352 family.
      
      Russell King did much of the initial work, implementing the validate
      and mac_link_state calls. However, there is an important TODO in the
      commit message:
      
      needs to call phylink_mac_change() when the port link comes up/goes down.
      
      The remaining patches implement this, by adding more support for the
      SERDES interfaces, in particular, interrupt support so we get notified
      when the SERDES gains/looses sync.
      
      This has been tested on the ZII devel C, using a Clearfog as peer
      device.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ff50eda4
    • A
      net: dsa: mv88e6xxx: Re-setup interrupts on CMODE change. · 734447d4
      Andrew Lunn 提交于
      When a port changes CMODE, the SERDES interface being used can change.
      Disable interrupts for the old SERDES interface, and enable interrupts
      on the new.
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      734447d4
    • A
      net: dsa: mv88e6xxx: Add SERDES phydev_mac_change up for 6390 · efd1ba6a
      Andrew Lunn 提交于
      phylink wants to know when the MAC layers notices a change in the
      link. For the 6390 family, this is a change in the SERDES state.
      
      Add interrupt support for the SERDES interface used to implement
      SGMII/1000Base-X/2500Base-X. This is currently limited to ports 9 and
      10. Support for the 10G SERDES and other ports will be added later,
      building on this basic framework.
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      efd1ba6a
    • A
      net: dsa: mv88e6xxx: link mv88e6xxx_port to mv88e6xxx_chip · 7b898469
      Andrew Lunn 提交于
      An up coming change will register interrupts for individual switch
      ports, using the mv88e6xxx_port as the interrupt context information.
      Add members to the mv88e6xxx_port structure so we can link it back to
      the mv88e6xxx_chip member the port belongs to and the port number of
      the port.
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7b898469
    • A
      net: dsa: mv88e6xxx: Power on/off SERDES on cmode change · 364e9d77
      Andrew Lunn 提交于
      The 6390 family has a number of SERDES interfaces per port. When the
      cmode changes, eg 1000Base-X to XAUI, the SERDES interface in use will
      also change. Power down the old SERDES interface and power up the new
      SERDES interface.
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      364e9d77
    • A
      net: dsa: mv88e6xxx: Cache the port cmode · 2d2e1dd2
      Andrew Lunn 提交于
      The ports CMODE indicates the type of link between the MAC and the
      PHY. It is used often in the SERDES code. Rather than read it each
      time, cache its value.
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2d2e1dd2