1. 03 7月, 2019 8 次提交
    • S
      bpf: add dsack_dups/delivered{, _ce} to bpf_tcp_sock · 0357746d
      Stanislav Fomichev 提交于
      Add more fields to bpf_tcp_sock that might be useful for debugging
      congestion control issues.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Priyaranjan Jha <priyarjha@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      0357746d
    • S
      bpf: split shared bpf_tcp_sock and bpf_sock_ops implementation · 2377b81d
      Stanislav Fomichev 提交于
      We've added bpf_tcp_sock member to bpf_sock_ops and don't expect
      any new tcp_sock fields in bpf_sock_ops. Let's remove
      CONVERT_COMMON_TCP_SOCK_FIELDS so bpf_tcp_sock can be independently
      extended.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Priyaranjan Jha <priyarjha@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      2377b81d
    • S
      bpf: add BPF_CGROUP_SOCK_OPS callback that is executed on every RTT · 23729ff2
      Stanislav Fomichev 提交于
      Performance impact should be minimal because it's under a new
      BPF_SOCK_OPS_RTT_CB_FLAG flag that has to be explicitly enabled.
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Priyaranjan Jha <priyarjha@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      23729ff2
    • J
      selftests: bpf: standardize to static __always_inline · d2f5bbbc
      Jiri Benc 提交于
      The progs for bpf selftests use several different notations to force
      function inlining. Standardize to what most of them use,
      static __always_inline.
      Suggested-by: NSong Liu <liu.song.a23@gmail.com>
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      d2f5bbbc
    • B
      bpf: Add support for fq's EDT to HBM · 71634d7f
      brakmo 提交于
      Adds support for fq's Earliest Departure Time to HBM (Host Bandwidth
      Manager). Includes a new BPF program supporting EDT, and also updates
      corresponding programs.
      
      It will drop packets with an EDT of more than 500us in the future
      unless the packet belongs to a flow with less than 2 packets in flight.
      This is done so each flow has at least 2 packets in flight, so they
      will not starve, and also to help prevent delayed ACK timeouts.
      
      It will also work with ECN enabled traffic, where the packets will be
      CE marked if their EDT is more than 50us in the future.
      
      The table below shows some performance numbers. The flows are back to
      back RPCS. One server sending to another, either 2 or 4 flows.
      One flow is a 10KB RPC, the rest are 1MB RPCs. When there are more
      than one flow of a given RPC size, the numbers represent averages.
      
      The rate limit applies to all flows (they are in the same cgroup).
      Tests ending with "-edt" ran with the new BPF program supporting EDT.
      Tests ending with "-hbt" ran on top HBT qdisc with the specified rate
      (i.e. no HBM). The other tests ran with the HBM BPF program included
      in the HBM patch-set.
      
      EDT has limited value when using DCTCP, but it helps in many cases when
      using Cubic. It usually achieves larger link utilization and lower
      99% latencies for the 1MB RPCs.
      HBM ends up queueing a lot of packets with its default parameter values,
      reducing the goodput of the 10KB RPCs and increasing their latency. Also,
      the RTTs seen by the flows are quite large.
      
                               Aggr              10K  10K  10K   1MB  1MB  1MB
               Limit           rate drops  RTT  rate  P90  P99  rate  P90  P99
      Test      rate  Flows    Mbps   %     us  Mbps   us   us  Mbps   ms   ms
      --------  ----  -----    ---- -----  ---  ---- ---- ----  ---- ---- ----
      cubic       1G    2       904  0.02  108   257  511  539   647 13.4 24.5
      cubic-edt   1G    2       982  0.01  156   239  656  967   743 14.0 17.2
      dctcp       1G    2       977  0.00  105   324  408  744   653 14.5 15.9
      dctcp-edt   1G    2       981  0.01  142   321  417  811   660 15.7 17.0
      cubic-htb   1G    2       919  0.00 1825    40 2822 4140   879  9.7  9.9
      
      cubic     200M    2       155  0.30  220    81  532  655    74  283  450
      cubic-edt 200M    2       188  0.02  222    87 1035 1095   101   84   85
      dctcp     200M    2       188  0.03  111    77  912  939   111   76  325
      dctcp-edt 200M    2       188  0.03  217    74 1416 1738   114   76   79
      cubic-htb 200M    2       188  0.00 5015     8 14ms 15ms   180   48   50
      
      cubic       1G    4       952  0.03  110   165  516  546   262   38  154
      cubic-edt   1G    4       973  0.01  190   111 1034 1314   287   65   79
      dctcp       1G    4       951  0.00  103   180  617  905   257   37   38
      dctcp-edt   1G    4       967  0.00  163   151  732 1126   272   43   55
      cubic-htb   1G    4       914  0.00 3249    13  7ms  8ms   300   29   34
      
      cubic       5G    4      4236  0.00  134   305  490  624  1310   10   17
      cubic-edt   5G    4      4865  0.00  156   306  425  759  1520   10   16
      dctcp       5G    4      4936  0.00  128   485  221  409  1484    7    9
      dctcp-edt   5G    4      4924  0.00  148   390  392  623  1508   11   26
      
      v1 -> v2: Incorporated Andrii's suggestions
      v2 -> v3: Incorporated Yonghong's suggestions
      v3 -> v4: Removed credit update that is not needed
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      71634d7f
    • L
      bpf, libbpf, smatch: Fix potential NULL pointer dereference · 33bae185
      Leo Yan 提交于
      Based on the following report from Smatch, fix the potential NULL
      pointer dereference check:
      
        tools/lib/bpf/libbpf.c:3493
        bpf_prog_load_xattr() warn: variable dereferenced before check 'attr'
        (see line 3483)
      
        3479 int bpf_prog_load_xattr(const struct bpf_prog_load_attr *attr,
        3480                         struct bpf_object **pobj, int *prog_fd)
        3481 {
        3482         struct bpf_object_open_attr open_attr = {
        3483                 .file           = attr->file,
        3484                 .prog_type      = attr->prog_type,
                                               ^^^^^^
        3485         };
      
      At the head of function, it directly access 'attr' without checking
      if it's NULL pointer. This patch moves the values assignment after
      validating 'attr' and 'attr->file'.
      Signed-off-by: NLeo Yan <leo.yan@linaro.org>
      Acked-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      33bae185
    • A
      libbpf: fix GCC8 warning for strncpy · cdfc7f88
      Andrii Nakryiko 提交于
      GCC8 started emitting warning about using strncpy with number of bytes
      exactly equal destination size, which is generally unsafe, as can lead
      to non-zero terminated string being copied. Use IFNAMSIZ - 1 as number
      of bytes to ensure name is always zero-terminated.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Cc: Magnus Karlsson <magnus.karlsson@intel.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Acked-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      cdfc7f88
    • A
      bpf: fix precision tracking · a3ce685d
      Alexei Starovoitov 提交于
      When equivalent state is found the current state needs to propagate precision marks.
      Otherwise the verifier will prune the search incorrectly.
      
      There is a price for correctness:
                            before      before    broken    fixed
                            cnst spill  precise   precise
      bpf_lb-DLB_L3.o       1923        8128      1863      1898
      bpf_lb-DLB_L4.o       3077        6707      2468      2666
      bpf_lb-DUNKNOWN.o     1062        1062      544       544
      bpf_lxc-DDROP_ALL.o   166729      380712    22629     36823
      bpf_lxc-DUNKNOWN.o    174607      440652    28805     45325
      bpf_netdev.o          8407        31904     6801      7002
      bpf_overlay.o         5420        23569     4754      4858
      bpf_lxc_jit.o         39389       359445    50925     69631
      Overall precision tracking is still very effective.
      
      Fixes: b5dc0163 ("bpf: precise scalar_value tracking")
      Reported-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Tested-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      a3ce685d
  2. 29 6月, 2019 7 次提交
    • D
      Merge branch 'bpf-lookup-devmap' · 8daed767
      Daniel Borkmann 提交于
      Toke Høiland-Jørgensen says:
      
      ====================
      When using the bpf_redirect_map() helper to redirect packets from XDP, the eBPF
      program cannot currently know whether the redirect will succeed, which makes it
      impossible to gracefully handle errors. To properly fix this will probably
      require deeper changes to the way TX resources are allocated, but one thing that
      is fairly straight forward to fix is to allow lookups into devmaps, so programs
      can at least know when a redirect is *guaranteed* to fail because there is no
      entry in the map. Currently, programs work around this by keeping a shadow map
      of another type which indicates whether a map index is valid.
      
      This series contains two changes that are complementary ways to fix this issue:
      
      - Moving the map lookup into the bpf_redirect_map() helper (and caching the
        result), so the helper can return an error if no value is found in the map.
        This includes a refactoring of the devmap and cpumap code to not care about
        the index on enqueue.
      
      - Allowing regular lookups into devmaps from eBPF programs, using the read-only
        flag to make sure they don't change the values.
      
      The performance impact of the series is negligible, in the sense that I cannot
      measure it because the variance between test runs is higher than the difference
      pre/post series.
      
      Changelog:
      
      v6:
        - Factor out list handling in maps to a helper in list.h (new patch 1)
        - Rename variables in struct bpf_redirect_info (new patch 3 + patch 4)
        - Explain why we are clearing out the map in the info struct on lookup failure
        - Remove unneeded check for forwarding target in tracepoint macro
      
      v5:
        - Rebase on latest bpf-next.
        - Update documentation for bpf_redirect_map() with the new meaning of flags.
      
      v4:
        - Fix a few nits from Andrii
        - Lose the #defines in bpf.h and just compare the flags argument directly to
          XDP_TX in bpf_xdp_redirect_map().
      
      v3:
        - Adopt Jonathan's idea of using the lower two bits of the flag value as the
          return code.
        - Always do the lookup, and cache the result for use in xdp_do_redirect(); to
          achieve this, refactor the devmap and cpumap code to get rid the bitmap for
          selecting which devices to flush.
      
      v2:
        - For patch 1, make it clear that the change works for any map type.
        - For patch 2, just use the new BPF_F_RDONLY_PROG flag to make the return
          value read-only.
      ====================
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      8daed767
    • T
      devmap: Allow map lookups from eBPF · 0cdbb4b0
      Toke Høiland-Jørgensen 提交于
      We don't currently allow lookups into a devmap from eBPF, because the map
      lookup returns a pointer directly to the dev->ifindex, which shouldn't be
      modifiable from eBPF.
      
      However, being able to do lookups in devmaps is useful to know (e.g.)
      whether forwarding to a specific interface is enabled. Currently, programs
      work around this by keeping a shadow map of another type which indicates
      whether a map index is valid.
      
      Since we now have a flag to make maps read-only from the eBPF side, we can
      simply lift the lookup restriction if we make sure this flag is always set.
      Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Acked-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      0cdbb4b0
    • T
      bpf_xdp_redirect_map: Perform map lookup in eBPF helper · 43e74c02
      Toke Høiland-Jørgensen 提交于
      The bpf_redirect_map() helper used by XDP programs doesn't return any
      indication of whether it can successfully redirect to the map index it was
      given. Instead, BPF programs have to track this themselves, leading to
      programs using duplicate maps to track which entries are populated in the
      devmap.
      
      This patch fixes this by moving the map lookup into the bpf_redirect_map()
      helper, which makes it possible to return failure to the eBPF program. The
      lower bits of the flags argument is used as the return code, which means
      that existing users who pass a '0' flag argument will get XDP_ABORTED.
      
      With this, a BPF program can check the return code from the helper call and
      react by, for instance, substituting a different redirect. This works for
      any type of map used for redirect.
      Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Acked-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      43e74c02
    • T
      devmap: Rename ifindex member in bpf_redirect_info · 4b55cf29
      Toke Høiland-Jørgensen 提交于
      The bpf_redirect_info struct has an 'ifindex' member which was named back
      when the redirects could only target egress interfaces. Now that we can
      also redirect to sockets and CPUs, this is a bit misleading, so rename the
      member to tgt_index.
      
      Reorder the struct members so we can have 'tgt_index' and 'tgt_value' next
      to each other in a subsequent patch.
      Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4b55cf29
    • T
      devmap/cpumap: Use flush list instead of bitmap · d5df2830
      Toke Høiland-Jørgensen 提交于
      The socket map uses a linked list instead of a bitmap to keep track of
      which entries to flush. Do the same for devmap and cpumap, as this means we
      don't have to care about the map index when enqueueing things into the
      map (and so we can cache the map lookup).
      Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Acked-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      d5df2830
    • T
      xskmap: Move non-standard list manipulation to helper · c8af5cd7
      Toke Høiland-Jørgensen 提交于
      Add a helper in list.h for the non-standard way of clearing a list that is
      used in xskmap. This makes it easier to reuse it in the other map types,
      and also makes sure this usage is not forgotten in any list refactorings in
      the future.
      Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      c8af5cd7
    • S
      selftests/bpf: fix -Wstrict-aliasing in test_sockopt_sk.c · 2d6dbb9a
      Stanislav Fomichev 提交于
      Let's use union with u8[4] and u32 members for sockopt buffer,
      that should fix any possible aliasing issues.
      
      test_sockopt_sk.c: In function ‘getsetsockopt’:
      test_sockopt_sk.c:115:2: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
        if (*(__u32 *)buf != 0x55AA*2) {
        ^~
      test_sockopt_sk.c:116:3: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
         log_err("Unexpected getsockopt(SO_SNDBUF) 0x%x != 0x55AA*2",
         ^~~~~~~
      
      Fixes: 8a027dc0 ("selftests/bpf: add sockopt test that exercises sk helpers")
      Reported-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      2d6dbb9a
  3. 28 6月, 2019 25 次提交
    • A
      Merge branch 'bpf-sockopt-hooks' · 2ec1899e
      Alexei Starovoitov 提交于
      Stanislav Fomichev says:
      
      ====================
      This series implements two new per-cgroup hooks: getsockopt and
      setsockopt along with a new sockopt program type. The idea is pretty
      similar to recently introduced cgroup sysctl hooks, but
      implementation is simpler (no need to convert to/from strings).
      
      What this can be applied to:
      * move business logic of what tos/priority/etc can be set by
        containers (either pass or reject)
      * handle existing options (or introduce new ones) differently by
        propagating some information in cgroup/socket local storage
      
      Compared to a simple syscall/{g,s}etsockopt tracepoint, those
      hooks are context aware. Meaning, they can access underlying socket
      and use cgroup and socket local storage.
      
      v9:
      * allow overwriting setsocktop arguments (Alexei Starovoitov)
        (see individual changes for more changelog details)
      ====================
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      2ec1899e
    • S
      bpftool: support cgroup sockopt · f6d08d9d
      Stanislav Fomichev 提交于
      Support sockopt prog type and cgroup hooks in the bpftool.
      
      Cc: Andrii Nakryiko <andriin@fb.com>
      Cc: Martin Lau <kafai@fb.com>
      Acked-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f6d08d9d
    • S
      bpf: add sockopt documentation · 0c51b369
      Stanislav Fomichev 提交于
      Provide user documentation about sockopt prog type and cgroup hooks.
      
      v9:
      * add details about setsockopt context and inheritance
      
      v7:
      * add description for retval=0 and optlen=-1
      
      v6:
      * describe cgroup chaining, add example
      
      v2:
      * use return code 2 for kernel bypass
      
      Cc: Andrii Nakryiko <andriin@fb.com>
      Cc: Martin Lau <kafai@fb.com>
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      0c51b369
    • S
      selftests/bpf: add sockopt test that exercises BPF_F_ALLOW_MULTI · 65b4414a
      Stanislav Fomichev 提交于
      sockopt test that verifies chaining behavior.
      
      v9:
      * setsockopt chaining example
      
      v7:
      * rework the test to verify cgroup getsockopt chaining
      
      Cc: Andrii Nakryiko <andriin@fb.com>
      Cc: Martin Lau <kafai@fb.com>
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      65b4414a
    • S
      selftests/bpf: add sockopt test that exercises sk helpers · 8a027dc0
      Stanislav Fomichev 提交于
      socktop test that introduces new SOL_CUSTOM sockopt level and
      stores whatever users sets in sk storage. Whenever getsockopt
      is called, the original value is retrieved.
      
      v9:
      * SO_SNDBUF example to override user-supplied buffer
      
      v7:
      * use retval=0 and optlen-1
      
      v6:
      * test 'ret=1' use-case as well (Alexei Starovoitov)
      
      v4:
      * don't call bpf_sk_fullsock helper
      
      v3:
      * drop (__u8 *)(long) casts for optval{,_end}
      
      v2:
      * new test
      
      Cc: Andrii Nakryiko <andriin@fb.com>
      Cc: Martin Lau <kafai@fb.com>
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      8a027dc0
    • S
      selftests/bpf: add sockopt test · 9ec8a4c9
      Stanislav Fomichev 提交于
      Add sockopt selftests:
      * require proper expected_attach_type
      * enforce context field read/write access
      * test bpf_sockopt_handled handler
      * test EPERM
      * test limiting optlen from getsockopt
      * test out-of-bounds access
      
      v9:
      * add tests for setsockopt argument mangling
      
      v7:
      * remove return 2; test retval=0 and optlen=-1
      
      v3:
      * use DW for optval{,_end} loads
      
      v2:
      * use return code 2 for kernel bypass
      
      Cc: Andrii Nakryiko <andriin@fb.com>
      Cc: Martin Lau <kafai@fb.com>
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      9ec8a4c9
    • S
      selftests/bpf: test sockopt section name · 47ac90bb
      Stanislav Fomichev 提交于
      Add tests that make sure libbpf section detection works.
      
      Cc: Andrii Nakryiko <andriin@fb.com>
      Cc: Martin Lau <kafai@fb.com>
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      47ac90bb
    • S
      libbpf: support sockopt hooks · 4cdbfb59
      Stanislav Fomichev 提交于
      Make libbpf aware of new sockopt hooks so it can derive prog type
      and hook point from the section names.
      
      Cc: Andrii Nakryiko <andriin@fb.com>
      Cc: Martin Lau <kafai@fb.com>
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      4cdbfb59
    • S
      bpf: sync bpf.h to tools/ · aa6ab647
      Stanislav Fomichev 提交于
      Export new prog type and hook points to the libbpf.
      
      Cc: Andrii Nakryiko <andriin@fb.com>
      Cc: Martin Lau <kafai@fb.com>
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      aa6ab647
    • S
      bpf: implement getsockopt and setsockopt hooks · 0d01da6a
      Stanislav Fomichev 提交于
      Implement new BPF_PROG_TYPE_CGROUP_SOCKOPT program type and
      BPF_CGROUP_{G,S}ETSOCKOPT cgroup hooks.
      
      BPF_CGROUP_SETSOCKOPT can modify user setsockopt arguments before
      passing them down to the kernel or bypass kernel completely.
      BPF_CGROUP_GETSOCKOPT can can inspect/modify getsockopt arguments that
      kernel returns.
      Both hooks reuse existing PTR_TO_PACKET{,_END} infrastructure.
      
      The buffer memory is pre-allocated (because I don't think there is
      a precedent for working with __user memory from bpf). This might be
      slow to do for each {s,g}etsockopt call, that's why I've added
      __cgroup_bpf_prog_array_is_empty that exits early if there is nothing
      attached to a cgroup. Note, however, that there is a race between
      __cgroup_bpf_prog_array_is_empty and BPF_PROG_RUN_ARRAY where cgroup
      program layout might have changed; this should not be a problem
      because in general there is a race between multiple calls to
      {s,g}etsocktop and user adding/removing bpf progs from a cgroup.
      
      The return code of the BPF program is handled as follows:
      * 0: EPERM
      * 1: success, continue with next BPF program in the cgroup chain
      
      v9:
      * allow overwriting setsockopt arguments (Alexei Starovoitov):
        * use set_fs (same as kernel_setsockopt)
        * buffer is always kzalloc'd (no small on-stack buffer)
      
      v8:
      * use s32 for optlen (Andrii Nakryiko)
      
      v7:
      * return only 0 or 1 (Alexei Starovoitov)
      * always run all progs (Alexei Starovoitov)
      * use optval=0 as kernel bypass in setsockopt (Alexei Starovoitov)
        (decided to use optval=-1 instead, optval=0 might be a valid input)
      * call getsockopt hook after kernel handlers (Alexei Starovoitov)
      
      v6:
      * rework cgroup chaining; stop as soon as bpf program returns
        0 or 2; see patch with the documentation for the details
      * drop Andrii's and Martin's Acked-by (not sure they are comfortable
        with the new state of things)
      
      v5:
      * skip copy_to_user() and put_user() when ret == 0 (Martin Lau)
      
      v4:
      * don't export bpf_sk_fullsock helper (Martin Lau)
      * size != sizeof(__u64) for uapi pointers (Martin Lau)
      * offsetof instead of bpf_ctx_range when checking ctx access (Martin Lau)
      
      v3:
      * typos in BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY comments (Andrii Nakryiko)
      * reverse christmas tree in BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY (Andrii
        Nakryiko)
      * use __bpf_md_ptr instead of __u32 for optval{,_end} (Martin Lau)
      * use BPF_FIELD_SIZEOF() for consistency (Martin Lau)
      * new CG_SOCKOPT_ACCESS macro to wrap repeated parts
      
      v2:
      * moved bpf_sockopt_kern fields around to remove a hole (Martin Lau)
      * aligned bpf_sockopt_kern->buf to 8 bytes (Martin Lau)
      * bpf_prog_array_is_empty instead of bpf_prog_array_length (Martin Lau)
      * added [0,2] return code check to verifier (Martin Lau)
      * dropped unused buf[64] from the stack (Martin Lau)
      * use PTR_TO_SOCKET for bpf_sockopt->sk (Martin Lau)
      * dropped bpf_target_off from ctx rewrites (Martin Lau)
      * use return code for kernel bypass (Martin Lau & Andrii Nakryiko)
      
      Cc: Andrii Nakryiko <andriin@fb.com>
      Cc: Martin Lau <kafai@fb.com>
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      0d01da6a
    • D
      Merge branch 'bpf-af-xdp-mlx5e' · 3b1c667e
      Daniel Borkmann 提交于
      Tariq Toukan says:
      
      ====================
      This series contains improvements to the AF_XDP kernel infrastructure
      and AF_XDP support in mlx5e. The infrastructure improvements are
      required for mlx5e, but also some of them benefit to all drivers, and
      some can be useful for other drivers that want to implement AF_XDP.
      
      The performance testing was performed on a machine with the following
      configuration:
      
      - 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
      - Mellanox ConnectX-5 Ex with 100 Gbit/s link
      
      The results with retpoline disabled, single stream:
      
      txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
      rxdrop: 12.2 Mpps
      l2fwd: 9.4 Mpps
      
      The results with retpoline enabled, single stream:
      
      txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
      rxdrop: 9.9 Mpps
      l2fwd: 6.8 Mpps
      
      v2 changes:
      
      Added patches for mlx5e and addressed the comments for v1. Rebased for
      bpf-next.
      
      v3 changes:
      
      Rebased for the newer bpf-next, resolved conflicts in libbpf. Addressed
      Björn's comments for coding style. Fixed a bug in error handling flow in
      mlx5e_open_xsk.
      
      v4 changes:
      
      UAPI is not changed, XSK RX queues are exposed to the kernel. The lower
      half of the available amount of RX queues are regular queues, and the
      upper half are XSK RX queues. The patch "xsk: Extend channels to support
      combined XSK/non-XSK traffic" was dropped. The final patch was reworked
      accordingly.
      
      Added "net/mlx5e: Attach/detach XDP program safely", as the changes
      introduced in the XSK patch base on the stuff from this one.
      
      Added "libbpf: Support drivers with non-combined channels", which aligns
      the condition in libbpf with the condition in the kernel.
      
      Rebased over the newer bpf-next.
      
      v5 changes:
      
      In v4, ethtool reports the number of channels as 'combined' and the
      number of XSK RX queues as 'rx' for mlx5e. It was changed, so that 'rx'
      is 0, and 'combined' reports the double amount of channels if there is
      an active UMEM - to make libbpf happy.
      
      The patch for libbpf was dropped. Although it's still useful and fixes
      things, it raises some disagreement, so I'm dropping it - it's no longer
      useful for mlx5e anymore after the change above.
      
      v6 changes:
      
      As Maxim is out of office, I rebased the series on behalf of him,
      solved some conflicts, and re-spinned.
      ====================
      Acked-by: NBjörn Töpel <bjorn.topel@intel.com>
      Tested-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      3b1c667e
    • M
      net/mlx5e: Add XSK zero-copy support · db05815b
      Maxim Mikityanskiy 提交于
      This commit adds support for AF_XDP zero-copy RX and TX.
      
      We create a dedicated XSK RQ inside the channel, it means that two
      RQs are running simultaneously: one for non-XSK traffic and the other
      for XSK traffic. The regular and XSK RQs use a single ID namespace split
      into two halves: the lower half is regular RQs, and the upper half is
      XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
      of channels is not allowed, because it would break to mapping between
      XSK RQ IDs and channels.
      
      XSK requires different page allocation and release routines. Such
      functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
      generic enough to be used for both regular and XSK RQs, and they use the
      mlx5e_page_{alloc,release} wrappers around the real allocation
      functions. Function pointers are not used to avoid losing the
      performance with retpolines. Wherever it's certain that the regular
      (non-XSK) page release function should be used, it's called directly.
      
      Only the stats that could be meaningful for XSK are exposed to the
      userspace. Those that don't take part in the XSK flow are not
      considered.
      
      Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
      because the newer xdpsock sample doesn't provide any Fill Ring entries
      at the setup stage.
      
      We create a dedicated XSK SQ in the channel. This separation has its
      advantages:
      
      1. When the UMEM is closed, the XSK SQ can also be closed and stop
      receiving completions. If an existing SQ was used for XSK, it would
      continue receiving completions for the packets of the closed socket. If
      a new UMEM was opened at that point, it would start getting completions
      that don't belong to it.
      
      2. Calculating statistics separately.
      
      When the userspace kicks the TX, the driver triggers a hardware
      interrupt by posting a NOP to a dedicated XSK ICO (internal control
      operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
      ICO SQ is protected by a spinlock, as the userspace application may kick
      the TX from any core.
      
      Store the pointers to the UMEMs in the net device private context,
      independently from the kernel. This way the driver can distinguish
      between the zero-copy and non-zero-copy UMEMs. The kernel function
      xdp_get_umem_from_qid does not care about this difference, but the
      driver is only interested in zero-copy UMEMs, particularly, on the
      cleanup it determines whether to close the XSK RQ and SQ or not by
      looking at the presence of the UMEM. Use state_lock to protect the
      access to this area of UMEM pointers.
      
      LRO isn't compatible with XDP, but there may be active UMEMs while
      XDP is off. If this is the case, don't allow LRO to ensure XDP can
      be reenabled at any time.
      
      The validation of XSK parameters typically happens when XSK queues
      open. However, when the interface is down or the XDP program isn't
      set, it's still possible to have active AF_XDP sockets and even to
      open new, but the XSK queues will be closed. To cover these cases,
      perform the validation also in these flows:
      
      1. A new UMEM is registered, but the XSK queues aren't going to be
      created due to missing XDP program or interface being down.
      
      2. MTU changes while there are UMEMs registered.
      
      Having this early check prevents mlx5e_open_channels from failing
      at a later stage, where recovery is impossible and the application
      has no chance to handle the error, because it got the successful
      return value for an MTU change or XSK open operation.
      
      The performance testing was performed on a machine with the following
      configuration:
      
      - 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
      - Mellanox ConnectX-5 Ex with 100 Gbit/s link
      
      The results with retpoline disabled, single stream:
      
      txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
      rxdrop: 12.2 Mpps
      l2fwd: 9.4 Mpps
      
      The results with retpoline enabled, single stream:
      
      txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
      rxdrop: 9.9 Mpps
      l2fwd: 6.8 Mpps
      Signed-off-by: NMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Acked-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      db05815b
    • M
      net/mlx5e: Move queue param structs to en/params.h · 32a23653
      Maxim Mikityanskiy 提交于
      structs mlx5e_{rq,sq,cq,channel}_param are going to be used in the
      upcoming XSK RX and TX patches. Move them to a header file to make
      them accessible from other C files.
      Signed-off-by: NMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Acked-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      32a23653
    • M
      net/mlx5e: Encapsulate open/close queues into a function · 0a06382f
      Maxim Mikityanskiy 提交于
      Create new functions mlx5e_{open,close}_queues to encapsulate opening
      and closing RQs and SQs, and call the new functions from
      mlx5e_{open,close}_channel. It simplifies the existing functions a bit
      and prepares them for the upcoming AF_XDP changes.
      Signed-off-by: NMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Acked-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      0a06382f
    • M
      net/mlx5e: Consider XSK in XDP MTU limit calculation · a011b49f
      Maxim Mikityanskiy 提交于
      Use the existing mlx5e_get_linear_rq_headroom function to calculate the
      headroom for mlx5e_xdp_max_mtu. This function takes the XSK headroom
      into consideration, which will be used in the following patches.
      Signed-off-by: NMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Acked-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      a011b49f
    • M
      net/mlx5e: XDP_TX from UMEM support · 84a0a231
      Maxim Mikityanskiy 提交于
      When an XDP program returns XDP_TX, and the RQ is XSK-enabled, it
      requires careful handling, because convert_to_xdp_frame creates a new
      page and copies the data there, while our driver expects the xdp_frame
      to point to the same memory as the xdp_buff. Handle this case
      separately: map the page, and in the end unmap it and call
      xdp_return_frame.
      Signed-off-by: NMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Acked-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      84a0a231
    • M
      net/mlx5e: Share the XDP SQ for XDP_TX between RQs · b9673cf5
      Maxim Mikityanskiy 提交于
      Put the XDP SQ that is used for XDP_TX into the channel. It used to be a
      part of the RQ, but with introduction of AF_XDP there will be one more
      RQ that could share the same XDP SQ. This patch is a preparation for
      that change.
      
      Separate XDP_TX statistics per RQ were implemented in one of the previous
      patches.
      Signed-off-by: NMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Acked-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      b9673cf5
    • M
      net/mlx5e: Refactor struct mlx5e_xdp_info · d963fa15
      Maxim Mikityanskiy 提交于
      Currently, struct mlx5e_xdp_info has some issues that have to be cleaned
      up before the upcoming AF_XDP support makes things too complicated and
      messy. This structure is used both when sending the packet and on
      completion. Moreover, the cleanup procedure on completion depends on the
      origin of the packet (XDP_REDIRECT, XDP_TX). Adding AF_XDP support will
      add new flows that use this structure even differently. To avoid
      overcomplicating the code, this commit refactors the usage of this
      structure in the following ways:
      
      1. struct mlx5e_xdp_info is split into two different structures. One is
      struct mlx5e_xdp_xmit_data, a transient structure that doesn't need to
      be stored and is only used while sending the packet. The other is still
      struct mlx5e_xdp_info that is stored in a FIFO and contains the fields
      needed on completion.
      
      2. The fields of struct mlx5e_xdp_info that are used in different flows
      are put into a union. A special enum indicates the cleanup mode and
      helps choose the right union member. This approach is clear and
      explicit. Although it could be possible to "guess" the mode by looking
      at the values of the fields and at the XDP SQ type, it wouldn't be that
      clear and extendable and would require looking through the whole chain
      to understand what's going on.
      
      For the reference, there are the fields of struct mlx5e_xdp_info that
      are used in different flows (including AF_XDP ones):
      
      Packet origin          | Fields used on completion | Cleanup steps
      -----------------------+---------------------------+------------------
      XDP_REDIRECT,          | xdpf, dma_addr            | DMA unmap and
      XDP_TX from XSK RQ     |                           | xdp_return_frame.
      -----------------------+---------------------------+------------------
      XDP_TX from regular RQ | di                        | Recycle page.
      -----------------------+---------------------------+------------------
      AF_XDP TX              | (none)                    | Increment the
                             |                           | producer index in
                             |                           | Completion Ring.
      
      On send, the same set of mlx5e_xdp_xmit_data fields is used in all
      flows: DMA and virtual addresses and length.
      Signed-off-by: NMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Acked-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      d963fa15
    • M
      net/mlx5e: Allow ICO SQ to be used by multiple RQs · ed084fb6
      Maxim Mikityanskiy 提交于
      Prepare to creation of the XSK RQ, which will require posting UMRs, too.
      The same ICO SQ will be used for both RQs and also to trigger interrupts
      by posting NOPs. UMR WQEs can't be reused any more. Optimization
      introduced in commit ab966d7e ("net/mlx5e: RX, Recycle buffer of
      UMR WQEs") is reverted.
      Signed-off-by: NMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Acked-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      ed084fb6
    • M
      net/mlx5e: Calculate linear RX frag size considering XSK · a069e977
      Maxim Mikityanskiy 提交于
      Additional conditions introduced:
      
      - XSK implies XDP.
      - Headroom includes the XSK headroom if it exists.
      - No space is reserved for struct shared_skb_info in XSK mode.
      - Fragment size smaller than the XSK chunk size is not allowed.
      
      A new auxiliary function mlx5e_get_linear_rq_headroom with the support
      for XSK is introduced. Use this function in the implementation of
      mlx5e_get_rq_headroom. Change headroom to u32 to match the headroom
      field in struct xdp_umem.
      Signed-off-by: NMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Acked-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      a069e977
    • M
      net/mlx5e: Replace deprecated PCI_DMA_TODEVICE · 6ed9350f
      Maxim Mikityanskiy 提交于
      The PCI API for DMA is deprecated, and PCI_DMA_TODEVICE is just defined
      to DMA_TO_DEVICE for backward compatibility. Just use DMA_TO_DEVICE.
      Signed-off-by: NMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Acked-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      6ed9350f
    • M
      xsk: Return the whole xdp_desc from xsk_umem_consume_tx · 4bce4e5c
      Maxim Mikityanskiy 提交于
      Some drivers want to access the data transmitted in order to implement
      acceleration features of the NICs. It is also useful in AF_XDP TX flow.
      
      Change the xsk_umem_consume_tx API to return the whole xdp_desc, that
      contains the data pointer, length and DMA address, instead of only the
      latter two. Adapt the implementation of i40e and ixgbe to this change.
      Signed-off-by: NMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Acked-by: NSaeed Mahameed <saeedm@mellanox.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Magnus Karlsson <magnus.karlsson@intel.com>
      Acked-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4bce4e5c
    • M
      xsk: Change the default frame size to 4096 and allow controlling it · 123e8da1
      Maxim Mikityanskiy 提交于
      The typical XDP memory scheme is one packet per page. Change the AF_XDP
      frame size in libbpf to 4096, which is the page size on x86, to allow
      libbpf to be used with the drivers with the packet-per-page scheme.
      
      Add a command line option -f to xdpsock to allow to specify a custom
      frame size.
      Signed-off-by: NMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Acked-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      123e8da1
    • M
      libbpf: Support getsockopt XDP_OPTIONS · 2761ed4b
      Maxim Mikityanskiy 提交于
      Query XDP_OPTIONS in libbpf to determine if the zero-copy mode is active
      or not.
      Signed-off-by: NMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Acked-by: NSaeed Mahameed <saeedm@mellanox.com>
      Acked-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      2761ed4b
    • M
      xsk: Add getsockopt XDP_OPTIONS · 2640d3c8
      Maxim Mikityanskiy 提交于
      Make it possible for the application to determine whether the AF_XDP
      socket is running in zero-copy mode. To achieve this, add a new
      getsockopt option XDP_OPTIONS that returns flags. The only flag
      supported for now is the zero-copy mode indicator.
      Signed-off-by: NMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Acked-by: NSaeed Mahameed <saeedm@mellanox.com>
      Acked-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      2640d3c8