1. 12 10月, 2020 22 次提交
    • J
      bpf, sockmap: Allow skipping sk_skb parser program · ef565928
      John Fastabend 提交于
      Currently, we often run with a nop parser namely one that just does
      this, 'return skb->len'. This happens when either our verdict program
      can handle streaming data or it is only looking at socket data such
      as IP addresses and other metadata associated with the flow. The second
      case is common for a L3/L4 proxy for instance.
      
      So lets allow loading programs without the parser then we can skip
      the stream parser logic and avoid having to add a BPF program that
      is effectively a nop.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/160239297866.8495.13345662302749219672.stgit@john-Precision-5820-Tower
      ef565928
    • J
      bpf, sockmap: Check skb_verdict and skb_parser programs explicitly · 743df8b7
      John Fastabend 提交于
      We are about to allow skb_verdict to run without skb_parser programs
      as a first step change code to check each program type specifically.
      This should be a mechanical change without any impact to actual result.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/160239294756.8495.5796595770890272219.stgit@john-Precision-5820-Tower
      743df8b7
    • A
      Merge branch 'sockmap/sk_skb program memory acct fixes' · 20a6d915
      Alexei Starovoitov 提交于
      John Fastabend says:
      
      ====================
      
      Users of sockmap and skmsg trying to build proxys and other tools
      have pointed out to me the error handling can be problematic. If
      the proxy is under-provisioned and/or the BPF admin does not have
      the ability to update/modify memory provisions on the sockets
      its possible data may be dropped. For some things we have retries
      so everything works out OK, but for most things this is likely
      not great. And things go bad.
      
      The original design dropped memory accounting on the receive
      socket as early as possible. We did this early in sk_skb
      handling and then charged it to the redirect socket immediately
      after running the BPF program.
      
      But, this design caused a fundamental problem. Namely, what should we do
      if we redirect to a socket that has already reached its socket memory
      limits. For proxy use cases the network admin can tune memory limits.
      But, in general we punted on this problem and told folks to simply make
      your memory limits high enough to handle your workload. This is not a
      really good answer. When deploying into environments where we expect this
      to be transparent its no longer the case because we need to tune params.
      In fact its really only viable in cases where we have fine grained
      control over the application. For example a proxy redirecting from an
      ingress socket to an egress socket. The result is I get bug
      reports because its surprising for one, but more importantly also breaks
      some use cases. So lets fix it.
      
      This series cleans up the different cases so that in many common
      modes, such as passing packet up to receive socket, we can simply
      use the underlying assumption that the TCP stack already has done
      memory accounting.
      
      Next instead of trying to do memory accounting against the socket
      we plan to redirect into we keep memory accounting on the receive
      socket until the skb can be put on the redirect socket. This means
      if we do an egress redirect to a socket and sock_writable() returns
      EAGAIN we can requeue the skb on the workqueue and try again. The
      same scenario plays out for ingress. If the skb can not be put on
      the receive queue of the redirect socket than we simply requeue and
      retry. In both cases memory is still accounted for against the
      receiving socket.
      
      This also handles head of line blocking. With the above scheme the
      skb is on a queue associated with the socket it will be sent/recv'd
      on, but the memory accounting is against the received socket. This
      means the receive socket can advance to the next skb and avoid head
      of line blocking. At least until its receive memory on the socket
      runs out. This will put some maximum size on the amount of data any
      socket can enqueue giving us bounds on the skb lists so they can't grow
      indefinitely.
      
      Overall I think this is a win. Tested with test_sockmap.
      
      These are fixes, but I tagged it for bpf-next considering we are
      at -rc8.
      
      v1->v2: Fix uninitialized/unused variables (kernel test robot)
      v2->v3: fix typo in patch2 err=0 needs to be <0 so use err=-EIO
      ---
      ====================
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      20a6d915
    • J
      bpf, sockmap: Add memory accounting so skbs on ingress lists are visible · 0b17ad25
      John Fastabend 提交于
      Move skb->sk assignment out of sk_psock_bpf_run() and into individual
      callers. Then we can use proper skb_set_owner_r() call to assign a
      sk to a skb. This improves things by also charging the truesize against
      the sockets sk_rmem_alloc counter. With this done we get some accounting
      in place to ensure the memory associated with skbs on the workqueue are
      still being accounted for somewhere. Finally, by using skb_set_owner_r
      the destructor is setup so we can just let the normal skb_kfree logic
      recover the memory. Combined with previous patch dropping skb_orphan()
      we now can recover from memory pressure and maintain accounting.
      
      Note, we will charge the skbs against their originating socket even
      if being redirected into another socket. Once the skb completes the
      redirect op the kfree_skb will give the memory back. This is important
      because if we charged the socket we are redirecting to (like it was
      done before this series) the sock_writeable() test could fail because
      of the skb trying to be sent is already charged against the socket.
      
      Also TLS case is special. Here we wait until we have decided not to
      simply PASS the packet up the stack. In the case where we PASS the
      packet up the stack we already have an skb which is accounted for on
      the TLS socket context.
      
      For the parser case we continue to just set/clear skb->sk this is
      because the skb being used here may be combined with other skbs or
      turned into multiple skbs depending on the parser logic. For example
      the parser could request a payload length greater than skb->len so
      that the strparser needs to collect multiple skbs. At any rate
      the final result will be handled in the strparser recv callback.
      
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/160226867513.5692.10579573214635925960.stgit@john-Precision-5820-Tower
      0b17ad25
    • J
      bpf, sockmap: Remove skb_orphan and let normal skb_kfree do cleanup · 10d58d00
      John Fastabend 提交于
      Calling skb_orphan() is unnecessary in the strp rcv handler because the skb
      is from a skb_clone() in __strp_recv. So it never has a destructor or a
      sk assigned. Plus its confusing to read because it might hint to the reader
      that the skb could have an sk assigned which is not true. Even if we did
      have an sk assigned it would be cleaner to simply wait for the upcoming
      kfree_skb().
      
      Additionally, move the comment about strparser clone up so its closer to
      the logic it is describing and add to it so that it is more complete.
      
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/160226865548.5692.9098315689984599579.stgit@john-Precision-5820-Tower
      10d58d00
    • J
      bpf, sockmap: Remove dropped data on errors in redirect case · 9047f19e
      John Fastabend 提交于
      In the sk_skb redirect case we didn't handle the case where we overrun
      the sk_rmem_alloc entry on ingress redirect or sk_wmem_alloc on egress.
      Because we didn't have anything implemented we simply dropped the skb.
      This meant data could be dropped if socket memory accounting was in
      place.
      
      This fixes the above dropped data case by moving the memory checks
      later in the code where we actually do the send or recv. This pushes
      those checks into the workqueue and allows us to return an EAGAIN error
      which in turn allows us to try again later from the workqueue.
      
      Fixes: 51199405 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/160226863689.5692.13861422742592309285.stgit@john-Precision-5820-Tower
      9047f19e
    • J
      bpf, sockmap: Remove skb_set_owner_w wmem will be taken later from sendpage · 29545f49
      John Fastabend 提交于
      The skb_set_owner_w is unnecessary here. The sendpage call will create a
      fresh skb and set the owner correctly from workqueue. Its also not entirely
      harmless because it consumes cycles, but also impacts resource accounting
      by increasing sk_wmem_alloc. This is charging the socket we are going to
      send to for the skb, but we will put it on the workqueue for some time
      before this happens so we are artifically inflating sk_wmem_alloc for
      this period. Further, we don't know how many skbs will be used to send the
      packet or how it will be broken up when sent over the new socket so
      charging it with one big sum is also not correct when the workqueue may
      break it up if facing memory pressure. Seeing we don't know how/when
      this is going to be sent drop the early accounting.
      
      A later patch will do proper accounting charged on receive socket for
      the case where skbs get enqueued on the workqueue.
      
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/160226861708.5692.17964237936462425136.stgit@john-Precision-5820-Tower
      29545f49
    • J
      bpf, sockmap: On receive programs try to fast track SK_PASS ingress · 9ecbfb06
      John Fastabend 提交于
      When we receive an skb and the ingress skb verdict program returns
      SK_PASS we currently set the ingress flag and put it on the workqueue
      so it can be turned into a sk_msg and put on the sk_msg ingress queue.
      Then finally telling userspace with data_ready hook.
      
      Here we observe that if the workqueue is empty then we can try to
      convert into a sk_msg type and call data_ready directly without
      bouncing through a workqueue. Its a common pattern to have a recv
      verdict program for visibility that always returns SK_PASS. In this
      case unless there is an ENOMEM error or we overrun the socket we
      can avoid the workqueue completely only using it when we fall back
      to error cases caused by memory pressure.
      
      By doing this we eliminate another case where data may be dropped
      if errors occur on memory limits in workqueue.
      
      Fixes: 51199405 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/160226859704.5692.12929678876744977669.stgit@john-Precision-5820-Tower
      9ecbfb06
    • J
      bpf, sockmap: Skb verdict SK_PASS to self already checked rmem limits · cfea28f8
      John Fastabend 提交于
      For sk_skb case where skb_verdict program returns SK_PASS to continue to
      pass packet up the stack, the memory limits were already checked before
      enqueuing in skb_queue_tail from TCP side. So, lets remove the extra checks
      here. The theory is if the TCP stack believes we have memory to receive
      the packet then lets trust the stack and not double check the limits.
      
      In fact the accounting here can cause a drop if sk_rmem_alloc has increased
      after the stack accepted this packet, but before the duplicate check here.
      And worse if this happens because TCP stack already believes the data has
      been received there is no retransmit.
      
      Fixes: 51199405 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/160226857664.5692.668205469388498375.stgit@john-Precision-5820-Tower
      cfea28f8
    • A
      bpf: Migrate from patchwork.ozlabs.org to patchwork.kernel.org. · ebb034b1
      Alexei Starovoitov 提交于
      Move the bpf/bpf-next patch processing queue to patchwork.kernel.org.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20201011200149.66537-1-alexei.starovoitov@gmail.com
      ebb034b1
    • T
      bpf: Always return target ifindex in bpf_fib_lookup · d1c362e1
      Toke Høiland-Jørgensen 提交于
      The bpf_fib_lookup() helper performs a neighbour lookup for the destination
      IP and returns BPF_FIB_LKUP_NO_NEIGH if this fails, with the expectation
      that the BPF program will pass the packet up the stack in this case.
      However, with the addition of bpf_redirect_neigh() that can be used instead
      to perform the neighbour lookup, at the cost of a bit of duplicated work.
      
      For that we still need the target ifindex, and since bpf_fib_lookup()
      already has that at the time it performs the neighbour lookup, there is
      really no reason why it can't just return it in any case. So let's just
      always return the ifindex if the FIB lookup itself succeeds.
      Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: David Ahern <dsahern@gmail.com>
      Link: https://lore.kernel.org/bpf/20201009184234.134214-1-toke@redhat.com
      d1c362e1
    • A
      Merge branch 'samples: bpf: Refactor XDP programs with libbpf' · 52b07e56
      Alexei Starovoitov 提交于
      "Daniel T. Lee" says:
      
      ====================
      To avoid confusion caused by the increasing fragmentation of the BPF
      Loader program, this commit would like to convert the previous bpf_load
      loader with the libbpf loader.
      
      Thanks to libbpf's bpf_link interface, managing the tracepoint BPF
      program is much easier. bpf_program__attach_tracepoint manages the
      enable of tracepoint event and attach of BPF programs to it with a
      single interface bpf_link, so there is no need to manage event_fd and
      prog_fd separately.
      
      And due to addition of generic bpf_program__attach() to libbpf, it is
      now possible to attach BPF programs with __attach() instead of
      explicitly calling __attach_<type>().
      
      This patchset refactors xdp_monitor with using this libbpf API, and the
      bpf_load is removed and migrated to libbpf. Also, attach_tracepoint()
      is replaced with the generic __attach() method in xdp_redirect_cpu.
      Moreover, maps in kern program have been converted to BTF-defined map.
      ---
      Changes in v2:
       - added cleanup logic for bpf_link and bpf_object in xdp_monitor
       - program section match with bpf_program__is_<type> instead of strncmp
       - revert BTF key/val type to default of BPF_MAP_TYPE_PERF_EVENT_ARRAY
       - split increment into seperate satement
       - refactor pointer array initialization
       - error code cleanup
      ====================
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      52b07e56
    • D
      samples: bpf: Refactor XDP kern program maps with BTF-defined map · 321f6324
      Daniel T. Lee 提交于
      Most of the samples were converted to use the new BTF-defined MAP as
      they moved to libbpf, but some of the samples were missing.
      
      Instead of using the previous BPF MAP definition, this commit refactors
      xdp_monitor and xdp_sample_pkts_kern MAP definition with the new
      BTF-defined MAP format.
      
      Also, this commit removes the max_entries attribute at PERF_EVENT_ARRAY
      map type. The libbpf's bpf_object__create_map() will automatically
      set max_entries to the maximum configured number of CPUs on the host.
      Signed-off-by: NDaniel T. Lee <danieltimlee@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20201010181734.1109-4-danieltimlee@gmail.com
      321f6324
    • D
      samples: bpf: Replace attach_tracepoint() to attach() in xdp_redirect_cpu · 151936bf
      Daniel T. Lee 提交于
      >From commit d7a18ea7 ("libbpf: Add generic bpf_program__attach()"),
      for some BPF programs, it is now possible to attach BPF programs
      with __attach() instead of explicitly calling __attach_<type>().
      
      This commit refactors the __attach_tracepoint() with libbpf's generic
      __attach() method. In addition, this refactors the logic of setting
      the map FD to simplify the code. Also, the missing removal of
      bpf_load.o in Makefile has been fixed.
      Signed-off-by: NDaniel T. Lee <danieltimlee@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20201010181734.1109-3-danieltimlee@gmail.com
      151936bf
    • D
      samples: bpf: Refactor xdp_monitor with libbpf · 8ac91df6
      Daniel T. Lee 提交于
      To avoid confusion caused by the increasing fragmentation of the BPF
      Loader program, this commit would like to change to the libbpf loader
      instead of using the bpf_load.
      
      Thanks to libbpf's bpf_link interface, managing the tracepoint BPF
      program is much easier. bpf_program__attach_tracepoint manages the
      enable of tracepoint event and attach of BPF programs to it with a
      single interface bpf_link, so there is no need to manage event_fd and
      prog_fd separately.
      
      This commit refactors xdp_monitor with using this libbpf API, and the
      bpf_load is removed and migrated to libbpf.
      Signed-off-by: NDaniel T. Lee <danieltimlee@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20201010181734.1109-2-danieltimlee@gmail.com
      8ac91df6
    • A
      Merge branch 'Follow-up BPF helper improvements' · 673e3752
      Alexei Starovoitov 提交于
      Daniel Borkmann says:
      
      ====================
      
      This series addresses most of the feedback [0] that was to be followed
      up from the last series, that is, UAPI helper comment improvements and
      getting rid of the ifindex obj file hacks in the selftest by using a
      BPF map instead. The __sk_buff data/data_end pointer work, I'm planning
      to do in a later round as well as the mem*() BPF improvements we have
      in Cilium for libbpf. Next, the series adds two features, i) a helper
      called redirect_peer() to improve latency on netns switch, and ii) to
      allow map in map with dynamic inner array map sizes. Selftests for each
      are added as well. For details, please check individual patches, thanks!
      
        [0] https://lore.kernel.org/bpf/cover.1601477936.git.daniel@iogearbox.net/
      
      v5 -> v6:
        - Going with Andrii's suggestion to make the misconfigured verifier
          test more robust, and only probe on -EOPNOTSUPP (Andrii)
      v4 -> v5:
        - Replace cnt == -EOPNOTSUPP check with cnt < 0; I've used < 0
          here as I think it's useful to keep the existing cnt == 0 ||
          cnt >= ARRAY_SIZE(insn_buf) for error detection (Andrii)
      v3 -> v4:
        - Rename new array map flag to BPF_F_INNER_MAP (Alexei)
      v2 -> v3:
        - Remove tab that slipped into uapi helper desc (Jakub)
        - Rework map in map for array to error from map_gen_lookup (Andrii)
      v1 -> v2:
        - Fixed selftest comment wrt inner1/inner2 value (Yonghong)
      ====================
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      673e3752
    • D
      bpf, selftests: Add redirect_peer selftest · 9f4c53ca
      Daniel Borkmann 提交于
      Extend the test_tc_redirect test and add a small test that exercises the new
      redirect_peer() helper for the IPv4 and IPv6 case.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20201010234006.7075-7-daniel@iogearbox.net
      9f4c53ca
    • D
      bpf, selftests: Make redirect_neigh test more extensible · 57a73fe7
      Daniel Borkmann 提交于
      Rename into test_tc_redirect.sh and move setup and test code into separate
      functions so they can be reused for newly added tests in here. Also remove
      the crude hack to override ifindex inside the object file via xxd and sed
      and just use a simple map instead. Map given iproute2 does not support BTF
      fully and therefore neither global data at this point.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20201010234006.7075-6-daniel@iogearbox.net
      57a73fe7
    • D
      bpf, selftests: Add test for different array inner map size · 6775dab7
      Daniel Borkmann 提交于
      Extend the "diff_size" subtest to also include a non-inlined array map variant
      where dynamic inner #elems are possible.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20201010234006.7075-5-daniel@iogearbox.net
      6775dab7
    • D
      bpf: Allow for map-in-map with dynamic inner array map entries · 4a8f87e6
      Daniel Borkmann 提交于
      Recent work in f4d05259 ("bpf: Add map_meta_equal map ops") and 134fede4
      ("bpf: Relax max_entries check for most of the inner map types") added support
      for dynamic inner max elements for most map-in-map types. Exceptions were maps
      like array or prog array where the map_gen_lookup() callback uses the maps'
      max_entries field as a constant when emitting instructions.
      
      We recently implemented Maglev consistent hashing into Cilium's load balancer
      which uses map-in-map with an outer map being hash and inner being array holding
      the Maglev backend table for each service. This has been designed this way in
      order to reduce overall memory consumption given the outer hash map allows to
      avoid preallocating a large, flat memory area for all services. Also, the
      number of service mappings is not always known a-priori.
      
      The use case for dynamic inner array map entries is to further reduce memory
      overhead, for example, some services might just have a small number of back
      ends while others could have a large number. Right now the Maglev backend table
      for small and large number of backends would need to have the same inner array
      map entries which adds a lot of unneeded overhead.
      
      Dynamic inner array map entries can be realized by avoiding the inlined code
      generation for their lookup. The lookup will still be efficient since it will
      be calling into array_map_lookup_elem() directly and thus avoiding retpoline.
      The patch adds a BPF_F_INNER_MAP flag to map creation which therefore skips
      inline code generation and relaxes array_map_meta_equal() check to ignore both
      maps' max_entries. This also still allows to have faster lookups for map-in-map
      when BPF_F_INNER_MAP is not specified and hence dynamic max_entries not needed.
      
      Example code generation where inner map is dynamic sized array:
      
        # bpftool p d x i 125
        int handle__sys_enter(void * ctx):
        ; int handle__sys_enter(void *ctx)
           0: (b4) w1 = 0
        ; int key = 0;
           1: (63) *(u32 *)(r10 -4) = r1
           2: (bf) r2 = r10
        ;
           3: (07) r2 += -4
        ; inner_map = bpf_map_lookup_elem(&outer_arr_dyn, &key);
           4: (18) r1 = map[id:468]
           6: (07) r1 += 272
           7: (61) r0 = *(u32 *)(r2 +0)
           8: (35) if r0 >= 0x3 goto pc+5
           9: (67) r0 <<= 3
          10: (0f) r0 += r1
          11: (79) r0 = *(u64 *)(r0 +0)
          12: (15) if r0 == 0x0 goto pc+1
          13: (05) goto pc+1
          14: (b7) r0 = 0
          15: (b4) w6 = -1
        ; if (!inner_map)
          16: (15) if r0 == 0x0 goto pc+6
          17: (bf) r2 = r10
        ;
          18: (07) r2 += -4
        ; val = bpf_map_lookup_elem(inner_map, &key);
          19: (bf) r1 = r0                               | No inlining but instead
          20: (85) call array_map_lookup_elem#149280     | call to array_map_lookup_elem()
        ; return val ? *val : -1;                        | for inner array lookup.
          21: (15) if r0 == 0x0 goto pc+1
        ; return val ? *val : -1;
          22: (61) r6 = *(u32 *)(r0 +0)
        ; }
          23: (bc) w0 = w6
          24: (95) exit
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20201010234006.7075-4-daniel@iogearbox.net
      4a8f87e6
    • D
      bpf: Add redirect_peer helper · 9aa1206e
      Daniel Borkmann 提交于
      Add an efficient ingress to ingress netns switch that can be used out of tc BPF
      programs in order to redirect traffic from host ns ingress into a container
      veth device ingress without having to go via CPU backlog queue [0]. For local
      containers this can also be utilized and path via CPU backlog queue only needs
      to be taken once, not twice. On a high level this borrows from ipvlan which does
      similar switch in __netif_receive_skb_core() and then iterates via another_round.
      This helps to reduce latency for mentioned use cases.
      
      Pod to remote pod with redirect(), TCP_RR [1]:
      
        # percpu_netperf 10.217.1.33
                RT_LATENCY:         122.450         (per CPU:         122.666         122.401         122.333         122.401 )
              MEAN_LATENCY:         121.210         (per CPU:         121.100         121.260         121.320         121.160 )
            STDDEV_LATENCY:         120.040         (per CPU:         119.420         119.910         125.460         115.370 )
               MIN_LATENCY:          46.500         (per CPU:          47.000          47.000          47.000          45.000 )
               P50_LATENCY:         118.500         (per CPU:         118.000         119.000         118.000         119.000 )
               P90_LATENCY:         127.500         (per CPU:         127.000         128.000         127.000         128.000 )
               P99_LATENCY:         130.750         (per CPU:         131.000         131.000         129.000         132.000 )
      
          TRANSACTION_RATE:       32666.400         (per CPU:        8152.200        8169.842        8174.439        8169.897 )
      
      Pod to remote pod with redirect_peer(), TCP_RR:
      
        # percpu_netperf 10.217.1.33
                RT_LATENCY:          44.449         (per CPU:          43.767          43.127          45.279          45.622 )
              MEAN_LATENCY:          45.065         (per CPU:          44.030          45.530          45.190          45.510 )
            STDDEV_LATENCY:          84.823         (per CPU:          66.770          97.290          84.380          90.850 )
               MIN_LATENCY:          33.500         (per CPU:          33.000          33.000          34.000          34.000 )
               P50_LATENCY:          43.250         (per CPU:          43.000          43.000          43.000          44.000 )
               P90_LATENCY:          46.750         (per CPU:          46.000          47.000          47.000          47.000 )
               P99_LATENCY:          52.750         (per CPU:          51.000          54.000          53.000          53.000 )
      
          TRANSACTION_RATE:       90039.500         (per CPU:       22848.186       23187.089       22085.077       21919.130 )
      
        [0] https://linuxplumbersconf.org/event/7/contributions/674/attachments/568/1002/plumbers_2020_cilium_load_balancer.pdf
        [1] https://github.com/borkmann/netperf_scripts/blob/master/percpu_netperfSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20201010234006.7075-3-daniel@iogearbox.net
      9aa1206e
    • D
      bpf: Improve bpf_redirect_neigh helper description · dd2ce6a5
      Daniel Borkmann 提交于
      Follow-up to address David's feedback that we should better describe internals
      of the bpf_redirect_neigh() helper.
      Suggested-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Link: https://lore.kernel.org/bpf/20201010234006.7075-2-daniel@iogearbox.net
      dd2ce6a5
  2. 10 10月, 2020 5 次提交
  3. 09 10月, 2020 2 次提交
  4. 08 10月, 2020 9 次提交
    • A
      Merge branch 'libbpf: auto-resize relocatable LOAD/STORE instructions' · 1e9259ec
      Alexei Starovoitov 提交于
      Andrii Nakryiko says:
      
      ====================
      Patch set implements logic in libbpf to auto-adjust memory size (1-, 2-, 4-,
      8-bytes) of load/store (LD/ST/STX) instructions which have BPF CO-RE field
      offset relocation associated with it. In practice this means transparent
      handling of 32-bit kernels, both pointer and unsigned integers. Signed
      integers are not relocatable with zero-extending loads/stores, so libbpf
      poisons them and generates a warning. If/when BPF gets support for
      sign-extending loads/stores, it would be possible to automatically relocate
      them as well.
      
      All the details are contained in patch #2 comments and commit message.
      Patch #3 is a simple change in libbpf to make advanced testing with custom BTF
      easier. Patch #4 validates correct uses of auto-resizable loads, as well as
      check that libbpf fails invalid uses. Patch #1 skips CO-RE relocation for
      programs that had bpf_program__set_autoload(prog, false) set on them, reducing
      warnings and noise.
      
      v2->v3:
        - fix copyright (Alexei);
      v1->v2:
        - more consistent names for instruction mem size convertion routines (Alexei);
        - extended selftests to use relocatable STX instructions (Alexei);
        - added a fix for skipping CO-RE relocation for non-loadable programs.
      
      Cc: Luka Perkov <luka.perkov@sartura.hr>
      Cc: Tony Ambardar <tony.ambardar@gmail.com>
      ====================
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      1e9259ec
    • A
      selftests/bpf: Validate libbpf's auto-sizing of LD/ST/STX instructions · 888d83b9
      Andrii Nakryiko 提交于
      Add selftests validating libbpf's auto-resizing of load/store instructions
      when used with CO-RE relocations. An explicit and manual approach with using
      bpf_core_read() is also demonstrated and tested. Separate BPF program is
      supposed to fail due to using signed integers of sizes that differ from
      kernel's sizes.
      
      To reliably simulate 32-bit BTF (i.e., the one with sizeof(long) ==
      sizeof(void *) == 4), selftest generates its own custom BTF and passes it as
      a replacement for real kernel BTF. This allows to test 32/64-bitness mix on
      all architectures.
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20201008001025.292064-5-andrii@kernel.org
      888d83b9
    • A
      libbpf: Allow specifying both ELF and raw BTF for CO-RE BTF override · 2b7d88c2
      Andrii Nakryiko 提交于
      Use generalized BTF parsing logic, making it possible to parse BTF both from
      ELF file, as well as a raw BTF dump. This makes it easier to write custom
      tests with manually generated BTFs.
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20201008001025.292064-4-andrii@kernel.org
      2b7d88c2
    • A
      libbpf: Support safe subset of load/store instruction resizing with CO-RE · a66345bc
      Andrii Nakryiko 提交于
      Add support for patching instructions of the following form:
        - rX = *(T *)(rY + <off>);
        - *(T *)(rX + <off>) = rY;
        - *(T *)(rX + <off>) = <imm>, where T is one of {u8, u16, u32, u64}.
      
      For such instructions, if the actual kernel field recorded in CO-RE relocation
      has a different size than the one recorded locally (e.g., from vmlinux.h),
      then libbpf will adjust T to an appropriate 1-, 2-, 4-, or 8-byte loads.
      
      In general, such transformation is not always correct and could lead to
      invalid final value being loaded or stored. But two classes of cases are
      always safe:
        - if both local and target (kernel) types are unsigned integers, but of
        different sizes, then it's OK to adjust load/store instruction according to
        the necessary memory size. Zero-extending nature of such instructions and
        unsignedness make sure that the final value is always correct;
        - pointer size mismatch between BPF target architecture (which is always
        64-bit) and 32-bit host kernel architecture can be similarly resolved
        automatically, because pointer is essentially an unsigned integer. Loading
        32-bit pointer into 64-bit BPF register with zero extension will leave
        correct pointer in the register.
      
      Both cases are necessary to support CO-RE on 32-bit kernels, as `unsigned
      long` in vmlinux.h generated from 32-bit kernel is 32-bit, but when compiled
      with BPF program for BPF target it will be treated by compiler as 64-bit
      integer. Similarly, pointers in vmlinux.h are 32-bit for kernel, but treated
      as 64-bit values by compiler for BPF target. Both problems are now resolved by
      libbpf for direct memory reads.
      
      But similar transformations are useful in general when kernel fields are
      "resized" from, e.g., unsigned int to unsigned long (or vice versa).
      
      Now, similar transformations for signed integers are not safe to perform as
      they will result in incorrect sign extension of the value. If such situation
      is detected, libbpf will emit helpful message and will poison the instruction.
      Not failing immediately means that it's possible to guard the instruction
      based on kernel version (or other conditions) and make sure it's not
      reachable.
      
      If there is a need to read signed integers that change sizes between different
      kernels, it's possible to use BPF_CORE_READ_BITFIELD() macro, which works both
      with bitfields and non-bitfield integers of any signedness and handles
      sign-extension properly. Also, bpf_core_read() with proper size and/or use of
      bpf_core_field_size() relocation could allow to deal with such complicated
      situations explicitly, if not so conventiently as direct memory reads.
      
      Selftests added in a separate patch in progs/test_core_autosize.c demonstrate
      both direct memory and probed use cases.
      
      BPF_CORE_READ() is not changed and it won't deal with such situations as
      automatically as direct memory reads due to the signedness integer
      limitations, which are much harder to detect and control with compiler macro
      magic. So it's encouraged to utilize direct memory reads as much as possible.
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20201008001025.292064-3-andrii@kernel.org
      a66345bc
    • A
      libbpf: Skip CO-RE relocations for not loaded BPF programs · 47f7cf63
      Andrii Nakryiko 提交于
      Bypass CO-RE relocations step for BPF programs that are not going to be
      loaded. This allows to have BPF programs compiled in and disabled dynamically
      if kernel is not supposed to provide enough relocation information. In such
      case, there won't be unnecessary warnings about failed relocations.
      
      Fixes: d9297581 ("libbpf: Support disabling auto-loading BPF programs")
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20201008001025.292064-2-andrii@kernel.org
      47f7cf63
    • M
      libbpf: Fix compatibility problem in xsk_socket__create · 80348d88
      Magnus Karlsson 提交于
      Fix a compatibility problem when the old XDP_SHARED_UMEM mode is used
      together with the xsk_socket__create() call. In the old XDP_SHARED_UMEM
      mode, only sharing of the same device and queue id was allowed, and
      in this mode, the fill ring and completion ring were shared between
      the AF_XDP sockets.
      
      Therefore, it was perfectly fine to call the xsk_socket__create() API
      for each socket and not use the new xsk_socket__create_shared() API.
      This behavior was ruined by the commit introducing XDP_SHARED_UMEM
      support between different devices and/or queue ids. This patch restores
      the ability to use xsk_socket__create in these circumstances so that
      backward compatibility is not broken.
      
      Fixes: 2f6324a3 ("libbpf: Support shared umems between queues and devices")
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/1602070946-11154-1-git-send-email-magnus.karlsson@gmail.com
      80348d88
    • J
    • Y
      bpf: Fix build failure for kernel/trace/bpf_trace.c with CONFIG_NET=n · ebfb4d40
      Yonghong Song 提交于
      When CONFIG_NET is not defined, I hit the following build error:
          kernel/trace/bpf_trace.o:(.rodata+0x110): undefined reference to `bpf_prog_test_run_raw_tp'
      
      Commit 1b4d60ec ("bpf: Enable BPF_PROG_TEST_RUN for raw_tracepoint")
      added test_run support for raw_tracepoint in /kernel/trace/bpf_trace.c.
      But the test_run function bpf_prog_test_run_raw_tp is defined in
      net/bpf/test_run.c, only available with CONFIG_NET=y.
      
      Adding a CONFIG_NET guard for
          .test_run = bpf_prog_test_run_raw_tp;
      fixed the above build issue.
      
      Fixes: 1b4d60ec ("bpf: Enable BPF_PROG_TEST_RUN for raw_tracepoint")
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20201007062933.3425899-1-yhs@fb.com
      ebfb4d40
    • R
      kernel/bpf/verifier: Fix build when NET is not enabled · 49a2a4d4
      Randy Dunlap 提交于
      Fix build errors in kernel/bpf/verifier.c when CONFIG_NET is
      not enabled.
      
      ../kernel/bpf/verifier.c:3995:13: error: ‘btf_sock_ids’ undeclared here (not in a function); did you mean ‘bpf_sock_ops’?
        .btf_id = &btf_sock_ids[BTF_SOCK_TYPE_SOCK_COMMON],
      
      ../kernel/bpf/verifier.c:3995:26: error: ‘BTF_SOCK_TYPE_SOCK_COMMON’ undeclared here (not in a function); did you mean ‘PTR_TO_SOCK_COMMON’?
        .btf_id = &btf_sock_ids[BTF_SOCK_TYPE_SOCK_COMMON],
      
      Fixes: 1df8f55a ("bpf: Enable bpf_skc_to_* sock casting helper to networking prog type")
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20201007021613.13646-1-rdunlap@infradead.org
      49a2a4d4
  5. 07 10月, 2020 2 次提交