1. 12 10月, 2020 3 次提交
    • D
      bpf: Allow for map-in-map with dynamic inner array map entries · 4a8f87e6
      Daniel Borkmann 提交于
      Recent work in f4d05259 ("bpf: Add map_meta_equal map ops") and 134fede4
      ("bpf: Relax max_entries check for most of the inner map types") added support
      for dynamic inner max elements for most map-in-map types. Exceptions were maps
      like array or prog array where the map_gen_lookup() callback uses the maps'
      max_entries field as a constant when emitting instructions.
      
      We recently implemented Maglev consistent hashing into Cilium's load balancer
      which uses map-in-map with an outer map being hash and inner being array holding
      the Maglev backend table for each service. This has been designed this way in
      order to reduce overall memory consumption given the outer hash map allows to
      avoid preallocating a large, flat memory area for all services. Also, the
      number of service mappings is not always known a-priori.
      
      The use case for dynamic inner array map entries is to further reduce memory
      overhead, for example, some services might just have a small number of back
      ends while others could have a large number. Right now the Maglev backend table
      for small and large number of backends would need to have the same inner array
      map entries which adds a lot of unneeded overhead.
      
      Dynamic inner array map entries can be realized by avoiding the inlined code
      generation for their lookup. The lookup will still be efficient since it will
      be calling into array_map_lookup_elem() directly and thus avoiding retpoline.
      The patch adds a BPF_F_INNER_MAP flag to map creation which therefore skips
      inline code generation and relaxes array_map_meta_equal() check to ignore both
      maps' max_entries. This also still allows to have faster lookups for map-in-map
      when BPF_F_INNER_MAP is not specified and hence dynamic max_entries not needed.
      
      Example code generation where inner map is dynamic sized array:
      
        # bpftool p d x i 125
        int handle__sys_enter(void * ctx):
        ; int handle__sys_enter(void *ctx)
           0: (b4) w1 = 0
        ; int key = 0;
           1: (63) *(u32 *)(r10 -4) = r1
           2: (bf) r2 = r10
        ;
           3: (07) r2 += -4
        ; inner_map = bpf_map_lookup_elem(&outer_arr_dyn, &key);
           4: (18) r1 = map[id:468]
           6: (07) r1 += 272
           7: (61) r0 = *(u32 *)(r2 +0)
           8: (35) if r0 >= 0x3 goto pc+5
           9: (67) r0 <<= 3
          10: (0f) r0 += r1
          11: (79) r0 = *(u64 *)(r0 +0)
          12: (15) if r0 == 0x0 goto pc+1
          13: (05) goto pc+1
          14: (b7) r0 = 0
          15: (b4) w6 = -1
        ; if (!inner_map)
          16: (15) if r0 == 0x0 goto pc+6
          17: (bf) r2 = r10
        ;
          18: (07) r2 += -4
        ; val = bpf_map_lookup_elem(inner_map, &key);
          19: (bf) r1 = r0                               | No inlining but instead
          20: (85) call array_map_lookup_elem#149280     | call to array_map_lookup_elem()
        ; return val ? *val : -1;                        | for inner array lookup.
          21: (15) if r0 == 0x0 goto pc+1
        ; return val ? *val : -1;
          22: (61) r6 = *(u32 *)(r0 +0)
        ; }
          23: (bc) w0 = w6
          24: (95) exit
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20201010234006.7075-4-daniel@iogearbox.net
      4a8f87e6
    • D
      bpf: Add redirect_peer helper · 9aa1206e
      Daniel Borkmann 提交于
      Add an efficient ingress to ingress netns switch that can be used out of tc BPF
      programs in order to redirect traffic from host ns ingress into a container
      veth device ingress without having to go via CPU backlog queue [0]. For local
      containers this can also be utilized and path via CPU backlog queue only needs
      to be taken once, not twice. On a high level this borrows from ipvlan which does
      similar switch in __netif_receive_skb_core() and then iterates via another_round.
      This helps to reduce latency for mentioned use cases.
      
      Pod to remote pod with redirect(), TCP_RR [1]:
      
        # percpu_netperf 10.217.1.33
                RT_LATENCY:         122.450         (per CPU:         122.666         122.401         122.333         122.401 )
              MEAN_LATENCY:         121.210         (per CPU:         121.100         121.260         121.320         121.160 )
            STDDEV_LATENCY:         120.040         (per CPU:         119.420         119.910         125.460         115.370 )
               MIN_LATENCY:          46.500         (per CPU:          47.000          47.000          47.000          45.000 )
               P50_LATENCY:         118.500         (per CPU:         118.000         119.000         118.000         119.000 )
               P90_LATENCY:         127.500         (per CPU:         127.000         128.000         127.000         128.000 )
               P99_LATENCY:         130.750         (per CPU:         131.000         131.000         129.000         132.000 )
      
          TRANSACTION_RATE:       32666.400         (per CPU:        8152.200        8169.842        8174.439        8169.897 )
      
      Pod to remote pod with redirect_peer(), TCP_RR:
      
        # percpu_netperf 10.217.1.33
                RT_LATENCY:          44.449         (per CPU:          43.767          43.127          45.279          45.622 )
              MEAN_LATENCY:          45.065         (per CPU:          44.030          45.530          45.190          45.510 )
            STDDEV_LATENCY:          84.823         (per CPU:          66.770          97.290          84.380          90.850 )
               MIN_LATENCY:          33.500         (per CPU:          33.000          33.000          34.000          34.000 )
               P50_LATENCY:          43.250         (per CPU:          43.000          43.000          43.000          44.000 )
               P90_LATENCY:          46.750         (per CPU:          46.000          47.000          47.000          47.000 )
               P99_LATENCY:          52.750         (per CPU:          51.000          54.000          53.000          53.000 )
      
          TRANSACTION_RATE:       90039.500         (per CPU:       22848.186       23187.089       22085.077       21919.130 )
      
        [0] https://linuxplumbersconf.org/event/7/contributions/674/attachments/568/1002/plumbers_2020_cilium_load_balancer.pdf
        [1] https://github.com/borkmann/netperf_scripts/blob/master/percpu_netperfSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20201010234006.7075-3-daniel@iogearbox.net
      9aa1206e
    • D
      bpf: Improve bpf_redirect_neigh helper description · dd2ce6a5
      Daniel Borkmann 提交于
      Follow-up to address David's feedback that we should better describe internals
      of the bpf_redirect_neigh() helper.
      Suggested-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Link: https://lore.kernel.org/bpf/20201010234006.7075-2-daniel@iogearbox.net
      dd2ce6a5
  2. 10 10月, 2020 5 次提交
  3. 09 10月, 2020 2 次提交
  4. 08 10月, 2020 9 次提交
    • A
      Merge branch 'libbpf: auto-resize relocatable LOAD/STORE instructions' · 1e9259ec
      Alexei Starovoitov 提交于
      Andrii Nakryiko says:
      
      ====================
      Patch set implements logic in libbpf to auto-adjust memory size (1-, 2-, 4-,
      8-bytes) of load/store (LD/ST/STX) instructions which have BPF CO-RE field
      offset relocation associated with it. In practice this means transparent
      handling of 32-bit kernels, both pointer and unsigned integers. Signed
      integers are not relocatable with zero-extending loads/stores, so libbpf
      poisons them and generates a warning. If/when BPF gets support for
      sign-extending loads/stores, it would be possible to automatically relocate
      them as well.
      
      All the details are contained in patch #2 comments and commit message.
      Patch #3 is a simple change in libbpf to make advanced testing with custom BTF
      easier. Patch #4 validates correct uses of auto-resizable loads, as well as
      check that libbpf fails invalid uses. Patch #1 skips CO-RE relocation for
      programs that had bpf_program__set_autoload(prog, false) set on them, reducing
      warnings and noise.
      
      v2->v3:
        - fix copyright (Alexei);
      v1->v2:
        - more consistent names for instruction mem size convertion routines (Alexei);
        - extended selftests to use relocatable STX instructions (Alexei);
        - added a fix for skipping CO-RE relocation for non-loadable programs.
      
      Cc: Luka Perkov <luka.perkov@sartura.hr>
      Cc: Tony Ambardar <tony.ambardar@gmail.com>
      ====================
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      1e9259ec
    • A
      selftests/bpf: Validate libbpf's auto-sizing of LD/ST/STX instructions · 888d83b9
      Andrii Nakryiko 提交于
      Add selftests validating libbpf's auto-resizing of load/store instructions
      when used with CO-RE relocations. An explicit and manual approach with using
      bpf_core_read() is also demonstrated and tested. Separate BPF program is
      supposed to fail due to using signed integers of sizes that differ from
      kernel's sizes.
      
      To reliably simulate 32-bit BTF (i.e., the one with sizeof(long) ==
      sizeof(void *) == 4), selftest generates its own custom BTF and passes it as
      a replacement for real kernel BTF. This allows to test 32/64-bitness mix on
      all architectures.
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20201008001025.292064-5-andrii@kernel.org
      888d83b9
    • A
      libbpf: Allow specifying both ELF and raw BTF for CO-RE BTF override · 2b7d88c2
      Andrii Nakryiko 提交于
      Use generalized BTF parsing logic, making it possible to parse BTF both from
      ELF file, as well as a raw BTF dump. This makes it easier to write custom
      tests with manually generated BTFs.
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20201008001025.292064-4-andrii@kernel.org
      2b7d88c2
    • A
      libbpf: Support safe subset of load/store instruction resizing with CO-RE · a66345bc
      Andrii Nakryiko 提交于
      Add support for patching instructions of the following form:
        - rX = *(T *)(rY + <off>);
        - *(T *)(rX + <off>) = rY;
        - *(T *)(rX + <off>) = <imm>, where T is one of {u8, u16, u32, u64}.
      
      For such instructions, if the actual kernel field recorded in CO-RE relocation
      has a different size than the one recorded locally (e.g., from vmlinux.h),
      then libbpf will adjust T to an appropriate 1-, 2-, 4-, or 8-byte loads.
      
      In general, such transformation is not always correct and could lead to
      invalid final value being loaded or stored. But two classes of cases are
      always safe:
        - if both local and target (kernel) types are unsigned integers, but of
        different sizes, then it's OK to adjust load/store instruction according to
        the necessary memory size. Zero-extending nature of such instructions and
        unsignedness make sure that the final value is always correct;
        - pointer size mismatch between BPF target architecture (which is always
        64-bit) and 32-bit host kernel architecture can be similarly resolved
        automatically, because pointer is essentially an unsigned integer. Loading
        32-bit pointer into 64-bit BPF register with zero extension will leave
        correct pointer in the register.
      
      Both cases are necessary to support CO-RE on 32-bit kernels, as `unsigned
      long` in vmlinux.h generated from 32-bit kernel is 32-bit, but when compiled
      with BPF program for BPF target it will be treated by compiler as 64-bit
      integer. Similarly, pointers in vmlinux.h are 32-bit for kernel, but treated
      as 64-bit values by compiler for BPF target. Both problems are now resolved by
      libbpf for direct memory reads.
      
      But similar transformations are useful in general when kernel fields are
      "resized" from, e.g., unsigned int to unsigned long (or vice versa).
      
      Now, similar transformations for signed integers are not safe to perform as
      they will result in incorrect sign extension of the value. If such situation
      is detected, libbpf will emit helpful message and will poison the instruction.
      Not failing immediately means that it's possible to guard the instruction
      based on kernel version (or other conditions) and make sure it's not
      reachable.
      
      If there is a need to read signed integers that change sizes between different
      kernels, it's possible to use BPF_CORE_READ_BITFIELD() macro, which works both
      with bitfields and non-bitfield integers of any signedness and handles
      sign-extension properly. Also, bpf_core_read() with proper size and/or use of
      bpf_core_field_size() relocation could allow to deal with such complicated
      situations explicitly, if not so conventiently as direct memory reads.
      
      Selftests added in a separate patch in progs/test_core_autosize.c demonstrate
      both direct memory and probed use cases.
      
      BPF_CORE_READ() is not changed and it won't deal with such situations as
      automatically as direct memory reads due to the signedness integer
      limitations, which are much harder to detect and control with compiler macro
      magic. So it's encouraged to utilize direct memory reads as much as possible.
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20201008001025.292064-3-andrii@kernel.org
      a66345bc
    • A
      libbpf: Skip CO-RE relocations for not loaded BPF programs · 47f7cf63
      Andrii Nakryiko 提交于
      Bypass CO-RE relocations step for BPF programs that are not going to be
      loaded. This allows to have BPF programs compiled in and disabled dynamically
      if kernel is not supposed to provide enough relocation information. In such
      case, there won't be unnecessary warnings about failed relocations.
      
      Fixes: d9297581 ("libbpf: Support disabling auto-loading BPF programs")
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20201008001025.292064-2-andrii@kernel.org
      47f7cf63
    • M
      libbpf: Fix compatibility problem in xsk_socket__create · 80348d88
      Magnus Karlsson 提交于
      Fix a compatibility problem when the old XDP_SHARED_UMEM mode is used
      together with the xsk_socket__create() call. In the old XDP_SHARED_UMEM
      mode, only sharing of the same device and queue id was allowed, and
      in this mode, the fill ring and completion ring were shared between
      the AF_XDP sockets.
      
      Therefore, it was perfectly fine to call the xsk_socket__create() API
      for each socket and not use the new xsk_socket__create_shared() API.
      This behavior was ruined by the commit introducing XDP_SHARED_UMEM
      support between different devices and/or queue ids. This patch restores
      the ability to use xsk_socket__create in these circumstances so that
      backward compatibility is not broken.
      
      Fixes: 2f6324a3 ("libbpf: Support shared umems between queues and devices")
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/1602070946-11154-1-git-send-email-magnus.karlsson@gmail.com
      80348d88
    • J
    • Y
      bpf: Fix build failure for kernel/trace/bpf_trace.c with CONFIG_NET=n · ebfb4d40
      Yonghong Song 提交于
      When CONFIG_NET is not defined, I hit the following build error:
          kernel/trace/bpf_trace.o:(.rodata+0x110): undefined reference to `bpf_prog_test_run_raw_tp'
      
      Commit 1b4d60ec ("bpf: Enable BPF_PROG_TEST_RUN for raw_tracepoint")
      added test_run support for raw_tracepoint in /kernel/trace/bpf_trace.c.
      But the test_run function bpf_prog_test_run_raw_tp is defined in
      net/bpf/test_run.c, only available with CONFIG_NET=y.
      
      Adding a CONFIG_NET guard for
          .test_run = bpf_prog_test_run_raw_tp;
      fixed the above build issue.
      
      Fixes: 1b4d60ec ("bpf: Enable BPF_PROG_TEST_RUN for raw_tracepoint")
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20201007062933.3425899-1-yhs@fb.com
      ebfb4d40
    • R
      kernel/bpf/verifier: Fix build when NET is not enabled · 49a2a4d4
      Randy Dunlap 提交于
      Fix build errors in kernel/bpf/verifier.c when CONFIG_NET is
      not enabled.
      
      ../kernel/bpf/verifier.c:3995:13: error: ‘btf_sock_ids’ undeclared here (not in a function); did you mean ‘bpf_sock_ops’?
        .btf_id = &btf_sock_ids[BTF_SOCK_TYPE_SOCK_COMMON],
      
      ../kernel/bpf/verifier.c:3995:26: error: ‘BTF_SOCK_TYPE_SOCK_COMMON’ undeclared here (not in a function); did you mean ‘PTR_TO_SOCK_COMMON’?
        .btf_id = &btf_sock_ids[BTF_SOCK_TYPE_SOCK_COMMON],
      
      Fixes: 1df8f55a ("bpf: Enable bpf_skc_to_* sock casting helper to networking prog type")
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20201007021613.13646-1-rdunlap@infradead.org
      49a2a4d4
  5. 07 10月, 2020 11 次提交
  6. 06 10月, 2020 2 次提交
    • A
      bpf, doc: Update Andrii's email in MAINTAINERS · dca4121c
      Andrii Nakryiko 提交于
      Update Andrii Nakryiko's reviewer email to kernel.org account. This optimizes
      email logistics on my side and makes it less likely for me to miss important
      patches.
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20201005223648.2437130-1-andrii@kernel.org
      dca4121c
    • S
      bpf: Use raw_spin_trylock() for pcpu_freelist_push/pop in NMI · 39d8f0d1
      Song Liu 提交于
      Recent improvements in LOCKDEP highlighted a potential A-A deadlock with
      pcpu_freelist in NMI:
      
      ./tools/testing/selftests/bpf/test_progs -t stacktrace_build_id_nmi
      
      [   18.984807] ================================
      [   18.984807] WARNING: inconsistent lock state
      [   18.984808] 5.9.0-rc6-01771-g1466de1330e1 #2967 Not tainted
      [   18.984809] --------------------------------
      [   18.984809] inconsistent {INITIAL USE} -> {IN-NMI} usage.
      [   18.984810] test_progs/1990 [HC2[2]:SC0[0]:HE0:SE1] takes:
      [   18.984810] ffffe8ffffc219c0 (&head->lock){....}-{2:2}, at: __pcpu_freelist_pop+0xe3/0x180
      [   18.984813] {INITIAL USE} state was registered at:
      [   18.984814]   lock_acquire+0x175/0x7c0
      [   18.984814]   _raw_spin_lock+0x2c/0x40
      [   18.984815]   __pcpu_freelist_pop+0xe3/0x180
      [   18.984815]   pcpu_freelist_pop+0x31/0x40
      [   18.984816]   htab_map_alloc+0xbbf/0xf40
      [   18.984816]   __do_sys_bpf+0x5aa/0x3ed0
      [   18.984817]   do_syscall_64+0x2d/0x40
      [   18.984818]   entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [   18.984818] irq event stamp: 12
      [...]
      [   18.984822] other info that might help us debug this:
      [   18.984823]  Possible unsafe locking scenario:
      [   18.984823]
      [   18.984824]        CPU0
      [   18.984824]        ----
      [   18.984824]   lock(&head->lock);
      [   18.984826]   <Interrupt>
      [   18.984826]     lock(&head->lock);
      [   18.984827]
      [   18.984828]  *** DEADLOCK ***
      [   18.984828]
      [   18.984829] 2 locks held by test_progs/1990:
      [...]
      [   18.984838]  <NMI>
      [   18.984838]  dump_stack+0x9a/0xd0
      [   18.984839]  lock_acquire+0x5c9/0x7c0
      [   18.984839]  ? lock_release+0x6f0/0x6f0
      [   18.984840]  ? __pcpu_freelist_pop+0xe3/0x180
      [   18.984840]  _raw_spin_lock+0x2c/0x40
      [   18.984841]  ? __pcpu_freelist_pop+0xe3/0x180
      [   18.984841]  __pcpu_freelist_pop+0xe3/0x180
      [   18.984842]  pcpu_freelist_pop+0x17/0x40
      [   18.984842]  ? lock_release+0x6f0/0x6f0
      [   18.984843]  __bpf_get_stackid+0x534/0xaf0
      [   18.984843]  bpf_prog_1fd9e30e1438d3c5_oncpu+0x73/0x350
      [   18.984844]  bpf_overflow_handler+0x12f/0x3f0
      
      This is because pcpu_freelist_head.lock is accessed in both NMI and
      non-NMI context. Fix this issue by using raw_spin_trylock() in NMI.
      
      Since NMI interrupts non-NMI context, when NMI context tries to lock the
      raw_spinlock, non-NMI context of the same CPU may already have locked a
      lock and is blocked from unlocking the lock. For a system with N CPUs,
      there could be N NMIs at the same time, and they may block N non-NMI
      raw_spinlocks. This is tricky for pcpu_freelist_push(), where unlike
      _pop(), failing _push() means leaking memory. This issue is more likely to
      trigger in non-SMP system.
      
      Fix this issue with an extra list, pcpu_freelist.extralist. The extralist
      is primarily used to take _push() when raw_spin_trylock() failed on all
      the per CPU lists. It should be empty most of the time. The following
      table summarizes the behavior of pcpu_freelist in NMI and non-NMI:
      
      non-NMI pop(): 	use _lock(); check per CPU lists first;
                      if all per CPU lists are empty, check extralist;
                      if extralist is empty, return NULL.
      
      non-NMI push(): use _lock(); only push to per CPU lists.
      
      NMI pop():    use _trylock(); check per CPU lists first;
                    if all per CPU lists are locked or empty, check extralist;
                    if extralist is locked or empty, return NULL.
      
      NMI push():   use _trylock(); check per CPU lists first;
                    if all per CPU lists are locked; try push to extralist;
                    if extralist is also locked, keep trying on per CPU lists.
      Reported-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20201005165838.3735218-1-songliubraving@fb.com
      39d8f0d1
  7. 05 10月, 2020 2 次提交
  8. 03 10月, 2020 6 次提交
    • S
      bpf: Deref map in BPF_PROG_BIND_MAP when it's already used · 1028ae40
      Stanislav Fomichev 提交于
      We are missing a deref for the case when we are doing BPF_PROG_BIND_MAP
      on a map that's being already held by the program.
      There is 'if (ret) bpf_map_put(map)' below which doesn't trigger
      because we don't consider this an error.
      Let's add missing bpf_map_put() for this specific condition.
      
      Fixes: ef15314a ("bpf: Add BPF_PROG_BIND_MAP syscall")
      Reported-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20201003002544.3601440-1-sdf@google.com
      1028ae40
    • A
      Merge branch 'Add skb_adjust_room() for SK_SKB' · fb91db01
      Alexei Starovoitov 提交于
      John Fastabend says:
      
      ====================
      This implements the helper skb_adjust_room() for BPF_SKS_SK_STREAM_VERDICT
      programs so we can push/pop headers from the data on recieve. One use
      case is to pop TLS headers off kTLS packets.
      
      The first patch implements the helper and the second updates test_sockmap
      to use it removing some case handling we had to do earlier to account for
      the TLS headers in the kTLS tests.
      
      v1->v2:
       Fix error path for TLS case (Daniel)
       check mode input is 0 because we don't use it now (Daniel)
       Remove incorrect/misleading comment (Lorenz)
      
      Thanks,
      John
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      ---
      ====================
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      fb91db01
    • J
      bpf, sockmap: Update selftests to use skb_adjust_room · 91274ca5
      John Fastabend 提交于
      Instead of working around TLS headers in sockmap selftests use the
      new skb_adjust_room helper. This allows us to avoid special casing
      the receive side to skip headers.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/160160100932.7052.3646935243867660528.stgit@john-Precision-5820-Tower
      91274ca5
    • J
      bpf, sockmap: Add skb_adjust_room to pop bytes off ingress payload · 18ebe16d
      John Fastabend 提交于
      This implements a new helper skb_adjust_room() so users can push/pop
      extra bytes from a BPF_SK_SKB_STREAM_VERDICT program.
      
      Some protocols may include headers and other information that we may
      not want to include when doing a redirect from a BPF_SK_SKB_STREAM_VERDICT
      program. One use case is to redirect TLS packets into a receive socket
      that doesn't expect TLS data. In TLS case the first 13B or so contain the
      protocol header. With KTLS the payload is decrypted so we should be able
      to redirect this to a receiving socket, but the receiving socket may not
      be expecting to receive a TLS header and discard the data. Using the
      above helper we can pop the header off and put an appropriate header on
      the payload. This allows for creating a proxy between protocols without
      extra hops through the stack or userspace.
      
      So in order to fix this case add skb_adjust_room() so users can strip the
      header. After this the user can strip the header and an unmodified receiver
      thread will work correctly when data is redirected into the ingress path
      of a sock.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/160160099197.7052.8443193973242831692.stgit@john-Precision-5820-Tower
      18ebe16d
    • A
      Merge branch 'bpf: BTF support for ksyms' · 60a128b5
      Alexei Starovoitov 提交于
      Hao Luo says:
      
      ====================
      v3 -> v4:
       - Rebasing
       - Cast bpf_[per|this]_cpu_ptr's parameter to void __percpu * before
         passing into per_cpu_ptr.
      
      v2 -> v3:
       - Rename functions and variables in verifier for better readability.
       - Stick to logging message convention in libbpf.
       - Move bpf_per_cpu_ptr and bpf_this_cpu_ptr from trace-specific
         helper set to base helper set.
       - More specific test in ksyms_btf.
       - Fix return type cast in bpf_*_cpu_ptr.
       - Fix btf leak in ksyms_btf selftest.
       - Fix return error code for kallsyms_find().
      
      v1 -> v2:
       - Move check_pseudo_btf_id from check_ld_imm() to
         replace_map_fd_with_map_ptr() and rename the latter.
       - Add bpf_this_cpu_ptr().
       - Use bpf_core_types_are_compat() in libbpf.c for checking type
         compatibility.
       - Rewrite typed ksym extern type in BTF with int to save space.
       - Minor revision of bpf_per_cpu_ptr()'s comments.
       - Avoid using long in tests that use skeleton.
       - Refactored test_ksyms.c by moving kallsyms_find() to trace_helpers.c
       - Fold the patches that sync include/linux/uapi and
         tools/include/linux/uapi.
      
      rfc -> v1:
       - Encode VAR's btf_id for PSEUDO_BTF_ID.
       - More checks in verifier. Checking the btf_id passed as
         PSEUDO_BTF_ID is valid VAR, its name and type.
       - Checks in libbpf on type compatibility of ksyms.
       - Add bpf_per_cpu_ptr() to access kernel percpu vars. Introduced
         new ARG and RET types for this helper.
      
      This patch series extends the previously added __ksym externs with
      btf support.
      
      Right now the __ksym externs are treated as pure 64-bit scalar value.
      Libbpf replaces ld_imm64 insn of __ksym by its kernel address at load
      time. This patch series extend those externs with their btf info. Note
      that btf support for __ksym must come with the kernel btf that has
      VARs encoded to work properly. The corresponding chagnes in pahole
      is available at [1] (with a fix at [2] for gcc 4.9+).
      
      The first 3 patches in this series add support for general kernel
      global variables, which include verifier checking (01/06), libpf
      support (02/06) and selftests for getting typed ksym extern's kernel
      address (03/06).
      
      The next 3 patches extends that capability further by introducing
      helpers bpf_per_cpu_ptr() and bpf_this_cpu_ptr(), which allows accessing
      kernel percpu variables correctly (04/06 and 05/06).
      
      The tests of this feature were performed against pahole that is extended
      with [1] and [2]. For kernel BTF that does not have VARs encoded, the
      selftests will be skipped.
      
      [1] https://git.kernel.org/pub/scm/devel/pahole/pahole.git/commit/?id=f3d9054ba8ff1df0fc44e507e3a01c0964cabd42
      [2] https://www.spinics.net/lists/dwarves/msg00451.html
      ====================
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      60a128b5
    • H
      bpf/selftests: Test for bpf_per_cpu_ptr() and bpf_this_cpu_ptr() · 00dc73e4
      Hao Luo 提交于
      Test bpf_per_cpu_ptr() and bpf_this_cpu_ptr(). Test two paths in the
      kernel. If the base pointer points to a struct, the returned reg is
      of type PTR_TO_BTF_ID. Direct pointer dereference can be applied on
      the returned variable. If the base pointer isn't a struct, the
      returned reg is of type PTR_TO_MEM, which also supports direct pointer
      dereference.
      Signed-off-by: NHao Luo <haoluo@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200929235049.2533242-7-haoluo@google.com
      00dc73e4