1. 27 3月, 2021 3 次提交
    • M
      bpf: selftests: Add kfunc_call test · 7bd1590d
      Martin KaFai Lau 提交于
      This patch adds a few kernel function bpf_kfunc_call_test*() for the
      selftest's test_run purpose.  They will be allowed for tc_cls prog.
      
      The selftest calling the kernel function bpf_kfunc_call_test*()
      is also added in this patch.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210325015252.1551395-1-kafai@fb.com
      7bd1590d
    • M
      bpf: Support bpf program calling kernel function · e6ac2450
      Martin KaFai Lau 提交于
      This patch adds support to BPF verifier to allow bpf program calling
      kernel function directly.
      
      The use case included in this set is to allow bpf-tcp-cc to directly
      call some tcp-cc helper functions (e.g. "tcp_cong_avoid_ai()").  Those
      functions have already been used by some kernel tcp-cc implementations.
      
      This set will also allow the bpf-tcp-cc program to directly call the
      kernel tcp-cc implementation,  For example, a bpf_dctcp may only want to
      implement its own dctcp_cwnd_event() and reuse other dctcp_*() directly
      from the kernel tcp_dctcp.c instead of reimplementing (or
      copy-and-pasting) them.
      
      The tcp-cc kernel functions mentioned above will be white listed
      for the struct_ops bpf-tcp-cc programs to use in a later patch.
      The white listed functions are not bounded to a fixed ABI contract.
      Those functions have already been used by the existing kernel tcp-cc.
      If any of them has changed, both in-tree and out-of-tree kernel tcp-cc
      implementations have to be changed.  The same goes for the struct_ops
      bpf-tcp-cc programs which have to be adjusted accordingly.
      
      This patch is to make the required changes in the bpf verifier.
      
      First change is in btf.c, it adds a case in "btf_check_func_arg_match()".
      When the passed in "btf->kernel_btf == true", it means matching the
      verifier regs' states with a kernel function.  This will handle the
      PTR_TO_BTF_ID reg.  It also maps PTR_TO_SOCK_COMMON, PTR_TO_SOCKET,
      and PTR_TO_TCP_SOCK to its kernel's btf_id.
      
      In the later libbpf patch, the insn calling a kernel function will
      look like:
      
      insn->code == (BPF_JMP | BPF_CALL)
      insn->src_reg == BPF_PSEUDO_KFUNC_CALL /* <- new in this patch */
      insn->imm == func_btf_id /* btf_id of the running kernel */
      
      [ For the future calling function-in-kernel-module support, an array
        of module btf_fds can be passed at the load time and insn->off
        can be used to index into this array. ]
      
      At the early stage of verifier, the verifier will collect all kernel
      function calls into "struct bpf_kfunc_desc".  Those
      descriptors are stored in "prog->aux->kfunc_tab" and will
      be available to the JIT.  Since this "add" operation is similar
      to the current "add_subprog()" and looking for the same insn->code,
      they are done together in the new "add_subprog_and_kfunc()".
      
      In the "do_check()" stage, the new "check_kfunc_call()" is added
      to verify the kernel function call instruction:
      1. Ensure the kernel function can be used by a particular BPF_PROG_TYPE.
         A new bpf_verifier_ops "check_kfunc_call" is added to do that.
         The bpf-tcp-cc struct_ops program will implement this function in
         a later patch.
      2. Call "btf_check_kfunc_args_match()" to ensure the regs can be
         used as the args of a kernel function.
      3. Mark the regs' type, subreg_def, and zext_dst.
      
      At the later do_misc_fixups() stage, the new fixup_kfunc_call()
      will replace the insn->imm with the function address (relative
      to __bpf_call_base).  If needed, the jit can find the btf_func_model
      by calling the new bpf_jit_find_kfunc_model(prog, insn).
      With the imm set to the function address, "bpftool prog dump xlated"
      will be able to display the kernel function calls the same way as
      it displays other bpf helper calls.
      
      gpl_compatible program is required to call kernel function.
      
      This feature currently requires JIT.
      
      The verifier selftests are adjusted because of the changes in
      the verbose log in add_subprog_and_kfunc().
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210325015142.1544736-1-kafai@fb.com
      e6ac2450
    • M
      bpf: Refactor btf_check_func_arg_match · 34747c41
      Martin KaFai Lau 提交于
      This patch moved the subprog specific logic from
      btf_check_func_arg_match() to the new btf_check_subprog_arg_match().
      The core logic is left in btf_check_func_arg_match() which
      will be reused later to check the kernel function call.
      
      The "if (!btf_type_is_ptr(t))" is checked first to improve the
      indentation which will be useful for a later patch.
      
      Some of the "btf_kind_str[]" usages is replaced with the shortcut
      "btf_type_str(t)".
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210325015136.1544504-1-kafai@fb.com
      34747c41
  2. 26 3月, 2021 2 次提交
  3. 18 3月, 2021 1 次提交
    • A
      bpf: Fix fexit trampoline. · e21aa341
      Alexei Starovoitov 提交于
      The fexit/fmod_ret programs can be attached to kernel functions that can sleep.
      The synchronize_rcu_tasks() will not wait for such tasks to complete.
      In such case the trampoline image will be freed and when the task
      wakes up the return IP will point to freed memory causing the crash.
      Solve this by adding percpu_ref_get/put for the duration of trampoline
      and separate trampoline vs its image life times.
      The "half page" optimization has to be removed, since
      first_half->second_half->first_half transition cannot be guaranteed to
      complete in deterministic time. Every trampoline update becomes a new image.
      The image with fmod_ret or fexit progs will be freed via percpu_ref_kill and
      call_rcu_tasks. Together they will wait for the original function and
      trampoline asm to complete. The trampoline is patched from nop to jmp to skip
      fexit progs. They are freed independently from the trampoline. The image with
      fentry progs only will be freed via call_rcu_tasks_trace+call_rcu_tasks which
      will wait for both sleepable and non-sleepable progs to complete.
      
      Fixes: fec56f58 ("bpf: Introduce BPF trampoline")
      Reported-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: Paul E. McKenney <paulmck@kernel.org>  # for RCU
      Link: https://lore.kernel.org/bpf/20210316210007.38949-1-alexei.starovoitov@gmail.com
      e21aa341
  4. 10 3月, 2021 2 次提交
  5. 05 3月, 2021 1 次提交
  6. 27 2月, 2021 6 次提交
  7. 12 2月, 2021 1 次提交
  8. 11 2月, 2021 4 次提交
  9. 28 1月, 2021 1 次提交
  10. 13 1月, 2021 3 次提交
  11. 09 12月, 2020 1 次提交
  12. 04 12月, 2020 1 次提交
    • A
      bpf: Remove hard-coded btf_vmlinux assumption from BPF verifier · 22dc4a0f
      Andrii Nakryiko 提交于
      Remove a permeating assumption thoughout BPF verifier of vmlinux BTF. Instead,
      wherever BTF type IDs are involved, also track the instance of struct btf that
      goes along with the type ID. This allows to gradually add support for kernel
      module BTFs and using/tracking module types across BPF helper calls and
      registers.
      
      This patch also renames btf_id() function to btf_obj_id() to minimize naming
      clash with using btf_id to denote BTF *type* ID, rather than BTF *object*'s ID.
      
      Also, altough btf_vmlinux can't get destructed and thus doesn't need
      refcounting, module BTFs need that, so apply BTF refcounting universally when
      BPF program is using BTF-powered attachment (tp_btf, fentry/fexit, etc). This
      makes for simpler clean up code.
      
      Now that BTF type ID is not enough to uniquely identify a BTF type, extend BPF
      trampoline key to include BTF object ID. To differentiate that from target
      program BPF ID, set 31st bit of type ID. BTF type IDs (at least currently) are
      not allowed to take full 32 bits, so there is no danger of confusing that bit
      with a valid BTF type ID.
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20201203204634.1325171-10-andrii@kernel.org
      22dc4a0f
  13. 03 12月, 2020 3 次提交
  14. 19 11月, 2020 1 次提交
  15. 11 11月, 2020 1 次提交
    • A
      bpf: Load and verify kernel module BTFs · 36e68442
      Andrii Nakryiko 提交于
      Add kernel module listener that will load/validate and unload module BTF.
      Module BTFs gets ID generated for them, which makes it possible to iterate
      them with existing BTF iteration API. They are given their respective module's
      names, which will get reported through GET_OBJ_INFO API. They are also marked
      as in-kernel BTFs for tooling to distinguish them from user-provided BTFs.
      
      Also, similarly to vmlinux BTF, kernel module BTFs are exposed through
      sysfs as /sys/kernel/btf/<module-name>. This is convenient for user-space
      tools to inspect module BTF contents and dump their types with existing tools:
      
      [vmuser@archvm bpf]$ ls -la /sys/kernel/btf
      total 0
      drwxr-xr-x  2 root root       0 Nov  4 19:46 .
      drwxr-xr-x 13 root root       0 Nov  4 19:46 ..
      
      ...
      
      -r--r--r--  1 root root     888 Nov  4 19:46 irqbypass
      -r--r--r--  1 root root  100225 Nov  4 19:46 kvm
      -r--r--r--  1 root root   35401 Nov  4 19:46 kvm_intel
      -r--r--r--  1 root root     120 Nov  4 19:46 pcspkr
      -r--r--r--  1 root root     399 Nov  4 19:46 serio_raw
      -r--r--r--  1 root root 4094095 Nov  4 19:46 vmlinux
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Link: https://lore.kernel.org/bpf/20201110011932.3201430-5-andrii@kernel.org
      36e68442
  16. 07 11月, 2020 1 次提交
  17. 29 10月, 2020 1 次提交
    • Y
      bpf: Permit cond_resched for some iterators · cf83b2d2
      Yonghong Song 提交于
      Commit e679654a ("bpf: Fix a rcu_sched stall issue with
      bpf task/task_file iterator") tries to fix rcu stalls warning
      which is caused by bpf task_file iterator when running
      "bpftool prog".
      
            rcu: INFO: rcu_sched self-detected stall on CPU
            rcu: \x097-....: (20999 ticks this GP) idle=302/1/0x4000000000000000 softirq=1508852/1508852 fqs=4913
            \x09(t=21031 jiffies g=2534773 q=179750)
            NMI backtrace for cpu 7
            CPU: 7 PID: 184195 Comm: bpftool Kdump: loaded Tainted: G        W         5.8.0-00004-g68bfc7f8c1b4 #6
            Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A17 05/03/2019
            Call Trace:
            <IRQ>
            dump_stack+0x57/0x70
            nmi_cpu_backtrace.cold+0x14/0x53
            ? lapic_can_unplug_cpu.cold+0x39/0x39
            nmi_trigger_cpumask_backtrace+0xb7/0xc7
            rcu_dump_cpu_stacks+0xa2/0xd0
            rcu_sched_clock_irq.cold+0x1ff/0x3d9
            ? tick_nohz_handler+0x100/0x100
            update_process_times+0x5b/0x90
            tick_sched_timer+0x5e/0xf0
            __hrtimer_run_queues+0x12a/0x2a0
            hrtimer_interrupt+0x10e/0x280
            __sysvec_apic_timer_interrupt+0x51/0xe0
            asm_call_on_stack+0xf/0x20
            </IRQ>
            sysvec_apic_timer_interrupt+0x6f/0x80
            ...
            task_file_seq_next+0x52/0xa0
            bpf_seq_read+0xb9/0x320
            vfs_read+0x9d/0x180
            ksys_read+0x5f/0xe0
            do_syscall_64+0x38/0x60
            entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The fix is to limit the number of bpf program runs to be
      one million. This fixed the program in most cases. But
      we also found under heavy load, which can increase the wallclock
      time for bpf_seq_read(), the warning may still be possible.
      
      For example, calling bpf_delay() in the "while" loop of
      bpf_seq_read(), which will introduce artificial delay,
      the warning will show up in my qemu run.
      
        static unsigned q;
        volatile unsigned *p = &q;
        volatile unsigned long long ll;
        static void bpf_delay(void)
        {
               int i, j;
      
               for (i = 0; i < 10000; i++)
                       for (j = 0; j < 10000; j++)
                               ll += *p;
        }
      
      There are two ways to fix this issue. One is to reduce the above
      one million threshold to say 100,000 and hopefully rcu warning will
      not show up any more. Another is to introduce a target feature
      which enables bpf_seq_read() calling cond_resched().
      
      This patch took second approach as the first approach may cause
      more -EAGAIN failures for read() syscalls. Note that not all bpf_iter
      targets can permit cond_resched() in bpf_seq_read() as some, e.g.,
      netlink seq iterator, rcu read lock critical section spans through
      seq_ops->next() -> seq_ops->show() -> seq_ops->next().
      
      For the kernel code with the above hack, "bpftool p" roughly takes
      38 seconds to finish on my VM with 184 bpf program runs.
      Using the following command, I am able to collect the number of
      context switches:
         perf stat -e context-switches -- ./bpftool p >& log
      Without this patch,
         69      context-switches
      With this patch,
         75      context-switches
      This patch added additional 6 context switches, roughly every 6 seconds
      to reschedule, to avoid lengthy no-rescheduling which may cause the
      above RCU warnings.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20201028061054.1411116-1-yhs@fb.com
      cf83b2d2
  18. 12 10月, 2020 1 次提交
    • D
      bpf: Allow for map-in-map with dynamic inner array map entries · 4a8f87e6
      Daniel Borkmann 提交于
      Recent work in f4d05259 ("bpf: Add map_meta_equal map ops") and 134fede4
      ("bpf: Relax max_entries check for most of the inner map types") added support
      for dynamic inner max elements for most map-in-map types. Exceptions were maps
      like array or prog array where the map_gen_lookup() callback uses the maps'
      max_entries field as a constant when emitting instructions.
      
      We recently implemented Maglev consistent hashing into Cilium's load balancer
      which uses map-in-map with an outer map being hash and inner being array holding
      the Maglev backend table for each service. This has been designed this way in
      order to reduce overall memory consumption given the outer hash map allows to
      avoid preallocating a large, flat memory area for all services. Also, the
      number of service mappings is not always known a-priori.
      
      The use case for dynamic inner array map entries is to further reduce memory
      overhead, for example, some services might just have a small number of back
      ends while others could have a large number. Right now the Maglev backend table
      for small and large number of backends would need to have the same inner array
      map entries which adds a lot of unneeded overhead.
      
      Dynamic inner array map entries can be realized by avoiding the inlined code
      generation for their lookup. The lookup will still be efficient since it will
      be calling into array_map_lookup_elem() directly and thus avoiding retpoline.
      The patch adds a BPF_F_INNER_MAP flag to map creation which therefore skips
      inline code generation and relaxes array_map_meta_equal() check to ignore both
      maps' max_entries. This also still allows to have faster lookups for map-in-map
      when BPF_F_INNER_MAP is not specified and hence dynamic max_entries not needed.
      
      Example code generation where inner map is dynamic sized array:
      
        # bpftool p d x i 125
        int handle__sys_enter(void * ctx):
        ; int handle__sys_enter(void *ctx)
           0: (b4) w1 = 0
        ; int key = 0;
           1: (63) *(u32 *)(r10 -4) = r1
           2: (bf) r2 = r10
        ;
           3: (07) r2 += -4
        ; inner_map = bpf_map_lookup_elem(&outer_arr_dyn, &key);
           4: (18) r1 = map[id:468]
           6: (07) r1 += 272
           7: (61) r0 = *(u32 *)(r2 +0)
           8: (35) if r0 >= 0x3 goto pc+5
           9: (67) r0 <<= 3
          10: (0f) r0 += r1
          11: (79) r0 = *(u64 *)(r0 +0)
          12: (15) if r0 == 0x0 goto pc+1
          13: (05) goto pc+1
          14: (b7) r0 = 0
          15: (b4) w6 = -1
        ; if (!inner_map)
          16: (15) if r0 == 0x0 goto pc+6
          17: (bf) r2 = r10
        ;
          18: (07) r2 += -4
        ; val = bpf_map_lookup_elem(inner_map, &key);
          19: (bf) r1 = r0                               | No inlining but instead
          20: (85) call array_map_lookup_elem#149280     | call to array_map_lookup_elem()
        ; return val ? *val : -1;                        | for inner array lookup.
          21: (15) if r0 == 0x0 goto pc+1
        ; return val ? *val : -1;
          22: (61) r6 = *(u32 *)(r0 +0)
        ; }
          23: (bc) w0 = w6
          24: (95) exit
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20201010234006.7075-4-daniel@iogearbox.net
      4a8f87e6
  19. 03 10月, 2020 2 次提交
  20. 30 9月, 2020 2 次提交
  21. 29 9月, 2020 2 次提交
    • A
      bpf: Add bpf_snprintf_btf helper · c4d0bfb4
      Alan Maguire 提交于
      A helper is added to support tracing kernel type information in BPF
      using the BPF Type Format (BTF).  Its signature is
      
      long bpf_snprintf_btf(char *str, u32 str_size, struct btf_ptr *ptr,
      		      u32 btf_ptr_size, u64 flags);
      
      struct btf_ptr * specifies
      
      - a pointer to the data to be traced
      - the BTF id of the type of data pointed to
      - a flags field is provided for future use; these flags
        are not to be confused with the BTF_F_* flags
        below that control how the btf_ptr is displayed; the
        flags member of the struct btf_ptr may be used to
        disambiguate types in kernel versus module BTF, etc;
        the main distinction is the flags relate to the type
        and information needed in identifying it; not how it
        is displayed.
      
      For example a BPF program with a struct sk_buff *skb
      could do the following:
      
      	static struct btf_ptr b = { };
      
      	b.ptr = skb;
      	b.type_id = __builtin_btf_type_id(struct sk_buff, 1);
      	bpf_snprintf_btf(str, sizeof(str), &b, sizeof(b), 0, 0);
      
      Default output looks like this:
      
      (struct sk_buff){
       .transport_header = (__u16)65535,
       .mac_header = (__u16)65535,
       .end = (sk_buff_data_t)192,
       .head = (unsigned char *)0x000000007524fd8b,
       .data = (unsigned char *)0x000000007524fd8b,
       .truesize = (unsigned int)768,
       .users = (refcount_t){
        .refs = (atomic_t){
         .counter = (int)1,
        },
       },
      }
      
      Flags modifying display are as follows:
      
      - BTF_F_COMPACT:	no formatting around type information
      - BTF_F_NONAME:		no struct/union member names/types
      - BTF_F_PTR_RAW:	show raw (unobfuscated) pointer values;
      			equivalent to %px.
      - BTF_F_ZERO:		show zero-valued struct/union members;
      			they are not displayed by default
      Signed-off-by: NAlan Maguire <alan.maguire@oracle.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/1601292670-1616-4-git-send-email-alan.maguire@oracle.com
      c4d0bfb4
    • A
      76654e67