1. 19 12月, 2021 7 次提交
  2. 14 12月, 2021 1 次提交
    • J
      bpf: Add get_func_[arg|ret|arg_cnt] helpers · f92c1e18
      Jiri Olsa 提交于
      Adding following helpers for tracing programs:
      
      Get n-th argument of the traced function:
        long bpf_get_func_arg(void *ctx, u32 n, u64 *value)
      
      Get return value of the traced function:
        long bpf_get_func_ret(void *ctx, u64 *value)
      
      Get arguments count of the traced function:
        long bpf_get_func_arg_cnt(void *ctx)
      
      The trampoline now stores number of arguments on ctx-8
      address, so it's easy to verify argument index and find
      return value argument's position.
      
      Moving function ip address on the trampoline stack behind
      the number of functions arguments, so it's now stored on
      ctx-16 address if it's needed.
      
      All helpers above are inlined by verifier.
      
      Also bit unrelated small change - using newly added function
      bpf_prog_has_trampoline in check_get_func_ip.
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20211208193245.172141-5-jolsa@kernel.org
      f92c1e18
  3. 12 12月, 2021 1 次提交
  4. 03 12月, 2021 1 次提交
  5. 01 12月, 2021 1 次提交
    • J
      bpf: Add bpf_loop helper · e6f2dd0f
      Joanne Koong 提交于
      This patch adds the kernel-side and API changes for a new helper
      function, bpf_loop:
      
      long bpf_loop(u32 nr_loops, void *callback_fn, void *callback_ctx,
      u64 flags);
      
      where long (*callback_fn)(u32 index, void *ctx);
      
      bpf_loop invokes the "callback_fn" **nr_loops** times or until the
      callback_fn returns 1. The callback_fn can only return 0 or 1, and
      this is enforced by the verifier. The callback_fn index is zero-indexed.
      
      A few things to please note:
      ~ The "u64 flags" parameter is currently unused but is included in
      case a future use case for it arises.
      ~ In the kernel-side implementation of bpf_loop (kernel/bpf/bpf_iter.c),
      bpf_callback_t is used as the callback function cast.
      ~ A program can have nested bpf_loop calls but the program must
      still adhere to the verifier constraint of its stack depth (the stack depth
      cannot exceed MAX_BPF_STACK))
      ~ Recursive callback_fns do not pass the verifier, due to the call stack
      for these being too deep.
      ~ The next patch will include the tests and benchmark
      Signed-off-by: NJoanne Koong <joannekoong@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20211130030622.4131246-2-joannekoong@fb.com
      e6f2dd0f
  6. 30 11月, 2021 1 次提交
  7. 17 11月, 2021 1 次提交
  8. 16 11月, 2021 2 次提交
    • T
      bpf: Change value of MAX_TAIL_CALL_CNT from 32 to 33 · ebf7f6f0
      Tiezhu Yang 提交于
      In the current code, the actual max tail call count is 33 which is greater
      than MAX_TAIL_CALL_CNT (defined as 32). The actual limit is not consistent
      with the meaning of MAX_TAIL_CALL_CNT and thus confusing at first glance.
      We can see the historical evolution from commit 04fd61ab ("bpf: allow
      bpf programs to tail-call other bpf programs") and commit f9dabe01
      ("bpf: Undo off-by-one in interpreter tail call count limit"). In order
      to avoid changing existing behavior, the actual limit is 33 now, this is
      reasonable.
      
      After commit 874be05f ("bpf, tests: Add tail call test suite"), we can
      see there exists failed testcase.
      
      On all archs when CONFIG_BPF_JIT_ALWAYS_ON is not set:
       # echo 0 > /proc/sys/net/core/bpf_jit_enable
       # modprobe test_bpf
       # dmesg | grep -w FAIL
       Tail call error path, max count reached jited:0 ret 34 != 33 FAIL
      
      On some archs:
       # echo 1 > /proc/sys/net/core/bpf_jit_enable
       # modprobe test_bpf
       # dmesg | grep -w FAIL
       Tail call error path, max count reached jited:1 ret 34 != 33 FAIL
      
      Although the above failed testcase has been fixed in commit 18935a72
      ("bpf/tests: Fix error in tail call limit tests"), it would still be good
      to change the value of MAX_TAIL_CALL_CNT from 32 to 33 to make the code
      more readable.
      
      The 32-bit x86 JIT was using a limit of 32, just fix the wrong comments and
      limit to 33 tail calls as the constant MAX_TAIL_CALL_CNT updated. For the
      mips64 JIT, use "ori" instead of "addiu" as suggested by Johan Almbladh.
      For the riscv JIT, use RV_REG_TCC directly to save one register move as
      suggested by Björn Töpel. For the other implementations, no function changes,
      it does not change the current limit 33, the new value of MAX_TAIL_CALL_CNT
      can reflect the actual max tail call count, the related tail call testcases
      in test_bpf module and selftests can work well for the interpreter and the
      JIT.
      
      Here are the test results on x86_64:
      
       # uname -m
       x86_64
       # echo 0 > /proc/sys/net/core/bpf_jit_enable
       # modprobe test_bpf test_suite=test_tail_calls
       # dmesg | tail -1
       test_bpf: test_tail_calls: Summary: 8 PASSED, 0 FAILED, [0/8 JIT'ed]
       # rmmod test_bpf
       # echo 1 > /proc/sys/net/core/bpf_jit_enable
       # modprobe test_bpf test_suite=test_tail_calls
       # dmesg | tail -1
       test_bpf: test_tail_calls: Summary: 8 PASSED, 0 FAILED, [8/8 JIT'ed]
       # rmmod test_bpf
       # ./test_progs -t tailcalls
       #142 tailcalls:OK
       Summary: 1/11 PASSED, 0 SKIPPED, 0 FAILED
      Signed-off-by: NTiezhu Yang <yangtiezhu@loongson.cn>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: NJohan Almbladh <johan.almbladh@anyfinetworks.com>
      Tested-by: NIlya Leoshkevich <iii@linux.ibm.com>
      Acked-by: NBjörn Töpel <bjorn@kernel.org>
      Acked-by: NJohan Almbladh <johan.almbladh@anyfinetworks.com>
      Acked-by: NIlya Leoshkevich <iii@linux.ibm.com>
      Link: https://lore.kernel.org/bpf/1636075800-3264-1-git-send-email-yangtiezhu@loongson.cn
      ebf7f6f0
    • D
      bpf: Fix toctou on read-only map's constant scalar tracking · 353050be
      Daniel Borkmann 提交于
      Commit a23740ec ("bpf: Track contents of read-only maps as scalars") is
      checking whether maps are read-only both from BPF program side and user space
      side, and then, given their content is constant, reading out their data via
      map->ops->map_direct_value_addr() which is then subsequently used as known
      scalar value for the register, that is, it is marked as __mark_reg_known()
      with the read value at verification time. Before a23740ec, the register
      content was marked as an unknown scalar so the verifier could not make any
      assumptions about the map content.
      
      The current implementation however is prone to a TOCTOU race, meaning, the
      value read as known scalar for the register is not guaranteed to be exactly
      the same at a later point when the program is executed, and as such, the
      prior made assumptions of the verifier with regards to the program will be
      invalid which can cause issues such as OOB access, etc.
      
      While the BPF_F_RDONLY_PROG map flag is always fixed and required to be
      specified at map creation time, the map->frozen property is initially set to
      false for the map given the map value needs to be populated, e.g. for global
      data sections. Once complete, the loader "freezes" the map from user space
      such that no subsequent updates/deletes are possible anymore. For the rest
      of the lifetime of the map, this freeze one-time trigger cannot be undone
      anymore after a successful BPF_MAP_FREEZE cmd return. Meaning, any new BPF_*
      cmd calls which would update/delete map entries will be rejected with -EPERM
      since map_get_sys_perms() removes the FMODE_CAN_WRITE permission. This also
      means that pending update/delete map entries must still complete before this
      guarantee is given. This corner case is not an issue for loaders since they
      create and prepare such program private map in successive steps.
      
      However, a malicious user is able to trigger this TOCTOU race in two different
      ways: i) via userfaultfd, and ii) via batched updates. For i) userfaultfd is
      used to expand the competition interval, so that map_update_elem() can modify
      the contents of the map after map_freeze() and bpf_prog_load() were executed.
      This works, because userfaultfd halts the parallel thread which triggered a
      map_update_elem() at the time where we copy key/value from the user buffer and
      this already passed the FMODE_CAN_WRITE capability test given at that time the
      map was not "frozen". Then, the main thread performs the map_freeze() and
      bpf_prog_load(), and once that had completed successfully, the other thread
      is woken up to complete the pending map_update_elem() which then changes the
      map content. For ii) the idea of the batched update is similar, meaning, when
      there are a large number of updates to be processed, it can increase the
      competition interval between the two. It is therefore possible in practice to
      modify the contents of the map after executing map_freeze() and bpf_prog_load().
      
      One way to fix both i) and ii) at the same time is to expand the use of the
      map's map->writecnt. The latter was introduced in fc970227 ("bpf: Add mmap()
      support for BPF_MAP_TYPE_ARRAY") and further refined in 1f6cb19b ("bpf:
      Prevent re-mmap()'ing BPF map as writable for initially r/o mapping") with
      the rationale to make a writable mmap()'ing of a map mutually exclusive with
      read-only freezing. The counter indicates writable mmap() mappings and then
      prevents/fails the freeze operation. Its semantics can be expanded beyond
      just mmap() by generally indicating ongoing write phases. This would essentially
      span any parallel regular and batched flavor of update/delete operation and
      then also have map_freeze() fail with -EBUSY. For the check_mem_access() in
      the verifier we expand upon the bpf_map_is_rdonly() check ensuring that all
      last pending writes have completed via bpf_map_write_active() test. Once the
      map->frozen is set and bpf_map_write_active() indicates a map->writecnt of 0
      only then we are really guaranteed to use the map's data as known constants.
      For map->frozen being set and pending writes in process of still being completed
      we fall back to marking that register as unknown scalar so we don't end up
      making assumptions about it. With this, both TOCTOU reproducers from i) and
      ii) are fixed.
      
      Note that the map->writecnt has been converted into a atomic64 in the fix in
      order to avoid a double freeze_mutex mutex_{un,}lock() pair when updating
      map->writecnt in the various map update/delete BPF_* cmd flavors. Spanning
      the freeze_mutex over entire map update/delete operations in syscall side
      would not be possible due to then causing everything to be serialized.
      Similarly, something like synchronize_rcu() after setting map->frozen to wait
      for update/deletes to complete is not possible either since it would also
      have to span the user copy which can sleep. On the libbpf side, this won't
      break d66562fb ("libbpf: Add BPF object skeleton support") as the
      anonymous mmap()-ed "map initialization image" is remapped as a BPF map-backed
      mmap()-ed memory where for .rodata it's non-writable.
      
      Fixes: a23740ec ("bpf: Track contents of read-only maps as scalars")
      Reported-by: w1tcher.bupt@gmail.com
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      353050be
  9. 08 11月, 2021 1 次提交
  10. 07 11月, 2021 1 次提交
    • M
      bpf: Stop caching subprog index in the bpf_pseudo_func insn · 3990ed4c
      Martin KaFai Lau 提交于
      This patch is to fix an out-of-bound access issue when jit-ing the
      bpf_pseudo_func insn (i.e. ld_imm64 with src_reg == BPF_PSEUDO_FUNC)
      
      In jit_subprog(), it currently reuses the subprog index cached in
      insn[1].imm.  This subprog index is an index into a few array related
      to subprogs.  For example, in jit_subprog(), it is an index to the newly
      allocated 'struct bpf_prog **func' array.
      
      The subprog index was cached in insn[1].imm after add_subprog().  However,
      this could become outdated (and too big in this case) if some subprogs
      are completely removed during dead code elimination (in
      adjust_subprog_starts_after_remove).  The cached index in insn[1].imm
      is not updated accordingly and causing out-of-bound issue in the later
      jit_subprog().
      
      Unlike bpf_pseudo_'func' insn, the current bpf_pseudo_'call' insn
      is handling the DCE properly by calling find_subprog(insn->imm) to
      figure out the index instead of caching the subprog index.
      The existing bpf_adj_branches() will adjust the insn->imm
      whenever insn is added or removed.
      
      Instead of having two ways handling subprog index,
      this patch is to make bpf_pseudo_func works more like
      bpf_pseudo_call.
      
      First change is to stop caching the subprog index result
      in insn[1].imm after add_subprog().  The verification
      process will use find_subprog(insn->imm) to figure
      out the subprog index.
      
      Second change is in bpf_adj_branches() and have it to
      adjust the insn->imm for the bpf_pseudo_func insn also
      whenever insn is added or removed.
      
      Third change is in jit_subprog().  Like the bpf_pseudo_call handling,
      bpf_pseudo_func temporarily stores the find_subprog() result
      in insn->off.  It is fine because the prog's insn has been finalized
      at this point.  insn->off will be reset back to 0 later to avoid
      confusing the userspace prog dump tool.
      
      Fixes: 69c087ba ("bpf: Add bpf_for_each_map_elem() helper")
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20211106014014.651018-1-kafai@fb.com
      3990ed4c
  11. 02 11月, 2021 4 次提交
    • J
      bpf: Add alignment padding for "map_extra" + consolidate holes · 8845b468
      Joanne Koong 提交于
      This patch makes 2 changes regarding alignment padding
      for the "map_extra" field.
      
      1) In the kernel header, "map_extra" and "btf_value_type_id"
      are rearranged to consolidate the hole.
      
      Before:
      struct bpf_map {
      	...
              u32		max_entries;	/*    36     4	*/
              u32		map_flags;	/*    40     4	*/
      
              /* XXX 4 bytes hole, try to pack */
      
              u64		map_extra;	/*    48     8	*/
              int		spin_lock_off;	/*    56     4	*/
              int		timer_off;	/*    60     4	*/
              /* --- cacheline 1 boundary (64 bytes) --- */
              u32		id;		/*    64     4	*/
              int		numa_node;	/*    68     4	*/
      	...
              bool		frozen;		/*   117     1	*/
      
              /* XXX 10 bytes hole, try to pack */
      
              /* --- cacheline 2 boundary (128 bytes) --- */
      	...
              struct work_struct	work;	/*   144    72	*/
      
              /* --- cacheline 3 boundary (192 bytes) was 24 bytes ago --- */
      	struct mutex	freeze_mutex;	/*   216   144 	*/
      
              /* --- cacheline 5 boundary (320 bytes) was 40 bytes ago --- */
              u64		writecnt; 	/*   360     8	*/
      
          /* size: 384, cachelines: 6, members: 26 */
          /* sum members: 354, holes: 2, sum holes: 14 */
          /* padding: 16 */
          /* forced alignments: 2, forced holes: 1, sum forced holes: 10 */
      
      } __attribute__((__aligned__(64)));
      
      After:
      struct bpf_map {
      	...
              u32		max_entries;	/*    36     4	*/
              u64		map_extra;	/*    40     8 	*/
              u32		map_flags;	/*    48     4	*/
              int		spin_lock_off;	/*    52     4	*/
              int		timer_off;	/*    56     4	*/
              u32		id;		/*    60     4	*/
      
              /* --- cacheline 1 boundary (64 bytes) --- */
              int		numa_node;	/*    64     4	*/
      	...
      	bool		frozen		/*   113     1  */
      
              /* XXX 14 bytes hole, try to pack */
      
              /* --- cacheline 2 boundary (128 bytes) --- */
      	...
              struct work_struct	work;	/*   144    72	*/
      
              /* --- cacheline 3 boundary (192 bytes) was 24 bytes ago --- */
              struct mutex	freeze_mutex;	/*   216   144	*/
      
              /* --- cacheline 5 boundary (320 bytes) was 40 bytes ago --- */
              u64		writecnt;       /*   360     8	*/
      
          /* size: 384, cachelines: 6, members: 26 */
          /* sum members: 354, holes: 1, sum holes: 14 */
          /* padding: 16 */
          /* forced alignments: 2, forced holes: 1, sum forced holes: 14 */
      
      } __attribute__((__aligned__(64)));
      
      2) Add alignment padding to the bpf_map_info struct
      More details can be found in commit 36f9814a ("bpf: fix uapi hole
      for 32 bit compat applications")
      Signed-off-by: NJoanne Koong <joannekoong@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20211029224909.1721024-3-joannekoong@fb.com
      8845b468
    • H
      bpf: Add dummy BPF STRUCT_OPS for test purpose · c196906d
      Hou Tao 提交于
      Currently the test of BPF STRUCT_OPS depends on the specific bpf
      implementation of tcp_congestion_ops, but it can not cover all
      basic functionalities (e.g, return value handling), so introduce
      a dummy BPF STRUCT_OPS for test purpose.
      
      Loading a bpf_dummy_ops implementation from userspace is prohibited,
      and its only purpose is to run BPF_PROG_TYPE_STRUCT_OPS program
      through bpf(BPF_PROG_TEST_RUN). Now programs for test_1() & test_2()
      are supported. The following three cases are exercised in
      bpf_dummy_struct_ops_test_run():
      
      (1) test and check the value returned from state arg in test_1(state)
      The content of state is copied from userspace pointer and copied back
      after calling test_1(state). The user pointer is saved in an u64 array
      and the array address is passed through ctx_in.
      
      (2) test and check the return value of test_1(NULL)
      Just simulate the case in which an invalid input argument is passed in.
      
      (3) test multiple arguments passing in test_2(state, ...)
      5 arguments are passed through ctx_in in form of u64 array. The first
      element of array is userspace pointer of state and others 4 arguments
      follow.
      Signed-off-by: NHou Tao <houtao1@huawei.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20211025064025.2567443-4-houtao1@huawei.com
      c196906d
    • H
      bpf: Factor out helpers for ctx access checking · 35346ab6
      Hou Tao 提交于
      Factor out two helpers to check the read access of ctx for raw tp
      and BTF function. bpf_tracing_ctx_access() is used to check
      the read access to argument is valid, and bpf_tracing_btf_ctx_access()
      checks whether the btf type of argument is valid besides the checking
      of argument read. bpf_tracing_btf_ctx_access() will be used by the
      following patch.
      Signed-off-by: NHou Tao <houtao1@huawei.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20211025064025.2567443-3-houtao1@huawei.com
      35346ab6
    • H
      bpf: Factor out a helper to prepare trampoline for struct_ops prog · 31a645ae
      Hou Tao 提交于
      Factor out a helper bpf_struct_ops_prepare_trampoline() to prepare
      trampoline for BPF_PROG_TYPE_STRUCT_OPS prog. It will be used by
      .test_run callback in following patch.
      Signed-off-by: NHou Tao <houtao1@huawei.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20211025064025.2567443-2-houtao1@huawei.com
      31a645ae
  12. 29 10月, 2021 2 次提交
    • K
      bpf: Add bpf_kallsyms_lookup_name helper · d6aef08a
      Kumar Kartikeya Dwivedi 提交于
      This helper allows us to get the address of a kernel symbol from inside
      a BPF_PROG_TYPE_SYSCALL prog (used by gen_loader), so that we can
      relocate typeless ksym vars.
      Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20211028063501.2239335-2-memxor@gmail.com
      d6aef08a
    • J
      bpf: Add bloom filter map implementation · 9330986c
      Joanne Koong 提交于
      This patch adds the kernel-side changes for the implementation of
      a bpf bloom filter map.
      
      The bloom filter map supports peek (determining whether an element
      is present in the map) and push (adding an element to the map)
      operations.These operations are exposed to userspace applications
      through the already existing syscalls in the following way:
      
      BPF_MAP_LOOKUP_ELEM -> peek
      BPF_MAP_UPDATE_ELEM -> push
      
      The bloom filter map does not have keys, only values. In light of
      this, the bloom filter map's API matches that of queue stack maps:
      user applications use BPF_MAP_LOOKUP_ELEM/BPF_MAP_UPDATE_ELEM
      which correspond internally to bpf_map_peek_elem/bpf_map_push_elem,
      and bpf programs must use the bpf_map_peek_elem and bpf_map_push_elem
      APIs to query or add an element to the bloom filter map. When the
      bloom filter map is created, it must be created with a key_size of 0.
      
      For updates, the user will pass in the element to add to the map
      as the value, with a NULL key. For lookups, the user will pass in the
      element to query in the map as the value, with a NULL key. In the
      verifier layer, this requires us to modify the argument type of
      a bloom filter's BPF_FUNC_map_peek_elem call to ARG_PTR_TO_MAP_VALUE;
      as well, in the syscall layer, we need to copy over the user value
      so that in bpf_map_peek_elem, we know which specific value to query.
      
      A few things to please take note of:
       * If there are any concurrent lookups + updates, the user is
      responsible for synchronizing this to ensure no false negative lookups
      occur.
       * The number of hashes to use for the bloom filter is configurable from
      userspace. If no number is specified, the default used will be 5 hash
      functions. The benchmarks later in this patchset can help compare the
      performance of using different number of hashes on different entry
      sizes. In general, using more hashes decreases both the false positive
      rate and the speed of a lookup.
       * Deleting an element in the bloom filter map is not supported.
       * The bloom filter map may be used as an inner map.
       * The "max_entries" size that is specified at map creation time is used
      to approximate a reasonable bitmap size for the bloom filter, and is not
      otherwise strictly enforced. If the user wishes to insert more entries
      into the bloom filter than "max_entries", they may do so but they should
      be aware that this may lead to a higher false positive rate.
      Signed-off-by: NJoanne Koong <joannekoong@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20211027234504.30744-2-joannekoong@fb.com
      9330986c
  13. 27 10月, 2021 1 次提交
    • T
      bpf: Fix potential race in tail call compatibility check · 54713c85
      Toke Høiland-Jørgensen 提交于
      Lorenzo noticed that the code testing for program type compatibility of
      tail call maps is potentially racy in that two threads could encounter a
      map with an unset type simultaneously and both return true even though they
      are inserting incompatible programs.
      
      The race window is quite small, but artificially enlarging it by adding a
      usleep_range() inside the check in bpf_prog_array_compatible() makes it
      trivial to trigger from userspace with a program that does, essentially:
      
              map_fd = bpf_create_map(BPF_MAP_TYPE_PROG_ARRAY, 4, 4, 2, 0);
              pid = fork();
              if (pid) {
                      key = 0;
                      value = xdp_fd;
              } else {
                      key = 1;
                      value = tc_fd;
              }
              err = bpf_map_update_elem(map_fd, &key, &value, 0);
      
      While the race window is small, it has potentially serious ramifications in
      that triggering it would allow a BPF program to tail call to a program of a
      different type. So let's get rid of it by protecting the update with a
      spinlock. The commit in the Fixes tag is the last commit that touches the
      code in question.
      
      v2:
      - Use a spinlock instead of an atomic variable and cmpxchg() (Alexei)
      v3:
      - Put lock and the members it protects into an embedded 'owner' struct (Daniel)
      
      Fixes: 3324b584 ("ebpf: misc core cleanup")
      Reported-by: NLorenzo Bianconi <lorenzo.bianconi@redhat.com>
      Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20211026110019.363464-1-toke@redhat.com
      54713c85
  14. 22 10月, 2021 2 次提交
  15. 06 10月, 2021 1 次提交
    • K
      bpf: Introduce BPF support for kernel module function calls · 2357672c
      Kumar Kartikeya Dwivedi 提交于
      This change adds support on the kernel side to allow for BPF programs to
      call kernel module functions. Userspace will prepare an array of module
      BTF fds that is passed in during BPF_PROG_LOAD using fd_array parameter.
      In the kernel, the module BTFs are placed in the auxilliary struct for
      bpf_prog, and loaded as needed.
      
      The verifier then uses insn->off to index into the fd_array. insn->off
      0 is reserved for vmlinux BTF (for backwards compat), so userspace must
      use an fd_array index > 0 for module kfunc support. kfunc_btf_tab is
      sorted based on offset in an array, and each offset corresponds to one
      descriptor, with a max limit up to 256 such module BTFs.
      
      We also change existing kfunc_tab to distinguish each element based on
      imm, off pair as each such call will now be distinct.
      
      Another change is to check_kfunc_call callback, which now include a
      struct module * pointer, this is to be used in later patch such that the
      kfunc_id and module pointer are matched for dynamically registered BTF
      sets from loadable modules, so that same kfunc_id in two modules doesn't
      lead to check_kfunc_call succeeding. For the duration of the
      check_kfunc_call, the reference to struct module exists, as it returns
      the pointer stored in kfunc_btf_tab.
      Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20211002011757.311265-2-memxor@gmail.com
      2357672c
  16. 29 9月, 2021 1 次提交
  17. 18 9月, 2021 2 次提交
  18. 15 9月, 2021 1 次提交
  19. 17 8月, 2021 4 次提交
    • A
      bpf: Add bpf_get_attach_cookie() BPF helper to access bpf_cookie value · 7adfc6c9
      Andrii Nakryiko 提交于
      Add new BPF helper, bpf_get_attach_cookie(), which can be used by BPF programs
      to get access to a user-provided bpf_cookie value, specified during BPF
      program attachment (BPF link creation) time.
      
      Naming is hard, though. With the concept being named "BPF cookie", I've
      considered calling the helper:
        - bpf_get_cookie() -- seems too unspecific and easily mistaken with socket
          cookie;
        - bpf_get_bpf_cookie() -- too much tautology;
        - bpf_get_link_cookie() -- would be ok, but while we create a BPF link to
          attach BPF program to BPF hook, it's still an "attachment" and the
          bpf_cookie is associated with BPF program attachment to a hook, not a BPF
          link itself. Technically, we could support bpf_cookie with old-style
          cgroup programs.So I ultimately rejected it in favor of
          bpf_get_attach_cookie().
      
      Currently all perf_event-backed BPF program types support
      bpf_get_attach_cookie() helper. Follow-up patches will add support for
      fentry/fexit programs as well.
      
      While at it, mark bpf_tracing_func_proto() as static to make it obvious that
      it's only used from within the kernel/trace/bpf_trace.c.
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20210815070609.987780-7-andrii@kernel.org
      7adfc6c9
    • A
      bpf: Allow to specify user-provided bpf_cookie for BPF perf links · 82e6b1ee
      Andrii Nakryiko 提交于
      Add ability for users to specify custom u64 value (bpf_cookie) when creating
      BPF link for perf_event-backed BPF programs (kprobe/uprobe, perf_event,
      tracepoints).
      
      This is useful for cases when the same BPF program is used for attaching and
      processing invocation of different tracepoints/kprobes/uprobes in a generic
      fashion, but such that each invocation is distinguished from each other (e.g.,
      BPF program can look up additional information associated with a specific
      kernel function without having to rely on function IP lookups). This enables
      new use cases to be implemented simply and efficiently that previously were
      possible only through code generation (and thus multiple instances of almost
      identical BPF program) or compilation at runtime (BCC-style) on target hosts
      (even more expensive resource-wise). For uprobes it is not even possible in
      some cases to know function IP before hand (e.g., when attaching to shared
      library without PID filtering, in which case base load address is not known
      for a library).
      
      This is done by storing u64 bpf_cookie in struct bpf_prog_array_item,
      corresponding to each attached and run BPF program. Given cgroup BPF programs
      already use two 8-byte pointers for their needs and cgroup BPF programs don't
      have (yet?) support for bpf_cookie, reuse that space through union of
      cgroup_storage and new bpf_cookie field.
      
      Make it available to kprobe/tracepoint BPF programs through bpf_trace_run_ctx.
      This is set by BPF_PROG_RUN_ARRAY, used by kprobe/uprobe/tracepoint BPF
      program execution code, which luckily is now also split from
      BPF_PROG_RUN_ARRAY_CG. This run context will be utilized by a new BPF helper
      giving access to this user-provided cookie value from inside a BPF program.
      Generic perf_event BPF programs will access this value from perf_event itself
      through passed in BPF program context.
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NYonghong Song <yhs@fb.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/bpf/20210815070609.987780-6-andrii@kernel.org
      82e6b1ee
    • A
      bpf: Refactor BPF_PROG_RUN_ARRAY family of macros into functions · 7d08c2c9
      Andrii Nakryiko 提交于
      Similar to BPF_PROG_RUN, turn BPF_PROG_RUN_ARRAY macros into proper functions
      with all the same readability and maintainability benefits. Making them into
      functions required shuffling around bpf_set_run_ctx/bpf_reset_run_ctx
      functions. Also, explicitly specifying the type of the BPF prog run callback
      required adjusting __bpf_prog_run_save_cb() to accept const void *, casted
      internally to const struct sk_buff.
      
      Further, split out a cgroup-specific BPF_PROG_RUN_ARRAY_CG and
      BPF_PROG_RUN_ARRAY_CG_FLAGS from the more generic BPF_PROG_RUN_ARRAY due to
      the differences in bpf_run_ctx used for those two different use cases.
      
      I think BPF_PROG_RUN_ARRAY_CG would benefit from further refactoring to accept
      struct cgroup and enum bpf_attach_type instead of bpf_prog_array, fetching
      cgrp->bpf.effective[type] and RCU-dereferencing it internally. But that
      required including include/linux/cgroup-defs.h, which I wasn't sure is ok with
      everyone.
      
      The remaining generic BPF_PROG_RUN_ARRAY function will be extended to
      pass-through user-provided context value in the next patch.
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20210815070609.987780-3-andrii@kernel.org
      7d08c2c9
    • A
      bpf: Refactor BPF_PROG_RUN into a function · fb7dd8bc
      Andrii Nakryiko 提交于
      Turn BPF_PROG_RUN into a proper always inlined function. No functional and
      performance changes are intended, but it makes it much easier to understand
      what's going on with how BPF programs are actually get executed. It's more
      obvious what types and callbacks are expected. Also extra () around input
      parameters can be dropped, as well as `__` variable prefixes intended to avoid
      naming collisions, which makes the code simpler to read and write.
      
      This refactoring also highlighted one extra issue. BPF_PROG_RUN is both
      a macro and an enum value (BPF_PROG_RUN == BPF_PROG_TEST_RUN). Turning
      BPF_PROG_RUN into a function causes naming conflict compilation error. So
      rename BPF_PROG_RUN into lower-case bpf_prog_run(), similar to
      bpf_prog_run_xdp(), bpf_prog_run_pin_on_cpu(), etc. All existing callers of
      BPF_PROG_RUN, the macro, are switched to bpf_prog_run() explicitly.
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20210815070609.987780-2-andrii@kernel.org
      fb7dd8bc
  20. 24 7月, 2021 1 次提交
  21. 17 7月, 2021 1 次提交
    • A
      bpf: Add ambient BPF runtime context stored in current · c7603cfa
      Andrii Nakryiko 提交于
      b910eaaa ("bpf: Fix NULL pointer dereference in bpf_get_local_storage()
      helper") fixed the problem with cgroup-local storage use in BPF by
      pre-allocating per-CPU array of 8 cgroup storage pointers to accommodate
      possible BPF program preemptions and nested executions.
      
      While this seems to work good in practice, it introduces new and unnecessary
      failure mode in which not all BPF programs might be executed if we fail to
      find an unused slot for cgroup storage, however unlikely it is. It might also
      not be so unlikely when/if we allow sleepable cgroup BPF programs in the
      future.
      
      Further, the way that cgroup storage is implemented as ambiently-available
      property during entire BPF program execution is a convenient way to pass extra
      information to BPF program and helpers without requiring user code to pass
      around extra arguments explicitly. So it would be good to have a generic
      solution that can allow implementing this without arbitrary restrictions.
      Ideally, such solution would work for both preemptable and sleepable BPF
      programs in exactly the same way.
      
      This patch introduces such solution, bpf_run_ctx. It adds one pointer field
      (bpf_ctx) to task_struct. This field is maintained by BPF_PROG_RUN family of
      macros in such a way that it always stays valid throughout BPF program
      execution. BPF program preemption is handled by remembering previous
      current->bpf_ctx value locally while executing nested BPF program and
      restoring old value after nested BPF program finishes. This is handled by two
      helper functions, bpf_set_run_ctx() and bpf_reset_run_ctx(), which are
      supposed to be used before and after BPF program runs, respectively.
      
      Restoring old value of the pointer handles preemption, while bpf_run_ctx
      pointer being a property of current task_struct naturally solves this problem
      for sleepable BPF programs by "following" BPF program execution as it is
      scheduled in and out of CPU. It would even allow CPU migration of BPF
      programs, even though it's not currently allowed by BPF infra.
      
      This patch cleans up cgroup local storage handling as a first application. The
      design itself is generic, though, with bpf_run_ctx being an empty struct that
      is supposed to be embedded into a specific struct for a given BPF program type
      (bpf_cg_run_ctx in this case). Follow up patches are planned that will expand
      this mechanism for other uses within tracing BPF programs.
      
      To verify that this change doesn't revert the fix to the original cgroup
      storage issue, I ran the same repro as in the original report ([0]) and didn't
      get any problems. Replacing bpf_reset_run_ctx(old_run_ctx) with
      bpf_reset_run_ctx(NULL) triggers the issue pretty quickly (so repro does work).
      
        [0] https://lore.kernel.org/bpf/YEEvBUiJl2pJkxTd@krava/
      
      Fixes: b910eaaa ("bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper")
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20210712230615.3525979-1-andrii@kernel.org
      c7603cfa
  22. 16 7月, 2021 3 次提交
    • C
      sock_map: Relax config dependency to CONFIG_NET · 17edea21
      Cong Wang 提交于
      Currently sock_map still has Kconfig dependency on CONFIG_INET,
      but there is no actual functional dependency on it after we
      introduce ->psock_update_sk_prot().
      
      We have to extend it to CONFIG_NET now as we are going to
      support AF_UNIX.
      Signed-off-by: NCong Wang <cong.wang@bytedance.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210704190252.11866-2-xiyou.wangcong@gmail.com
      17edea21
    • J
      bpf, x86: Store caller's ip in trampoline stack · 7e6f3cd8
      Jiri Olsa 提交于
      Storing caller's ip in trampoline's stack. Trampoline programs
      can reach the IP in (ctx - 8) address, so there's no change in
      program's arguments interface.
      
      The IP address is takes from [fp + 8], which is return address
      from the initial 'call fentry' call to trampoline.
      
      This IP address will be returned via bpf_get_func_ip helper
      helper, which is added in following patches.
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210714094400.396467-2-jolsa@kernel.org
      7e6f3cd8
    • A
      bpf: Add map side support for bpf timers. · 68134668
      Alexei Starovoitov 提交于
      Restrict bpf timers to array, hash (both preallocated and kmalloced), and
      lru map types. The per-cpu maps with timers don't make sense, since 'struct
      bpf_timer' is a part of map value. bpf timers in per-cpu maps would mean that
      the number of timers depends on number of possible cpus and timers would not be
      accessible from all cpus. lpm map support can be added in the future.
      The timers in inner maps are supported.
      
      The bpf_map_update/delete_elem() helpers and sys_bpf commands cancel and free
      bpf_timer in a given map element.
      
      Similar to 'struct bpf_spin_lock' BTF is required and it is used to validate
      that map element indeed contains 'struct bpf_timer'.
      
      Make check_and_init_map_value() init both bpf_spin_lock and bpf_timer when
      map element data is reused in preallocated htab and lru maps.
      
      Teach copy_map_value() to support both bpf_spin_lock and bpf_timer in a single
      map element. There could be one of each, but not more than one. Due to 'one
      bpf_timer in one element' restriction do not support timers in global data,
      since global data is a map of single element, but from bpf program side it's
      seen as many global variables and restriction of single global timer would be
      odd. The sys_bpf map_freeze and sys_mmap syscalls are not allowed on maps with
      timers, since user space could have corrupted mmap element and crashed the
      kernel. The maps with timers cannot be readonly. Due to these restrictions
      search for bpf_timer in datasec BTF in case it was placed in the global data to
      report clear error.
      
      The previous patch allowed 'struct bpf_timer' as a first field in a map
      element only. Relax this restriction.
      
      Refactor lru map to s/bpf_lru_push_free/htab_lru_push_free/ to cancel and free
      the timer when lru map deletes an element as a part of it eviction algorithm.
      
      Make sure that bpf program cannot access 'struct bpf_timer' via direct load/store.
      The timer operation are done through helpers only.
      This is similar to 'struct bpf_spin_lock'.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NYonghong Song <yhs@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Acked-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-5-alexei.starovoitov@gmail.com
      68134668