1. 08 11月, 2021 1 次提交
  2. 02 11月, 2021 4 次提交
    • J
      bpf: Add alignment padding for "map_extra" + consolidate holes · 8845b468
      Joanne Koong 提交于
      This patch makes 2 changes regarding alignment padding
      for the "map_extra" field.
      
      1) In the kernel header, "map_extra" and "btf_value_type_id"
      are rearranged to consolidate the hole.
      
      Before:
      struct bpf_map {
      	...
              u32		max_entries;	/*    36     4	*/
              u32		map_flags;	/*    40     4	*/
      
              /* XXX 4 bytes hole, try to pack */
      
              u64		map_extra;	/*    48     8	*/
              int		spin_lock_off;	/*    56     4	*/
              int		timer_off;	/*    60     4	*/
              /* --- cacheline 1 boundary (64 bytes) --- */
              u32		id;		/*    64     4	*/
              int		numa_node;	/*    68     4	*/
      	...
              bool		frozen;		/*   117     1	*/
      
              /* XXX 10 bytes hole, try to pack */
      
              /* --- cacheline 2 boundary (128 bytes) --- */
      	...
              struct work_struct	work;	/*   144    72	*/
      
              /* --- cacheline 3 boundary (192 bytes) was 24 bytes ago --- */
      	struct mutex	freeze_mutex;	/*   216   144 	*/
      
              /* --- cacheline 5 boundary (320 bytes) was 40 bytes ago --- */
              u64		writecnt; 	/*   360     8	*/
      
          /* size: 384, cachelines: 6, members: 26 */
          /* sum members: 354, holes: 2, sum holes: 14 */
          /* padding: 16 */
          /* forced alignments: 2, forced holes: 1, sum forced holes: 10 */
      
      } __attribute__((__aligned__(64)));
      
      After:
      struct bpf_map {
      	...
              u32		max_entries;	/*    36     4	*/
              u64		map_extra;	/*    40     8 	*/
              u32		map_flags;	/*    48     4	*/
              int		spin_lock_off;	/*    52     4	*/
              int		timer_off;	/*    56     4	*/
              u32		id;		/*    60     4	*/
      
              /* --- cacheline 1 boundary (64 bytes) --- */
              int		numa_node;	/*    64     4	*/
      	...
      	bool		frozen		/*   113     1  */
      
              /* XXX 14 bytes hole, try to pack */
      
              /* --- cacheline 2 boundary (128 bytes) --- */
      	...
              struct work_struct	work;	/*   144    72	*/
      
              /* --- cacheline 3 boundary (192 bytes) was 24 bytes ago --- */
              struct mutex	freeze_mutex;	/*   216   144	*/
      
              /* --- cacheline 5 boundary (320 bytes) was 40 bytes ago --- */
              u64		writecnt;       /*   360     8	*/
      
          /* size: 384, cachelines: 6, members: 26 */
          /* sum members: 354, holes: 1, sum holes: 14 */
          /* padding: 16 */
          /* forced alignments: 2, forced holes: 1, sum forced holes: 14 */
      
      } __attribute__((__aligned__(64)));
      
      2) Add alignment padding to the bpf_map_info struct
      More details can be found in commit 36f9814a ("bpf: fix uapi hole
      for 32 bit compat applications")
      Signed-off-by: NJoanne Koong <joannekoong@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20211029224909.1721024-3-joannekoong@fb.com
      8845b468
    • H
      bpf: Add dummy BPF STRUCT_OPS for test purpose · c196906d
      Hou Tao 提交于
      Currently the test of BPF STRUCT_OPS depends on the specific bpf
      implementation of tcp_congestion_ops, but it can not cover all
      basic functionalities (e.g, return value handling), so introduce
      a dummy BPF STRUCT_OPS for test purpose.
      
      Loading a bpf_dummy_ops implementation from userspace is prohibited,
      and its only purpose is to run BPF_PROG_TYPE_STRUCT_OPS program
      through bpf(BPF_PROG_TEST_RUN). Now programs for test_1() & test_2()
      are supported. The following three cases are exercised in
      bpf_dummy_struct_ops_test_run():
      
      (1) test and check the value returned from state arg in test_1(state)
      The content of state is copied from userspace pointer and copied back
      after calling test_1(state). The user pointer is saved in an u64 array
      and the array address is passed through ctx_in.
      
      (2) test and check the return value of test_1(NULL)
      Just simulate the case in which an invalid input argument is passed in.
      
      (3) test multiple arguments passing in test_2(state, ...)
      5 arguments are passed through ctx_in in form of u64 array. The first
      element of array is userspace pointer of state and others 4 arguments
      follow.
      Signed-off-by: NHou Tao <houtao1@huawei.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20211025064025.2567443-4-houtao1@huawei.com
      c196906d
    • H
      bpf: Factor out helpers for ctx access checking · 35346ab6
      Hou Tao 提交于
      Factor out two helpers to check the read access of ctx for raw tp
      and BTF function. bpf_tracing_ctx_access() is used to check
      the read access to argument is valid, and bpf_tracing_btf_ctx_access()
      checks whether the btf type of argument is valid besides the checking
      of argument read. bpf_tracing_btf_ctx_access() will be used by the
      following patch.
      Signed-off-by: NHou Tao <houtao1@huawei.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20211025064025.2567443-3-houtao1@huawei.com
      35346ab6
    • H
      bpf: Factor out a helper to prepare trampoline for struct_ops prog · 31a645ae
      Hou Tao 提交于
      Factor out a helper bpf_struct_ops_prepare_trampoline() to prepare
      trampoline for BPF_PROG_TYPE_STRUCT_OPS prog. It will be used by
      .test_run callback in following patch.
      Signed-off-by: NHou Tao <houtao1@huawei.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20211025064025.2567443-2-houtao1@huawei.com
      31a645ae
  3. 29 10月, 2021 2 次提交
    • K
      bpf: Add bpf_kallsyms_lookup_name helper · d6aef08a
      Kumar Kartikeya Dwivedi 提交于
      This helper allows us to get the address of a kernel symbol from inside
      a BPF_PROG_TYPE_SYSCALL prog (used by gen_loader), so that we can
      relocate typeless ksym vars.
      Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20211028063501.2239335-2-memxor@gmail.com
      d6aef08a
    • J
      bpf: Add bloom filter map implementation · 9330986c
      Joanne Koong 提交于
      This patch adds the kernel-side changes for the implementation of
      a bpf bloom filter map.
      
      The bloom filter map supports peek (determining whether an element
      is present in the map) and push (adding an element to the map)
      operations.These operations are exposed to userspace applications
      through the already existing syscalls in the following way:
      
      BPF_MAP_LOOKUP_ELEM -> peek
      BPF_MAP_UPDATE_ELEM -> push
      
      The bloom filter map does not have keys, only values. In light of
      this, the bloom filter map's API matches that of queue stack maps:
      user applications use BPF_MAP_LOOKUP_ELEM/BPF_MAP_UPDATE_ELEM
      which correspond internally to bpf_map_peek_elem/bpf_map_push_elem,
      and bpf programs must use the bpf_map_peek_elem and bpf_map_push_elem
      APIs to query or add an element to the bloom filter map. When the
      bloom filter map is created, it must be created with a key_size of 0.
      
      For updates, the user will pass in the element to add to the map
      as the value, with a NULL key. For lookups, the user will pass in the
      element to query in the map as the value, with a NULL key. In the
      verifier layer, this requires us to modify the argument type of
      a bloom filter's BPF_FUNC_map_peek_elem call to ARG_PTR_TO_MAP_VALUE;
      as well, in the syscall layer, we need to copy over the user value
      so that in bpf_map_peek_elem, we know which specific value to query.
      
      A few things to please take note of:
       * If there are any concurrent lookups + updates, the user is
      responsible for synchronizing this to ensure no false negative lookups
      occur.
       * The number of hashes to use for the bloom filter is configurable from
      userspace. If no number is specified, the default used will be 5 hash
      functions. The benchmarks later in this patchset can help compare the
      performance of using different number of hashes on different entry
      sizes. In general, using more hashes decreases both the false positive
      rate and the speed of a lookup.
       * Deleting an element in the bloom filter map is not supported.
       * The bloom filter map may be used as an inner map.
       * The "max_entries" size that is specified at map creation time is used
      to approximate a reasonable bitmap size for the bloom filter, and is not
      otherwise strictly enforced. If the user wishes to insert more entries
      into the bloom filter than "max_entries", they may do so but they should
      be aware that this may lead to a higher false positive rate.
      Signed-off-by: NJoanne Koong <joannekoong@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20211027234504.30744-2-joannekoong@fb.com
      9330986c
  4. 27 10月, 2021 1 次提交
    • T
      bpf: Fix potential race in tail call compatibility check · 54713c85
      Toke Høiland-Jørgensen 提交于
      Lorenzo noticed that the code testing for program type compatibility of
      tail call maps is potentially racy in that two threads could encounter a
      map with an unset type simultaneously and both return true even though they
      are inserting incompatible programs.
      
      The race window is quite small, but artificially enlarging it by adding a
      usleep_range() inside the check in bpf_prog_array_compatible() makes it
      trivial to trigger from userspace with a program that does, essentially:
      
              map_fd = bpf_create_map(BPF_MAP_TYPE_PROG_ARRAY, 4, 4, 2, 0);
              pid = fork();
              if (pid) {
                      key = 0;
                      value = xdp_fd;
              } else {
                      key = 1;
                      value = tc_fd;
              }
              err = bpf_map_update_elem(map_fd, &key, &value, 0);
      
      While the race window is small, it has potentially serious ramifications in
      that triggering it would allow a BPF program to tail call to a program of a
      different type. So let's get rid of it by protecting the update with a
      spinlock. The commit in the Fixes tag is the last commit that touches the
      code in question.
      
      v2:
      - Use a spinlock instead of an atomic variable and cmpxchg() (Alexei)
      v3:
      - Put lock and the members it protects into an embedded 'owner' struct (Daniel)
      
      Fixes: 3324b584 ("ebpf: misc core cleanup")
      Reported-by: NLorenzo Bianconi <lorenzo.bianconi@redhat.com>
      Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20211026110019.363464-1-toke@redhat.com
      54713c85
  5. 22 10月, 2021 2 次提交
  6. 06 10月, 2021 1 次提交
    • K
      bpf: Introduce BPF support for kernel module function calls · 2357672c
      Kumar Kartikeya Dwivedi 提交于
      This change adds support on the kernel side to allow for BPF programs to
      call kernel module functions. Userspace will prepare an array of module
      BTF fds that is passed in during BPF_PROG_LOAD using fd_array parameter.
      In the kernel, the module BTFs are placed in the auxilliary struct for
      bpf_prog, and loaded as needed.
      
      The verifier then uses insn->off to index into the fd_array. insn->off
      0 is reserved for vmlinux BTF (for backwards compat), so userspace must
      use an fd_array index > 0 for module kfunc support. kfunc_btf_tab is
      sorted based on offset in an array, and each offset corresponds to one
      descriptor, with a max limit up to 256 such module BTFs.
      
      We also change existing kfunc_tab to distinguish each element based on
      imm, off pair as each such call will now be distinct.
      
      Another change is to check_kfunc_call callback, which now include a
      struct module * pointer, this is to be used in later patch such that the
      kfunc_id and module pointer are matched for dynamically registered BTF
      sets from loadable modules, so that same kfunc_id in two modules doesn't
      lead to check_kfunc_call succeeding. For the duration of the
      check_kfunc_call, the reference to struct module exists, as it returns
      the pointer stored in kfunc_btf_tab.
      Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20211002011757.311265-2-memxor@gmail.com
      2357672c
  7. 29 9月, 2021 1 次提交
  8. 18 9月, 2021 2 次提交
  9. 15 9月, 2021 1 次提交
  10. 17 8月, 2021 4 次提交
    • A
      bpf: Add bpf_get_attach_cookie() BPF helper to access bpf_cookie value · 7adfc6c9
      Andrii Nakryiko 提交于
      Add new BPF helper, bpf_get_attach_cookie(), which can be used by BPF programs
      to get access to a user-provided bpf_cookie value, specified during BPF
      program attachment (BPF link creation) time.
      
      Naming is hard, though. With the concept being named "BPF cookie", I've
      considered calling the helper:
        - bpf_get_cookie() -- seems too unspecific and easily mistaken with socket
          cookie;
        - bpf_get_bpf_cookie() -- too much tautology;
        - bpf_get_link_cookie() -- would be ok, but while we create a BPF link to
          attach BPF program to BPF hook, it's still an "attachment" and the
          bpf_cookie is associated with BPF program attachment to a hook, not a BPF
          link itself. Technically, we could support bpf_cookie with old-style
          cgroup programs.So I ultimately rejected it in favor of
          bpf_get_attach_cookie().
      
      Currently all perf_event-backed BPF program types support
      bpf_get_attach_cookie() helper. Follow-up patches will add support for
      fentry/fexit programs as well.
      
      While at it, mark bpf_tracing_func_proto() as static to make it obvious that
      it's only used from within the kernel/trace/bpf_trace.c.
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20210815070609.987780-7-andrii@kernel.org
      7adfc6c9
    • A
      bpf: Allow to specify user-provided bpf_cookie for BPF perf links · 82e6b1ee
      Andrii Nakryiko 提交于
      Add ability for users to specify custom u64 value (bpf_cookie) when creating
      BPF link for perf_event-backed BPF programs (kprobe/uprobe, perf_event,
      tracepoints).
      
      This is useful for cases when the same BPF program is used for attaching and
      processing invocation of different tracepoints/kprobes/uprobes in a generic
      fashion, but such that each invocation is distinguished from each other (e.g.,
      BPF program can look up additional information associated with a specific
      kernel function without having to rely on function IP lookups). This enables
      new use cases to be implemented simply and efficiently that previously were
      possible only through code generation (and thus multiple instances of almost
      identical BPF program) or compilation at runtime (BCC-style) on target hosts
      (even more expensive resource-wise). For uprobes it is not even possible in
      some cases to know function IP before hand (e.g., when attaching to shared
      library without PID filtering, in which case base load address is not known
      for a library).
      
      This is done by storing u64 bpf_cookie in struct bpf_prog_array_item,
      corresponding to each attached and run BPF program. Given cgroup BPF programs
      already use two 8-byte pointers for their needs and cgroup BPF programs don't
      have (yet?) support for bpf_cookie, reuse that space through union of
      cgroup_storage and new bpf_cookie field.
      
      Make it available to kprobe/tracepoint BPF programs through bpf_trace_run_ctx.
      This is set by BPF_PROG_RUN_ARRAY, used by kprobe/uprobe/tracepoint BPF
      program execution code, which luckily is now also split from
      BPF_PROG_RUN_ARRAY_CG. This run context will be utilized by a new BPF helper
      giving access to this user-provided cookie value from inside a BPF program.
      Generic perf_event BPF programs will access this value from perf_event itself
      through passed in BPF program context.
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NYonghong Song <yhs@fb.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/bpf/20210815070609.987780-6-andrii@kernel.org
      82e6b1ee
    • A
      bpf: Refactor BPF_PROG_RUN_ARRAY family of macros into functions · 7d08c2c9
      Andrii Nakryiko 提交于
      Similar to BPF_PROG_RUN, turn BPF_PROG_RUN_ARRAY macros into proper functions
      with all the same readability and maintainability benefits. Making them into
      functions required shuffling around bpf_set_run_ctx/bpf_reset_run_ctx
      functions. Also, explicitly specifying the type of the BPF prog run callback
      required adjusting __bpf_prog_run_save_cb() to accept const void *, casted
      internally to const struct sk_buff.
      
      Further, split out a cgroup-specific BPF_PROG_RUN_ARRAY_CG and
      BPF_PROG_RUN_ARRAY_CG_FLAGS from the more generic BPF_PROG_RUN_ARRAY due to
      the differences in bpf_run_ctx used for those two different use cases.
      
      I think BPF_PROG_RUN_ARRAY_CG would benefit from further refactoring to accept
      struct cgroup and enum bpf_attach_type instead of bpf_prog_array, fetching
      cgrp->bpf.effective[type] and RCU-dereferencing it internally. But that
      required including include/linux/cgroup-defs.h, which I wasn't sure is ok with
      everyone.
      
      The remaining generic BPF_PROG_RUN_ARRAY function will be extended to
      pass-through user-provided context value in the next patch.
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20210815070609.987780-3-andrii@kernel.org
      7d08c2c9
    • A
      bpf: Refactor BPF_PROG_RUN into a function · fb7dd8bc
      Andrii Nakryiko 提交于
      Turn BPF_PROG_RUN into a proper always inlined function. No functional and
      performance changes are intended, but it makes it much easier to understand
      what's going on with how BPF programs are actually get executed. It's more
      obvious what types and callbacks are expected. Also extra () around input
      parameters can be dropped, as well as `__` variable prefixes intended to avoid
      naming collisions, which makes the code simpler to read and write.
      
      This refactoring also highlighted one extra issue. BPF_PROG_RUN is both
      a macro and an enum value (BPF_PROG_RUN == BPF_PROG_TEST_RUN). Turning
      BPF_PROG_RUN into a function causes naming conflict compilation error. So
      rename BPF_PROG_RUN into lower-case bpf_prog_run(), similar to
      bpf_prog_run_xdp(), bpf_prog_run_pin_on_cpu(), etc. All existing callers of
      BPF_PROG_RUN, the macro, are switched to bpf_prog_run() explicitly.
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20210815070609.987780-2-andrii@kernel.org
      fb7dd8bc
  11. 24 7月, 2021 1 次提交
  12. 17 7月, 2021 1 次提交
    • A
      bpf: Add ambient BPF runtime context stored in current · c7603cfa
      Andrii Nakryiko 提交于
      b910eaaa ("bpf: Fix NULL pointer dereference in bpf_get_local_storage()
      helper") fixed the problem with cgroup-local storage use in BPF by
      pre-allocating per-CPU array of 8 cgroup storage pointers to accommodate
      possible BPF program preemptions and nested executions.
      
      While this seems to work good in practice, it introduces new and unnecessary
      failure mode in which not all BPF programs might be executed if we fail to
      find an unused slot for cgroup storage, however unlikely it is. It might also
      not be so unlikely when/if we allow sleepable cgroup BPF programs in the
      future.
      
      Further, the way that cgroup storage is implemented as ambiently-available
      property during entire BPF program execution is a convenient way to pass extra
      information to BPF program and helpers without requiring user code to pass
      around extra arguments explicitly. So it would be good to have a generic
      solution that can allow implementing this without arbitrary restrictions.
      Ideally, such solution would work for both preemptable and sleepable BPF
      programs in exactly the same way.
      
      This patch introduces such solution, bpf_run_ctx. It adds one pointer field
      (bpf_ctx) to task_struct. This field is maintained by BPF_PROG_RUN family of
      macros in such a way that it always stays valid throughout BPF program
      execution. BPF program preemption is handled by remembering previous
      current->bpf_ctx value locally while executing nested BPF program and
      restoring old value after nested BPF program finishes. This is handled by two
      helper functions, bpf_set_run_ctx() and bpf_reset_run_ctx(), which are
      supposed to be used before and after BPF program runs, respectively.
      
      Restoring old value of the pointer handles preemption, while bpf_run_ctx
      pointer being a property of current task_struct naturally solves this problem
      for sleepable BPF programs by "following" BPF program execution as it is
      scheduled in and out of CPU. It would even allow CPU migration of BPF
      programs, even though it's not currently allowed by BPF infra.
      
      This patch cleans up cgroup local storage handling as a first application. The
      design itself is generic, though, with bpf_run_ctx being an empty struct that
      is supposed to be embedded into a specific struct for a given BPF program type
      (bpf_cg_run_ctx in this case). Follow up patches are planned that will expand
      this mechanism for other uses within tracing BPF programs.
      
      To verify that this change doesn't revert the fix to the original cgroup
      storage issue, I ran the same repro as in the original report ([0]) and didn't
      get any problems. Replacing bpf_reset_run_ctx(old_run_ctx) with
      bpf_reset_run_ctx(NULL) triggers the issue pretty quickly (so repro does work).
      
        [0] https://lore.kernel.org/bpf/YEEvBUiJl2pJkxTd@krava/
      
      Fixes: b910eaaa ("bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper")
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20210712230615.3525979-1-andrii@kernel.org
      c7603cfa
  13. 16 7月, 2021 4 次提交
    • C
      sock_map: Relax config dependency to CONFIG_NET · 17edea21
      Cong Wang 提交于
      Currently sock_map still has Kconfig dependency on CONFIG_INET,
      but there is no actual functional dependency on it after we
      introduce ->psock_update_sk_prot().
      
      We have to extend it to CONFIG_NET now as we are going to
      support AF_UNIX.
      Signed-off-by: NCong Wang <cong.wang@bytedance.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210704190252.11866-2-xiyou.wangcong@gmail.com
      17edea21
    • J
      bpf, x86: Store caller's ip in trampoline stack · 7e6f3cd8
      Jiri Olsa 提交于
      Storing caller's ip in trampoline's stack. Trampoline programs
      can reach the IP in (ctx - 8) address, so there's no change in
      program's arguments interface.
      
      The IP address is takes from [fp + 8], which is return address
      from the initial 'call fentry' call to trampoline.
      
      This IP address will be returned via bpf_get_func_ip helper
      helper, which is added in following patches.
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210714094400.396467-2-jolsa@kernel.org
      7e6f3cd8
    • A
      bpf: Add map side support for bpf timers. · 68134668
      Alexei Starovoitov 提交于
      Restrict bpf timers to array, hash (both preallocated and kmalloced), and
      lru map types. The per-cpu maps with timers don't make sense, since 'struct
      bpf_timer' is a part of map value. bpf timers in per-cpu maps would mean that
      the number of timers depends on number of possible cpus and timers would not be
      accessible from all cpus. lpm map support can be added in the future.
      The timers in inner maps are supported.
      
      The bpf_map_update/delete_elem() helpers and sys_bpf commands cancel and free
      bpf_timer in a given map element.
      
      Similar to 'struct bpf_spin_lock' BTF is required and it is used to validate
      that map element indeed contains 'struct bpf_timer'.
      
      Make check_and_init_map_value() init both bpf_spin_lock and bpf_timer when
      map element data is reused in preallocated htab and lru maps.
      
      Teach copy_map_value() to support both bpf_spin_lock and bpf_timer in a single
      map element. There could be one of each, but not more than one. Due to 'one
      bpf_timer in one element' restriction do not support timers in global data,
      since global data is a map of single element, but from bpf program side it's
      seen as many global variables and restriction of single global timer would be
      odd. The sys_bpf map_freeze and sys_mmap syscalls are not allowed on maps with
      timers, since user space could have corrupted mmap element and crashed the
      kernel. The maps with timers cannot be readonly. Due to these restrictions
      search for bpf_timer in datasec BTF in case it was placed in the global data to
      report clear error.
      
      The previous patch allowed 'struct bpf_timer' as a first field in a map
      element only. Relax this restriction.
      
      Refactor lru map to s/bpf_lru_push_free/htab_lru_push_free/ to cancel and free
      the timer when lru map deletes an element as a part of it eviction algorithm.
      
      Make sure that bpf program cannot access 'struct bpf_timer' via direct load/store.
      The timer operation are done through helpers only.
      This is similar to 'struct bpf_spin_lock'.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NYonghong Song <yhs@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Acked-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-5-alexei.starovoitov@gmail.com
      68134668
    • A
      bpf: Introduce bpf timers. · b00628b1
      Alexei Starovoitov 提交于
      Introduce 'struct bpf_timer { __u64 :64; __u64 :64; };' that can be embedded
      in hash/array/lru maps as a regular field and helpers to operate on it:
      
      // Initialize the timer.
      // First 4 bits of 'flags' specify clockid.
      // Only CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
      long bpf_timer_init(struct bpf_timer *timer, struct bpf_map *map, int flags);
      
      // Configure the timer to call 'callback_fn' static function.
      long bpf_timer_set_callback(struct bpf_timer *timer, void *callback_fn);
      
      // Arm the timer to expire 'nsec' nanoseconds from the current time.
      long bpf_timer_start(struct bpf_timer *timer, u64 nsec, u64 flags);
      
      // Cancel the timer and wait for callback_fn to finish if it was running.
      long bpf_timer_cancel(struct bpf_timer *timer);
      
      Here is how BPF program might look like:
      struct map_elem {
          int counter;
          struct bpf_timer timer;
      };
      
      struct {
          __uint(type, BPF_MAP_TYPE_HASH);
          __uint(max_entries, 1000);
          __type(key, int);
          __type(value, struct map_elem);
      } hmap SEC(".maps");
      
      static int timer_cb(void *map, int *key, struct map_elem *val);
      /* val points to particular map element that contains bpf_timer. */
      
      SEC("fentry/bpf_fentry_test1")
      int BPF_PROG(test1, int a)
      {
          struct map_elem *val;
          int key = 0;
      
          val = bpf_map_lookup_elem(&hmap, &key);
          if (val) {
              bpf_timer_init(&val->timer, &hmap, CLOCK_REALTIME);
              bpf_timer_set_callback(&val->timer, timer_cb);
              bpf_timer_start(&val->timer, 1000 /* call timer_cb2 in 1 usec */, 0);
          }
      }
      
      This patch adds helper implementations that rely on hrtimers
      to call bpf functions as timers expire.
      The following patches add necessary safety checks.
      
      Only programs with CAP_BPF are allowed to use bpf_timer.
      
      The amount of timers used by the program is constrained by
      the memcg recorded at map creation time.
      
      The bpf_timer_init() helper needs explicit 'map' argument because inner maps
      are dynamic and not known at load time. While the bpf_timer_set_callback() is
      receiving hidden 'aux->prog' argument supplied by the verifier.
      
      The prog pointer is needed to do refcnting of bpf program to make sure that
      program doesn't get freed while the timer is armed. This approach relies on
      "user refcnt" scheme used in prog_array that stores bpf programs for
      bpf_tail_call. The bpf_timer_set_callback() will increment the prog refcnt which is
      paired with bpf_timer_cancel() that will drop the prog refcnt. The
      ops->map_release_uref is responsible for cancelling the timers and dropping
      prog refcnt when user space reference to a map reaches zero.
      This uref approach is done to make sure that Ctrl-C of user space process will
      not leave timers running forever unless the user space explicitly pinned a map
      that contained timers in bpffs.
      
      bpf_timer_init() and bpf_timer_set_callback() will return -EPERM if map doesn't
      have user references (is not held by open file descriptor from user space and
      not pinned in bpffs).
      
      The bpf_map_delete_elem() and bpf_map_update_elem() operations cancel
      and free the timer if given map element had it allocated.
      "bpftool map update" command can be used to cancel timers.
      
      The 'struct bpf_timer' is explicitly __attribute__((aligned(8))) because
      '__u64 :64' has 1 byte alignment of 8 byte padding.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Acked-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-4-alexei.starovoitov@gmail.com
      b00628b1
  14. 09 7月, 2021 1 次提交
    • J
      bpf: Track subprog poke descriptors correctly and fix use-after-free · f263a814
      John Fastabend 提交于
      Subprograms are calling map_poke_track(), but on program release there is no
      hook to call map_poke_untrack(). However, on program release, the aux memory
      (and poke descriptor table) is freed even though we still have a reference to
      it in the element list of the map aux data. When we run map_poke_run(), we then
      end up accessing free'd memory, triggering KASAN in prog_array_map_poke_run():
      
        [...]
        [  402.824689] BUG: KASAN: use-after-free in prog_array_map_poke_run+0xc2/0x34e
        [  402.824698] Read of size 4 at addr ffff8881905a7940 by task hubble-fgs/4337
        [  402.824705] CPU: 1 PID: 4337 Comm: hubble-fgs Tainted: G          I       5.12.0+ #399
        [  402.824715] Call Trace:
        [  402.824719]  dump_stack+0x93/0xc2
        [  402.824727]  print_address_description.constprop.0+0x1a/0x140
        [  402.824736]  ? prog_array_map_poke_run+0xc2/0x34e
        [  402.824740]  ? prog_array_map_poke_run+0xc2/0x34e
        [  402.824744]  kasan_report.cold+0x7c/0xd8
        [  402.824752]  ? prog_array_map_poke_run+0xc2/0x34e
        [  402.824757]  prog_array_map_poke_run+0xc2/0x34e
        [  402.824765]  bpf_fd_array_map_update_elem+0x124/0x1a0
        [...]
      
      The elements concerned are walked as follows:
      
          for (i = 0; i < elem->aux->size_poke_tab; i++) {
                 poke = &elem->aux->poke_tab[i];
          [...]
      
      The access to size_poke_tab is a 4 byte read, verified by checking offsets
      in the KASAN dump:
      
        [  402.825004] The buggy address belongs to the object at ffff8881905a7800
                       which belongs to the cache kmalloc-1k of size 1024
        [  402.825008] The buggy address is located 320 bytes inside of
                       1024-byte region [ffff8881905a7800, ffff8881905a7c00)
      
      The pahole output of bpf_prog_aux:
      
        struct bpf_prog_aux {
          [...]
          /* --- cacheline 5 boundary (320 bytes) --- */
          u32                        size_poke_tab;        /*   320     4 */
          [...]
      
      In general, subprograms do not necessarily manage their own data structures.
      For example, BTF func_info and linfo are just pointers to the main program
      structure. This allows reference counting and cleanup to be done on the latter
      which simplifies their management a bit. The aux->poke_tab struct, however,
      did not follow this logic. The initial proposed fix for this use-after-free
      bug further embedded poke data tracking into the subprogram with proper
      reference counting. However, Daniel and Alexei questioned why we were treating
      these objects special; I agree, its unnecessary. The fix here removes the per
      subprogram poke table allocation and map tracking and instead simply points
      the aux->poke_tab pointer at the main programs poke table. This way, map
      tracking is simplified to the main program and we do not need to manage them
      per subprogram.
      
      This also means, bpf_prog_free_deferred(), which unwinds the program reference
      counting and kfrees objects, needs to ensure that we don't try to double free
      the poke_tab when free'ing the subprog structures. This is easily solved by
      NULL'ing the poke_tab pointer. The second detail is to ensure that per
      subprogram JIT logic only does fixups on poke_tab[] entries it owns. To do
      this, we add a pointer in the poke structure to point at the subprogram value
      so JITs can easily check while walking the poke_tab structure if the current
      entry belongs to the current program. The aux pointer is stable and therefore
      suitable for such comparison. On the jit_subprogs() error path, we omit
      cleaning up the poke->aux field because these are only ever referenced from
      the JIT side, but on error we will never make it to the JIT, so its fine to
      leave them dangling. Removing these pointers would complicate the error path
      for no reason. However, we do need to untrack all poke descriptors from the
      main program as otherwise they could race with the freeing of JIT memory from
      the subprograms. Lastly, a748c697 ("bpf: propagate poke descriptors to
      subprograms") had an off-by-one on the subprogram instruction index range
      check as it was testing 'insn_idx >= subprog_start && insn_idx <= subprog_end'.
      However, subprog_end is the next subprogram's start instruction.
      
      Fixes: a748c697 ("bpf: propagate poke descriptors to subprograms")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Co-developed-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210707223848.14580-2-john.fastabend@gmail.com
      f263a814
  15. 08 7月, 2021 2 次提交
  16. 16 6月, 2021 1 次提交
  17. 26 5月, 2021 1 次提交
    • H
      xdp: Extend xdp_redirect_map with broadcast support · e624d4ed
      Hangbin Liu 提交于
      This patch adds two flags BPF_F_BROADCAST and BPF_F_EXCLUDE_INGRESS to
      extend xdp_redirect_map for broadcast support.
      
      With BPF_F_BROADCAST the packet will be broadcasted to all the interfaces
      in the map. with BPF_F_EXCLUDE_INGRESS the ingress interface will be
      excluded when do broadcasting.
      
      When getting the devices in dev hash map via dev_map_hash_get_next_key(),
      there is a possibility that we fall back to the first key when a device
      was removed. This will duplicate packets on some interfaces. So just walk
      the whole buckets to avoid this issue. For dev array map, we also walk the
      whole map to find valid interfaces.
      
      Function bpf_clear_redirect_map() was removed in
      commit ee75aef2 ("bpf, xdp: Restructure redirect actions").
      Add it back as we need to use ri->map again.
      
      With test topology:
        +-------------------+             +-------------------+
        | Host A (i40e 10G) |  ---------- | eno1(i40e 10G)    |
        +-------------------+             |                   |
                                          |   Host B          |
        +-------------------+             |                   |
        | Host C (i40e 10G) |  ---------- | eno2(i40e 10G)    |
        +-------------------+             |                   |
                                          |          +------+ |
                                          | veth0 -- | Peer | |
                                          | veth1 -- |      | |
                                          | veth2 -- |  NS  | |
                                          |          +------+ |
                                          +-------------------+
      
      On Host A:
       # pktgen/pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -s 64
      
      On Host B(Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 128G Memory):
      Use xdp_redirect_map and xdp_redirect_map_multi in samples/bpf for testing.
      All the veth peers in the NS have a XDP_DROP program loaded. The
      forward_map max_entries in xdp_redirect_map_multi is modify to 4.
      
      Testing the performance impact on the regular xdp_redirect path with and
      without patch (to check impact of additional check for broadcast mode):
      
      5.12 rc4         | redirect_map        i40e->i40e      |    2.0M |  9.7M
      5.12 rc4         | redirect_map        i40e->veth      |    1.7M | 11.8M
      5.12 rc4 + patch | redirect_map        i40e->i40e      |    2.0M |  9.6M
      5.12 rc4 + patch | redirect_map        i40e->veth      |    1.7M | 11.7M
      
      Testing the performance when cloning packets with the redirect_map_multi
      test, using a redirect map size of 4, filled with 1-3 devices:
      
      5.12 rc4 + patch | redirect_map multi  i40e->veth (x1) |    1.7M | 11.4M
      5.12 rc4 + patch | redirect_map multi  i40e->veth (x2) |    1.1M |  4.3M
      5.12 rc4 + patch | redirect_map multi  i40e->veth (x3) |    0.8M |  2.6M
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Link: https://lore.kernel.org/bpf/20210519090747.1655268-3-liuhangbin@gmail.com
      e624d4ed
  18. 25 5月, 2021 1 次提交
  19. 19 5月, 2021 3 次提交
  20. 28 4月, 2021 1 次提交
    • F
      bpf: Implement formatted output helpers with bstr_printf · 48cac3f4
      Florent Revest 提交于
      BPF has three formatted output helpers: bpf_trace_printk, bpf_seq_printf
      and bpf_snprintf. Their signatures specify that all arguments are
      provided from the BPF world as u64s (in an array or as registers). All
      of these helpers are currently implemented by calling functions such as
      snprintf() whose signatures take a variable number of arguments, then
      placed in a va_list by the compiler to call vsnprintf().
      
      "d9c9e4db bpf: Factorize bpf_trace_printk and bpf_seq_printf" introduced
      a bpf_printf_prepare function that fills an array of u64 sanitized
      arguments with an array of "modifiers" which indicate what the "real"
      size of each argument should be (given by the format specifier). The
      BPF_CAST_FMT_ARG macro consumes these arrays and casts each argument to
      its real size. However, the C promotion rules implicitely cast them all
      back to u64s. Therefore, the arguments given to snprintf are u64s and
      the va_list constructed by the compiler will use 64 bits for each
      argument. On 64 bit machines, this happens to work well because 32 bit
      arguments in va_lists need to occupy 64 bits anyway, but on 32 bit
      architectures this breaks the layout of the va_list expected by the
      called function and mangles values.
      
      In "88a5c690 bpf: fix bpf_trace_printk on 32 bit archs", this problem
      had been solved for bpf_trace_printk only with a "horrid workaround"
      that emitted multiple calls to trace_printk where each call had
      different argument types and generated different va_list layouts. One of
      the call would be dynamically chosen at runtime. This was ok with the 3
      arguments that bpf_trace_printk takes but bpf_seq_printf and
      bpf_snprintf accept up to 12 arguments. Because this approach scales
      code exponentially, it is not a viable option anymore.
      
      Because the promotion rules are part of the language and because the
      construction of a va_list is an arch-specific ABI, it's best to just
      avoid variadic arguments and va_lists altogether. Thankfully the
      kernel's snprintf() has an alternative in the form of bstr_printf() that
      accepts arguments in a "binary buffer representation". These binary
      buffers are currently created by vbin_printf and used in the tracing
      subsystem to split the cost of printing into two parts: a fast one that
      only dereferences and remembers values, and a slower one, called later,
      that does the pretty-printing.
      
      This patch refactors bpf_printf_prepare to construct binary buffers of
      arguments consumable by bstr_printf() instead of arrays of arguments and
      modifiers. This gets rid of BPF_CAST_FMT_ARG and greatly simplifies the
      bpf_printf_prepare usage but there are a few gotchas that change how
      bpf_printf_prepare needs to do things.
      
      Currently, bpf_printf_prepare uses a per cpu temporary buffer as a
      generic storage for strings and IP addresses. With this refactoring, the
      temporary buffers now holds all the arguments in a structured binary
      format.
      
      To comply with the format expected by bstr_printf, certain format
      specifiers also need to be pre-formatted: %pB and %pi6/%pi4/%pI4/%pI6.
      Because vsnprintf subroutines for these specifiers are hard to expose,
      we pre-format these arguments with calls to snprintf().
      Reported-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Signed-off-by: NFlorent Revest <revest@chromium.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210427174313.860948-3-revest@chromium.org
      48cac3f4
  21. 20 4月, 2021 3 次提交
    • F
      bpf: Add a bpf_snprintf helper · 7b15523a
      Florent Revest 提交于
      The implementation takes inspiration from the existing bpf_trace_printk
      helper but there are a few differences:
      
      To allow for a large number of format-specifiers, parameters are
      provided in an array, like in bpf_seq_printf.
      
      Because the output string takes two arguments and the array of
      parameters also takes two arguments, the format string needs to fit in
      one argument. Thankfully, ARG_PTR_TO_CONST_STR is guaranteed to point to
      a zero-terminated read-only map so we don't need a format string length
      arg.
      
      Because the format-string is known at verification time, we also do
      a first pass of format string validation in the verifier logic. This
      makes debugging easier.
      Signed-off-by: NFlorent Revest <revest@chromium.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20210419155243.1632274-4-revest@chromium.org
      7b15523a
    • F
      bpf: Add a ARG_PTR_TO_CONST_STR argument type · fff13c4b
      Florent Revest 提交于
      This type provides the guarantee that an argument is going to be a const
      pointer to somewhere in a read-only map value. It also checks that this
      pointer is followed by a zero character before the end of the map value.
      Signed-off-by: NFlorent Revest <revest@chromium.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20210419155243.1632274-3-revest@chromium.org
      fff13c4b
    • F
      bpf: Factorize bpf_trace_printk and bpf_seq_printf · d9c9e4db
      Florent Revest 提交于
      Two helpers (trace_printk and seq_printf) have very similar
      implementations of format string parsing and a third one is coming
      (snprintf). To avoid code duplication and make the code easier to
      maintain, this moves the operations associated with format string
      parsing (validation and argument sanitization) into one generic
      function.
      
      The implementation of the two existing helpers already drifted quite a
      bit so unifying them entailed a lot of changes:
      
      - bpf_trace_printk always expected fmt[fmt_size] to be the terminating
        NULL character, this is no longer true, the first 0 is terminating.
      - bpf_trace_printk now supports %% (which produces the percentage char).
      - bpf_trace_printk now skips width formating fields.
      - bpf_trace_printk now supports the X modifier (capital hexadecimal).
      - bpf_trace_printk now supports %pK, %px, %pB, %pi4, %pI4, %pi6 and %pI6
      - argument casting on 32 bit has been simplified into one macro and
        using an enum instead of obscure int increments.
      
      - bpf_seq_printf now uses bpf_trace_copy_string instead of
        strncpy_from_kernel_nofault and handles the %pks %pus specifiers.
      - bpf_seq_printf now prints longs correctly on 32 bit architectures.
      
      - both were changed to use a global per-cpu tmp buffer instead of one
        stack buffer for trace_printk and 6 small buffers for seq_printf.
      - to avoid per-cpu buffer usage conflict, these helpers disable
        preemption while the per-cpu buffer is in use.
      - both helpers now support the %ps and %pS specifiers to print symbols.
      
      The implementation is also moved from bpf_trace.c to helpers.c because
      the upcoming bpf_snprintf helper will be made available to all BPF
      programs and will need it.
      Signed-off-by: NFlorent Revest <revest@chromium.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210419155243.1632274-2-revest@chromium.org
      d9c9e4db
  22. 09 4月, 2021 1 次提交
  23. 03 4月, 2021 1 次提交