1. 11 3月, 2022 3 次提交
  2. 10 3月, 2022 1 次提交
    • T
      bpf: Add "live packet" mode for XDP in BPF_PROG_RUN · b530e9e1
      Toke Høiland-Jørgensen 提交于
      This adds support for running XDP programs through BPF_PROG_RUN in a mode
      that enables live packet processing of the resulting frames. Previous uses
      of BPF_PROG_RUN for XDP returned the XDP program return code and the
      modified packet data to userspace, which is useful for unit testing of XDP
      programs.
      
      The existing BPF_PROG_RUN for XDP allows userspace to set the ingress
      ifindex and RXQ number as part of the context object being passed to the
      kernel. This patch reuses that code, but adds a new mode with different
      semantics, which can be selected with the new BPF_F_TEST_XDP_LIVE_FRAMES
      flag.
      
      When running BPF_PROG_RUN in this mode, the XDP program return codes will
      be honoured: returning XDP_PASS will result in the frame being injected
      into the networking stack as if it came from the selected networking
      interface, while returning XDP_TX and XDP_REDIRECT will result in the frame
      being transmitted out that interface. XDP_TX is translated into an
      XDP_REDIRECT operation to the same interface, since the real XDP_TX action
      is only possible from within the network drivers themselves, not from the
      process context where BPF_PROG_RUN is executed.
      
      Internally, this new mode of operation creates a page pool instance while
      setting up the test run, and feeds pages from that into the XDP program.
      The setup cost of this is amortised over the number of repetitions
      specified by userspace.
      
      To support the performance testing use case, we further optimise the setup
      step so that all pages in the pool are pre-initialised with the packet
      data, and pre-computed context and xdp_frame objects stored at the start of
      each page. This makes it possible to entirely avoid touching the page
      content on each XDP program invocation, and enables sending up to 9
      Mpps/core on my test box.
      
      Because the data pages are recycled by the page pool, and the test runner
      doesn't re-initialise them for each run, subsequent invocations of the XDP
      program will see the packet data in the state it was after the last time it
      ran on that particular page. This means that an XDP program that modifies
      the packet before redirecting it has to be careful about which assumptions
      it makes about the packet content, but that is only an issue for the most
      naively written programs.
      
      Enabling the new flag is only allowed when not setting ctx_out and data_out
      in the test specification, since using it means frames will be redirected
      somewhere else, so they can't be returned.
      Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20220309105346.100053-2-toke@redhat.com
      b530e9e1
  3. 03 3月, 2022 1 次提交
    • M
      bpf: Add __sk_buff->delivery_time_type and bpf_skb_set_skb_delivery_time() · 8d21ec0e
      Martin KaFai Lau 提交于
      * __sk_buff->delivery_time_type:
      This patch adds __sk_buff->delivery_time_type.  It tells if the
      delivery_time is stored in __sk_buff->tstamp or not.
      
      It will be most useful for ingress to tell if the __sk_buff->tstamp
      has the (rcv) timestamp or delivery_time.  If delivery_time_type
      is 0 (BPF_SKB_DELIVERY_TIME_NONE), it has the (rcv) timestamp.
      
      Two non-zero types are defined for the delivery_time_type,
      BPF_SKB_DELIVERY_TIME_MONO and BPF_SKB_DELIVERY_TIME_UNSPEC.  For UNSPEC,
      it can only happen in egress because only mono delivery_time can be
      forwarded to ingress now.  The clock of UNSPEC delivery_time
      can be deduced from the skb->sk->sk_clockid which is how
      the sch_etf doing it also.
      
      * Provide forwarded delivery_time to tc-bpf@ingress:
      With the help of the new delivery_time_type, the tc-bpf has a way
      to tell if the __sk_buff->tstamp has the (rcv) timestamp or
      the delivery_time.  During bpf load time, the verifier will learn if
      the bpf prog has accessed the new __sk_buff->delivery_time_type.
      If it does, it means the tc-bpf@ingress is expecting the
      skb->tstamp could have the delivery_time.  The kernel will then
      read the skb->tstamp as-is during bpf insn rewrite without
      checking the skb->mono_delivery_time.  This is done by adding a
      new prog->delivery_time_access bit.  The same goes for
      writing skb->tstamp.
      
      * bpf_skb_set_delivery_time():
      The bpf_skb_set_delivery_time() helper is added to allow setting both
      delivery_time and the delivery_time_type at the same time.  If the
      tc-bpf does not need to change the delivery_time_type, it can directly
      write to the __sk_buff->tstamp as the existing tc-bpf has already been
      doing.  It will be most useful at ingress to change the
      __sk_buff->tstamp from the (rcv) timestamp to
      a mono delivery_time and then bpf_redirect_*().
      
      bpf only has mono clock helper (bpf_ktime_get_ns), and
      the current known use case is the mono EDT for fq, and
      only mono delivery time can be kept during forward now,
      so bpf_skb_set_delivery_time() only supports setting
      BPF_SKB_DELIVERY_TIME_MONO.  It can be extended later when use cases
      come up and the forwarding path also supports other clock bases.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8d21ec0e
  4. 10 2月, 2022 1 次提交
  5. 01 2月, 2022 1 次提交
  6. 25 1月, 2022 1 次提交
  7. 22 1月, 2022 3 次提交
  8. 20 1月, 2022 2 次提交
  9. 14 12月, 2021 1 次提交
    • J
      bpf: Add get_func_[arg|ret|arg_cnt] helpers · f92c1e18
      Jiri Olsa 提交于
      Adding following helpers for tracing programs:
      
      Get n-th argument of the traced function:
        long bpf_get_func_arg(void *ctx, u32 n, u64 *value)
      
      Get return value of the traced function:
        long bpf_get_func_ret(void *ctx, u64 *value)
      
      Get arguments count of the traced function:
        long bpf_get_func_arg_cnt(void *ctx)
      
      The trampoline now stores number of arguments on ctx-8
      address, so it's easy to verify argument index and find
      return value argument's position.
      
      Moving function ip address on the trampoline stack behind
      the number of functions arguments, so it's now stored on
      ctx-16 address if it's needed.
      
      All helpers above are inlined by verifier.
      
      Also bit unrelated small change - using newly added function
      bpf_prog_has_trampoline in check_get_func_ip.
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20211208193245.172141-5-jolsa@kernel.org
      f92c1e18
  10. 12 12月, 2021 1 次提交
  11. 03 12月, 2021 2 次提交
  12. 01 12月, 2021 1 次提交
    • J
      bpf: Add bpf_loop helper · e6f2dd0f
      Joanne Koong 提交于
      This patch adds the kernel-side and API changes for a new helper
      function, bpf_loop:
      
      long bpf_loop(u32 nr_loops, void *callback_fn, void *callback_ctx,
      u64 flags);
      
      where long (*callback_fn)(u32 index, void *ctx);
      
      bpf_loop invokes the "callback_fn" **nr_loops** times or until the
      callback_fn returns 1. The callback_fn can only return 0 or 1, and
      this is enforced by the verifier. The callback_fn index is zero-indexed.
      
      A few things to please note:
      ~ The "u64 flags" parameter is currently unused but is included in
      case a future use case for it arises.
      ~ In the kernel-side implementation of bpf_loop (kernel/bpf/bpf_iter.c),
      bpf_callback_t is used as the callback function cast.
      ~ A program can have nested bpf_loop calls but the program must
      still adhere to the verifier constraint of its stack depth (the stack depth
      cannot exceed MAX_BPF_STACK))
      ~ Recursive callback_fns do not pass the verifier, due to the call stack
      for these being too deep.
      ~ The next patch will include the tests and benchmark
      Signed-off-by: NJoanne Koong <joannekoong@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20211130030622.4131246-2-joannekoong@fb.com
      e6f2dd0f
  13. 16 11月, 2021 1 次提交
    • T
      bpf: Change value of MAX_TAIL_CALL_CNT from 32 to 33 · ebf7f6f0
      Tiezhu Yang 提交于
      In the current code, the actual max tail call count is 33 which is greater
      than MAX_TAIL_CALL_CNT (defined as 32). The actual limit is not consistent
      with the meaning of MAX_TAIL_CALL_CNT and thus confusing at first glance.
      We can see the historical evolution from commit 04fd61ab ("bpf: allow
      bpf programs to tail-call other bpf programs") and commit f9dabe01
      ("bpf: Undo off-by-one in interpreter tail call count limit"). In order
      to avoid changing existing behavior, the actual limit is 33 now, this is
      reasonable.
      
      After commit 874be05f ("bpf, tests: Add tail call test suite"), we can
      see there exists failed testcase.
      
      On all archs when CONFIG_BPF_JIT_ALWAYS_ON is not set:
       # echo 0 > /proc/sys/net/core/bpf_jit_enable
       # modprobe test_bpf
       # dmesg | grep -w FAIL
       Tail call error path, max count reached jited:0 ret 34 != 33 FAIL
      
      On some archs:
       # echo 1 > /proc/sys/net/core/bpf_jit_enable
       # modprobe test_bpf
       # dmesg | grep -w FAIL
       Tail call error path, max count reached jited:1 ret 34 != 33 FAIL
      
      Although the above failed testcase has been fixed in commit 18935a72
      ("bpf/tests: Fix error in tail call limit tests"), it would still be good
      to change the value of MAX_TAIL_CALL_CNT from 32 to 33 to make the code
      more readable.
      
      The 32-bit x86 JIT was using a limit of 32, just fix the wrong comments and
      limit to 33 tail calls as the constant MAX_TAIL_CALL_CNT updated. For the
      mips64 JIT, use "ori" instead of "addiu" as suggested by Johan Almbladh.
      For the riscv JIT, use RV_REG_TCC directly to save one register move as
      suggested by Björn Töpel. For the other implementations, no function changes,
      it does not change the current limit 33, the new value of MAX_TAIL_CALL_CNT
      can reflect the actual max tail call count, the related tail call testcases
      in test_bpf module and selftests can work well for the interpreter and the
      JIT.
      
      Here are the test results on x86_64:
      
       # uname -m
       x86_64
       # echo 0 > /proc/sys/net/core/bpf_jit_enable
       # modprobe test_bpf test_suite=test_tail_calls
       # dmesg | tail -1
       test_bpf: test_tail_calls: Summary: 8 PASSED, 0 FAILED, [0/8 JIT'ed]
       # rmmod test_bpf
       # echo 1 > /proc/sys/net/core/bpf_jit_enable
       # modprobe test_bpf test_suite=test_tail_calls
       # dmesg | tail -1
       test_bpf: test_tail_calls: Summary: 8 PASSED, 0 FAILED, [8/8 JIT'ed]
       # rmmod test_bpf
       # ./test_progs -t tailcalls
       #142 tailcalls:OK
       Summary: 1/11 PASSED, 0 SKIPPED, 0 FAILED
      Signed-off-by: NTiezhu Yang <yangtiezhu@loongson.cn>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: NJohan Almbladh <johan.almbladh@anyfinetworks.com>
      Tested-by: NIlya Leoshkevich <iii@linux.ibm.com>
      Acked-by: NBjörn Töpel <bjorn@kernel.org>
      Acked-by: NJohan Almbladh <johan.almbladh@anyfinetworks.com>
      Acked-by: NIlya Leoshkevich <iii@linux.ibm.com>
      Link: https://lore.kernel.org/bpf/1636075800-3264-1-git-send-email-yangtiezhu@loongson.cn
      ebf7f6f0
  14. 11 11月, 2021 1 次提交
  15. 08 11月, 2021 1 次提交
  16. 02 11月, 2021 1 次提交
    • J
      bpf: Add alignment padding for "map_extra" + consolidate holes · 8845b468
      Joanne Koong 提交于
      This patch makes 2 changes regarding alignment padding
      for the "map_extra" field.
      
      1) In the kernel header, "map_extra" and "btf_value_type_id"
      are rearranged to consolidate the hole.
      
      Before:
      struct bpf_map {
      	...
              u32		max_entries;	/*    36     4	*/
              u32		map_flags;	/*    40     4	*/
      
              /* XXX 4 bytes hole, try to pack */
      
              u64		map_extra;	/*    48     8	*/
              int		spin_lock_off;	/*    56     4	*/
              int		timer_off;	/*    60     4	*/
              /* --- cacheline 1 boundary (64 bytes) --- */
              u32		id;		/*    64     4	*/
              int		numa_node;	/*    68     4	*/
      	...
              bool		frozen;		/*   117     1	*/
      
              /* XXX 10 bytes hole, try to pack */
      
              /* --- cacheline 2 boundary (128 bytes) --- */
      	...
              struct work_struct	work;	/*   144    72	*/
      
              /* --- cacheline 3 boundary (192 bytes) was 24 bytes ago --- */
      	struct mutex	freeze_mutex;	/*   216   144 	*/
      
              /* --- cacheline 5 boundary (320 bytes) was 40 bytes ago --- */
              u64		writecnt; 	/*   360     8	*/
      
          /* size: 384, cachelines: 6, members: 26 */
          /* sum members: 354, holes: 2, sum holes: 14 */
          /* padding: 16 */
          /* forced alignments: 2, forced holes: 1, sum forced holes: 10 */
      
      } __attribute__((__aligned__(64)));
      
      After:
      struct bpf_map {
      	...
              u32		max_entries;	/*    36     4	*/
              u64		map_extra;	/*    40     8 	*/
              u32		map_flags;	/*    48     4	*/
              int		spin_lock_off;	/*    52     4	*/
              int		timer_off;	/*    56     4	*/
              u32		id;		/*    60     4	*/
      
              /* --- cacheline 1 boundary (64 bytes) --- */
              int		numa_node;	/*    64     4	*/
      	...
      	bool		frozen		/*   113     1  */
      
              /* XXX 14 bytes hole, try to pack */
      
              /* --- cacheline 2 boundary (128 bytes) --- */
      	...
              struct work_struct	work;	/*   144    72	*/
      
              /* --- cacheline 3 boundary (192 bytes) was 24 bytes ago --- */
              struct mutex	freeze_mutex;	/*   216   144	*/
      
              /* --- cacheline 5 boundary (320 bytes) was 40 bytes ago --- */
              u64		writecnt;       /*   360     8	*/
      
          /* size: 384, cachelines: 6, members: 26 */
          /* sum members: 354, holes: 1, sum holes: 14 */
          /* padding: 16 */
          /* forced alignments: 2, forced holes: 1, sum forced holes: 14 */
      
      } __attribute__((__aligned__(64)));
      
      2) Add alignment padding to the bpf_map_info struct
      More details can be found in commit 36f9814a ("bpf: fix uapi hole
      for 32 bit compat applications")
      Signed-off-by: NJoanne Koong <joannekoong@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20211029224909.1721024-3-joannekoong@fb.com
      8845b468
  17. 29 10月, 2021 2 次提交
    • K
      bpf: Add bpf_kallsyms_lookup_name helper · d6aef08a
      Kumar Kartikeya Dwivedi 提交于
      This helper allows us to get the address of a kernel symbol from inside
      a BPF_PROG_TYPE_SYSCALL prog (used by gen_loader), so that we can
      relocate typeless ksym vars.
      Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20211028063501.2239335-2-memxor@gmail.com
      d6aef08a
    • J
      bpf: Add bloom filter map implementation · 9330986c
      Joanne Koong 提交于
      This patch adds the kernel-side changes for the implementation of
      a bpf bloom filter map.
      
      The bloom filter map supports peek (determining whether an element
      is present in the map) and push (adding an element to the map)
      operations.These operations are exposed to userspace applications
      through the already existing syscalls in the following way:
      
      BPF_MAP_LOOKUP_ELEM -> peek
      BPF_MAP_UPDATE_ELEM -> push
      
      The bloom filter map does not have keys, only values. In light of
      this, the bloom filter map's API matches that of queue stack maps:
      user applications use BPF_MAP_LOOKUP_ELEM/BPF_MAP_UPDATE_ELEM
      which correspond internally to bpf_map_peek_elem/bpf_map_push_elem,
      and bpf programs must use the bpf_map_peek_elem and bpf_map_push_elem
      APIs to query or add an element to the bloom filter map. When the
      bloom filter map is created, it must be created with a key_size of 0.
      
      For updates, the user will pass in the element to add to the map
      as the value, with a NULL key. For lookups, the user will pass in the
      element to query in the map as the value, with a NULL key. In the
      verifier layer, this requires us to modify the argument type of
      a bloom filter's BPF_FUNC_map_peek_elem call to ARG_PTR_TO_MAP_VALUE;
      as well, in the syscall layer, we need to copy over the user value
      so that in bpf_map_peek_elem, we know which specific value to query.
      
      A few things to please take note of:
       * If there are any concurrent lookups + updates, the user is
      responsible for synchronizing this to ensure no false negative lookups
      occur.
       * The number of hashes to use for the bloom filter is configurable from
      userspace. If no number is specified, the default used will be 5 hash
      functions. The benchmarks later in this patchset can help compare the
      performance of using different number of hashes on different entry
      sizes. In general, using more hashes decreases both the false positive
      rate and the speed of a lookup.
       * Deleting an element in the bloom filter map is not supported.
       * The bloom filter map may be used as an inner map.
       * The "max_entries" size that is specified at map creation time is used
      to approximate a reasonable bitmap size for the bloom filter, and is not
      otherwise strictly enforced. If the user wishes to insert more entries
      into the bloom filter than "max_entries", they may do so but they should
      be aware that this may lead to a higher false positive rate.
      Signed-off-by: NJoanne Koong <joannekoong@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20211027234504.30744-2-joannekoong@fb.com
      9330986c
  18. 22 10月, 2021 2 次提交
  19. 18 9月, 2021 2 次提交
  20. 16 9月, 2021 1 次提交
  21. 14 9月, 2021 1 次提交
  22. 11 9月, 2021 1 次提交
  23. 26 8月, 2021 1 次提交
  24. 24 8月, 2021 1 次提交
    • D
      bpf: Migrate cgroup_bpf to internal cgroup_bpf_attach_type enum · 6fc88c35
      Dave Marchevsky 提交于
      Add an enum (cgroup_bpf_attach_type) containing only valid cgroup_bpf
      attach types and a function to map bpf_attach_type values to the new
      enum. Inspired by netns_bpf_attach_type.
      
      Then, migrate cgroup_bpf to use cgroup_bpf_attach_type wherever
      possible.  Functionality is unchanged as attach_type_to_prog_type
      switches in bpf/syscall.c were preventing non-cgroup programs from
      making use of the invalid cgroup_bpf array slots.
      
      As a result struct cgroup_bpf uses 504 fewer bytes relative to when its
      arrays were sized using MAX_BPF_ATTACH_TYPE.
      
      bpf_cgroup_storage is notably not migrated as struct
      bpf_cgroup_storage_key is part of uapi and contains a bpf_attach_type
      member which is not meant to be opaque. Similarly, bpf_cgroup_link
      continues to report its bpf_attach_type member to userspace via fdinfo
      and bpf_link_info.
      
      To ease disambiguation, bpf_attach_type variables are renamed from
      'type' to 'atype' when changed to cgroup_bpf_attach_type.
      Signed-off-by: NDave Marchevsky <davemarchevsky@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210819092420.1984861-2-davemarchevsky@fb.com
      6fc88c35
  25. 17 8月, 2021 3 次提交
    • A
      bpf: Add bpf_get_attach_cookie() BPF helper to access bpf_cookie value · 7adfc6c9
      Andrii Nakryiko 提交于
      Add new BPF helper, bpf_get_attach_cookie(), which can be used by BPF programs
      to get access to a user-provided bpf_cookie value, specified during BPF
      program attachment (BPF link creation) time.
      
      Naming is hard, though. With the concept being named "BPF cookie", I've
      considered calling the helper:
        - bpf_get_cookie() -- seems too unspecific and easily mistaken with socket
          cookie;
        - bpf_get_bpf_cookie() -- too much tautology;
        - bpf_get_link_cookie() -- would be ok, but while we create a BPF link to
          attach BPF program to BPF hook, it's still an "attachment" and the
          bpf_cookie is associated with BPF program attachment to a hook, not a BPF
          link itself. Technically, we could support bpf_cookie with old-style
          cgroup programs.So I ultimately rejected it in favor of
          bpf_get_attach_cookie().
      
      Currently all perf_event-backed BPF program types support
      bpf_get_attach_cookie() helper. Follow-up patches will add support for
      fentry/fexit programs as well.
      
      While at it, mark bpf_tracing_func_proto() as static to make it obvious that
      it's only used from within the kernel/trace/bpf_trace.c.
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20210815070609.987780-7-andrii@kernel.org
      7adfc6c9
    • A
      bpf: Allow to specify user-provided bpf_cookie for BPF perf links · 82e6b1ee
      Andrii Nakryiko 提交于
      Add ability for users to specify custom u64 value (bpf_cookie) when creating
      BPF link for perf_event-backed BPF programs (kprobe/uprobe, perf_event,
      tracepoints).
      
      This is useful for cases when the same BPF program is used for attaching and
      processing invocation of different tracepoints/kprobes/uprobes in a generic
      fashion, but such that each invocation is distinguished from each other (e.g.,
      BPF program can look up additional information associated with a specific
      kernel function without having to rely on function IP lookups). This enables
      new use cases to be implemented simply and efficiently that previously were
      possible only through code generation (and thus multiple instances of almost
      identical BPF program) or compilation at runtime (BCC-style) on target hosts
      (even more expensive resource-wise). For uprobes it is not even possible in
      some cases to know function IP before hand (e.g., when attaching to shared
      library without PID filtering, in which case base load address is not known
      for a library).
      
      This is done by storing u64 bpf_cookie in struct bpf_prog_array_item,
      corresponding to each attached and run BPF program. Given cgroup BPF programs
      already use two 8-byte pointers for their needs and cgroup BPF programs don't
      have (yet?) support for bpf_cookie, reuse that space through union of
      cgroup_storage and new bpf_cookie field.
      
      Make it available to kprobe/tracepoint BPF programs through bpf_trace_run_ctx.
      This is set by BPF_PROG_RUN_ARRAY, used by kprobe/uprobe/tracepoint BPF
      program execution code, which luckily is now also split from
      BPF_PROG_RUN_ARRAY_CG. This run context will be utilized by a new BPF helper
      giving access to this user-provided cookie value from inside a BPF program.
      Generic perf_event BPF programs will access this value from perf_event itself
      through passed in BPF program context.
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NYonghong Song <yhs@fb.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/bpf/20210815070609.987780-6-andrii@kernel.org
      82e6b1ee
    • A
      bpf: Implement minimal BPF perf link · b89fbfbb
      Andrii Nakryiko 提交于
      Introduce a new type of BPF link - BPF perf link. This brings perf_event-based
      BPF program attachments (perf_event, tracepoints, kprobes, and uprobes) into
      the common BPF link infrastructure, allowing to list all active perf_event
      based attachments, auto-detaching BPF program from perf_event when link's FD
      is closed, get generic BPF link fdinfo/get_info functionality.
      
      BPF_LINK_CREATE command expects perf_event's FD as target_fd. No extra flags
      are currently supported.
      
      Force-detaching and atomic BPF program updates are not yet implemented, but
      with perf_event-based BPF links we now have common framework for this without
      the need to extend ioctl()-based perf_event interface.
      
      One interesting consideration is a new value for bpf_attach_type, which
      BPF_LINK_CREATE command expects. Generally, it's either 1-to-1 mapping from
      bpf_attach_type to bpf_prog_type, or many-to-1 mapping from a subset of
      bpf_attach_types to one bpf_prog_type (e.g., see BPF_PROG_TYPE_SK_SKB or
      BPF_PROG_TYPE_CGROUP_SOCK). In this case, though, we have three different
      program types (KPROBE, TRACEPOINT, PERF_EVENT) using the same perf_event-based
      mechanism, so it's many bpf_prog_types to one bpf_attach_type. I chose to
      define a single BPF_PERF_EVENT attach type for all of them and adjust
      link_create()'s logic for checking correspondence between attach type and
      program type.
      
      The alternative would be to define three new attach types (e.g., BPF_KPROBE,
      BPF_TRACEPOINT, and BPF_PERF_EVENT), but that seemed like unnecessary overkill
      and BPF_KPROBE will cause naming conflicts with BPF_KPROBE() macro, defined by
      libbpf. I chose to not do this to avoid unnecessary proliferation of
      bpf_attach_type enum values and not have to deal with naming conflicts.
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NYonghong Song <yhs@fb.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/bpf/20210815070609.987780-5-andrii@kernel.org
      b89fbfbb
  26. 16 7月, 2021 3 次提交
    • J
      bpf: Add bpf_get_func_ip helper for kprobe programs · 9ffd9f3f
      Jiri Olsa 提交于
      Adding bpf_get_func_ip helper for BPF_PROG_TYPE_KPROBE programs,
      so it's now possible to call bpf_get_func_ip from both kprobe and
      kretprobe programs.
      
      Taking the caller's address from 'struct kprobe::addr', which is
      defined for both kprobe and kretprobe.
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Link: https://lore.kernel.org/bpf/20210714094400.396467-5-jolsa@kernel.org
      9ffd9f3f
    • J
      bpf: Add bpf_get_func_ip helper for tracing programs · 9b99edca
      Jiri Olsa 提交于
      Adding bpf_get_func_ip helper for BPF_PROG_TYPE_TRACING programs,
      specifically for all trampoline attach types.
      
      The trampoline's caller IP address is stored in (ctx - 8) address.
      so there's no reason to actually call the helper, but rather fixup
      the call instruction and return [ctx - 8] value directly.
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210714094400.396467-4-jolsa@kernel.org
      9b99edca
    • A
      bpf: Introduce bpf timers. · b00628b1
      Alexei Starovoitov 提交于
      Introduce 'struct bpf_timer { __u64 :64; __u64 :64; };' that can be embedded
      in hash/array/lru maps as a regular field and helpers to operate on it:
      
      // Initialize the timer.
      // First 4 bits of 'flags' specify clockid.
      // Only CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
      long bpf_timer_init(struct bpf_timer *timer, struct bpf_map *map, int flags);
      
      // Configure the timer to call 'callback_fn' static function.
      long bpf_timer_set_callback(struct bpf_timer *timer, void *callback_fn);
      
      // Arm the timer to expire 'nsec' nanoseconds from the current time.
      long bpf_timer_start(struct bpf_timer *timer, u64 nsec, u64 flags);
      
      // Cancel the timer and wait for callback_fn to finish if it was running.
      long bpf_timer_cancel(struct bpf_timer *timer);
      
      Here is how BPF program might look like:
      struct map_elem {
          int counter;
          struct bpf_timer timer;
      };
      
      struct {
          __uint(type, BPF_MAP_TYPE_HASH);
          __uint(max_entries, 1000);
          __type(key, int);
          __type(value, struct map_elem);
      } hmap SEC(".maps");
      
      static int timer_cb(void *map, int *key, struct map_elem *val);
      /* val points to particular map element that contains bpf_timer. */
      
      SEC("fentry/bpf_fentry_test1")
      int BPF_PROG(test1, int a)
      {
          struct map_elem *val;
          int key = 0;
      
          val = bpf_map_lookup_elem(&hmap, &key);
          if (val) {
              bpf_timer_init(&val->timer, &hmap, CLOCK_REALTIME);
              bpf_timer_set_callback(&val->timer, timer_cb);
              bpf_timer_start(&val->timer, 1000 /* call timer_cb2 in 1 usec */, 0);
          }
      }
      
      This patch adds helper implementations that rely on hrtimers
      to call bpf functions as timers expire.
      The following patches add necessary safety checks.
      
      Only programs with CAP_BPF are allowed to use bpf_timer.
      
      The amount of timers used by the program is constrained by
      the memcg recorded at map creation time.
      
      The bpf_timer_init() helper needs explicit 'map' argument because inner maps
      are dynamic and not known at load time. While the bpf_timer_set_callback() is
      receiving hidden 'aux->prog' argument supplied by the verifier.
      
      The prog pointer is needed to do refcnting of bpf program to make sure that
      program doesn't get freed while the timer is armed. This approach relies on
      "user refcnt" scheme used in prog_array that stores bpf programs for
      bpf_tail_call. The bpf_timer_set_callback() will increment the prog refcnt which is
      paired with bpf_timer_cancel() that will drop the prog refcnt. The
      ops->map_release_uref is responsible for cancelling the timers and dropping
      prog refcnt when user space reference to a map reaches zero.
      This uref approach is done to make sure that Ctrl-C of user space process will
      not leave timers running forever unless the user space explicitly pinned a map
      that contained timers in bpffs.
      
      bpf_timer_init() and bpf_timer_set_callback() will return -EPERM if map doesn't
      have user references (is not held by open file descriptor from user space and
      not pinned in bpffs).
      
      The bpf_map_delete_elem() and bpf_map_update_elem() operations cancel
      and free the timer if given map element had it allocated.
      "bpftool map update" command can be used to cancel timers.
      
      The 'struct bpf_timer' is explicitly __attribute__((aligned(8))) because
      '__u64 :64' has 1 byte alignment of 8 byte padding.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Acked-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-4-alexei.starovoitov@gmail.com
      b00628b1
  27. 15 7月, 2021 1 次提交