1. 29 9月, 2022 1 次提交
  2. 27 9月, 2022 2 次提交
  3. 22 9月, 2022 4 次提交
    • J
      bpf: Prevent bpf program recursion for raw tracepoint probes · 05b24ff9
      Jiri Olsa 提交于
      We got report from sysbot [1] about warnings that were caused by
      bpf program attached to contention_begin raw tracepoint triggering
      the same tracepoint by using bpf_trace_printk helper that takes
      trace_printk_lock lock.
      
       Call Trace:
        <TASK>
        ? trace_event_raw_event_bpf_trace_printk+0x5f/0x90
        bpf_trace_printk+0x2b/0xe0
        bpf_prog_a9aec6167c091eef_prog+0x1f/0x24
        bpf_trace_run2+0x26/0x90
        native_queued_spin_lock_slowpath+0x1c6/0x2b0
        _raw_spin_lock_irqsave+0x44/0x50
        bpf_trace_printk+0x3f/0xe0
        bpf_prog_a9aec6167c091eef_prog+0x1f/0x24
        bpf_trace_run2+0x26/0x90
        native_queued_spin_lock_slowpath+0x1c6/0x2b0
        _raw_spin_lock_irqsave+0x44/0x50
        bpf_trace_printk+0x3f/0xe0
        bpf_prog_a9aec6167c091eef_prog+0x1f/0x24
        bpf_trace_run2+0x26/0x90
        native_queued_spin_lock_slowpath+0x1c6/0x2b0
        _raw_spin_lock_irqsave+0x44/0x50
        bpf_trace_printk+0x3f/0xe0
        bpf_prog_a9aec6167c091eef_prog+0x1f/0x24
        bpf_trace_run2+0x26/0x90
        native_queued_spin_lock_slowpath+0x1c6/0x2b0
        _raw_spin_lock_irqsave+0x44/0x50
        __unfreeze_partials+0x5b/0x160
        ...
      
      The can be reproduced by attaching bpf program as raw tracepoint on
      contention_begin tracepoint. The bpf prog calls bpf_trace_printk
      helper. Then by running perf bench the spin lock code is forced to
      take slow path and call contention_begin tracepoint.
      
      Fixing this by skipping execution of the bpf program if it's
      already running, Using bpf prog 'active' field, which is being
      currently used by trampoline programs for the same reason.
      
      Moving bpf_prog_inc_misses_counter to syscall.c because
      trampoline.c is compiled in just for CONFIG_BPF_JIT option.
      Reviewed-by: NStanislav Fomichev <sdf@google.com>
      Reported-by: syzbot+2251879aa068ad9c960d@syzkaller.appspotmail.com
      [1] https://lore.kernel.org/bpf/YxhFe3EwqchC%2FfYf@krava/T/#tSigned-off-by: NJiri Olsa <jolsa@kernel.org>
      Link: https://lore.kernel.org/r/20220916071914.7156-1-jolsa@kernel.orgSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      05b24ff9
    • R
      bpf: Add bpf_lookup_*_key() and bpf_key_put() kfuncs · f3cf4134
      Roberto Sassu 提交于
      Add the bpf_lookup_user_key(), bpf_lookup_system_key() and bpf_key_put()
      kfuncs, to respectively search a key with a given key handle serial number
      and flags, obtain a key from a pre-determined ID defined in
      include/linux/verification.h, and cleanup.
      
      Introduce system_keyring_id_check() to validate the keyring ID parameter of
      bpf_lookup_system_key().
      Signed-off-by: NRoberto Sassu <roberto.sassu@huawei.com>
      Acked-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Acked-by: NSong Liu <song@kernel.org>
      Link: https://lore.kernel.org/r/20220920075951.929132-8-roberto.sassu@huaweicloud.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      f3cf4134
    • R
      bpf: Export bpf_dynptr_get_size() · 51df4865
      Roberto Sassu 提交于
      Export bpf_dynptr_get_size(), so that kernel code dealing with eBPF dynamic
      pointers can obtain the real size of data carried by this data structure.
      Signed-off-by: NRoberto Sassu <roberto.sassu@huawei.com>
      Reviewed-by: NJoanne Koong <joannelkoong@gmail.com>
      Acked-by: NKP Singh <kpsingh@kernel.org>
      Acked-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20220920075951.929132-6-roberto.sassu@huaweicloud.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      51df4865
    • D
      bpf: Add bpf_user_ringbuf_drain() helper · 20571567
      David Vernet 提交于
      In a prior change, we added a new BPF_MAP_TYPE_USER_RINGBUF map type which
      will allow user-space applications to publish messages to a ring buffer
      that is consumed by a BPF program in kernel-space. In order for this
      map-type to be useful, it will require a BPF helper function that BPF
      programs can invoke to drain samples from the ring buffer, and invoke
      callbacks on those samples. This change adds that capability via a new BPF
      helper function:
      
      bpf_user_ringbuf_drain(struct bpf_map *map, void *callback_fn, void *ctx,
                             u64 flags)
      
      BPF programs may invoke this function to run callback_fn() on a series of
      samples in the ring buffer. callback_fn() has the following signature:
      
      long callback_fn(struct bpf_dynptr *dynptr, void *context);
      
      Samples are provided to the callback in the form of struct bpf_dynptr *'s,
      which the program can read using BPF helper functions for querying
      struct bpf_dynptr's.
      
      In order to support bpf_ringbuf_drain(), a new PTR_TO_DYNPTR register
      type is added to the verifier to reflect a dynptr that was allocated by
      a helper function and passed to a BPF program. Unlike PTR_TO_STACK
      dynptrs which are allocated on the stack by a BPF program, PTR_TO_DYNPTR
      dynptrs need not use reference tracking, as the BPF helper is trusted to
      properly free the dynptr before returning. The verifier currently only
      supports PTR_TO_DYNPTR registers that are also DYNPTR_TYPE_LOCAL.
      
      Note that while the corresponding user-space libbpf logic will be added
      in a subsequent patch, this patch does contain an implementation of the
      .map_poll() callback for BPF_MAP_TYPE_USER_RINGBUF maps. This
      .map_poll() callback guarantees that an epoll-waiting user-space
      producer will receive at least one event notification whenever at least
      one sample is drained in an invocation of bpf_user_ringbuf_drain(),
      provided that the function is not invoked with the BPF_RB_NO_WAKEUP
      flag. If the BPF_RB_FORCE_WAKEUP flag is provided, a wakeup
      notification is sent even if no sample was drained.
      Signed-off-by: NDavid Vernet <void@manifault.com>
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20220920000100.477320-3-void@manifault.com
      20571567
  4. 17 9月, 2022 1 次提交
  5. 11 9月, 2022 1 次提交
  6. 08 9月, 2022 4 次提交
  7. 07 9月, 2022 1 次提交
  8. 26 8月, 2022 1 次提交
    • H
      bpf: Introduce cgroup iter · d4ccaf58
      Hao Luo 提交于
      Cgroup_iter is a type of bpf_iter. It walks over cgroups in four modes:
      
       - walking a cgroup's descendants in pre-order.
       - walking a cgroup's descendants in post-order.
       - walking a cgroup's ancestors.
       - process only the given cgroup.
      
      When attaching cgroup_iter, one can set a cgroup to the iter_link
      created from attaching. This cgroup is passed as a file descriptor
      or cgroup id and serves as the starting point of the walk. If no
      cgroup is specified, the starting point will be the root cgroup v2.
      
      For walking descendants, one can specify the order: either pre-order or
      post-order. For walking ancestors, the walk starts at the specified
      cgroup and ends at the root.
      
      One can also terminate the walk early by returning 1 from the iter
      program.
      
      Note that because walking cgroup hierarchy holds cgroup_mutex, the iter
      program is called with cgroup_mutex held.
      
      Currently only one session is supported, which means, depending on the
      volume of data bpf program intends to send to user space, the number
      of cgroups that can be walked is limited. For example, given the current
      buffer size is 8 * PAGE_SIZE, if the program sends 64B data for each
      cgroup, assuming PAGE_SIZE is 4kb, the total number of cgroups that can
      be walked is 512. This is a limitation of cgroup_iter. If the output
      data is larger than the kernel buffer size, after all data in the
      kernel buffer is consumed by user space, the subsequent read() syscall
      will signal EOPNOTSUPP. In order to work around, the user may have to
      update their program to reduce the volume of data sent to output. For
      example, skip some uninteresting cgroups. In future, we may extend
      bpf_iter flags to allow customizing buffer size.
      Acked-by: NYonghong Song <yhs@fb.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NHao Luo <haoluo@google.com>
      Link: https://lore.kernel.org/r/20220824233117.1312810-2-haoluo@google.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      d4ccaf58
  9. 24 8月, 2022 1 次提交
  10. 19 8月, 2022 1 次提交
    • M
      bpf: net: Avoid sk_setsockopt() taking sk lock when called from bpf · 24426654
      Martin KaFai Lau 提交于
      Most of the code in bpf_setsockopt(SOL_SOCKET) are duplicated from
      the sk_setsockopt().  The number of supported optnames are
      increasing ever and so as the duplicated code.
      
      One issue in reusing sk_setsockopt() is that the bpf prog
      has already acquired the sk lock.  This patch adds a
      has_current_bpf_ctx() to tell if the sk_setsockopt() is called from
      a bpf prog.  The bpf prog calling bpf_setsockopt() is either running
      in_task() or in_serving_softirq().  Both cases have the current->bpf_ctx
      initialized.  Thus, the has_current_bpf_ctx() only needs to
      test !!current->bpf_ctx.
      
      This patch also adds sockopt_{lock,release}_sock() helpers
      for sk_setsockopt() to use.  These helpers will test
      has_current_bpf_ctx() before acquiring/releasing the lock.  They are
      in EXPORT_SYMBOL for the ipv6 module to use in a latter patch.
      
      Note on the change in sock_setbindtodevice().  sockopt_lock_sock()
      is done in sock_setbindtodevice() instead of doing the lock_sock
      in sock_bindtoindex(..., lock_sk = true).
      Reviewed-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/r/20220817061717.4175589-1-kafai@fb.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      24426654
  11. 10 8月, 2022 1 次提交
  12. 23 7月, 2022 2 次提交
  13. 22 7月, 2022 1 次提交
  14. 21 7月, 2022 1 次提交
  15. 13 7月, 2022 2 次提交
    • S
      bpf, x86: fix freeing of not-finalized bpf_prog_pack · 1d5f82d9
      Song Liu 提交于
      syzbot reported a few issues with bpf_prog_pack [1], [2]. This only happens
      with multiple subprogs. In jit_subprogs(), we first call bpf_int_jit_compile()
      on each sub program. And then, we call it on each sub program again. jit_data
      is not freed in the first call of bpf_int_jit_compile(). Similarly we don't
      call bpf_jit_binary_pack_finalize() in the first call of bpf_int_jit_compile().
      
      If bpf_int_jit_compile() failed for one sub program, we will call
      bpf_jit_binary_pack_finalize() for this sub program. However, we don't have a
      chance to call it for other sub programs. Then we will hit "goto out_free" in
      jit_subprogs(), and call bpf_jit_free on some subprograms that haven't got
      bpf_jit_binary_pack_finalize() yet.
      
      At this point, bpf_jit_binary_pack_free() is called and the whole 2MB page is
      freed erroneously.
      
      Fix this with a custom bpf_jit_free() for x86_64, which calls
      bpf_jit_binary_pack_finalize() if necessary. Also, with custom
      bpf_jit_free(), bpf_prog_aux->use_bpf_prog_pack is not needed any more,
      remove it.
      
      Fixes: 1022a549 ("bpf, x86_64: Use bpf_jit_binary_pack_alloc")
      [1] https://syzkaller.appspot.com/bug?extid=2f649ec6d2eea1495a8f
      [2] https://syzkaller.appspot.com/bug?extid=87f65c75f4a72db05445
      Reported-by: syzbot+2f649ec6d2eea1495a8f@syzkaller.appspotmail.com
      Reported-by: syzbot+87f65c75f4a72db05445@syzkaller.appspotmail.com
      Signed-off-by: NSong Liu <song@kernel.org>
      Link: https://lore.kernel.org/r/20220706002612.4013790-1-song@kernel.orgSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      1d5f82d9
    • R
      bpf: reparent bpf maps on memcg offlining · 4201d9ab
      Roman Gushchin 提交于
      The memory consumed by a bpf map is always accounted to the memory
      cgroup of the process which created the map. The map can outlive
      the memory cgroup if it's used by processes in other cgroups or
      is pinned on bpffs. In this case the map pins the original cgroup
      in the dying state.
      
      For other types of objects (slab objects, non-slab kernel allocations,
      percpu objects and recently LRU pages) there is a reparenting process
      implemented: on cgroup offlining charged objects are getting
      reassigned to the parent cgroup. Because all charges and statistics
      are fully recursive it's a fairly cheap operation.
      
      For efficiency and consistency with other types of objects, let's do
      the same for bpf maps. Fortunately thanks to the objcg API, the
      required changes are minimal.
      
      Please, note that individual allocations (slabs, percpu and large
      kmallocs) already have the reparenting mechanism. This commit adds
      it to the saved map->memcg pointer by replacing it to map->objcg.
      Because dying cgroups are not visible for a user and all charges are
      recursive, this commit doesn't bring any behavior changes for a user.
      
      v2:
        added a missing const qualifier
      Signed-off-by: NRoman Gushchin <roman.gushchin@linux.dev>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Link: https://lore.kernel.org/r/20220711162827.184743-1-roman.gushchin@linux.devSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      4201d9ab
  16. 30 6月, 2022 4 次提交
    • S
      bpf: expose bpf_{g,s}etsockopt to lsm cgroup · 9113d7e4
      Stanislav Fomichev 提交于
      I don't see how to make it nice without introducing btf id lists
      for the hooks where these helpers are allowed. Some LSM hooks
      work on the locked sockets, some are triggering early and
      don't grab any locks, so have two lists for now:
      
      1. LSM hooks which trigger under socket lock - minority of the hooks,
         but ideal case for us, we can expose existing BTF-based helpers
      2. LSM hooks which trigger without socket lock, but they trigger
         early in the socket creation path where it should be safe to
         do setsockopt without any locks
      3. The rest are prohibited. I'm thinking that this use-case might
         be a good gateway to sleeping lsm cgroup hooks in the future.
         We can either expose lock/unlock operations (and add tracking
         to the verifier) or have another set of bpf_setsockopt
         wrapper that grab the locks and might sleep.
      Reviewed-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Link: https://lore.kernel.org/r/20220628174314.1216643-7-sdf@google.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      9113d7e4
    • S
      bpf: minimize number of allocated lsm slots per program · c0e19f2c
      Stanislav Fomichev 提交于
      Previous patch adds 1:1 mapping between all 211 LSM hooks
      and bpf_cgroup program array. Instead of reserving a slot per
      possible hook, reserve 10 slots per cgroup for lsm programs.
      Those slots are dynamically allocated on demand and reclaimed.
      
      struct cgroup_bpf {
      	struct bpf_prog_array *    effective[33];        /*     0   264 */
      	/* --- cacheline 4 boundary (256 bytes) was 8 bytes ago --- */
      	struct hlist_head          progs[33];            /*   264   264 */
      	/* --- cacheline 8 boundary (512 bytes) was 16 bytes ago --- */
      	u8                         flags[33];            /*   528    33 */
      
      	/* XXX 7 bytes hole, try to pack */
      
      	struct list_head           storages;             /*   568    16 */
      	/* --- cacheline 9 boundary (576 bytes) was 8 bytes ago --- */
      	struct bpf_prog_array *    inactive;             /*   584     8 */
      	struct percpu_ref          refcnt;               /*   592    16 */
      	struct work_struct         release_work;         /*   608    72 */
      
      	/* size: 680, cachelines: 11, members: 7 */
      	/* sum members: 673, holes: 1, sum holes: 7 */
      	/* last cacheline: 40 bytes */
      };
      Reviewed-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Link: https://lore.kernel.org/r/20220628174314.1216643-5-sdf@google.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      c0e19f2c
    • S
      bpf: per-cgroup lsm flavor · 69fd337a
      Stanislav Fomichev 提交于
      Allow attaching to lsm hooks in the cgroup context.
      
      Attaching to per-cgroup LSM works exactly like attaching
      to other per-cgroup hooks. New BPF_LSM_CGROUP is added
      to trigger new mode; the actual lsm hook we attach to is
      signaled via existing attach_btf_id.
      
      For the hooks that have 'struct socket' or 'struct sock' as its first
      argument, we use the cgroup associated with that socket. For the rest,
      we use 'current' cgroup (this is all on default hierarchy == v2 only).
      Note that for some hooks that work on 'struct sock' we still
      take the cgroup from 'current' because some of them work on the socket
      that hasn't been properly initialized yet.
      
      Behind the scenes, we allocate a shim program that is attached
      to the trampoline and runs cgroup effective BPF programs array.
      This shim has some rudimentary ref counting and can be shared
      between several programs attaching to the same lsm hook from
      different cgroups.
      
      Note that this patch bloats cgroup size because we add 211
      cgroup_bpf_attach_type(s) for simplicity sake. This will be
      addressed in the subsequent patch.
      
      Also note that we only add non-sleepable flavor for now. To enable
      sleepable use-cases, bpf_prog_run_array_cg has to grab trace rcu,
      shim programs have to be freed via trace rcu, cgroup_bpf.effective
      should be also trace-rcu-managed + maybe some other changes that
      I'm not aware of.
      Reviewed-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Link: https://lore.kernel.org/r/20220628174314.1216643-4-sdf@google.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      69fd337a
    • S
      bpf: add bpf_func_t and trampoline helpers · af3f4134
      Stanislav Fomichev 提交于
      I'll be adding lsm cgroup specific helpers that grab
      trampoline mutex.
      
      No functional changes.
      Reviewed-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Link: https://lore.kernel.org/r/20220628174314.1216643-2-sdf@google.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      af3f4134
  17. 21 6月, 2022 1 次提交
    • E
      bpf: Inline calls to bpf_loop when callback is known · 1ade2371
      Eduard Zingerman 提交于
      Calls to `bpf_loop` are replaced with direct loops to avoid
      indirection. E.g. the following:
      
        bpf_loop(10, foo, NULL, 0);
      
      Is replaced by equivalent of the following:
      
        for (int i = 0; i < 10; ++i)
          foo(i, NULL);
      
      This transformation could be applied when:
      - callback is known and does not change during program execution;
      - flags passed to `bpf_loop` are always zero.
      
      Inlining logic works as follows:
      
      - During execution simulation function `update_loop_inline_state`
        tracks the following information for each `bpf_loop` call
        instruction:
        - is callback known and constant?
        - are flags constant and zero?
      - Function `optimize_bpf_loop` increases stack depth for functions
        where `bpf_loop` calls can be inlined and invokes `inline_bpf_loop`
        to apply the inlining. The additional stack space is used to spill
        registers R6, R7 and R8. These registers are used as loop counter,
        loop maximal bound and callback context parameter;
      
      Measurements using `benchs/run_bench_bpf_loop.sh` inside QEMU / KVM on
      i7-4710HQ CPU show a drop in latency from 14 ns/op to 2 ns/op.
      Signed-off-by: NEduard Zingerman <eddyz87@gmail.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/r/20220620235344.569325-4-eddyz87@gmail.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      1ade2371
  18. 17 6月, 2022 4 次提交
  19. 03 6月, 2022 1 次提交
  20. 24 5月, 2022 4 次提交
  21. 21 5月, 2022 2 次提交