提交 · 720e6a435194fb5237833a4a7ec6aa60a78964a8 · openeuler / Kernel

07 9月, 2022 1 次提交

bpf: Allow struct argument in trampoline based programs · 720e6a43

由 Yonghong Song 提交于 8月 31, 2022

Allow struct argument in trampoline based programs where
the struct size should be <= 16 bytes. In such cases, the argument
will be put into up to 2 registers for bpf, x86_64 and arm64
architectures.

To support arch-specific trampoline manipulation,
add arg_flags for additional struct information about arguments
in btf_func_model. Such information will be used in arch specific
function arch_prepare_bpf_trampoline() to prepare argument access
properly in trampoline.
Signed-off-by: NYonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220831152646.2078089-1-yhs@fb.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

720e6a43

26 8月, 2022 1 次提交

bpf: Introduce cgroup iter · d4ccaf58

由 Hao Luo 提交于 8月 24, 2022

Cgroup_iter is a type of bpf_iter. It walks over cgroups in four modes:

- walking a cgroup's descendants in pre-order.
- walking a cgroup's descendants in post-order.
- walking a cgroup's ancestors.
- process only the given cgroup.

When attaching cgroup_iter, one can set a cgroup to the iter_link
created from attaching. This cgroup is passed as a file descriptor
or cgroup id and serves as the starting point of the walk. If no
cgroup is specified, the starting point will be the root cgroup v2.

For walking descendants, one can specify the order: either pre-order or
post-order. For walking ancestors, the walk starts at the specified
cgroup and ends at the root.

One can also terminate the walk early by returning 1 from the iter
program.

Note that because walking cgroup hierarchy holds cgroup_mutex, the iter
program is called with cgroup_mutex held.

Currently only one session is supported, which means, depending on the
volume of data bpf program intends to send to user space, the number
of cgroups that can be walked is limited. For example, given the current
buffer size is 8 * PAGE_SIZE, if the program sends 64B data for each
cgroup, assuming PAGE_SIZE is 4kb, the total number of cgroups that can
be walked is 512. This is a limitation of cgroup_iter. If the output
data is larger than the kernel buffer size, after all data in the
kernel buffer is consumed by user space, the subsequent read() syscall
will signal EOPNOTSUPP. In order to work around, the user may have to
update their program to reduce the volume of data sent to output. For
example, skip some uninteresting cgroups. In future, we may extend
bpf_iter flags to allow customizing buffer size.
Acked-by: NYonghong Song <yhs@fb.com>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NHao Luo <haoluo@google.com>
Link: https://lore.kernel.org/r/20220824233117.1312810-2-haoluo@google.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

d4ccaf58

24 8月, 2022 1 次提交

bpf: Use cgroup_{common,current}_func_proto in more hooks · bed89185

由 Stanislav Fomichev 提交于 8月 23, 2022

The following hooks are per-cgroup hooks but they are not
using cgroup_{common,current}_func_proto, fix it:

* BPF_PROG_TYPE_CGROUP_SKB (cg_skb)
* BPF_PROG_TYPE_CGROUP_SOCK_ADDR (cg_sock_addr)
* BPF_PROG_TYPE_CGROUP_SOCK (cg_sock)
* BPF_PROG_TYPE_LSM+BPF_LSM_CGROUP

Also:

* move common func_proto's into cgroup func_proto handlers
* make sure bpf_{g,s}et_retval are not accessible from recvmsg,
  getpeername and getsockname (return/errno is ignored in these
  places)
* as a side effect, expose get_current_pid_tgid, get_current_comm_proto,
  get_current_ancestor_cgroup_id, get_cgroup_classid to more cgroup
  hooks
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NStanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220823222555.523590-3-sdf@google.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

bed89185

19 8月, 2022 1 次提交

bpf: net: Avoid sk_setsockopt() taking sk lock when called from bpf · 24426654

由 Martin KaFai Lau 提交于 8月 16, 2022

Most of the code in bpf_setsockopt(SOL_SOCKET) are duplicated from
the sk_setsockopt().  The number of supported optnames are
increasing ever and so as the duplicated code.

One issue in reusing sk_setsockopt() is that the bpf prog
has already acquired the sk lock.  This patch adds a
has_current_bpf_ctx() to tell if the sk_setsockopt() is called from
a bpf prog.  The bpf prog calling bpf_setsockopt() is either running
in_task() or in_serving_softirq().  Both cases have the current->bpf_ctx
initialized.  Thus, the has_current_bpf_ctx() only needs to
test !!current->bpf_ctx.

This patch also adds sockopt_{lock,release}_sock() helpers
for sk_setsockopt() to use.  These helpers will test
has_current_bpf_ctx() before acquiring/releasing the lock.  They are
in EXPORT_SYMBOL for the ipv6 module to use in a latter patch.

Note on the change in sock_setbindtodevice().  sockopt_lock_sock()
is done in sock_setbindtodevice() instead of doing the lock_sock
in sock_bindtoindex(..., lock_sk = true).
Reviewed-by: NStanislav Fomichev <sdf@google.com>
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061717.4175589-1-kafai@fb.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

24426654

10 8月, 2022 1 次提交

bpf: Add BPF-helper for accessing CLOCK_TAI · c8996c98

由 Jesper Dangaard Brouer 提交于 8月 09, 2022

Commit 3dc6ffae ("timekeeping: Introduce fast accessor to clock tai")
introduced a fast and NMI-safe accessor for CLOCK_TAI. Especially in time
sensitive networks (TSN), where all nodes are synchronized by Precision Time
Protocol (PTP), it's helpful to have the possibility to generate timestamps
based on CLOCK_TAI instead of CLOCK_MONOTONIC. With a BPF helper for TAI in
place, it becomes very convenient to correlate activity across different
machines in the network.

Use cases for such a BPF helper include functionalities such as Tx launch
time (e.g. ETF and TAPRIO Qdiscs) and timestamping.

Note: CLOCK_TAI is nothing new per se, only the NMI-safe variant of it is.
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
[Kurt: Wrote changelog and renamed helper]
Signed-off-by: NKurt Kanzenbach <kurt@linutronix.de>
Link: https://lore.kernel.org/r/20220809060803.5773-2-kurt@linutronix.deSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

c8996c98

23 7月, 2022 2 次提交

bpf: Support bpf_trampoline on functions with IPMODIFY (e.g. livepatch) · 00963a2e

由 Song Liu 提交于 7月 19, 2022

When tracing a function with IPMODIFY ftrace_ops (livepatch), the bpf
trampoline must follow the instruction pointer saved on stack. This needs
extra handling for bpf trampolines with BPF_TRAMP_F_CALL_ORIG flag.

Implement bpf_tramp_ftrace_ops_func and use it for the ftrace_ops used
by BPF trampoline. This enables tracing functions with livepatch.

This also requires moving bpf trampoline to *_ftrace_direct_mult APIs.
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/all/20220602193706.2607681-2-song@kernel.org/
Link: https://lore.kernel.org/bpf/20220720002126.803253-5-song@kernel.org

00963a2e

bpf, x64: Allow to use caller address from stack · 316cba62

由 Jiri Olsa 提交于 7月 19, 2022

Currently we call the original function by using the absolute address
given at the JIT generation. That's not usable when having trampoline
attached to multiple functions, or the target address changes dynamically
(in case of live patch). In such cases we need to take the return address
from the stack.

Adding support to retrieve the original function address from the stack
by adding new BPF_TRAMP_F_ORIG_STACK flag for arch_prepare_bpf_trampoline
function.

Basically we take the return address of the 'fentry' call:

   function + 0: call fentry    # stores 'function + 5' address on stack
   function + 5: ...

The 'function + 5' address will be used as the address for the
original function to call.
Signed-off-by: NJiri Olsa <jolsa@kernel.org>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220720002126.803253-4-song@kernel.org

316cba62

22 7月, 2022 1 次提交

bpf: Switch to new kfunc flags infrastructure · a4703e31

由 Kumar Kartikeya Dwivedi 提交于 7月 21, 2022

Instead of populating multiple sets to indicate some attribute and then
researching the same BTF ID in them, prepare a single unified BTF set
which indicates whether a kfunc is allowed to be called, and also its
attributes if any at the same time. Now, only one call is needed to
perform the lookup for both kfunc availability and its attributes.
Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220721134245.2450-4-memxor@gmail.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

a4703e31

21 7月, 2022 1 次提交

bpf: Fix bpf_trampoline_{,un}link_cgroup_shim ifdef guards · 9cb61fda

由 Stanislav Fomichev 提交于 7月 20, 2022

They were updated in kernel/bpf/trampoline.c to fix another build
issue. We should to do the same for include/linux/bpf.h header.

Fixes: 3908fcdd ("bpf: fix lsm_cgroup build errors on esoteric configs")
Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NStanislav Fomichev <sdf@google.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20220720155220.4087433-1-sdf@google.com

9cb61fda

13 7月, 2022 2 次提交

bpf, x86: fix freeing of not-finalized bpf_prog_pack · 1d5f82d9

由 Song Liu 提交于 7月 05, 2022

syzbot reported a few issues with bpf_prog_pack [1], [2]. This only happens
with multiple subprogs. In jit_subprogs(), we first call bpf_int_jit_compile()
on each sub program. And then, we call it on each sub program again. jit_data
is not freed in the first call of bpf_int_jit_compile(). Similarly we don't
call bpf_jit_binary_pack_finalize() in the first call of bpf_int_jit_compile().

If bpf_int_jit_compile() failed for one sub program, we will call
bpf_jit_binary_pack_finalize() for this sub program. However, we don't have a
chance to call it for other sub programs. Then we will hit "goto out_free" in
jit_subprogs(), and call bpf_jit_free on some subprograms that haven't got
bpf_jit_binary_pack_finalize() yet.

At this point, bpf_jit_binary_pack_free() is called and the whole 2MB page is
freed erroneously.

Fix this with a custom bpf_jit_free() for x86_64, which calls
bpf_jit_binary_pack_finalize() if necessary. Also, with custom
bpf_jit_free(), bpf_prog_aux->use_bpf_prog_pack is not needed any more,
remove it.

Fixes: 1022a549 ("bpf, x86_64: Use bpf_jit_binary_pack_alloc")
[1] https://syzkaller.appspot.com/bug?extid=2f649ec6d2eea1495a8f
[2] https://syzkaller.appspot.com/bug?extid=87f65c75f4a72db05445
Reported-by: syzbot+2f649ec6d2eea1495a8f@syzkaller.appspotmail.com
Reported-by: syzbot+87f65c75f4a72db05445@syzkaller.appspotmail.com
Signed-off-by: NSong Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20220706002612.4013790-1-song@kernel.orgSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

1d5f82d9

bpf: reparent bpf maps on memcg offlining · 4201d9ab

由 Roman Gushchin 提交于 7月 11, 2022

The memory consumed by a bpf map is always accounted to the memory
cgroup of the process which created the map. The map can outlive
the memory cgroup if it's used by processes in other cgroups or
is pinned on bpffs. In this case the map pins the original cgroup
in the dying state.

For other types of objects (slab objects, non-slab kernel allocations,
percpu objects and recently LRU pages) there is a reparenting process
implemented: on cgroup offlining charged objects are getting
reassigned to the parent cgroup. Because all charges and statistics
are fully recursive it's a fairly cheap operation.

For efficiency and consistency with other types of objects, let's do
the same for bpf maps. Fortunately thanks to the objcg API, the
required changes are minimal.

Please, note that individual allocations (slabs, percpu and large
kmallocs) already have the reparenting mechanism. This commit adds
it to the saved map->memcg pointer by replacing it to map->objcg.
Because dying cgroups are not visible for a user and all charges are
recursive, this commit doesn't bring any behavior changes for a user.

v2:
  added a missing const qualifier
Signed-off-by: NRoman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Link: https://lore.kernel.org/r/20220711162827.184743-1-roman.gushchin@linux.devSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

4201d9ab

30 6月, 2022 4 次提交

bpf: expose bpf_{g,s}etsockopt to lsm cgroup · 9113d7e4

由 Stanislav Fomichev 提交于 6月 28, 2022

I don't see how to make it nice without introducing btf id lists
for the hooks where these helpers are allowed. Some LSM hooks
work on the locked sockets, some are triggering early and
don't grab any locks, so have two lists for now:

1. LSM hooks which trigger under socket lock - minority of the hooks,
but ideal case for us, we can expose existing BTF-based helpers
2. LSM hooks which trigger without socket lock, but they trigger
early in the socket creation path where it should be safe to
do setsockopt without any locks
3. The rest are prohibited. I'm thinking that this use-case might
be a good gateway to sleeping lsm cgroup hooks in the future.
We can either expose lock/unlock operations (and add tracking
to the verifier) or have another set of bpf_setsockopt
wrapper that grab the locks and might sleep.
Reviewed-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NStanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220628174314.1216643-7-sdf@google.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

9113d7e4

bpf: minimize number of allocated lsm slots per program · c0e19f2c

由 Stanislav Fomichev 提交于 6月 28, 2022

Previous patch adds 1:1 mapping between all 211 LSM hooks
and bpf_cgroup program array. Instead of reserving a slot per
possible hook, reserve 10 slots per cgroup for lsm programs.
Those slots are dynamically allocated on demand and reclaimed.

struct cgroup_bpf {
	struct bpf_prog_array *    effective[33];        /*     0   264 */
	/* --- cacheline 4 boundary (256 bytes) was 8 bytes ago --- */
	struct hlist_head          progs[33];            /*   264   264 */
	/* --- cacheline 8 boundary (512 bytes) was 16 bytes ago --- */
	u8                         flags[33];            /*   528    33 */

	/* XXX 7 bytes hole, try to pack */

	struct list_head           storages;             /*   568    16 */
	/* --- cacheline 9 boundary (576 bytes) was 8 bytes ago --- */
	struct bpf_prog_array *    inactive;             /*   584     8 */
	struct percpu_ref          refcnt;               /*   592    16 */
	struct work_struct         release_work;         /*   608    72 */

	/* size: 680, cachelines: 11, members: 7 */
	/* sum members: 673, holes: 1, sum holes: 7 */
	/* last cacheline: 40 bytes */
};
Reviewed-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NStanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220628174314.1216643-5-sdf@google.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

c0e19f2c

bpf: per-cgroup lsm flavor · 69fd337a

由 Stanislav Fomichev 提交于 6月 28, 2022

Allow attaching to lsm hooks in the cgroup context.

Attaching to per-cgroup LSM works exactly like attaching
to other per-cgroup hooks. New BPF_LSM_CGROUP is added
to trigger new mode; the actual lsm hook we attach to is
signaled via existing attach_btf_id.

For the hooks that have 'struct socket' or 'struct sock' as its first
argument, we use the cgroup associated with that socket. For the rest,
we use 'current' cgroup (this is all on default hierarchy == v2 only).
Note that for some hooks that work on 'struct sock' we still
take the cgroup from 'current' because some of them work on the socket
that hasn't been properly initialized yet.

Behind the scenes, we allocate a shim program that is attached
to the trampoline and runs cgroup effective BPF programs array.
This shim has some rudimentary ref counting and can be shared
between several programs attaching to the same lsm hook from
different cgroups.

Note that this patch bloats cgroup size because we add 211
cgroup_bpf_attach_type(s) for simplicity sake. This will be
addressed in the subsequent patch.

Also note that we only add non-sleepable flavor for now. To enable
sleepable use-cases, bpf_prog_run_array_cg has to grab trace rcu,
shim programs have to be freed via trace rcu, cgroup_bpf.effective
should be also trace-rcu-managed + maybe some other changes that
I'm not aware of.
Reviewed-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NStanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220628174314.1216643-4-sdf@google.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

69fd337a

bpf: add bpf_func_t and trampoline helpers · af3f4134

由 Stanislav Fomichev 提交于 6月 28, 2022

I'll be adding lsm cgroup specific helpers that grab
trampoline mutex.

No functional changes.
Reviewed-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NStanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220628174314.1216643-2-sdf@google.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

af3f4134

21 6月, 2022 1 次提交

bpf: Inline calls to bpf_loop when callback is known · 1ade2371

由 Eduard Zingerman 提交于 6月 21, 2022

Calls to `bpf_loop` are replaced with direct loops to avoid
indirection. E.g. the following:

  bpf_loop(10, foo, NULL, 0);

Is replaced by equivalent of the following:

  for (int i = 0; i < 10; ++i)
    foo(i, NULL);

This transformation could be applied when:
- callback is known and does not change during program execution;
- flags passed to `bpf_loop` are always zero.

Inlining logic works as follows:

- During execution simulation function `update_loop_inline_state`
  tracks the following information for each `bpf_loop` call
  instruction:
  - is callback known and constant?
  - are flags constant and zero?
- Function `optimize_bpf_loop` increases stack depth for functions
  where `bpf_loop` calls can be inlined and invokes `inline_bpf_loop`
  to apply the inlining. The additional stack space is used to spill
  registers R6, R7 and R8. These registers are used as loop counter,
  loop maximal bound and callback context parameter;

Measurements using `benchs/run_bench_bpf_loop.sh` inside QEMU / KVM on
i7-4710HQ CPU show a drop in latency from 14 ns/op to 2 ns/op.
Signed-off-by: NEduard Zingerman <eddyz87@gmail.com>
Acked-by: NSong Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/r/20220620235344.569325-4-eddyz87@gmail.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

1ade2371

17 6月, 2022 4 次提交

bpf: Fix non-static bpf_func_proto struct definitions · dc368e1c

由 Joanne Koong 提交于 6月 16, 2022

This patch does two things:

1) Marks the dynptr bpf_func_proto structs that were added in [1]
   as static, as pointed out by the kernel test robot in [2].

2) There are some bpf_func_proto structs marked as extern which can
   instead be statically defined.

  [1] https://lore.kernel.org/bpf/20220523210712.3641569-1-joannelkoong@gmail.com/
  [2] https://lore.kernel.org/bpf/62ab89f2.Pko7sI08RAKdF8R6%25lkp@intel.com/Reported-by: Nkernel test robot <lkp@intel.com>
Signed-off-by: NJoanne Koong <joannelkoong@gmail.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220616225407.1878436-1-joannelkoong@gmail.com

dc368e1c

bpf: Allow helpers to accept pointers with a fixed size · 508362ac

由 Maxim Mikityanskiy 提交于 6月 15, 2022

Before this commit, the BPF verifier required ARG_PTR_TO_MEM arguments
to be followed by ARG_CONST_SIZE holding the size of the memory region.
The helpers had to check that size in runtime.

There are cases where the size expected by a helper is a compile-time
constant. Checking it in runtime is an unnecessary overhead and waste of
BPF registers.

This commit allows helpers to accept pointers to memory without the
corresponding ARG_CONST_SIZE, given that they define the memory region
size in struct bpf_func_proto and use ARG_PTR_TO_FIXED_SIZE_MEM type.

arg_size is unionized with arg_btf_id to reduce the kernel image size,
and it's valid because they are used by different argument types.
Signed-off-by: NMaxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: NTariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20220615134847.3753567-3-maximmi@nvidia.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

508362ac

bpf: implement sleepable uprobes by chaining gps · 8c7dcb84

由 Delyan Kratunov 提交于 6月 14, 2022

uprobes work by raising a trap, setting a task flag from within the
interrupt handler, and processing the actual work for the uprobe on the
way back to userspace. As a result, uprobe handlers already execute in a
might_fault/_sleep context. The primary obstacle to sleepable bpf uprobe
programs is therefore on the bpf side.

Namely, the bpf_prog_array attached to the uprobe is protected by normal
rcu. In order for uprobe bpf programs to become sleepable, it has to be
protected by the tasks_trace rcu flavor instead (and kfree() called after
a corresponding grace period).

Therefore, the free path for bpf_prog_array now chains a tasks_trace and
normal grace periods one after the other.

Users who iterate under tasks_trace read section would
be safe, as would users who iterate under normal read sections (from
non-sleepable locations).

The downside is that the tasks_trace latency affects all perf_event-attached
bpf programs (and not just uprobe ones). This is deemed safe given the
possible attach rates for kprobe/uprobe/tp programs.

Separately, non-sleepable programs need access to dynamically sized
rcu-protected maps, so bpf_run_prog_array_sleepables now conditionally takes
an rcu read section, in addition to the overarching tasks_trace section.
Signed-off-by: NDelyan Kratunov <delyank@fb.com>
Link: https://lore.kernel.org/r/ce844d62a2fd0443b08c5ab02e95bc7149f9aeb1.1655248076.git.delyank@fb.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

8c7dcb84

bpf: move bpf_prog to bpf.h · d687f621

由 Delyan Kratunov 提交于 6月 14, 2022

In order to add a version of bpf_prog_run_array which accesses the
bpf_prog->aux member, bpf_prog needs to be more than a forward
declaration inside bpf.h.

Given that filter.h already includes bpf.h, this merely reorders
the type declarations for filter.h users. bpf.h users now have access to
bpf_prog internals.
Signed-off-by: NDelyan Kratunov <delyank@fb.com>
Link: https://lore.kernel.org/r/3ed7824e3948f22d84583649ccac0ff0d38b6b58.1655248076.git.delyank@fb.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

d687f621

03 6月, 2022 1 次提交

bpf, sockmap: Fix sk->sk_forward_alloc warn_on in sk_stream_kill_queues · d8616ee2

由 Wang Yufen 提交于 5月 24, 2022

During TCP sockmap redirect pressure test, the following warning is triggered:

WARNING: CPU: 3 PID: 2145 at net/core/stream.c:205 sk_stream_kill_queues+0xbc/0xd0
CPU: 3 PID: 2145 Comm: iperf Kdump: loaded Tainted: G        W         5.10.0+ #9
Call Trace:
 inet_csk_destroy_sock+0x55/0x110
 inet_csk_listen_stop+0xbb/0x380
 tcp_close+0x41b/0x480
 inet_release+0x42/0x80
 __sock_release+0x3d/0xa0
 sock_close+0x11/0x20
 __fput+0x9d/0x240
 task_work_run+0x62/0x90
 exit_to_user_mode_prepare+0x110/0x120
 syscall_exit_to_user_mode+0x27/0x190
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

The reason we observed is that:

When the listener is closing, a connection may have completed the three-way
handshake but not accepted, and the client has sent some packets. The child
sks in accept queue release by inet_child_forget()->inet_csk_destroy_sock(),
but psocks of child sks have not released.

To fix, add sock_map_destroy to release psocks.
Signed-off-by: NWang Yufen <wangyufen@huawei.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
Acked-by: NJakub Sitnicki <jakub@cloudflare.com>
Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20220524075311.649153-1-wangyufen@huawei.com

d8616ee2

24 5月, 2022 4 次提交

bpf: Add dynptr data slices · 34d4ef57

由 Joanne Koong 提交于 5月 23, 2022

This patch adds a new helper function

void *bpf_dynptr_data(struct bpf_dynptr *ptr, u32 offset, u32 len);

which returns a pointer to the underlying data of a dynptr. *len*
must be a statically known value. The bpf program may access the returned
data slice as a normal buffer (eg can do direct reads and writes), since
the verifier associates the length with the returned pointer, and
enforces that no out of bounds accesses occur.
Signed-off-by: NJoanne Koong <joannelkoong@gmail.com>
Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
Acked-by: NYonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20220523210712.3641569-6-joannelkoong@gmail.com

34d4ef57

bpf: Dynptr support for ring buffers · bc34dee6

由 Joanne Koong 提交于 5月 23, 2022

Currently, our only way of writing dynamically-sized data into a ring
buffer is through bpf_ringbuf_output but this incurs an extra memcpy
cost. bpf_ringbuf_reserve + bpf_ringbuf_commit avoids this extra
memcpy, but it can only safely support reservation sizes that are
statically known since the verifier cannot guarantee that the bpf
program won’t access memory outside the reserved space.

The bpf_dynptr abstraction allows for dynamically-sized ring buffer
reservations without the extra memcpy.

There are 3 new APIs:

long bpf_ringbuf_reserve_dynptr(void *ringbuf, u32 size, u64 flags, struct bpf_dynptr *ptr);
void bpf_ringbuf_submit_dynptr(struct bpf_dynptr *ptr, u64 flags);
void bpf_ringbuf_discard_dynptr(struct bpf_dynptr *ptr, u64 flags);

These closely follow the functionalities of the original ringbuf APIs.
For example, all ringbuffer dynptrs that have been reserved must be
either submitted or discarded before the program exits.
Signed-off-by: NJoanne Koong <joannelkoong@gmail.com>
Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
Acked-by: NAndrii Nakryiko <andrii@kernel.org>
Acked-by: NDavid Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20220523210712.3641569-4-joannelkoong@gmail.com

bc34dee6

bpf: Add verifier support for dynptrs · 97e03f52

由 Joanne Koong 提交于 5月 23, 2022

This patch adds the bulk of the verifier work for supporting dynamic
pointers (dynptrs) in bpf.

A bpf_dynptr is opaque to the bpf program. It is a 16-byte structure
defined internally as:

struct bpf_dynptr_kern {
    void *data;
    u32 size;
    u32 offset;
} __aligned(8);

The upper 8 bits of *size* is reserved (it contains extra metadata about
read-only status and dynptr type). Consequently, a dynptr only supports
memory less than 16 MB.

There are different types of dynptrs (eg malloc, ringbuf, ...). In this
patchset, the most basic one, dynptrs to a bpf program's local memory,
is added. For now only local memory that is of reg type PTR_TO_MAP_VALUE
is supported.

In the verifier, dynptr state information will be tracked in stack
slots. When the program passes in an uninitialized dynptr
(ARG_PTR_TO_DYNPTR | MEM_UNINIT), the stack slots corresponding
to the frame pointer where the dynptr resides at are marked
STACK_DYNPTR. For helper functions that take in initialized dynptrs (eg
bpf_dynptr_read + bpf_dynptr_write which are added later in this
patchset), the verifier enforces that the dynptr has been initialized
properly by checking that their corresponding stack slots have been
marked as STACK_DYNPTR.

The 6th patch in this patchset adds test cases that the verifier should
successfully reject, such as for example attempting to use a dynptr
after doing a direct write into it inside the bpf program.
Signed-off-by: NJoanne Koong <joannelkoong@gmail.com>
Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
Acked-by: NAndrii Nakryiko <andrii@kernel.org>
Acked-by: NDavid Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20220523210712.3641569-2-joannelkoong@gmail.com

97e03f52

bpf: Introduce bpf_arch_text_invalidate for bpf_prog_pack · fe736565

由 Song Liu 提交于 5月 20, 2022

Introduce bpf_arch_text_invalidate and use it to fill unused part of the
bpf_prog_pack with illegal instructions when a BPF program is freed.

Fixes: 57631054 ("bpf: Introduce bpf_prog_pack allocator")
Fixes: 33c98058 ("bpf: Introduce bpf_jit_binary_pack_[alloc|finalize|free]")
Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220520235758.1858153-4-song@kernel.org

fe736565

21 5月, 2022 2 次提交

selftests/bpf: Add missing trampoline program type to trampoline_count test · b23316aa

由 Yuntao Wang 提交于 5月 19, 2022

Currently the trampoline_count test doesn't include any fmod_ret bpf
programs, fix it to make the test cover all possible trampoline program
types.

Since fmod_ret bpf programs can't be attached to __set_task_comm function,
as it's neither whitelisted for error injection nor a security hook, change
it to bpf_modify_return_test.

This patch also does some other cleanups such as removing duplicate code,
dropping inconsistent comments, etc.
Signed-off-by: NYuntao Wang <ytcoode@gmail.com>
Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
Acked-by: NYonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20220519150610.601313-1-ytcoode@gmail.com

b23316aa

bpf: Add bpf_skc_to_mptcp_sock_proto · 3bc253c2

由 Geliang Tang 提交于 5月 19, 2022

This patch implements a new struct bpf_func_proto, named
bpf_skc_to_mptcp_sock_proto. Define a new bpf_id BTF_SOCK_TYPE_MPTCP,
and a new helper bpf_skc_to_mptcp_sock(), which invokes another new
helper bpf_mptcp_sock_from_subflow() in net/mptcp/bpf.c to get struct
mptcp_sock from a given subflow socket.

v2: Emit BTF type, add func_id checks in verifier.c and bpf_trace.c,
remove build check for CONFIG_BPF_JIT
v5: Drop EXPORT_SYMBOL (Martin)
Co-developed-by: NNicolas Rybowski <nicolas.rybowski@tessares.net>
Co-developed-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: NNicolas Rybowski <nicolas.rybowski@tessares.net>
Signed-off-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: NGeliang Tang <geliang.tang@suse.com>
Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220519233016.105670-2-mathew.j.martineau@linux.intel.com

3bc253c2

14 5月, 2022 1 次提交

bpf: Add MEM_UNINIT as a bpf_type_flag · 16d1e00c

由 Joanne Koong 提交于 5月 09, 2022

Instead of having uninitialized versions of arguments as separate
bpf_arg_types (eg ARG_PTR_TO_UNINIT_MEM as the uninitialized version
of ARG_PTR_TO_MEM), we can instead use MEM_UNINIT as a bpf_type_flag
modifier to denote that the argument is uninitialized.

Doing so cleans up some of the logic in the verifier. We no longer
need to do two checks against an argument type (eg "if
(base_type(arg_type) == ARG_PTR_TO_MEM || base_type(arg_type) ==
ARG_PTR_TO_UNINIT_MEM)"), since uninitialized and initialized
versions of the same argument type will now share the same base type.

In the near future, MEM_UNINIT will be used by dynptr helper functions
as well.
Signed-off-by: NJoanne Koong <joannelkoong@gmail.com>
Acked-by: NAndrii Nakryiko <andrii@kernel.org>
Acked-by: NDavid Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20220509224257.3222614-2-joannelkoong@gmail.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

16d1e00c

12 5月, 2022 2 次提交

bpf: add bpf_map_lookup_percpu_elem for percpu map · 07343110

由 Feng Zhou 提交于 5月 11, 2022

Add new ebpf helpers bpf_map_lookup_percpu_elem.

The implementation method is relatively simple, refer to the implementation
method of map_lookup_elem of percpu map, increase the parameters of cpu, and
obtain it according to the specified cpu.
Signed-off-by: NFeng Zhou <zhoufeng.zf@bytedance.com>
Link: https://lore.kernel.org/r/20220511093854.411-2-zhoufeng.zf@bytedance.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

07343110

bpf: Fix sparse warning for bpf_kptr_xchg_proto · 5b74c690

由 Kumar Kartikeya Dwivedi 提交于 5月 12, 2022

Kernel Test Robot complained about missing static storage class
annotation for bpf_kptr_xchg_proto variable.

sparse: symbol 'bpf_kptr_xchg_proto' was not declared. Should it be static?

This caused by missing extern definition in the header. Add it to
suppress the sparse warning.

Fixes: c0a5a21c ("bpf: Allow storing referenced kptr in map")
Reported-by: Nkernel test robot <lkp@intel.com>
Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220511194654.765705-2-memxor@gmail.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

5b74c690

11 5月, 2022 4 次提交

bpf, x86: Attach a cookie to fentry/fexit/fmod_ret/lsm. · 2fcc8241

由 Kui-Feng Lee 提交于 5月 10, 2022

Pass a cookie along with BPF_LINK_CREATE requests.

Add a bpf_cookie field to struct bpf_tracing_link to attach a cookie.
The cookie of a bpf_tracing_link is available by calling
bpf_get_attach_cookie when running the BPF program of the attached
link.

The value of a cookie will be set at bpf_tramp_run_ctx by the
trampoline of the link.
Signed-off-by: NKui-Feng Lee <kuifeng@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
Acked-by: NAndrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220510205923.3206889-4-kuifeng@fb.com

2fcc8241

bpf, x86: Create bpf_tramp_run_ctx on the caller thread's stack · e384c7b7

由 Kui-Feng Lee 提交于 5月 10, 2022

BPF trampolines will create a bpf_tramp_run_ctx, a bpf_run_ctx, on
stacks and set/reset the current bpf_run_ctx before/after calling a
bpf_prog.
Signed-off-by: NKui-Feng Lee <kuifeng@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
Acked-by: NAndrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220510205923.3206889-3-kuifeng@fb.com

e384c7b7

bpf, x86: Generate trampolines from bpf_tramp_links · f7e0beaf

由 Kui-Feng Lee 提交于 5月 10, 2022

Replace struct bpf_tramp_progs with struct bpf_tramp_links to collect
struct bpf_tramp_link(s) for a trampoline.  struct bpf_tramp_link
extends bpf_link to act as a linked list node.

arch_prepare_bpf_trampoline() accepts a struct bpf_tramp_links to
collects all bpf_tramp_link(s) that a trampoline should call.

Change BPF trampoline and bpf_struct_ops to pass bpf_tramp_links
instead of bpf_tramp_progs.
Signed-off-by: NKui-Feng Lee <kuifeng@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
Acked-by: NAndrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220510205923.3206889-2-kuifeng@fb.com

f7e0beaf

bpf: Add bpf_link iterator · 9f883612

由 Dmitrii Dolgov 提交于 5月 10, 2022

Implement bpf_link iterator to traverse links via bpf_seq_file
operations. The changeset is mostly shamelessly copied from
commit a228a64f ("bpf: Add bpf_prog iterator")
Signed-off-by: NDmitrii Dolgov <9erthalion6@gmail.com>
Acked-by: NYonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220510155233.9815-2-9erthalion6@gmail.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

9f883612

28 4月, 2022 1 次提交

x86/speculation: Add missing prototype for unpriv_ebpf_notify() · 2147c438

由 Josh Poimboeuf 提交于 4月 25, 2022

Fix the following warnings seen with "make W=1":

  kernel/sysctl.c:183:13: warning: no previous prototype for ‘unpriv_ebpf_notify’ [-Wmissing-prototypes]
    183 | void __weak unpriv_ebpf_notify(int new_state)
        |             ^~~~~~~~~~~~~~~~~~

  arch/x86/kernel/cpu/bugs.c:659:6: warning: no previous prototype for ‘unpriv_ebpf_notify’ [-Wmissing-prototypes]
    659 | void unpriv_ebpf_notify(int new_state)
        |      ^~~~~~~~~~~~~~~~~~

Fixes: 44a3918c ("x86/speculation: Include unprivileged eBPF status in Spectre v2 mitigation reporting")
Reported-by: Nkernel test robot <lkp@intel.com>
Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/5689d065f739602ececaee1e05e68b8644009608.1650930000.git.jpoimboe@redhat.com

2147c438

27 4月, 2022 1 次提交

bpf: Compute map_btf_id during build time · c317ab71

由 Menglong Dong 提交于 4月 25, 2022

For now, the field 'map_btf_id' in 'struct bpf_map_ops' for all map
types are computed during vmlinux-btf init:

  btf_parse_vmlinux() -> btf_vmlinux_map_ids_init()

It will lookup the btf_type according to the 'map_btf_name' field in
'struct bpf_map_ops'. This process can be done during build time,
thanks to Jiri's resolve_btfids.

selftest of map_ptr has passed:

  $96 map_ptr:OK
  Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
Reported-by: Nkernel test robot <lkp@intel.com>
Signed-off-by: NMenglong Dong <imagedong@tencent.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

c317ab71

26 4月, 2022 4 次提交

bpf: Make BTF type match stricter for release arguments · 2ab3b380

由 Kumar Kartikeya Dwivedi 提交于 4月 25, 2022

The current of behavior of btf_struct_ids_match for release arguments is
that when type match fails, it retries with first member type again
(recursively). Since the offset is already 0, this is akin to just
casting the pointer in normal C, since if type matches it was just
embedded inside parent sturct as an object. However, we want to reject
cases for release function type matching, be it kfunc or BPF helpers.

An example is the following:

struct foo {
	struct bar b;
};

struct foo *v = acq_foo();
rel_bar(&v->b); // btf_struct_ids_match fails btf_types_are_same, then
		// retries with first member type and succeeds, while
		// it should fail.

Hence, don't walk the struct and only rely on btf_types_are_same for
strict mode. All users of strict mode must be dealing with zero offset
anyway, since otherwise they would want the struct to be walked.
Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-10-memxor@gmail.com

2ab3b380

bpf: Wire up freeing of referenced kptr · 14a324f6

由 Kumar Kartikeya Dwivedi 提交于 4月 25, 2022

A destructor kfunc can be defined as void func(type *), where type may
be void or any other pointer type as per convenience.

In this patch, we ensure that the type is sane and capture the function
pointer into off_desc of ptr_off_tab for the specific pointer offset,
with the invariant that the dtor pointer is always set when 'kptr_ref'
tag is applied to the pointer's pointee type, which is indicated by the
flag BPF_MAP_VALUE_OFF_F_REF.

Note that only BTF IDs whose destructor kfunc is registered, thus become
the allowed BTF IDs for embedding as referenced kptr. Hence it serves
the purpose of finding dtor kfunc BTF ID, as well acting as a check
against the whitelist of allowed BTF IDs for this purpose.

Finally, wire up the actual freeing of the referenced pointer if any at
all available offsets, so that no references are leaked after the BPF
map goes away and the BPF program previously moved the ownership a
referenced pointer into it.

The behavior is similar to BPF timers, where bpf_map_{update,delete}_elem
will free any existing referenced kptr. The same case is with LRU map's
bpf_lru_push_free/htab_lru_push_free functions, which are extended to
reset unreferenced and free referenced kptr.

Note that unlike BPF timers, kptr is not reset or freed when map uref
drops to zero.
Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-8-memxor@gmail.com

14a324f6

bpf: Adapt copy_map_value for multiple offset case · 4d7d7f69

由 Kumar Kartikeya Dwivedi 提交于 4月 25, 2022

Since now there might be at most 10 offsets that need handling in
copy_map_value, the manual shuffling and special case is no longer going
to work. Hence, let's generalise the copy_map_value function by using
a sorted array of offsets to skip regions that must be avoided while
copying into and out of a map value.

When the map is created, we populate the offset array in struct map,
Then, copy_map_value uses this sorted offset array is used to memcpy
while skipping timer, spin lock, and kptr. The array is allocated as
in most cases none of these special fields would be present in map
value, hence we can save on space for the common case by not embedding
the entire object inside bpf_map struct.
Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-6-memxor@gmail.com

4d7d7f69

bpf: Prevent escaping of kptr loaded from maps · 6efe152d

由 Kumar Kartikeya Dwivedi 提交于 4月 25, 2022

While we can guarantee that even for unreferenced kptr, the object
pointer points to being freed etc. can be handled by the verifier's
exception handling (normal load patching to PROBE_MEM loads), we still
cannot allow the user to pass these pointers to BPF helpers and kfunc,
because the same exception handling won't be done for accesses inside
the kernel. The same is true if a referenced pointer is loaded using
normal load instruction. Since the reference is not guaranteed to be
held while the pointer is used, it must be marked as untrusted.

Hence introduce a new type flag, PTR_UNTRUSTED, which is used to mark
all registers loading unreferenced and referenced kptr from BPF maps,
and ensure they can never escape the BPF program and into the kernel by
way of calling stable/unstable helpers.

In check_ptr_to_btf_access, the !type_may_be_null check to reject type
flags is still correct, as apart from PTR_MAYBE_NULL, only MEM_USER,
MEM_PERCPU, and PTR_UNTRUSTED may be set for PTR_TO_BTF_ID. The first
two are checked inside the function and rejected using a proper error
message, but we still want to allow dereference of untrusted case.

Also, we make sure to inherit PTR_UNTRUSTED when chain of pointers are
walked, so that this flag is never dropped once it has been set on a
PTR_TO_BTF_ID (i.e. trusted to untrusted transition can only be in one
direction).

In convert_ctx_accesses, extend the switch case to consider untrusted
PTR_TO_BTF_ID in addition to normal PTR_TO_BTF_ID for PROBE_MEM
conversion for BPF_LDX.
Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-5-memxor@gmail.com

6efe152d

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功