提交 · ceea991a019c57a1fb0edd12a5f836a0fa431aee · openeuler / Kernel

17 9月, 2022 3 次提交

bpf: Move bpf_dispatcher function out of ftrace locations · ceea991a

由 Jiri Olsa 提交于 9月 03, 2022

The dispatcher function is attached/detached to trampoline by
dispatcher update function. At the same time it's available as
ftrace attachable function.

After discussion [1] the proposed solution is to use compiler
attributes to alter bpf_dispatcher_##name##_func function:

  - remove it from being instrumented with __no_instrument_function__
    attribute, so ftrace has no track of it

  - but still generate 5 nop instructions with patchable_function_entry(5)
    attribute, which are expected by bpf_arch_text_poke used by
    dispatcher update function

Enabling HAVE_DYNAMIC_FTRACE_NO_PATCHABLE option for x86, so
__patchable_function_entries functions are not part of ftrace/mcount
locations.

Adding attributes to bpf_dispatcher_XXX function on x86_64 so it's
kept out of ftrace locations and has 5 byte nop generated at entry.

These attributes need to be arch specific as pointed out by Ilya
Leoshkevic in here [2].

The dispatcher image is generated only for x86_64 arch, so the
code can stay as is for other archs.

  [1] https://lore.kernel.org/bpf/20220722110811.124515-1-jolsa@kernel.org/
  [2] https://lore.kernel.org/bpf/969a14281a7791c334d476825863ee449964dd0c.camel@linux.ibm.com/Suggested-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NJiri Olsa <jolsa@kernel.org>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/bpf/20220903131154.420467-3-jolsa@kernel.org

ceea991a

ftrace: Add HAVE_DYNAMIC_FTRACE_NO_PATCHABLE · 9440155c

由 Peter Zijlstra (Intel) 提交于 9月 03, 2022

x86 will shortly start using -fpatchable-function-entry for purposes
other than ftrace, make sure the __patchable_function_entry section
isn't merged in the mcount_loc section.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NJiri Olsa <jolsa@kernel.org>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220903131154.420467-2-jolsa@kernel.org

9440155c

bpf: Use bpf_capable() instead of CAP_SYS_ADMIN for blinding decision · bfeb7e39

由 Yauheni Kaliuta 提交于 9月 05, 2022

The full CAP_SYS_ADMIN requirement for blinding looks too strict nowadays.
These days given unprivileged BPF is disabled by default, the main users
for constant blinding coming from unprivileged in particular via cBPF -> eBPF
migration (e.g. old-style socket filters).
Signed-off-by: NYauheni Kaliuta <ykaliuta@redhat.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220831090655.156434-1-ykaliuta@redhat.com
Link: https://lore.kernel.org/bpf/20220905090149.61221-1-ykaliuta@redhat.com

bfeb7e39

15 9月, 2022 1 次提交

bpf: Add verifier check for BPF_PTR_POISON retval and arg · 47e34cb7

由 Dave Marchevsky 提交于 9月 12, 2022

BPF_PTR_POISON was added in commit c0a5a21c ("bpf: Allow storing
referenced kptr in map") to denote a bpf_func_proto btf_id which the
verifier will replace with a dynamically-determined btf_id at verification
time.

This patch adds verifier 'poison' functionality to BPF_PTR_POISON in
order to prepare for expanded use of the value to poison ret- and
arg-btf_id in ongoing work, namely rbtree and linked list patchsets
[0, 1]. Specifically, when the verifier checks helper calls, it assumes
that BPF_PTR_POISON'ed ret type will be replaced with a valid type before
- or in lieu of - the default ret_btf_id logic. Similarly for arg btf_id.

If poisoned btf_id reaches default handling block for either, consider
this a verifier internal error and fail verification. Otherwise a helper
w/ poisoned btf_id but no verifier logic replacing the type will cause a
crash as the invalid pointer is dereferenced.

Also move BPF_PTR_POISON to existing include/linux/posion.h header and
remove unnecessary shift.

  [0]: lore.kernel.org/bpf/20220830172759.4069786-1-davemarchevsky@fb.com
  [1]: lore.kernel.org/bpf/20220904204145.3089-1-memxor@gmail.com
Signed-off-by: NDave Marchevsky <davemarchevsky@fb.com>
Acked-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220912154544.1398199-1-davemarchevsky@fb.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

47e34cb7

11 9月, 2022 3 次提交

bpf: Add verifier support for custom callback return range · 1bfe26fb

由 Dave Marchevsky 提交于 9月 08, 2022

Verifier logic to confirm that a callback function returns 0 or 1 was
added in commit 69c087ba ("bpf: Add bpf_for_each_map_elem() helper").
At the time, callback return value was only used to continue or stop
iteration.

In order to support callbacks with a broader return value range, such as
those added in rbtree series[0] and others, add a callback_ret_range to
bpf_func_state. Verifier's helpers which set in_callback_fn will also
set the new field, which the verifier will later use to check return
value bounds.

Default to tnum_range(0, 0) instead of using tnum_unknown as a sentinel
value as the latter would prevent the valid range (0, U64_MAX) being
used. Previous global default tnum_range(0, 1) is explicitly set for
extant callback helpers. The change to global default was made after
discussion around this patch in rbtree series [1], goal here is to make
it more obvious that callback_ret_range should be explicitly set.

  [0]: lore.kernel.org/bpf/20220830172759.4069786-1-davemarchevsky@fb.com/
  [1]: lore.kernel.org/bpf/20220830172759.4069786-2-davemarchevsky@fb.com/
Signed-off-by: NDave Marchevsky <davemarchevsky@fb.com>
Reviewed-by: NStanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220908230716.2751723-1-davemarchevsky@fb.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

1bfe26fb

bpf: Add support for writing to nf_conn:mark · 864b656f

由 Daniel Xu 提交于 9月 07, 2022

Support direct writes to nf_conn:mark from TC and XDP prog types. This
is useful when applications want to store per-connection metadata. This
is also particularly useful for applications that run both bpf and
iptables/nftables because the latter can trivially access this metadata.

One example use case would be if a bpf prog is responsible for advanced
packet classification and iptables/nftables is later used for routing
due to pre-existing/legacy code.
Signed-off-by: NDaniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/ebca06dea366e3e7e861c12f375a548cc4c61108.1662568410.git.dxu@dxuuu.xyzSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

864b656f

bpf: Add stub for btf_struct_access() · d4f7bdb2

由 Daniel Xu 提交于 9月 07, 2022

Add corresponding unimplemented stub for when CONFIG_BPF_SYSCALL=n
Signed-off-by: NDaniel Xu <dxu@dxuuu.xyz>
Acked-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/4021398e884433b1fef57a4d28361bb9fcf1bd05.1662568410.git.dxu@dxuuu.xyzSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

d4f7bdb2

08 9月, 2022 5 次提交

bpf: Add helper macro bpf_for_each_reg_in_vstate · b239da34

由 Kumar Kartikeya Dwivedi 提交于 9月 04, 2022

For a lot of use cases in future patches, we will want to modify the
state of registers part of some same 'group' (e.g. same ref_obj_id). It
won't just be limited to releasing reference state, but setting a type
flag dynamically based on certain actions, etc.

Hence, we need a way to easily pass a callback to the function that
iterates over all registers in current bpf_verifier_state in all frames
upto (and including) the curframe.

While in C++ we would be able to easily use a lambda to pass state and
the callback together, sadly we aren't using C++ in the kernel. The next
best thing to avoid defining a function for each case seems like
statement expressions in GNU C. The kernel already uses them heavily,
hence they can passed to the macro in the style of a lambda. The
statement expression will then be substituted in the for loop bodies.

Variables __state and __reg are set to current bpf_func_state and reg
for each invocation of the expression inside the passed in verifier
state.

Then, convert mark_ptr_or_null_regs, clear_all_pkt_pointers,
release_reference, find_good_pkt_pointers, find_equal_scalars to
use bpf_for_each_reg_in_vstate.
Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220904204145.3089-16-memxor@gmail.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

b239da34

bpf: Add zero_map_value to zero map value with special fields · cc487558

由 Kumar Kartikeya Dwivedi 提交于 9月 04, 2022

We need this helper to skip over special fields (bpf_spin_lock,
bpf_timer, kptrs) while zeroing a map value. Use the same logic as
copy_map_value but memset instead of memcpy.

Currently, the code zeroing map value memory does not have to deal with
special fields, hence this is a prerequisite for introducing such
support.
Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220904204145.3089-4-memxor@gmail.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

cc487558

bpf: Add copy_map_value_long to copy to remote percpu memory · 44832519

由 Kumar Kartikeya Dwivedi 提交于 9月 04, 2022

bpf_long_memcpy is used while copying to remote percpu regions from BPF
syscall and helpers, so that the copy is atomic at word size
granularity.

This might not be possible when you copy from map value hosting kptrs
from or to percpu maps, as the alignment or size in disjoint regions may
not be multiple of word size.

Hence, to avoid complicating the copy loop, we only use bpf_long_memcpy
when special fields are not present, otherwise use normal memcpy to copy
the disjoint regions.
Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220904204145.3089-2-memxor@gmail.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

44832519

bpf/verifier: allow kfunc to return an allocated mem · eb1f7f71

由 Benjamin Tissoires 提交于 9月 06, 2022

For drivers (outside of network), the incoming data is not statically
defined in a struct. Most of the time the data buffer is kzalloc-ed
and thus we can not rely on eBPF and BTF to explore the data.

This commit allows to return an arbitrary memory, previously allocated by
the driver.
An interesting extra point is that the kfunc can mark the exported
memory region as read only or read/write.

So, when a kfunc is not returning a pointer to a struct but to a plain
type, we can consider it is a valid allocated memory assuming that:
- one of the arguments is either called rdonly_buf_size or
  rdwr_buf_size
- and this argument is a const from the caller point of view

We can then use this parameter as the size of the allocated memory.

The memory is either read-only or read-write based on the name
of the size parameter.
Acked-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: NBenjamin Tissoires <benjamin.tissoires@redhat.com>
Link: https://lore.kernel.org/r/20220906151303.2780789-7-benjamin.tissoires@redhat.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

eb1f7f71

bpf: split btf_check_subprog_arg_match in two · 95f2f26f

由 Benjamin Tissoires 提交于 9月 06, 2022

btf_check_subprog_arg_match() was used twice in verifier.c:
- when checking for the type mismatches between a (sub)prog declaration
  and BTF
- when checking the call of a subprog to see if the provided arguments
  are correct and valid

This is problematic when we check if the first argument of a program
(pointer to ctx) is correctly accessed:
To be able to ensure we access a valid memory in the ctx, the verifier
assumes the pointer to context is not null.
This has the side effect of marking the program accessing the entire
context, even if the context is never dereferenced.

For example, by checking the context access with the current code, the
following eBPF program would fail with -EINVAL if the ctx is set to null
from the userspace:

```
SEC("syscall")
int prog(struct my_ctx *args) {
  return 0;
}
```

In that particular case, we do not want to actually check that the memory
is correct while checking for the BTF validity, but we just want to
ensure that the (sub)prog definition matches the BTF we have.

So split btf_check_subprog_arg_match() in two so we can actually check
for the memory used when in a call, and ignore that part when not.

Note that a further patch is in preparation to disentangled
btf_check_func_arg_match() from these two purposes, and so right now we
just add a new hack around that by adding a boolean to this function.
Signed-off-by: NBenjamin Tissoires <benjamin.tissoires@redhat.com>
Acked-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220906151303.2780789-3-benjamin.tissoires@redhat.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

95f2f26f

07 9月, 2022 2 次提交

bpf: Update descriptions for helpers bpf_get_func_arg[_cnt]() · 27ed9353

由 Yonghong Song 提交于 8月 31, 2022

Now instead of the number of arguments, the number of registers
holding argument values are stored in trampoline. Update
the description of bpf_get_func_arg[_cnt]() helpers. Previous
programs without struct arguments should continue to work
as usual.
Signed-off-by: NYonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220831152657.2078805-1-yhs@fb.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

27ed9353

bpf: Allow struct argument in trampoline based programs · 720e6a43

由 Yonghong Song 提交于 8月 31, 2022

Allow struct argument in trampoline based programs where
the struct size should be <= 16 bytes. In such cases, the argument
will be put into up to 2 registers for bpf, x86_64 and arm64
architectures.

To support arch-specific trampoline manipulation,
add arg_flags for additional struct information about arguments
in btf_func_model. Such information will be used in arch specific
function arch_prepare_bpf_trampoline() to prepare argument access
properly in trampoline.
Signed-off-by: NYonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220831152646.2078089-1-yhs@fb.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

720e6a43

05 9月, 2022 6 次提交

bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc. · 9f2c6e96

由 Alexei Starovoitov 提交于 9月 02, 2022

User space might be creating and destroying a lot of hash maps. Synchronous
rcu_barrier-s in a destruction path of hash map delay freeing of hash buckets
and other map memory and may cause artificial OOM situation under stress.
Optimize rcu_barrier usage between bpf hash map and bpf_mem_alloc:
- remove rcu_barrier from hash map, since htab doesn't use call_rcu
  directly and there are no callback to wait for.
- bpf_mem_alloc has call_rcu_in_progress flag that indicates pending callbacks.
  Use it to avoid barriers in fast path.
- When barriers are needed copy bpf_mem_alloc into temp structure
  and wait for rcu barrier-s in the worker to let the rest of
  hash map freeing to proceed.
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220902211058.60789-17-alexei.starovoitov@gmail.com

9f2c6e96

bpf: Add percpu allocation support to bpf_mem_alloc. · 4ab67149

由 Alexei Starovoitov 提交于 9月 02, 2022

Extend bpf_mem_alloc to cache free list of fixed size per-cpu allocations.
Once such cache is created bpf_mem_cache_alloc() will return per-cpu objects.
bpf_mem_cache_free() will free them back into global per-cpu pool after
observing RCU grace period.
per-cpu flavor of bpf_mem_alloc is going to be used by per-cpu hash maps.

The free list cache consists of tuples { llist_node, per-cpu pointer }
Unlike alloc_percpu() that returns per-cpu pointer
the bpf_mem_cache_alloc() returns a pointer to per-cpu pointer and
bpf_mem_cache_free() expects to receive it back.
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: NAndrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220902211058.60789-11-alexei.starovoitov@gmail.com

4ab67149

bpf: Introduce any context BPF specific memory allocator. · 7c8199e2

由 Alexei Starovoitov 提交于 9月 02, 2022

Tracing BPF programs can attach to kprobe and fentry. Hence they
run in unknown context where calling plain kmalloc() might not be safe.

Front-end kmalloc() with minimal per-cpu cache of free elements.
Refill this cache asynchronously from irq_work.

BPF programs always run with migration disabled.
It's safe to allocate from cache of the current cpu with irqs disabled.
Free-ing is always done into bucket of the current cpu as well.
irq_work trims extra free elements from buckets with kfree
and refills them with kmalloc, so global kmalloc logic takes care
of freeing objects allocated by one cpu and freed on another.

struct bpf_mem_alloc supports two modes:
- When size != 0 create kmem_cache and bpf_mem_cache for each cpu.
  This is typical bpf hash map use case when all elements have equal size.
- When size == 0 allocate 11 bpf_mem_cache-s for each cpu, then rely on
  kmalloc/kfree. Max allocation size is 4096 in this case.
  This is bpf_dynptr and bpf_kptr use case.

bpf_mem_alloc/bpf_mem_free are bpf specific 'wrappers' of kmalloc/kfree.
bpf_mem_cache_alloc/bpf_mem_cache_free are 'wrappers' of kmem_cache_alloc/kmem_cache_free.

The allocators are NMI-safe from bpf programs only. They are not NMI-safe in general.
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: NAndrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220902211058.60789-2-alexei.starovoitov@gmail.com

7c8199e2

net: phy: Add 1000BASE-KX interface mode · 05ad5d45

由 Sean Anderson 提交于 9月 02, 2022

Add 1000BASE-KX interface mode. This 1G backplane ethernet as described in
clause 70. Clause 73 autonegotiation is mandatory, and only full duplex
operation is supported.

Although at the PMA level this interface mode is identical to
1000BASE-X, it uses a different form of in-band autonegation. This
justifies a separate interface mode, since the interface mode (along
with the MLO_AN_* autonegotiation mode) sets the type of autonegotiation
which will be used on a link. This results in more than just electrical
differences between the link modes.

With regard to 1000BASE-X, 1000BASE-KX holds a similar position to
SGMII: same signaling, but different autonegotiation. PCS drivers
(which typically handle in-band autonegotiation) may only support
1000BASE-X, and not 1000BASE-KX. Similarly, the phy mode is used to
configure serdes phys with phy_set_mode_ext. Due to the different
electrical standards (SFI or XFI vs Clause 70), they will likely want to
use different configuration. Adding a phy interface mode for
1000BASE-KX helps simplify configuration in these areas.
Signed-off-by: NSean Anderson <sean.anderson@seco.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

05ad5d45

soc: fsl: qbman: Add CGR update function · 914f8b22

由 Sean Anderson 提交于 9月 02, 2022

This adds a function to update a CGR with new parameters. qman_create_cgr
can almost be used for this (with flags=0), but it's not suitable because
it also registers the callback function. The _safe variant was modeled off
of qman_cgr_delete_safe. However, we handle multiple arguments and a return
value.
Signed-off-by: NSean Anderson <sean.anderson@seco.com>
Acked-by: NCamelia Groza <camelia.groza@nxp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

914f8b22

net: pcs: add new PCS driver for altera TSE PCS · 4a502cf4

由 Maxime Chevallier 提交于 9月 02, 2022

The Altera Triple Speed Ethernet has a SGMII/1000BaseC PCS that can be
integrated in several ways. It can either be part of the TSE MAC's
address space, accessed through 32 bits accesses on the mapped mdio
device 0, or through a dedicated 16 bits register set.

This driver allows using the TSE PCS outside of altera TSE's driver,
since it can be used standalone by other MACs.
Signed-off-by: NMaxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4a502cf4

03 9月, 2022 11 次提交

wifi: nl80211: add MLD address to assoc BSS entries · 6522047c

由 Johannes Berg 提交于 9月 02, 2022

Add an MLD address attribute to BSS entries that the interface
is currently associated with to help userspace figure out what's
going on.
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

6522047c

wifi: nl80211: Add POWERED_ADDR_CHANGE feature · a36c4216

由 James Prestwood 提交于 8月 26, 2022

Add a new extended feature bit signifying that the wireless hardware
supports changing the MAC address while the underlying net_device is
powered. Note that this has a different meaning from
IFF_LIVE_ADDR_CHANGE as additional restrictions might be imposed by
the hardware, such as:

 - No connection is active on this interface, carrier is off
 - No scan is in progress
 - No offchannel operations are in progress
Signed-off-by: NJames Prestwood <prestwoj@gmail.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

a36c4216

net/ipv4: Use __DECLARE_FLEX_ARRAY() helper · 5854a09b

由 Gustavo A. R. Silva 提交于 8月 31, 2022

We now have a cleaner way to keep compatibility with user-space
(a.k.a. not breaking it) when we need to keep in place a one-element
array (for its use in user-space) together with a flexible-array
member (for its use in kernel-space) without making it hard to read
at the source level. This is through the use of the new
__DECLARE_FLEX_ARRAY() helper macro.

The size and memory layout of the structure is preserved after the
changes. See below.

Before changes:

$ pahole -C ip_msfilter net/ipv4/igmp.o
struct ip_msfilter {
	union {
		struct {
			__be32     imsf_multiaddr_aux;   /*     0     4 */
			__be32     imsf_interface_aux;   /*     4     4 */
			__u32      imsf_fmode_aux;       /*     8     4 */
			__u32      imsf_numsrc_aux;      /*    12     4 */
			__be32     imsf_slist[1];        /*    16     4 */
		};                                       /*     0    20 */
		struct {
			__be32     imsf_multiaddr;       /*     0     4 */
			__be32     imsf_interface;       /*     4     4 */
			__u32      imsf_fmode;           /*     8     4 */
			__u32      imsf_numsrc;          /*    12     4 */
			__be32     imsf_slist_flex[0];   /*    16     0 */
		};                                       /*     0    16 */
	};                                               /*     0    20 */

	/* size: 20, cachelines: 1, members: 1 */
	/* last cacheline: 20 bytes */
};

After changes:

$ pahole -C ip_msfilter net/ipv4/igmp.o
struct ip_msfilter {
	__be32                     imsf_multiaddr;       /*     0     4 */
	__be32                     imsf_interface;       /*     4     4 */
	__u32                      imsf_fmode;           /*     8     4 */
	__u32                      imsf_numsrc;          /*    12     4 */
	union {
		__be32             imsf_slist[1];        /*    16     4 */
		struct {
			struct {
			} __empty_imsf_slist_flex;       /*    16     0 */
			__be32     imsf_slist_flex[0];   /*    16     0 */
		};                                       /*    16     0 */
	};                                               /*    16     4 */

	/* size: 20, cachelines: 1, members: 5 */
	/* last cacheline: 20 bytes */
};

In the past, we had to duplicate the whole original structure within
a union, and update the names of all the members. Now, we just need to
declare the flexible-array member to be used in kernel-space through
the __DECLARE_FLEX_ARRAY() helper together with the one-element array,
within a union. This makes the source code more clean and easier to read.

Link: https://github.com/KSPP/linux/issues/193Signed-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5854a09b

bpf: Change bpf_getsockopt(SOL_IPV6) to reuse do_ipv6_getsockopt() · 38566ec0

由 Martin KaFai Lau 提交于 9月 01, 2022

This patch changes bpf_getsockopt(SOL_IPV6) to reuse
do_ipv6_getsockopt().  It removes the duplicated code from
bpf_getsockopt(SOL_IPV6).

This also makes bpf_getsockopt(SOL_IPV6) supporting the same
set of optnames as in bpf_setsockopt(SOL_IPV6).  In particular,
this adds IPV6_AUTOFLOWLABEL support to bpf_getsockopt(SOL_IPV6).

ipv6 could be compiled as a module.  Like how other code solved it
with stubs in ipv6_stubs.h, this patch adds the do_ipv6_getsockopt
to the ipv6_bpf_stub.
Signed-off-by: NMartin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20220902002931.2896218-1-kafai@fb.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

38566ec0

bpf: Change bpf_getsockopt(SOL_IP) to reuse do_ip_getsockopt() · fd969f25

由 Martin KaFai Lau 提交于 9月 01, 2022

This patch changes bpf_getsockopt(SOL_IP) to reuse
do_ip_getsockopt() and remove the duplicated code.
Signed-off-by: NMartin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20220902002925.2895416-1-kafai@fb.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

fd969f25

bpf: Change bpf_getsockopt(SOL_TCP) to reuse do_tcp_getsockopt() · 273b7f0f

由 Martin KaFai Lau 提交于 9月 01, 2022

This patch changes bpf_getsockopt(SOL_TCP) to reuse
do_tcp_getsockopt().  It removes the duplicated code from
bpf_getsockopt(SOL_TCP).

Before this patch, there were some optnames available to
bpf_setsockopt(SOL_TCP) but missing in bpf_getsockopt(SOL_TCP).
For example, TCP_NODELAY, TCP_MAXSEG, TCP_KEEPIDLE, TCP_KEEPINTVL,
and a few more.  It surprises users from time to time.  This patch
automatically closes this gap without duplicating more code.

bpf_getsockopt(TCP_SAVED_SYN) does not free the saved_syn,
so it stays in sol_tcp_sockopt().

For string name value like TCP_CONGESTION, bpf expects it
is always null terminated, so sol_tcp_sockopt() decrements
optlen by one before calling do_tcp_getsockopt() and
the 'if (optlen < saved_optlen) memset(..,0,..);'
in __bpf_getsockopt() will always do a null termination.
Signed-off-by: NMartin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20220902002918.2894511-1-kafai@fb.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

273b7f0f

bpf: Change bpf_getsockopt(SOL_SOCKET) to reuse sk_getsockopt() · 65ddc82d

由 Martin KaFai Lau 提交于 9月 01, 2022

This patch changes bpf_getsockopt(SOL_SOCKET) to reuse
sk_getsockopt(). It removes all duplicated code from
bpf_getsockopt(SOL_SOCKET).

Before this patch, there were some optnames available to
bpf_setsockopt(SOL_SOCKET) but missing in bpf_getsockopt(SOL_SOCKET).
It surprises users from time to time. For example, SO_REUSEADDR,
SO_KEEPALIVE, SO_RCVLOWAT, and SO_MAX_PACING_RATE. This patch
automatically closes this gap without duplicating more code.
The only exception is SO_BINDTODEVICE because it needs to acquire a
blocking lock. Thus, SO_BINDTODEVICE is not supported.
Signed-off-by: NMartin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20220902002912.2894040-1-kafai@fb.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

65ddc82d

bpf: net: Change do_ipv6_getsockopt() to take the sockptr_t argument · 6dadbe4b

由 Martin KaFai Lau 提交于 9月 01, 2022

Similar to the earlier patch that changes sk_getsockopt() to
take the sockptr_t argument .  This patch also changes
do_ipv6_getsockopt() to take the sockptr_t argument such that
a latter patch can make bpf_getsockopt(SOL_IPV6) to reuse
do_ipv6_getsockopt().

Note on the change in ip6_mc_msfget().  This function is to
return an array of sockaddr_storage in optval.  This function
is shared between ipv6_get_msfilter() and compat_ipv6_get_msfilter().
However, the sockaddr_storage is stored at different offset of the
optval because of the difference between group_filter and
compat_group_filter.  Thus, a new 'ss_offset' argument is
added to ip6_mc_msfget().
Signed-off-by: NMartin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20220902002853.2892532-1-kafai@fb.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

6dadbe4b

bpf: net: Change do_ip_getsockopt() to take the sockptr_t argument · 728f064c

由 Martin KaFai Lau 提交于 9月 01, 2022

Similar to the earlier patch that changes sk_getsockopt() to
take the sockptr_t argument.  This patch also changes
do_ip_getsockopt() to take the sockptr_t argument such that
a latter patch can make bpf_getsockopt(SOL_IP) to reuse
do_ip_getsockopt().

Note on the change in ip_mc_gsfget().  This function is to
return an array of sockaddr_storage in optval.  This function
is shared between ip_get_mcast_msfilter() and
compat_ip_get_mcast_msfilter().  However, the sockaddr_storage
is stored at different offset of the optval because of
the difference between group_filter and compat_group_filter.
Thus, a new 'ss_offset' argument is added to ip_mc_gsfget().
Signed-off-by: NMartin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20220902002828.2890585-1-kafai@fb.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

728f064c

bpf: net: Change sk_getsockopt() to take the sockptr_t argument · 4ff09db1

由 Martin KaFai Lau 提交于 9月 01, 2022

This patch changes sk_getsockopt() to take the sockptr_t argument
such that it can be used by bpf_getsockopt(SOL_SOCKET) in a
latter patch.

security_socket_getpeersec_stream() is not changed.  It stays
with the __user ptr (optval.user and optlen.user) to avoid changes
to other security hooks.  bpf_getsockopt(SOL_SOCKET) also does not
support SO_PEERSEC.
Signed-off-by: NMartin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20220902002802.2888419-1-kafai@fb.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

4ff09db1

net: ieee802154: Fix compilation error when CONFIG_IEEE802154_NL802154_EXPERIMENTAL is disabled · 82543936

由 Gal Pressman 提交于 9月 01, 2022

When CONFIG_IEEE802154_NL802154_EXPERIMENTAL is disabled,
NL802154_CMD_DEL_SEC_LEVEL is undefined and results in a compilation
error:
net/ieee802154/nl802154.c:2503:19: error: 'NL802154_CMD_DEL_SEC_LEVEL' undeclared here (not in a function); did you mean 'NL802154_CMD_SET_CCA_ED_LEVEL'?
 2503 |  .resv_start_op = NL802154_CMD_DEL_SEC_LEVEL + 1,
      |                   ^~~~~~~~~~~~~~~~~~~~~~~~~~
      |                   NL802154_CMD_SET_CCA_ED_LEVEL

Unhide the experimental commands, having them defined in an enum
makes no difference.

Fixes: 9c5d03d3 ("genetlink: start to validate reserved header bytes")
Signed-off-by: NGal Pressman <gal@nvidia.com>
Acked-by: NStefan Schmidt <stefan@datenfreihafen.org>
Tested-by: NSudip Mukherjee <sudipm.mukherjee@gmail.com>
Link: https://lore.kernel.org/r/20220902030620.2737091-1-kuba@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>

82543936

02 9月, 2022 4 次提交

bpf: Support getting tunnel flags · 44c51472

由 Shmulik Ladkani 提交于 8月 31, 2022

Existing 'bpf_skb_get_tunnel_key' extracts various tunnel parameters
(id, ttl, tos, local and remote) but does not expose ip_tunnel_info's
tun_flags to the BPF program.

It makes sense to expose tun_flags to the BPF program.

Assume for example multiple GRE tunnels maintained on a single GRE
interface in collect_md mode. The program expects origins to initiate
over GRE, however different origins use different GRE characteristics
(e.g. some prefer to use GRE checksum, some do not; some pass a GRE key,
some do not, etc..).

A BPF program getting tun_flags can therefore remember the relevant
flags (e.g. TUNNEL_CSUM, TUNNEL_SEQ...) for each initiating remote. In
the reply path, the program can use 'bpf_skb_set_tunnel_key' in order
to correctly reply to the remote, using similar characteristics, based
on the stored tunnel flags.

Introduce BPF_F_TUNINFO_FLAGS flag for bpf_skb_get_tunnel_key. If
specified, 'bpf_tunnel_key->tunnel_flags' is set with the tun_flags.

Decided to use the existing unused 'tunnel_ext' as the storage for the
'tunnel_flags' in order to avoid changing bpf_tunnel_key's layout.

Also, the following has been considered during the design:

  1. Convert the "interesting" internal TUNNEL_xxx flags back to BPF_F_yyy
     and place into the new 'tunnel_flags' field. This has 2 drawbacks:

     - The BPF_F_yyy flags are from *set_tunnel_key* enumeration space,
       e.g. BPF_F_ZERO_CSUM_TX. It is awkward that it is "returned" into
       tunnel_flags from a *get_tunnel_key* call.
     - Not all "interesting" TUNNEL_xxx flags can be mapped to existing
       BPF_F_yyy flags, and it doesn't make sense to create new BPF_F_yyy
       flags just for purposes of the returned tunnel_flags.

  2. Place key.tun_flags into 'tunnel_flags' but mask them, keeping only
     "interesting" flags. That's ok, but the drawback is that what's
     "interesting" for my usecase might be limiting for other usecases.

Therefore I decided to expose what's in key.tun_flags *as is*, which seems
most flexible. The BPF user can just choose to ignore bits he's not
interested in. The TUNNEL_xxx are also UAPI, so no harm exposing them
back in the get_tunnel_key call.
Signed-off-by: NShmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220831144010.174110-1-shmulik.ladkani@gmail.com

44c51472

bpf, tnums: Warn against the usage of tnum_in(tnum_range(), ...) · dc84dbbc

由 Shung-Hsi Yu 提交于 8月 31, 2022

Commit a657182a ("bpf: Don't use tnum_range on array range checking
for poke descriptors") has shown that using tnum_range() as argument to
tnum_in() can lead to misleading code that looks like tight bound check
when in fact the actual allowed range is much wider.

Document such behavior to warn against its usage in general, and suggest
some scenario where result can be trusted.
Signed-off-by: NShung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/984b37f9fdf7ac36831d2137415a4a915744c1b6.1661462653.git.daniel@iogearbox.net
Link: https://www.openwall.com/lists/oss-security/2022/08/26/1
Link: https://lore.kernel.org/bpf/20220831031907.16133-3-shung-hsi.yu@suse.com
Link: https://lore.kernel.org/bpf/20220831031907.16133-2-shung-hsi.yu@suse.com

dc84dbbc

net: remove netif_tx_napi_add() · c3f760ef

由 Jakub Kicinski 提交于 8月 31, 2022

All callers are now gone.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c3f760ef

net: bql: add more documentation · 977f1aa5

由 Eric Dumazet 提交于 8月 31, 2022

Add some documentation for netdev_tx_sent_queue() and
netdev_tx_completed_queue()

Stating that netdev_tx_completed_queue() must be called once
per TX completion round is apparently not obvious for everybody.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

977f1aa5

01 9月, 2022 3 次提交

tcp: make global challenge ack rate limitation per net-ns and default disabled · 79e3602c

由 Eric Dumazet 提交于 8月 30, 2022

Because per host rate limiting has been proven problematic (side channel
attacks can be based on it), per host rate limiting of challenge acks ideally
should be per netns and turned off by default.

This is a long due followup of following commits:

083ae308 ("tcp: enable per-socket rate limiting of all 'challenge acks'")
f2b2c582 ("tcp: mitigate ACK loops for connections as tcp_sock")
75ff39cc ("tcp: make challenge acks less predictable")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Jason Baron <jbaron@akamai.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

79e3602c

net: sched: gred/red: remove unused variables in struct red_stats · 4516c873

由 Zhengchao Shao 提交于 8月 30, 2022

The variable "other" in the struct red_stats is not used. Remove it.
Signed-off-by: NZhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

4516c873

mm/rmap: Fix anon_vma->degree ambiguity leading to double-reuse · 2555283e

由 Jann Horn 提交于 8月 31, 2022

anon_vma->degree tracks the combined number of child anon_vmas and VMAs
that use the anon_vma as their ->anon_vma.

anon_vma_clone() then assumes that for any anon_vma attached to
src->anon_vma_chain other than src->anon_vma, it is impossible for it to
be a leaf node of the VMA tree, meaning that for such VMAs ->degree is
elevated by 1 because of a child anon_vma, meaning that if ->degree
equals 1 there are no VMAs that use the anon_vma as their ->anon_vma.

This assumption is wrong because the ->degree optimization leads to leaf
nodes being abandoned on anon_vma_clone() - an existing anon_vma is
reused and no new parent-child relationship is created.  So it is
possible to reuse an anon_vma for one VMA while it is still tied to
another VMA.

This is an issue because is_mergeable_anon_vma() and its callers assume
that if two VMAs have the same ->anon_vma, the list of anon_vmas
attached to the VMAs is guaranteed to be the same.  When this assumption
is violated, vma_merge() can merge pages into a VMA that is not attached
to the corresponding anon_vma, leading to dangling page->mapping
pointers that will be dereferenced during rmap walks.

Fix it by separately tracking the number of child anon_vmas and the
number of VMAs using the anon_vma as their ->anon_vma.

Fixes: 7a3ef208 ("mm: prevent endless growth of anon_vma hierarchy")
Cc: stable@kernel.org
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Signed-off-by: NJann Horn <jannh@google.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2555283e

31 8月, 2022 2 次提交

fscache: fix misdocumented parameter · ec1bd371

由 Khalid Masum 提交于 8月 18, 2022

This patch fixes two warnings generated by make docs. The functions
fscache_use_cookie and fscache_unuse_cookie, both have a parameter
named cookie. But they are documented with the name "object" with
unclear description. Which generates the warning when creating docs.

This commit will replace the currently misdocumented parameter names
with the correct ones while adding proper descriptions.

CC: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: NKhalid Masum <khalid.masum.92@gmail.com>
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20220521142446.4746-1-khalid.masum.92@gmail.com/ # v1
Link: https://lore.kernel.org/r/20220818040738.12036-1-khalid.masum.92@gmail.com/ # v2
Link: https://lore.kernel.org/r/880d7d25753fb326ee17ac08005952112fcf9bdb.1657360984.git.mchehab@kernel.org/ # Mauro's version

ec1bd371

thunderbolt: Show link type for XDomain connections too · f9cad07b

由 Mika Westerberg 提交于 8月 30, 2022

Following what we do for routers already, extend this to XDomain
connections as well. This will show in sysfs whether the link is in USB4
or Thunderbolt mode.
Signed-off-by: NMika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f9cad07b

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功