提交 2843ba2e 编写于 作者: D David S. Miller

Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Alexei Starovoitov says:

====================
pull-request: bpf-next 2019-04-22

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) allow stack/queue helpers from more bpf program types, from Alban.

2) allow parallel verification of root bpf programs, from Alexei.

3) introduce bpf sysctl hook for trusted root cases, from Andrey.

4) recognize var/datasec in btf deduplication, from Andrii.

5) cpumap performance optimizations, from Jesper.

6) verifier prep for alu32 optimization, from Jiong.

7) libbpf xsk cleanup, from Magnus.

8) other various fixes and cleanups.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
......@@ -85,8 +85,33 @@ Q: Can loops be supported in a safe way?
A: It's not clear yet.
BPF developers are trying to find a way to
support bounded loops where the verifier can guarantee that
the program terminates in less than 4096 instructions.
support bounded loops.
Q: What are the verifier limits?
--------------------------------
A: The only limit known to the user space is BPF_MAXINSNS (4096).
It's the maximum number of instructions that the unprivileged bpf
program can have. The verifier has various internal limits.
Like the maximum number of instructions that can be explored during
program analysis. Currently, that limit is set to 1 million.
Which essentially means that the largest program can consist
of 1 million NOP instructions. There is a limit to the maximum number
of subsequent branches, a limit to the number of nested bpf-to-bpf
calls, a limit to the number of the verifier states per instruction,
a limit to the number of maps used by the program.
All these limits can be hit with a sufficiently complex program.
There are also non-numerical limits that can cause the program
to be rejected. The verifier used to recognize only pointer + constant
expressions. Now it can recognize pointer + bounded_register.
bpf_lookup_map_elem(key) had a requirement that 'key' must be
a pointer to the stack. Now, 'key' can be a pointer to map value.
The verifier is steadily getting 'smarter'. The limits are
being removed. The only way to know that the program is going to
be accepted by the verifier is to try to load it.
The bpf development process guarantees that the future kernel
versions will accept all bpf programs that were accepted by
the earlier versions.
Instruction level questions
---------------------------
......
......@@ -36,6 +36,16 @@ Two sets of Questions and Answers (Q&A) are maintained.
bpf_devel_QA
Program types
=============
.. toctree::
:maxdepth: 1
prog_cgroup_sysctl
prog_flow_dissector
.. Links:
.. _Documentation/networking/filter.txt: ../networking/filter.txt
.. _man-pages: https://www.kernel.org/doc/man-pages/
......
.. SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
===========================
BPF_PROG_TYPE_CGROUP_SYSCTL
===========================
This document describes ``BPF_PROG_TYPE_CGROUP_SYSCTL`` program type that
provides cgroup-bpf hook for sysctl.
The hook has to be attached to a cgroup and will be called every time a
process inside that cgroup tries to read from or write to sysctl knob in proc.
1. Attach type
**************
``BPF_CGROUP_SYSCTL`` attach type has to be used to attach
``BPF_PROG_TYPE_CGROUP_SYSCTL`` program to a cgroup.
2. Context
**********
``BPF_PROG_TYPE_CGROUP_SYSCTL`` provides access to the following context from
BPF program::
struct bpf_sysctl {
__u32 write;
__u32 file_pos;
};
* ``write`` indicates whether sysctl value is being read (``0``) or written
(``1``). This field is read-only.
* ``file_pos`` indicates file position sysctl is being accessed at, read
or written. This field is read-write. Writing to the field sets the starting
position in sysctl proc file ``read(2)`` will be reading from or ``write(2)``
will be writing to. Writing zero to the field can be used e.g. to override
whole sysctl value by ``bpf_sysctl_set_new_value()`` on ``write(2)`` even
when it's called by user space on ``file_pos > 0``. Writing non-zero
value to the field can be used to access part of sysctl value starting from
specified ``file_pos``. Not all sysctl support access with ``file_pos !=
0``, e.g. writes to numeric sysctl entries must always be at file position
``0``. See also ``kernel.sysctl_writes_strict`` sysctl.
See `linux/bpf.h`_ for more details on how context field can be accessed.
3. Return code
**************
``BPF_PROG_TYPE_CGROUP_SYSCTL`` program must return one of the following
return codes:
* ``0`` means "reject access to sysctl";
* ``1`` means "proceed with access".
If program returns ``0`` user space will get ``-1`` from ``read(2)`` or
``write(2)`` and ``errno`` will be set to ``EPERM``.
4. Helpers
**********
Since sysctl knob is represented by a name and a value, sysctl specific BPF
helpers focus on providing access to these properties:
* ``bpf_sysctl_get_name()`` to get sysctl name as it is visible in
``/proc/sys`` into provided by BPF program buffer;
* ``bpf_sysctl_get_current_value()`` to get string value currently held by
sysctl into provided by BPF program buffer. This helper is available on both
``read(2)`` from and ``write(2)`` to sysctl;
* ``bpf_sysctl_get_new_value()`` to get new string value currently being
written to sysctl before actual write happens. This helper can be used only
on ``ctx->write == 1``;
* ``bpf_sysctl_set_new_value()`` to override new string value currently being
written to sysctl before actual write happens. Sysctl value will be
overridden starting from the current ``ctx->file_pos``. If the whole value
has to be overridden BPF program can set ``file_pos`` to zero before calling
to the helper. This helper can be used only on ``ctx->write == 1``. New
string value set by the helper is treated and verified by kernel same way as
an equivalent string passed by user space.
BPF program sees sysctl value same way as user space does in proc filesystem,
i.e. as a string. Since many sysctl values represent an integer or a vector
of integers, the following helpers can be used to get numeric value from the
string:
* ``bpf_strtol()`` to convert initial part of the string to long integer
similar to user space `strtol(3)`_;
* ``bpf_strtoul()`` to convert initial part of the string to unsigned long
integer similar to user space `strtoul(3)`_;
See `linux/bpf.h`_ for more details on helpers described here.
5. Examples
***********
See `test_sysctl_prog.c`_ for an example of BPF program in C that access
sysctl name and value, parses string value to get vector of integers and uses
the result to make decision whether to allow or deny access to sysctl.
6. Notes
********
``BPF_PROG_TYPE_CGROUP_SYSCTL`` is intended to be used in **trusted** root
environment, for example to monitor sysctl usage or catch unreasonable values
an application, running as root in a separate cgroup, is trying to set.
Since `task_dfl_cgroup(current)` is called at `sys_read` / `sys_write` time it
may return results different from that at `sys_open` time, i.e. process that
opened sysctl file in proc filesystem may differ from process that is trying
to read from / write to it and two such processes may run in different
cgroups, what means ``BPF_PROG_TYPE_CGROUP_SYSCTL`` should not be used as a
security mechanism to limit sysctl usage.
As with any cgroup-bpf program additional care should be taken if an
application running as root in a cgroup should not be allowed to
detach/replace BPF program attached by administrator.
.. Links
.. _linux/bpf.h: ../../include/uapi/linux/bpf.h
.. _strtol(3): http://man7.org/linux/man-pages/man3/strtol.3p.html
.. _strtoul(3): http://man7.org/linux/man-pages/man3/strtoul.3p.html
.. _test_sysctl_prog.c:
../../tools/testing/selftests/bpf/progs/test_sysctl_prog.c
.. SPDX-License-Identifier: GPL-2.0
==================
BPF Flow Dissector
==================
============================
BPF_PROG_TYPE_FLOW_DISSECTOR
============================
Overview
========
......
......@@ -9,7 +9,6 @@ Contents:
netdev-FAQ
af_xdp
batman-adv
bpf_flow_dissector
can
can_ucan_protocol
device_drivers/freescale/dpaa2/index
......
......@@ -97,6 +97,12 @@ lirc_mode2_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
return &bpf_map_update_elem_proto;
case BPF_FUNC_map_delete_elem:
return &bpf_map_delete_elem_proto;
case BPF_FUNC_map_push_elem:
return &bpf_map_push_elem_proto;
case BPF_FUNC_map_pop_elem:
return &bpf_map_pop_elem_proto;
case BPF_FUNC_map_peek_elem:
return &bpf_map_peek_elem_proto;
case BPF_FUNC_ktime_get_ns:
return &bpf_ktime_get_ns_proto;
case BPF_FUNC_tail_call:
......
......@@ -13,6 +13,7 @@
#include <linux/namei.h>
#include <linux/mm.h>
#include <linux/module.h>
#include <linux/bpf-cgroup.h>
#include "internal.h"
static const struct dentry_operations proc_sys_dentry_operations;
......@@ -569,8 +570,8 @@ static ssize_t proc_sys_call_handler(struct file *filp, void __user *buf,
struct inode *inode = file_inode(filp);
struct ctl_table_header *head = grab_header(inode);
struct ctl_table *table = PROC_I(inode)->sysctl_entry;
void *new_buf = NULL;
ssize_t error;
size_t res;
if (IS_ERR(head))
return PTR_ERR(head);
......@@ -588,11 +589,27 @@ static ssize_t proc_sys_call_handler(struct file *filp, void __user *buf,
if (!table->proc_handler)
goto out;
error = BPF_CGROUP_RUN_PROG_SYSCTL(head, table, write, buf, &count,
ppos, &new_buf);
if (error)
goto out;
/* careful: calling conventions are nasty here */
res = count;
error = table->proc_handler(table, write, buf, &res, ppos);
if (new_buf) {
mm_segment_t old_fs;
old_fs = get_fs();
set_fs(KERNEL_DS);
error = table->proc_handler(table, write, (void __user *)new_buf,
&count, ppos);
set_fs(old_fs);
kfree(new_buf);
} else {
error = table->proc_handler(table, write, buf, &count, ppos);
}
if (!error)
error = res;
error = count;
out:
sysctl_head_finish(head);
......
......@@ -17,6 +17,8 @@ struct bpf_map;
struct bpf_prog;
struct bpf_sock_ops_kern;
struct bpf_cgroup_storage;
struct ctl_table;
struct ctl_table_header;
#ifdef CONFIG_CGROUP_BPF
......@@ -109,6 +111,12 @@ int __cgroup_bpf_run_filter_sock_ops(struct sock *sk,
int __cgroup_bpf_check_dev_permission(short dev_type, u32 major, u32 minor,
short access, enum bpf_attach_type type);
int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head,
struct ctl_table *table, int write,
void __user *buf, size_t *pcount,
loff_t *ppos, void **new_buf,
enum bpf_attach_type type);
static inline enum bpf_cgroup_storage_type cgroup_storage_type(
struct bpf_map *map)
{
......@@ -253,6 +261,18 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key,
\
__ret; \
})
#define BPF_CGROUP_RUN_PROG_SYSCTL(head, table, write, buf, count, pos, nbuf) \
({ \
int __ret = 0; \
if (cgroup_bpf_enabled) \
__ret = __cgroup_bpf_run_filter_sysctl(head, table, write, \
buf, count, pos, nbuf, \
BPF_CGROUP_SYSCTL); \
__ret; \
})
int cgroup_bpf_prog_attach(const union bpf_attr *attr,
enum bpf_prog_type ptype, struct bpf_prog *prog);
int cgroup_bpf_prog_detach(const union bpf_attr *attr,
......@@ -321,6 +341,7 @@ static inline int bpf_percpu_cgroup_storage_update(struct bpf_map *map,
#define BPF_CGROUP_RUN_PROG_UDP6_SENDMSG_LOCK(sk, uaddr, t_ctx) ({ 0; })
#define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) ({ 0; })
#define BPF_CGROUP_RUN_PROG_DEVICE_CGROUP(type,major,minor,access) ({ 0; })
#define BPF_CGROUP_RUN_PROG_SYSCTL(head,table,write,buf,count,pos,nbuf) ({ 0; })
#define for_each_cgroup_storage_type(stype) for (; false; )
......
......@@ -202,6 +202,8 @@ enum bpf_arg_type {
ARG_ANYTHING, /* any (initialized) argument is ok */
ARG_PTR_TO_SPIN_LOCK, /* pointer to bpf_spin_lock */
ARG_PTR_TO_SOCK_COMMON, /* pointer to sock_common */
ARG_PTR_TO_INT, /* pointer to int */
ARG_PTR_TO_LONG, /* pointer to long */
};
/* type of values returned from helper functions */
......@@ -987,6 +989,8 @@ extern const struct bpf_func_proto bpf_sk_redirect_map_proto;
extern const struct bpf_func_proto bpf_spin_lock_proto;
extern const struct bpf_func_proto bpf_spin_unlock_proto;
extern const struct bpf_func_proto bpf_get_local_storage_proto;
extern const struct bpf_func_proto bpf_strtol_proto;
extern const struct bpf_func_proto bpf_strtoul_proto;
/* Shared helpers among cBPF and eBPF. */
void bpf_user_rnd_init_once(void);
......
......@@ -28,6 +28,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_RAW_TRACEPOINT, raw_tracepoint)
#endif
#ifdef CONFIG_CGROUP_BPF
BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_DEVICE, cg_dev)
BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SYSCTL, cg_sysctl)
#endif
#ifdef CONFIG_BPF_LIRC_MODE2
BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2)
......
......@@ -295,6 +295,11 @@ struct bpf_verifier_env {
const struct bpf_line_info *prev_linfo;
struct bpf_verifier_log log;
struct bpf_subprog_info subprog_info[BPF_MAX_SUBPROGS + 1];
struct {
int *insn_state;
int *insn_stack;
int cur_stack;
} cfg;
u32 subprog_cnt;
/* number of instructions analyzed by the verifier */
u32 insn_processed;
......
......@@ -33,6 +33,8 @@ struct bpf_prog_aux;
struct xdp_rxq_info;
struct xdp_buff;
struct sock_reuseport;
struct ctl_table;
struct ctl_table_header;
/* ArgX, context and stack frame pointer register positions. Note,
* Arg1, Arg2, Arg3, etc are used as argument mappings of function
......@@ -1177,4 +1179,18 @@ struct bpf_sock_ops_kern {
*/
};
struct bpf_sysctl_kern {
struct ctl_table_header *head;
struct ctl_table *table;
void *cur_val;
size_t cur_len;
void *new_val;
size_t new_len;
int new_updated;
int write;
loff_t *ppos;
/* Temporary "register" for indirect stores to ppos. */
u64 tmp_reg;
};
#endif /* __LINUX_FILTER_H__ */
......@@ -1042,6 +1042,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t priority, int flags,
int node);
struct sk_buff *__build_skb(void *data, unsigned int frag_size);
struct sk_buff *build_skb(void *data, unsigned int frag_size);
struct sk_buff *build_skb_around(struct sk_buff *skb,
void *data, unsigned int frag_size);
/**
* alloc_skb - allocate a network buffer
......
......@@ -167,6 +167,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_LIRC_MODE2,
BPF_PROG_TYPE_SK_REUSEPORT,
BPF_PROG_TYPE_FLOW_DISSECTOR,
BPF_PROG_TYPE_CGROUP_SYSCTL,
};
enum bpf_attach_type {
......@@ -188,6 +189,7 @@ enum bpf_attach_type {
BPF_CGROUP_UDP6_SENDMSG,
BPF_LIRC_MODE2,
BPF_FLOW_DISSECTOR,
BPF_CGROUP_SYSCTL,
__MAX_BPF_ATTACH_TYPE
};
......@@ -1735,12 +1737,19 @@ union bpf_attr {
* error if an eBPF program tries to set a callback that is not
* supported in the current kernel.
*
* The supported callback values that *argval* can combine are:
* *argval* is a flag array which can combine these flags:
*
* * **BPF_SOCK_OPS_RTO_CB_FLAG** (retransmission time out)
* * **BPF_SOCK_OPS_RETRANS_CB_FLAG** (retransmission)
* * **BPF_SOCK_OPS_STATE_CB_FLAG** (TCP state change)
*
* Therefore, this function can be used to clear a callback flag by
* setting the appropriate bit to zero. e.g. to disable the RTO
* callback:
*
* **bpf_sock_ops_cb_flags_set(bpf_sock,**
* **bpf_sock->bpf_sock_ops_cb_flags & ~BPF_SOCK_OPS_RTO_CB_FLAG)**
*
* Here are some examples of where one could call such eBPF
* program:
*
......@@ -2504,6 +2513,122 @@ union bpf_attr {
* Return
* 0 if iph and th are a valid SYN cookie ACK, or a negative error
* otherwise.
*
* int bpf_sysctl_get_name(struct bpf_sysctl *ctx, char *buf, size_t buf_len, u64 flags)
* Description
* Get name of sysctl in /proc/sys/ and copy it into provided by
* program buffer *buf* of size *buf_len*.
*
* The buffer is always NUL terminated, unless it's zero-sized.
*
* If *flags* is zero, full name (e.g. "net/ipv4/tcp_mem") is
* copied. Use **BPF_F_SYSCTL_BASE_NAME** flag to copy base name
* only (e.g. "tcp_mem").
* Return
* Number of character copied (not including the trailing NUL).
*
* **-E2BIG** if the buffer wasn't big enough (*buf* will contain
* truncated name in this case).
*
* int bpf_sysctl_get_current_value(struct bpf_sysctl *ctx, char *buf, size_t buf_len)
* Description
* Get current value of sysctl as it is presented in /proc/sys
* (incl. newline, etc), and copy it as a string into provided
* by program buffer *buf* of size *buf_len*.
*
* The whole value is copied, no matter what file position user
* space issued e.g. sys_read at.
*
* The buffer is always NUL terminated, unless it's zero-sized.
* Return
* Number of character copied (not including the trailing NUL).
*
* **-E2BIG** if the buffer wasn't big enough (*buf* will contain
* truncated name in this case).
*
* **-EINVAL** if current value was unavailable, e.g. because
* sysctl is uninitialized and read returns -EIO for it.
*
* int bpf_sysctl_get_new_value(struct bpf_sysctl *ctx, char *buf, size_t buf_len)
* Description
* Get new value being written by user space to sysctl (before
* the actual write happens) and copy it as a string into
* provided by program buffer *buf* of size *buf_len*.
*
* User space may write new value at file position > 0.
*
* The buffer is always NUL terminated, unless it's zero-sized.
* Return
* Number of character copied (not including the trailing NUL).
*
* **-E2BIG** if the buffer wasn't big enough (*buf* will contain
* truncated name in this case).
*
* **-EINVAL** if sysctl is being read.
*
* int bpf_sysctl_set_new_value(struct bpf_sysctl *ctx, const char *buf, size_t buf_len)
* Description
* Override new value being written by user space to sysctl with
* value provided by program in buffer *buf* of size *buf_len*.
*
* *buf* should contain a string in same form as provided by user
* space on sysctl write.
*
* User space may write new value at file position > 0. To override
* the whole sysctl value file position should be set to zero.
* Return
* 0 on success.
*
* **-E2BIG** if the *buf_len* is too big.
*
* **-EINVAL** if sysctl is being read.
*
* int bpf_strtol(const char *buf, size_t buf_len, u64 flags, long *res)
* Description
* Convert the initial part of the string from buffer *buf* of
* size *buf_len* to a long integer according to the given base
* and save the result in *res*.
*
* The string may begin with an arbitrary amount of white space
* (as determined by isspace(3)) followed by a single optional '-'
* sign.
*
* Five least significant bits of *flags* encode base, other bits
* are currently unused.
*
* Base must be either 8, 10, 16 or 0 to detect it automatically
* similar to user space strtol(3).
* Return
* Number of characters consumed on success. Must be positive but
* no more than buf_len.
*
* **-EINVAL** if no valid digits were found or unsupported base
* was provided.
*
* **-ERANGE** if resulting value was out of range.
*
* int bpf_strtoul(const char *buf, size_t buf_len, u64 flags, unsigned long *res)
* Description
* Convert the initial part of the string from buffer *buf* of
* size *buf_len* to an unsigned long integer according to the
* given base and save the result in *res*.
*
* The string may begin with an arbitrary amount of white space
* (as determined by isspace(3)).
*
* Five least significant bits of *flags* encode base, other bits
* are currently unused.
*
* Base must be either 8, 10, 16 or 0 to detect it automatically
* similar to user space strtoul(3).
* Return
* Number of characters consumed on success. Must be positive but
* no more than buf_len.
*
* **-EINVAL** if no valid digits were found or unsupported base
* was provided.
*
* **-ERANGE** if resulting value was out of range.
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
......@@ -2606,7 +2731,13 @@ union bpf_attr {
FN(skb_ecn_set_ce), \
FN(get_listener_sock), \
FN(skc_lookup_tcp), \
FN(tcp_check_syncookie),
FN(tcp_check_syncookie), \
FN(sysctl_get_name), \
FN(sysctl_get_current_value), \
FN(sysctl_get_new_value), \
FN(sysctl_set_new_value), \
FN(strtol), \
FN(strtoul),
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
......@@ -2668,17 +2799,20 @@ enum bpf_func_id {
/* BPF_FUNC_skb_adjust_room flags. */
#define BPF_F_ADJ_ROOM_FIXED_GSO (1ULL << 0)
#define BPF_ADJ_ROOM_ENCAP_L2_MASK 0xff
#define BPF_ADJ_ROOM_ENCAP_L2_SHIFT 56
#define BPF_ADJ_ROOM_ENCAP_L2_MASK 0xff
#define BPF_ADJ_ROOM_ENCAP_L2_SHIFT 56
#define BPF_F_ADJ_ROOM_ENCAP_L3_IPV4 (1ULL << 1)
#define BPF_F_ADJ_ROOM_ENCAP_L3_IPV6 (1ULL << 2)
#define BPF_F_ADJ_ROOM_ENCAP_L4_GRE (1ULL << 3)
#define BPF_F_ADJ_ROOM_ENCAP_L4_UDP (1ULL << 4)
#define BPF_F_ADJ_ROOM_ENCAP_L2(len) (((__u64)len & \
#define BPF_F_ADJ_ROOM_ENCAP_L2(len) (((__u64)len & \
BPF_ADJ_ROOM_ENCAP_L2_MASK) \
<< BPF_ADJ_ROOM_ENCAP_L2_SHIFT)
/* BPF_FUNC_sysctl_get_name flags. */
#define BPF_F_SYSCTL_BASE_NAME (1ULL << 0)
/* Mode for BPF_FUNC_skb_adjust_room helper. */
enum bpf_adj_room_mode {
BPF_ADJ_ROOM_NET,
......@@ -3308,4 +3442,14 @@ struct bpf_line_info {
struct bpf_spin_lock {
__u32 val;
};
struct bpf_sysctl {
__u32 write; /* Sysctl is being read (= 0) or written (= 1).
* Allows 1,2,4-byte read, but no write.
*/
__u32 file_pos; /* Sysctl file position to read from, write to.
* Allows 1,2,4-byte read an 4-byte write.
*/
};
#endif /* _UAPI__LINUX_BPF_H__ */
......@@ -11,7 +11,10 @@
#include <linux/kernel.h>
#include <linux/atomic.h>
#include <linux/cgroup.h>
#include <linux/filter.h>
#include <linux/slab.h>
#include <linux/sysctl.h>
#include <linux/string.h>
#include <linux/bpf.h>
#include <linux/bpf-cgroup.h>
#include <net/sock.h>
......@@ -701,7 +704,7 @@ int __cgroup_bpf_check_dev_permission(short dev_type, u32 major, u32 minor,
EXPORT_SYMBOL(__cgroup_bpf_check_dev_permission);
static const struct bpf_func_proto *
cgroup_dev_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
cgroup_base_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
{
switch (func_id) {
case BPF_FUNC_map_lookup_elem:
......@@ -710,6 +713,12 @@ cgroup_dev_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
return &bpf_map_update_elem_proto;
case BPF_FUNC_map_delete_elem:
return &bpf_map_delete_elem_proto;
case BPF_FUNC_map_push_elem:
return &bpf_map_push_elem_proto;
case BPF_FUNC_map_pop_elem:
return &bpf_map_pop_elem_proto;
case BPF_FUNC_map_peek_elem:
return &bpf_map_peek_elem_proto;
case BPF_FUNC_get_current_uid_gid:
return &bpf_get_current_uid_gid_proto;
case BPF_FUNC_get_local_storage:
......@@ -725,6 +734,12 @@ cgroup_dev_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
}
}
static const struct bpf_func_proto *
cgroup_dev_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
{
return cgroup_base_func_proto(func_id, prog);
}
static bool cgroup_dev_is_valid_access(int off, int size,
enum bpf_access_type type,
const struct bpf_prog *prog,
......@@ -762,3 +777,356 @@ const struct bpf_verifier_ops cg_dev_verifier_ops = {
.get_func_proto = cgroup_dev_func_proto,
.is_valid_access = cgroup_dev_is_valid_access,
};
/**
* __cgroup_bpf_run_filter_sysctl - Run a program on sysctl
*
* @head: sysctl table header
* @table: sysctl table
* @write: sysctl is being read (= 0) or written (= 1)
* @buf: pointer to buffer passed by user space
* @pcount: value-result argument: value is size of buffer pointed to by @buf,
* result is size of @new_buf if program set new value, initial value
* otherwise
* @ppos: value-result argument: value is position at which read from or write
* to sysctl is happening, result is new position if program overrode it,
* initial value otherwise
* @new_buf: pointer to pointer to new buffer that will be allocated if program
* overrides new value provided by user space on sysctl write
* NOTE: it's caller responsibility to free *new_buf if it was set
* @type: type of program to be executed
*
* Program is run when sysctl is being accessed, either read or written, and
* can allow or deny such access.
*
* This function will return %-EPERM if an attached program is found and
* returned value != 1 during execution. In all other cases 0 is returned.
*/
int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head,
struct ctl_table *table, int write,
void __user *buf, size_t *pcount,
loff_t *ppos, void **new_buf,
enum bpf_attach_type type)
{
struct bpf_sysctl_kern ctx = {
.head = head,
.table = table,
.write = write,
.ppos = ppos,
.cur_val = NULL,
.cur_len = PAGE_SIZE,
.new_val = NULL,
.new_len = 0,
.new_updated = 0,
};
struct cgroup *cgrp;
int ret;
ctx.cur_val = kmalloc_track_caller(ctx.cur_len, GFP_KERNEL);
if (ctx.cur_val) {
mm_segment_t old_fs;
loff_t pos = 0;
old_fs = get_fs();
set_fs(KERNEL_DS);
if (table->proc_handler(table, 0, (void __user *)ctx.cur_val,
&ctx.cur_len, &pos)) {
/* Let BPF program decide how to proceed. */
ctx.cur_len = 0;
}
set_fs(old_fs);
} else {
/* Let BPF program decide how to proceed. */
ctx.cur_len = 0;
}
if (write && buf && *pcount) {
/* BPF program should be able to override new value with a
* buffer bigger than provided by user.
*/
ctx.new_val = kmalloc_track_caller(PAGE_SIZE, GFP_KERNEL);
ctx.new_len = min_t(size_t, PAGE_SIZE, *pcount);
if (!ctx.new_val ||
copy_from_user(ctx.new_val, buf, ctx.new_len))
/* Let BPF program decide how to proceed. */
ctx.new_len = 0;
}
rcu_read_lock();
cgrp = task_dfl_cgroup(current);
ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], &ctx, BPF_PROG_RUN);
rcu_read_unlock();
kfree(ctx.cur_val);
if (ret == 1 && ctx.new_updated) {
*new_buf = ctx.new_val;
*pcount = ctx.new_len;
} else {
kfree(ctx.new_val);
}
return ret == 1 ? 0 : -EPERM;
}
EXPORT_SYMBOL(__cgroup_bpf_run_filter_sysctl);
static ssize_t sysctl_cpy_dir(const struct ctl_dir *dir, char **bufp,
size_t *lenp)
{
ssize_t tmp_ret = 0, ret;
if (dir->header.parent) {
tmp_ret = sysctl_cpy_dir(dir->header.parent, bufp, lenp);
if (tmp_ret < 0)
return tmp_ret;
}
ret = strscpy(*bufp, dir->header.ctl_table[0].procname, *lenp);
if (ret < 0)
return ret;
*bufp += ret;
*lenp -= ret;
ret += tmp_ret;
/* Avoid leading slash. */
if (!ret)
return ret;
tmp_ret = strscpy(*bufp, "/", *lenp);
if (tmp_ret < 0)
return tmp_ret;
*bufp += tmp_ret;
*lenp -= tmp_ret;
return ret + tmp_ret;
}
BPF_CALL_4(bpf_sysctl_get_name, struct bpf_sysctl_kern *, ctx, char *, buf,
size_t, buf_len, u64, flags)
{
ssize_t tmp_ret = 0, ret;
if (!buf)
return -EINVAL;
if (!(flags & BPF_F_SYSCTL_BASE_NAME)) {
if (!ctx->head)
return -EINVAL;
tmp_ret = sysctl_cpy_dir(ctx->head->parent, &buf, &buf_len);
if (tmp_ret < 0)
return tmp_ret;
}
ret = strscpy(buf, ctx->table->procname, buf_len);
return ret < 0 ? ret : tmp_ret + ret;
}
static const struct bpf_func_proto bpf_sysctl_get_name_proto = {
.func = bpf_sysctl_get_name,
.gpl_only = false,
.ret_type = RET_INTEGER,
.arg1_type = ARG_PTR_TO_CTX,
.arg2_type = ARG_PTR_TO_MEM,
.arg3_type = ARG_CONST_SIZE,
.arg4_type = ARG_ANYTHING,
};
static int copy_sysctl_value(char *dst, size_t dst_len, char *src,
size_t src_len)
{
if (!dst)
return -EINVAL;
if (!dst_len)
return -E2BIG;
if (!src || !src_len) {
memset(dst, 0, dst_len);
return -EINVAL;
}
memcpy(dst, src, min(dst_len, src_len));
if (dst_len > src_len) {
memset(dst + src_len, '\0', dst_len - src_len);
return src_len;
}
dst[dst_len - 1] = '\0';
return -E2BIG;
}
BPF_CALL_3(bpf_sysctl_get_current_value, struct bpf_sysctl_kern *, ctx,
char *, buf, size_t, buf_len)
{
return copy_sysctl_value(buf, buf_len, ctx->cur_val, ctx->cur_len);
}
static const struct bpf_func_proto bpf_sysctl_get_current_value_proto = {
.func = bpf_sysctl_get_current_value,
.gpl_only = false,
.ret_type = RET_INTEGER,
.arg1_type = ARG_PTR_TO_CTX,
.arg2_type = ARG_PTR_TO_UNINIT_MEM,
.arg3_type = ARG_CONST_SIZE,
};
BPF_CALL_3(bpf_sysctl_get_new_value, struct bpf_sysctl_kern *, ctx, char *, buf,
size_t, buf_len)
{
if (!ctx->write) {
if (buf && buf_len)
memset(buf, '\0', buf_len);
return -EINVAL;
}
return copy_sysctl_value(buf, buf_len, ctx->new_val, ctx->new_len);
}
static const struct bpf_func_proto bpf_sysctl_get_new_value_proto = {
.func = bpf_sysctl_get_new_value,
.gpl_only = false,
.ret_type = RET_INTEGER,
.arg1_type = ARG_PTR_TO_CTX,
.arg2_type = ARG_PTR_TO_UNINIT_MEM,
.arg3_type = ARG_CONST_SIZE,
};
BPF_CALL_3(bpf_sysctl_set_new_value, struct bpf_sysctl_kern *, ctx,
const char *, buf, size_t, buf_len)
{
if (!ctx->write || !ctx->new_val || !ctx->new_len || !buf || !buf_len)
return -EINVAL;
if (buf_len > PAGE_SIZE - 1)
return -E2BIG;
memcpy(ctx->new_val, buf, buf_len);
ctx->new_len = buf_len;
ctx->new_updated = 1;
return 0;
}
static const struct bpf_func_proto bpf_sysctl_set_new_value_proto = {
.func = bpf_sysctl_set_new_value,
.gpl_only = false,
.ret_type = RET_INTEGER,
.arg1_type = ARG_PTR_TO_CTX,
.arg2_type = ARG_PTR_TO_MEM,
.arg3_type = ARG_CONST_SIZE,
};
static const struct bpf_func_proto *
sysctl_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
{
switch (func_id) {
case BPF_FUNC_strtol:
return &bpf_strtol_proto;
case BPF_FUNC_strtoul:
return &bpf_strtoul_proto;
case BPF_FUNC_sysctl_get_name:
return &bpf_sysctl_get_name_proto;
case BPF_FUNC_sysctl_get_current_value:
return &bpf_sysctl_get_current_value_proto;
case BPF_FUNC_sysctl_get_new_value:
return &bpf_sysctl_get_new_value_proto;
case BPF_FUNC_sysctl_set_new_value:
return &bpf_sysctl_set_new_value_proto;
default:
return cgroup_base_func_proto(func_id, prog);
}
}
static bool sysctl_is_valid_access(int off, int size, enum bpf_access_type type,
const struct bpf_prog *prog,
struct bpf_insn_access_aux *info)
{
const int size_default = sizeof(__u32);
if (off < 0 || off + size > sizeof(struct bpf_sysctl) || off % size)
return false;
switch (off) {
case offsetof(struct bpf_sysctl, write):
if (type != BPF_READ)
return false;
bpf_ctx_record_field_size(info, size_default);
return bpf_ctx_narrow_access_ok(off, size, size_default);
case offsetof(struct bpf_sysctl, file_pos):
if (type == BPF_READ) {
bpf_ctx_record_field_size(info, size_default);
return bpf_ctx_narrow_access_ok(off, size, size_default);
} else {
return size == size_default;
}
default:
return false;
}
}
static u32 sysctl_convert_ctx_access(enum bpf_access_type type,
const struct bpf_insn *si,
struct bpf_insn *insn_buf,
struct bpf_prog *prog, u32 *target_size)
{
struct bpf_insn *insn = insn_buf;
switch (si->off) {
case offsetof(struct bpf_sysctl, write):
*insn++ = BPF_LDX_MEM(
BPF_SIZE(si->code), si->dst_reg, si->src_reg,
bpf_target_off(struct bpf_sysctl_kern, write,
FIELD_SIZEOF(struct bpf_sysctl_kern,
write),
target_size));
break;
case offsetof(struct bpf_sysctl, file_pos):
/* ppos is a pointer so it should be accessed via indirect
* loads and stores. Also for stores additional temporary
* register is used since neither src_reg nor dst_reg can be
* overridden.
*/
if (type == BPF_WRITE) {
int treg = BPF_REG_9;
if (si->src_reg == treg || si->dst_reg == treg)
--treg;
if (si->src_reg == treg || si->dst_reg == treg)
--treg;
*insn++ = BPF_STX_MEM(
BPF_DW, si->dst_reg, treg,
offsetof(struct bpf_sysctl_kern, tmp_reg));
*insn++ = BPF_LDX_MEM(
BPF_FIELD_SIZEOF(struct bpf_sysctl_kern, ppos),
treg, si->dst_reg,
offsetof(struct bpf_sysctl_kern, ppos));
*insn++ = BPF_STX_MEM(
BPF_SIZEOF(u32), treg, si->src_reg, 0);
*insn++ = BPF_LDX_MEM(
BPF_DW, treg, si->dst_reg,
offsetof(struct bpf_sysctl_kern, tmp_reg));
} else {
*insn++ = BPF_LDX_MEM(
BPF_FIELD_SIZEOF(struct bpf_sysctl_kern, ppos),
si->dst_reg, si->src_reg,
offsetof(struct bpf_sysctl_kern, ppos));
*insn++ = BPF_LDX_MEM(
BPF_SIZE(si->code), si->dst_reg, si->dst_reg, 0);
}
*target_size = sizeof(u32);
break;
}
return insn - insn_buf;
}
const struct bpf_verifier_ops cg_sysctl_verifier_ops = {
.get_func_proto = sysctl_func_proto,
.is_valid_access = sysctl_is_valid_access,
.convert_ctx_access = sysctl_convert_ctx_access,
};
const struct bpf_prog_ops cg_sysctl_prog_ops = {
};
......@@ -160,12 +160,12 @@ static void cpu_map_kthread_stop(struct work_struct *work)
}
static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,
struct xdp_frame *xdpf)
struct xdp_frame *xdpf,
struct sk_buff *skb)
{
unsigned int hard_start_headroom;
unsigned int frame_size;
void *pkt_data_start;
struct sk_buff *skb;
/* Part of headroom was reserved to xdpf */
hard_start_headroom = sizeof(struct xdp_frame) + xdpf->headroom;
......@@ -191,8 +191,8 @@ static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,
SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
pkt_data_start = xdpf->data - hard_start_headroom;
skb = build_skb(pkt_data_start, frame_size);
if (!skb)
skb = build_skb_around(skb, pkt_data_start, frame_size);
if (unlikely(!skb))
return NULL;
skb_reserve(skb, hard_start_headroom);
......@@ -240,6 +240,8 @@ static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
}
}
#define CPUMAP_BATCH 8
static int cpu_map_kthread_run(void *data)
{
struct bpf_cpu_map_entry *rcpu = data;
......@@ -252,8 +254,11 @@ static int cpu_map_kthread_run(void *data)
* kthread_stop signal until queue is empty.
*/
while (!kthread_should_stop() || !__ptr_ring_empty(rcpu->queue)) {
unsigned int processed = 0, drops = 0, sched = 0;
struct xdp_frame *xdpf;
unsigned int drops = 0, sched = 0;
void *frames[CPUMAP_BATCH];
void *skbs[CPUMAP_BATCH];
gfp_t gfp = __GFP_ZERO | GFP_ATOMIC;
int i, n, m;
/* Release CPU reschedule checks */
if (__ptr_ring_empty(rcpu->queue)) {
......@@ -269,18 +274,38 @@ static int cpu_map_kthread_run(void *data)
sched = cond_resched();
}
/* Process packets in rcpu->queue */
local_bh_disable();
/*
* The bpf_cpu_map_entry is single consumer, with this
* kthread CPU pinned. Lockless access to ptr_ring
* consume side valid as no-resize allowed of queue.
*/
while ((xdpf = __ptr_ring_consume(rcpu->queue))) {
struct sk_buff *skb;
n = ptr_ring_consume_batched(rcpu->queue, frames, CPUMAP_BATCH);
for (i = 0; i < n; i++) {
void *f = frames[i];
struct page *page = virt_to_page(f);
/* Bring struct page memory area to curr CPU. Read by
* build_skb_around via page_is_pfmemalloc(), and when
* freed written by page_frag_free call.
*/
prefetchw(page);
}
m = kmem_cache_alloc_bulk(skbuff_head_cache, gfp, n, skbs);
if (unlikely(m == 0)) {
for (i = 0; i < n; i++)
skbs[i] = NULL; /* effect: xdp_return_frame */
drops = n;
}
local_bh_disable();
for (i = 0; i < n; i++) {
struct xdp_frame *xdpf = frames[i];
struct sk_buff *skb = skbs[i];
int ret;
skb = cpu_map_build_skb(rcpu, xdpf);
skb = cpu_map_build_skb(rcpu, xdpf, skb);
if (!skb) {
xdp_return_frame(xdpf);
continue;
......@@ -290,13 +315,9 @@ static int cpu_map_kthread_run(void *data)
ret = netif_receive_skb_core(skb);
if (ret == NET_RX_DROP)
drops++;
/* Limit BH-disable period */
if (++processed == 8)
break;
}
/* Feedback loop via tracepoint */
trace_xdp_cpumap_kthread(rcpu->map_id, processed, drops, sched);
trace_xdp_cpumap_kthread(rcpu->map_id, n, drops, sched);
local_bh_enable(); /* resched point, may call do_softirq() */
}
......
......@@ -18,6 +18,9 @@
#include <linux/sched.h>
#include <linux/uidgid.h>
#include <linux/filter.h>
#include <linux/ctype.h>
#include "../../lib/kstrtox.h"
/* If kernel subsystem is allowing eBPF programs to call this function,
* inside its own verifier_ops->get_func_proto() callback it should return
......@@ -363,4 +366,132 @@ const struct bpf_func_proto bpf_get_local_storage_proto = {
.arg2_type = ARG_ANYTHING,
};
#endif
#define BPF_STRTOX_BASE_MASK 0x1F
static int __bpf_strtoull(const char *buf, size_t buf_len, u64 flags,
unsigned long long *res, bool *is_negative)
{
unsigned int base = flags & BPF_STRTOX_BASE_MASK;
const char *cur_buf = buf;
size_t cur_len = buf_len;
unsigned int consumed;
size_t val_len;
char str[64];
if (!buf || !buf_len || !res || !is_negative)
return -EINVAL;
if (base != 0 && base != 8 && base != 10 && base != 16)
return -EINVAL;
if (flags & ~BPF_STRTOX_BASE_MASK)
return -EINVAL;
while (cur_buf < buf + buf_len && isspace(*cur_buf))
++cur_buf;
*is_negative = (cur_buf < buf + buf_len && *cur_buf == '-');
if (*is_negative)
++cur_buf;
consumed = cur_buf - buf;
cur_len -= consumed;
if (!cur_len)
return -EINVAL;
cur_len = min(cur_len, sizeof(str) - 1);
memcpy(str, cur_buf, cur_len);
str[cur_len] = '\0';
cur_buf = str;
cur_buf = _parse_integer_fixup_radix(cur_buf, &base);
val_len = _parse_integer(cur_buf, base, res);
if (val_len & KSTRTOX_OVERFLOW)
return -ERANGE;
if (val_len == 0)
return -EINVAL;
cur_buf += val_len;
consumed += cur_buf - str;
return consumed;
}
static int __bpf_strtoll(const char *buf, size_t buf_len, u64 flags,
long long *res)
{
unsigned long long _res;
bool is_negative;
int err;
err = __bpf_strtoull(buf, buf_len, flags, &_res, &is_negative);
if (err < 0)
return err;
if (is_negative) {
if ((long long)-_res > 0)
return -ERANGE;
*res = -_res;
} else {
if ((long long)_res < 0)
return -ERANGE;
*res = _res;
}
return err;
}
BPF_CALL_4(bpf_strtol, const char *, buf, size_t, buf_len, u64, flags,
long *, res)
{
long long _res;
int err;
err = __bpf_strtoll(buf, buf_len, flags, &_res);
if (err < 0)
return err;
if (_res != (long)_res)
return -ERANGE;
*res = _res;
return err;
}
const struct bpf_func_proto bpf_strtol_proto = {
.func = bpf_strtol,
.gpl_only = false,
.ret_type = RET_INTEGER,
.arg1_type = ARG_PTR_TO_MEM,
.arg2_type = ARG_CONST_SIZE,
.arg3_type = ARG_ANYTHING,
.arg4_type = ARG_PTR_TO_LONG,
};
BPF_CALL_4(bpf_strtoul, const char *, buf, size_t, buf_len, u64, flags,
unsigned long *, res)
{
unsigned long long _res;
bool is_negative;
int err;
err = __bpf_strtoull(buf, buf_len, flags, &_res, &is_negative);
if (err < 0)
return err;
if (is_negative)
return -EINVAL;
if (_res != (unsigned long)_res)
return -ERANGE;
*res = _res;
return err;
}
const struct bpf_func_proto bpf_strtoul_proto = {
.func = bpf_strtoul,
.gpl_only = false,
.ret_type = RET_INTEGER,
.arg1_type = ARG_PTR_TO_MEM,
.arg2_type = ARG_CONST_SIZE,
.arg3_type = ARG_ANYTHING,
.arg4_type = ARG_PTR_TO_LONG,
};
#endif
......@@ -1888,6 +1888,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
case BPF_FLOW_DISSECTOR:
ptype = BPF_PROG_TYPE_FLOW_DISSECTOR;
break;
case BPF_CGROUP_SYSCTL:
ptype = BPF_PROG_TYPE_CGROUP_SYSCTL;
break;
default:
return -EINVAL;
}
......@@ -1966,6 +1969,9 @@ static int bpf_prog_detach(const union bpf_attr *attr)
return lirc_prog_detach(attr);
case BPF_FLOW_DISSECTOR:
return skb_flow_dissector_bpf_prog_detach(attr);
case BPF_CGROUP_SYSCTL:
ptype = BPF_PROG_TYPE_CGROUP_SYSCTL;
break;
default:
return -EINVAL;
}
......@@ -1999,6 +2005,7 @@ static int bpf_prog_query(const union bpf_attr *attr,
case BPF_CGROUP_UDP6_SENDMSG:
case BPF_CGROUP_SOCK_OPS:
case BPF_CGROUP_DEVICE:
case BPF_CGROUP_SYSCTL:
break;
case BPF_LIRC_MODE2:
return lirc_prog_query(attr, uattr);
......
......@@ -1177,30 +1177,32 @@ static int check_reg_arg(struct bpf_verifier_env *env, u32 regno,
{
struct bpf_verifier_state *vstate = env->cur_state;
struct bpf_func_state *state = vstate->frame[vstate->curframe];
struct bpf_reg_state *regs = state->regs;
struct bpf_reg_state *reg, *regs = state->regs;
if (regno >= MAX_BPF_REG) {
verbose(env, "R%d is invalid\n", regno);
return -EINVAL;
}
reg = &regs[regno];
if (t == SRC_OP) {
/* check whether register used as source operand can be read */
if (regs[regno].type == NOT_INIT) {
if (reg->type == NOT_INIT) {
verbose(env, "R%d !read_ok\n", regno);
return -EACCES;
}
/* We don't need to worry about FP liveness because it's read-only */
if (regno != BPF_REG_FP)
return mark_reg_read(env, &regs[regno],
regs[regno].parent);
if (regno == BPF_REG_FP)
return 0;
return mark_reg_read(env, reg, reg->parent);
} else {
/* check whether register used as dest operand can be written to */
if (regno == BPF_REG_FP) {
verbose(env, "frame pointer is read only\n");
return -EACCES;
}
regs[regno].live |= REG_LIVE_WRITTEN;
reg->live |= REG_LIVE_WRITTEN;
if (t == DST_OP)
mark_reg_unknown(env, regs, regno);
}
......@@ -2462,6 +2464,22 @@ static bool arg_type_is_mem_size(enum bpf_arg_type type)
type == ARG_CONST_SIZE_OR_ZERO;
}
static bool arg_type_is_int_ptr(enum bpf_arg_type type)
{
return type == ARG_PTR_TO_INT ||
type == ARG_PTR_TO_LONG;
}
static int int_ptr_type_to_size(enum bpf_arg_type type)
{
if (type == ARG_PTR_TO_INT)
return sizeof(u32);
else if (type == ARG_PTR_TO_LONG)
return sizeof(u64);
return -EINVAL;
}
static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
enum bpf_arg_type arg_type,
struct bpf_call_arg_meta *meta)
......@@ -2554,6 +2572,12 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
type != expected_type)
goto err_type;
meta->raw_mode = arg_type == ARG_PTR_TO_UNINIT_MEM;
} else if (arg_type_is_int_ptr(arg_type)) {
expected_type = PTR_TO_STACK;
if (!type_is_pkt_pointer(type) &&
type != PTR_TO_MAP_VALUE &&
type != expected_type)
goto err_type;
} else {
verbose(env, "unsupported arg_type %d\n", arg_type);
return -EFAULT;
......@@ -2635,6 +2659,13 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
err = check_helper_mem_access(env, regno - 1,
reg->umax_value,
zero_size_allowed, meta);
} else if (arg_type_is_int_ptr(arg_type)) {
int size = int_ptr_type_to_size(arg_type);
err = check_helper_mem_access(env, regno, size, false, meta);
if (err)
return err;
err = check_ptr_alignment(env, reg, 0, size, true);
}
return err;
......@@ -5267,6 +5298,7 @@ static int check_return_code(struct bpf_verifier_env *env)
case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
case BPF_PROG_TYPE_SOCK_OPS:
case BPF_PROG_TYPE_CGROUP_DEVICE:
case BPF_PROG_TYPE_CGROUP_SYSCTL:
break;
default:
return 0;
......@@ -5337,10 +5369,6 @@ enum {
#define STATE_LIST_MARK ((struct bpf_verifier_state_list *) -1L)
static int *insn_stack; /* stack of insns to process */
static int cur_stack; /* current stack index */
static int *insn_state;
/* t, w, e - match pseudo-code above:
* t - index of current instruction
* w - next instruction
......@@ -5348,6 +5376,9 @@ static int *insn_state;
*/
static int push_insn(int t, int w, int e, struct bpf_verifier_env *env)
{
int *insn_stack = env->cfg.insn_stack;
int *insn_state = env->cfg.insn_state;
if (e == FALLTHROUGH && insn_state[t] >= (DISCOVERED | FALLTHROUGH))
return 0;
......@@ -5368,9 +5399,9 @@ static int push_insn(int t, int w, int e, struct bpf_verifier_env *env)
/* tree-edge */
insn_state[t] = DISCOVERED | e;
insn_state[w] = DISCOVERED;
if (cur_stack >= env->prog->len)
if (env->cfg.cur_stack >= env->prog->len)
return -E2BIG;
insn_stack[cur_stack++] = w;
insn_stack[env->cfg.cur_stack++] = w;
return 1;
} else if ((insn_state[w] & 0xF0) == DISCOVERED) {
verbose_linfo(env, t, "%d: ", t);
......@@ -5394,14 +5425,15 @@ static int check_cfg(struct bpf_verifier_env *env)
{
struct bpf_insn *insns = env->prog->insnsi;
int insn_cnt = env->prog->len;
int *insn_stack, *insn_state;
int ret = 0;
int i, t;
insn_state = kvcalloc(insn_cnt, sizeof(int), GFP_KERNEL);
insn_state = env->cfg.insn_state = kvcalloc(insn_cnt, sizeof(int), GFP_KERNEL);
if (!insn_state)
return -ENOMEM;
insn_stack = kvcalloc(insn_cnt, sizeof(int), GFP_KERNEL);
insn_stack = env->cfg.insn_stack = kvcalloc(insn_cnt, sizeof(int), GFP_KERNEL);
if (!insn_stack) {
kvfree(insn_state);
return -ENOMEM;
......@@ -5409,12 +5441,12 @@ static int check_cfg(struct bpf_verifier_env *env)
insn_state[0] = DISCOVERED; /* mark 1st insn as discovered */
insn_stack[0] = 0; /* 0 is the first instruction */
cur_stack = 1;
env->cfg.cur_stack = 1;
peek_stack:
if (cur_stack == 0)
if (env->cfg.cur_stack == 0)
goto check_state;
t = insn_stack[cur_stack - 1];
t = insn_stack[env->cfg.cur_stack - 1];
if (BPF_CLASS(insns[t].code) == BPF_JMP ||
BPF_CLASS(insns[t].code) == BPF_JMP32) {
......@@ -5483,7 +5515,7 @@ static int check_cfg(struct bpf_verifier_env *env)
mark_explored:
insn_state[t] = EXPLORED;
if (cur_stack-- <= 0) {
if (env->cfg.cur_stack-- <= 0) {
verbose(env, "pop stack internal bug\n");
ret = -EFAULT;
goto err_free;
......@@ -5503,6 +5535,7 @@ static int check_cfg(struct bpf_verifier_env *env)
err_free:
kvfree(insn_state);
kvfree(insn_stack);
env->cfg.insn_state = env->cfg.insn_stack = NULL;
return ret;
}
......@@ -6191,6 +6224,22 @@ static bool states_equal(struct bpf_verifier_env *env,
return true;
}
static int propagate_liveness_reg(struct bpf_verifier_env *env,
struct bpf_reg_state *reg,
struct bpf_reg_state *parent_reg)
{
int err;
if (parent_reg->live & REG_LIVE_READ || !(reg->live & REG_LIVE_READ))
return 0;
err = mark_reg_read(env, reg, parent_reg);
if (err)
return err;
return 0;
}
/* A write screens off any subsequent reads; but write marks come from the
* straight-line code between a state and its parent. When we arrive at an
* equivalent state (jump target or such) we didn't arrive by the straight-line
......@@ -6202,8 +6251,9 @@ static int propagate_liveness(struct bpf_verifier_env *env,
const struct bpf_verifier_state *vstate,
struct bpf_verifier_state *vparent)
{
int i, frame, err = 0;
struct bpf_reg_state *state_reg, *parent_reg;
struct bpf_func_state *state, *parent;
int i, frame, err = 0;
if (vparent->curframe != vstate->curframe) {
WARN(1, "propagate_live: parent frame %d current frame %d\n",
......@@ -6213,30 +6263,27 @@ static int propagate_liveness(struct bpf_verifier_env *env,
/* Propagate read liveness of registers... */
BUILD_BUG_ON(BPF_REG_FP + 1 != MAX_BPF_REG);
for (frame = 0; frame <= vstate->curframe; frame++) {
parent = vparent->frame[frame];
state = vstate->frame[frame];
parent_reg = parent->regs;
state_reg = state->regs;
/* We don't need to worry about FP liveness, it's read-only */
for (i = frame < vstate->curframe ? BPF_REG_6 : 0; i < BPF_REG_FP; i++) {
if (vparent->frame[frame]->regs[i].live & REG_LIVE_READ)
continue;
if (vstate->frame[frame]->regs[i].live & REG_LIVE_READ) {
err = mark_reg_read(env, &vstate->frame[frame]->regs[i],
&vparent->frame[frame]->regs[i]);
if (err)
return err;
}
err = propagate_liveness_reg(env, &state_reg[i],
&parent_reg[i]);
if (err)
return err;
}
}
/* ... and stack slots */
for (frame = 0; frame <= vstate->curframe; frame++) {
state = vstate->frame[frame];
parent = vparent->frame[frame];
/* Propagate stack slots. */
for (i = 0; i < state->allocated_stack / BPF_REG_SIZE &&
i < parent->allocated_stack / BPF_REG_SIZE; i++) {
if (parent->stack[i].spilled_ptr.live & REG_LIVE_READ)
continue;
if (state->stack[i].spilled_ptr.live & REG_LIVE_READ)
mark_reg_read(env, &state->stack[i].spilled_ptr,
&parent->stack[i].spilled_ptr);
parent_reg = &parent->stack[i].spilled_ptr;
state_reg = &state->stack[i].spilled_ptr;
err = propagate_liveness_reg(env, state_reg,
parent_reg);
if (err)
return err;
}
}
return err;
......@@ -7601,9 +7648,8 @@ static int jit_subprogs(struct bpf_verifier_env *env)
insn->src_reg != BPF_PSEUDO_CALL)
continue;
subprog = insn->off;
insn->imm = (u64 (*)(u64, u64, u64, u64, u64))
func[subprog]->bpf_func -
__bpf_call_base;
insn->imm = BPF_CAST_CALL(func[subprog]->bpf_func) -
__bpf_call_base;
}
/* we use the aux data to keep a list of the start addresses
......@@ -8086,9 +8132,11 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr,
env->insn_aux_data[i].orig_idx = i;
env->prog = *prog;
env->ops = bpf_verifier_ops[env->prog->type];
is_priv = capable(CAP_SYS_ADMIN);
/* grab the mutex to protect few globals used by verifier */
mutex_lock(&bpf_verifier_lock);
if (!is_priv)
mutex_lock(&bpf_verifier_lock);
if (attr->log_level || attr->log_buf || attr->log_size) {
/* user requested verbose verifier output
......@@ -8111,7 +8159,6 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr,
if (attr->prog_flags & BPF_F_ANY_ALIGNMENT)
env->strict_alignment = false;
is_priv = capable(CAP_SYS_ADMIN);
env->allow_ptr_leaks = is_priv;
ret = replace_map_fd_with_map_ptr(env);
......@@ -8224,7 +8271,8 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr,
release_maps(env);
*prog = env->prog;
err_unlock:
mutex_unlock(&bpf_verifier_lock);
if (!is_priv)
mutex_unlock(&bpf_verifier_lock);
vfree(env->insn_aux_data);
err_free_env:
kfree(env);
......
......@@ -569,6 +569,12 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
return &bpf_map_update_elem_proto;
case BPF_FUNC_map_delete_elem:
return &bpf_map_delete_elem_proto;
case BPF_FUNC_map_push_elem:
return &bpf_map_push_elem_proto;
case BPF_FUNC_map_pop_elem:
return &bpf_map_pop_elem_proto;
case BPF_FUNC_map_peek_elem:
return &bpf_map_peek_elem_proto;
case BPF_FUNC_probe_read:
return &bpf_probe_read_proto;
case BPF_FUNC_ktime_get_ns:
......
......@@ -3069,6 +3069,9 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 off, u32 len_diff,
{
int ret;
if (flags & ~BPF_F_ADJ_ROOM_FIXED_GSO)
return -EINVAL;
if (skb_is_gso(skb) && !skb_is_gso_tcp(skb)) {
/* udp gso_size delineates datagrams, only allow if fixed */
if (!(skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4) ||
......@@ -4434,8 +4437,7 @@ BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct bpf_sock_ops_kern *, bpf_sock,
if (!IS_ENABLED(CONFIG_INET) || !sk_fullsock(sk))
return -EINVAL;
if (val)
tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
return argval & (~BPF_SOCK_OPS_ALL_CB_FLAGS);
}
......
......@@ -258,6 +258,33 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
}
EXPORT_SYMBOL(__alloc_skb);
/* Caller must provide SKB that is memset cleared */
static struct sk_buff *__build_skb_around(struct sk_buff *skb,
void *data, unsigned int frag_size)
{
struct skb_shared_info *shinfo;
unsigned int size = frag_size ? : ksize(data);
size -= SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
/* Assumes caller memset cleared SKB */
skb->truesize = SKB_TRUESIZE(size);
refcount_set(&skb->users, 1);
skb->head = data;
skb->data = data;
skb_reset_tail_pointer(skb);
skb->end = skb->tail + size;
skb->mac_header = (typeof(skb->mac_header))~0U;
skb->transport_header = (typeof(skb->transport_header))~0U;
/* make sure we initialize shinfo sequentially */
shinfo = skb_shinfo(skb);
memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
atomic_set(&shinfo->dataref, 1);
return skb;
}
/**
* __build_skb - build a network buffer
* @data: data buffer provided by caller
......@@ -279,32 +306,15 @@ EXPORT_SYMBOL(__alloc_skb);
*/
struct sk_buff *__build_skb(void *data, unsigned int frag_size)
{
struct skb_shared_info *shinfo;
struct sk_buff *skb;
unsigned int size = frag_size ? : ksize(data);
skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
if (!skb)
if (unlikely(!skb))
return NULL;
size -= SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
memset(skb, 0, offsetof(struct sk_buff, tail));
skb->truesize = SKB_TRUESIZE(size);
refcount_set(&skb->users, 1);
skb->head = data;
skb->data = data;
skb_reset_tail_pointer(skb);
skb->end = skb->tail + size;
skb->mac_header = (typeof(skb->mac_header))~0U;
skb->transport_header = (typeof(skb->transport_header))~0U;
/* make sure we initialize shinfo sequentially */
shinfo = skb_shinfo(skb);
memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
atomic_set(&shinfo->dataref, 1);
return skb;
return __build_skb_around(skb, data, frag_size);
}
/* build_skb() is wrapper over __build_skb(), that specifically
......@@ -325,6 +335,29 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size)
}
EXPORT_SYMBOL(build_skb);
/**
* build_skb_around - build a network buffer around provided skb
* @skb: sk_buff provide by caller, must be memset cleared
* @data: data buffer provided by caller
* @frag_size: size of data, or 0 if head was kmalloced
*/
struct sk_buff *build_skb_around(struct sk_buff *skb,
void *data, unsigned int frag_size)
{
if (unlikely(!skb))
return NULL;
skb = __build_skb_around(skb, data, frag_size);
if (skb && frag_size) {
skb->head_frag = 1;
if (page_is_pfmemalloc(virt_to_head_page(data)))
skb->pfmemalloc = 1;
}
return skb;
}
EXPORT_SYMBOL(build_skb_around);
#define NAPI_SKB_CACHE_SIZE 64
struct napi_alloc_cache {
......
......@@ -43,6 +43,48 @@ struct xsk_queue {
u64 invalid_descs;
};
/* The structure of the shared state of the rings are the same as the
* ring buffer in kernel/events/ring_buffer.c. For the Rx and completion
* ring, the kernel is the producer and user space is the consumer. For
* the Tx and fill rings, the kernel is the consumer and user space is
* the producer.
*
* producer consumer
*
* if (LOAD ->consumer) { LOAD ->producer
* (A) smp_rmb() (C)
* STORE $data LOAD $data
* smp_wmb() (B) smp_mb() (D)
* STORE ->producer STORE ->consumer
* }
*
* (A) pairs with (D), and (B) pairs with (C).
*
* Starting with (B), it protects the data from being written after
* the producer pointer. If this barrier was missing, the consumer
* could observe the producer pointer being set and thus load the data
* before the producer has written the new data. The consumer would in
* this case load the old data.
*
* (C) protects the consumer from speculatively loading the data before
* the producer pointer actually has been read. If we do not have this
* barrier, some architectures could load old data as speculative loads
* are not discarded as the CPU does not know there is a dependency
* between ->producer and data.
*
* (A) is a control dependency that separates the load of ->consumer
* from the stores of $data. In case ->consumer indicates there is no
* room in the buffer to store $data we do not. So no barrier is needed.
*
* (D) protects the load of the data to be observed to happen after the
* store of the consumer pointer. If we did not have this memory
* barrier, the producer could observe the consumer pointer being set
* and overwrite the data with a new value before the consumer got the
* chance to read the old value. The consumer would thus miss reading
* the old entry and very likely read the new entry twice, once right
* now and again after circling through the ring.
*/
/* Common functions operating for both RXTX and umem queues */
static inline u64 xskq_nb_invalid_descs(struct xsk_queue *q)
......@@ -106,6 +148,7 @@ static inline u64 *xskq_validate_addr(struct xsk_queue *q, u64 *addr)
static inline u64 *xskq_peek_addr(struct xsk_queue *q, u64 *addr)
{
if (q->cons_tail == q->cons_head) {
smp_mb(); /* D, matches A */
WRITE_ONCE(q->ring->consumer, q->cons_tail);
q->cons_head = q->cons_tail + xskq_nb_avail(q, RX_BATCH_SIZE);
......@@ -128,10 +171,11 @@ static inline int xskq_produce_addr(struct xsk_queue *q, u64 addr)
if (xskq_nb_free(q, q->prod_tail, 1) == 0)
return -ENOSPC;
/* A, matches D */
ring->desc[q->prod_tail++ & q->ring_mask] = addr;
/* Order producer and data */
smp_wmb();
smp_wmb(); /* B, matches C */
WRITE_ONCE(q->ring->producer, q->prod_tail);
return 0;
......@@ -144,6 +188,7 @@ static inline int xskq_produce_addr_lazy(struct xsk_queue *q, u64 addr)
if (xskq_nb_free(q, q->prod_head, LAZY_UPDATE_THRESHOLD) == 0)
return -ENOSPC;
/* A, matches D */
ring->desc[q->prod_head++ & q->ring_mask] = addr;
return 0;
}
......@@ -152,7 +197,7 @@ static inline void xskq_produce_flush_addr_n(struct xsk_queue *q,
u32 nb_entries)
{
/* Order producer and data */
smp_wmb();
smp_wmb(); /* B, matches C */
q->prod_tail += nb_entries;
WRITE_ONCE(q->ring->producer, q->prod_tail);
......@@ -163,6 +208,7 @@ static inline int xskq_reserve_addr(struct xsk_queue *q)
if (xskq_nb_free(q, q->prod_head, 1) == 0)
return -ENOSPC;
/* A, matches D */
q->prod_head++;
return 0;
}
......@@ -204,11 +250,12 @@ static inline struct xdp_desc *xskq_peek_desc(struct xsk_queue *q,
struct xdp_desc *desc)
{
if (q->cons_tail == q->cons_head) {
smp_mb(); /* D, matches A */
WRITE_ONCE(q->ring->consumer, q->cons_tail);
q->cons_head = q->cons_tail + xskq_nb_avail(q, RX_BATCH_SIZE);
/* Order consumer and data */
smp_rmb();
smp_rmb(); /* C, matches B */
}
return xskq_validate_desc(q, desc);
......@@ -228,6 +275,7 @@ static inline int xskq_produce_batch_desc(struct xsk_queue *q,
if (xskq_nb_free(q, q->prod_head, 1) == 0)
return -ENOSPC;
/* A, matches D */
idx = (q->prod_head++) & q->ring_mask;
ring->desc[idx].addr = addr;
ring->desc[idx].len = len;
......@@ -238,7 +286,7 @@ static inline int xskq_produce_batch_desc(struct xsk_queue *q,
static inline void xskq_produce_flush_desc(struct xsk_queue *q)
{
/* Order producer and data */
smp_wmb();
smp_wmb(); /* B, matches C */
q->prod_tail = q->prod_head,
WRITE_ONCE(q->ring->producer, q->prod_tail);
......
......@@ -99,7 +99,7 @@ gen_btf()
pahole_ver=$(${PAHOLE} --version | sed -E 's/v([0-9]+)\.([0-9]+)/\1\2/')
if [ "${pahole_ver}" -lt "113" ]; then
info "BTF" "${1}: pahole version $(${PAHOLE} --version) is too old, need at least v1.13"
exit 0
return 0
fi
info "BTF" ${1}
......
......@@ -29,7 +29,7 @@ CGROUP COMMANDS
| *PROG* := { **id** *PROG_ID* | **pinned** *FILE* | **tag** *PROG_TAG* }
| *ATTACH_TYPE* := { **ingress** | **egress** | **sock_create** | **sock_ops** | **device** |
| **bind4** | **bind6** | **post_bind4** | **post_bind6** | **connect4** | **connect6** |
| **sendmsg4** | **sendmsg6** }
| **sendmsg4** | **sendmsg6** | **sysctl** }
| *ATTACH_FLAGS* := { **multi** | **override** }
DESCRIPTION
......@@ -85,7 +85,8 @@ DESCRIPTION
**sendmsg4** call to sendto(2), sendmsg(2), sendmmsg(2) for an
unconnected udp4 socket (since 4.18);
**sendmsg6** call to sendto(2), sendmsg(2), sendmmsg(2) for an
unconnected udp6 socket (since 4.18).
unconnected udp6 socket (since 4.18);
**sysctl** sysctl access (since 5.2).
**bpftool cgroup detach** *CGROUP* *ATTACH_TYPE* *PROG*
Detach *PROG* from the cgroup *CGROUP* and attach type
......@@ -99,7 +100,7 @@ OPTIONS
-h, --help
Print short generic help message (similar to **bpftool help**).
-v, --version
-V, --version
Print version number (similar to **bpftool version**).
-j, --json
......
......@@ -63,7 +63,7 @@ OPTIONS
-h, --help
Print short generic help message (similar to **bpftool help**).
-v, --version
-V, --version
Print version number (similar to **bpftool version**).
-j, --json
......
......@@ -135,7 +135,7 @@ OPTIONS
-h, --help
Print short generic help message (similar to **bpftool help**).
-v, --version
-V, --version
Print version number (similar to **bpftool version**).
-j, --json
......
......@@ -55,7 +55,7 @@ OPTIONS
-h, --help
Print short generic help message (similar to **bpftool help**).
-v, --version
-V, --version
Print version number (similar to **bpftool version**).
-j, --json
......
......@@ -43,7 +43,7 @@ OPTIONS
-h, --help
Print short generic help message (similar to **bpftool help**).
-v, --version
-V, --version
Print version number (similar to **bpftool version**).
-j, --json
......
......@@ -25,7 +25,7 @@ PROG COMMANDS
| **bpftool** **prog dump xlated** *PROG* [{**file** *FILE* | **opcodes** | **visual** | **linum**}]
| **bpftool** **prog dump jited** *PROG* [{**file** *FILE* | **opcodes** | **linum**}]
| **bpftool** **prog pin** *PROG* *FILE*
| **bpftool** **prog { load | loadall }** *OBJ* *PATH* [**type** *TYPE*] [**map** {**idx** *IDX* | **name** *NAME*} *MAP*] [**dev** *NAME*]
| **bpftool** **prog { load | loadall }** *OBJ* *PATH* [**type** *TYPE*] [**map** {**idx** *IDX* | **name** *NAME*} *MAP*] [**dev** *NAME*] [**pinmaps** *MAP_DIR*]
| **bpftool** **prog attach** *PROG* *ATTACH_TYPE* [*MAP*]
| **bpftool** **prog detach** *PROG* *ATTACH_TYPE* [*MAP*]
| **bpftool** **prog tracelog**
......@@ -39,7 +39,8 @@ PROG COMMANDS
| **cgroup/sock** | **cgroup/dev** | **lwt_in** | **lwt_out** | **lwt_xmit** |
| **lwt_seg6local** | **sockops** | **sk_skb** | **sk_msg** | **lirc_mode2** |
| **cgroup/bind4** | **cgroup/bind6** | **cgroup/post_bind4** | **cgroup/post_bind6** |
| **cgroup/connect4** | **cgroup/connect6** | **cgroup/sendmsg4** | **cgroup/sendmsg6**
| **cgroup/connect4** | **cgroup/connect6** | **cgroup/sendmsg4** | **cgroup/sendmsg6** |
| **cgroup/sysctl**
| }
| *ATTACH_TYPE* := {
| **msg_verdict** | **stream_verdict** | **stream_parser** | **flow_dissector**
......@@ -56,6 +57,14 @@ DESCRIPTION
Output will start with program ID followed by program type and
zero or more named attributes (depending on kernel version).
Since Linux 5.1 the kernel can collect statistics on BPF
programs (such as the total time spent running the program,
and the number of times it was run). If available, bpftool
shows such statistics. However, the kernel does not collect
them by defaults, as it slightly impacts performance on each
program run. Activation or deactivation of the feature is
performed via the **kernel.bpf_stats_enabled** sysctl knob.
**bpftool prog dump xlated** *PROG* [{ **file** *FILE* | **opcodes** | **visual** | **linum** }]
Dump eBPF instructions of the program from the kernel. By
default, eBPF will be disassembled and printed to standard
......@@ -144,7 +153,7 @@ OPTIONS
-h, --help
Print short generic help message (similar to **bpftool help**).
-v, --version
-V, --version
Print version number (similar to **bpftool version**).
-j, --json
......
......@@ -49,7 +49,7 @@ OPTIONS
-h, --help
Print short help message (similar to **bpftool help**).
-v, --version
-V, --version
Print version number (similar to **bpftool version**).
-j, --json
......
......@@ -370,7 +370,8 @@ _bpftool()
lirc_mode2 cgroup/bind4 cgroup/bind6 \
cgroup/connect4 cgroup/connect6 \
cgroup/sendmsg4 cgroup/sendmsg6 \
cgroup/post_bind4 cgroup/post_bind6" -- \
cgroup/post_bind4 cgroup/post_bind6 \
cgroup/sysctl" -- \
"$cur" ) )
return 0
;;
......@@ -619,7 +620,7 @@ _bpftool()
attach|detach)
local ATTACH_TYPES='ingress egress sock_create sock_ops \
device bind4 bind6 post_bind4 post_bind6 connect4 \
connect6 sendmsg4 sendmsg6'
connect6 sendmsg4 sendmsg6 sysctl'
local ATTACH_FLAGS='multi override'
local PROG_TYPE='id pinned tag'
case $prev in
......@@ -629,7 +630,7 @@ _bpftool()
;;
ingress|egress|sock_create|sock_ops|device|bind4|bind6|\
post_bind4|post_bind6|connect4|connect6|sendmsg4|\
sendmsg6)
sendmsg6|sysctl)
COMPREPLY=( $( compgen -W "$PROG_TYPE" -- \
"$cur" ) )
return 0
......
......@@ -25,7 +25,7 @@
" ATTACH_TYPE := { ingress | egress | sock_create |\n" \
" sock_ops | device | bind4 | bind6 |\n" \
" post_bind4 | post_bind6 | connect4 |\n" \
" connect6 | sendmsg4 | sendmsg6 }"
" connect6 | sendmsg4 | sendmsg6 | sysctl }"
static const char * const attach_type_strings[] = {
[BPF_CGROUP_INET_INGRESS] = "ingress",
......@@ -41,6 +41,7 @@ static const char * const attach_type_strings[] = {
[BPF_CGROUP_INET6_POST_BIND] = "post_bind6",
[BPF_CGROUP_UDP4_SENDMSG] = "sendmsg4",
[BPF_CGROUP_UDP6_SENDMSG] = "sendmsg6",
[BPF_CGROUP_SYSCTL] = "sysctl",
[__MAX_BPF_ATTACH_TYPE] = NULL,
};
......@@ -248,6 +249,13 @@ static int do_show_tree_fn(const char *fpath, const struct stat *sb,
for (type = 0; type < __MAX_BPF_ATTACH_TYPE; type++)
show_attached_bpf_progs(cgroup_fd, type, ftw->level);
if (errno == EINVAL)
/* Last attach type does not support query.
* Do not report an error for this, especially because batch
* mode would stop processing commands.
*/
errno = 0;
if (json_output) {
jsonw_end_array(json_wtr);
jsonw_end_object(json_wtr);
......
......@@ -73,6 +73,7 @@ static const char * const prog_type_name[] = {
[BPF_PROG_TYPE_LIRC_MODE2] = "lirc_mode2",
[BPF_PROG_TYPE_SK_REUSEPORT] = "sk_reuseport",
[BPF_PROG_TYPE_FLOW_DISSECTOR] = "flow_dissector",
[BPF_PROG_TYPE_CGROUP_SYSCTL] = "cgroup_sysctl",
};
extern const char * const map_type_name[];
......
......@@ -261,20 +261,20 @@ static void print_entry_json(struct bpf_map_info *info, unsigned char *key,
}
static void print_entry_error(struct bpf_map_info *info, unsigned char *key,
const char *value)
const char *error_msg)
{
int value_size = strlen(value);
int msg_size = strlen(error_msg);
bool single_line, break_names;
break_names = info->key_size > 16 || value_size > 16;
single_line = info->key_size + value_size <= 24 && !break_names;
break_names = info->key_size > 16 || msg_size > 16;
single_line = info->key_size + msg_size <= 24 && !break_names;
printf("key:%c", break_names ? '\n' : ' ');
fprint_hex(stdout, key, info->key_size, " ");
printf(single_line ? " " : "\n");
printf("value:%c%s", break_names ? '\n' : ' ', value);
printf("value:%c%s", break_names ? '\n' : ' ', error_msg);
printf("\n");
}
......@@ -298,11 +298,7 @@ static void print_entry_plain(struct bpf_map_info *info, unsigned char *key,
if (info->value_size) {
printf("value:%c", break_names ? '\n' : ' ');
if (value)
fprint_hex(stdout, value, info->value_size,
" ");
else
printf("<no entry>");
fprint_hex(stdout, value, info->value_size, " ");
}
printf("\n");
......@@ -321,11 +317,8 @@ static void print_entry_plain(struct bpf_map_info *info, unsigned char *key,
for (i = 0; i < n; i++) {
printf("value (CPU %02d):%c",
i, info->value_size > 16 ? '\n' : ' ');
if (value)
fprint_hex(stdout, value + i * step,
info->value_size, " ");
else
printf("<no entry>");
fprint_hex(stdout, value + i * step,
info->value_size, " ");
printf("\n");
}
}
......@@ -538,6 +531,9 @@ static int show_map_close_json(int fd, struct bpf_map_info *info)
}
close(fd);
if (info->btf_id)
jsonw_int_field(json_wtr, "btf_id", info->btf_id);
if (!hash_empty(map_table.table)) {
struct pinned_obj *obj;
......@@ -604,15 +600,19 @@ static int show_map_close_plain(int fd, struct bpf_map_info *info)
}
close(fd);
printf("\n");
if (!hash_empty(map_table.table)) {
struct pinned_obj *obj;
hash_for_each_possible(map_table.table, obj, hash, info->id) {
if (obj->id == info->id)
printf("\tpinned %s\n", obj->path);
printf("\n\tpinned %s", obj->path);
}
}
if (info->btf_id)
printf("\n\tbtf_id %d", info->btf_id);
printf("\n");
return 0;
}
......@@ -722,11 +722,16 @@ static int dump_map_elem(int fd, void *key, void *value,
jsonw_string_field(json_wtr, "error", strerror(lookup_errno));
jsonw_end_object(json_wtr);
} else {
const char *msg = NULL;
if (errno == ENOENT)
print_entry_plain(map_info, key, NULL);
else
print_entry_error(map_info, key,
strerror(lookup_errno));
msg = "<no entry>";
else if (lookup_errno == ENOSPC &&
map_info->type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY)
msg = "<cannot read>";
print_entry_error(map_info, key,
msg ? : strerror(lookup_errno));
}
return 0;
......@@ -780,6 +785,10 @@ static int do_dump(int argc, char **argv)
}
}
if (info.type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY &&
info.value_size != 8)
p_info("Warning: cannot read values from %s map with value_size != 8",
map_type_name[info.type]);
while (true) {
err = bpf_map_get_next_key(fd, prev_key, key);
if (err) {
......
......@@ -323,7 +323,7 @@ static void print_prog_plain(struct bpf_prog_info *info, int fd)
}
if (info->btf_id)
printf("\n\tbtf_id %d\n", info->btf_id);
printf("\n\tbtf_id %d", info->btf_id);
printf("\n");
}
......@@ -1060,7 +1060,7 @@ static int do_help(int argc, char **argv)
" tracepoint | raw_tracepoint | xdp | perf_event | cgroup/skb |\n"
" cgroup/sock | cgroup/dev | lwt_in | lwt_out | lwt_xmit |\n"
" lwt_seg6local | sockops | sk_skb | sk_msg | lirc_mode2 |\n"
" sk_reuseport | flow_dissector |\n"
" sk_reuseport | flow_dissector | cgroup/sysctl |\n"
" cgroup/bind4 | cgroup/bind6 | cgroup/post_bind4 |\n"
" cgroup/post_bind6 | cgroup/connect4 | cgroup/connect6 |\n"
" cgroup/sendmsg4 | cgroup/sendmsg6 }\n"
......
......@@ -167,6 +167,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_LIRC_MODE2,
BPF_PROG_TYPE_SK_REUSEPORT,
BPF_PROG_TYPE_FLOW_DISSECTOR,
BPF_PROG_TYPE_CGROUP_SYSCTL,
};
enum bpf_attach_type {
......@@ -188,6 +189,7 @@ enum bpf_attach_type {
BPF_CGROUP_UDP6_SENDMSG,
BPF_LIRC_MODE2,
BPF_FLOW_DISSECTOR,
BPF_CGROUP_SYSCTL,
__MAX_BPF_ATTACH_TYPE
};
......@@ -2504,6 +2506,122 @@ union bpf_attr {
* Return
* 0 if iph and th are a valid SYN cookie ACK, or a negative error
* otherwise.
*
* int bpf_sysctl_get_name(struct bpf_sysctl *ctx, char *buf, size_t buf_len, u64 flags)
* Description
* Get name of sysctl in /proc/sys/ and copy it into provided by
* program buffer *buf* of size *buf_len*.
*
* The buffer is always NUL terminated, unless it's zero-sized.
*
* If *flags* is zero, full name (e.g. "net/ipv4/tcp_mem") is
* copied. Use **BPF_F_SYSCTL_BASE_NAME** flag to copy base name
* only (e.g. "tcp_mem").
* Return
* Number of character copied (not including the trailing NUL).
*
* **-E2BIG** if the buffer wasn't big enough (*buf* will contain
* truncated name in this case).
*
* int bpf_sysctl_get_current_value(struct bpf_sysctl *ctx, char *buf, size_t buf_len)
* Description
* Get current value of sysctl as it is presented in /proc/sys
* (incl. newline, etc), and copy it as a string into provided
* by program buffer *buf* of size *buf_len*.
*
* The whole value is copied, no matter what file position user
* space issued e.g. sys_read at.
*
* The buffer is always NUL terminated, unless it's zero-sized.
* Return
* Number of character copied (not including the trailing NUL).
*
* **-E2BIG** if the buffer wasn't big enough (*buf* will contain
* truncated name in this case).
*
* **-EINVAL** if current value was unavailable, e.g. because
* sysctl is uninitialized and read returns -EIO for it.
*
* int bpf_sysctl_get_new_value(struct bpf_sysctl *ctx, char *buf, size_t buf_len)
* Description
* Get new value being written by user space to sysctl (before
* the actual write happens) and copy it as a string into
* provided by program buffer *buf* of size *buf_len*.
*
* User space may write new value at file position > 0.
*
* The buffer is always NUL terminated, unless it's zero-sized.
* Return
* Number of character copied (not including the trailing NUL).
*
* **-E2BIG** if the buffer wasn't big enough (*buf* will contain
* truncated name in this case).
*
* **-EINVAL** if sysctl is being read.
*
* int bpf_sysctl_set_new_value(struct bpf_sysctl *ctx, const char *buf, size_t buf_len)
* Description
* Override new value being written by user space to sysctl with
* value provided by program in buffer *buf* of size *buf_len*.
*
* *buf* should contain a string in same form as provided by user
* space on sysctl write.
*
* User space may write new value at file position > 0. To override
* the whole sysctl value file position should be set to zero.
* Return
* 0 on success.
*
* **-E2BIG** if the *buf_len* is too big.
*
* **-EINVAL** if sysctl is being read.
*
* int bpf_strtol(const char *buf, size_t buf_len, u64 flags, long *res)
* Description
* Convert the initial part of the string from buffer *buf* of
* size *buf_len* to a long integer according to the given base
* and save the result in *res*.
*
* The string may begin with an arbitrary amount of white space
* (as determined by isspace(3)) followed by a single optional '-'
* sign.
*
* Five least significant bits of *flags* encode base, other bits
* are currently unused.
*
* Base must be either 8, 10, 16 or 0 to detect it automatically
* similar to user space strtol(3).
* Return
* Number of characters consumed on success. Must be positive but
* no more than buf_len.
*
* **-EINVAL** if no valid digits were found or unsupported base
* was provided.
*
* **-ERANGE** if resulting value was out of range.
*
* int bpf_strtoul(const char *buf, size_t buf_len, u64 flags, unsigned long *res)
* Description
* Convert the initial part of the string from buffer *buf* of
* size *buf_len* to an unsigned long integer according to the
* given base and save the result in *res*.
*
* The string may begin with an arbitrary amount of white space
* (as determined by isspace(3)).
*
* Five least significant bits of *flags* encode base, other bits
* are currently unused.
*
* Base must be either 8, 10, 16 or 0 to detect it automatically
* similar to user space strtoul(3).
* Return
* Number of characters consumed on success. Must be positive but
* no more than buf_len.
*
* **-EINVAL** if no valid digits were found or unsupported base
* was provided.
*
* **-ERANGE** if resulting value was out of range.
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
......@@ -2606,7 +2724,13 @@ union bpf_attr {
FN(skb_ecn_set_ce), \
FN(get_listener_sock), \
FN(skc_lookup_tcp), \
FN(tcp_check_syncookie),
FN(tcp_check_syncookie), \
FN(sysctl_get_name), \
FN(sysctl_get_current_value), \
FN(sysctl_get_new_value), \
FN(sysctl_set_new_value), \
FN(strtol), \
FN(strtoul),
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
......@@ -2668,17 +2792,20 @@ enum bpf_func_id {
/* BPF_FUNC_skb_adjust_room flags. */
#define BPF_F_ADJ_ROOM_FIXED_GSO (1ULL << 0)
#define BPF_ADJ_ROOM_ENCAP_L2_MASK 0xff
#define BPF_ADJ_ROOM_ENCAP_L2_SHIFT 56
#define BPF_ADJ_ROOM_ENCAP_L2_MASK 0xff
#define BPF_ADJ_ROOM_ENCAP_L2_SHIFT 56
#define BPF_F_ADJ_ROOM_ENCAP_L3_IPV4 (1ULL << 1)
#define BPF_F_ADJ_ROOM_ENCAP_L3_IPV6 (1ULL << 2)
#define BPF_F_ADJ_ROOM_ENCAP_L4_GRE (1ULL << 3)
#define BPF_F_ADJ_ROOM_ENCAP_L4_UDP (1ULL << 4)
#define BPF_F_ADJ_ROOM_ENCAP_L2(len) (((__u64)len & \
#define BPF_F_ADJ_ROOM_ENCAP_L2(len) (((__u64)len & \
BPF_ADJ_ROOM_ENCAP_L2_MASK) \
<< BPF_ADJ_ROOM_ENCAP_L2_SHIFT)
/* BPF_FUNC_sysctl_get_name flags. */
#define BPF_F_SYSCTL_BASE_NAME (1ULL << 0)
/* Mode for BPF_FUNC_skb_adjust_room helper. */
enum bpf_adj_room_mode {
BPF_ADJ_ROOM_NET,
......@@ -3308,4 +3435,14 @@ struct bpf_line_info {
struct bpf_spin_lock {
__u32 val;
};
struct bpf_sysctl {
__u32 write; /* Sysctl is being read (= 0) or written (= 1).
* Allows 1,2,4-byte read, but no write.
*/
__u32 file_pos; /* Sysctl file position to read from, write to.
* Allows 1,2,4-byte read an 4-byte write.
*/
};
#endif /* _UAPI__LINUX_BPF_H__ */
......@@ -92,7 +92,7 @@ struct bpf_load_program_attr {
#define MAPS_RELAX_COMPAT 0x01
/* Recommend log buffer size */
#define BPF_LOG_BUF_SIZE (16 * 1024 * 1024) /* verifier maximum in kernels <= 5.1 */
#define BPF_LOG_BUF_SIZE (UINT32_MAX >> 8) /* verifier maximum in kernels <= 5.1 */
LIBBPF_API int
bpf_load_program_xattr(const struct bpf_load_program_attr *load_attr,
char *log_buf, size_t log_buf_sz);
......
......@@ -1354,8 +1354,16 @@ static struct btf_dedup *btf_dedup_new(struct btf *btf, struct btf_ext *btf_ext,
}
/* special BTF "void" type is made canonical immediately */
d->map[0] = 0;
for (i = 1; i <= btf->nr_types; i++)
d->map[i] = BTF_UNPROCESSED_ID;
for (i = 1; i <= btf->nr_types; i++) {
struct btf_type *t = d->btf->types[i];
__u16 kind = BTF_INFO_KIND(t->info);
/* VAR and DATASEC are never deduped and are self-canonical */
if (kind == BTF_KIND_VAR || kind == BTF_KIND_DATASEC)
d->map[i] = i;
else
d->map[i] = BTF_UNPROCESSED_ID;
}
d->hypot_map = malloc(sizeof(__u32) * (1 + btf->nr_types));
if (!d->hypot_map) {
......@@ -1946,6 +1954,8 @@ static int btf_dedup_prim_type(struct btf_dedup *d, __u32 type_id)
case BTF_KIND_UNION:
case BTF_KIND_FUNC:
case BTF_KIND_FUNC_PROTO:
case BTF_KIND_VAR:
case BTF_KIND_DATASEC:
return 0;
case BTF_KIND_INT:
......@@ -2699,6 +2709,7 @@ static int btf_dedup_remap_type(struct btf_dedup *d, __u32 type_id)
case BTF_KIND_PTR:
case BTF_KIND_TYPEDEF:
case BTF_KIND_FUNC:
case BTF_KIND_VAR:
r = btf_dedup_remap_type_id(d, t->type);
if (r < 0)
return r;
......@@ -2753,6 +2764,20 @@ static int btf_dedup_remap_type(struct btf_dedup *d, __u32 type_id)
break;
}
case BTF_KIND_DATASEC: {
struct btf_var_secinfo *var = (struct btf_var_secinfo *)(t + 1);
__u16 vlen = BTF_INFO_VLEN(t->info);
for (i = 0; i < vlen; i++) {
r = btf_dedup_remap_type_id(d, var->type);
if (r < 0)
return r;
var->type = r;
var++;
}
break;
}
default:
return -EINVAL;
}
......
......@@ -817,7 +817,7 @@ bpf_object__init_internal_map(struct bpf_object *obj, struct bpf_map *map,
memcpy(*data_buff, data->d_buf, data->d_size);
}
pr_debug("map %ld is \"%s\"\n", map - obj->maps, map->name);
pr_debug("map %td is \"%s\"\n", map - obj->maps, map->name);
return 0;
}
......@@ -2064,6 +2064,7 @@ static bool bpf_prog_type__needs_kver(enum bpf_prog_type type)
case BPF_PROG_TYPE_TRACEPOINT:
case BPF_PROG_TYPE_RAW_TRACEPOINT:
case BPF_PROG_TYPE_PERF_EVENT:
case BPF_PROG_TYPE_CGROUP_SYSCTL:
return false;
case BPF_PROG_TYPE_KPROBE:
default:
......@@ -3004,6 +3005,8 @@ static const struct {
BPF_CGROUP_UDP4_SENDMSG),
BPF_EAPROG_SEC("cgroup/sendmsg6", BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
BPF_CGROUP_UDP6_SENDMSG),
BPF_EAPROG_SEC("cgroup/sysctl", BPF_PROG_TYPE_CGROUP_SYSCTL,
BPF_CGROUP_SYSCTL),
};
#undef BPF_PROG_SEC_IMPL
......
......@@ -97,6 +97,7 @@ probe_load(enum bpf_prog_type prog_type, const struct bpf_insn *insns,
case BPF_PROG_TYPE_LIRC_MODE2:
case BPF_PROG_TYPE_SK_REUSEPORT:
case BPF_PROG_TYPE_FLOW_DISSECTOR:
case BPF_PROG_TYPE_CGROUP_SYSCTL:
default:
break;
}
......
......@@ -23,6 +23,36 @@ do { \
#define pr_info(fmt, ...) __pr(LIBBPF_INFO, fmt, ##__VA_ARGS__)
#define pr_debug(fmt, ...) __pr(LIBBPF_DEBUG, fmt, ##__VA_ARGS__)
/* Use these barrier functions instead of smp_[rw]mb() when they are
* used in a libbpf header file. That way they can be built into the
* application that uses libbpf.
*/
#if defined(__i386__) || defined(__x86_64__)
# define libbpf_smp_rmb() asm volatile("" : : : "memory")
# define libbpf_smp_wmb() asm volatile("" : : : "memory")
# define libbpf_smp_mb() \
asm volatile("lock; addl $0,-4(%%rsp)" : : : "memory", "cc")
/* Hinders stores to be observed before older loads. */
# define libbpf_smp_rwmb() asm volatile("" : : : "memory")
#elif defined(__aarch64__)
# define libbpf_smp_rmb() asm volatile("dmb ishld" : : : "memory")
# define libbpf_smp_wmb() asm volatile("dmb ishst" : : : "memory")
# define libbpf_smp_mb() asm volatile("dmb ish" : : : "memory")
# define libbpf_smp_rwmb() libbpf_smp_mb()
#elif defined(__arm__)
/* These are only valid for armv7 and above */
# define libbpf_smp_rmb() asm volatile("dmb ish" : : : "memory")
# define libbpf_smp_wmb() asm volatile("dmb ishst" : : : "memory")
# define libbpf_smp_mb() asm volatile("dmb ish" : : : "memory")
# define libbpf_smp_rwmb() libbpf_smp_mb()
#else
/* Architecture missing native barrier functions. */
# define libbpf_smp_rmb() __sync_synchronize()
# define libbpf_smp_wmb() __sync_synchronize()
# define libbpf_smp_mb() __sync_synchronize()
# define libbpf_smp_rwmb() __sync_synchronize()
#endif
#ifdef __cplusplus
} /* extern "C" */
#endif
......
......@@ -16,6 +16,7 @@
#include <linux/if_xdp.h>
#include "libbpf.h"
#include "libbpf_util.h"
#ifdef __cplusplus
extern "C" {
......@@ -36,6 +37,10 @@ struct name { \
DEFINE_XSK_RING(xsk_ring_prod);
DEFINE_XSK_RING(xsk_ring_cons);
/* For a detailed explanation on the memory barriers associated with the
* ring, please take a look at net/xdp/xsk_queue.h.
*/
struct xsk_umem;
struct xsk_socket;
......@@ -105,7 +110,7 @@ static inline __u32 xsk_cons_nb_avail(struct xsk_ring_cons *r, __u32 nb)
static inline size_t xsk_ring_prod__reserve(struct xsk_ring_prod *prod,
size_t nb, __u32 *idx)
{
if (unlikely(xsk_prod_nb_free(prod, nb) < nb))
if (xsk_prod_nb_free(prod, nb) < nb)
return 0;
*idx = prod->cached_prod;
......@@ -116,10 +121,10 @@ static inline size_t xsk_ring_prod__reserve(struct xsk_ring_prod *prod,
static inline void xsk_ring_prod__submit(struct xsk_ring_prod *prod, size_t nb)
{
/* Make sure everything has been written to the ring before signalling
* this to the kernel.
/* Make sure everything has been written to the ring before indicating
* this to the kernel by writing the producer pointer.
*/
smp_wmb();
libbpf_smp_wmb();
*prod->producer += nb;
}
......@@ -129,11 +134,11 @@ static inline size_t xsk_ring_cons__peek(struct xsk_ring_cons *cons,
{
size_t entries = xsk_cons_nb_avail(cons, nb);
if (likely(entries > 0)) {
if (entries > 0) {
/* Make sure we do not speculatively read the data before
* we have received the packet buffers from the ring.
*/
smp_rmb();
libbpf_smp_rmb();
*idx = cons->cached_cons;
cons->cached_cons += entries;
......@@ -144,6 +149,11 @@ static inline size_t xsk_ring_cons__peek(struct xsk_ring_cons *cons,
static inline void xsk_ring_cons__release(struct xsk_ring_cons *cons, size_t nb)
{
/* Make sure data has been read before indicating we are done
* with the entries by updating the consumer pointer.
*/
libbpf_smp_rwmb();
*cons->consumer += nb;
}
......
......@@ -23,7 +23,7 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test
test_align test_verifier_log test_dev_cgroup test_tcpbpf_user \
test_sock test_btf test_sockmap test_lirc_mode2_user get_cgroup_id_user \
test_socket_cookie test_cgroup_storage test_select_reuseport test_section_names \
test_netcnt test_tcpnotify_user test_sock_fields
test_netcnt test_tcpnotify_user test_sock_fields test_sysctl
BPF_OBJ_FILES = $(patsubst %.c,%.o, $(notdir $(wildcard progs/*.c)))
TEST_GEN_FILES = $(BPF_OBJ_FILES)
......@@ -93,6 +93,7 @@ $(OUTPUT)/get_cgroup_id_user: cgroup_helpers.c
$(OUTPUT)/test_cgroup_storage: cgroup_helpers.c
$(OUTPUT)/test_netcnt: cgroup_helpers.c
$(OUTPUT)/test_sock_fields: cgroup_helpers.c
$(OUTPUT)/test_sysctl: cgroup_helpers.c
.PHONY: force
......
......@@ -192,6 +192,25 @@ static int (*bpf_skb_ecn_set_ce)(void *ctx) =
static int (*bpf_tcp_check_syncookie)(struct bpf_sock *sk,
void *ip, int ip_len, void *tcp, int tcp_len) =
(void *) BPF_FUNC_tcp_check_syncookie;
static int (*bpf_sysctl_get_name)(void *ctx, char *buf,
unsigned long long buf_len,
unsigned long long flags) =
(void *) BPF_FUNC_sysctl_get_name;
static int (*bpf_sysctl_get_current_value)(void *ctx, char *buf,
unsigned long long buf_len) =
(void *) BPF_FUNC_sysctl_get_current_value;
static int (*bpf_sysctl_get_new_value)(void *ctx, char *buf,
unsigned long long buf_len) =
(void *) BPF_FUNC_sysctl_get_new_value;
static int (*bpf_sysctl_set_new_value)(void *ctx, const char *buf,
unsigned long long buf_len) =
(void *) BPF_FUNC_sysctl_set_new_value;
static int (*bpf_strtol)(const char *buf, unsigned long long buf_len,
unsigned long long flags, long *res) =
(void *) BPF_FUNC_strtol;
static int (*bpf_strtoul)(const char *buf, unsigned long long buf_len,
unsigned long long flags, unsigned long *res) =
(void *) BPF_FUNC_strtoul;
/* llvm builtin functions that eBPF C program may use to
* emit BPF_LD_ABS and BPF_LD_IND instructions
......
......@@ -2,7 +2,7 @@
#include <test_progs.h>
#define CHECK_FLOW_KEYS(desc, got, expected) \
CHECK(memcmp(&got, &expected, sizeof(got)) != 0, \
CHECK_ATTR(memcmp(&got, &expected, sizeof(got)) != 0, \
desc, \
"nhoff=%u/%u " \
"thoff=%u/%u " \
......@@ -10,6 +10,7 @@
"is_frag=%u/%u " \
"is_first_frag=%u/%u " \
"is_encap=%u/%u " \
"ip_proto=0x%x/0x%x " \
"n_proto=0x%x/0x%x " \
"sport=%u/%u " \
"dport=%u/%u\n", \
......@@ -19,53 +20,32 @@
got.is_frag, expected.is_frag, \
got.is_first_frag, expected.is_first_frag, \
got.is_encap, expected.is_encap, \
got.ip_proto, expected.ip_proto, \
got.n_proto, expected.n_proto, \
got.sport, expected.sport, \
got.dport, expected.dport)
static struct bpf_flow_keys pkt_v4_flow_keys = {
.nhoff = 0,
.thoff = sizeof(struct iphdr),
.addr_proto = ETH_P_IP,
.ip_proto = IPPROTO_TCP,
.n_proto = __bpf_constant_htons(ETH_P_IP),
};
static struct bpf_flow_keys pkt_v6_flow_keys = {
.nhoff = 0,
.thoff = sizeof(struct ipv6hdr),
.addr_proto = ETH_P_IPV6,
.ip_proto = IPPROTO_TCP,
.n_proto = __bpf_constant_htons(ETH_P_IPV6),
};
#define VLAN_HLEN 4
struct ipv4_pkt {
struct ethhdr eth;
struct iphdr iph;
struct tcphdr tcp;
} __packed;
static struct {
struct svlan_ipv4_pkt {
struct ethhdr eth;
__u16 vlan_tci;
__u16 vlan_proto;
struct iphdr iph;
struct tcphdr tcp;
} __packed pkt_vlan_v4 = {
.eth.h_proto = __bpf_constant_htons(ETH_P_8021Q),
.vlan_proto = __bpf_constant_htons(ETH_P_IP),
.iph.ihl = 5,
.iph.protocol = IPPROTO_TCP,
.iph.tot_len = __bpf_constant_htons(MAGIC_BYTES),
.tcp.urg_ptr = 123,
.tcp.doff = 5,
};
} __packed;
static struct bpf_flow_keys pkt_vlan_v4_flow_keys = {
.nhoff = VLAN_HLEN,
.thoff = VLAN_HLEN + sizeof(struct iphdr),
.addr_proto = ETH_P_IP,
.ip_proto = IPPROTO_TCP,
.n_proto = __bpf_constant_htons(ETH_P_IP),
};
struct ipv6_pkt {
struct ethhdr eth;
struct ipv6hdr iph;
struct tcphdr tcp;
} __packed;
static struct {
struct dvlan_ipv6_pkt {
struct ethhdr eth;
__u16 vlan_tci;
__u16 vlan_proto;
......@@ -73,31 +53,97 @@ static struct {
__u16 vlan_proto2;
struct ipv6hdr iph;
struct tcphdr tcp;
} __packed pkt_vlan_v6 = {
.eth.h_proto = __bpf_constant_htons(ETH_P_8021AD),
.vlan_proto = __bpf_constant_htons(ETH_P_8021Q),
.vlan_proto2 = __bpf_constant_htons(ETH_P_IPV6),
.iph.nexthdr = IPPROTO_TCP,
.iph.payload_len = __bpf_constant_htons(MAGIC_BYTES),
.tcp.urg_ptr = 123,
.tcp.doff = 5,
} __packed;
struct test {
const char *name;
union {
struct ipv4_pkt ipv4;
struct svlan_ipv4_pkt svlan_ipv4;
struct ipv6_pkt ipv6;
struct dvlan_ipv6_pkt dvlan_ipv6;
} pkt;
struct bpf_flow_keys keys;
};
static struct bpf_flow_keys pkt_vlan_v6_flow_keys = {
.nhoff = VLAN_HLEN * 2,
.thoff = VLAN_HLEN * 2 + sizeof(struct ipv6hdr),
.addr_proto = ETH_P_IPV6,
.ip_proto = IPPROTO_TCP,
.n_proto = __bpf_constant_htons(ETH_P_IPV6),
#define VLAN_HLEN 4
struct test tests[] = {
{
.name = "ipv4",
.pkt.ipv4 = {
.eth.h_proto = __bpf_constant_htons(ETH_P_IP),
.iph.ihl = 5,
.iph.protocol = IPPROTO_TCP,
.iph.tot_len = __bpf_constant_htons(MAGIC_BYTES),
.tcp.doff = 5,
},
.keys = {
.nhoff = 0,
.thoff = sizeof(struct iphdr),
.addr_proto = ETH_P_IP,
.ip_proto = IPPROTO_TCP,
.n_proto = __bpf_constant_htons(ETH_P_IP),
},
},
{
.name = "ipv6",
.pkt.ipv6 = {
.eth.h_proto = __bpf_constant_htons(ETH_P_IPV6),
.iph.nexthdr = IPPROTO_TCP,
.iph.payload_len = __bpf_constant_htons(MAGIC_BYTES),
.tcp.doff = 5,
},
.keys = {
.nhoff = 0,
.thoff = sizeof(struct ipv6hdr),
.addr_proto = ETH_P_IPV6,
.ip_proto = IPPROTO_TCP,
.n_proto = __bpf_constant_htons(ETH_P_IPV6),
},
},
{
.name = "802.1q-ipv4",
.pkt.svlan_ipv4 = {
.eth.h_proto = __bpf_constant_htons(ETH_P_8021Q),
.vlan_proto = __bpf_constant_htons(ETH_P_IP),
.iph.ihl = 5,
.iph.protocol = IPPROTO_TCP,
.iph.tot_len = __bpf_constant_htons(MAGIC_BYTES),
.tcp.doff = 5,
},
.keys = {
.nhoff = VLAN_HLEN,
.thoff = VLAN_HLEN + sizeof(struct iphdr),
.addr_proto = ETH_P_IP,
.ip_proto = IPPROTO_TCP,
.n_proto = __bpf_constant_htons(ETH_P_IP),
},
},
{
.name = "802.1ad-ipv6",
.pkt.dvlan_ipv6 = {
.eth.h_proto = __bpf_constant_htons(ETH_P_8021AD),
.vlan_proto = __bpf_constant_htons(ETH_P_8021Q),
.vlan_proto2 = __bpf_constant_htons(ETH_P_IPV6),
.iph.nexthdr = IPPROTO_TCP,
.iph.payload_len = __bpf_constant_htons(MAGIC_BYTES),
.tcp.doff = 5,
},
.keys = {
.nhoff = VLAN_HLEN * 2,
.thoff = VLAN_HLEN * 2 + sizeof(struct ipv6hdr),
.addr_proto = ETH_P_IPV6,
.ip_proto = IPPROTO_TCP,
.n_proto = __bpf_constant_htons(ETH_P_IPV6),
},
},
};
void test_flow_dissector(void)
{
struct bpf_flow_keys flow_keys;
struct bpf_object *obj;
__u32 duration, retval;
int err, prog_fd;
__u32 size;
int i, err, prog_fd;
err = bpf_flow_load(&obj, "./bpf_flow.o", "flow_dissector",
"jmp_table", &prog_fd);
......@@ -106,35 +152,24 @@ void test_flow_dissector(void)
return;
}
err = bpf_prog_test_run(prog_fd, 10, &pkt_v4, sizeof(pkt_v4),
&flow_keys, &size, &retval, &duration);
CHECK(size != sizeof(flow_keys) || err || retval != 1, "ipv4",
"err %d errno %d retval %d duration %d size %u/%lu\n",
err, errno, retval, duration, size, sizeof(flow_keys));
CHECK_FLOW_KEYS("ipv4_flow_keys", flow_keys, pkt_v4_flow_keys);
err = bpf_prog_test_run(prog_fd, 10, &pkt_v6, sizeof(pkt_v6),
&flow_keys, &size, &retval, &duration);
CHECK(size != sizeof(flow_keys) || err || retval != 1, "ipv6",
"err %d errno %d retval %d duration %d size %u/%lu\n",
err, errno, retval, duration, size, sizeof(flow_keys));
CHECK_FLOW_KEYS("ipv6_flow_keys", flow_keys, pkt_v6_flow_keys);
for (i = 0; i < ARRAY_SIZE(tests); i++) {
struct bpf_flow_keys flow_keys;
struct bpf_prog_test_run_attr tattr = {
.prog_fd = prog_fd,
.data_in = &tests[i].pkt,
.data_size_in = sizeof(tests[i].pkt),
.data_out = &flow_keys,
};
err = bpf_prog_test_run(prog_fd, 10, &pkt_vlan_v4, sizeof(pkt_vlan_v4),
&flow_keys, &size, &retval, &duration);
CHECK(size != sizeof(flow_keys) || err || retval != 1, "vlan_ipv4",
"err %d errno %d retval %d duration %d size %u/%lu\n",
err, errno, retval, duration, size, sizeof(flow_keys));
CHECK_FLOW_KEYS("vlan_ipv4_flow_keys", flow_keys,
pkt_vlan_v4_flow_keys);
err = bpf_prog_test_run(prog_fd, 10, &pkt_vlan_v6, sizeof(pkt_vlan_v6),
&flow_keys, &size, &retval, &duration);
CHECK(size != sizeof(flow_keys) || err || retval != 1, "vlan_ipv6",
"err %d errno %d retval %d duration %d size %u/%lu\n",
err, errno, retval, duration, size, sizeof(flow_keys));
CHECK_FLOW_KEYS("vlan_ipv6_flow_keys", flow_keys,
pkt_vlan_v6_flow_keys);
err = bpf_prog_test_run_xattr(&tattr);
CHECK_ATTR(tattr.data_size_out != sizeof(flow_keys) ||
err || tattr.retval != 1,
tests[i].name,
"err %d errno %d retval %d duration %d size %u/%lu\n",
err, errno, tattr.retval, tattr.duration,
tattr.data_size_out, sizeof(flow_keys));
CHECK_FLOW_KEYS(tests[i].name, flow_keys, tests[i].keys);
}
bpf_object__close(obj);
}
// SPDX-License-Identifier: GPL-2.0
// Copyright (c) 2019 Facebook
#include <stdint.h>
#include <string.h>
#include <linux/stddef.h>
#include <linux/bpf.h>
#include "bpf_helpers.h"
#include "bpf_util.h"
/* Max supported length of a string with unsigned long in base 10 (pow2 - 1). */
#define MAX_ULONG_STR_LEN 0xF
/* Max supported length of sysctl value string (pow2). */
#define MAX_VALUE_STR_LEN 0x40
static __always_inline int is_tcp_mem(struct bpf_sysctl *ctx)
{
char tcp_mem_name[] = "net/ipv4/tcp_mem";
unsigned char i;
char name[64];
int ret;
memset(name, 0, sizeof(name));
ret = bpf_sysctl_get_name(ctx, name, sizeof(name), 0);
if (ret < 0 || ret != sizeof(tcp_mem_name) - 1)
return 0;
#pragma clang loop unroll(full)
for (i = 0; i < sizeof(tcp_mem_name); ++i)
if (name[i] != tcp_mem_name[i])
return 0;
return 1;
}
SEC("cgroup/sysctl")
int sysctl_tcp_mem(struct bpf_sysctl *ctx)
{
unsigned long tcp_mem[3] = {0, 0, 0};
char value[MAX_VALUE_STR_LEN];
unsigned char i, off = 0;
int ret;
if (ctx->write)
return 0;
if (!is_tcp_mem(ctx))
return 0;
ret = bpf_sysctl_get_current_value(ctx, value, MAX_VALUE_STR_LEN);
if (ret < 0 || ret >= MAX_VALUE_STR_LEN)
return 0;
#pragma clang loop unroll(full)
for (i = 0; i < ARRAY_SIZE(tcp_mem); ++i) {
ret = bpf_strtoul(value + off, MAX_ULONG_STR_LEN, 0,
tcp_mem + i);
if (ret <= 0 || ret > MAX_ULONG_STR_LEN)
return 0;
off += ret & MAX_ULONG_STR_LEN;
}
return tcp_mem[0] < tcp_mem[1] && tcp_mem[1] < tcp_mem[2];
}
char _license[] SEC("license") = "GPL";
......@@ -157,7 +157,7 @@ static __always_inline int encap_ipv4(struct __sk_buff *skb, __u8 encap_proto,
bpf_ntohs(h_outer.ip.tot_len));
h_outer.ip.protocol = encap_proto;
set_ipv4_csum(&h_outer.ip);
set_ipv4_csum((void *)&h_outer.ip);
/* store new outer network header */
if (bpf_skb_store_bytes(skb, ETH_HLEN, &h_outer, olen,
......
// SPDX-License-Identifier: GPL-2.0
#include <stddef.h>
#include <string.h>
#include <netinet/in.h>
#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/if_packet.h>
......@@ -9,7 +10,6 @@
#include <linux/types.h>
#include <linux/socket.h>
#include <linux/tcp.h>
#include <netinet/in.h>
#include "bpf_helpers.h"
#include "bpf_endian.h"
#include "test_tcpbpf.h"
......
// SPDX-License-Identifier: GPL-2.0
#include <stddef.h>
#include <string.h>
#include <netinet/in.h>
#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/if_packet.h>
......@@ -9,7 +10,6 @@
#include <linux/types.h>
#include <linux/socket.h>
#include <linux/tcp.h>
#include <netinet/in.h>
#include "bpf_helpers.h"
#include "bpf_endian.h"
#include "test_tcpnotify.h"
......
......@@ -6642,6 +6642,51 @@ const struct btf_dedup_test dedup_tests[] = {
.dont_resolve_fwds = false,
},
},
{
.descr = "dedup: datasec and vars pass-through",
.input = {
.raw_types = {
/* int */
BTF_TYPE_INT_ENC(0, BTF_INT_SIGNED, 0, 32, 4), /* [1] */
/* static int t */
BTF_VAR_ENC(NAME_NTH(2), 1, 0), /* [2] */
/* .bss section */ /* [3] */
BTF_TYPE_ENC(NAME_NTH(1), BTF_INFO_ENC(BTF_KIND_DATASEC, 0, 1), 4),
BTF_VAR_SECINFO_ENC(2, 0, 4),
/* int, referenced from [5] */
BTF_TYPE_INT_ENC(0, BTF_INT_SIGNED, 0, 32, 4), /* [4] */
/* another static int t */
BTF_VAR_ENC(NAME_NTH(2), 4, 0), /* [5] */
/* another .bss section */ /* [6] */
BTF_TYPE_ENC(NAME_NTH(1), BTF_INFO_ENC(BTF_KIND_DATASEC, 0, 1), 4),
BTF_VAR_SECINFO_ENC(5, 0, 4),
BTF_END_RAW,
},
BTF_STR_SEC("\0.bss\0t"),
},
.expect = {
.raw_types = {
/* int */
BTF_TYPE_INT_ENC(0, BTF_INT_SIGNED, 0, 32, 4), /* [1] */
/* static int t */
BTF_VAR_ENC(NAME_NTH(2), 1, 0), /* [2] */
/* .bss section */ /* [3] */
BTF_TYPE_ENC(NAME_NTH(1), BTF_INFO_ENC(BTF_KIND_DATASEC, 0, 1), 4),
BTF_VAR_SECINFO_ENC(2, 0, 4),
/* another static int t */
BTF_VAR_ENC(NAME_NTH(2), 1, 0), /* [4] */
/* another .bss section */ /* [5] */
BTF_TYPE_ENC(NAME_NTH(1), BTF_INFO_ENC(BTF_KIND_DATASEC, 0, 1), 4),
BTF_VAR_SECINFO_ENC(4, 0, 4),
BTF_END_RAW,
},
BTF_STR_SEC("\0.bss\0t"),
},
.opts = {
.dont_resolve_fwds = false,
.dedup_table_size = 1
},
},
};
......@@ -6671,6 +6716,10 @@ static int btf_type_size(const struct btf_type *t)
return base_size + vlen * sizeof(struct btf_member);
case BTF_KIND_FUNC_PROTO:
return base_size + vlen * sizeof(struct btf_param);
case BTF_KIND_VAR:
return base_size + sizeof(struct btf_var);
case BTF_KIND_DATASEC:
return base_size + vlen * sizeof(struct btf_var_secinfo);
default:
fprintf(stderr, "Unsupported BTF_KIND:%u\n", kind);
return -EINVAL;
......
......@@ -129,6 +129,24 @@ setup()
ip link set veth7 netns ${NS2}
ip link set veth8 netns ${NS3}
if [ ! -z "${VRF}" ] ; then
ip -netns ${NS1} link add red type vrf table 1001
ip -netns ${NS1} link set red up
ip -netns ${NS1} route add table 1001 unreachable default metric 8192
ip -netns ${NS1} -6 route add table 1001 unreachable default metric 8192
ip -netns ${NS1} link set veth1 vrf red
ip -netns ${NS1} link set veth5 vrf red
ip -netns ${NS2} link add red type vrf table 1001
ip -netns ${NS2} link set red up
ip -netns ${NS2} route add table 1001 unreachable default metric 8192
ip -netns ${NS2} -6 route add table 1001 unreachable default metric 8192
ip -netns ${NS2} link set veth2 vrf red
ip -netns ${NS2} link set veth3 vrf red
ip -netns ${NS2} link set veth6 vrf red
ip -netns ${NS2} link set veth7 vrf red
fi
# configure addesses: the top route (1-2-3-4)
ip -netns ${NS1} addr add ${IPv4_1}/24 dev veth1
ip -netns ${NS2} addr add ${IPv4_2}/24 dev veth2
......@@ -163,29 +181,29 @@ setup()
# NS1
# top route
ip -netns ${NS1} route add ${IPv4_2}/32 dev veth1
ip -netns ${NS1} route add default dev veth1 via ${IPv4_2} # go top by default
ip -netns ${NS1} -6 route add ${IPv6_2}/128 dev veth1
ip -netns ${NS1} -6 route add default dev veth1 via ${IPv6_2} # go top by default
ip -netns ${NS1} route add ${IPv4_2}/32 dev veth1 ${VRF}
ip -netns ${NS1} route add default dev veth1 via ${IPv4_2} ${VRF} # go top by default
ip -netns ${NS1} -6 route add ${IPv6_2}/128 dev veth1 ${VRF}
ip -netns ${NS1} -6 route add default dev veth1 via ${IPv6_2} ${VRF} # go top by default
# bottom route
ip -netns ${NS1} route add ${IPv4_6}/32 dev veth5
ip -netns ${NS1} route add ${IPv4_7}/32 dev veth5 via ${IPv4_6}
ip -netns ${NS1} route add ${IPv4_8}/32 dev veth5 via ${IPv4_6}
ip -netns ${NS1} -6 route add ${IPv6_6}/128 dev veth5
ip -netns ${NS1} -6 route add ${IPv6_7}/128 dev veth5 via ${IPv6_6}
ip -netns ${NS1} -6 route add ${IPv6_8}/128 dev veth5 via ${IPv6_6}
ip -netns ${NS1} route add ${IPv4_6}/32 dev veth5 ${VRF}
ip -netns ${NS1} route add ${IPv4_7}/32 dev veth5 via ${IPv4_6} ${VRF}
ip -netns ${NS1} route add ${IPv4_8}/32 dev veth5 via ${IPv4_6} ${VRF}
ip -netns ${NS1} -6 route add ${IPv6_6}/128 dev veth5 ${VRF}
ip -netns ${NS1} -6 route add ${IPv6_7}/128 dev veth5 via ${IPv6_6} ${VRF}
ip -netns ${NS1} -6 route add ${IPv6_8}/128 dev veth5 via ${IPv6_6} ${VRF}
# NS2
# top route
ip -netns ${NS2} route add ${IPv4_1}/32 dev veth2
ip -netns ${NS2} route add ${IPv4_4}/32 dev veth3
ip -netns ${NS2} -6 route add ${IPv6_1}/128 dev veth2
ip -netns ${NS2} -6 route add ${IPv6_4}/128 dev veth3
ip -netns ${NS2} route add ${IPv4_1}/32 dev veth2 ${VRF}
ip -netns ${NS2} route add ${IPv4_4}/32 dev veth3 ${VRF}
ip -netns ${NS2} -6 route add ${IPv6_1}/128 dev veth2 ${VRF}
ip -netns ${NS2} -6 route add ${IPv6_4}/128 dev veth3 ${VRF}
# bottom route
ip -netns ${NS2} route add ${IPv4_5}/32 dev veth6
ip -netns ${NS2} route add ${IPv4_8}/32 dev veth7
ip -netns ${NS2} -6 route add ${IPv6_5}/128 dev veth6
ip -netns ${NS2} -6 route add ${IPv6_8}/128 dev veth7
ip -netns ${NS2} route add ${IPv4_5}/32 dev veth6 ${VRF}
ip -netns ${NS2} route add ${IPv4_8}/32 dev veth7 ${VRF}
ip -netns ${NS2} -6 route add ${IPv6_5}/128 dev veth6 ${VRF}
ip -netns ${NS2} -6 route add ${IPv6_8}/128 dev veth7 ${VRF}
# NS3
# top route
......@@ -207,16 +225,16 @@ setup()
ip -netns ${NS3} tunnel add gre_dev mode gre remote ${IPv4_1} local ${IPv4_GRE} ttl 255
ip -netns ${NS3} link set gre_dev up
ip -netns ${NS3} addr add ${IPv4_GRE} dev gre_dev
ip -netns ${NS1} route add ${IPv4_GRE}/32 dev veth5 via ${IPv4_6}
ip -netns ${NS2} route add ${IPv4_GRE}/32 dev veth7 via ${IPv4_8}
ip -netns ${NS1} route add ${IPv4_GRE}/32 dev veth5 via ${IPv4_6} ${VRF}
ip -netns ${NS2} route add ${IPv4_GRE}/32 dev veth7 via ${IPv4_8} ${VRF}
# configure IPv6 GRE device in NS3, and a route to it via the "bottom" route
ip -netns ${NS3} -6 tunnel add name gre6_dev mode ip6gre remote ${IPv6_1} local ${IPv6_GRE} ttl 255
ip -netns ${NS3} link set gre6_dev up
ip -netns ${NS3} -6 addr add ${IPv6_GRE} nodad dev gre6_dev
ip -netns ${NS1} -6 route add ${IPv6_GRE}/128 dev veth5 via ${IPv6_6}
ip -netns ${NS2} -6 route add ${IPv6_GRE}/128 dev veth7 via ${IPv6_8}
ip -netns ${NS1} -6 route add ${IPv6_GRE}/128 dev veth5 via ${IPv6_6} ${VRF}
ip -netns ${NS2} -6 route add ${IPv6_GRE}/128 dev veth7 via ${IPv6_8} ${VRF}
# rp_filter gets confused by what these tests are doing, so disable it
ip netns exec ${NS1} sysctl -wq net.ipv4.conf.all.rp_filter=0
......@@ -244,18 +262,18 @@ trap cleanup EXIT
remove_routes_to_gredev()
{
ip -netns ${NS1} route del ${IPv4_GRE} dev veth5
ip -netns ${NS2} route del ${IPv4_GRE} dev veth7
ip -netns ${NS1} -6 route del ${IPv6_GRE}/128 dev veth5
ip -netns ${NS2} -6 route del ${IPv6_GRE}/128 dev veth7
ip -netns ${NS1} route del ${IPv4_GRE} dev veth5 ${VRF}
ip -netns ${NS2} route del ${IPv4_GRE} dev veth7 ${VRF}
ip -netns ${NS1} -6 route del ${IPv6_GRE}/128 dev veth5 ${VRF}
ip -netns ${NS2} -6 route del ${IPv6_GRE}/128 dev veth7 ${VRF}
}
add_unreachable_routes_to_gredev()
{
ip -netns ${NS1} route add unreachable ${IPv4_GRE}/32
ip -netns ${NS2} route add unreachable ${IPv4_GRE}/32
ip -netns ${NS1} -6 route add unreachable ${IPv6_GRE}/128
ip -netns ${NS2} -6 route add unreachable ${IPv6_GRE}/128
ip -netns ${NS1} route add unreachable ${IPv4_GRE}/32 ${VRF}
ip -netns ${NS2} route add unreachable ${IPv4_GRE}/32 ${VRF}
ip -netns ${NS1} -6 route add unreachable ${IPv6_GRE}/128 ${VRF}
ip -netns ${NS2} -6 route add unreachable ${IPv6_GRE}/128 ${VRF}
}
test_ping()
......@@ -265,10 +283,10 @@ test_ping()
local RET=0
if [ "${PROTO}" == "IPv4" ] ; then
ip netns exec ${NS1} ping -c 1 -W 1 -I ${IPv4_SRC} ${IPv4_DST} 2>&1 > /dev/null
ip netns exec ${NS1} ping -c 1 -W 1 -I veth1 ${IPv4_DST} 2>&1 > /dev/null
RET=$?
elif [ "${PROTO}" == "IPv6" ] ; then
ip netns exec ${NS1} ping6 -c 1 -W 6 -I ${IPv6_SRC} ${IPv6_DST} 2>&1 > /dev/null
ip netns exec ${NS1} ping6 -c 1 -W 6 -I veth1 ${IPv6_DST} 2>&1 > /dev/null
RET=$?
else
echo " test_ping: unknown PROTO: ${PROTO}"
......@@ -328,7 +346,7 @@ test_gso()
test_egress()
{
local readonly ENCAP=$1
echo "starting egress ${ENCAP} encap test"
echo "starting egress ${ENCAP} encap test ${VRF}"
setup
# by default, pings work
......@@ -336,26 +354,35 @@ test_egress()
test_ping IPv6 0
# remove NS2->DST routes, ping fails
ip -netns ${NS2} route del ${IPv4_DST}/32 dev veth3
ip -netns ${NS2} -6 route del ${IPv6_DST}/128 dev veth3
ip -netns ${NS2} route del ${IPv4_DST}/32 dev veth3 ${VRF}
ip -netns ${NS2} -6 route del ${IPv6_DST}/128 dev veth3 ${VRF}
test_ping IPv4 1
test_ping IPv6 1
# install replacement routes (LWT/eBPF), pings succeed
if [ "${ENCAP}" == "IPv4" ] ; then
ip -netns ${NS1} route add ${IPv4_DST} encap bpf xmit obj test_lwt_ip_encap.o sec encap_gre dev veth1
ip -netns ${NS1} -6 route add ${IPv6_DST} encap bpf xmit obj test_lwt_ip_encap.o sec encap_gre dev veth1
ip -netns ${NS1} route add ${IPv4_DST} encap bpf xmit obj \
test_lwt_ip_encap.o sec encap_gre dev veth1 ${VRF}
ip -netns ${NS1} -6 route add ${IPv6_DST} encap bpf xmit obj \
test_lwt_ip_encap.o sec encap_gre dev veth1 ${VRF}
elif [ "${ENCAP}" == "IPv6" ] ; then
ip -netns ${NS1} route add ${IPv4_DST} encap bpf xmit obj test_lwt_ip_encap.o sec encap_gre6 dev veth1
ip -netns ${NS1} -6 route add ${IPv6_DST} encap bpf xmit obj test_lwt_ip_encap.o sec encap_gre6 dev veth1
ip -netns ${NS1} route add ${IPv4_DST} encap bpf xmit obj \
test_lwt_ip_encap.o sec encap_gre6 dev veth1 ${VRF}
ip -netns ${NS1} -6 route add ${IPv6_DST} encap bpf xmit obj \
test_lwt_ip_encap.o sec encap_gre6 dev veth1 ${VRF}
else
echo " unknown encap ${ENCAP}"
TEST_STATUS=1
fi
test_ping IPv4 0
test_ping IPv6 0
test_gso IPv4
test_gso IPv6
# skip GSO tests with VRF: VRF routing needs properly assigned
# source IP/device, which is easy to do with ping and hard with dd/nc.
if [ -z "${VRF}" ] ; then
test_gso IPv4
test_gso IPv6
fi
# a negative test: remove routes to GRE devices: ping fails
remove_routes_to_gredev
......@@ -374,7 +401,7 @@ test_egress()
test_ingress()
{
local readonly ENCAP=$1
echo "starting ingress ${ENCAP} encap test"
echo "starting ingress ${ENCAP} encap test ${VRF}"
setup
# need to wait a bit for IPv6 to autoconf, otherwise
......@@ -385,18 +412,22 @@ test_ingress()
test_ping IPv6 0
# remove NS2->DST routes, pings fail
ip -netns ${NS2} route del ${IPv4_DST}/32 dev veth3
ip -netns ${NS2} -6 route del ${IPv6_DST}/128 dev veth3
ip -netns ${NS2} route del ${IPv4_DST}/32 dev veth3 ${VRF}
ip -netns ${NS2} -6 route del ${IPv6_DST}/128 dev veth3 ${VRF}
test_ping IPv4 1
test_ping IPv6 1
# install replacement routes (LWT/eBPF), pings succeed
if [ "${ENCAP}" == "IPv4" ] ; then
ip -netns ${NS2} route add ${IPv4_DST} encap bpf in obj test_lwt_ip_encap.o sec encap_gre dev veth2
ip -netns ${NS2} -6 route add ${IPv6_DST} encap bpf in obj test_lwt_ip_encap.o sec encap_gre dev veth2
ip -netns ${NS2} route add ${IPv4_DST} encap bpf in obj \
test_lwt_ip_encap.o sec encap_gre dev veth2 ${VRF}
ip -netns ${NS2} -6 route add ${IPv6_DST} encap bpf in obj \
test_lwt_ip_encap.o sec encap_gre dev veth2 ${VRF}
elif [ "${ENCAP}" == "IPv6" ] ; then
ip -netns ${NS2} route add ${IPv4_DST} encap bpf in obj test_lwt_ip_encap.o sec encap_gre6 dev veth2
ip -netns ${NS2} -6 route add ${IPv6_DST} encap bpf in obj test_lwt_ip_encap.o sec encap_gre6 dev veth2
ip -netns ${NS2} route add ${IPv4_DST} encap bpf in obj \
test_lwt_ip_encap.o sec encap_gre6 dev veth2 ${VRF}
ip -netns ${NS2} -6 route add ${IPv6_DST} encap bpf in obj \
test_lwt_ip_encap.o sec encap_gre6 dev veth2 ${VRF}
else
echo "FAIL: unknown encap ${ENCAP}"
TEST_STATUS=1
......@@ -418,6 +449,13 @@ test_ingress()
process_test_results
}
VRF=""
test_egress IPv4
test_egress IPv6
test_ingress IPv4
test_ingress IPv6
VRF="vrf red"
test_egress IPv4
test_egress IPv6
test_ingress IPv4
......
......@@ -119,6 +119,11 @@ static struct sec_name_test tests[] = {
{0, BPF_PROG_TYPE_CGROUP_SOCK_ADDR, BPF_CGROUP_UDP6_SENDMSG},
{0, BPF_CGROUP_UDP6_SENDMSG},
},
{
"cgroup/sysctl",
{0, BPF_PROG_TYPE_CGROUP_SYSCTL, BPF_CGROUP_SYSCTL},
{0, BPF_CGROUP_SYSCTL},
},
};
static int test_prog_type_by_name(const struct sec_name_test *test)
......
此差异已折叠。
......@@ -52,7 +52,7 @@
#define MAX_INSNS BPF_MAXINSNS
#define MAX_TEST_INSNS 1000000
#define MAX_FIXUPS 8
#define MAX_NR_MAPS 16
#define MAX_NR_MAPS 17
#define MAX_TEST_RUNS 8
#define POINTER_VALUE 0xcafe4all
#define TEST_DATA_LEN 64
......@@ -208,6 +208,76 @@ static void bpf_fill_rand_ld_dw(struct bpf_test *self)
self->retval = (uint32_t)res;
}
/* test the sequence of 1k jumps */
static void bpf_fill_scale1(struct bpf_test *self)
{
struct bpf_insn *insn = self->fill_insns;
int i = 0, k = 0;
insn[i++] = BPF_MOV64_REG(BPF_REG_6, BPF_REG_1);
/* test to check that the sequence of 1024 jumps is acceptable */
while (k++ < 1024) {
insn[i++] = BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
BPF_FUNC_get_prandom_u32);
insn[i++] = BPF_JMP_IMM(BPF_JGT, BPF_REG_0, bpf_semi_rand_get(), 2);
insn[i++] = BPF_MOV64_REG(BPF_REG_1, BPF_REG_10);
insn[i++] = BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_6,
-8 * (k % 64 + 1));
}
/* every jump adds 1024 steps to insn_processed, so to stay exactly
* within 1m limit add MAX_TEST_INSNS - 1025 MOVs and 1 EXIT
*/
while (i < MAX_TEST_INSNS - 1025)
insn[i++] = BPF_ALU32_IMM(BPF_MOV, BPF_REG_0, 42);
insn[i] = BPF_EXIT_INSN();
self->prog_len = i + 1;
self->retval = 42;
}
/* test the sequence of 1k jumps in inner most function (function depth 8)*/
static void bpf_fill_scale2(struct bpf_test *self)
{
struct bpf_insn *insn = self->fill_insns;
int i = 0, k = 0;
#define FUNC_NEST 7
for (k = 0; k < FUNC_NEST; k++) {
insn[i++] = BPF_CALL_REL(1);
insn[i++] = BPF_EXIT_INSN();
}
insn[i++] = BPF_MOV64_REG(BPF_REG_6, BPF_REG_1);
/* test to check that the sequence of 1024 jumps is acceptable */
while (k++ < 1024) {
insn[i++] = BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
BPF_FUNC_get_prandom_u32);
insn[i++] = BPF_JMP_IMM(BPF_JGT, BPF_REG_0, bpf_semi_rand_get(), 2);
insn[i++] = BPF_MOV64_REG(BPF_REG_1, BPF_REG_10);
insn[i++] = BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_6,
-8 * (k % (64 - 4 * FUNC_NEST) + 1));
}
/* every jump adds 1024 steps to insn_processed, so to stay exactly
* within 1m limit add MAX_TEST_INSNS - 1025 MOVs and 1 EXIT
*/
while (i < MAX_TEST_INSNS - 1025)
insn[i++] = BPF_ALU32_IMM(BPF_MOV, BPF_REG_0, 42);
insn[i] = BPF_EXIT_INSN();
self->prog_len = i + 1;
self->retval = 42;
}
static void bpf_fill_scale(struct bpf_test *self)
{
switch (self->retval) {
case 1:
return bpf_fill_scale1(self);
case 2:
return bpf_fill_scale2(self);
default:
self->prog_len = 0;
break;
}
}
/* BPF_SK_LOOKUP contains 13 instructions, if you need to fix up maps */
#define BPF_SK_LOOKUP(func) \
/* struct bpf_sock_tuple tuple = {} */ \
......
{
"ARG_PTR_TO_LONG uninitialized",
.insns = {
/* bpf_strtoul arg1 (buf) */
BPF_MOV64_REG(BPF_REG_7, BPF_REG_10),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8),
BPF_MOV64_IMM(BPF_REG_0, 0x00303036),
BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0),
BPF_MOV64_REG(BPF_REG_1, BPF_REG_7),
/* bpf_strtoul arg2 (buf_len) */
BPF_MOV64_IMM(BPF_REG_2, 4),
/* bpf_strtoul arg3 (flags) */
BPF_MOV64_IMM(BPF_REG_3, 0),
/* bpf_strtoul arg4 (res) */
BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8),
BPF_MOV64_REG(BPF_REG_4, BPF_REG_7),
/* bpf_strtoul() */
BPF_EMIT_CALL(BPF_FUNC_strtoul),
BPF_MOV64_IMM(BPF_REG_0, 1),
BPF_EXIT_INSN(),
},
.result = REJECT,
.prog_type = BPF_PROG_TYPE_CGROUP_SYSCTL,
.errstr = "invalid indirect read from stack off -16+0 size 8",
},
{
"ARG_PTR_TO_LONG half-uninitialized",
.insns = {
/* bpf_strtoul arg1 (buf) */
BPF_MOV64_REG(BPF_REG_7, BPF_REG_10),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8),
BPF_MOV64_IMM(BPF_REG_0, 0x00303036),
BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0),
BPF_MOV64_REG(BPF_REG_1, BPF_REG_7),
/* bpf_strtoul arg2 (buf_len) */
BPF_MOV64_IMM(BPF_REG_2, 4),
/* bpf_strtoul arg3 (flags) */
BPF_MOV64_IMM(BPF_REG_3, 0),
/* bpf_strtoul arg4 (res) */
BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8),
BPF_STX_MEM(BPF_W, BPF_REG_7, BPF_REG_0, 0),
BPF_MOV64_REG(BPF_REG_4, BPF_REG_7),
/* bpf_strtoul() */
BPF_EMIT_CALL(BPF_FUNC_strtoul),
BPF_MOV64_IMM(BPF_REG_0, 1),
BPF_EXIT_INSN(),
},
.result = REJECT,
.prog_type = BPF_PROG_TYPE_CGROUP_SYSCTL,
.errstr = "invalid indirect read from stack off -16+4 size 8",
},
{
"ARG_PTR_TO_LONG misaligned",
.insns = {
/* bpf_strtoul arg1 (buf) */
BPF_MOV64_REG(BPF_REG_7, BPF_REG_10),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8),
BPF_MOV64_IMM(BPF_REG_0, 0x00303036),
BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0),
BPF_MOV64_REG(BPF_REG_1, BPF_REG_7),
/* bpf_strtoul arg2 (buf_len) */
BPF_MOV64_IMM(BPF_REG_2, 4),
/* bpf_strtoul arg3 (flags) */
BPF_MOV64_IMM(BPF_REG_3, 0),
/* bpf_strtoul arg4 (res) */
BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -12),
BPF_MOV64_IMM(BPF_REG_0, 0),
BPF_STX_MEM(BPF_W, BPF_REG_7, BPF_REG_0, 0),
BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 4),
BPF_MOV64_REG(BPF_REG_4, BPF_REG_7),
/* bpf_strtoul() */
BPF_EMIT_CALL(BPF_FUNC_strtoul),
BPF_MOV64_IMM(BPF_REG_0, 1),
BPF_EXIT_INSN(),
},
.result = REJECT,
.prog_type = BPF_PROG_TYPE_CGROUP_SYSCTL,
.errstr = "misaligned stack access off (0x0; 0x0)+-20+0 size 8",
},
{
"ARG_PTR_TO_LONG size < sizeof(long)",
.insns = {
/* bpf_strtoul arg1 (buf) */
BPF_MOV64_REG(BPF_REG_7, BPF_REG_10),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -16),
BPF_MOV64_IMM(BPF_REG_0, 0x00303036),
BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0),
BPF_MOV64_REG(BPF_REG_1, BPF_REG_7),
/* bpf_strtoul arg2 (buf_len) */
BPF_MOV64_IMM(BPF_REG_2, 4),
/* bpf_strtoul arg3 (flags) */
BPF_MOV64_IMM(BPF_REG_3, 0),
/* bpf_strtoul arg4 (res) */
BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, 12),
BPF_STX_MEM(BPF_W, BPF_REG_7, BPF_REG_0, 0),
BPF_MOV64_REG(BPF_REG_4, BPF_REG_7),
/* bpf_strtoul() */
BPF_EMIT_CALL(BPF_FUNC_strtoul),
BPF_MOV64_IMM(BPF_REG_0, 1),
BPF_EXIT_INSN(),
},
.result = REJECT,
.prog_type = BPF_PROG_TYPE_CGROUP_SYSCTL,
.errstr = "invalid stack type R4 off=-4 access_size=8",
},
{
"ARG_PTR_TO_LONG initialized",
.insns = {
/* bpf_strtoul arg1 (buf) */
BPF_MOV64_REG(BPF_REG_7, BPF_REG_10),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8),
BPF_MOV64_IMM(BPF_REG_0, 0x00303036),
BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0),
BPF_MOV64_REG(BPF_REG_1, BPF_REG_7),
/* bpf_strtoul arg2 (buf_len) */
BPF_MOV64_IMM(BPF_REG_2, 4),
/* bpf_strtoul arg3 (flags) */
BPF_MOV64_IMM(BPF_REG_3, 0),
/* bpf_strtoul arg4 (res) */
BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8),
BPF_STX_MEM(BPF_DW, BPF_REG_7, BPF_REG_0, 0),
BPF_MOV64_REG(BPF_REG_4, BPF_REG_7),
/* bpf_strtoul() */
BPF_EMIT_CALL(BPF_FUNC_strtoul),
BPF_MOV64_IMM(BPF_REG_0, 1),
BPF_EXIT_INSN(),
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_CGROUP_SYSCTL,
},
{
"scale: scale test 1",
.insns = { },
.data = { },
.fill_helper = bpf_fill_scale,
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.result = ACCEPT,
.retval = 1,
},
{
"scale: scale test 2",
.insns = { },
.data = { },
.fill_helper = bpf_fill_scale,
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.result = ACCEPT,
.retval = 2,
},
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册