提交 2bbc078f 编写于 作者: D David S. Miller

Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2019-12-27

The following pull-request contains BPF updates for your *net-next* tree.

We've added 127 non-merge commits during the last 17 day(s) which contain
a total of 110 files changed, 6901 insertions(+), 2721 deletions(-).

There are three merge conflicts. Conflicts and resolution looks as follows:

1) Merge conflict in net/bpf/test_run.c:

There was a tree-wide cleanup c593642c ("treewide: Use sizeof_field() macro")
which gets in the way with b590cb5f ("bpf: Switch to offsetofend in
BPF_PROG_TEST_RUN"):

  <<<<<<< HEAD
          if (!range_is_zero(__skb, offsetof(struct __sk_buff, priority) +
                             sizeof_field(struct __sk_buff, priority),
  =======
          if (!range_is_zero(__skb, offsetofend(struct __sk_buff, priority),
  >>>>>>> 7c8dce4b

There are a few occasions that look similar to this. Always take the chunk with
offsetofend(). Note that there is one where the fields differ in here:

  <<<<<<< HEAD
          if (!range_is_zero(__skb, offsetof(struct __sk_buff, tstamp) +
                             sizeof_field(struct __sk_buff, tstamp),
  =======
          if (!range_is_zero(__skb, offsetofend(struct __sk_buff, gso_segs),
  >>>>>>> 7c8dce4b

Just take the one with offsetofend() /and/ gso_segs. Latter is correct due to
850a88cc ("bpf: Expose __sk_buff wire_len/gso_segs to BPF_PROG_TEST_RUN").

2) Merge conflict in arch/riscv/net/bpf_jit_comp.c:

(I'm keeping Bjorn in Cc here for a double-check in case I got it wrong.)

  <<<<<<< HEAD
          if (is_13b_check(off, insn))
                  return -1;
          emit(rv_blt(tcc, RV_REG_ZERO, off >> 1), ctx);
  =======
          emit_branch(BPF_JSLT, RV_REG_T1, RV_REG_ZERO, off, ctx);
  >>>>>>> 7c8dce4b

Result should look like:

          emit_branch(BPF_JSLT, tcc, RV_REG_ZERO, off, ctx);

3) Merge conflict in arch/riscv/include/asm/pgtable.h:

  <<<<<<< HEAD
  =======
  #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
  #define VMALLOC_END      (PAGE_OFFSET - 1)
  #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)

  #define BPF_JIT_REGION_SIZE     (SZ_128M)
  #define BPF_JIT_REGION_START    (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
  #define BPF_JIT_REGION_END      (VMALLOC_END)

  /*
   * Roughly size the vmemmap space to be large enough to fit enough
   * struct pages to map half the virtual address space. Then
   * position vmemmap directly below the VMALLOC region.
   */
  #define VMEMMAP_SHIFT \
          (CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
  #define VMEMMAP_SIZE    BIT(VMEMMAP_SHIFT)
  #define VMEMMAP_END     (VMALLOC_START - 1)
  #define VMEMMAP_START   (VMALLOC_START - VMEMMAP_SIZE)

  #define vmemmap         ((struct page *)VMEMMAP_START)

  >>>>>>> 7c8dce4b

Only take the BPF_* defines from there and move them higher up in the
same file. Remove the rest from the chunk. The VMALLOC_* etc defines
got moved via 01f52e16 ("riscv: define vmemmap before pfn_to_page
calls"). Result:

  [...]
  #define __S101  PAGE_READ_EXEC
  #define __S110  PAGE_SHARED_EXEC
  #define __S111  PAGE_SHARED_EXEC

  #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
  #define VMALLOC_END      (PAGE_OFFSET - 1)
  #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)

  #define BPF_JIT_REGION_SIZE     (SZ_128M)
  #define BPF_JIT_REGION_START    (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
  #define BPF_JIT_REGION_END      (VMALLOC_END)

  /*
   * Roughly size the vmemmap space to be large enough to fit enough
   * struct pages to map half the virtual address space. Then
   * position vmemmap directly below the VMALLOC region.
   */
  #define VMEMMAP_SHIFT \
          (CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
  #define VMEMMAP_SIZE    BIT(VMEMMAP_SHIFT)
  #define VMEMMAP_END     (VMALLOC_START - 1)
  #define VMEMMAP_START   (VMALLOC_START - VMEMMAP_SIZE)

  [...]

Let me know if there are any other issues.

Anyway, the main changes are:

1) Extend bpftool to produce a struct (aka "skeleton") tailored and specific
   to a provided BPF object file. This provides an alternative, simplified API
   compared to standard libbpf interaction. Also, add libbpf extern variable
   resolution for .kconfig section to import Kconfig data, from Andrii Nakryiko.

2) Add BPF dispatcher for XDP which is a mechanism to avoid indirect calls by
   generating a branch funnel as discussed back in bpfconf'19 at LSF/MM. Also,
   add various BPF riscv JIT improvements, from Björn Töpel.

3) Extend bpftool to allow matching BPF programs and maps by name,
   from Paul Chaignon.

4) Support for replacing cgroup BPF programs attached with BPF_F_ALLOW_MULTI
   flag for allowing updates without service interruption, from Andrey Ignatov.

5) Cleanup and simplification of ring access functions for AF_XDP with a
   bonus of 0-5% performance improvement, from Magnus Karlsson.

6) Enable BPF JITs for x86-64 and arm64 by default. Also, final version of
   audit support for BPF, from Daniel Borkmann and latter with Jiri Olsa.

7) Move and extend test_select_reuseport into BPF program tests under
   BPF selftests, from Jakub Sitnicki.

8) Various BPF sample improvements for xdpsock for customizing parameters
   to set up and benchmark AF_XDP, from Jay Jayatheerthan.

9) Improve libbpf to provide a ulimit hint on permission denied errors.
   Also change XDP sample programs to attach in driver mode by default,
   from Toke Høiland-Jørgensen.

10) Extend BPF test infrastructure to allow changing skb mark from tc BPF
    programs, from Nikita V. Shirokov.

11) Optimize prologue code sequence in BPF arm32 JIT, from Russell King.

12) Fix xdp_redirect_cpu BPF sample to manually attach to tracepoints after
    libbpf conversion, from Jesper Dangaard Brouer.

13) Minor misc improvements from various others.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
......@@ -1260,12 +1260,9 @@ static inline void emit_push_r64(const s8 src[], struct jit_ctx *ctx)
static void build_prologue(struct jit_ctx *ctx)
{
const s8 r0 = bpf2a32[BPF_REG_0][1];
const s8 r2 = bpf2a32[BPF_REG_1][1];
const s8 r3 = bpf2a32[BPF_REG_1][0];
const s8 r4 = bpf2a32[BPF_REG_6][1];
const s8 fplo = bpf2a32[BPF_REG_FP][1];
const s8 fphi = bpf2a32[BPF_REG_FP][0];
const s8 arm_r0 = bpf2a32[BPF_REG_0][1];
const s8 *bpf_r1 = bpf2a32[BPF_REG_1];
const s8 *bpf_fp = bpf2a32[BPF_REG_FP];
const s8 *tcc = bpf2a32[TCALL_CNT];
/* Save callee saved registers. */
......@@ -1278,8 +1275,10 @@ static void build_prologue(struct jit_ctx *ctx)
emit(ARM_PUSH(CALLEE_PUSH_MASK), ctx);
emit(ARM_MOV_R(ARM_FP, ARM_SP), ctx);
#endif
/* Save frame pointer for later */
emit(ARM_SUB_I(ARM_IP, ARM_SP, SCRATCH_SIZE), ctx);
/* mov r3, #0 */
/* sub r2, sp, #SCRATCH_SIZE */
emit(ARM_MOV_I(bpf_r1[0], 0), ctx);
emit(ARM_SUB_I(bpf_r1[1], ARM_SP, SCRATCH_SIZE), ctx);
ctx->stack_size = imm8m(STACK_SIZE);
......@@ -1287,18 +1286,15 @@ static void build_prologue(struct jit_ctx *ctx)
emit(ARM_SUB_I(ARM_SP, ARM_SP, ctx->stack_size), ctx);
/* Set up BPF prog stack base register */
emit_a32_mov_r(fplo, ARM_IP, ctx);
emit_a32_mov_i(fphi, 0, ctx);
emit_a32_mov_r64(true, bpf_fp, bpf_r1, ctx);
/* mov r4, 0 */
emit(ARM_MOV_I(r4, 0), ctx);
/* Initialize Tail Count */
emit(ARM_MOV_I(bpf_r1[1], 0), ctx);
emit_a32_mov_r64(true, tcc, bpf_r1, ctx);
/* Move BPF_CTX to BPF_R1 */
emit(ARM_MOV_R(r3, r4), ctx);
emit(ARM_MOV_R(r2, r0), ctx);
/* Initialize Tail Count */
emit(ARM_STR_I(r4, ARM_FP, EBPF_SCRATCH_TO_ARM_FP(tcc[0])), ctx);
emit(ARM_STR_I(r4, ARM_FP, EBPF_SCRATCH_TO_ARM_FP(tcc[1])), ctx);
emit(ARM_MOV_R(bpf_r1[1], arm_r0), ctx);
/* end of prologue */
}
......
......@@ -69,6 +69,7 @@ config ARM64
select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 && (GCC_VERSION >= 50000 || CC_IS_CLANG)
select ARCH_SUPPORTS_NUMA_BALANCING
select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
select ARCH_WANT_DEFAULT_BPF_JIT
select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
select ARCH_WANT_FRAME_POINTERS
select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
......
......@@ -82,4 +82,8 @@ struct riscv_pmu {
int irq;
};
#ifdef CONFIG_PERF_EVENTS
#define perf_arch_bpf_user_pt_regs(regs) (struct user_regs_struct *)regs
#endif
#endif /* _ASM_RISCV_PERF_EVENT_H */
......@@ -94,6 +94,10 @@ extern pgd_t swapper_pg_dir[];
#define VMALLOC_END (PAGE_OFFSET - 1)
#define VMALLOC_START (PAGE_OFFSET - VMALLOC_SIZE)
#define BPF_JIT_REGION_SIZE (SZ_128M)
#define BPF_JIT_REGION_START (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
#define BPF_JIT_REGION_END (VMALLOC_END)
/*
* Roughly size the vmemmap space to be large enough to fit enough
* struct pages to map half the virtual address space. Then
......
/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
#ifndef _UAPI__ASM_BPF_PERF_EVENT_H__
#define _UAPI__ASM_BPF_PERF_EVENT_H__
#include <asm/ptrace.h>
typedef struct user_regs_struct bpf_user_pt_regs_t;
#endif /* _UAPI__ASM_BPF_PERF_EVENT_H__ */
此差异已折叠。
......@@ -93,6 +93,7 @@ config X86
select ARCH_USE_QUEUED_RWLOCKS
select ARCH_USE_QUEUED_SPINLOCKS
select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
select ARCH_WANT_DEFAULT_BPF_JIT if X86_64
select ARCH_WANTS_DYNAMIC_TASK_STRUCT
select ARCH_WANT_HUGE_PMD_SHARE
select ARCH_WANTS_THP_SWAP if X86_64
......
......@@ -10,10 +10,12 @@
#include <linux/if_vlan.h>
#include <linux/bpf.h>
#include <linux/memory.h>
#include <linux/sort.h>
#include <asm/extable.h>
#include <asm/set_memory.h>
#include <asm/nospec-branch.h>
#include <asm/text-patching.h>
#include <asm/asm-prototypes.h>
static u8 *emit_code(u8 *ptr, u32 bytes, unsigned int len)
{
......@@ -1530,6 +1532,154 @@ int arch_prepare_bpf_trampoline(void *image, struct btf_func_model *m, u32 flags
return 0;
}
static int emit_cond_near_jump(u8 **pprog, void *func, void *ip, u8 jmp_cond)
{
u8 *prog = *pprog;
int cnt = 0;
s64 offset;
offset = func - (ip + 2 + 4);
if (!is_simm32(offset)) {
pr_err("Target %p is out of range\n", func);
return -EINVAL;
}
EMIT2_off32(0x0F, jmp_cond + 0x10, offset);
*pprog = prog;
return 0;
}
static void emit_nops(u8 **pprog, unsigned int len)
{
unsigned int i, noplen;
u8 *prog = *pprog;
int cnt = 0;
while (len > 0) {
noplen = len;
if (noplen > ASM_NOP_MAX)
noplen = ASM_NOP_MAX;
for (i = 0; i < noplen; i++)
EMIT1(ideal_nops[noplen][i]);
len -= noplen;
}
*pprog = prog;
}
static int emit_fallback_jump(u8 **pprog)
{
u8 *prog = *pprog;
int err = 0;
#ifdef CONFIG_RETPOLINE
/* Note that this assumes the the compiler uses external
* thunks for indirect calls. Both clang and GCC use the same
* naming convention for external thunks.
*/
err = emit_jump(&prog, __x86_indirect_thunk_rdx, prog);
#else
int cnt = 0;
EMIT2(0xFF, 0xE2); /* jmp rdx */
#endif
*pprog = prog;
return err;
}
static int emit_bpf_dispatcher(u8 **pprog, int a, int b, s64 *progs)
{
u8 *jg_reloc, *jg_target, *prog = *pprog;
int pivot, err, jg_bytes = 1, cnt = 0;
s64 jg_offset;
if (a == b) {
/* Leaf node of recursion, i.e. not a range of indices
* anymore.
*/
EMIT1(add_1mod(0x48, BPF_REG_3)); /* cmp rdx,func */
if (!is_simm32(progs[a]))
return -1;
EMIT2_off32(0x81, add_1reg(0xF8, BPF_REG_3),
progs[a]);
err = emit_cond_near_jump(&prog, /* je func */
(void *)progs[a], prog,
X86_JE);
if (err)
return err;
err = emit_fallback_jump(&prog); /* jmp thunk/indirect */
if (err)
return err;
*pprog = prog;
return 0;
}
/* Not a leaf node, so we pivot, and recursively descend into
* the lower and upper ranges.
*/
pivot = (b - a) / 2;
EMIT1(add_1mod(0x48, BPF_REG_3)); /* cmp rdx,func */
if (!is_simm32(progs[a + pivot]))
return -1;
EMIT2_off32(0x81, add_1reg(0xF8, BPF_REG_3), progs[a + pivot]);
if (pivot > 2) { /* jg upper_part */
/* Require near jump. */
jg_bytes = 4;
EMIT2_off32(0x0F, X86_JG + 0x10, 0);
} else {
EMIT2(X86_JG, 0);
}
jg_reloc = prog;
err = emit_bpf_dispatcher(&prog, a, a + pivot, /* emit lower_part */
progs);
if (err)
return err;
/* From Intel 64 and IA-32 Architectures Optimization
* Reference Manual, 3.4.1.4 Code Alignment, Assembly/Compiler
* Coding Rule 11: All branch targets should be 16-byte
* aligned.
*/
jg_target = PTR_ALIGN(prog, 16);
if (jg_target != prog)
emit_nops(&prog, jg_target - prog);
jg_offset = prog - jg_reloc;
emit_code(jg_reloc - jg_bytes, jg_offset, jg_bytes);
err = emit_bpf_dispatcher(&prog, a + pivot + 1, /* emit upper_part */
b, progs);
if (err)
return err;
*pprog = prog;
return 0;
}
static int cmp_ips(const void *a, const void *b)
{
const s64 *ipa = a;
const s64 *ipb = b;
if (*ipa > *ipb)
return 1;
if (*ipa < *ipb)
return -1;
return 0;
}
int arch_prepare_bpf_dispatcher(void *image, s64 *funcs, int num_funcs)
{
u8 *prog = image;
sort(funcs, num_funcs, sizeof(funcs[0]), cmp_ips, NULL);
return emit_bpf_dispatcher(&prog, 0, num_funcs - 1, funcs);
}
struct x64_jit_data {
struct bpf_binary_header *header;
int *addrs;
......
......@@ -269,7 +269,7 @@ static bool i40e_alloc_buffer_zc(struct i40e_ring *rx_ring,
bi->handle = xsk_umem_adjust_offset(umem, handle, umem->headroom);
xsk_umem_discard_addr(umem);
xsk_umem_release_addr(umem);
return true;
}
......@@ -306,7 +306,7 @@ static bool i40e_alloc_buffer_slow_zc(struct i40e_ring *rx_ring,
bi->handle = xsk_umem_adjust_offset(umem, handle, umem->headroom);
xsk_umem_discard_addr_rq(umem);
xsk_umem_release_addr_rq(umem);
return true;
}
......
......@@ -555,7 +555,7 @@ ice_alloc_buf_fast_zc(struct ice_ring *rx_ring, struct ice_rx_buf *rx_buf)
rx_buf->handle = handle + umem->headroom;
xsk_umem_discard_addr(umem);
xsk_umem_release_addr(umem);
return true;
}
......@@ -591,7 +591,7 @@ ice_alloc_buf_slow_zc(struct ice_ring *rx_ring, struct ice_rx_buf *rx_buf)
rx_buf->handle = handle + umem->headroom;
xsk_umem_discard_addr_rq(umem);
xsk_umem_release_addr_rq(umem);
return true;
}
......
......@@ -277,7 +277,7 @@ static bool ixgbe_alloc_buffer_zc(struct ixgbe_ring *rx_ring,
bi->handle = xsk_umem_adjust_offset(umem, handle, umem->headroom);
xsk_umem_discard_addr(umem);
xsk_umem_release_addr(umem);
return true;
}
......@@ -304,7 +304,7 @@ static bool ixgbe_alloc_buffer_slow_zc(struct ixgbe_ring *rx_ring,
bi->handle = xsk_umem_adjust_offset(umem, handle, umem->headroom);
xsk_umem_discard_addr_rq(umem);
xsk_umem_release_addr_rq(umem);
return true;
}
......
......@@ -35,7 +35,7 @@ int mlx5e_xsk_page_alloc_umem(struct mlx5e_rq *rq,
*/
dma_info->addr = xdp_umem_get_dma(umem, handle);
xsk_umem_discard_addr_rq(umem);
xsk_umem_release_addr_rq(umem);
dma_sync_single_for_device(rq->pdev, dma_info->addr, PAGE_SIZE,
DMA_BIDIRECTIONAL);
......
......@@ -85,6 +85,7 @@ int cgroup_bpf_inherit(struct cgroup *cgrp);
void cgroup_bpf_offline(struct cgroup *cgrp);
int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog,
struct bpf_prog *replace_prog,
enum bpf_attach_type type, u32 flags);
int __cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog,
enum bpf_attach_type type);
......@@ -93,7 +94,8 @@ int __cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr,
/* Wrapper for __cgroup_bpf_*() protected by cgroup_mutex */
int cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog,
enum bpf_attach_type type, u32 flags);
struct bpf_prog *replace_prog, enum bpf_attach_type type,
u32 flags);
int cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog,
enum bpf_attach_type type, u32 flags);
int cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr,
......
......@@ -471,11 +471,69 @@ struct bpf_trampoline {
void *image;
u64 selector;
};
#define BPF_DISPATCHER_MAX 48 /* Fits in 2048B */
struct bpf_dispatcher_prog {
struct bpf_prog *prog;
refcount_t users;
};
struct bpf_dispatcher {
/* dispatcher mutex */
struct mutex mutex;
void *func;
struct bpf_dispatcher_prog progs[BPF_DISPATCHER_MAX];
int num_progs;
void *image;
u32 image_off;
};
static __always_inline unsigned int bpf_dispatcher_nopfunc(
const void *ctx,
const struct bpf_insn *insnsi,
unsigned int (*bpf_func)(const void *,
const struct bpf_insn *))
{
return bpf_func(ctx, insnsi);
}
#ifdef CONFIG_BPF_JIT
struct bpf_trampoline *bpf_trampoline_lookup(u64 key);
int bpf_trampoline_link_prog(struct bpf_prog *prog);
int bpf_trampoline_unlink_prog(struct bpf_prog *prog);
void bpf_trampoline_put(struct bpf_trampoline *tr);
void *bpf_jit_alloc_exec_page(void);
#define BPF_DISPATCHER_INIT(name) { \
.mutex = __MUTEX_INITIALIZER(name.mutex), \
.func = &name##func, \
.progs = {}, \
.num_progs = 0, \
.image = NULL, \
.image_off = 0 \
}
#define DEFINE_BPF_DISPATCHER(name) \
noinline unsigned int name##func( \
const void *ctx, \
const struct bpf_insn *insnsi, \
unsigned int (*bpf_func)(const void *, \
const struct bpf_insn *)) \
{ \
return bpf_func(ctx, insnsi); \
} \
EXPORT_SYMBOL(name##func); \
struct bpf_dispatcher name = BPF_DISPATCHER_INIT(name);
#define DECLARE_BPF_DISPATCHER(name) \
unsigned int name##func( \
const void *ctx, \
const struct bpf_insn *insnsi, \
unsigned int (*bpf_func)(const void *, \
const struct bpf_insn *)); \
extern struct bpf_dispatcher name;
#define BPF_DISPATCHER_FUNC(name) name##func
#define BPF_DISPATCHER_PTR(name) (&name)
void bpf_dispatcher_change_prog(struct bpf_dispatcher *d, struct bpf_prog *from,
struct bpf_prog *to);
#else
static inline struct bpf_trampoline *bpf_trampoline_lookup(u64 key)
{
......@@ -490,6 +548,13 @@ static inline int bpf_trampoline_unlink_prog(struct bpf_prog *prog)
return -ENOTSUPP;
}
static inline void bpf_trampoline_put(struct bpf_trampoline *tr) {}
#define DEFINE_BPF_DISPATCHER(name)
#define DECLARE_BPF_DISPATCHER(name)
#define BPF_DISPATCHER_FUNC(name) bpf_dispatcher_nopfunc
#define BPF_DISPATCHER_PTR(name) NULL
static inline void bpf_dispatcher_change_prog(struct bpf_dispatcher *d,
struct bpf_prog *from,
struct bpf_prog *to) {}
#endif
struct bpf_func_info_aux {
......@@ -897,14 +962,14 @@ struct sk_buff;
struct bpf_dtab_netdev *__dev_map_lookup_elem(struct bpf_map *map, u32 key);
struct bpf_dtab_netdev *__dev_map_hash_lookup_elem(struct bpf_map *map, u32 key);
void __dev_map_flush(struct bpf_map *map);
void __dev_map_flush(void);
int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
struct net_device *dev_rx);
int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
struct bpf_prog *xdp_prog);
struct bpf_cpu_map_entry *__cpu_map_lookup_elem(struct bpf_map *map, u32 key);
void __cpu_map_flush(struct bpf_map *map);
void __cpu_map_flush(void);
int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_buff *xdp,
struct net_device *dev_rx);
......@@ -943,6 +1008,8 @@ int btf_distill_func_proto(struct bpf_verifier_log *log,
int btf_check_func_arg_match(struct bpf_verifier_env *env, int subprog);
struct bpf_prog *bpf_prog_by_id(u32 id);
#else /* !CONFIG_BPF_SYSCALL */
static inline struct bpf_prog *bpf_prog_get(u32 ufd)
{
......@@ -1004,7 +1071,7 @@ static inline struct net_device *__dev_map_hash_lookup_elem(struct bpf_map *map
return NULL;
}
static inline void __dev_map_flush(struct bpf_map *map)
static inline void __dev_map_flush(void)
{
}
......@@ -1033,7 +1100,7 @@ struct bpf_cpu_map_entry *__cpu_map_lookup_elem(struct bpf_map *map, u32 key)
return NULL;
}
static inline void __cpu_map_flush(struct bpf_map *map)
static inline void __cpu_map_flush(void)
{
}
......@@ -1074,6 +1141,11 @@ static inline int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
static inline void bpf_map_put(struct bpf_map *map)
{
}
static inline struct bpf_prog *bpf_prog_by_id(u32 id)
{
return ERR_PTR(-ENOTSUPP);
}
#endif /* CONFIG_BPF_SYSCALL */
static inline struct bpf_prog *bpf_prog_get_type(u32 ufd,
......
......@@ -559,23 +559,26 @@ struct sk_filter {
DECLARE_STATIC_KEY_FALSE(bpf_stats_enabled_key);
#define BPF_PROG_RUN(prog, ctx) ({ \
u32 ret; \
cant_sleep(); \
if (static_branch_unlikely(&bpf_stats_enabled_key)) { \
struct bpf_prog_stats *stats; \
u64 start = sched_clock(); \
ret = (*(prog)->bpf_func)(ctx, (prog)->insnsi); \
stats = this_cpu_ptr(prog->aux->stats); \
u64_stats_update_begin(&stats->syncp); \
stats->cnt++; \
stats->nsecs += sched_clock() - start; \
u64_stats_update_end(&stats->syncp); \
} else { \
ret = (*(prog)->bpf_func)(ctx, (prog)->insnsi); \
} \
#define __BPF_PROG_RUN(prog, ctx, dfunc) ({ \
u32 ret; \
cant_sleep(); \
if (static_branch_unlikely(&bpf_stats_enabled_key)) { \
struct bpf_prog_stats *stats; \
u64 start = sched_clock(); \
ret = dfunc(ctx, (prog)->insnsi, (prog)->bpf_func); \
stats = this_cpu_ptr(prog->aux->stats); \
u64_stats_update_begin(&stats->syncp); \
stats->cnt++; \
stats->nsecs += sched_clock() - start; \
u64_stats_update_end(&stats->syncp); \
} else { \
ret = dfunc(ctx, (prog)->insnsi, (prog)->bpf_func); \
} \
ret; })
#define BPF_PROG_RUN(prog, ctx) __BPF_PROG_RUN(prog, ctx, \
bpf_dispatcher_nopfunc)
#define BPF_SKB_CB_LEN QDISC_CB_PRIV_LEN
struct bpf_skb_data_end {
......@@ -589,7 +592,6 @@ struct bpf_redirect_info {
u32 tgt_index;
void *tgt_value;
struct bpf_map *map;
struct bpf_map *map_to_flush;
u32 kern_flags;
};
......@@ -699,6 +701,8 @@ static inline u32 bpf_prog_run_clear_cb(const struct bpf_prog *prog,
return res;
}
DECLARE_BPF_DISPATCHER(bpf_dispatcher_xdp)
static __always_inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
struct xdp_buff *xdp)
{
......@@ -708,9 +712,12 @@ static __always_inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
* already takes rcu_read_lock() when fetching the program, so
* it's not necessary here anymore.
*/
return BPF_PROG_RUN(prog, xdp);
return __BPF_PROG_RUN(prog, xdp,
BPF_DISPATCHER_FUNC(bpf_dispatcher_xdp));
}
void bpf_prog_change_xdp(struct bpf_prog *prev_prog, struct bpf_prog *prog);
static inline u32 bpf_prog_insn_size(const struct bpf_prog *prog)
{
return prog->len * sizeof(struct bpf_insn);
......
......@@ -72,7 +72,6 @@ struct xdp_umem {
struct xsk_map {
struct bpf_map map;
struct list_head __percpu *flush_list;
spinlock_t lock; /* Synchronize map updates */
struct xdp_sock *xsk_map[];
};
......@@ -119,8 +118,8 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs);
/* Used from netdev driver */
bool xsk_umem_has_addrs(struct xdp_umem *umem, u32 cnt);
u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr);
void xsk_umem_discard_addr(struct xdp_umem *umem);
bool xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr);
void xsk_umem_release_addr(struct xdp_umem *umem);
void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries);
bool xsk_umem_consume_tx(struct xdp_umem *umem, struct xdp_desc *desc);
void xsk_umem_consume_tx_done(struct xdp_umem *umem);
......@@ -139,9 +138,8 @@ void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs,
struct xdp_sock **map_entry);
int xsk_map_inc(struct xsk_map *map);
void xsk_map_put(struct xsk_map *map);
int __xsk_map_redirect(struct bpf_map *map, struct xdp_buff *xdp,
struct xdp_sock *xs);
void __xsk_map_flush(struct bpf_map *map);
int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp);
void __xsk_map_flush(void);
static inline struct xdp_sock *__xsk_map_lookup_elem(struct bpf_map *map,
u32 key)
......@@ -199,7 +197,7 @@ static inline bool xsk_umem_has_addrs_rq(struct xdp_umem *umem, u32 cnt)
return xsk_umem_has_addrs(umem, cnt - rq->length);
}
static inline u64 *xsk_umem_peek_addr_rq(struct xdp_umem *umem, u64 *addr)
static inline bool xsk_umem_peek_addr_rq(struct xdp_umem *umem, u64 *addr)
{
struct xdp_umem_fq_reuse *rq = umem->fq_reuse;
......@@ -210,12 +208,12 @@ static inline u64 *xsk_umem_peek_addr_rq(struct xdp_umem *umem, u64 *addr)
return addr;
}
static inline void xsk_umem_discard_addr_rq(struct xdp_umem *umem)
static inline void xsk_umem_release_addr_rq(struct xdp_umem *umem)
{
struct xdp_umem_fq_reuse *rq = umem->fq_reuse;
if (!rq->length)
xsk_umem_discard_addr(umem);
xsk_umem_release_addr(umem);
else
rq->length--;
}
......@@ -260,7 +258,7 @@ static inline u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr)
return NULL;
}
static inline void xsk_umem_discard_addr(struct xdp_umem *umem)
static inline void xsk_umem_release_addr(struct xdp_umem *umem)
{
}
......@@ -334,7 +332,7 @@ static inline u64 *xsk_umem_peek_addr_rq(struct xdp_umem *umem, u64 *addr)
return NULL;
}
static inline void xsk_umem_discard_addr_rq(struct xdp_umem *umem)
static inline void xsk_umem_release_addr_rq(struct xdp_umem *umem)
{
}
......@@ -369,13 +367,12 @@ static inline u64 xsk_umem_adjust_offset(struct xdp_umem *umem, u64 handle,
return 0;
}
static inline int __xsk_map_redirect(struct bpf_map *map, struct xdp_buff *xdp,
struct xdp_sock *xs)
static inline int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp)
{
return -EOPNOTSUPP;
}
static inline void __xsk_map_flush(struct bpf_map *map)
static inline void __xsk_map_flush(void)
{
}
......
......@@ -116,6 +116,7 @@
#define AUDIT_FANOTIFY 1331 /* Fanotify access decision */
#define AUDIT_TIME_INJOFFSET 1332 /* Timekeeping offset injected */
#define AUDIT_TIME_ADJNTPVAL 1333 /* NTP value adjustment */
#define AUDIT_BPF 1334 /* BPF subsystem */
#define AUDIT_AVC 1400 /* SE Linux avc denial or grant */
#define AUDIT_SELINUX_ERR 1401 /* Internal SE Linux Errors */
......
......@@ -231,6 +231,11 @@ enum bpf_attach_type {
* When children program makes decision (like picking TCP CA or sock bind)
* parent program has a chance to override it.
*
* With BPF_F_ALLOW_MULTI a new program is added to the end of the list of
* programs for a cgroup. Though it's possible to replace an old program at
* any position by also specifying BPF_F_REPLACE flag and position itself in
* replace_bpf_fd attribute. Old program at this position will be released.
*
* A cgroup with MULTI or OVERRIDE flag allows any attach flags in sub-cgroups.
* A cgroup with NONE doesn't allow any programs in sub-cgroups.
* Ex1:
......@@ -249,6 +254,7 @@ enum bpf_attach_type {
*/
#define BPF_F_ALLOW_OVERRIDE (1U << 0)
#define BPF_F_ALLOW_MULTI (1U << 1)
#define BPF_F_REPLACE (1U << 2)
/* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the
* verifier will perform strict alignment checking as if the kernel
......@@ -442,6 +448,10 @@ union bpf_attr {
__u32 attach_bpf_fd; /* eBPF program to attach */
__u32 attach_type;
__u32 attach_flags;
__u32 replace_bpf_fd; /* previously attached eBPF
* program to replace if
* BPF_F_REPLACE is used
*/
};
struct { /* anonymous struct used by BPF_PROG_TEST_RUN command */
......
......@@ -142,7 +142,8 @@ struct btf_param {
enum {
BTF_VAR_STATIC = 0,
BTF_VAR_GLOBAL_ALLOCATED,
BTF_VAR_GLOBAL_ALLOCATED = 1,
BTF_VAR_GLOBAL_EXTERN = 2,
};
/* BTF_KIND_VAR is followed by a single "struct btf_var" to describe
......
......@@ -1604,6 +1604,9 @@ config BPF_SYSCALL
Enable the bpf() system call that allows to manipulate eBPF
programs and maps via file descriptors.
config ARCH_WANT_DEFAULT_BPF_JIT
bool
config BPF_JIT_ALWAYS_ON
bool "Permanently enable BPF JIT and remove BPF interpreter"
depends on BPF_SYSCALL && HAVE_EBPF_JIT && BPF_JIT
......@@ -1611,6 +1614,10 @@ config BPF_JIT_ALWAYS_ON
Enables BPF JIT and removes BPF interpreter to avoid
speculative execution of BPF instructions by the interpreter
config BPF_JIT_DEFAULT_ON
def_bool ARCH_WANT_DEFAULT_BPF_JIT || BPF_JIT_ALWAYS_ON
depends on HAVE_EBPF_JIT && BPF_JIT
config USERFAULTFD
bool "Enable userfaultfd() system call"
depends on MMU
......
......@@ -8,6 +8,7 @@ obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o
obj-$(CONFIG_BPF_SYSCALL) += disasm.o
obj-$(CONFIG_BPF_JIT) += trampoline.o
obj-$(CONFIG_BPF_SYSCALL) += btf.o
obj-$(CONFIG_BPF_JIT) += dispatcher.o
ifeq ($(CONFIG_NET),y)
obj-$(CONFIG_BPF_SYSCALL) += devmap.o
obj-$(CONFIG_BPF_SYSCALL) += cpumap.o
......
......@@ -103,8 +103,7 @@ static u32 prog_list_length(struct list_head *head)
* if parent has overridable or multi-prog, allow attaching
*/
static bool hierarchy_allows_attach(struct cgroup *cgrp,
enum bpf_attach_type type,
u32 new_flags)
enum bpf_attach_type type)
{
struct cgroup *p;
......@@ -283,31 +282,34 @@ static int update_effective_progs(struct cgroup *cgrp,
* propagate the change to descendants
* @cgrp: The cgroup which descendants to traverse
* @prog: A program to attach
* @replace_prog: Previously attached program to replace if BPF_F_REPLACE is set
* @type: Type of attach operation
* @flags: Option flags
*
* Must be called with cgroup_mutex held.
*/
int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog,
struct bpf_prog *replace_prog,
enum bpf_attach_type type, u32 flags)
{
u32 saved_flags = (flags & (BPF_F_ALLOW_OVERRIDE | BPF_F_ALLOW_MULTI));
struct list_head *progs = &cgrp->bpf.progs[type];
struct bpf_prog *old_prog = NULL;
struct bpf_cgroup_storage *storage[MAX_BPF_CGROUP_STORAGE_TYPE],
*old_storage[MAX_BPF_CGROUP_STORAGE_TYPE] = {NULL};
struct bpf_prog_list *pl, *replace_pl = NULL;
enum bpf_cgroup_storage_type stype;
struct bpf_prog_list *pl;
bool pl_was_allocated;
int err;
if ((flags & BPF_F_ALLOW_OVERRIDE) && (flags & BPF_F_ALLOW_MULTI))
if (((flags & BPF_F_ALLOW_OVERRIDE) && (flags & BPF_F_ALLOW_MULTI)) ||
((flags & BPF_F_REPLACE) && !(flags & BPF_F_ALLOW_MULTI)))
/* invalid combination */
return -EINVAL;
if (!hierarchy_allows_attach(cgrp, type, flags))
if (!hierarchy_allows_attach(cgrp, type))
return -EPERM;
if (!list_empty(progs) && cgrp->bpf.flags[type] != flags)
if (!list_empty(progs) && cgrp->bpf.flags[type] != saved_flags)
/* Disallow attaching non-overridable on top
* of existing overridable in this cgroup.
* Disallow attaching multi-prog if overridable or none
......@@ -317,6 +319,21 @@ int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog,
if (prog_list_length(progs) >= BPF_CGROUP_MAX_PROGS)
return -E2BIG;
if (flags & BPF_F_ALLOW_MULTI) {
list_for_each_entry(pl, progs, node) {
if (pl->prog == prog)
/* disallow attaching the same prog twice */
return -EINVAL;
if (pl->prog == replace_prog)
replace_pl = pl;
}
if ((flags & BPF_F_REPLACE) && !replace_pl)
/* prog to replace not found for cgroup */
return -ENOENT;
} else if (!list_empty(progs)) {
replace_pl = list_first_entry(progs, typeof(*pl), node);
}
for_each_cgroup_storage_type(stype) {
storage[stype] = bpf_cgroup_storage_alloc(prog, stype);
if (IS_ERR(storage[stype])) {
......@@ -327,53 +344,28 @@ int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog,
}
}
if (flags & BPF_F_ALLOW_MULTI) {
list_for_each_entry(pl, progs, node) {
if (pl->prog == prog) {
/* disallow attaching the same prog twice */
for_each_cgroup_storage_type(stype)
bpf_cgroup_storage_free(storage[stype]);
return -EINVAL;
}
if (replace_pl) {
pl = replace_pl;
old_prog = pl->prog;
for_each_cgroup_storage_type(stype) {
old_storage[stype] = pl->storage[stype];
bpf_cgroup_storage_unlink(old_storage[stype]);
}
} else {
pl = kmalloc(sizeof(*pl), GFP_KERNEL);
if (!pl) {
for_each_cgroup_storage_type(stype)
bpf_cgroup_storage_free(storage[stype]);
return -ENOMEM;
}
pl_was_allocated = true;
pl->prog = prog;
for_each_cgroup_storage_type(stype)
pl->storage[stype] = storage[stype];
list_add_tail(&pl->node, progs);
} else {
if (list_empty(progs)) {
pl = kmalloc(sizeof(*pl), GFP_KERNEL);
if (!pl) {
for_each_cgroup_storage_type(stype)
bpf_cgroup_storage_free(storage[stype]);
return -ENOMEM;
}
pl_was_allocated = true;
list_add_tail(&pl->node, progs);
} else {
pl = list_first_entry(progs, typeof(*pl), node);
old_prog = pl->prog;
for_each_cgroup_storage_type(stype) {
old_storage[stype] = pl->storage[stype];
bpf_cgroup_storage_unlink(old_storage[stype]);
}
pl_was_allocated = false;
}
pl->prog = prog;
for_each_cgroup_storage_type(stype)
pl->storage[stype] = storage[stype];
}
cgrp->bpf.flags[type] = flags;
pl->prog = prog;
for_each_cgroup_storage_type(stype)
pl->storage[stype] = storage[stype];
cgrp->bpf.flags[type] = saved_flags;
err = update_effective_progs(cgrp, type);
if (err)
......@@ -401,7 +393,7 @@ int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog,
pl->storage[stype] = old_storage[stype];
bpf_cgroup_storage_link(old_storage[stype], cgrp, type);
}
if (pl_was_allocated) {
if (!replace_pl) {
list_del(&pl->node);
kfree(pl);
}
......@@ -539,6 +531,7 @@ int __cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr,
int cgroup_bpf_prog_attach(const union bpf_attr *attr,
enum bpf_prog_type ptype, struct bpf_prog *prog)
{
struct bpf_prog *replace_prog = NULL;
struct cgroup *cgrp;
int ret;
......@@ -546,8 +539,20 @@ int cgroup_bpf_prog_attach(const union bpf_attr *attr,
if (IS_ERR(cgrp))
return PTR_ERR(cgrp);
ret = cgroup_bpf_attach(cgrp, prog, attr->attach_type,
if ((attr->attach_flags & BPF_F_ALLOW_MULTI) &&
(attr->attach_flags & BPF_F_REPLACE)) {
replace_prog = bpf_prog_get_type(attr->replace_bpf_fd, ptype);
if (IS_ERR(replace_prog)) {
cgroup_put(cgrp);
return PTR_ERR(replace_prog);
}
}
ret = cgroup_bpf_attach(cgrp, prog, replace_prog, attr->attach_type,
attr->attach_flags);
if (replace_prog)
bpf_prog_put(replace_prog);
cgroup_put(cgrp);
return ret;
}
......
......@@ -222,8 +222,6 @@ struct bpf_prog *bpf_prog_realloc(struct bpf_prog *fp_old, unsigned int size,
u32 pages, delta;
int ret;
BUG_ON(fp_old == NULL);
size = round_up(size, PAGE_SIZE);
pages = size / PAGE_SIZE;
if (pages <= fp_old->pages)
......@@ -520,9 +518,9 @@ void bpf_prog_kallsyms_del_all(struct bpf_prog *fp)
#ifdef CONFIG_BPF_JIT
/* All BPF JIT sysctl knobs here. */
int bpf_jit_enable __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_ALWAYS_ON);
int bpf_jit_enable __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_DEFAULT_ON);
int bpf_jit_kallsyms __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_DEFAULT_ON);
int bpf_jit_harden __read_mostly;
int bpf_jit_kallsyms __read_mostly;
long bpf_jit_limit __read_mostly;
static __always_inline void
......
......@@ -72,17 +72,18 @@ struct bpf_cpu_map {
struct bpf_map map;
/* Below members specific for map type */
struct bpf_cpu_map_entry **cpu_map;
struct list_head __percpu *flush_list;
};
static int bq_flush_to_queue(struct xdp_bulk_queue *bq, bool in_napi_ctx);
static DEFINE_PER_CPU(struct list_head, cpu_map_flush_list);
static int bq_flush_to_queue(struct xdp_bulk_queue *bq);
static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
{
struct bpf_cpu_map *cmap;
int err = -ENOMEM;
int ret, cpu;
u64 cost;
int ret;
if (!capable(CAP_SYS_ADMIN))
return ERR_PTR(-EPERM);
......@@ -106,7 +107,6 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
/* make sure page count doesn't overflow */
cost = (u64) cmap->map.max_entries * sizeof(struct bpf_cpu_map_entry *);
cost += sizeof(struct list_head) * num_possible_cpus();
/* Notice returns -EPERM on if map size is larger than memlock limit */
ret = bpf_map_charge_init(&cmap->map.memory, cost);
......@@ -115,23 +115,14 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
goto free_cmap;
}
cmap->flush_list = alloc_percpu(struct list_head);
if (!cmap->flush_list)
goto free_charge;
for_each_possible_cpu(cpu)
INIT_LIST_HEAD(per_cpu_ptr(cmap->flush_list, cpu));
/* Alloc array for possible remote "destination" CPUs */
cmap->cpu_map = bpf_map_area_alloc(cmap->map.max_entries *
sizeof(struct bpf_cpu_map_entry *),
cmap->map.numa_node);
if (!cmap->cpu_map)
goto free_percpu;
goto free_charge;
return &cmap->map;
free_percpu:
free_percpu(cmap->flush_list);
free_charge:
bpf_map_charge_finish(&cmap->map.memory);
free_cmap:
......@@ -399,22 +390,14 @@ static struct bpf_cpu_map_entry *__cpu_map_entry_alloc(u32 qsize, u32 cpu,
static void __cpu_map_entry_free(struct rcu_head *rcu)
{
struct bpf_cpu_map_entry *rcpu;
int cpu;
/* This cpu_map_entry have been disconnected from map and one
* RCU graze-period have elapsed. Thus, XDP cannot queue any
* RCU grace-period have elapsed. Thus, XDP cannot queue any
* new packets and cannot change/set flush_needed that can
* find this entry.
*/
rcpu = container_of(rcu, struct bpf_cpu_map_entry, rcu);
/* Flush remaining packets in percpu bulkq */
for_each_online_cpu(cpu) {
struct xdp_bulk_queue *bq = per_cpu_ptr(rcpu->bulkq, cpu);
/* No concurrent bq_enqueue can run at this point */
bq_flush_to_queue(bq, false);
}
free_percpu(rcpu->bulkq);
/* Cannot kthread_stop() here, last put free rcpu resources */
put_cpu_map_entry(rcpu);
......@@ -436,7 +419,7 @@ static void __cpu_map_entry_free(struct rcu_head *rcu)
* percpu bulkq to queue. Due to caller map_delete_elem() disable
* preemption, cannot call kthread_stop() to make sure queue is empty.
* Instead a work_queue is started for stopping kthread,
* cpu_map_kthread_stop, which waits for an RCU graze period before
* cpu_map_kthread_stop, which waits for an RCU grace period before
* stopping kthread, emptying the queue.
*/
static void __cpu_map_entry_replace(struct bpf_cpu_map *cmap,
......@@ -507,7 +490,6 @@ static int cpu_map_update_elem(struct bpf_map *map, void *key, void *value,
static void cpu_map_free(struct bpf_map *map)
{
struct bpf_cpu_map *cmap = container_of(map, struct bpf_cpu_map, map);
int cpu;
u32 i;
/* At this point bpf_prog->aux->refcnt == 0 and this map->refcnt == 0,
......@@ -522,18 +504,6 @@ static void cpu_map_free(struct bpf_map *map)
bpf_clear_redirect_map(map);
synchronize_rcu();
/* To ensure all pending flush operations have completed wait for flush
* list be empty on _all_ cpus. Because the above synchronize_rcu()
* ensures the map is disconnected from the program we can assume no new
* items will be added to the list.
*/
for_each_online_cpu(cpu) {
struct list_head *flush_list = per_cpu_ptr(cmap->flush_list, cpu);
while (!list_empty(flush_list))
cond_resched();
}
/* For cpu_map the remote CPUs can still be using the entries
* (struct bpf_cpu_map_entry).
*/
......@@ -544,10 +514,9 @@ static void cpu_map_free(struct bpf_map *map)
if (!rcpu)
continue;
/* bq flush and cleanup happens after RCU graze-period */
/* bq flush and cleanup happens after RCU grace-period */
__cpu_map_entry_replace(cmap, i, NULL); /* call_rcu */
}
free_percpu(cmap->flush_list);
bpf_map_area_free(cmap->cpu_map);
kfree(cmap);
}
......@@ -599,7 +568,7 @@ const struct bpf_map_ops cpu_map_ops = {
.map_check_btf = map_check_no_btf,
};
static int bq_flush_to_queue(struct xdp_bulk_queue *bq, bool in_napi_ctx)
static int bq_flush_to_queue(struct xdp_bulk_queue *bq)
{
struct bpf_cpu_map_entry *rcpu = bq->obj;
unsigned int processed = 0, drops = 0;
......@@ -620,10 +589,7 @@ static int bq_flush_to_queue(struct xdp_bulk_queue *bq, bool in_napi_ctx)
err = __ptr_ring_produce(q, xdpf);
if (err) {
drops++;
if (likely(in_napi_ctx))
xdp_return_frame_rx_napi(xdpf);
else
xdp_return_frame(xdpf);
xdp_return_frame_rx_napi(xdpf);
}
processed++;
}
......@@ -642,11 +608,11 @@ static int bq_flush_to_queue(struct xdp_bulk_queue *bq, bool in_napi_ctx)
*/
static int bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf)
{
struct list_head *flush_list = this_cpu_ptr(rcpu->cmap->flush_list);
struct list_head *flush_list = this_cpu_ptr(&cpu_map_flush_list);
struct xdp_bulk_queue *bq = this_cpu_ptr(rcpu->bulkq);
if (unlikely(bq->count == CPU_MAP_BULK_SIZE))
bq_flush_to_queue(bq, true);
bq_flush_to_queue(bq);
/* Notice, xdp_buff/page MUST be queued here, long enough for
* driver to code invoking us to finished, due to driver
......@@ -681,16 +647,26 @@ int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_buff *xdp,
return 0;
}
void __cpu_map_flush(struct bpf_map *map)
void __cpu_map_flush(void)
{
struct bpf_cpu_map *cmap = container_of(map, struct bpf_cpu_map, map);
struct list_head *flush_list = this_cpu_ptr(cmap->flush_list);
struct list_head *flush_list = this_cpu_ptr(&cpu_map_flush_list);
struct xdp_bulk_queue *bq, *tmp;
list_for_each_entry_safe(bq, tmp, flush_list, flush_node) {
bq_flush_to_queue(bq, true);
bq_flush_to_queue(bq);
/* If already running, costs spin_lock_irqsave + smb_mb */
wake_up_process(bq->obj->kthread);
}
}
static int __init cpu_map_init(void)
{
int cpu;
for_each_possible_cpu(cpu)
INIT_LIST_HEAD(&per_cpu(cpu_map_flush_list, cpu));
return 0;
}
subsys_initcall(cpu_map_init);
......@@ -75,7 +75,6 @@ struct bpf_dtab_netdev {
struct bpf_dtab {
struct bpf_map map;
struct bpf_dtab_netdev **netdev_map; /* DEVMAP type only */
struct list_head __percpu *flush_list;
struct list_head list;
/* these are only used for DEVMAP_HASH type maps */
......@@ -85,6 +84,7 @@ struct bpf_dtab {
u32 n_buckets;
};
static DEFINE_PER_CPU(struct list_head, dev_map_flush_list);
static DEFINE_SPINLOCK(dev_map_lock);
static LIST_HEAD(dev_map_list);
......@@ -109,8 +109,8 @@ static inline struct hlist_head *dev_map_index_hash(struct bpf_dtab *dtab,
static int dev_map_init_map(struct bpf_dtab *dtab, union bpf_attr *attr)
{
int err, cpu;
u64 cost;
u64 cost = 0;
int err;
/* check sanity of attributes */
if (attr->max_entries == 0 || attr->key_size != 4 ||
......@@ -125,9 +125,6 @@ static int dev_map_init_map(struct bpf_dtab *dtab, union bpf_attr *attr)
bpf_map_init_from_attr(&dtab->map, attr);
/* make sure page count doesn't overflow */
cost = (u64) sizeof(struct list_head) * num_possible_cpus();
if (attr->map_type == BPF_MAP_TYPE_DEVMAP_HASH) {
dtab->n_buckets = roundup_pow_of_two(dtab->map.max_entries);
......@@ -143,17 +140,10 @@ static int dev_map_init_map(struct bpf_dtab *dtab, union bpf_attr *attr)
if (err)
return -EINVAL;
dtab->flush_list = alloc_percpu(struct list_head);
if (!dtab->flush_list)
goto free_charge;
for_each_possible_cpu(cpu)
INIT_LIST_HEAD(per_cpu_ptr(dtab->flush_list, cpu));
if (attr->map_type == BPF_MAP_TYPE_DEVMAP_HASH) {
dtab->dev_index_head = dev_map_create_hash(dtab->n_buckets);
if (!dtab->dev_index_head)
goto free_percpu;
goto free_charge;
spin_lock_init(&dtab->index_lock);
} else {
......@@ -161,13 +151,11 @@ static int dev_map_init_map(struct bpf_dtab *dtab, union bpf_attr *attr)
sizeof(struct bpf_dtab_netdev *),
dtab->map.numa_node);
if (!dtab->netdev_map)
goto free_percpu;
goto free_charge;
}
return 0;
free_percpu:
free_percpu(dtab->flush_list);
free_charge:
bpf_map_charge_finish(&dtab->map.memory);
return -ENOMEM;
......@@ -201,7 +189,7 @@ static struct bpf_map *dev_map_alloc(union bpf_attr *attr)
static void dev_map_free(struct bpf_map *map)
{
struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
int i, cpu;
int i;
/* At this point bpf_prog->aux->refcnt == 0 and this map->refcnt == 0,
* so the programs (can be more than one that used this map) were
......@@ -221,18 +209,6 @@ static void dev_map_free(struct bpf_map *map)
/* Make sure prior __dev_map_entry_free() have completed. */
rcu_barrier();
/* To ensure all pending flush operations have completed wait for flush
* list to empty on _all_ cpus.
* Because the above synchronize_rcu() ensures the map is disconnected
* from the program we can assume no new items will be added.
*/
for_each_online_cpu(cpu) {
struct list_head *flush_list = per_cpu_ptr(dtab->flush_list, cpu);
while (!list_empty(flush_list))
cond_resched();
}
if (dtab->map.map_type == BPF_MAP_TYPE_DEVMAP_HASH) {
for (i = 0; i < dtab->n_buckets; i++) {
struct bpf_dtab_netdev *dev;
......@@ -266,7 +242,6 @@ static void dev_map_free(struct bpf_map *map)
bpf_map_area_free(dtab->netdev_map);
}
free_percpu(dtab->flush_list);
kfree(dtab);
}
......@@ -345,8 +320,7 @@ static int dev_map_hash_get_next_key(struct bpf_map *map, void *key,
return -ENOENT;
}
static int bq_xmit_all(struct xdp_bulk_queue *bq, u32 flags,
bool in_napi_ctx)
static int bq_xmit_all(struct xdp_bulk_queue *bq, u32 flags)
{
struct bpf_dtab_netdev *obj = bq->obj;
struct net_device *dev = obj->dev;
......@@ -384,11 +358,7 @@ static int bq_xmit_all(struct xdp_bulk_queue *bq, u32 flags,
for (i = 0; i < bq->count; i++) {
struct xdp_frame *xdpf = bq->q[i];
/* RX path under NAPI protection, can return frames faster */
if (likely(in_napi_ctx))
xdp_return_frame_rx_napi(xdpf);
else
xdp_return_frame(xdpf);
xdp_return_frame_rx_napi(xdpf);
drops++;
}
goto out;
......@@ -401,15 +371,14 @@ static int bq_xmit_all(struct xdp_bulk_queue *bq, u32 flags,
* net device can be torn down. On devmap tear down we ensure the flush list
* is empty before completing to ensure all flush operations have completed.
*/
void __dev_map_flush(struct bpf_map *map)
void __dev_map_flush(void)
{
struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
struct list_head *flush_list = this_cpu_ptr(dtab->flush_list);
struct list_head *flush_list = this_cpu_ptr(&dev_map_flush_list);
struct xdp_bulk_queue *bq, *tmp;
rcu_read_lock();
list_for_each_entry_safe(bq, tmp, flush_list, flush_node)
bq_xmit_all(bq, XDP_XMIT_FLUSH, true);
bq_xmit_all(bq, XDP_XMIT_FLUSH);
rcu_read_unlock();
}
......@@ -436,11 +405,11 @@ static int bq_enqueue(struct bpf_dtab_netdev *obj, struct xdp_frame *xdpf,
struct net_device *dev_rx)
{
struct list_head *flush_list = this_cpu_ptr(obj->dtab->flush_list);
struct list_head *flush_list = this_cpu_ptr(&dev_map_flush_list);
struct xdp_bulk_queue *bq = this_cpu_ptr(obj->bulkq);
if (unlikely(bq->count == DEV_MAP_BULK_SIZE))
bq_xmit_all(bq, 0, true);
bq_xmit_all(bq, 0);
/* Ingress dev_rx will be the same for all xdp_frame's in
* bulk_queue, because bq stored per-CPU and must be flushed
......@@ -509,27 +478,11 @@ static void *dev_map_hash_lookup_elem(struct bpf_map *map, void *key)
return dev ? &dev->ifindex : NULL;
}
static void dev_map_flush_old(struct bpf_dtab_netdev *dev)
{
if (dev->dev->netdev_ops->ndo_xdp_xmit) {
struct xdp_bulk_queue *bq;
int cpu;
rcu_read_lock();
for_each_online_cpu(cpu) {
bq = per_cpu_ptr(dev->bulkq, cpu);
bq_xmit_all(bq, XDP_XMIT_FLUSH, false);
}
rcu_read_unlock();
}
}
static void __dev_map_entry_free(struct rcu_head *rcu)
{
struct bpf_dtab_netdev *dev;
dev = container_of(rcu, struct bpf_dtab_netdev, rcu);
dev_map_flush_old(dev);
free_percpu(dev->bulkq);
dev_put(dev->dev);
kfree(dev);
......@@ -810,10 +763,15 @@ static struct notifier_block dev_map_notifier = {
static int __init dev_map_init(void)
{
int cpu;
/* Assure tracepoint shadow struct _bpf_dtab_netdev is in sync */
BUILD_BUG_ON(offsetof(struct bpf_dtab_netdev, dev) !=
offsetof(struct _bpf_dtab_netdev, dev));
register_netdevice_notifier(&dev_map_notifier);
for_each_possible_cpu(cpu)
INIT_LIST_HEAD(&per_cpu(dev_map_flush_list, cpu));
return 0;
}
......
// SPDX-License-Identifier: GPL-2.0-only
/* Copyright(c) 2019 Intel Corporation. */
#include <linux/hash.h>
#include <linux/bpf.h>
#include <linux/filter.h>
/* The BPF dispatcher is a multiway branch code generator. The
* dispatcher is a mechanism to avoid the performance penalty of an
* indirect call, which is expensive when retpolines are enabled. A
* dispatch client registers a BPF program into the dispatcher, and if
* there is available room in the dispatcher a direct call to the BPF
* program will be generated. All calls to the BPF programs called via
* the dispatcher will then be a direct call, instead of an
* indirect. The dispatcher hijacks a trampoline function it via the
* __fentry__ of the trampoline. The trampoline function has the
* following signature:
*
* unsigned int trampoline(const void *ctx, const struct bpf_insn *insnsi,
* unsigned int (*bpf_func)(const void *,
* const struct bpf_insn *));
*/
static struct bpf_dispatcher_prog *bpf_dispatcher_find_prog(
struct bpf_dispatcher *d, struct bpf_prog *prog)
{
int i;
for (i = 0; i < BPF_DISPATCHER_MAX; i++) {
if (prog == d->progs[i].prog)
return &d->progs[i];
}
return NULL;
}
static struct bpf_dispatcher_prog *bpf_dispatcher_find_free(
struct bpf_dispatcher *d)
{
return bpf_dispatcher_find_prog(d, NULL);
}
static bool bpf_dispatcher_add_prog(struct bpf_dispatcher *d,
struct bpf_prog *prog)
{
struct bpf_dispatcher_prog *entry;
if (!prog)
return false;
entry = bpf_dispatcher_find_prog(d, prog);
if (entry) {
refcount_inc(&entry->users);
return false;
}
entry = bpf_dispatcher_find_free(d);
if (!entry)
return false;
bpf_prog_inc(prog);
entry->prog = prog;
refcount_set(&entry->users, 1);
d->num_progs++;
return true;
}
static bool bpf_dispatcher_remove_prog(struct bpf_dispatcher *d,
struct bpf_prog *prog)
{
struct bpf_dispatcher_prog *entry;
if (!prog)
return false;
entry = bpf_dispatcher_find_prog(d, prog);
if (!entry)
return false;
if (refcount_dec_and_test(&entry->users)) {
entry->prog = NULL;
bpf_prog_put(prog);
d->num_progs--;
return true;
}
return false;
}
int __weak arch_prepare_bpf_dispatcher(void *image, s64 *funcs, int num_funcs)
{
return -ENOTSUPP;
}
static int bpf_dispatcher_prepare(struct bpf_dispatcher *d, void *image)
{
s64 ips[BPF_DISPATCHER_MAX] = {}, *ipsp = &ips[0];
int i;
for (i = 0; i < BPF_DISPATCHER_MAX; i++) {
if (d->progs[i].prog)
*ipsp++ = (s64)(uintptr_t)d->progs[i].prog->bpf_func;
}
return arch_prepare_bpf_dispatcher(image, &ips[0], d->num_progs);
}
static void bpf_dispatcher_update(struct bpf_dispatcher *d, int prev_num_progs)
{
void *old, *new;
u32 noff;
int err;
if (!prev_num_progs) {
old = NULL;
noff = 0;
} else {
old = d->image + d->image_off;
noff = d->image_off ^ (PAGE_SIZE / 2);
}
new = d->num_progs ? d->image + noff : NULL;
if (new) {
if (bpf_dispatcher_prepare(d, new))
return;
}
err = bpf_arch_text_poke(d->func, BPF_MOD_JUMP, old, new);
if (err || !new)
return;
d->image_off = noff;
}
void bpf_dispatcher_change_prog(struct bpf_dispatcher *d, struct bpf_prog *from,
struct bpf_prog *to)
{
bool changed = false;
int prev_num_progs;
if (from == to)
return;
mutex_lock(&d->mutex);
if (!d->image) {
d->image = bpf_jit_alloc_exec_page();
if (!d->image)
goto out;
}
prev_num_progs = d->num_progs;
changed |= bpf_dispatcher_remove_prog(d, from);
changed |= bpf_dispatcher_add_prog(d, to);
if (!changed)
goto out;
bpf_dispatcher_update(d, prev_num_progs);
out:
mutex_unlock(&d->mutex);
}
......@@ -23,6 +23,7 @@
#include <linux/timekeeping.h>
#include <linux/ctype.h>
#include <linux/nospec.h>
#include <linux/audit.h>
#include <uapi/linux/btf.h>
#define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \
......@@ -1306,6 +1307,36 @@ static int find_prog_type(enum bpf_prog_type type, struct bpf_prog *prog)
return 0;
}
enum bpf_audit {
BPF_AUDIT_LOAD,
BPF_AUDIT_UNLOAD,
BPF_AUDIT_MAX,
};
static const char * const bpf_audit_str[BPF_AUDIT_MAX] = {
[BPF_AUDIT_LOAD] = "LOAD",
[BPF_AUDIT_UNLOAD] = "UNLOAD",
};
static void bpf_audit_prog(const struct bpf_prog *prog, unsigned int op)
{
struct audit_context *ctx = NULL;
struct audit_buffer *ab;
if (WARN_ON_ONCE(op >= BPF_AUDIT_MAX))
return;
if (audit_enabled == AUDIT_OFF)
return;
if (op == BPF_AUDIT_LOAD)
ctx = audit_context();
ab = audit_log_start(ctx, GFP_ATOMIC, AUDIT_BPF);
if (unlikely(!ab))
return;
audit_log_format(ab, "prog-id=%u op=%s",
prog->aux->id, bpf_audit_str[op]);
audit_log_end(ab);
}
int __bpf_prog_charge(struct user_struct *user, u32 pages)
{
unsigned long memlock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
......@@ -1421,6 +1452,7 @@ static void __bpf_prog_put(struct bpf_prog *prog, bool do_idr_lock)
{
if (atomic64_dec_and_test(&prog->aux->refcnt)) {
perf_event_bpf_event(prog, PERF_BPF_EVENT_PROG_UNLOAD, 0);
bpf_audit_prog(prog, BPF_AUDIT_UNLOAD);
/* bpf_prog_free_id() must be called first */
bpf_prog_free_id(prog, do_idr_lock);
__bpf_prog_put_noref(prog, true);
......@@ -1830,6 +1862,7 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr)
*/
bpf_prog_kallsyms_add(prog);
perf_event_bpf_event(prog, PERF_BPF_EVENT_PROG_LOAD, 0);
bpf_audit_prog(prog, BPF_AUDIT_LOAD);
err = bpf_prog_new_fd(prog);
if (err < 0)
......@@ -2040,10 +2073,10 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
}
}
#define BPF_PROG_ATTACH_LAST_FIELD attach_flags
#define BPF_PROG_ATTACH_LAST_FIELD replace_bpf_fd
#define BPF_F_ATTACH_MASK \
(BPF_F_ALLOW_OVERRIDE | BPF_F_ALLOW_MULTI)
(BPF_F_ALLOW_OVERRIDE | BPF_F_ALLOW_MULTI | BPF_F_REPLACE)
static int bpf_prog_attach(const union bpf_attr *attr)
{
......@@ -2305,17 +2338,12 @@ static int bpf_obj_get_next_id(const union bpf_attr *attr,
#define BPF_PROG_GET_FD_BY_ID_LAST_FIELD prog_id
static int bpf_prog_get_fd_by_id(const union bpf_attr *attr)
struct bpf_prog *bpf_prog_by_id(u32 id)
{
struct bpf_prog *prog;
u32 id = attr->prog_id;
int fd;
if (CHECK_ATTR(BPF_PROG_GET_FD_BY_ID))
return -EINVAL;
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
if (!id)
return ERR_PTR(-ENOENT);
spin_lock_bh(&prog_idr_lock);
prog = idr_find(&prog_idr, id);
......@@ -2324,7 +2352,22 @@ static int bpf_prog_get_fd_by_id(const union bpf_attr *attr)
else
prog = ERR_PTR(-ENOENT);
spin_unlock_bh(&prog_idr_lock);
return prog;
}
static int bpf_prog_get_fd_by_id(const union bpf_attr *attr)
{
struct bpf_prog *prog;
u32 id = attr->prog_id;
int fd;
if (CHECK_ATTR(BPF_PROG_GET_FD_BY_ID))
return -EINVAL;
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
prog = bpf_prog_by_id(id);
if (IS_ERR(prog))
return PTR_ERR(prog);
......
......@@ -14,6 +14,22 @@ static struct hlist_head trampoline_table[TRAMPOLINE_TABLE_SIZE];
/* serializes access to trampoline_table */
static DEFINE_MUTEX(trampoline_mutex);
void *bpf_jit_alloc_exec_page(void)
{
void *image;
image = bpf_jit_alloc_exec(PAGE_SIZE);
if (!image)
return NULL;
set_vm_flush_reset_perms(image);
/* Keep image as writeable. The alternative is to keep flipping ro/rw
* everytime new program is attached or detached.
*/
set_memory_x((long)image, 1);
return image;
}
struct bpf_trampoline *bpf_trampoline_lookup(u64 key)
{
struct bpf_trampoline *tr;
......@@ -34,7 +50,7 @@ struct bpf_trampoline *bpf_trampoline_lookup(u64 key)
goto out;
/* is_root was checked earlier. No need for bpf_jit_charge_modmem() */
image = bpf_jit_alloc_exec(PAGE_SIZE);
image = bpf_jit_alloc_exec_page();
if (!image) {
kfree(tr);
tr = NULL;
......@@ -48,12 +64,6 @@ struct bpf_trampoline *bpf_trampoline_lookup(u64 key)
mutex_init(&tr->mutex);
for (i = 0; i < BPF_TRAMP_MAX; i++)
INIT_HLIST_HEAD(&tr->progs_hlist[i]);
set_vm_flush_reset_perms(image);
/* Keep image as writeable. The alternative is to keep flipping ro/rw
* everytime new program is attached or detached.
*/
set_memory_x((long)image, 1);
tr->image = image;
out:
mutex_unlock(&trampoline_mutex);
......
......@@ -72,9 +72,9 @@ static void xsk_map_sock_delete(struct xdp_sock *xs,
static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
{
struct bpf_map_memory mem;
int cpu, err, numa_node;
int err, numa_node;
struct xsk_map *m;
u64 cost, size;
u64 size;
if (!capable(CAP_NET_ADMIN))
return ERR_PTR(-EPERM);
......@@ -86,9 +86,8 @@ static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
numa_node = bpf_map_attr_numa_node(attr);
size = struct_size(m, xsk_map, attr->max_entries);
cost = size + array_size(sizeof(*m->flush_list), num_possible_cpus());
err = bpf_map_charge_init(&mem, cost);
err = bpf_map_charge_init(&mem, size);
if (err < 0)
return ERR_PTR(err);
......@@ -102,16 +101,6 @@ static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
bpf_map_charge_move(&m->map.memory, &mem);
spin_lock_init(&m->lock);
m->flush_list = alloc_percpu(struct list_head);
if (!m->flush_list) {
bpf_map_charge_finish(&m->map.memory);
bpf_map_area_free(m);
return ERR_PTR(-ENOMEM);
}
for_each_possible_cpu(cpu)
INIT_LIST_HEAD(per_cpu_ptr(m->flush_list, cpu));
return &m->map;
}
......@@ -121,7 +110,6 @@ static void xsk_map_free(struct bpf_map *map)
bpf_clear_redirect_map(map);
synchronize_net();
free_percpu(m->flush_list);
bpf_map_area_free(m);
}
......
......@@ -6288,12 +6288,13 @@ void cgroup_sk_free(struct sock_cgroup_data *skcd)
#ifdef CONFIG_CGROUP_BPF
int cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog,
enum bpf_attach_type type, u32 flags)
struct bpf_prog *replace_prog, enum bpf_attach_type type,
u32 flags)
{
int ret;
mutex_lock(&cgroup_mutex);
ret = __cgroup_bpf_attach(cgrp, prog, type, flags);
ret = __cgroup_bpf_attach(cgrp, prog, replace_prog, type, flags);
mutex_unlock(&cgroup_mutex);
return ret;
}
......
......@@ -15,7 +15,7 @@
#include <trace/events/bpf_test_run.h>
static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat,
u32 *retval, u32 *time)
u32 *retval, u32 *time, bool xdp)
{
struct bpf_cgroup_storage *storage[MAX_BPF_CGROUP_STORAGE_TYPE] = { NULL };
enum bpf_cgroup_storage_type stype;
......@@ -41,7 +41,11 @@ static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat,
time_start = ktime_get_ns();
for (i = 0; i < repeat; i++) {
bpf_cgroup_storage_set(storage);
*retval = BPF_PROG_RUN(prog, ctx);
if (xdp)
*retval = bpf_prog_run_xdp(prog, ctx);
else
*retval = BPF_PROG_RUN(prog, ctx);
if (signal_pending(current)) {
ret = -EINTR;
......@@ -247,34 +251,53 @@ static int convert___skb_to_skb(struct sk_buff *skb, struct __sk_buff *__skb)
return 0;
/* make sure the fields we don't use are zeroed */
if (!range_is_zero(__skb, 0, offsetof(struct __sk_buff, priority)))
if (!range_is_zero(__skb, 0, offsetof(struct __sk_buff, mark)))
return -EINVAL;
/* mark is allowed */
if (!range_is_zero(__skb, offsetofend(struct __sk_buff, mark),
offsetof(struct __sk_buff, priority)))
return -EINVAL;
/* priority is allowed */
if (!range_is_zero(__skb, offsetof(struct __sk_buff, priority) +
sizeof_field(struct __sk_buff, priority),
if (!range_is_zero(__skb, offsetofend(struct __sk_buff, priority),
offsetof(struct __sk_buff, cb)))
return -EINVAL;
/* cb is allowed */
if (!range_is_zero(__skb, offsetof(struct __sk_buff, cb) +
sizeof_field(struct __sk_buff, cb),
if (!range_is_zero(__skb, offsetofend(struct __sk_buff, cb),
offsetof(struct __sk_buff, tstamp)))
return -EINVAL;
/* tstamp is allowed */
/* wire_len is allowed */
/* gso_segs is allowed */
if (!range_is_zero(__skb, offsetof(struct __sk_buff, tstamp) +
sizeof_field(struct __sk_buff, tstamp),
if (!range_is_zero(__skb, offsetofend(struct __sk_buff, gso_segs),
sizeof(struct __sk_buff)))
return -EINVAL;
skb->mark = __skb->mark;
skb->priority = __skb->priority;
skb->tstamp = __skb->tstamp;
memcpy(&cb->data, __skb->cb, QDISC_CB_PRIV_LEN);
if (__skb->wire_len == 0) {
cb->pkt_len = skb->len;
} else {
if (__skb->wire_len < skb->len ||
__skb->wire_len > GSO_MAX_SIZE)
return -EINVAL;
cb->pkt_len = __skb->wire_len;
}
if (__skb->gso_segs > GSO_MAX_SEGS)
return -EINVAL;
skb_shinfo(skb)->gso_segs = __skb->gso_segs;
return 0;
}
......@@ -285,9 +308,12 @@ static void convert_skb_to___skb(struct sk_buff *skb, struct __sk_buff *__skb)
if (!__skb)
return;
__skb->mark = skb->mark;
__skb->priority = skb->priority;
__skb->tstamp = skb->tstamp;
memcpy(__skb->cb, &cb->data, QDISC_CB_PRIV_LEN);
__skb->wire_len = cb->pkt_len;
__skb->gso_segs = skb_shinfo(skb)->gso_segs;
}
int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr,
......@@ -359,7 +385,7 @@ int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr,
ret = convert___skb_to_skb(skb, ctx);
if (ret)
goto out;
ret = bpf_test_run(prog, skb, repeat, &retval, &duration);
ret = bpf_test_run(prog, skb, repeat, &retval, &duration, false);
if (ret)
goto out;
if (!is_l2) {
......@@ -416,8 +442,8 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
rxqueue = __netif_get_rx_queue(current->nsproxy->net_ns->loopback_dev, 0);
xdp.rxq = &rxqueue->xdp_rxq;
ret = bpf_test_run(prog, &xdp, repeat, &retval, &duration);
bpf_prog_change_xdp(NULL, prog);
ret = bpf_test_run(prog, &xdp, repeat, &retval, &duration, true);
if (ret)
goto out;
if (xdp.data != data + XDP_PACKET_HEADROOM + NET_IP_ALIGN ||
......@@ -425,6 +451,7 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
size = xdp.data_end - xdp.data;
ret = bpf_test_finish(kattr, uattr, xdp.data, size, retval, duration);
out:
bpf_prog_change_xdp(prog, NULL);
kfree(data);
return ret;
}
......@@ -437,8 +464,7 @@ static int verify_user_bpf_flow_keys(struct bpf_flow_keys *ctx)
/* flags is allowed */
if (!range_is_zero(ctx, offsetof(struct bpf_flow_keys, flags) +
sizeof_field(struct bpf_flow_keys, flags),
if (!range_is_zero(ctx, offsetofend(struct bpf_flow_keys, flags),
sizeof(struct bpf_flow_keys)))
return -EINVAL;
......
......@@ -8542,7 +8542,17 @@ static int dev_xdp_install(struct net_device *dev, bpf_op_t bpf_op,
struct netlink_ext_ack *extack, u32 flags,
struct bpf_prog *prog)
{
bool non_hw = !(flags & XDP_FLAGS_HW_MODE);
struct bpf_prog *prev_prog = NULL;
struct netdev_bpf xdp;
int err;
if (non_hw) {
prev_prog = bpf_prog_by_id(__dev_xdp_query(dev, bpf_op,
XDP_QUERY_PROG));
if (IS_ERR(prev_prog))
prev_prog = NULL;
}
memset(&xdp, 0, sizeof(xdp));
if (flags & XDP_FLAGS_HW_MODE)
......@@ -8553,7 +8563,14 @@ static int dev_xdp_install(struct net_device *dev, bpf_op_t bpf_op,
xdp.flags = flags;
xdp.prog = prog;
return bpf_op(dev, &xdp);
err = bpf_op(dev, &xdp);
if (!err && non_hw)
bpf_prog_change_xdp(prev_prog, prog);
if (prev_prog)
bpf_prog_put(prev_prog);
return err;
}
static void dev_xdp_uninstall(struct net_device *dev)
......
......@@ -3511,36 +3511,16 @@ xdp_do_redirect_slow(struct net_device *dev, struct xdp_buff *xdp,
}
static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
struct bpf_map *map,
struct xdp_buff *xdp,
u32 index)
struct bpf_map *map, struct xdp_buff *xdp)
{
int err;
switch (map->map_type) {
case BPF_MAP_TYPE_DEVMAP:
case BPF_MAP_TYPE_DEVMAP_HASH: {
struct bpf_dtab_netdev *dst = fwd;
err = dev_map_enqueue(dst, xdp, dev_rx);
if (unlikely(err))
return err;
break;
}
case BPF_MAP_TYPE_CPUMAP: {
struct bpf_cpu_map_entry *rcpu = fwd;
err = cpu_map_enqueue(rcpu, xdp, dev_rx);
if (unlikely(err))
return err;
break;
}
case BPF_MAP_TYPE_XSKMAP: {
struct xdp_sock *xs = fwd;
err = __xsk_map_redirect(map, xdp, xs);
return err;
}
case BPF_MAP_TYPE_DEVMAP_HASH:
return dev_map_enqueue(fwd, xdp, dev_rx);
case BPF_MAP_TYPE_CPUMAP:
return cpu_map_enqueue(fwd, xdp, dev_rx);
case BPF_MAP_TYPE_XSKMAP:
return __xsk_map_redirect(fwd, xdp);
default:
break;
}
......@@ -3549,26 +3529,9 @@ static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
void xdp_do_flush_map(void)
{
struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
struct bpf_map *map = ri->map_to_flush;
ri->map_to_flush = NULL;
if (map) {
switch (map->map_type) {
case BPF_MAP_TYPE_DEVMAP:
case BPF_MAP_TYPE_DEVMAP_HASH:
__dev_map_flush(map);
break;
case BPF_MAP_TYPE_CPUMAP:
__cpu_map_flush(map);
break;
case BPF_MAP_TYPE_XSKMAP:
__xsk_map_flush(map);
break;
default:
break;
}
}
__dev_map_flush();
__cpu_map_flush();
__xsk_map_flush();
}
EXPORT_SYMBOL_GPL(xdp_do_flush_map);
......@@ -3617,14 +3580,10 @@ static int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp,
ri->tgt_value = NULL;
WRITE_ONCE(ri->map, NULL);
if (ri->map_to_flush && unlikely(ri->map_to_flush != map))
xdp_do_flush_map();
err = __bpf_tx_xdp_map(dev, fwd, map, xdp, index);
err = __bpf_tx_xdp_map(dev, fwd, map, xdp);
if (unlikely(err))
goto err;
ri->map_to_flush = map;
_trace_xdp_redirect_map(dev, xdp_prog, fwd, map, index);
return 0;
err:
......@@ -8941,3 +8900,11 @@ const struct bpf_verifier_ops sk_reuseport_verifier_ops = {
const struct bpf_prog_ops sk_reuseport_prog_ops = {
};
#endif /* CONFIG_INET */
DEFINE_BPF_DISPATCHER(bpf_dispatcher_xdp)
void bpf_prog_change_xdp(struct bpf_prog *prev_prog, struct bpf_prog *prog)
{
bpf_dispatcher_change_prog(BPF_DISPATCHER_PTR(bpf_dispatcher_xdp),
prev_prog, prog);
}
......@@ -31,6 +31,8 @@
#define TX_BATCH_SIZE 16
static DEFINE_PER_CPU(struct list_head, xskmap_flush_list);
bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs)
{
return READ_ONCE(xs->rx) && READ_ONCE(xs->umem) &&
......@@ -39,21 +41,21 @@ bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs)
bool xsk_umem_has_addrs(struct xdp_umem *umem, u32 cnt)
{
return xskq_has_addrs(umem->fq, cnt);
return xskq_cons_has_entries(umem->fq, cnt);
}
EXPORT_SYMBOL(xsk_umem_has_addrs);
u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr)
bool xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr)
{
return xskq_peek_addr(umem->fq, addr, umem);
return xskq_cons_peek_addr(umem->fq, addr, umem);
}
EXPORT_SYMBOL(xsk_umem_peek_addr);
void xsk_umem_discard_addr(struct xdp_umem *umem)
void xsk_umem_release_addr(struct xdp_umem *umem)
{
xskq_discard_addr(umem->fq);
xskq_cons_release(umem->fq);
}
EXPORT_SYMBOL(xsk_umem_discard_addr);
EXPORT_SYMBOL(xsk_umem_release_addr);
void xsk_set_rx_need_wakeup(struct xdp_umem *umem)
{
......@@ -124,7 +126,7 @@ static void __xsk_rcv_memcpy(struct xdp_umem *umem, u64 addr, void *from_buf,
void *to_buf = xdp_umem_get_data(umem, addr);
addr = xsk_umem_add_offset_to_addr(addr);
if (xskq_crosses_non_contig_pg(umem, addr, len + metalen)) {
if (xskq_cons_crosses_non_contig_pg(umem, addr, len + metalen)) {
void *next_pg_addr = umem->pages[(addr >> PAGE_SHIFT) + 1].addr;
u64 page_start = addr & ~(PAGE_SIZE - 1);
u64 first_len = PAGE_SIZE - (addr - page_start);
......@@ -146,7 +148,7 @@ static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
u32 metalen;
int err;
if (!xskq_peek_addr(xs->umem->fq, &addr, xs->umem) ||
if (!xskq_cons_peek_addr(xs->umem->fq, &addr, xs->umem) ||
len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
xs->rx_dropped++;
return -ENOSPC;
......@@ -165,9 +167,9 @@ static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
offset += metalen;
addr = xsk_umem_adjust_offset(xs->umem, addr, offset);
err = xskq_produce_batch_desc(xs->rx, addr, len);
err = xskq_prod_reserve_desc(xs->rx, addr, len);
if (!err) {
xskq_discard_addr(xs->umem->fq);
xskq_cons_release(xs->umem->fq);
xdp_return_buff(xdp);
return 0;
}
......@@ -178,7 +180,7 @@ static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
static int __xsk_rcv_zc(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
{
int err = xskq_produce_batch_desc(xs->rx, (u64)xdp->handle, len);
int err = xskq_prod_reserve_desc(xs->rx, xdp->handle, len);
if (err)
xs->rx_dropped++;
......@@ -214,7 +216,7 @@ static int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
static void xsk_flush(struct xdp_sock *xs)
{
xskq_produce_flush_desc(xs->rx);
xskq_prod_submit(xs->rx);
xs->sk.sk_data_ready(&xs->sk);
}
......@@ -234,7 +236,7 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
goto out_unlock;
}
if (!xskq_peek_addr(xs->umem->fq, &addr, xs->umem) ||
if (!xskq_cons_peek_addr(xs->umem->fq, &addr, xs->umem) ||
len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
err = -ENOSPC;
goto out_drop;
......@@ -245,12 +247,12 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
memcpy(buffer, xdp->data_meta, len + metalen);
addr = xsk_umem_adjust_offset(xs->umem, addr, metalen);
err = xskq_produce_batch_desc(xs->rx, addr, len);
err = xskq_prod_reserve_desc(xs->rx, addr, len);
if (err)
goto out_drop;
xskq_discard_addr(xs->umem->fq);
xskq_produce_flush_desc(xs->rx);
xskq_cons_release(xs->umem->fq);
xskq_prod_submit(xs->rx);
spin_unlock_bh(&xs->rx_lock);
......@@ -264,11 +266,9 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
return err;
}
int __xsk_map_redirect(struct bpf_map *map, struct xdp_buff *xdp,
struct xdp_sock *xs)
int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp)
{
struct xsk_map *m = container_of(map, struct xsk_map, map);
struct list_head *flush_list = this_cpu_ptr(m->flush_list);
struct list_head *flush_list = this_cpu_ptr(&xskmap_flush_list);
int err;
err = xsk_rcv(xs, xdp);
......@@ -281,10 +281,9 @@ int __xsk_map_redirect(struct bpf_map *map, struct xdp_buff *xdp,
return 0;
}
void __xsk_map_flush(struct bpf_map *map)
void __xsk_map_flush(void)
{
struct xsk_map *m = container_of(map, struct xsk_map, map);
struct list_head *flush_list = this_cpu_ptr(m->flush_list);
struct list_head *flush_list = this_cpu_ptr(&xskmap_flush_list);
struct xdp_sock *xs, *tmp;
list_for_each_entry_safe(xs, tmp, flush_list, flush_node) {
......@@ -295,7 +294,7 @@ void __xsk_map_flush(struct bpf_map *map)
void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries)
{
xskq_produce_flush_addr_n(umem->cq, nb_entries);
xskq_prod_submit_n(umem->cq, nb_entries);
}
EXPORT_SYMBOL(xsk_umem_complete_tx);
......@@ -317,13 +316,18 @@ bool xsk_umem_consume_tx(struct xdp_umem *umem, struct xdp_desc *desc)
rcu_read_lock();
list_for_each_entry_rcu(xs, &umem->xsk_list, list) {
if (!xskq_peek_desc(xs->tx, desc, umem))
if (!xskq_cons_peek_desc(xs->tx, desc, umem))
continue;
if (xskq_produce_addr_lazy(umem->cq, desc->addr))
/* This is the backpreassure mechanism for the Tx path.
* Reserve space in the completion queue and only proceed
* if there is space in it. This avoids having to implement
* any buffering in the Tx path.
*/
if (xskq_prod_reserve_addr(umem->cq, desc->addr))
goto out;
xskq_discard_desc(xs->tx);
xskq_cons_release(xs->tx);
rcu_read_unlock();
return true;
}
......@@ -358,7 +362,7 @@ static void xsk_destruct_skb(struct sk_buff *skb)
unsigned long flags;
spin_lock_irqsave(&xs->tx_completion_lock, flags);
WARN_ON_ONCE(xskq_produce_addr(xs->umem->cq, addr));
xskq_prod_submit_addr(xs->umem->cq, addr);
spin_unlock_irqrestore(&xs->tx_completion_lock, flags);
sock_wfree(skb);
......@@ -378,7 +382,7 @@ static int xsk_generic_xmit(struct sock *sk)
if (xs->queue_id >= xs->dev->real_num_tx_queues)
goto out;
while (xskq_peek_desc(xs->tx, &desc, xs->umem)) {
while (xskq_cons_peek_desc(xs->tx, &desc, xs->umem)) {
char *buffer;
u64 addr;
u32 len;
......@@ -399,7 +403,12 @@ static int xsk_generic_xmit(struct sock *sk)
addr = desc.addr;
buffer = xdp_umem_get_data(xs->umem, addr);
err = skb_store_bits(skb, 0, buffer, len);
if (unlikely(err) || xskq_reserve_addr(xs->umem->cq)) {
/* This is the backpreassure mechanism for the Tx path.
* Reserve space in the completion queue and only proceed
* if there is space in it. This avoids having to implement
* any buffering in the Tx path.
*/
if (unlikely(err) || xskq_prod_reserve(xs->umem->cq)) {
kfree_skb(skb);
goto out;
}
......@@ -411,7 +420,7 @@ static int xsk_generic_xmit(struct sock *sk)
skb->destructor = xsk_destruct_skb;
err = dev_direct_xmit(skb, xs->queue_id);
xskq_discard_desc(xs->tx);
xskq_cons_release(xs->tx);
/* Ignore NET_XMIT_CN as packet might have been sent */
if (err == NET_XMIT_DROP || err == NETDEV_TX_BUSY) {
/* SKB completed but not sent */
......@@ -477,9 +486,9 @@ static __poll_t xsk_poll(struct file *file, struct socket *sock,
__xsk_sendmsg(sk);
}
if (xs->rx && !xskq_empty_desc(xs->rx))
if (xs->rx && !xskq_prod_is_empty(xs->rx))
mask |= EPOLLIN | EPOLLRDNORM;
if (xs->tx && !xskq_full_desc(xs->tx))
if (xs->tx && !xskq_cons_is_full(xs->tx))
mask |= EPOLLOUT | EPOLLWRNORM;
return mask;
......@@ -1183,7 +1192,7 @@ static struct pernet_operations xsk_net_ops = {
static int __init xsk_init(void)
{
int err;
int err, cpu;
err = proto_register(&xsk_proto, 0 /* no slab */);
if (err)
......@@ -1201,6 +1210,8 @@ static int __init xsk_init(void)
if (err)
goto out_pernet;
for_each_possible_cpu(cpu)
INIT_LIST_HEAD(&per_cpu(xskmap_flush_list, cpu));
return 0;
out_pernet:
......
......@@ -18,14 +18,14 @@ void xskq_set_umem(struct xsk_queue *q, u64 size, u64 chunk_mask)
q->chunk_mask = chunk_mask;
}
static u32 xskq_umem_get_ring_size(struct xsk_queue *q)
static size_t xskq_get_ring_size(struct xsk_queue *q, bool umem_queue)
{
return sizeof(struct xdp_umem_ring) + q->nentries * sizeof(u64);
}
struct xdp_umem_ring *umem_ring;
struct xdp_rxtx_ring *rxtx_ring;
static u32 xskq_rxtx_get_ring_size(struct xsk_queue *q)
{
return sizeof(struct xdp_ring) + q->nentries * sizeof(struct xdp_desc);
if (umem_queue)
return struct_size(umem_ring, desc, q->nentries);
return struct_size(rxtx_ring, desc, q->nentries);
}
struct xsk_queue *xskq_create(u32 nentries, bool umem_queue)
......@@ -43,8 +43,7 @@ struct xsk_queue *xskq_create(u32 nentries, bool umem_queue)
gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN |
__GFP_COMP | __GFP_NORETRY;
size = umem_queue ? xskq_umem_get_ring_size(q) :
xskq_rxtx_get_ring_size(q);
size = xskq_get_ring_size(q, umem_queue);
q->ring = (struct xdp_ring *)__get_free_pages(gfp_flags,
get_order(size));
......
......@@ -10,9 +10,6 @@
#include <linux/if_xdp.h>
#include <net/xdp_sock.h>
#define RX_BATCH_SIZE 16
#define LAZY_UPDATE_THRESHOLD 128
struct xdp_ring {
u32 producer ____cacheline_aligned_in_smp;
u32 consumer ____cacheline_aligned_in_smp;
......@@ -36,10 +33,8 @@ struct xsk_queue {
u64 size;
u32 ring_mask;
u32 nentries;
u32 prod_head;
u32 prod_tail;
u32 cons_head;
u32 cons_tail;
u32 cached_prod;
u32 cached_cons;
struct xdp_ring *ring;
u64 invalid_descs;
};
......@@ -86,56 +81,31 @@ struct xsk_queue {
* now and again after circling through the ring.
*/
/* Common functions operating for both RXTX and umem queues */
static inline u64 xskq_nb_invalid_descs(struct xsk_queue *q)
{
return q ? q->invalid_descs : 0;
}
static inline u32 xskq_nb_avail(struct xsk_queue *q, u32 dcnt)
{
u32 entries = q->prod_tail - q->cons_tail;
if (entries == 0) {
/* Refresh the local pointer */
q->prod_tail = READ_ONCE(q->ring->producer);
entries = q->prod_tail - q->cons_tail;
}
return (entries > dcnt) ? dcnt : entries;
}
static inline u32 xskq_nb_free(struct xsk_queue *q, u32 producer, u32 dcnt)
{
u32 free_entries = q->nentries - (producer - q->cons_tail);
if (free_entries >= dcnt)
return free_entries;
/* Refresh the local tail pointer */
q->cons_tail = READ_ONCE(q->ring->consumer);
return q->nentries - (producer - q->cons_tail);
}
static inline bool xskq_has_addrs(struct xsk_queue *q, u32 cnt)
{
u32 entries = q->prod_tail - q->cons_tail;
if (entries >= cnt)
return true;
/* Refresh the local pointer. */
q->prod_tail = READ_ONCE(q->ring->producer);
entries = q->prod_tail - q->cons_tail;
return entries >= cnt;
}
/* The operations on the rings are the following:
*
* producer consumer
*
* RESERVE entries PEEK in the ring for entries
* WRITE data into the ring READ data from the ring
* SUBMIT entries RELEASE entries
*
* The producer reserves one or more entries in the ring. It can then
* fill in these entries and finally submit them so that they can be
* seen and read by the consumer.
*
* The consumer peeks into the ring to see if the producer has written
* any new entries. If so, the producer can then read these entries
* and when it is done reading them release them back to the producer
* so that the producer can use these slots to fill in new entries.
*
* The function names below reflect these operations.
*/
/* UMEM queue */
/* Functions that read and validate content from consumer rings. */
static inline bool xskq_crosses_non_contig_pg(struct xdp_umem *umem, u64 addr,
u64 length)
static inline bool xskq_cons_crosses_non_contig_pg(struct xdp_umem *umem,
u64 addr,
u64 length)
{
bool cross_pg = (addr & (PAGE_SIZE - 1)) + length > PAGE_SIZE;
bool next_pg_contig =
......@@ -145,9 +115,16 @@ static inline bool xskq_crosses_non_contig_pg(struct xdp_umem *umem, u64 addr,
return cross_pg && !next_pg_contig;
}
static inline bool xskq_is_valid_addr(struct xsk_queue *q, u64 addr)
static inline bool xskq_cons_is_valid_unaligned(struct xsk_queue *q,
u64 addr,
u64 length,
struct xdp_umem *umem)
{
if (addr >= q->size) {
u64 base_addr = xsk_umem_extract_addr(addr);
addr = xsk_umem_add_offset_to_addr(addr);
if (base_addr >= q->size || addr >= q->size ||
xskq_cons_crosses_non_contig_pg(umem, addr, length)) {
q->invalid_descs++;
return false;
}
......@@ -155,15 +132,9 @@ static inline bool xskq_is_valid_addr(struct xsk_queue *q, u64 addr)
return true;
}
static inline bool xskq_is_valid_addr_unaligned(struct xsk_queue *q, u64 addr,
u64 length,
struct xdp_umem *umem)
static inline bool xskq_cons_is_valid_addr(struct xsk_queue *q, u64 addr)
{
u64 base_addr = xsk_umem_extract_addr(addr);
addr = xsk_umem_add_offset_to_addr(addr);
if (base_addr >= q->size || addr >= q->size ||
xskq_crosses_non_contig_pg(umem, addr, length)) {
if (addr >= q->size) {
q->invalid_descs++;
return false;
}
......@@ -171,204 +142,240 @@ static inline bool xskq_is_valid_addr_unaligned(struct xsk_queue *q, u64 addr,
return true;
}
static inline u64 *xskq_validate_addr(struct xsk_queue *q, u64 *addr,
struct xdp_umem *umem)
static inline bool xskq_cons_read_addr(struct xsk_queue *q, u64 *addr,
struct xdp_umem *umem)
{
while (q->cons_tail != q->cons_head) {
struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
unsigned int idx = q->cons_tail & q->ring_mask;
struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
*addr = READ_ONCE(ring->desc[idx]) & q->chunk_mask;
while (q->cached_cons != q->cached_prod) {
u32 idx = q->cached_cons & q->ring_mask;
*addr = ring->desc[idx] & q->chunk_mask;
if (umem->flags & XDP_UMEM_UNALIGNED_CHUNK_FLAG) {
if (xskq_is_valid_addr_unaligned(q, *addr,
if (xskq_cons_is_valid_unaligned(q, *addr,
umem->chunk_size_nohr,
umem))
return addr;
return true;
goto out;
}
if (xskq_is_valid_addr(q, *addr))
return addr;
if (xskq_cons_is_valid_addr(q, *addr))
return true;
out:
q->cons_tail++;
q->cached_cons++;
}
return NULL;
return false;
}
static inline u64 *xskq_peek_addr(struct xsk_queue *q, u64 *addr,
struct xdp_umem *umem)
static inline bool xskq_cons_is_valid_desc(struct xsk_queue *q,
struct xdp_desc *d,
struct xdp_umem *umem)
{
if (q->cons_tail == q->cons_head) {
smp_mb(); /* D, matches A */
WRITE_ONCE(q->ring->consumer, q->cons_tail);
q->cons_head = q->cons_tail + xskq_nb_avail(q, RX_BATCH_SIZE);
if (umem->flags & XDP_UMEM_UNALIGNED_CHUNK_FLAG) {
if (!xskq_cons_is_valid_unaligned(q, d->addr, d->len, umem))
return false;
if (d->len > umem->chunk_size_nohr || d->options) {
q->invalid_descs++;
return false;
}
/* Order consumer and data */
smp_rmb();
return true;
}
return xskq_validate_addr(q, addr, umem);
}
if (!xskq_cons_is_valid_addr(q, d->addr))
return false;
static inline void xskq_discard_addr(struct xsk_queue *q)
{
q->cons_tail++;
if (((d->addr + d->len) & q->chunk_mask) != (d->addr & q->chunk_mask) ||
d->options) {
q->invalid_descs++;
return false;
}
return true;
}
static inline int xskq_produce_addr(struct xsk_queue *q, u64 addr)
static inline bool xskq_cons_read_desc(struct xsk_queue *q,
struct xdp_desc *desc,
struct xdp_umem *umem)
{
struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
if (xskq_nb_free(q, q->prod_tail, 1) == 0)
return -ENOSPC;
while (q->cached_cons != q->cached_prod) {
struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring;
u32 idx = q->cached_cons & q->ring_mask;
/* A, matches D */
ring->desc[q->prod_tail++ & q->ring_mask] = addr;
*desc = ring->desc[idx];
if (xskq_cons_is_valid_desc(q, desc, umem))
return true;
/* Order producer and data */
smp_wmb(); /* B, matches C */
q->cached_cons++;
}
WRITE_ONCE(q->ring->producer, q->prod_tail);
return 0;
return false;
}
static inline int xskq_produce_addr_lazy(struct xsk_queue *q, u64 addr)
/* Functions for consumers */
static inline void __xskq_cons_release(struct xsk_queue *q)
{
struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
smp_mb(); /* D, matches A */
WRITE_ONCE(q->ring->consumer, q->cached_cons);
}
if (xskq_nb_free(q, q->prod_head, LAZY_UPDATE_THRESHOLD) == 0)
return -ENOSPC;
static inline void __xskq_cons_peek(struct xsk_queue *q)
{
/* Refresh the local pointer */
q->cached_prod = READ_ONCE(q->ring->producer);
smp_rmb(); /* C, matches B */
}
/* A, matches D */
ring->desc[q->prod_head++ & q->ring_mask] = addr;
return 0;
static inline void xskq_cons_get_entries(struct xsk_queue *q)
{
__xskq_cons_release(q);
__xskq_cons_peek(q);
}
static inline void xskq_produce_flush_addr_n(struct xsk_queue *q,
u32 nb_entries)
static inline bool xskq_cons_has_entries(struct xsk_queue *q, u32 cnt)
{
/* Order producer and data */
smp_wmb(); /* B, matches C */
u32 entries = q->cached_prod - q->cached_cons;
q->prod_tail += nb_entries;
WRITE_ONCE(q->ring->producer, q->prod_tail);
if (entries >= cnt)
return true;
__xskq_cons_peek(q);
entries = q->cached_prod - q->cached_cons;
return entries >= cnt;
}
static inline int xskq_reserve_addr(struct xsk_queue *q)
static inline bool xskq_cons_peek_addr(struct xsk_queue *q, u64 *addr,
struct xdp_umem *umem)
{
if (xskq_nb_free(q, q->prod_head, 1) == 0)
return -ENOSPC;
if (q->cached_prod == q->cached_cons)
xskq_cons_get_entries(q);
return xskq_cons_read_addr(q, addr, umem);
}
/* A, matches D */
q->prod_head++;
return 0;
static inline bool xskq_cons_peek_desc(struct xsk_queue *q,
struct xdp_desc *desc,
struct xdp_umem *umem)
{
if (q->cached_prod == q->cached_cons)
xskq_cons_get_entries(q);
return xskq_cons_read_desc(q, desc, umem);
}
/* Rx/Tx queue */
static inline void xskq_cons_release(struct xsk_queue *q)
{
/* To improve performance, only update local state here.
* Reflect this to global state when we get new entries
* from the ring in xskq_cons_get_entries().
*/
q->cached_cons++;
}
static inline bool xskq_is_valid_desc(struct xsk_queue *q, struct xdp_desc *d,
struct xdp_umem *umem)
static inline bool xskq_cons_is_full(struct xsk_queue *q)
{
if (umem->flags & XDP_UMEM_UNALIGNED_CHUNK_FLAG) {
if (!xskq_is_valid_addr_unaligned(q, d->addr, d->len, umem))
return false;
/* No barriers needed since data is not accessed */
return READ_ONCE(q->ring->producer) - READ_ONCE(q->ring->consumer) ==
q->nentries;
}
if (d->len > umem->chunk_size_nohr || d->options) {
q->invalid_descs++;
return false;
}
/* Functions for producers */
return true;
}
static inline bool xskq_prod_is_full(struct xsk_queue *q)
{
u32 free_entries = q->nentries - (q->cached_prod - q->cached_cons);
if (!xskq_is_valid_addr(q, d->addr))
if (free_entries)
return false;
if (((d->addr + d->len) & q->chunk_mask) != (d->addr & q->chunk_mask) ||
d->options) {
q->invalid_descs++;
return false;
}
/* Refresh the local tail pointer */
q->cached_cons = READ_ONCE(q->ring->consumer);
free_entries = q->nentries - (q->cached_prod - q->cached_cons);
return true;
return !free_entries;
}
static inline struct xdp_desc *xskq_validate_desc(struct xsk_queue *q,
struct xdp_desc *desc,
struct xdp_umem *umem)
static inline int xskq_prod_reserve(struct xsk_queue *q)
{
while (q->cons_tail != q->cons_head) {
struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring;
unsigned int idx = q->cons_tail & q->ring_mask;
*desc = READ_ONCE(ring->desc[idx]);
if (xskq_is_valid_desc(q, desc, umem))
return desc;
q->cons_tail++;
}
if (xskq_prod_is_full(q))
return -ENOSPC;
return NULL;
/* A, matches D */
q->cached_prod++;
return 0;
}
static inline struct xdp_desc *xskq_peek_desc(struct xsk_queue *q,
struct xdp_desc *desc,
struct xdp_umem *umem)
static inline int xskq_prod_reserve_addr(struct xsk_queue *q, u64 addr)
{
if (q->cons_tail == q->cons_head) {
smp_mb(); /* D, matches A */
WRITE_ONCE(q->ring->consumer, q->cons_tail);
q->cons_head = q->cons_tail + xskq_nb_avail(q, RX_BATCH_SIZE);
/* Order consumer and data */
smp_rmb(); /* C, matches B */
}
struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
return xskq_validate_desc(q, desc, umem);
}
if (xskq_prod_is_full(q))
return -ENOSPC;
static inline void xskq_discard_desc(struct xsk_queue *q)
{
q->cons_tail++;
/* A, matches D */
ring->desc[q->cached_prod++ & q->ring_mask] = addr;
return 0;
}
static inline int xskq_produce_batch_desc(struct xsk_queue *q,
u64 addr, u32 len)
static inline int xskq_prod_reserve_desc(struct xsk_queue *q,
u64 addr, u32 len)
{
struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring;
unsigned int idx;
u32 idx;
if (xskq_nb_free(q, q->prod_head, 1) == 0)
if (xskq_prod_is_full(q))
return -ENOSPC;
/* A, matches D */
idx = (q->prod_head++) & q->ring_mask;
idx = q->cached_prod++ & q->ring_mask;
ring->desc[idx].addr = addr;
ring->desc[idx].len = len;
return 0;
}
static inline void xskq_produce_flush_desc(struct xsk_queue *q)
static inline void __xskq_prod_submit(struct xsk_queue *q, u32 idx)
{
/* Order producer and data */
smp_wmb(); /* B, matches C */
q->prod_tail = q->prod_head;
WRITE_ONCE(q->ring->producer, q->prod_tail);
WRITE_ONCE(q->ring->producer, idx);
}
static inline void xskq_prod_submit(struct xsk_queue *q)
{
__xskq_prod_submit(q, q->cached_prod);
}
static inline void xskq_prod_submit_addr(struct xsk_queue *q, u64 addr)
{
struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
u32 idx = q->ring->producer;
ring->desc[idx++ & q->ring_mask] = addr;
__xskq_prod_submit(q, idx);
}
static inline bool xskq_full_desc(struct xsk_queue *q)
static inline void xskq_prod_submit_n(struct xsk_queue *q, u32 nb_entries)
{
return xskq_nb_avail(q, q->nentries) == q->nentries;
__xskq_prod_submit(q, q->ring->producer + nb_entries);
}
static inline bool xskq_empty_desc(struct xsk_queue *q)
static inline bool xskq_prod_is_empty(struct xsk_queue *q)
{
return xskq_nb_free(q, q->prod_tail, q->nentries) == q->nentries;
/* No barriers needed since data is not accessed */
return READ_ONCE(q->ring->consumer) == READ_ONCE(q->ring->producer);
}
/* For both producers and consumers */
static inline u64 xskq_nb_invalid_descs(struct xsk_queue *q)
{
return q ? q->invalid_descs : 0;
}
void xskq_set_umem(struct xsk_queue *q, u64 size, u64 chunk_mask);
......
......@@ -38,6 +38,8 @@ tprogs-y += tc_l2_redirect
tprogs-y += lwt_len_hist
tprogs-y += xdp_tx_iptunnel
tprogs-y += test_map_in_map
tprogs-y += per_socket_stats_example
tprogs-y += xdp_redirect
tprogs-y += xdp_redirect_map
tprogs-y += xdp_redirect_cpu
tprogs-y += xdp_monitor
......@@ -196,7 +198,7 @@ endif
TPROGCFLAGS_bpf_load.o += -Wno-unused-variable
TPROGS_LDLIBS += $(LIBBPF) -lelf
TPROGS_LDLIBS += $(LIBBPF) -lelf -lz
TPROGLDLIBS_tracex4 += -lrt
TPROGLDLIBS_trace_output += -lrt
TPROGLDLIBS_map_perf_test += -lrt
......@@ -234,6 +236,7 @@ BTF_LLVM_PROBE := $(shell echo "int main() { return 0; }" | \
readelf -S ./llvm_btf_verify.o | grep BTF; \
/bin/rm -f ./llvm_btf_verify.o)
BPF_EXTRA_CFLAGS += -fno-stack-protector
ifneq ($(BTF_LLVM_PROBE),)
BPF_EXTRA_CFLAGS += -g
else
......
......@@ -98,7 +98,7 @@ int main(int argc, char **argv)
xdp_flags |= XDP_FLAGS_SKB_MODE;
break;
case 'N':
xdp_flags |= XDP_FLAGS_DRV_MODE;
/* default, set below */
break;
case 'F':
xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
......@@ -109,6 +109,9 @@ int main(int argc, char **argv)
}
}
if (!(xdp_flags & XDP_FLAGS_SKB_MODE))
xdp_flags |= XDP_FLAGS_DRV_MODE;
if (optind == argc) {
usage(basename(argv[0]));
return 1;
......
......@@ -120,7 +120,7 @@ int main(int argc, char **argv)
xdp_flags |= XDP_FLAGS_SKB_MODE;
break;
case 'N':
xdp_flags |= XDP_FLAGS_DRV_MODE;
/* default, set below */
break;
case 'F':
xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
......@@ -132,6 +132,9 @@ int main(int argc, char **argv)
opt_flags[opt] = 0;
}
if (!(xdp_flags & XDP_FLAGS_SKB_MODE))
xdp_flags |= XDP_FLAGS_DRV_MODE;
for (i = 0; i < strlen(optstr); i++) {
if (opt_flags[(unsigned int)optstr[i]]) {
fprintf(stderr, "Missing argument -%c\n", optstr[i]);
......
......@@ -27,11 +27,13 @@
#include "libbpf.h"
#include <bpf/bpf.h>
static __u32 xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST;
static int do_attach(int idx, int prog_fd, int map_fd, const char *name)
{
int err;
err = bpf_set_link_xdp_fd(idx, prog_fd, 0);
err = bpf_set_link_xdp_fd(idx, prog_fd, xdp_flags);
if (err < 0) {
printf("ERROR: failed to attach program to %s\n", name);
return err;
......@@ -49,7 +51,7 @@ static int do_detach(int idx, const char *name)
{
int err;
err = bpf_set_link_xdp_fd(idx, -1, 0);
err = bpf_set_link_xdp_fd(idx, -1, xdp_flags);
if (err < 0)
printf("ERROR: failed to detach program from %s\n", name);
......@@ -83,11 +85,17 @@ int main(int argc, char **argv)
int attach = 1;
int ret = 0;
while ((opt = getopt(argc, argv, ":dD")) != -1) {
while ((opt = getopt(argc, argv, ":dDSF")) != -1) {
switch (opt) {
case 'd':
attach = 0;
break;
case 'S':
xdp_flags |= XDP_FLAGS_SKB_MODE;
break;
case 'F':
xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
break;
case 'D':
prog_name = "xdp_fwd_direct";
break;
......@@ -97,6 +105,9 @@ int main(int argc, char **argv)
}
}
if (!(xdp_flags & XDP_FLAGS_SKB_MODE))
xdp_flags |= XDP_FLAGS_DRV_MODE;
if (optind == argc) {
usage(basename(argv[0]));
return 1;
......
......@@ -16,6 +16,10 @@ static const char *__doc__ =
#include <getopt.h>
#include <net/if.h>
#include <time.h>
#include <linux/limits.h>
#define __must_check
#include <linux/err.h>
#include <arpa/inet.h>
#include <linux/if_link.h>
......@@ -46,6 +50,10 @@ static int cpus_count_map_fd;
static int cpus_iterator_map_fd;
static int exception_cnt_map_fd;
#define NUM_TP 5
struct bpf_link *tp_links[NUM_TP] = { 0 };
static int tp_cnt = 0;
/* Exit return codes */
#define EXIT_OK 0
#define EXIT_FAIL 1
......@@ -88,6 +96,10 @@ static void int_exit(int sig)
printf("program on interface changed, not removing\n");
}
}
/* Detach tracepoints */
while (tp_cnt)
bpf_link__destroy(tp_links[--tp_cnt]);
exit(EXIT_OK);
}
......@@ -588,23 +600,61 @@ static void stats_poll(int interval, bool use_separators, char *prog_name,
free_stats_record(prev);
}
static struct bpf_link * attach_tp(struct bpf_object *obj,
const char *tp_category,
const char* tp_name)
{
struct bpf_program *prog;
struct bpf_link *link;
char sec_name[PATH_MAX];
int len;
len = snprintf(sec_name, PATH_MAX, "tracepoint/%s/%s",
tp_category, tp_name);
if (len < 0)
exit(EXIT_FAIL);
prog = bpf_object__find_program_by_title(obj, sec_name);
if (!prog) {
fprintf(stderr, "ERR: finding progsec: %s\n", sec_name);
exit(EXIT_FAIL_BPF);
}
link = bpf_program__attach_tracepoint(prog, tp_category, tp_name);
if (IS_ERR(link))
exit(EXIT_FAIL_BPF);
return link;
}
static void init_tracepoints(struct bpf_object *obj) {
tp_links[tp_cnt++] = attach_tp(obj, "xdp", "xdp_redirect_err");
tp_links[tp_cnt++] = attach_tp(obj, "xdp", "xdp_redirect_map_err");
tp_links[tp_cnt++] = attach_tp(obj, "xdp", "xdp_exception");
tp_links[tp_cnt++] = attach_tp(obj, "xdp", "xdp_cpumap_enqueue");
tp_links[tp_cnt++] = attach_tp(obj, "xdp", "xdp_cpumap_kthread");
}
static int init_map_fds(struct bpf_object *obj)
{
cpu_map_fd = bpf_object__find_map_fd_by_name(obj, "cpu_map");
rx_cnt_map_fd = bpf_object__find_map_fd_by_name(obj, "rx_cnt");
/* Maps updated by tracepoints */
redirect_err_cnt_map_fd =
bpf_object__find_map_fd_by_name(obj, "redirect_err_cnt");
exception_cnt_map_fd =
bpf_object__find_map_fd_by_name(obj, "exception_cnt");
cpumap_enqueue_cnt_map_fd =
bpf_object__find_map_fd_by_name(obj, "cpumap_enqueue_cnt");
cpumap_kthread_cnt_map_fd =
bpf_object__find_map_fd_by_name(obj, "cpumap_kthread_cnt");
/* Maps used by XDP */
rx_cnt_map_fd = bpf_object__find_map_fd_by_name(obj, "rx_cnt");
cpu_map_fd = bpf_object__find_map_fd_by_name(obj, "cpu_map");
cpus_available_map_fd =
bpf_object__find_map_fd_by_name(obj, "cpus_available");
cpus_count_map_fd = bpf_object__find_map_fd_by_name(obj, "cpus_count");
cpus_iterator_map_fd =
bpf_object__find_map_fd_by_name(obj, "cpus_iterator");
exception_cnt_map_fd =
bpf_object__find_map_fd_by_name(obj, "exception_cnt");
if (cpu_map_fd < 0 || rx_cnt_map_fd < 0 ||
redirect_err_cnt_map_fd < 0 || cpumap_enqueue_cnt_map_fd < 0 ||
......@@ -662,6 +712,7 @@ int main(int argc, char **argv)
strerror(errno));
return EXIT_FAIL;
}
init_tracepoints(obj);
if (init_map_fds(obj) < 0) {
fprintf(stderr, "bpf_object__find_map_fd_by_name failed\n");
return EXIT_FAIL;
......@@ -728,6 +779,10 @@ int main(int argc, char **argv)
return EXIT_FAIL_OPTION;
}
}
if (!(xdp_flags & XDP_FLAGS_SKB_MODE))
xdp_flags |= XDP_FLAGS_DRV_MODE;
/* Required option */
if (ifindex == -1) {
fprintf(stderr, "ERR: required option --dev missing\n");
......
......@@ -116,7 +116,7 @@ int main(int argc, char **argv)
xdp_flags |= XDP_FLAGS_SKB_MODE;
break;
case 'N':
xdp_flags |= XDP_FLAGS_DRV_MODE;
/* default, set below */
break;
case 'F':
xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
......@@ -127,6 +127,9 @@ int main(int argc, char **argv)
}
}
if (!(xdp_flags & XDP_FLAGS_SKB_MODE))
xdp_flags |= XDP_FLAGS_DRV_MODE;
if (optind == argc) {
printf("usage: %s <IFNAME|IFINDEX>_IN <IFNAME|IFINDEX>_OUT\n", argv[0]);
return 1;
......
......@@ -117,7 +117,7 @@ int main(int argc, char **argv)
xdp_flags |= XDP_FLAGS_SKB_MODE;
break;
case 'N':
xdp_flags |= XDP_FLAGS_DRV_MODE;
/* default, set below */
break;
case 'F':
xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
......@@ -128,6 +128,9 @@ int main(int argc, char **argv)
}
}
if (!(xdp_flags & XDP_FLAGS_SKB_MODE))
xdp_flags |= XDP_FLAGS_DRV_MODE;
if (optind == argc) {
printf("usage: %s <IFNAME|IFINDEX>_IN <IFNAME|IFINDEX>_OUT\n", argv[0]);
return 1;
......
......@@ -662,6 +662,9 @@ int main(int ac, char **argv)
}
}
if (!(flags & XDP_FLAGS_SKB_MODE))
flags |= XDP_FLAGS_DRV_MODE;
if (optind == ac) {
usage(basename(argv[0]));
return 1;
......
......@@ -551,6 +551,10 @@ int main(int argc, char **argv)
return EXIT_FAIL_OPTION;
}
}
if (!(xdp_flags & XDP_FLAGS_SKB_MODE))
xdp_flags |= XDP_FLAGS_DRV_MODE;
/* Required option */
if (ifindex == -1) {
fprintf(stderr, "ERR: required option --dev missing\n");
......
......@@ -52,13 +52,13 @@ static int do_detach(int idx, const char *name)
__u32 curr_prog_id = 0;
int err = 0;
err = bpf_get_link_xdp_id(idx, &curr_prog_id, 0);
err = bpf_get_link_xdp_id(idx, &curr_prog_id, xdp_flags);
if (err) {
printf("bpf_get_link_xdp_id failed\n");
return err;
}
if (prog_id == curr_prog_id) {
err = bpf_set_link_xdp_fd(idx, -1, 0);
err = bpf_set_link_xdp_fd(idx, -1, xdp_flags);
if (err < 0)
printf("ERROR: failed to detach prog from %s\n", name);
} else if (!curr_prog_id) {
......@@ -115,7 +115,7 @@ int main(int argc, char **argv)
.prog_type = BPF_PROG_TYPE_XDP,
};
struct perf_buffer_opts pb_opts = {};
const char *optstr = "F";
const char *optstr = "FS";
int prog_fd, map_fd, opt;
struct bpf_object *obj;
struct bpf_map *map;
......@@ -127,12 +127,18 @@ int main(int argc, char **argv)
case 'F':
xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
break;
case 'S':
xdp_flags |= XDP_FLAGS_SKB_MODE;
break;
default:
usage(basename(argv[0]));
return 1;
}
}
if (!(xdp_flags & XDP_FLAGS_SKB_MODE))
xdp_flags |= XDP_FLAGS_DRV_MODE;
if (optind == argc) {
usage(basename(argv[0]));
return 1;
......
......@@ -231,7 +231,7 @@ int main(int argc, char **argv)
xdp_flags |= XDP_FLAGS_SKB_MODE;
break;
case 'N':
xdp_flags |= XDP_FLAGS_DRV_MODE;
/* default, set below */
break;
case 'F':
xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
......@@ -243,6 +243,9 @@ int main(int argc, char **argv)
opt_flags[opt] = 0;
}
if (!(xdp_flags & XDP_FLAGS_SKB_MODE))
xdp_flags |= XDP_FLAGS_DRV_MODE;
for (i = 0; i < strlen(optstr); i++) {
if (opt_flags[(unsigned int)optstr[i]]) {
fprintf(stderr, "Missing argument -%c\n", optstr[i]);
......
......@@ -10,6 +10,9 @@
#include <linux/if_link.h>
#include <linux/if_xdp.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/udp.h>
#include <arpa/inet.h>
#include <locale.h>
#include <net/ethernet.h>
#include <net/if.h>
......@@ -45,12 +48,14 @@
#endif
#define NUM_FRAMES (4 * 1024)
#define BATCH_SIZE 64
#define MIN_PKT_SIZE 64
#define DEBUG_HEXDUMP 0
typedef __u64 u64;
typedef __u32 u32;
typedef __u16 u16;
typedef __u8 u8;
static unsigned long prev_time;
......@@ -65,6 +70,13 @@ static u32 opt_xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST;
static const char *opt_if = "";
static int opt_ifindex;
static int opt_queue;
static unsigned long opt_duration;
static unsigned long start_time;
static bool benchmark_done;
static u32 opt_batch_size = 64;
static int opt_pkt_count;
static u16 opt_pkt_size = MIN_PKT_SIZE;
static u32 opt_pkt_fill_pattern = 0x12345678;
static int opt_poll;
static int opt_interval = 1;
static u32 opt_xdp_bind_flags = XDP_USE_NEED_WAKEUP;
......@@ -167,10 +179,21 @@ static void dump_stats(void)
}
}
static bool is_benchmark_done(void)
{
if (opt_duration > 0) {
unsigned long dt = (get_nsecs() - start_time);
if (dt >= opt_duration)
benchmark_done = true;
}
return benchmark_done;
}
static void *poller(void *arg)
{
(void)arg;
for (;;) {
while (!is_benchmark_done()) {
sleep(opt_interval);
dump_stats();
}
......@@ -195,6 +218,11 @@ static void remove_xdp_program(void)
}
static void int_exit(int sig)
{
benchmark_done = true;
}
static void xdpsock_cleanup(void)
{
struct xsk_umem *umem = xsks[0]->umem->umem;
int i;
......@@ -204,8 +232,6 @@ static void int_exit(int sig)
xsk_socket__delete(xsks[i]->xsk);
(void)xsk_umem__delete(umem);
remove_xdp_program();
exit(EXIT_SUCCESS);
}
static void __exit_with_error(int error, const char *file, const char *func,
......@@ -220,13 +246,6 @@ static void __exit_with_error(int error, const char *file, const char *func,
#define exit_with_error(error) __exit_with_error(error, __FILE__, __func__, \
__LINE__)
static const char pkt_data[] =
"\x3c\xfd\xfe\x9e\x7f\x71\xec\xb1\xd7\x98\x3a\xc0\x08\x00\x45\x00"
"\x00\x2e\x00\x00\x00\x00\x40\x11\x88\x97\x05\x08\x07\x08\xc8\x14"
"\x1e\x04\x10\x92\x10\x92\x00\x1a\x6d\xa3\x34\x33\x1f\x69\x40\x6b"
"\x54\x59\xb6\x14\x2d\x11\x44\xbf\xaf\xd9\xbe\xaa";
static void swap_mac_addresses(void *data)
{
struct ether_header *eth = (struct ether_header *)data;
......@@ -274,11 +293,243 @@ static void hex_dump(void *pkt, size_t length, u64 addr)
printf("\n");
}
static size_t gen_eth_frame(struct xsk_umem_info *umem, u64 addr)
static void *memset32_htonl(void *dest, u32 val, u32 size)
{
u32 *ptr = (u32 *)dest;
int i;
val = htonl(val);
for (i = 0; i < (size & (~0x3)); i += 4)
ptr[i >> 2] = val;
for (; i < size; i++)
((char *)dest)[i] = ((char *)&val)[i & 3];
return dest;
}
/*
* This function code has been taken from
* Linux kernel lib/checksum.c
*/
static inline unsigned short from32to16(unsigned int x)
{
/* add up 16-bit and 16-bit for 16+c bit */
x = (x & 0xffff) + (x >> 16);
/* add up carry.. */
x = (x & 0xffff) + (x >> 16);
return x;
}
/*
* This function code has been taken from
* Linux kernel lib/checksum.c
*/
static unsigned int do_csum(const unsigned char *buff, int len)
{
unsigned int result = 0;
int odd;
if (len <= 0)
goto out;
odd = 1 & (unsigned long)buff;
if (odd) {
#ifdef __LITTLE_ENDIAN
result += (*buff << 8);
#else
result = *buff;
#endif
len--;
buff++;
}
if (len >= 2) {
if (2 & (unsigned long)buff) {
result += *(unsigned short *)buff;
len -= 2;
buff += 2;
}
if (len >= 4) {
const unsigned char *end = buff +
((unsigned int)len & ~3);
unsigned int carry = 0;
do {
unsigned int w = *(unsigned int *)buff;
buff += 4;
result += carry;
result += w;
carry = (w > result);
} while (buff < end);
result += carry;
result = (result & 0xffff) + (result >> 16);
}
if (len & 2) {
result += *(unsigned short *)buff;
buff += 2;
}
}
if (len & 1)
#ifdef __LITTLE_ENDIAN
result += *buff;
#else
result += (*buff << 8);
#endif
result = from32to16(result);
if (odd)
result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
out:
return result;
}
__sum16 ip_fast_csum(const void *iph, unsigned int ihl);
/*
* This is a version of ip_compute_csum() optimized for IP headers,
* which always checksum on 4 octet boundaries.
* This function code has been taken from
* Linux kernel lib/checksum.c
*/
__sum16 ip_fast_csum(const void *iph, unsigned int ihl)
{
return (__force __sum16)~do_csum(iph, ihl * 4);
}
/*
* Fold a partial checksum
* This function code has been taken from
* Linux kernel include/asm-generic/checksum.h
*/
static inline __sum16 csum_fold(__wsum csum)
{
u32 sum = (__force u32)csum;
sum = (sum & 0xffff) + (sum >> 16);
sum = (sum & 0xffff) + (sum >> 16);
return (__force __sum16)~sum;
}
/*
* This function code has been taken from
* Linux kernel lib/checksum.c
*/
static inline u32 from64to32(u64 x)
{
/* add up 32-bit and 32-bit for 32+c bit */
x = (x & 0xffffffff) + (x >> 32);
/* add up carry.. */
x = (x & 0xffffffff) + (x >> 32);
return (u32)x;
}
__wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr,
__u32 len, __u8 proto, __wsum sum);
/*
* This function code has been taken from
* Linux kernel lib/checksum.c
*/
__wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr,
__u32 len, __u8 proto, __wsum sum)
{
unsigned long long s = (__force u32)sum;
s += (__force u32)saddr;
s += (__force u32)daddr;
#ifdef __BIG_ENDIAN__
s += proto + len;
#else
s += (proto + len) << 8;
#endif
return (__force __wsum)from64to32(s);
}
/*
* This function has been taken from
* Linux kernel include/asm-generic/checksum.h
*/
static inline __sum16
csum_tcpudp_magic(__be32 saddr, __be32 daddr, __u32 len,
__u8 proto, __wsum sum)
{
return csum_fold(csum_tcpudp_nofold(saddr, daddr, len, proto, sum));
}
static inline u16 udp_csum(u32 saddr, u32 daddr, u32 len,
u8 proto, u16 *udp_pkt)
{
u32 csum = 0;
u32 cnt = 0;
/* udp hdr and data */
for (; cnt < len; cnt += 2)
csum += udp_pkt[cnt >> 1];
return csum_tcpudp_magic(saddr, daddr, len, proto, csum);
}
#define ETH_FCS_SIZE 4
#define PKT_HDR_SIZE (sizeof(struct ethhdr) + sizeof(struct iphdr) + \
sizeof(struct udphdr))
#define PKT_SIZE (opt_pkt_size - ETH_FCS_SIZE)
#define IP_PKT_SIZE (PKT_SIZE - sizeof(struct ethhdr))
#define UDP_PKT_SIZE (IP_PKT_SIZE - sizeof(struct iphdr))
#define UDP_PKT_DATA_SIZE (UDP_PKT_SIZE - sizeof(struct udphdr))
static u8 pkt_data[XSK_UMEM__DEFAULT_FRAME_SIZE];
static void gen_eth_hdr_data(void)
{
struct udphdr *udp_hdr = (struct udphdr *)(pkt_data +
sizeof(struct ethhdr) +
sizeof(struct iphdr));
struct iphdr *ip_hdr = (struct iphdr *)(pkt_data +
sizeof(struct ethhdr));
struct ethhdr *eth_hdr = (struct ethhdr *)pkt_data;
/* ethernet header */
memcpy(eth_hdr->h_dest, "\x3c\xfd\xfe\x9e\x7f\x71", ETH_ALEN);
memcpy(eth_hdr->h_source, "\xec\xb1\xd7\x98\x3a\xc0", ETH_ALEN);
eth_hdr->h_proto = htons(ETH_P_IP);
/* IP header */
ip_hdr->version = IPVERSION;
ip_hdr->ihl = 0x5; /* 20 byte header */
ip_hdr->tos = 0x0;
ip_hdr->tot_len = htons(IP_PKT_SIZE);
ip_hdr->id = 0;
ip_hdr->frag_off = 0;
ip_hdr->ttl = IPDEFTTL;
ip_hdr->protocol = IPPROTO_UDP;
ip_hdr->saddr = htonl(0x0a0a0a10);
ip_hdr->daddr = htonl(0x0a0a0a20);
/* IP header checksum */
ip_hdr->check = 0;
ip_hdr->check = ip_fast_csum((const void *)ip_hdr, ip_hdr->ihl);
/* UDP header */
udp_hdr->source = htons(0x1000);
udp_hdr->dest = htons(0x1000);
udp_hdr->len = htons(UDP_PKT_SIZE);
/* UDP data */
memset32_htonl(pkt_data + PKT_HDR_SIZE, opt_pkt_fill_pattern,
UDP_PKT_DATA_SIZE);
/* UDP header checksum */
udp_hdr->check = 0;
udp_hdr->check = udp_csum(ip_hdr->saddr, ip_hdr->daddr, UDP_PKT_SIZE,
IPPROTO_UDP, (u16 *)udp_hdr);
}
static void gen_eth_frame(struct xsk_umem_info *umem, u64 addr)
{
memcpy(xsk_umem__get_data(umem->buffer, addr), pkt_data,
sizeof(pkt_data) - 1);
return sizeof(pkt_data) - 1;
PKT_SIZE);
}
static struct xsk_umem_info *xsk_configure_umem(void *buffer, u64 size)
......@@ -375,6 +626,11 @@ static struct option long_options[] = {
{"unaligned", no_argument, 0, 'u'},
{"shared-umem", no_argument, 0, 'M'},
{"force", no_argument, 0, 'F'},
{"duration", required_argument, 0, 'd'},
{"batch-size", required_argument, 0, 'b'},
{"tx-pkt-count", required_argument, 0, 'C'},
{"tx-pkt-size", required_argument, 0, 's'},
{"tx-pkt-pattern", required_argument, 0, 'P'},
{0, 0, 0, 0}
};
......@@ -399,8 +655,21 @@ static void usage(const char *prog)
" -u, --unaligned Enable unaligned chunk placement\n"
" -M, --shared-umem Enable XDP_SHARED_UMEM\n"
" -F, --force Force loading the XDP prog\n"
" -d, --duration=n Duration in secs to run command.\n"
" Default: forever.\n"
" -b, --batch-size=n Batch size for sending or receiving\n"
" packets. Default: %d\n"
" -C, --tx-pkt-count=n Number of packets to send.\n"
" Default: Continuous packets.\n"
" -s, --tx-pkt-size=n Transmit packet size.\n"
" (Default: %d bytes)\n"
" Min size: %d, Max size %d.\n"
" -P, --tx-pkt-pattern=nPacket fill pattern. Default: 0x%x\n"
"\n";
fprintf(stderr, str, prog, XSK_UMEM__DEFAULT_FRAME_SIZE);
fprintf(stderr, str, prog, XSK_UMEM__DEFAULT_FRAME_SIZE,
opt_batch_size, MIN_PKT_SIZE, MIN_PKT_SIZE,
XSK_UMEM__DEFAULT_FRAME_SIZE, opt_pkt_fill_pattern);
exit(EXIT_FAILURE);
}
......@@ -411,7 +680,7 @@ static void parse_command_line(int argc, char **argv)
opterr = 0;
for (;;) {
c = getopt_long(argc, argv, "Frtli:q:psSNn:czf:muM",
c = getopt_long(argc, argv, "Frtli:q:pSNn:czf:muMd:b:C:s:P:",
long_options, &option_index);
if (c == -1)
break;
......@@ -440,7 +709,7 @@ static void parse_command_line(int argc, char **argv)
opt_xdp_bind_flags |= XDP_COPY;
break;
case 'N':
opt_xdp_flags |= XDP_FLAGS_DRV_MODE;
/* default, set below */
break;
case 'n':
opt_interval = atoi(optarg);
......@@ -469,11 +738,37 @@ static void parse_command_line(int argc, char **argv)
case 'M':
opt_num_xsks = MAX_SOCKS;
break;
case 'd':
opt_duration = atoi(optarg);
opt_duration *= 1000000000;
break;
case 'b':
opt_batch_size = atoi(optarg);
break;
case 'C':
opt_pkt_count = atoi(optarg);
break;
case 's':
opt_pkt_size = atoi(optarg);
if (opt_pkt_size > (XSK_UMEM__DEFAULT_FRAME_SIZE) ||
opt_pkt_size < MIN_PKT_SIZE) {
fprintf(stderr,
"ERROR: Invalid frame size %d\n",
opt_pkt_size);
usage(basename(argv[0]));
}
break;
case 'P':
opt_pkt_fill_pattern = strtol(optarg, NULL, 16);
break;
default:
usage(basename(argv[0]));
}
}
if (!(opt_xdp_flags & XDP_FLAGS_SKB_MODE))
opt_xdp_flags |= XDP_FLAGS_DRV_MODE;
opt_ifindex = if_nametoindex(opt_if);
if (!opt_ifindex) {
fprintf(stderr, "ERROR: interface \"%s\" does not exist\n",
......@@ -513,7 +808,7 @@ static inline void complete_tx_l2fwd(struct xsk_socket_info *xsk,
if (!opt_need_wakeup || xsk_ring_prod__needs_wakeup(&xsk->tx))
kick_tx(xsk);
ndescs = (xsk->outstanding_tx > BATCH_SIZE) ? BATCH_SIZE :
ndescs = (xsk->outstanding_tx > opt_batch_size) ? opt_batch_size :
xsk->outstanding_tx;
/* re-add completed Tx buffers */
......@@ -542,7 +837,8 @@ static inline void complete_tx_l2fwd(struct xsk_socket_info *xsk,
}
}
static inline void complete_tx_only(struct xsk_socket_info *xsk)
static inline void complete_tx_only(struct xsk_socket_info *xsk,
int batch_size)
{
unsigned int rcvd;
u32 idx;
......@@ -553,7 +849,7 @@ static inline void complete_tx_only(struct xsk_socket_info *xsk)
if (!opt_need_wakeup || xsk_ring_prod__needs_wakeup(&xsk->tx))
kick_tx(xsk);
rcvd = xsk_ring_cons__peek(&xsk->umem->cq, BATCH_SIZE, &idx);
rcvd = xsk_ring_cons__peek(&xsk->umem->cq, batch_size, &idx);
if (rcvd > 0) {
xsk_ring_cons__release(&xsk->umem->cq, rcvd);
xsk->outstanding_tx -= rcvd;
......@@ -567,7 +863,7 @@ static void rx_drop(struct xsk_socket_info *xsk, struct pollfd *fds)
u32 idx_rx = 0, idx_fq = 0;
int ret;
rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx);
rcvd = xsk_ring_cons__peek(&xsk->rx, opt_batch_size, &idx_rx);
if (!rcvd) {
if (xsk_ring_prod__needs_wakeup(&xsk->umem->fq))
ret = poll(fds, num_socks, opt_timeout);
......@@ -619,36 +915,68 @@ static void rx_drop_all(void)
for (i = 0; i < num_socks; i++)
rx_drop(xsks[i], fds);
if (benchmark_done)
break;
}
}
static void tx_only(struct xsk_socket_info *xsk, u32 frame_nb)
static void tx_only(struct xsk_socket_info *xsk, u32 frame_nb, int batch_size)
{
u32 idx;
unsigned int i;
if (xsk_ring_prod__reserve(&xsk->tx, BATCH_SIZE, &idx) == BATCH_SIZE) {
unsigned int i;
for (i = 0; i < BATCH_SIZE; i++) {
xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr =
(frame_nb + i) << XSK_UMEM__DEFAULT_FRAME_SHIFT;
xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len =
sizeof(pkt_data) - 1;
}
while (xsk_ring_prod__reserve(&xsk->tx, batch_size, &idx) <
batch_size) {
complete_tx_only(xsk, batch_size);
}
xsk_ring_prod__submit(&xsk->tx, BATCH_SIZE);
xsk->outstanding_tx += BATCH_SIZE;
frame_nb += BATCH_SIZE;
frame_nb %= NUM_FRAMES;
for (i = 0; i < batch_size; i++) {
struct xdp_desc *tx_desc = xsk_ring_prod__tx_desc(&xsk->tx,
idx + i);
tx_desc->addr = (frame_nb + i) << XSK_UMEM__DEFAULT_FRAME_SHIFT;
tx_desc->len = PKT_SIZE;
}
complete_tx_only(xsk);
xsk_ring_prod__submit(&xsk->tx, batch_size);
xsk->outstanding_tx += batch_size;
frame_nb += batch_size;
frame_nb %= NUM_FRAMES;
complete_tx_only(xsk, batch_size);
}
static inline int get_batch_size(int pkt_cnt)
{
if (!opt_pkt_count)
return opt_batch_size;
if (pkt_cnt + opt_batch_size <= opt_pkt_count)
return opt_batch_size;
return opt_pkt_count - pkt_cnt;
}
static void complete_tx_only_all(void)
{
bool pending;
int i;
do {
pending = false;
for (i = 0; i < num_socks; i++) {
if (xsks[i]->outstanding_tx) {
complete_tx_only(xsks[i], opt_batch_size);
pending = !!xsks[i]->outstanding_tx;
}
}
} while (pending);
}
static void tx_only_all(void)
{
struct pollfd fds[MAX_SOCKS] = {};
u32 frame_nb[MAX_SOCKS] = {};
int pkt_cnt = 0;
int i, ret;
for (i = 0; i < num_socks; i++) {
......@@ -656,7 +984,9 @@ static void tx_only_all(void)
fds[0].events = POLLOUT;
}
for (;;) {
while ((opt_pkt_count && pkt_cnt < opt_pkt_count) || !opt_pkt_count) {
int batch_size = get_batch_size(pkt_cnt);
if (opt_poll) {
ret = poll(fds, num_socks, opt_timeout);
if (ret <= 0)
......@@ -667,8 +997,16 @@ static void tx_only_all(void)
}
for (i = 0; i < num_socks; i++)
tx_only(xsks[i], frame_nb[i]);
tx_only(xsks[i], frame_nb[i], batch_size);
pkt_cnt += batch_size;
if (benchmark_done)
break;
}
if (opt_pkt_count)
complete_tx_only_all();
}
static void l2fwd(struct xsk_socket_info *xsk, struct pollfd *fds)
......@@ -679,7 +1017,7 @@ static void l2fwd(struct xsk_socket_info *xsk, struct pollfd *fds)
complete_tx_l2fwd(xsk, fds);
rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx);
rcvd = xsk_ring_cons__peek(&xsk->rx, opt_batch_size, &idx_rx);
if (!rcvd) {
if (xsk_ring_prod__needs_wakeup(&xsk->umem->fq))
ret = poll(fds, num_socks, opt_timeout);
......@@ -736,6 +1074,9 @@ static void l2fwd_all(void)
for (i = 0; i < num_socks; i++)
l2fwd(xsks[i], fds);
if (benchmark_done)
break;
}
}
......@@ -831,9 +1172,12 @@ int main(int argc, char **argv)
for (i = 0; i < opt_num_xsks; i++)
xsks[num_socks++] = xsk_configure_socket(umem, rx, tx);
if (opt_bench == BENCH_TXONLY)
if (opt_bench == BENCH_TXONLY) {
gen_eth_hdr_data();
for (i = 0; i < NUM_FRAMES; i++)
gen_eth_frame(umem, i * opt_xsk_frame_size);
}
if (opt_num_xsks > 1 && opt_bench != BENCH_TXONLY)
enter_xsks_into_map(obj);
......@@ -849,6 +1193,7 @@ int main(int argc, char **argv)
exit_with_error(ret);
prev_time = get_nsecs();
start_time = prev_time;
if (opt_bench == BENCH_RXDROP)
rx_drop_all();
......@@ -857,5 +1202,11 @@ int main(int argc, char **argv)
else
l2fwd_all();
benchmark_done = true;
pthread_join(pt, NULL);
xdpsock_cleanup();
return 0;
}
================
bpftool-gen
================
-------------------------------------------------------------------------------
tool for BPF code-generation
-------------------------------------------------------------------------------
:Manual section: 8
SYNOPSIS
========
**bpftool** [*OPTIONS*] **gen** *COMMAND*
*OPTIONS* := { { **-j** | **--json** } [{ **-p** | **--pretty** }] }
*COMMAND* := { **skeleton | **help** }
GEN COMMANDS
=============
| **bpftool** **gen skeleton** *FILE*
| **bpftool** **gen help**
DESCRIPTION
===========
**bpftool gen skeleton** *FILE*
Generate BPF skeleton C header file for a given *FILE*.
BPF skeleton is an alternative interface to existing libbpf
APIs for working with BPF objects. Skeleton code is intended
to significantly shorten and simplify code to load and work
with BPF programs from userspace side. Generated code is
tailored to specific input BPF object *FILE*, reflecting its
structure by listing out available maps, program, variables,
etc. Skeleton eliminates the need to lookup mentioned
components by name. Instead, if skeleton instantiation
succeeds, they are populated in skeleton structure as valid
libbpf types (e.g., struct bpf_map pointer) and can be
passed to existing generic libbpf APIs.
In addition to simple and reliable access to maps and
programs, skeleton provides a storage for BPF links (struct
bpf_link) for each BPF program within BPF object. When
requested, supported BPF programs will be automatically
attached and resulting BPF links stored for further use by
user in pre-allocated fields in skeleton struct. For BPF
programs that can't be automatically attached by libbpf,
user can attach them manually, but store resulting BPF link
in per-program link field. All such set up links will be
automatically destroyed on BPF skeleton destruction. This
eliminates the need for users to manage links manually and
rely on libbpf support to detach programs and free up
resources.
Another facility provided by BPF skeleton is an interface to
global variables of all supported kinds: mutable, read-only,
as well as extern ones. This interface allows to pre-setup
initial values of variables before BPF object is loaded and
verified by kernel. For non-read-only variables, the same
interface can be used to fetch values of global variables on
userspace side, even if they are modified by BPF code.
During skeleton generation, contents of source BPF object
*FILE* is embedded within generated code and is thus not
necessary to keep around. This ensures skeleton and BPF
object file are matching 1-to-1 and always stay in sync.
Generated code is dual-licensed under LGPL-2.1 and
BSD-2-Clause licenses.
It is a design goal and guarantee that skeleton interfaces
are interoperable with generic libbpf APIs. User should
always be able to use skeleton API to create and load BPF
object, and later use libbpf APIs to keep working with
specific maps, programs, etc.
As part of skeleton, few custom functions are generated.
Each of them is prefixed with object name, derived from
object file name. I.e., if BPF object file name is
**example.o**, BPF object name will be **example**. The
following custom functions are provided in such case:
- **example__open** and **example__open_opts**.
These functions are used to instantiate skeleton. It
corresponds to libbpf's **bpf_object__open()** API.
**_opts** variants accepts extra **bpf_object_open_opts**
options.
- **example__load**.
This function creates maps, loads and verifies BPF
programs, initializes global data maps. It corresponds to
libppf's **bpf_object__load** API.
- **example__open_and_load** combines **example__open** and
**example__load** invocations in one commonly used
operation.
- **example__attach** and **example__detach**
This pair of functions allow to attach and detach,
correspondingly, already loaded BPF object. Only BPF
programs of types supported by libbpf for auto-attachment
will be auto-attached and their corresponding BPF links
instantiated. For other BPF programs, user can manually
create a BPF link and assign it to corresponding fields in
skeleton struct. **example__detach** will detach both
links created automatically, as well as those populated by
user manually.
- **example__destroy**
Detach and unload BPF programs, free up all the resources
used by skeleton and BPF object.
If BPF object has global variables, corresponding structs
with memory layout corresponding to global data data section
layout will be created. Currently supported ones are: *.data*,
*.bss*, *.rodata*, and *.kconfig* structs/data sections.
These data sections/structs can be used to set up initial
values of variables, if set before **example__load**.
Afterwards, if target kernel supports memory-mapped BPF
arrays, same structs can be used to fetch and update
(non-read-only) data from userspace, with same simplicity
as for BPF side.
**bpftool gen help**
Print short help message.
OPTIONS
=======
-h, --help
Print short generic help message (similar to **bpftool help**).
-V, --version
Print version number (similar to **bpftool version**).
-j, --json
Generate JSON output. For commands that cannot produce JSON,
this option has no effect.
-p, --pretty
Generate human-readable JSON output. Implies **-j**.
-d, --debug
Print all logs available from libbpf, including debug-level
information.
EXAMPLES
========
**$ cat example.c**
::
#include <stdbool.h>
#include <linux/ptrace.h>
#include <linux/bpf.h>
#include "bpf_helpers.h"
const volatile int param1 = 42;
bool global_flag = true;
struct { int x; } data = {};
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 128);
__type(key, int);
__type(value, long);
} my_map SEC(".maps");
SEC("raw_tp/sys_enter")
int handle_sys_enter(struct pt_regs *ctx)
{
static long my_static_var;
if (global_flag)
my_static_var++;
else
data.x += param1;
return 0;
}
SEC("raw_tp/sys_exit")
int handle_sys_exit(struct pt_regs *ctx)
{
int zero = 0;
bpf_map_lookup_elem(&my_map, &zero);
return 0;
}
This is example BPF application with two BPF programs and a mix of BPF maps
and global variables.
**$ bpftool gen skeleton example.o**
::
/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
/* THIS FILE IS AUTOGENERATED! */
#ifndef __EXAMPLE_SKEL_H__
#define __EXAMPLE_SKEL_H__
#include <stdlib.h>
#include <libbpf.h>
struct example {
struct bpf_object_skeleton *skeleton;
struct bpf_object *obj;
struct {
struct bpf_map *rodata;
struct bpf_map *data;
struct bpf_map *bss;
struct bpf_map *my_map;
} maps;
struct {
struct bpf_program *handle_sys_enter;
struct bpf_program *handle_sys_exit;
} progs;
struct {
struct bpf_link *handle_sys_enter;
struct bpf_link *handle_sys_exit;
} links;
struct example__bss {
struct {
int x;
} data;
} *bss;
struct example__data {
_Bool global_flag;
long int handle_sys_enter_my_static_var;
} *data;
struct example__rodata {
int param1;
} *rodata;
};
static void example__destroy(struct example *obj);
static inline struct example *example__open_opts(
const struct bpf_object_open_opts *opts);
static inline struct example *example__open();
static inline int example__load(struct example *obj);
static inline struct example *example__open_and_load();
static inline int example__attach(struct example *obj);
static inline void example__detach(struct example *obj);
#endif /* __EXAMPLE_SKEL_H__ */
**$ cat example_user.c**
::
#include "example.skel.h"
int main()
{
struct example *skel;
int err = 0;
skel = example__open();
if (!skel)
goto cleanup;
skel->rodata->param1 = 128;
err = example__load(skel);
if (err)
goto cleanup;
err = example__attach(skel);
if (err)
goto cleanup;
/* all libbpf APIs are usable */
printf("my_map name: %s\n", bpf_map__name(skel->maps.my_map));
printf("sys_enter prog FD: %d\n",
bpf_program__fd(skel->progs.handle_sys_enter));
/* detach and re-attach sys_exit program */
bpf_link__destroy(skel->links.handle_sys_exit);
skel->links.handle_sys_exit =
bpf_program__attach(skel->progs.handle_sys_exit);
printf("my_static_var: %ld\n",
skel->bss->handle_sys_enter_my_static_var);
cleanup:
example__destroy(skel);
return err;
}
**# ./example_user**
::
my_map name: my_map
sys_enter prog FD: 8
my_static_var: 7
This is a stripped-out version of skeleton generated for above example code.
SEE ALSO
========
**bpf**\ (2),
**bpf-helpers**\ (7),
**bpftool**\ (8),
**bpftool-map**\ (8),
**bpftool-prog**\ (8),
**bpftool-cgroup**\ (8),
**bpftool-feature**\ (8),
**bpftool-net**\ (8),
**bpftool-perf**\ (8),
**bpftool-btf**\ (8)
......@@ -39,9 +39,9 @@ MAP COMMANDS
| **bpftool** **map freeze** *MAP*
| **bpftool** **map help**
|
| *MAP* := { **id** *MAP_ID* | **pinned** *FILE* }
| *MAP* := { **id** *MAP_ID* | **pinned** *FILE* | **name** *MAP_NAME* }
| *DATA* := { [**hex**] *BYTES* }
| *PROG* := { **id** *PROG_ID* | **pinned** *FILE* | **tag** *PROG_TAG* }
| *PROG* := { **id** *PROG_ID* | **pinned** *FILE* | **tag** *PROG_TAG* | **name** *PROG_NAME* }
| *VALUE* := { *DATA* | *MAP* | *PROG* }
| *UPDATE_FLAGS* := { **any** | **exist** | **noexist** }
| *TYPE* := { **hash** | **array** | **prog_array** | **perf_event_array** | **percpu_hash**
......@@ -55,8 +55,9 @@ DESCRIPTION
===========
**bpftool map { show | list }** [*MAP*]
Show information about loaded maps. If *MAP* is specified
show information only about given map, otherwise list all
maps currently loaded on the system.
show information only about given maps, otherwise list all
maps currently loaded on the system. In case of **name**,
*MAP* may match several maps which will all be shown.
Output will start with map ID followed by map type and
zero or more named attributes (depending on kernel version).
......@@ -66,7 +67,8 @@ DESCRIPTION
as *FILE*.
**bpftool map dump** *MAP*
Dump all entries in a given *MAP*.
Dump all entries in a given *MAP*. In case of **name**,
*MAP* may match several maps which will all be dumped.
**bpftool map update** *MAP* [**key** *DATA*] [**value** *VALUE*] [*UPDATE_FLAGS*]
Update map entry for a given *KEY*.
......
......@@ -33,7 +33,7 @@ PROG COMMANDS
| **bpftool** **prog help**
|
| *MAP* := { **id** *MAP_ID* | **pinned** *FILE* }
| *PROG* := { **id** *PROG_ID* | **pinned** *FILE* | **tag** *PROG_TAG* }
| *PROG* := { **id** *PROG_ID* | **pinned** *FILE* | **tag** *PROG_TAG* | **name** *PROG_NAME* }
| *TYPE* := {
| **socket** | **kprobe** | **kretprobe** | **classifier** | **action** |
| **tracepoint** | **raw_tracepoint** | **xdp** | **perf_event** | **cgroup/skb** |
......@@ -53,8 +53,10 @@ DESCRIPTION
===========
**bpftool prog { show | list }** [*PROG*]
Show information about loaded programs. If *PROG* is
specified show information only about given program, otherwise
list all programs currently loaded on the system.
specified show information only about given programs,
otherwise list all programs currently loaded on the system.
In case of **tag** or **name**, *PROG* may match several
programs which will all be shown.
Output will start with program ID followed by program type and
zero or more named attributes (depending on kernel version).
......@@ -68,11 +70,15 @@ DESCRIPTION
performed via the **kernel.bpf_stats_enabled** sysctl knob.
**bpftool prog dump xlated** *PROG* [{ **file** *FILE* | **opcodes** | **visual** | **linum** }]
Dump eBPF instructions of the program from the kernel. By
Dump eBPF instructions of the programs from the kernel. By
default, eBPF will be disassembled and printed to standard
output in human-readable format. In this case, **opcodes**
controls if raw opcodes should be printed as well.
In case of **tag** or **name**, *PROG* may match several
programs which will all be dumped. However, if **file** or
**visual** is specified, *PROG* must match a single program.
If **file** is specified, the binary image will instead be
written to *FILE*.
......@@ -80,15 +86,17 @@ DESCRIPTION
built instead, and eBPF instructions will be presented with
CFG in DOT format, on standard output.
If the prog has line_info available, the source line will
If the programs have line_info available, the source line will
be displayed by default. If **linum** is specified,
the filename, line number and line column will also be
displayed on top of the source line.
**bpftool prog dump jited** *PROG* [{ **file** *FILE* | **opcodes** | **linum** }]
Dump jited image (host machine code) of the program.
If *FILE* is specified image will be written to a file,
otherwise it will be disassembled and printed to stdout.
*PROG* must match a single program when **file** is specified.
**opcodes** controls if raw opcodes will be printed.
......
......@@ -81,4 +81,5 @@ SEE ALSO
**bpftool-feature**\ (8),
**bpftool-net**\ (8),
**bpftool-perf**\ (8),
**bpftool-btf**\ (8)
**bpftool-btf**\ (8),
**bpftool-gen**\ (8),
......@@ -59,6 +59,21 @@ _bpftool_get_map_ids_for_type()
command sed -n 's/.*"id": \(.*\),$/\1/p' )" -- "$cur" ) )
}
_bpftool_get_map_names()
{
COMPREPLY+=( $( compgen -W "$( bpftool -jp map 2>&1 | \
command sed -n 's/.*"name": \(.*\),$/\1/p' )" -- "$cur" ) )
}
# Takes map type and adds matching map names to the list of suggestions.
_bpftool_get_map_names_for_type()
{
local type="$1"
COMPREPLY+=( $( compgen -W "$( bpftool -jp map 2>&1 | \
command grep -C2 "$type" | \
command sed -n 's/.*"name": \(.*\),$/\1/p' )" -- "$cur" ) )
}
_bpftool_get_prog_ids()
{
COMPREPLY+=( $( compgen -W "$( bpftool -jp prog 2>&1 | \
......@@ -71,6 +86,12 @@ _bpftool_get_prog_tags()
command sed -n 's/.*"tag": "\(.*\)",$/\1/p' )" -- "$cur" ) )
}
_bpftool_get_prog_names()
{
COMPREPLY+=( $( compgen -W "$( bpftool -jp prog 2>&1 | \
command sed -n 's/.*"name": "\(.*\)",$/\1/p' )" -- "$cur" ) )
}
_bpftool_get_btf_ids()
{
COMPREPLY+=( $( compgen -W "$( bpftool -jp btf 2>&1 | \
......@@ -180,6 +201,52 @@ _bpftool_map_update_get_id()
esac
}
_bpftool_map_update_get_name()
{
local command="$1"
# Is it the map to update, or a map to insert into the map to update?
# Search for "value" keyword.
local idx value
for (( idx=7; idx < ${#words[@]}-1; idx++ )); do
if [[ ${words[idx]} == "value" ]]; then
value=1
break
fi
done
if [[ $value -eq 0 ]]; then
case "$command" in
push)
_bpftool_get_map_names_for_type stack
;;
enqueue)
_bpftool_get_map_names_for_type queue
;;
*)
_bpftool_get_map_names
;;
esac
return 0
fi
# Name to complete is for a value. It can be either prog name or map name. This
# depends on the type of the map to update.
local type=$(_bpftool_map_guess_map_type)
case $type in
array_of_maps|hash_of_maps)
_bpftool_get_map_names
return 0
;;
prog_array)
_bpftool_get_prog_names
return 0
;;
*)
return 0
;;
esac
}
_bpftool()
{
local cur prev words objword
......@@ -251,7 +318,8 @@ _bpftool()
# Completion depends on object and command in use
case $object in
prog)
# Complete id, only for subcommands that use prog (but no map) ids
# Complete id and name, only for subcommands that use prog (but no
# map) ids/names.
case $command in
show|list|dump|pin)
case $prev in
......@@ -259,12 +327,16 @@ _bpftool()
_bpftool_get_prog_ids
return 0
;;
name)
_bpftool_get_prog_names
return 0
;;
esac
;;
esac
local PROG_TYPE='id pinned tag'
local MAP_TYPE='id pinned'
local PROG_TYPE='id pinned tag name'
local MAP_TYPE='id pinned name'
case $command in
show|list)
[[ $prev != "$command" ]] && return 0
......@@ -315,6 +387,9 @@ _bpftool()
id)
_bpftool_get_prog_ids
;;
name)
_bpftool_get_map_names
;;
pinned)
_filedir
;;
......@@ -335,6 +410,9 @@ _bpftool()
id)
_bpftool_get_map_ids
;;
name)
_bpftool_get_map_names
;;
pinned)
_filedir
;;
......@@ -399,6 +477,10 @@ _bpftool()
_bpftool_get_map_ids
return 0
;;
name)
_bpftool_get_map_names
return 0
;;
pinned|pinmaps)
_filedir
return 0
......@@ -447,7 +529,7 @@ _bpftool()
esac
;;
map)
local MAP_TYPE='id pinned'
local MAP_TYPE='id pinned name'
case $command in
show|list|dump|peek|pop|dequeue|freeze)
case $prev in
......@@ -473,6 +555,24 @@ _bpftool()
esac
return 0
;;
name)
case "$command" in
peek)
_bpftool_get_map_names_for_type stack
_bpftool_get_map_names_for_type queue
;;
pop)
_bpftool_get_map_names_for_type stack
;;
dequeue)
_bpftool_get_map_names_for_type queue
;;
*)
_bpftool_get_map_names
;;
esac
return 0
;;
*)
return 0
;;
......@@ -520,6 +620,10 @@ _bpftool()
_bpftool_get_map_ids
return 0
;;
name)
_bpftool_get_map_names
return 0
;;
key)
COMPREPLY+=( $( compgen -W 'hex' -- "$cur" ) )
;;
......@@ -545,6 +649,10 @@ _bpftool()
_bpftool_map_update_get_id $command
return 0
;;
name)
_bpftool_map_update_get_name $command
return 0
;;
key)
COMPREPLY+=( $( compgen -W 'hex' -- "$cur" ) )
;;
......@@ -553,13 +661,13 @@ _bpftool()
# map, depending on the type of the map to update.
case "$(_bpftool_map_guess_map_type)" in
array_of_maps|hash_of_maps)
local MAP_TYPE='id pinned'
local MAP_TYPE='id pinned name'
COMPREPLY+=( $( compgen -W "$MAP_TYPE" \
-- "$cur" ) )
return 0
;;
prog_array)
local PROG_TYPE='id pinned tag'
local PROG_TYPE='id pinned tag name'
COMPREPLY+=( $( compgen -W "$PROG_TYPE" \
-- "$cur" ) )
return 0
......@@ -621,6 +729,10 @@ _bpftool()
_bpftool_get_map_ids_for_type perf_event_array
return 0
;;
name)
_bpftool_get_map_names_for_type perf_event_array
return 0
;;
cpu)
return 0
;;
......@@ -644,8 +756,8 @@ _bpftool()
esac
;;
btf)
local PROG_TYPE='id pinned tag'
local MAP_TYPE='id pinned'
local PROG_TYPE='id pinned tag name'
local MAP_TYPE='id pinned name'
case $command in
dump)
case $prev in
......@@ -676,6 +788,17 @@ _bpftool()
esac
return 0
;;
name)
case $pprev in
prog)
_bpftool_get_prog_names
;;
map)
_bpftool_get_map_names
;;
esac
return 0
;;
format)
COMPREPLY=( $( compgen -W "c raw" -- "$cur" ) )
;;
......@@ -716,6 +839,17 @@ _bpftool()
;;
esac
;;
gen)
case $command in
skeleton)
_filedir
;;
*)
[[ $prev == $object ]] && \
COMPREPLY=( $( compgen -W 'skeleton help' -- "$cur" ) )
;;
esac
;;
cgroup)
case $command in
show|list|tree)
......@@ -735,7 +869,7 @@ _bpftool()
connect6 sendmsg4 sendmsg6 recvmsg4 recvmsg6 sysctl \
getsockopt setsockopt'
local ATTACH_FLAGS='multi override'
local PROG_TYPE='id pinned tag'
local PROG_TYPE='id pinned tag name'
case $prev in
$command)
_filedir
......@@ -760,7 +894,7 @@ _bpftool()
elif [[ "$command" == "attach" ]]; then
# We have an attach type on the command line,
# but it is not the previous word, or
# "id|pinned|tag" (we already checked for
# "id|pinned|tag|name" (we already checked for
# that). This should only leave the case when
# we need attach flags for "attach" commamnd.
_bpftool_one_of_list "$ATTACH_FLAGS"
......@@ -786,7 +920,7 @@ _bpftool()
esac
;;
net)
local PROG_TYPE='id pinned tag'
local PROG_TYPE='id pinned tag name'
local ATTACH_TYPES='xdp xdpgeneric xdpdrv xdpoffload'
case $command in
show|list)
......
此差异已折叠。
此差异已折叠。
......@@ -58,7 +58,7 @@ static int do_help(int argc, char **argv)
" %s batch file FILE\n"
" %s version\n"
"\n"
" OBJECT := { prog | map | cgroup | perf | net | feature | btf }\n"
" OBJECT := { prog | map | cgroup | perf | net | feature | btf | gen }\n"
" " HELP_SPEC_OPTIONS "\n"
"",
bin_name, bin_name, bin_name);
......@@ -227,6 +227,7 @@ static const struct cmd cmds[] = {
{ "net", do_net },
{ "feature", do_feature },
{ "btf", do_btf },
{ "gen", do_gen },
{ "version", do_version },
{ 0 }
};
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册