提交 · 2929ad29a3015120ab64c86f362fc2a590ceb69c · openeuler / Kernel

20 10月, 2018 12 次提交

Merge branch 'improve_perf_barriers' · 2929ad29

由 Alexei Starovoitov 提交于 10月 19, 2018

Daniel Borkmann says:

====================
This set first adds smp_* barrier variants to tools infrastructure
and updates perf and libbpf to make use of them. For details, please
see individual patches, thanks!

Arnaldo, if there are no objections, could this be routed via bpf-next
with Acked-by's due to later dependencies in libbpf? Alternatively,
I could also get the 2nd patch out during merge window, but perhaps
it's okay to do in one go as there shouldn't be much conflict in perf
itself.

Thanks!

v1 -> v2:
  - add common helper and switch to acquire/release variants
    when possible, thanks Peter!
====================
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

2929ad29

bpf, libbpf: use correct barriers in perf ring buffer walk · a64af0ef

由 Daniel Borkmann 提交于 10月 19, 2018

Given libbpf is a generic library and not restricted to x86-64 only,
the compiler barrier in bpf_perf_event_read_simple() after fetching
the head needs to be replaced with smp_rmb() at minimum. Also, writing
out the tail we should use WRITE_ONCE() to avoid store tearing.

Now that we have the logic in place in ring_buffer_read_head() and
ring_buffer_write_tail() helper also used by perf tool which would
select the correct and best variant for a given architecture (e.g.
x86-64 can avoid CPU barriers entirely), make use of these in order
to fix bpf_perf_event_read_simple().

Fixes: d0cabbb0 ("tools: bpf: move the event reading loop to libbpf")
Fixes: 39111695 ("samples: bpf: add bpf_perf_event_output example")
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

a64af0ef

tools, perf: add and use optimized ring_buffer_{read_head, write_tail} helpers · 09d62154

由 Daniel Borkmann 提交于 10月 19, 2018

Currently, on x86-64, perf uses LFENCE and MFENCE (rmb() and mb(),
respectively) when processing events from the perf ring buffer which
is unnecessarily expensive as we can do more lightweight in particular
given this is critical fast-path in perf.

According to Peter rmb()/mb() were added back then via a94d342b
("tools/perf: Add required memory barriers") at a time where kernel
still supported chips that needed it, but nowadays support for these
has been ditched completely, therefore we can fix them up as well.

While for x86-64, replacing rmb() and mb() with smp_*() variants would
result in just a compiler barrier for the former and LOCK + ADD for
the latter (__sync_synchronize() uses slower MFENCE by the way), Peter
suggested we can use smp_{load_acquire,store_release}() instead for
architectures where its implementation doesn't resolve in slower smp_mb().
Thus, e.g. in x86-64 we would be able to avoid CPU barrier entirely due
to TSO. For architectures where the latter needs to use smp_mb() e.g.
on arm, we stick to cheaper smp_rmb() variant for fetching the head.

This work adds helpers ring_buffer_read_head() and ring_buffer_write_tail()
for tools infrastructure that either switches to smp_load_acquire() for
architectures where it is cheaper or uses READ_ONCE() + smp_rmb() barrier
for those where it's not in order to fetch the data_head from the perf
control page, and it uses smp_store_release() to write the data_tail.
Latter is smp_mb() + WRITE_ONCE() combination or a cheaper variant if
architecture allows for it. Those that rely on smp_rmb() and smp_mb() can
further improve performance in a follow up step by implementing the two
under tools/arch/*/include/asm/barrier.h such that they don't have to
fallback to rmb() and mb() in tools/include/asm/barrier.h.

Switch perf to use ring_buffer_read_head() and ring_buffer_write_tail()
so it can make use of the optimizations. Later, we convert libbpf as
well to use the same helpers.

Side note [0]: the topic has been raised of whether one could simply use
the C11 gcc builtins [1] for the smp_load_acquire() and smp_store_release()
instead:

  __atomic_load_n(ptr, __ATOMIC_ACQUIRE);
  __atomic_store_n(ptr, val, __ATOMIC_RELEASE);

Kernel and (presumably) tooling shipped along with the kernel has a
minimum requirement of being able to build with gcc-4.6 and the latter
does not have C11 builtins. While generally the C11 memory models don't
align with the kernel's, the C11 load-acquire and store-release alone
/could/ suffice, however. Issue is that this is implementation dependent
on how the load-acquire and store-release is done by the compiler and
the mapping of supported compilers must align to be compatible with the
kernel's implementation, and thus needs to be verified/tracked on a
case by case basis whether they match (unless an architecture uses them
also from kernel side). The implementations for smp_load_acquire() and
smp_store_release() in this patch have been adapted from the kernel side
ones to have a concrete and compatible mapping in place.

  [0] http://patchwork.ozlabs.org/patch/985422/
  [1] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.htmlSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

09d62154

selftests/bpf: add missing executables to .gitignore · 78de3546

由 Anders Roxell 提交于 10月 19, 2018

Fixes: 371e4fcc ("selftests/bpf: cgroup local storage-based network counters")
Fixes: 370920c4 ("selftests/bpf: Test libbpf_{prog,attach}_type_by_name")
Signed-off-by: NAnders Roxell <anders.roxell@linaro.org>
Acked-by: NYonghong Song <yhs@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

78de3546

Merge branch 'queue_stack_maps' · 43ed375f

由 Alexei Starovoitov 提交于 10月 19, 2018

Mauricio Vasquez says:

====================
In some applications this is needed have a pool of free elements, for
example the list of free L4 ports in a SNAT.  None of the current maps allow
to do it as it is not possible to get any element without having they key
it is associated to, even if it were possible, the lack of locking mecanishms in
eBPF would do it almost impossible to be implemented without data races.

This patchset implements two new kind of eBPF maps: queue and stack.
Those maps provide to eBPF programs the peek, push and pop operations, and for
userspace applications a new bpf_map_lookup_and_delete_elem() is added.
Signed-off-by: NMauricio Vasquez B <mauricio.vasquez@polito.it>

v2 -> v3:
 - Remove "almost dead code" in syscall.c
 - Remove unnecessary copy_from_user in bpf_map_lookup_and_delete_elem
 - Rebase

v1 -> v2:
 - Put ARG_PTR_TO_UNINIT_MAP_VALUE logic into a separated patch
 - Fix missing __this_cpu_dec & preempt_enable calls in kernel/bpf/syscall.c

RFC v4 -> v1:
 - Remove roundup to power of 2 in memory allocation
 - Remove count and use a free slot to check if queue/stack is empty
 - Use if + assigment for wrapping indexes
 - Fix some minor style issues
 - Squash two patches together

RFC v3 -> RFC v4:
 - Revert renaming of kernel/bpf/stackmap.c
 - Remove restriction on value size
 - Remove len arguments from peek/pop helpers
 - Add new ARG_PTR_TO_UNINIT_MAP_VALUE

RFC v2 -> RFC v3:
 - Return elements by value instead that by reference
 - Implement queue/stack base on array and head + tail indexes
 - Rename stack trace related files to avoid confusion and conflicts

RFC v1 -> RFC v2:
 - Create two separate maps instead of single one + flags
 - Implement bpf_map_lookup_and_delete syscall
 - Support peek operation
 - Define replacement policy through flags in the update() method
 - Add eBPF side tests
====================
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

43ed375f

selftests/bpf: add test cases for queue and stack maps · 43b987d2

由 Mauricio Vasquez B 提交于 10月 18, 2018

test_maps:
Tests that queue/stack maps are behaving correctly even in corner cases

test_progs:
Tests new ebpf helpers
Signed-off-by: NMauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

43b987d2

Sync uapi/bpf.h to tools/include · da4e1b15

由 Mauricio Vasquez B 提交于 10月 18, 2018

Sync both files.
Signed-off-by: NMauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

da4e1b15

bpf: add MAP_LOOKUP_AND_DELETE_ELEM syscall · bd513cd0

由 Mauricio Vasquez B 提交于 10月 18, 2018

The previous patch implemented a bpf queue/stack maps that
provided the peek/pop/push functions.  There is not a direct
relationship between those functions and the current maps
syscalls, hence a new MAP_LOOKUP_AND_DELETE_ELEM syscall is added,
this is mapped to the pop operation in the queue/stack maps
and it is still to implement in other kind of maps.
Signed-off-by: NMauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

bd513cd0

bpf: add queue and stack maps · f1a2e44a

由 Mauricio Vasquez B 提交于 10月 18, 2018

Queue/stack maps implement a FIFO/LIFO data storage for ebpf programs.
These maps support peek, pop and push operations that are exposed to eBPF
programs through the new bpf_map[peek/pop/push] helpers.  Those operations
are exposed to userspace applications through the already existing
syscalls in the following way:

BPF_MAP_LOOKUP_ELEM            -> peek
BPF_MAP_LOOKUP_AND_DELETE_ELEM -> pop
BPF_MAP_UPDATE_ELEM            -> push

Queue/stack maps are implemented using a buffer, tail and head indexes,
hence BPF_F_NO_PREALLOC is not supported.

As opposite to other maps, queue and stack do not use RCU for protecting
maps values, the bpf_map[peek/pop] have a ARG_PTR_TO_UNINIT_MAP_VALUE
argument that is a pointer to a memory zone where to save the value of a
map.  Basically the same as ARG_PTR_TO_UNINIT_MEM, but the size has not
be passed as an extra argument.

Our main motivation for implementing queue/stack maps was to keep track
of a pool of elements, like network ports in a SNAT, however we forsee
other use cases, like for exampling saving last N kernel events in a map
and then analysing from userspace.
Signed-off-by: NMauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

f1a2e44a

bpf/verifier: add ARG_PTR_TO_UNINIT_MAP_VALUE · 2ea864c5

由 Mauricio Vasquez B 提交于 10月 18, 2018

ARG_PTR_TO_UNINIT_MAP_VALUE argument is a pointer to a memory zone
used to save the value of a map.  Basically the same as
ARG_PTR_TO_UNINIT_MEM, but the size has not be passed as an extra
argument.

This will be used in the following patch that implements some new
helpers that receive a pointer to be filled with a map value.
Signed-off-by: NMauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

2ea864c5

bpf/syscall: allow key to be null in map functions · c9d29f46

由 Mauricio Vasquez B 提交于 10月 18, 2018

This commit adds the required logic to allow key being NULL
in case the key_size of the map is 0.

A new __bpf_copy_key function helper only copies the key from
userpsace when key_size != 0, otherwise it enforces that key must be
null.
Signed-off-by: NMauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

c9d29f46

bpf: rename stack trace map operations · 14499160

由 Mauricio Vasquez B 提交于 10月 18, 2018

In the following patches queue and stack maps (FIFO and LIFO
datastructures) will be implemented.  In order to avoid confusion and
a possible name clash rename stack_map_ops to stack_trace_map_ops
Signed-off-by: NMauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

14499160

19 10月, 2018 2 次提交

tools: bpftool: use 4 context mode for the NFP disasm · 3ddeac67

由 Jakub Kicinski 提交于 10月 18, 2018

The nfp driver is currently always JITing the BPF for 4 context/thread
mode of the NFP flow processors.  Tell this to the disassembler,
otherwise some registers may be incorrectly decoded.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: NJiong Wang <jiong.wang@netronome.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

3ddeac67

selftests/bpf: fix file resource leak in load_kallsyms · 1bd70d2e

由 Peng Hao 提交于 10月 18, 2018

FILE pointer variable f is opened but never closed.
Signed-off-by: NPeng Hao <peng.hao2@zte.com.cn>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

1bd70d2e

18 10月, 2018 1 次提交

bpf: fix doc of bpf_skb_adjust_room() in uapi · b55cbc8d

由 Nicolas Dichtel 提交于 10月 17, 2018

len_diff is signed.

Fixes: fa15601a ("bpf: add documentation for eBPF helpers (33-41)")
CC: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: NQuentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

b55cbc8d

17 10月, 2018 9 次提交

Merge branch 'bpf-sk-msg-peek' · 44d520eb

由 Daniel Borkmann 提交于 10月 17, 2018

John Fastabend says:

====================
This adds support for the MSG_PEEK flag when redirecting into
an ingress psock sk_msg queue.

The first patch adds some base support to the helpers, then the
feature, and finally we add an option for the test suite to do
a duplicate MSG_PEEK call on every recv to test the feature.

With duplicate MSG_PEEK call all tests continue to PASS.
====================
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

44d520eb

bpf: sockmap, add msg_peek tests to test_sockmap · 753fb2ee

由 John Fastabend 提交于 10月 16, 2018

Add tests that do a MSG_PEEK recv followed by a regular receive to
test flag support.
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

753fb2ee

bpf: sockmap, support for msg_peek in sk_msg with redirect ingress · 02c558b2

由 John Fastabend 提交于 10月 16, 2018

This adds support for the MSG_PEEK flag when doing redirect to ingress
and receiving on the sk_msg psock queue. Previously the flag was
being ignored which could confuse applications if they expected the
flag to work as normal.
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

02c558b2

bpf: skmsg, improve sk_msg_used_element to work in cork context · 8734a162

由 John Fastabend 提交于 10月 16, 2018

Currently sk_msg_used_element is only called in zerocopy context where
cork is not possible and if this case happens we fallback to copy
mode. However the helper is more useful if it works in all contexts.

This patch resolved the case where if end == head indicating a full
or empty ring the helper always reports an empty ring. To fix this
add a test for the full ring case to avoid reporting a full ring
has 0 elements. This additional functionality will be used in the
next patches from recvmsg context where end = head with a full ring
is a valid case.
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

8734a162

bpf: sockmap, fix skmsg recvmsg handler to track size correctly · 3f4c3127

由 John Fastabend 提交于 10月 16, 2018

When converting sockmap to new skmsg generic data structures we missed
that the recvmsg handler did not correctly use sg.size and instead was
using individual elements length. The result is if a sock is closed
with outstanding data we omit the call to sk_mem_uncharge() and can
get the warning below.

[   66.728282] WARNING: CPU: 6 PID: 5783 at net/core/stream.c:206 sk_stream_kill_queues+0x1fa/0x210

To fix this correct the redirect handler to xfer the size along with
the scatterlist and also decrement the size from the recvmsg handler.
Now when a sock is closed the remaining 'size' will be decremented
with sk_mem_uncharge().
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

3f4c3127

Merge branch 'nfp-improve-bpf-offload' · 9032c10e

由 Alexei Starovoitov 提交于 10月 16, 2018

Jakub Kicinski says:

====================
this set adds check to make sure offload behaviour is correct.
First when atomic counters are used, we must make sure the map
does not already contain data we did not prepare for holding
atomics.

Second patch double checks vNIC capabilities for program offload
in case program is shared by multiple vNICs with different
constraints.
====================
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

9032c10e

nfp: bpf: double check vNIC capabilities after object sharing · 44b6fed0

由 Jakub Kicinski 提交于 10月 16, 2018

Program translation stage checks that program can be offloaded to
the netdev which was passed during the load (bpf_attr->prog_ifindex).
After program sharing was introduced, however, the netdev on which
program is loaded can theoretically be different, and therefore
we should recheck the program size and max stack size at load time.

This was found by code inspection, AFAIK today all vNICs have
identical caps.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: NQuentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

44b6fed0

nfp: bpf: protect against mis-initializing atomic counters · 527db74b

由 Jakub Kicinski 提交于 10月 16, 2018

Atomic operations on the NFP are currently always in big endian.
The driver keeps track of regions of memory storing atomic values
and byte swaps them accordingly.  There are corner cases where
the map values may be initialized before the driver knows they
are used as atomic counters.  This can happen either when the
datapath is performing the update and the stack contents are
unknown or when map is updated before the program which will
use it for atomic values is loaded.

To avoid situation where user initializes the value to 0 1 2 3
and then after loading a program which uses the word as an atomic
counter starts reading 3 2 1 0 - only allow atomic counters to be
initialized to endian-neutral values.

For updates from the datapath the stack information may not be
as precise, so just allow initializing such values to 0.

Example code which would break:
struct bpf_map_def SEC("maps") rxcnt = {
       .type = BPF_MAP_TYPE_HASH,
       .key_size = sizeof(__u32),
       .value_size = sizeof(__u64),
       .max_entries = 1,
};

int xdp_prog1()
{
      	__u64 nonzeroval = 3;
	__u32 key = 0;
	__u64 *value;

	value = bpf_map_lookup_elem(&rxcnt, &key);
	if (!value)
		bpf_map_update_elem(&rxcnt, &key, &nonzeroval, BPF_ANY);
	else
		__sync_fetch_and_add(value, 1);

	return XDP_PASS;
}

$ offload bpftool map dump
key: 00 00 00 00 value: 00 00 00 03 00 00 00 00

should be:

$ offload bpftool map dump
key: 00 00 00 00 value: 03 00 00 00 00 00 00 00
Reported-by: NDavid Beckett <david.beckett@netronome.com>
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: NQuentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

527db74b

libbpf: Per-symbol visibility for DSO · ab9e0848

由 Andrey Ignatov 提交于 10月 15, 2018

Make global symbols in libbpf DSO hidden by default with
-fvisibility=hidden and export symbols that are part of ABI explicitly
with __attribute__((visibility("default"))).

This is common practice that should prevent from accidentally exporting
a symbol, that is not supposed to be a part of ABI what, in turn,
improves both libbpf developer- and user-experiences. See [1] for more
details.

Export control becomes more important since more and more projects use
libbpf.

The patch doesn't export a bunch of netlink related functions since as
agreed in [2] they'll be reworked. That doesn't break bpftool since
bpftool links libbpf statically.

[1] https://www.akkadia.org/drepper/dsohowto.pdf (2.2 Export Control)
[2] https://www.mail-archive.com/netdev@vger.kernel.org/msg251434.htmlSigned-off-by: NAndrey Ignatov <rdna@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

ab9e0848

16 10月, 2018 16 次提交

bpf, tls: add tls header to tools infrastructure · 421f4292

由 Daniel Borkmann 提交于 10月 16, 2018

Andrey reported a build error for the BPF kselftest suite when compiled on
a machine which does not have tls related header bits installed natively:

  test_sockmap.c:120:23: fatal error: linux/tls.h: No such file or directory
   #include <linux/tls.h>
                         ^
  compilation terminated.

Fix it by adding the header to the tools include infrastructure and add
definitions such as SOL_TLS that could potentially be missing.

Fixes: e9dd9047 ("bpf: add tls support for testing in test_sockmap")
Reported-by: NAndrey Ignatov <rdna@fb.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

421f4292

Merge branch 'net-Kernel-side-filtering-for-route-dumps' · 2c59f06c

由 David S. Miller 提交于 10月 16, 2018

David Ahern says:

====================
net: Kernel side filtering for route dumps

Implement kernel side filtering of route dumps by protocol (e.g., which
routing daemon installed the route), route type (e.g., unicast), table
id and nexthop device.

iproute2 has been doing this filtering in userspace for years; pushing
the filters to the kernel side reduces the amount of data the kernel
sends and reduces wasted cycles on both sides processing unwanted data.
These initial options provide a huge improvement for efficiently
examining routes on large scale systems.

v2
- better handling of requests for a specific table. Rather than walking
  the hash of all tables, lookup the specific table and dump it
- refactor mr_rtm_dumproute moving the loop over the table into a
  helper that can be invoked directly
- add hook to return NLM_F_DUMP_FILTERED in DONE message to ensure
  it is returned even when the dump returns nothing
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2c59f06c

net/ipv4: Bail early if user only wants prefix entries · e4e92fb1