提交 · ae30727fa4beabd8403b016896eb1f714e6c0fd4 · openeuler / Kernel

20 3月, 2018 18 次提交

由 John Fastabend 提交于 3月 18, 2018

This adds the test script I am currently using to validate
the latest sockmap changes. Shortly sockmap will be ported
to selftests and these will be run from the infrastructure
there. Until then add the script here so we have a coverage
checklist when porting into selftests.
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

ae30727f

bpf: sockmap sample test for bpf_msg_pull_data · 0dcbbf67

由 John Fastabend 提交于 3月 18, 2018

This adds an option to test the msg_pull_data helper. This
uses two options txmsg_start and txmsg_end to let the user
specify start and end bytes to pull.

The options can be used with txmsg_apply, txmsg_cork options
as well as with any of the basic tests, txmsg, txmsg_redir and
txmsg_drop (plus noisy variants) to run pull_data inline with
those tests. By giving user direct control over the variables
we can easily do negative testing as well as positive tests.
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

0dcbbf67

bpf: sockmap add SK_DROP tests · e6373ce7

由 John Fastabend 提交于 3月 18, 2018

Add tests for SK_DROP.
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

e6373ce7

bpf: sockmap sample support for bpf_msg_cork_bytes() · 468b3fde

由 John Fastabend 提交于 3月 18, 2018

Add sample application support for the bpf_msg_cork_bytes helper. This
lets the user specify how many bytes each verdict should apply to.

Similar to apply_bytes() tests these can be run as a stand-alone test
when used without other options or inline with other tests by using
the txmsg_cork option along with any of the basic tests txmsg,
txmsg_redir, txmsg_drop.
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

468b3fde

bpf: sockmap, add sample option to test apply_bytes helper · 1c16c312

由 John Fastabend 提交于 3月 18, 2018

This adds an option to test the apply_bytes helper. This option lets
the user specify an int on the command line specifying how much data
each verdict should apply to.

When this is set a map entry is set with the bytes input by the user
and then the specified program --txmsg or --txmsg_redir will use the
value and set the applied data. If no other option is set then a
default --txmsg_apply program is run. This program will drop pkts
if an error is detected on the bytes map lookup. Useful to verify
the map lookup and apply helper are working and causing a hard
error if it is not.
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

1c16c312

bpf: sockmap sample, add data verification option · 6bce9d2c

由 John Fastabend 提交于 3月 18, 2018

To verify data is not being dropped or corrupted this adds an option
to verify test-patterns on recv.
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

6bce9d2c

bpf: sockmap sample, add sendfile test · e67463cb

由 John Fastabend 提交于 3月 18, 2018

To exercise TX ULP sendpage implementation we need a test that does
a sendfile. Add sendfile test option here.
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

e67463cb

bpf: sockmap sample, add option to attach SK_MSG program · 4c4c3c27

由 John Fastabend 提交于 3月 18, 2018

Add sockmap option to use SK_MSG program types.
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

4c4c3c27

bpf: add verifier tests for BPF_PROG_TYPE_SK_MSG · 1acc60b6

由 John Fastabend 提交于 3月 18, 2018

Test read and writes for BPF_PROG_TYPE_SK_MSG.
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

1acc60b6

bpf: add map tests for BPF_PROG_TYPE_SK_MSG · 82a86168

由 John Fastabend 提交于 3月 18, 2018

Add map tests to attach BPF_PROG_TYPE_SK_MSG types to a sockmap.
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

82a86168

bpf: sk_msg program helper bpf_sk_msg_pull_data · 015632bb

由 John Fastabend 提交于 3月 18, 2018

Currently, if a bpf sk msg program is run the program
can only parse data that the (start,end) pointers already
consumed. For sendmsg hooks this is likely the first
scatterlist element. For sendpage this will be the range
(0,0) because the data is shared with userspace and by
default we want to avoid allowing userspace to modify
data while (or after) BPF verdict is being decided.

To support pulling in additional bytes for parsing use
a new helper bpf_sk_msg_pull(start, end, flags) which
works similar to cls tc logic. This helper will attempt
to point the data start pointer at 'start' bytes offest
into msg and data end pointer at 'end' bytes offset into
message.

After basic sanity checks to ensure 'start' <= 'end' and
'end' <= msg_length there are a few cases we need to
handle.

First the sendmsg hook has already copied the data from
userspace and has exclusive access to it. Therefor, it
is not necessesary to copy the data. However, it may
be required. After finding the scatterlist element with
'start' offset byte in it there are two cases. One the
range (start,end) is entirely contained in the sg element
and is already linear. All that is needed is to update the
data pointers, no allocate/copy is needed. The other case
is (start, end) crosses sg element boundaries. In this
case we allocate a block of size 'end - start' and copy
the data to linearize it.

Next sendpage hook has not copied any data in initial
state so that data pointers are (0,0). In this case we
handle it similar to the above sendmsg case except the
allocation/copy must always happen. Then when sending
the data we have possibly three memory regions that
need to be sent, (0, start - 1), (start, end), and
(end + 1, msg_length). This is required to ensure any
writes by the BPF program are correctly transmitted.

Lastly this operation will invalidate any previous
data checks so BPF programs will have to revalidate
pointers after making this BPF call.
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

015632bb

bpf: sockmap, add msg_cork_bytes() helper · 91843d54

由 John Fastabend 提交于 3月 18, 2018

In the case where we need a specific number of bytes before a
verdict can be assigned, even if the data spans multiple sendmsg
or sendfile calls. The BPF program may use msg_cork_bytes().

The extreme case is a user can call sendmsg repeatedly with
1-byte msg segments. Obviously, this is bad for performance but
is still valid. If the BPF program needs N bytes to validate
a header it can use msg_cork_bytes to specify N bytes and the
BPF program will not be called again until N bytes have been
accumulated. The infrastructure will attempt to coalesce data
if possible so in many cases (most my use cases at least) the
data will be in a single scatterlist element with data pointers
pointing to start/end of the element. However, this is dependent
on available memory so is not guaranteed. So BPF programs must
validate data pointer ranges, but this is the case anyways to
convince the verifier the accesses are valid.
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

91843d54

bpf: sockmap, add bpf_msg_apply_bytes() helper · 2a100317

由 John Fastabend 提交于 3月 18, 2018

A single sendmsg or sendfile system call can contain multiple logical
messages that a BPF program may want to read and apply a verdict. But,
without an apply_bytes helper any verdict on the data applies to all
bytes in the sendmsg/sendfile. Alternatively, a BPF program may only
care to read the first N bytes of a msg. If the payload is large say
MB or even GB setting up and calling the BPF program repeatedly for
all bytes, even though the verdict is already known, creates
unnecessary overhead.

To allow BPF programs to control how many bytes a given verdict
applies to we implement a bpf_msg_apply_bytes() helper. When called
from within a BPF program this sets a counter, internal to the
BPF infrastructure, that applies the last verdict to the next N
bytes. If the N is smaller than the current data being processed
from a sendmsg/sendfile call, the first N bytes will be sent and
the BPF program will be re-run with start_data pointing to the N+1
byte. If N is larger than the current data being processed the
BPF verdict will be applied to multiple sendmsg/sendfile calls
until N bytes are consumed.

Note1 if a socket closes with apply_bytes counter non-zero this
is not a problem because data is not being buffered for N bytes
and is sent as its received.
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

2a100317

bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data · 4f738adb

由 John Fastabend 提交于 3月 18, 2018

This implements a BPF ULP layer to allow policy enforcement and
monitoring at the socket layer. In order to support this a new
program type BPF_PROG_TYPE_SK_MSG is used to run the policy at
the sendmsg/sendpage hook. To attach the policy to sockets a
sockmap is used with a new program attach type BPF_SK_MSG_VERDICT.

Similar to previous sockmap usages when a sock is added to a
sockmap, via a map update, if the map contains a BPF_SK_MSG_VERDICT
program type attached then the BPF ULP layer is created on the
socket and the attached BPF_PROG_TYPE_SK_MSG program is run for
every msg in sendmsg case and page/offset in sendpage case.

BPF_PROG_TYPE_SK_MSG Semantics/API:

BPF_PROG_TYPE_SK_MSG supports only two return codes SK_PASS and
SK_DROP. Returning SK_DROP free's the copied data in the sendmsg
case and in the sendpage case leaves the data untouched. Both cases
return -EACESS to the user. Returning SK_PASS will allow the msg to
be sent.

In the sendmsg case data is copied into kernel space buffers before
running the BPF program. The kernel space buffers are stored in a
scatterlist object where each element is a kernel memory buffer.
Some effort is made to coalesce data from the sendmsg call here.
For example a sendmsg call with many one byte iov entries will
likely be pushed into a single entry. The BPF program is run with
data pointers (start/end) pointing to the first sg element.

In the sendpage case data is not copied. We opt not to copy the
data by default here, because the BPF infrastructure does not
know what bytes will be needed nor when they will be needed. So
copying all bytes may be wasteful. Because of this the initial
start/end data pointers are (0,0). Meaning no data can be read or
written. This avoids reading data that may be modified by the
user. A new helper is added later in this series if reading and
writing the data is needed. The helper call will do a copy by
default so that the page is exclusively owned by the BPF call.

The verdict from the BPF_PROG_TYPE_SK_MSG applies to the entire msg
in the sendmsg() case and the entire page/offset in the sendpage case.
This avoids ambiguity on how to handle mixed return codes in the
sendmsg case. Again a helper is added later in the series if
a verdict needs to apply to multiple system calls and/or only
a subpart of the currently being processed message.

The helper msg_redirect_map() can be used to select the socket to
send the data on. This is used similar to existing redirect use
cases. This allows policy to redirect msgs.

Pseudo code simple example:

The basic logic to attach a program to a socket is as follows,

  // load the programs
  bpf_prog_load(SOCKMAP_TCP_MSG_PROG, BPF_PROG_TYPE_SK_MSG,
		&obj, &msg_prog);

  // lookup the sockmap
  bpf_map_msg = bpf_object__find_map_by_name(obj, "my_sock_map");

  // get fd for sockmap
  map_fd_msg = bpf_map__fd(bpf_map_msg);

  // attach program to sockmap
  bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0);

Adding sockets to the map is done in the normal way,

  // Add a socket 'fd' to sockmap at location 'i'
  bpf_map_update_elem(map_fd_msg, &i, fd, BPF_ANY);

After the above any socket attached to "my_sock_map", in this case
'fd', will run the BPF msg verdict program (msg_prog) on every
sendmsg and sendpage system call.

For a complete example see BPF selftests or sockmap samples.

Implementation notes:

It seemed the simplest, to me at least, to use a refcnt to ensure
psock is not lost across the sendmsg copy into the sg, the bpf program
running on the data in sg_data, and the final pass to the TCP stack.
Some performance testing may show a better method to do this and avoid
the refcnt cost, but for now use the simpler method.

Another item that will come after basic support is in place is
supporting MSG_MORE flag. At the moment we call sendpages even if
the MSG_MORE flag is set. An enhancement would be to collect the
pages into a larger scatterlist and pass down the stack. Notice that
bpf_tcp_sendmsg() could support this with some additional state saved
across sendmsg calls. I built the code to support this without having
to do refactoring work. Other features TBD include ZEROCOPY and the
TCP_RECV_QUEUE/TCP_NO_QUEUE support. This will follow initial series
shortly.

Future work could improve size limits on the scatterlist rings used
here. Currently, we use MAX_SKB_FRAGS simply because this was being
used already in the TLS case. Future work could extend the kernel sk
APIs to tune this depending on workload. This is a trade-off
between memory usage and throughput performance.
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

4f738adb

net: generalize sk_alloc_sg to work with scatterlist rings · 8c05dbf0

由 John Fastabend 提交于 3月 18, 2018

The current implementation of sk_alloc_sg expects scatterlist to always
start at entry 0 and complete at entry MAX_SKB_FRAGS.

Future patches will want to support starting at arbitrary offset into
scatterlist so add an additional sg_start parameters and then default
to the current values in TLS code paths.
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

8c05dbf0

net: do_tcp_sendpages flag to avoid SKBTX_SHARED_FRAG · 312fc2b4

由 John Fastabend 提交于 3月 18, 2018

When calling do_tcp_sendpages() from in kernel and we know the data
has no references from user side we can omit SKBTX_SHARED_FRAG flag.
This patch adds an internal flag, NO_SKBTX_SHARED_FRAG that can be used
to omit setting SKBTX_SHARED_FRAG.

The flag is not exposed to userspace because the sendpage call from
the splice logic masks out all bits except MSG_MORE.
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

312fc2b4

sockmap: convert refcnt to an atomic refcnt · ffa35660

由 John Fastabend 提交于 3月 18, 2018

The sockmap refcnt up until now has been wrapped in the
sk_callback_lock(). So its not actually needed any locking of its
own. The counter itself tracks the lifetime of the psock object.
Sockets in a sockmap have a lifetime that is independent of the
map they are part of. This is possible because a single socket may
be in multiple maps. When this happens we can only release the
psock data associated with the socket when the refcnt reaches
zero. There are three possible delete sock reference decrement
paths first through the normal sockmap process, the user deletes
the socket from the map. Second the map is removed and all sockets
in the map are removed, delete path is similar to case 1. The third
case is an asyncronous socket event such as a closing the socket. The
last case handles removing sockets that are no longer available.
For completeness, although inc does not pose any problems in this
patch series, the inc case only happens when a psock is added to a
map.

Next we plan to add another socket prog type to handle policy and
monitoring on the TX path. When we do this however we will need to
keep a reference count open across the sendmsg/sendpage call and
holding the sk_callback_lock() here (on every send) seems less than
ideal, also it may sleep in cases where we hit memory pressure.
Instead of dealing with these issues in some clever way simply make
the reference counting a refcnt_t type and do proper atomic ops.
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

ffa35660

sock: make static tls function alloc_sg generic sock helper · 2c3682f0

由 John Fastabend 提交于 3月 18, 2018

The TLS ULP module builds scatterlists from a sock using
page_frag_refill(). This is going to be useful for other ULPs
so move it into sock file for more general use.

In the process remove useless goto at end of while loop.
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

2c3682f0

16 3月, 2018 5 次提交

Merge branch 'bpf-tools-build-improvements' · 318df9f0

由 Daniel Borkmann 提交于 3月 16, 2018

Jakub Kicinski says:

====================
As promised this series addresses nits and minor issues in tools/bpf
build infra.  One GCC-7 warning which is nice to get rid of.  Dependencies
when built with OUTPUT are fixed.  make clean will now remove the
FEATURE-DUMP.* files.  PHONY target is also updated to match reality.
====================
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

318df9f0

tools: bpf: remove feature detection output · cc5b3403

由 Jakub Kicinski 提交于 3月 15, 2018

bpf tools use feature detection for libbfd dependency, clean up
the output files on make clean.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

cc5b3403

tools: bpf: cleanup PHONY target · 8050ea46

由 Jakub Kicinski 提交于 3月 15, 2018

There is no FORCE target in the Makefile and some of the PHONY
targets are missing, update the list.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

8050ea46

tools: bpftool: fix potential format truncation · d5fc73dc

由 Jakub Kicinski 提交于 3月 15, 2018

GCC 7 complains:

xlated_dumper.c: In function â€˜print_callâ€™:
xlated_dumper.c:179:10: warning: â€˜%sâ€™ directive output may be truncated writing up to 255 bytes into a region of size between 249 and 253 [-Wformat-truncation=]
"%+d#%s", insn->off, sym->name);

Add a bit more space to the buffer so it can handle the entire
string and integer without truncation.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

d5fc73dc

tools: bpftool: fix dependency file path · 90126e3a

由 Jakub Kicinski 提交于 3月 15, 2018

Auto-generated dependency files are in the OUTPUT directory,
we need to include them from there.  This fixes object files
not being rebuilt after header changes.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

90126e3a

15 3月, 2018 3 次提交

Merge branch 'bpf-stackmap-build-id' · 68de5ef4

由 Daniel Borkmann 提交于 3月 15, 2018

Song Liu says:

====================
This work follows up discussion at Plumbers'17 on improving addr->sym
resolution of user stack traces. The following links have more information
of the discussion:

http://www.linuxplumbersconf.org/2017/ocw/proposals/4764
https://lwn.net/Articles/734453/     Section "Stack traces and kprobes"

Currently, bpf stackmap store address for each entry in the call trace.
To map these addresses to user space files, it is necessary to maintain
the mapping from these virtual address to symbols in the binary. Usually,
the user space profiler (such as perf) has to scan /proc/pid/maps at the
beginning of profiling, and monitor mmap2() calls afterwards. Given the
cost of maintaining the address map, this solution is not practical for
system wide profiling that is always on.

This patch tries to address this with a variation to stackmap. Instead
of storing addresses, the variation stores ELF file build_id + offset.
After profiling, a user space tool will look up these functions with
build_id (to find the binary or shared library) and the offset.

I also updated bcc/cc library for the stackmap (no python/lua support yet).
You can find the work at:

  https://github.com/liu-song-6/bcc/commits/bpf_get_stackid_v02

Changes v5 -> v6:

1. When kernel stack is added to stackmap with build_id, use fallback
   mechanism to store ip (status == BPF_STACK_BUILD_ID_IP).

Changes v4 -> v5:

1. Only allow build_id lookup in non-nmi context. Added comment and
   commit message to highlight this limitation.
2. Minor fix reported by kbuild test robot.

Changes v3 -> v4:

1. Add fallback when build_id lookup failed. In this case, status is set
   to BPF_STACK_BUILD_ID_IP, and ip of this entry is saved.
2. Handle cases where vma is only part of the file (vma->vm_pgoff != 0).
   Thanks to Teng for helping me identify this issue!
3. Address feedbacks for previous versions.
====================
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

68de5ef4

bpf: add selftest for stackmap with BPF_F_STACK_BUILD_ID · 81f77fd0

由 Song Liu 提交于 3月 14, 2018

test_stacktrace_build_id() is added. It accesses tracepoint urandom_read
with "dd" and "urandom_read" and gathers stack traces. Then it reads the
stack traces from the stackmap.

urandom_read is a statically link binary that reads from /dev/urandom.
test_stacktrace_build_id() calls readelf to read build ID of urandom_read
and compares it with build ID from the stackmap.
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

81f77fd0

bpf: extend stackmap to save binary_build_id+offset instead of address · 615755a7

由 Song Liu 提交于 3月 14, 2018

Currently, bpf stackmap store address for each entry in the call trace.
To map these addresses to user space files, it is necessary to maintain
the mapping from these virtual address to symbols in the binary. Usually,
the user space profiler (such as perf) has to scan /proc/pid/maps at the
beginning of profiling, and monitor mmap2() calls afterwards. Given the
cost of maintaining the address map, this solution is not practical for
system wide profiling that is always on.

This patch tries to solve this problem with a variation of stackmap. This
variation is enabled by flag BPF_F_STACK_BUILD_ID. Instead of storing
addresses, the variation stores ELF file build_id + offset.

Build ID is a 20-byte unique identifier for ELF files. The following
command shows the Build ID of /bin/bash:

  [user@]$ readelf -n /bin/bash
  ...
    Build ID: XXXXXXXXXX
  ...

With BPF_F_STACK_BUILD_ID, bpf_get_stackid() tries to parse Build ID
for each entry in the call trace, and translate it into the following
struct:

  struct bpf_stack_build_id_offset {
          __s32           status;
          unsigned char   build_id[BPF_BUILD_ID_SIZE];
          union {
                  __u64   offset;
                  __u64   ip;
          };
  };

The search of build_id is limited to the first page of the file, and this
page should be in page cache. Otherwise, we fallback to store ip for this
entry (ip field in struct bpf_stack_build_id_offset). This requires the
build_id to be stored in the first page. A quick survey of binary and
dynamic library files in a few different systems shows that almost all
binary and dynamic library files have build_id in the first page.

Build_id is only meaningful for user stack. If a kernel stack is added to
a stackmap with BPF_F_STACK_BUILD_ID, it will automatically fallback to
only store ip (status == BPF_STACK_BUILD_ID_IP). Similarly, if build_id
lookup failed for some reason, it will also fallback to store ip.

User space can access struct bpf_stack_build_id_offset with bpf
syscall BPF_MAP_LOOKUP_ELEM. It is necessary for user space to
maintain mapping from build id to binary files. This mostly static
mapping is much easier to maintain than per process address maps.

Note: Stackmap with build_id only works in non-nmi context at this time.
This is because we need to take mm->mmap_sem for find_vma(). If this
changes, we would like to allow build_id lookup in nmi context.
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

615755a7

09 3月, 2018 9 次提交

bpf: comment why dots in filenames under BPF virtual FS are not allowed · 6d8cb045

由 Quentin Monnet 提交于 3月 08, 2018

When pinning a file under the BPF virtual file system (traditionally
/sys/fs/bpf), using a dot in the name of the location to pin at is not
allowed. For example, trying to pin at "/sys/fs/bpf/foo.bar" will be
rejected with -EPERM.

This check was introduced at the same time as the BPF file system
itself, with commit b2197755 ("bpf: add support for persistent
maps/progs"). At this time, it was checked in a function called
"bpf_dname_reserved()", which made clear that using a dot was reserved
for future extensions.

This function disappeared and the check was moved elsewhere with commit
0c93b7d8 ("bpf: reject invalid names right in ->lookup()"), and the
meaning of the dot ban was lost.

The present commit simply adds a comment in the source to explain to the
reader that the usage of dots is reserved for future usage.
Signed-off-by: NQuentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

6d8cb045

Merge branch 'bpf-tools-makefile-improvements' · 75a141af

由 Daniel Borkmann 提交于 3月 09, 2018

Jiri Benc says:

====================
Currently, 'make bpf' in the tools/ directory does not provide the
standard quiet output except for bpftool (which is however listed
with a wrong directory). Worse, it does not respect the build output
directory.

The 'make bpf_install' does not work as one would expect, either.
It installs unconditionally to /usr/bin without respecting DESTDIR
and prefix.

This patchset improves that behavior.
====================
Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

75a141af

tools: bpf: silence make by not deleting intermediate file · ef8ba83b

由 Jiri Benc 提交于 3月 08, 2018

Even in quiet mode, make finishes with

rm tools/bpf/bpf_exp.lex.c

That's because it considers the file to be intermediate. Silence that by
mentioning the lex.c file instead of the lex.o file; the dependency still
stays.
Signed-off-by: NJiri Benc <jbenc@redhat.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

ef8ba83b

tools: bpf: respect quiet/verbose build · a50b7f8c

由 Jiri Benc 提交于 3月 08, 2018

Default to quiet build, with V=1 enabling verbose build as is usual.
Signed-off-by: NJiri Benc <jbenc@redhat.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

a50b7f8c

tools: bpf: call descend in Makefile · 58416c37

由 Jiri Benc 提交于 3月 08, 2018

Use the descend macro to properly propagate $(subdir) to bpftool.
Signed-off-by: NJiri Benc <jbenc@redhat.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

58416c37

tools: bpf: make install should build first · 6c071008

由 Jiri Benc 提交于 3月 08, 2018

Make the 'install' target depend on the 'all' target to build the binaries
first.
Signed-off-by: NJiri Benc <jbenc@redhat.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

6c071008

tools: bpf: consistent make bpf_install · fde68c5b

由 Jiri Benc 提交于 3月 08, 2018

Currently, make bpf_install in tools/ does not respect DESTDIR. Moreover, it
installs to /usr/bin/ unconditionally.

Let it respect DESTDIR and allow prefix to be specified. Also, to be more
consistent with bpftool and with the usual customs, default the prefix to
/usr/local instead of /usr.
Signed-off-by: NJiri Benc <jbenc@redhat.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

fde68c5b

tools: bpf: respect output directory during build · 5a8997f2

由 Jiri Benc 提交于 3月 08, 2018

Currently, the programs under tools/bpf (with the notable exception of
bpftool) do not respect the output directory (make O=dir). Fix that.
Signed-off-by: NJiri Benc <jbenc@redhat.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

5a8997f2

tools: bpftool: silence 'missing initializer' warnings · 72ab55e9

由 Jiri Benc 提交于 3月 08, 2018

When building bpf tool, gcc emits piles of warnings:

prog.c: In function ‘prog_fd_by_tag’:
prog.c:101:9: warning: missing initializer for field ‘type’ of ‘struct bpf_prog_info’ [-Wmissing-field-initializers]
  struct bpf_prog_info info = {};
         ^
In file included from /home/storage/jbenc/git/net-next/tools/lib/bpf/bpf.h:26:0,
                 from prog.c:47:
/home/storage/jbenc/git/net-next/tools/include/uapi/linux/bpf.h:925:8: note: ‘type’ declared here
  __u32 type;
        ^

As these warnings are not useful, switch them off.
Signed-off-by: NJiri Benc <jbenc@redhat.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

72ab55e9

08 3月, 2018 5 次提交

Merge branch 'bpf-perf-sample-addr' · 12ef9bda

由 Daniel Borkmann 提交于 3月 08, 2018

Teng Qin says:

====================
These patches add support that allows bpf programs attached to perf events to
read the address values recorded with the perf events. These values are
requested by specifying sample_type with PERF_SAMPLE_ADDR when calling
perf_event_open().

The main motivation for these changes is to support building memory or lock
access profiling and tracing tools. For example on Intel CPUs, the recorded
address values for supported memory or lock access perf events would be
the access or lock target addresses from PEBS buffer. Such information would
be very valuable for building tools that help understand memory access or
lock acquire pattern.
====================
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

12ef9bda

samples/bpf: add example to test reading address · 12fe1225

由 Teng Qin 提交于 3月 06, 2018

This commit adds additional test in the trace_event example, by
attaching the bpf program to MEM_UOPS_RETIRED.LOCK_LOADS event with
PERF_SAMPLE_ADDR requested, and print the lock address value read from
the bpf program to trace_pipe.
Signed-off-by: NTeng Qin <qinteng@fb.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

12fe1225

bpf: add support to read sample address in bpf program · 95da0cdb

由 Teng Qin 提交于 3月 06, 2018

This commit adds new field "addr" to bpf_perf_event_data which could be
read and used by bpf programs attached to perf events. The value of the
field is copied from bpf_perf_event_data_kern.addr and contains the
address value recorded by specifying sample_type with PERF_SAMPLE_ADDR
when calling perf_event_open.
Signed-off-by: NTeng Qin <qinteng@fb.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

95da0cdb

ip6mr: remove synchronize_rcu() in favor of SOCK_RCU_FREE · a366e300

由 Eric Dumazet 提交于 3月 07, 2018

Kirill found that recently added synchronize_rcu() call in
ip6mr_sk_done()
was slowing down netns dismantle and posted a patch to use it only if
the socket
was found.

I instead suggested to get rid of this call, and use instead
SOCK_RCU_FREE

We might later change IPv4 side to use the same technique and unify
both stacks. IPv4 does not use synchronize_rcu() but has a call_rcu()
that could be replaced by SOCK_RCU_FREE.

Tested:
 time for i in {1..1000}; do unshare -n /bin/false;done

 Before : real 7m18.911s
 After : real 10.187s

Fixes: 8571ab47 ("ip6mr: Make mroute_sk rcu-based")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NKirill Tkhai <ktkhai@virtuozzo.com>
Cc: Yuval Mintz <yuvalm@mellanox.com>
Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a366e300

Merge branch 'RDS-zerocopy-code-enhancements' · 3f6be0eb

由 David S. Miller 提交于 3月 07, 2018

Sowmini Varadhan says:

====================
RDS: zerocopy code enhancements

A couple of enhancements to the rds zerocop code
- patch 1 refactors rds_message_copy_from_user to pull the zcopy logic
  into its own function
- patch 2 drops the usage sk_buff to track MSG_ZEROCOPY cookies and
  uses a simple linked list (enhancement suggested by willemb during
  code review)
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3f6be0eb

openeuler / Kernel 大约 1 年 前同步成功

openeuler / Kernel
大约 1 年前同步成功