提交 · 661b37cd437ef49cd28444f79b9b0c71ea76e8c8 · openeuler / Kernel

26 8月, 2020 4 次提交

由 Jiri Olsa 提交于 8月 25, 2020

Adding d_path helper function that returns full path for
given 'struct path' object, which needs to be the kernel
BTF 'path' object. The path is returned in buffer provided
'buf' of size 'sz' and is zero terminated.

  bpf_d_path(&file->f_path, buf, size);

The helper calls directly d_path function, so there's only
limited set of function it can be called from. Adding just
very modest set for the start.

Updating also bpf.h tools uapi header and adding 'path' to
bpf_helpers_doc.py script.
Signed-off-by: NJiri Olsa <jolsa@kernel.org>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NAndrii Nakryiko <andriin@fb.com>
Acked-by: NKP Singh <kpsingh@google.com>
Link: https://lore.kernel.org/bpf/20200825192124.710397-11-jolsa@kernel.org

6e22ab9d

bpf: Allow local storage to be used from LSM programs · 30897832

由 KP Singh 提交于 8月 25, 2020

Adds support for both bpf_{sk, inode}_storage_{get, delete} to be used
in LSM programs. These helpers are not used for tracing programs
(currently) as their usage is tied to the life-cycle of the object and
should only be used where the owning object won't be freed (when the
owning object is passed as an argument to the LSM hook). Thus, they
are safer to use in LSM hooks than tracing. Usage of local storage in
tracing programs will probably follow a per function based whitelist
approach.

Since the UAPI helper signature for bpf_sk_storage expect a bpf_sock,
it, leads to a compilation warning for LSM programs, it's also updated
to accept a void * pointer instead.
Signed-off-by: NKP Singh <kpsingh@google.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200825182919.1118197-7-kpsingh@chromium.org

30897832

bpf: Implement bpf_local_storage for inodes · 8ea63684

由 KP Singh 提交于 8月 25, 2020

Similar to bpf_local_storage for sockets, add local storage for inodes.
The life-cycle of storage is managed with the life-cycle of the inode.
i.e. the storage is destroyed along with the owning inode.

The BPF LSM allocates an __rcu pointer to the bpf_local_storage in the
security blob which are now stackable and can co-exist with other LSMs.
Signed-off-by: NKP Singh <kpsingh@google.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200825182919.1118197-6-kpsingh@chromium.org

8ea63684

bpf: Generalize bpf_sk_storage · f836a56e

由 KP Singh 提交于 8月 25, 2020

Refactor the functionality in bpf_sk_storage.c so that concept of
storage linked to kernel objects can be extended to other objects like
inode, task_struct etc.

Each new local storage will still be a separate map and provide its own
set of helpers. This allows for future object specific extensions and
still share a lot of the underlying implementation.

This includes the changes suggested by Martin in:

  https://lore.kernel.org/bpf/20200725013047.4006241-1-kafai@fb.com/

adding new map operations to support bpf_local_storage maps:

* storages for different kernel objects to optionally have different
  memory charging strategy (map_local_storage_charge,
  map_local_storage_uncharge)
* Functionality to extract the storage pointer from a pointer to the
  owning object (map_owner_storage_ptr)
Co-developed-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NKP Singh <kpsingh@google.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200825182919.1118197-4-kpsingh@chromium.org

f836a56e

25 8月, 2020 6 次提交

tcp: bpf: Optionally store mac header in TCP_SAVE_SYN · 267cf9fa

由 Martin KaFai Lau 提交于 8月 20, 2020

This patch is adapted from Eric's patch in an earlier discussion [1].

The TCP_SAVE_SYN currently only stores the network header and
tcp header.  This patch allows it to optionally store
the mac header also if the setsockopt's optval is 2.

It requires one more bit for the "save_syn" bit field in tcp_sock.
This patch achieves this by moving the syn_smc bit next to the is_mptcp.
The syn_smc is currently used with the TCP experimental option.  Since
syn_smc is only used when CONFIG_SMC is enabled, this patch also puts
the "IS_ENABLED(CONFIG_SMC)" around it like the is_mptcp did
with "IS_ENABLED(CONFIG_MPTCP)".

The mac_hdrlen is also stored in the "struct saved_syn"
to allow a quick offset from the bpf prog if it chooses to start
getting from the network header or the tcp header.

[1]: https://lore.kernel.org/netdev/CANn89iLJNWh6bkH7DNhy_kmcAexuUCccqERqe7z2QsvPhGrYPQ@mail.gmail.com/Suggested-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20200820190123.2886935-1-kafai@fb.com

267cf9fa

bpf: tcp: Allow bpf prog to write and parse TCP header option · 0813a841

由 Martin KaFai Lau 提交于 8月 20, 2020

[ Note: The TCP changes here is mainly to implement the bpf
pieces into the bpf_skops_*() functions introduced
in the earlier patches. ]

The earlier effort in BPF-TCP-CC allows the TCP Congestion Control
algorithm to be written in BPF. It opens up opportunities to allow
a faster turnaround time in testing/releasing new congestion control
ideas to production environment.

The same flexibility can be extended to writing TCP header option.
It is not uncommon that people want to test new TCP header option
to improve the TCP performance. Another use case is for data-center
that has a more controlled environment and has more flexibility in
putting header options for internal only use.

For example, we want to test the idea in putting maximum delay
ACK in TCP header option which is similar to a draft RFC proposal [1].

This patch introduces the necessary BPF API and use them in the
TCP stack to allow BPF_PROG_TYPE_SOCK_OPS program to parse
and write TCP header options. It currently supports most of
the TCP packet except RST.

Supported TCP header option:
───────────────────────────
This patch allows the bpf-prog to write any option kind.
Different bpf-progs can write its own option by calling the new helper
bpf_store_hdr_opt(). The helper will ensure there is no duplicated
option in the header.

By allowing bpf-prog to write any option kind, this gives a lot of
flexibility to the bpf-prog. Different bpf-prog can write its
own option kind. It could also allow the bpf-prog to support a
recently standardized option on an older kernel.

Sockops Callback Flags:
──────────────────────
The bpf program will only be called to parse/write tcp header option
if the following newly added callback flags are enabled
in tp->bpf_sock_ops_cb_flags:
BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG
BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG

A few words on the PARSE CB flags. When the above PARSE CB flags are
turned on, the bpf-prog will be called on packets received
at a sk that has at least reached the ESTABLISHED state.
The parsing of the SYN-SYNACK-ACK will be discussed in the
"3 Way HandShake" section.

The default is off for all of the above new CB flags, i.e. the bpf prog
will not be called to parse or write bpf hdr option. There are
details comment on these new cb flags in the UAPI bpf.h.

sock_ops->skb_data and bpf_load_hdr_opt()
─────────────────────────────────────────
sock_ops->skb_data and sock_ops->skb_data_end covers the whole
TCP header and its options. They are read only.

The new bpf_load_hdr_opt() helps to read a particular option "kind"
from the skb_data.

Please refer to the comment in UAPI bpf.h. It has details
on what skb_data contains under different sock_ops->op.

3 Way HandShake
───────────────
The bpf-prog can learn if it is sending SYN or SYNACK by reading the
sock_ops->skb_tcp_flags.

* Passive side

When writing SYNACK (i.e. sock_ops->op == BPF_SOCK_OPS_WRITE_HDR_OPT_CB),
the received SYN skb will be available to the bpf prog. The bpf prog can
use the SYN skb (which may carry the header option sent from the remote bpf
prog) to decide what bpf header option should be written to the outgoing
SYNACK skb. The SYN packet can be obtained by getsockopt(TCP_BPF_SYN*).
More on this later. Also, the bpf prog can learn if it is in syncookie
mode (by checking sock_ops->args[0] == BPF_WRITE_HDR_TCP_SYNACK_COOKIE).

The bpf prog can store the received SYN pkt by using the existing
bpf_setsockopt(TCP_SAVE_SYN). The example in a later patch does it.
[ Note that the fullsock here is a listen sk, bpf_sk_storage
is not very useful here since the listen sk will be shared
by many concurrent connection requests.

Extending bpf_sk_storage support to request_sock will add weight
to the minisock and it is not necessary better than storing the
whole ~100 bytes SYN pkt. ]

When the connection is established, the bpf prog will be called
in the existing PASSIVE_ESTABLISHED_CB callback. At that time,
the bpf prog can get the header option from the saved syn and
then apply the needed operation to the newly established socket.
The later patch will use the max delay ack specified in the SYN
header and set the RTO of this newly established connection
as an example.

The received ACK (that concludes the 3WHS) will also be available to
the bpf prog during PASSIVE_ESTABLISHED_CB through the sock_ops->skb_data.
It could be useful in syncookie scenario. More on this later.

There is an existing getsockopt "TCP_SAVED_SYN" to return the whole
saved syn pkt which includes the IP[46] header and the TCP header.
A few "TCP_BPF_SYN*" getsockopt has been added to allow specifying where to
start getting from, e.g. starting from TCP header, or from IP[46] header.

The new getsockopt(TCP_BPF_SYN*) will also know where it can get
the SYN's packet from:
- (a) the just received syn (available when the bpf prog is writing SYNACK)
and it is the only way to get SYN during syncookie mode.
or
- (b) the saved syn (available in PASSIVE_ESTABLISHED_CB and also other
existing CB).

The bpf prog does not need to know where the SYN pkt is coming from.
The getsockopt(TCP_BPF_SYN*) will hide this details.

Similarly, a flags "BPF_LOAD_HDR_OPT_TCP_SYN" is also added to
bpf_load_hdr_opt() to read a particular header option from the SYN packet.

* Fastopen

Fastopen should work the same as the regular non fastopen case.
This is a test in a later patch.

* Syncookie

For syncookie, the later example patch asks the active
side's bpf prog to resend the header options in ACK. The server
can use bpf_load_hdr_opt() to look at the options in this
received ACK during PASSIVE_ESTABLISHED_CB.

* Active side

The bpf prog will get a chance to write the bpf header option
in the SYN packet during WRITE_HDR_OPT_CB. The received SYNACK
pkt will also be available to the bpf prog during the existing
ACTIVE_ESTABLISHED_CB callback through the sock_ops->skb_data
and bpf_load_hdr_opt().

* Turn off header CB flags after 3WHS

If the bpf prog does not need to write/parse header options
beyond the 3WHS, the bpf prog can clear the bpf_sock_ops_cb_flags
to avoid being called for header options.
Or the bpf-prog can select to leave the UNKNOWN_HDR_OPT_CB_FLAG on
so that the kernel will only call it when there is option that
the kernel cannot handle.

[1]: draft-wang-tcpm-low-latency-opt-00
https://tools.ietf.org/html/draft-wang-tcpm-low-latency-opt-00Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200820190104.2885895-1-kafai@fb.com

0813a841

bpf: tcp: Add bpf_skops_hdr_opt_len() and bpf_skops_write_hdr_opt() · 331fca43

由 Martin KaFai Lau 提交于 8月 20, 2020

The bpf prog needs to parse the SYN header to learn what options have
been sent by the peer's bpf-prog before writing its options into SYNACK.
This patch adds a "syn_skb" arg to tcp_make_synack() and send_synack().
This syn_skb will eventually be made available (as read-only) to the
bpf prog.  This will be the only SYN packet available to the bpf
prog during syncookie.  For other regular cases, the bpf prog can
also use the saved_syn.

When writing options, the bpf prog will first be called to tell the
kernel its required number of bytes.  It is done by the new
bpf_skops_hdr_opt_len().  The bpf prog will only be called when the new
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG is set in tp->bpf_sock_ops_cb_flags.
When the bpf prog returns, the kernel will know how many bytes are needed
and then update the "*remaining" arg accordingly.  4 byte alignment will
be included in the "*remaining" before this function returns.  The 4 byte
aligned number of bytes will also be stored into the opts->bpf_opt_len.
"bpf_opt_len" is a newly added member to the struct tcp_out_options.

Then the new bpf_skops_write_hdr_opt() will call the bpf prog to write the
header options.  The bpf prog is only called if it has reserved spaces
before (opts->bpf_opt_len > 0).

The bpf prog is the last one getting a chance to reserve header space
and writing the header option.

These two functions are half implemented to highlight the changes in
TCP stack.  The actual codes preparing the bpf running context and
invoking the bpf prog will be added in the later patch with other
necessary bpf pieces.
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20200820190052.2885316-1-kafai@fb.com

331fca43

bpf: tcp: Add bpf_skops_parse_hdr() · 00d211a4

由 Martin KaFai Lau 提交于 8月 20, 2020

The patch adds a function bpf_skops_parse_hdr().
It will call the bpf prog to parse the TCP header received at
a tcp_sock that has at least reached the ESTABLISHED state.

For the packets received during the 3WHS (SYN, SYNACK and ACK),
the received skb will be available to the bpf prog during the callback
in bpf_skops_established() introduced in the previous patch and
in the bpf_skops_write_hdr_opt() that will be added in the
next patch.

Calling bpf prog to parse header is controlled by two new flags in
tp->bpf_sock_ops_cb_flags:
BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG and
BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG.

When BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG is set,
the bpf prog will only be called when there is unknown
option in the TCP header.

When BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG is set,
the bpf prog will be called on all received TCP header.

This function is half implemented to highlight the changes in
TCP stack.  The actual codes preparing the bpf running context and
invoking the bpf prog will be added in the later patch with other
necessary bpf pieces.
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20200820190046.2885054-1-kafai@fb.com

00d211a4

tcp: bpf: Add TCP_BPF_RTO_MIN for bpf_setsockopt · ca584ba0

由 Martin KaFai Lau 提交于 8月 20, 2020

This patch adds bpf_setsockopt(TCP_BPF_RTO_MIN) to allow bpf prog
to set the min rto of a connection.  It could be used together
with the earlier patch which has added bpf_setsockopt(TCP_BPF_DELACK_MAX).

A later selftest patch will communicate the max delay ack in a
bpf tcp header option and then the receiving side can use
bpf_setsockopt(TCP_BPF_RTO_MIN) to set a shorter rto.
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200820190027.2884170-1-kafai@fb.com

ca584ba0

tcp: bpf: Add TCP_BPF_DELACK_MAX setsockopt · 2b8ee4f0

由 Martin KaFai Lau 提交于 8月 20, 2020

This change is mostly from an internal patch and adapts it from sysctl
config to the bpf_setsockopt setup.

The bpf_prog can set the max delay ack by using
bpf_setsockopt(TCP_BPF_DELACK_MAX).  This max delay ack can be communicated
to its peer through bpf header option.  The receiving peer can then use
this max delay ack and set a potentially lower rto by using
bpf_setsockopt(TCP_BPF_RTO_MIN) which will be introduced
in the next patch.

Another later selftest patch will also use it like the above to show
how to write and parse bpf tcp header option.
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200820190021.2884000-1-kafai@fb.com

2b8ee4f0

22 8月, 2020 1 次提交

bpf: Implement link_query for bpf iterators · 6b0a249a

由 Yonghong Song 提交于 8月 21, 2020

This patch implemented bpf_link callback functions
show_fdinfo and fill_link_info to support link_query
interface.

The general interface for show_fdinfo and fill_link_info
will print/fill the target_name. Each targets can
register show_fdinfo and fill_link_info callbacks
to print/fill more target specific information.

For example, the below is a fdinfo result for a bpf
task iterator.
  $ cat /proc/1749/fdinfo/7
  pos:    0
  flags:  02000000
  mnt_id: 14
  link_type:      iter
  link_id:        11
  prog_tag:       990e1f8152f7e54f
  prog_id:        59
  target_name:    task
Signed-off-by: NYonghong Song <yhs@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200821184418.574122-1-yhs@fb.com

6b0a249a

12 8月, 2020 2 次提交

tools headers UAPI: Sync kvm.h headers with the kernel sources · 6016e034

由 Arnaldo Carvalho de Melo 提交于 8月 12, 2020

To pick the changes in:

  3edd6839 ("KVM: x86: Add a capability for GUEST_MAXPHYADDR < HOST_MAXPHYADDR support")
  1aa561b1 ("kvm: x86: Add "last CPU" to some KVM_EXIT information")
  23a60f83 ("s390/kvm: diagnose 0x318 sync and reset")

That do not result in any change in tooling, as the additions are not
being used in any table generator.

This silences these perf build warning:

  Warning: Kernel ABI header at 'tools/include/uapi/linux/kvm.h' differs from latest version at 'include/uapi/linux/kvm.h'
  diff -u tools/include/uapi/linux/kvm.h include/uapi/linux/kvm.h

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Collin Walling <walling@linux.ibm.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mohammed Gamal <mgamal@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>

6016e034

tools include UAPI: Sync linux/vhost.h with the kernel sources · fe452fb8

由 Arnaldo Carvalho de Melo 提交于 8月 12, 2020

To get the changes in:

  25abc060 ("vhost-vdpa: support IOTLB batching hints")

This doesn't result in any changes in tooling, no new ioctls to be
picked up by the id->string table generators, etc.

Silencing this perf build warning:

  Warning: Kernel ABI header at 'tools/include/uapi/linux/vhost.h' differs from latest version at 'include/uapi/linux/vhost.h'
  diff -u tools/include/uapi/linux/vhost.h include/uapi/linux/vhost.h

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>

fe452fb8

07 8月, 2020 2 次提交

tools headers UAPI: update linux/in.h copy · 7a36b9d2

由 Arnaldo Carvalho de Melo 提交于 8月 07, 2020

To get the changes from:

  eba75c58 ("icmp: support rfc 4884")

That don't cause any changes in tooling, as we still don't have a
[gs]etsockopt 'level' beautifier, will try and have one soon.

This silences this tools/perf build warning:

  Warning: Kernel ABI header at 'tools/include/uapi/linux/in.h' differs from latest version at 'include/uapi/linux/in.h'
  diff -u tools/include/uapi/linux/in.h include/uapi/linux/in.h

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>

7a36b9d2

tools/bpf: Support new uapi for map element bpf iterator · 74fc097d

由 Yonghong Song 提交于 8月 04, 2020

Previous commit adjusted kernel uapi for map
element bpf iterator. This patch adjusted libbpf API
due to uapi change. bpftool and bpf_iter selftests
are also changed accordingly.
Signed-off-by: NYonghong Song <yhs@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NAndrii Nakryiko <andriin@fb.com>
Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200805055058.1457623-1-yhs@fb.com

74fc097d

02 8月, 2020 1 次提交

libbpf: Add bpf_link detach APIs · 2e49527e

由 Andrii Nakryiko 提交于 7月 31, 2020

Add low-level bpf_link_detach() API. Also add higher-level bpf_link__detach()
one.
Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NSong Liu <songliubraving@fb.com>
Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200731182830.286260-3-andriin@fb.com

2e49527e

28 7月, 2020 1 次提交

bpf: Fix bpf_ringbuf_output() signature to return long · e1613b57

由 Andrii Nakryiko 提交于 7月 27, 2020

Due to bpf tree fix merge, bpf_ringbuf_output() signature ended up with int as
a return type, while all other helpers got converted to returning long. So fix
it in bpf-next now.

Fixes: b0659d8a ("bpf: Fix definition of bpf_ringbuf_output() helper in UAPI comments")
Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NSong Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200727224715.652037-1-andriin@fb.com

e1613b57

26 7月, 2020 2 次提交

libbpf: Add support for BPF XDP link · dc8698ca

由 Andrii Nakryiko 提交于 7月 21, 2020

Sync UAPI header and add support for using bpf_link-based XDP attachment.
Make xdp/ prog type set expected attach type. Kernel didn't enforce
attach_type for XDP programs before, so there is no backwards compatiblity
issues there.

Also fix section_names selftest to recognize that xdp prog types now have
expected attach type.
Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200722064603.3350758-8-andriin@fb.com

dc8698ca

bpf: Implement bpf iterator for map elements · a5cbe05a

由 Yonghong Song 提交于 7月 23, 2020

The bpf iterator for map elements are implemented.
The bpf program will receive four parameters:
  bpf_iter_meta *meta: the meta data
  bpf_map *map:        the bpf_map whose elements are traversed
  void *key:           the key of one element
  void *value:         the value of the same element

Here, meta and map pointers are always valid, and
key has register type PTR_TO_RDONLY_BUF_OR_NULL and
value has register type PTR_TO_RDWR_BUF_OR_NULL.
The kernel will track the access range of key and value
during verification time. Later, these values will be compared
against the values in the actual map to ensure all accesses
are within range.

A new field iter_seq_info is added to bpf_map_ops which
is used to add map type specific information, i.e., seq_ops,
init/fini seq_file func and seq_file private data size.
Subsequent patches will have actual implementation
for bpf_map_ops->iter_seq_info.

In user space, BPF_ITER_LINK_MAP_FD needs to be
specified in prog attr->link_create.flags, which indicates
that attr->link_create.target_fd is a map_fd.
The reason for such an explicit flag is for possible
future cases where one bpf iterator may allow more than
one possible customization, e.g., pid and cgroup id for
task_file.

Current kernel internal implementation only allows
the target to register at most one required bpf_iter_link_info.
To support the above case, optional bpf_iter_link_info's
are needed, the target can be extended to register such link
infos, and user provided link_info needs to match one of
target supported ones.
Signed-off-by: NYonghong Song <yhs@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200723184112.590360-1-yhs@fb.com

a5cbe05a

21 7月, 2020 1 次提交

tools: bpf: Use local copy of headers including uapi/linux/filter.h · f143c11b

由 Will Deacon 提交于 11月 21, 2019

Pulling header files directly out of the kernel sources for inclusion in
userspace programs is highly error prone, not least because it bypasses
the kbuild infrastructure entirely and so may end up referencing other
header files that have not been generated.

Subsequent patches will cause compiler.h to pull in the ungenerated
asm/rwonce.h file via filter.h, breaking the build for tools/bpf:

  | $ make -C tools/bpf
  | make: Entering directory '/linux/tools/bpf'
  |   CC       bpf_jit_disasm.o
  |   LINK     bpf_jit_disasm
  |   CC       bpf_dbg.o
  | In file included from /linux/include/uapi/linux/filter.h:9,
  |                  from /linux/tools/bpf/bpf_dbg.c:41:
  | /linux/include/linux/compiler.h:247:10: fatal error: asm/rwonce.h: No such file or directory
  |  #include <asm/rwonce.h>
  |           ^~~~~~~~~~~~~~
  | compilation terminated.
  | make: *** [Makefile:61: bpf_dbg.o] Error 1
  | make: Leaving directory '/linux/tools/bpf'

Take a copy of the installed version of linux/filter.h  (i.e. the one
created by the 'headers_install' target) into tools/include/uapi/linux/
and adjust the BPF tool Makefile to reference the local include
directories instead of those in the main source tree.

Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Suggested-by: NDaniel Borkmann <daniel@iogearbox.net>
Reported-by: NXiao Yang <ice_yangxiao@163.com>
Signed-off-by: NWill Deacon <will@kernel.org>

f143c11b

20 7月, 2020 1 次提交

tools headers UAPI: Update tools's copy of linux/perf_event.h · 5271d915

由 Leo Yan 提交于 7月 16, 2020

To get the changes in the commit:

  "perf: Add perf_event_mmap_page::cap_user_time_short ABI"

This update is a prerequisite to add support for short clock counters
related ABI extension.
Signed-off-by: NLeo Yan <leo.yan@linaro.org>
Link: https://lore.kernel.org/r/20200716051130.4359-8-leo.yan@linaro.orgSigned-off-by: NWill Deacon <will@kernel.org>

5271d915

18 7月, 2020 1 次提交

bpf: Sync linux/bpf.h to tools/ · a352b32a

由 Jakub Sitnicki 提交于 7月 17, 2020

Newly added program, context type and helper is used by tests in a
subsequent patch. Synchronize the header file.
Signed-off-by: NJakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NAndrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200717103536.397595-12-jakub@cloudflare.com

a352b32a

17 7月, 2020 1 次提交

bpf: Drop duplicated words in uapi helper comments · bfdfa517

由 Randy Dunlap 提交于 7月 15, 2020

Drop doubled words "will" and "attach".
Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/6b9f71ae-4f8e-0259-2c5d-187ddaefe6eb@infradead.org

bfdfa517

16 7月, 2020 2 次提交

bpf: cpumap: Add the possibility to attach an eBPF program to cpumap · 92164774

由 Lorenzo Bianconi 提交于 7月 14, 2020

Introduce the capability to attach an eBPF program to cpumap entries.
The idea behind this feature is to add the possibility to define on
which CPU run the eBPF program if the underlying hw does not support
RSS. Current supported verdicts are XDP_DROP and XDP_PASS.

This patch has been tested on Marvell ESPRESSObin using xdp_redirect_cpu
sample available in the kernel tree to identify possible performance
regressions. Results show there are no observable differences in
packet-per-second:

$./xdp_redirect_cpu --progname xdp_cpu_map0 --dev eth0 --cpu 1
rx: 354.8 Kpps
rx: 356.0 Kpps
rx: 356.8 Kpps
rx: 356.3 Kpps
rx: 356.6 Kpps
rx: 356.6 Kpps
rx: 356.7 Kpps
rx: 355.8 Kpps
rx: 356.8 Kpps
rx: 356.8 Kpps
Co-developed-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NLorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/5c9febdf903d810b3415732e5cd98491d7d9067a.1594734381.git.lorenzo@kernel.org

92164774

cpumap: Formalize map value as a named struct · 644bfe51

由 Lorenzo Bianconi 提交于 7月 14, 2020

As it has been already done for devmap, introduce 'struct bpf_cpumap_val'
to formalize the expected values that can be passed in for a CPUMAP.
Update cpumap code to use the struct.
Signed-off-by: NLorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/754f950674665dae6139c061d28c1d982aaf4170.1594734381.git.lorenzo@kernel.org

644bfe51

15 7月, 2020 1 次提交

net: bridge: Add port attribute IFLA_BRPORT_MRP_IN_OPEN · ffb3adba

由 Horatiu Vultur 提交于 7月 14, 2020

This patch adds a new port attribute, IFLA_BRPORT_MRP_IN_OPEN, which
allows to notify the userspace when the node lost the contiuity of
MRP_InTest frames.
Signed-off-by: NHoratiu Vultur <horatiu.vultur@microchip.com>
Acked-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ffb3adba

14 7月, 2020 1 次提交

xsk: Add new statistics · 8aa5a335

由 Ciara Loftus 提交于 7月 08, 2020

It can be useful for the user to know the reason behind a dropped packet.
Introduce new counters which track drops on the receive path caused by:
1. rx ring being full
2. fill ring being empty

Also, on the tx path introduce a counter which tracks the number of times
we attempt pull from the tx ring when it is empty.
Signed-off-by: NCiara Loftus <ciara.loftus@intel.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200708072835.4427-2-ciara.loftus@intel.com

8aa5a335

10 7月, 2020 2 次提交

perf tools: Add support for PERF_RECORD_KSYMBOL_TYPE_OOL · 789e2419

由 Adrian Hunter 提交于 5月 12, 2020

PERF_RECORD_KSYMBOL_TYPE_OOL marks an executable page. Create a map
backed only by memory, which will be populated as necessary by text poke
events.

Committer notes:

From the patch:

OOL stands for "Out of line" code such as kprobe-replaced instructions
or optimized kprobes or ftrace trampolines.
Signed-off-by: NAdrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: x86@kernel.org
Link: http://lore.kernel.org/lkml/20200512121922.8997-13-adrian.hunter@intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>

789e2419

perf tools: Add support for PERF_RECORD_TEXT_POKE · 246eba8e

由 Adrian Hunter 提交于 5月 12, 2020

Add processing for PERF_RECORD_TEXT_POKE events. When a text poke event
is processed, then the kernel dso data cache is updated with the poked
bytes.
Signed-off-by: NAdrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: x86@kernel.org
Link: http://lore.kernel.org/lkml/20200512121922.8997-12-adrian.hunter@intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>

246eba8e

08 7月, 2020 1 次提交

libbpf: Add support for BPF_CGROUP_INET_SOCK_RELEASE · e8b012e9

由 Stanislav Fomichev 提交于 7月 06, 2020

Add auto-detection for the cgroup/sock_release programs.
Signed-off-by: NStanislav Fomichev <sdf@google.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAndrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200706230128.4073544-3-sdf@google.com

e8b012e9

01 7月, 2020 1 次提交

bpf: Introduce helper bpf_get_task_stack() · fa28dcb8

由 Song Liu 提交于 6月 29, 2020

Introduce helper bpf_get_task_stack(), which dumps stack trace of given
task. This is different to bpf_get_stack(), which gets stack track of
current task. One potential use case of bpf_get_task_stack() is to call
it from bpf_iter__task and dump all /proc/<pid>/stack to a seq_file.

bpf_get_task_stack() uses stack_trace_save_tsk() instead of
get_perf_callchain() for kernel stack. The benefit of this choice is that
stack_trace_save_tsk() doesn't require changes in arch/. The downside of
using stack_trace_save_tsk() is that stack_trace_save_tsk() dumps the
stack trace to unsigned long array. For 32-bit systems, we need to
translate it to u64 array.
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NAndrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200630062846.664389-3-songliubraving@fb.com

fa28dcb8

25 6月, 2020 4 次提交

bpf: Add bpf_skc_to_udp6_sock() helper · 0d4fad3e

由 Yonghong Song 提交于 6月 23, 2020

The helper is used in tracing programs to cast a socket
pointer to a udp6_sock pointer.
The return value could be NULL if the casting is illegal.
Signed-off-by: NYonghong Song <yhs@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20200623230815.3988481-1-yhs@fb.com

0d4fad3e

bpf: Add bpf_skc_to_{tcp, tcp_timewait, tcp_request}_sock() helpers · 478cfbdf

由 Yonghong Song 提交于 6月 23, 2020

Three more helpers are added to cast a sock_common pointer to
an tcp_sock, tcp_timewait_sock or a tcp_request_sock for
tracing programs.
Signed-off-by: NYonghong Song <yhs@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200623230811.3988277-1-yhs@fb.com

478cfbdf

bpf: Add bpf_skc_to_tcp6_sock() helper · af7ec138

由 Yonghong Song 提交于 6月 23, 2020

The helper is used in tracing programs to cast a socket
pointer to a tcp6_sock pointer.
The return value could be NULL if the casting is illegal.

A new helper return type RET_PTR_TO_BTF_ID_OR_NULL is added
so the verifier is able to deduce proper return types for the helper.

Different from the previous BTF_ID based helpers,
the bpf_skc_to_tcp6_sock() argument can be several possible
btf_ids. More specifically, all possible socket data structures
with sock_common appearing in the first in the memory layout.
This patch only added socket types related to tcp and udp.

All possible argument btf_id and return value btf_id
for helper bpf_skc_to_tcp6_sock() are pre-calculcated and
cached. In the future, it is even possible to precompute
these btf_id's at kernel build time.
Signed-off-by: NYonghong Song <yhs@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NAndrii Nakryiko <andriin@fb.com>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200623230809.3988195-1-yhs@fb.com

af7ec138

bpf: Add SO_KEEPALIVE and related options to bpf_setsockopt · f9bcf968

由 Dmitry Yakunin 提交于 6月 20, 2020

This patch adds support of SO_KEEPALIVE flag and TCP related options
to bpf_setsockopt() routine. This is helpful if we want to enable or tune
TCP keepalive for applications which don't do it in the userspace code.

v3:
  - update kernel-doc in uapi (Nikita Vetoshkin <nekto0n@yandex-team.ru>)

v4:
  - update kernel-doc in tools too (Alexei Starovoitov)
  - add test to selftests (Alexei Starovoitov)
Signed-off-by: NDmitry Yakunin <zeil@yandex-team.ru>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200620153052.9439-3-zeil@yandex-team.ru

f9bcf968

24 6月, 2020 2 次提交

bpf: Fix formatting in documentation for BPF helpers · bcc7f554

由 Quentin Monnet 提交于 6月 23, 2020

When producing the bpf-helpers.7 man page from the documentation from
the BPF user space header file, rst2man complains:

    <stdin>:2636: (ERROR/3) Unexpected indentation.
    <stdin>:2640: (WARNING/2) Block quote ends without a blank line; unexpected unindent.

Let's fix formatting for the relevant chunk (item list in
bpf_ringbuf_query()'s description), and for a couple other functions.
Signed-off-by: NQuentin Monnet <quentin@isovalent.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NAndrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200623153935.6215-1-quentin@isovalent.com

bcc7f554

bpf: Switch most helper return values from 32-bit int to 64-bit long · bdb7b79b

由 Andrii Nakryiko 提交于 6月 22, 2020

Switch most of BPF helper definitions from returning int to long. These
definitions are coming from comments in BPF UAPI header and are used to
generate bpf_helper_defs.h (under libbpf) to be later included and used from
BPF programs.

In actual in-kernel implementation, all the helpers are defined as returning
u64, but due to some historical reasons, most of them are actually defined as
returning int in UAPI (usually, to return 0 on success, and negative value on
error).

This actually causes Clang to quite often generate sub-optimal code, because
compiler believes that return value is 32-bit, and in a lot of cases has to be
up-converted (usually with a pair of 32-bit bit shifts) to 64-bit values,
before they can be used further in BPF code.

Besides just "polluting" the code, these 32-bit shifts quite often cause
problems for cases in which return value matters. This is especially the case
for the family of bpf_probe_read_str() functions. There are few other similar
helpers (e.g., bpf_read_branch_records()), in which return value is used by
BPF program logic to record variable-length data and process it. For such
cases, BPF program logic carefully manages offsets within some array or map to
read variable-length data. For such uses, it's crucial for BPF verifier to
track possible range of register values to prove that all the accesses happen
within given memory bounds. Those extraneous zero-extending bit shifts,
inserted by Clang (and quite often interleaved with other code, which makes
the issues even more challenging and sometimes requires employing extra
per-variable compiler barriers), throws off verifier logic and makes it mark
registers as having unknown variable offset. We'll study this pattern a bit
later below.

Another common pattern is to check return of BPF helper for non-zero state to
detect error conditions and attempt alternative actions in such case. Even in
this simple and straightforward case, this 32-bit vs BPF's native 64-bit mode
quite often leads to sub-optimal and unnecessary extra code. We'll look at
this pattern as well.

Clang's BPF target supports two modes of code generation: ALU32, in which it
is capable of using lower 32-bit parts of registers, and no-ALU32, in which
only full 64-bit registers are being used. ALU32 mode somewhat mitigates the
above described problems, but not in all cases.

This patch switches all the cases in which BPF helpers return 0 or negative
error from returning int to returning long. It is shown below that such change
in definition leads to equivalent or better code. No-ALU32 mode benefits more,
but ALU32 mode doesn't degrade or still gets improved code generation.

Another class of cases switched from int to long are bpf_probe_read_str()-like
helpers, which encode successful case as non-negative values, while still
returning negative value for errors.

In all of such cases, correctness is preserved due to two's complement
encoding of negative values and the fact that all helpers return values with
32-bit absolute value. Two's complement ensures that for negative values
higher 32 bits are all ones and when truncated, leave valid negative 32-bit
value with the same value. Non-negative values have upper 32 bits set to zero
and similarly preserve value when high 32 bits are truncated. This means that
just casting to int/u32 is correct and efficient (and in ALU32 mode doesn't
require any extra shifts).

To minimize the chances of regressions, two code patterns were investigated,
as mentioned above. For both patterns, BPF assembly was analyzed in
ALU32/NO-ALU32 compiler modes, both with current 32-bit int return type and
new 64-bit long return type.

Case 1. Variable-length data reading and concatenation. This is quite
ubiquitous pattern in tracing/monitoring applications, reading data like
process's environment variables, file path, etc. In such case, many pieces of
string-like variable-length data are read into a single big buffer, and at the
end of the process, only a part of array containing actual data is sent to
user-space for further processing. This case is tested in test_varlen.c
selftest (in the next patch). Code flow is roughly as follows:

  void *payload = &sample->payload;
  u64 len;

  len = bpf_probe_read_kernel_str(payload, MAX_SZ1, &source_data1);
  if (len <= MAX_SZ1) {
      payload += len;
      sample->len1 = len;
  }
  len = bpf_probe_read_kernel_str(payload, MAX_SZ2, &source_data2);
  if (len <= MAX_SZ2) {
      payload += len;
      sample->len2 = len;
  }
  /* and so on */
  sample->total_len = payload - &sample->payload;
  /* send over, e.g., perf buffer */

There could be two variations with slightly different code generated: when len
is 64-bit integer and when it is 32-bit integer. Both variations were analysed.
BPF assembly instructions between two successive invocations of
bpf_probe_read_kernel_str() were used to check code regressions. Results are
below, followed by short analysis. Left side is using helpers with int return
type, the right one is after the switch to long.

ALU32 + INT                                ALU32 + LONG
===========                                ============

64-BIT (13 insns):                         64-BIT (10 insns):
------------------------------------       ------------------------------------
  17:   call 115                             17:   call 115
  18:   if w0 > 256 goto +9 <LBB0_4>         18:   if r0 > 256 goto +6 <LBB0_4>
  19:   w1 = w0                              19:   r1 = 0 ll
  20:   r1 <<= 32                            21:   *(u64 *)(r1 + 0) = r0
  21:   r1 s>>= 32                           22:   r6 = 0 ll
  22:   r2 = 0 ll                            24:   r6 += r0
  24:   *(u64 *)(r2 + 0) = r1              00000000000000c8 <LBB0_4>:
  25:   r6 = 0 ll                            25:   r1 = r6
  27:   r6 += r1                             26:   w2 = 256
00000000000000e0 <LBB0_4>:                   27:   r3 = 0 ll
  28:   r1 = r6                              29:   call 115
  29:   w2 = 256
  30:   r3 = 0 ll
  32:   call 115

32-BIT (11 insns):                         32-BIT (12 insns):
------------------------------------       ------------------------------------
  17:   call 115                             17:   call 115
  18:   if w0 > 256 goto +7 <LBB1_4>         18:   if w0 > 256 goto +8 <LBB1_4>
  19:   r1 = 0 ll                            19:   r1 = 0 ll
  21:   *(u32 *)(r1 + 0) = r0                21:   *(u32 *)(r1 + 0) = r0
  22:   w1 = w0                              22:   r0 <<= 32
  23:   r6 = 0 ll                            23:   r0 >>= 32
  25:   r6 += r1                             24:   r6 = 0 ll
00000000000000d0 <LBB1_4>:                   26:   r6 += r0
  26:   r1 = r6                            00000000000000d8 <LBB1_4>:
  27:   w2 = 256                             27:   r1 = r6
  28:   r3 = 0 ll                            28:   w2 = 256
  30:   call 115                             29:   r3 = 0 ll
                                             31:   call 115

In ALU32 mode, the variant using 64-bit length variable clearly wins and
avoids unnecessary zero-extension bit shifts. In practice, this is even more
important and good, because BPF code won't need to do extra checks to "prove"
that payload/len are within good bounds.

32-bit len is one instruction longer. Clang decided to do 64-to-32 casting
with two bit shifts, instead of equivalent `w1 = w0` assignment. The former
uses extra register. The latter might potentially lose some range information,
but not for 32-bit value. So in this case, verifier infers that r0 is [0, 256]
after check at 18:, and shifting 32 bits left/right keeps that range intact.
We should probably look into Clang's logic and see why it chooses bitshifts
over sub-register assignments for this.

NO-ALU32 + INT                             NO-ALU32 + LONG
==============                             ===============

64-BIT (14 insns):                         64-BIT (10 insns):
------------------------------------       ------------------------------------
  17:   call 115                             17:   call 115
  18:   r0 <<= 32                            18:   if r0 > 256 goto +6 <LBB0_4>
  19:   r1 = r0                              19:   r1 = 0 ll
  20:   r1 >>= 32                            21:   *(u64 *)(r1 + 0) = r0
  21:   if r1 > 256 goto +7 <LBB0_4>         22:   r6 = 0 ll
  22:   r0 s>>= 32                           24:   r6 += r0
  23:   r1 = 0 ll                          00000000000000c8 <LBB0_4>:
  25:   *(u64 *)(r1 + 0) = r0                25:   r1 = r6
  26:   r6 = 0 ll                            26:   r2 = 256
  28:   r6 += r0                             27:   r3 = 0 ll
00000000000000e8 <LBB0_4>:                   29:   call 115
  29:   r1 = r6
  30:   r2 = 256
  31:   r3 = 0 ll
  33:   call 115

32-BIT (13 insns):                         32-BIT (13 insns):
------------------------------------       ------------------------------------
  17:   call 115                             17:   call 115
  18:   r1 = r0                              18:   r1 = r0
  19:   r1 <<= 32                            19:   r1 <<= 32
  20:   r1 >>= 32                            20:   r1 >>= 32
  21:   if r1 > 256 goto +6 <LBB1_4>         21:   if r1 > 256 goto +6 <LBB1_4>
  22:   r2 = 0 ll                            22:   r2 = 0 ll
  24:   *(u32 *)(r2 + 0) = r0                24:   *(u32 *)(r2 + 0) = r0
  25:   r6 = 0 ll                            25:   r6 = 0 ll
  27:   r6 += r1                             27:   r6 += r1
00000000000000e0 <LBB1_4>:                 00000000000000e0 <LBB1_4>:
  28:   r1 = r6                              28:   r1 = r6
  29:   r2 = 256                             29:   r2 = 256
  30:   r3 = 0 ll                            30:   r3 = 0 ll
  32:   call 115                             32:   call 115

In NO-ALU32 mode, for the case of 64-bit len variable, Clang generates much
superior code, as expected, eliminating unnecessary bit shifts. For 32-bit
len, code is identical.

So overall, only ALU-32 32-bit len case is more-or-less equivalent and the
difference stems from internal Clang decision, rather than compiler lacking
enough information about types.

Case 2. Let's look at the simpler case of checking return result of BPF helper
for errors. The code is very simple:

  long bla;
  if (bpf_probe_read_kenerl(&bla, sizeof(bla), 0))
      return 1;
  else
      return 0;

ALU32 + CHECK (9 insns)                    ALU32 + CHECK (9 insns)
====================================       ====================================
  0:    r1 = r10                             0:    r1 = r10
  1:    r1 += -8                             1:    r1 += -8
  2:    w2 = 8                               2:    w2 = 8
  3:    r3 = 0                               3:    r3 = 0
  4:    call 113                             4:    call 113
  5:    w1 = w0                              5:    r1 = r0
  6:    w0 = 1                               6:    w0 = 1
  7:    if w1 != 0 goto +1 <LBB2_2>          7:    if r1 != 0 goto +1 <LBB2_2>
  8:    w0 = 0                               8:    w0 = 0
0000000000000048 <LBB2_2>:                 0000000000000048 <LBB2_2>:
  9:    exit                                 9:    exit

Almost identical code, the only difference is the use of full register
assignment (r1 = r0) vs half-registers (w1 = w0) in instruction #5. On 32-bit
architectures, new BPF assembly might be slightly less optimal, in theory. But
one can argue that's not a big issue, given that use of full registers is
still prevalent (e.g., for parameter passing).

NO-ALU32 + CHECK (11 insns)                NO-ALU32 + CHECK (9 insns)
====================================       ====================================
  0:    r1 = r10                             0:    r1 = r10
  1:    r1 += -8                             1:    r1 += -8
  2:    r2 = 8                               2:    r2 = 8
  3:    r3 = 0                               3:    r3 = 0
  4:    call 113                             4:    call 113
  5:    r1 = r0                              5:    r1 = r0
  6:    r1 <<= 32                            6:    r0 = 1
  7:    r1 >>= 32                            7:    if r1 != 0 goto +1 <LBB2_2>
  8:    r0 = 1                               8:    r0 = 0
  9:    if r1 != 0 goto +1 <LBB2_2>        0000000000000048 <LBB2_2>:
 10:    r0 = 0                               9:    exit
0000000000000058 <LBB2_2>:
 11:    exit

NO-ALU32 is a clear improvement, getting rid of unnecessary zero-extension bit
shifts.
Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200623032224.4020118-1-andriin@fb.com

bdb7b79b

18 6月, 2020 2 次提交

tools headers UAPI: Sync linux/fs.h with the kernel sources · 0e093c77

由 Arnaldo Carvalho de Melo 提交于 6月 17, 2020

To pick the changes from:

  b383a73f ("fs/ext4: Introduce DAX inode flag")

And silence this perf build warning:

  Warning: Kernel ABI header at 'tools/include/uapi/linux/fs.h' differs from latest version at 'include/uapi/linux/fs.h'
  diff -u tools/include/uapi/linux/fs.h include/uapi/linux/fs.h

It causes various beautifiers for things like fspick, fsmount, etc (see
below) to get rebuilt, but this specific change doesn't make 'perf
trace' be capable of decoding anything new, as we still don't decode
what comes from ioctls, just its cmds.

Details about the update:

  $ cp include/uapi/linux/fs.h tools/include/uapi/linux/fs.h
  $ git diff
  diff --git a/tools/include/uapi/linux/fs.h b/tools/include/uapi/linux/fs.h
  index 379a612f8f1d..f44eb0a04afd 100644
  --- a/tools/include/uapi/linux/fs.h
  +++ b/tools/include/uapi/linux/fs.h
  @@ -262,6 +262,7 @@ struct fsxattr {
   #define FS_EA_INODE_FL                 0x00200000 /* Inode used for large EA */
   #define FS_EOFBLOCKS_FL                        0x00400000 /* Reserved for ext4 */
   #define FS_NOCOW_FL                    0x00800000 /* Do not cow file */
  +#define FS_DAX_FL                      0x02000000 /* Inode is DAX */
   #define FS_INLINE_DATA_FL              0x10000000 /* Reserved for ext4 */
   #define FS_PROJINHERIT_FL              0x20000000 /* Create with parents projid */
   #define FS_CASEFOLD_FL                 0x40000000 /* Folder is case insensitive */
  $ m
  make: Entering directory '/home/acme/git/perf/tools/perf'
    BUILD:   Doing 'make -j8' parallel build
    INSTALL  GTK UI
    CC       /tmp/build/perf/builtin-trace.o
    DESCEND  plugins
    CC       /tmp/build/perf/trace/beauty/fsmount.o
    CC       /tmp/build/perf/trace/beauty/fspick.o
    CC       /tmp/build/perf/trace/beauty/mount_flags.o
    CC       /tmp/build/perf/trace/beauty/move_mount.o
    CC       /tmp/build/perf/trace/beauty/renameat.o
    CC       /tmp/build/perf/trace/beauty/sync_file_range.o
    INSTALL  trace_plugins
    LD       /tmp/build/perf/trace/beauty/perf-in.o
    LD       /tmp/build/perf/perf-in.o
    LINK     /tmp/build/perf/perf
  <SNIP>

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>

0e093c77

tools include UAPI: Sync linux/vhost.h with the kernel sources · f64925c1

由 Arnaldo Carvalho de Melo 提交于 6月 17, 2020

To get the changes in:

  776f3950 ("vhost_vdpa: Support config interrupt in vdpa")

Silencing this perf build warning:

  Warning: Kernel ABI header at 'tools/include/uapi/linux/vhost.h' differs from latest version at 'include/uapi/linux/vhost.h'
  diff -u tools/include/uapi/linux/vhost.h include/uapi/linux/vhost.h

This automatically picks the new ioctl introduced in the above patch,
making tools such as 'perf trace' aware of them and possibly allowing to
use the strings in filters, etc:

  # perf trace -e ioctl --pid 7951
  <SNIP>
     0.178 ( 0.010 ms): CPU 0/KVM/8023 ioctl(fd: 14, cmd: KVM_RUN) = 0
     0.194 ( 0.010 ms): CPU 0/KVM/8023 ioctl(fd: 14, cmd: KVM_RUN) = 0
     0.209 ( 0.010 ms): CPU 0/KVM/8023 ioctl(fd: 14, cmd: KVM_RUN) = 0
     0.224 (249.413 ms): CPU 0/KVM/8023 ioctl(fd: 14, cmd: KVM_RUN) = 0
   249.660 ( 0.011 ms): CPU 0/KVM/8023 ioctl(fd: 14, cmd: KVM_RUN) = 0
   249.675 ( 0.007 ms): CPU 0/KVM/8023 ioctl(fd: 14, cmd: KVM_RUN) = 0
   249.686 ( 0.007 ms): CPU 0/KVM/8023 ioctl(fd: 14, cmd: KVM_RUN) = 0
   249.697 ( 0.008 ms): CPU 0/KVM/8023 ioctl(fd: 14, cmd: KVM_RUN) = 0
   249.709 ( 0.007 ms): CPU 0/KVM/8023 ioctl(fd: 14, cmd: KVM_RUN) = 0
   249.720 ( 0.007 ms): CPU 0/KVM/8023 ioctl(fd: 14, cmd: KVM_RUN) = 0
   249.730 ( 0.007 ms): CPU 0/KVM/8023 ioctl(fd: 14, cmd: KVM_RUN) = 0
   249.740 ( 0.007 ms): CPU 0/KVM/8023 ioctl(fd: 14, cmd: KVM_RUN) = 0
   249.752 ( 0.007 ms): CPU 0/KVM/8023 ioctl(fd: 14, cmd: KVM_RUN) = 0
   249.762 ( 0.007 ms): CPU 0/KVM/8023 ioctl(fd: 14, cmd: KVM_RUN) = 0
   249.772 ( 0.007 ms): CPU 0/KVM/8023 ioctl(fd: 14, cmd: KVM_RUN) = 0
   249.782 (120.138 ms): CPU 0/KVM/8023 ioctl(fd: 14, cmd: KVM_RUN) = 0
   370.201 ( 0.039 ms): CPU 0/KVM/8023 ioctl(fd: 12, cmd: KVM_IRQ_LINE_STATUS, arg: 0x7f744f9e1420) = 0
   370.254 ( 0.052 ms): CPU 0/KVM/8023 ioctl(fd: 14, cmd: KVM_RUN) = 0
   370.575 ( 0.365 ms): CPU 0/KVM/8023 ioctl(fd: 14, cmd: KVM_RUN) = 0
   370.973 ( 0.028 ms): CPU 0/KVM/8023 ioctl(fd: 14, cmd: KVM_RUN) = 0
   371.015 ( 0.037 ms): CPU 0/KVM/8023 ioctl(fd: 14, cmd: KVM_RUN) = 0
   371.071 ( 0.009 ms): CPU 0/KVM/8023 ioctl(fd: 12, cmd: KVM_IRQ_LINE_STATUS, arg: 0x7f744f9e14b0) = 0
  <SNIP>
  #

Details about the update:

  $ diff -u tools/include/uapi/linux/vhost.h include/uapi/linux/vhost.h
  --- tools/include/uapi/linux/vhost.h	2020-04-16 13:19:12.056763843 -0300
  +++ include/uapi/linux/vhost.h	2020-06-17 10:04:20.532056428 -0300
  @@ -15,6 +15,8 @@
   #include <linux/types.h>
   #include <linux/ioctl.h>

  +#define VHOST_FILE_UNBIND -1
  +
   /* ioctls */

   #define VHOST_VIRTIO 0xAF
  @@ -140,4 +142,6 @@
   /* Get the max ring size. */
   #define VHOST_VDPA_GET_VRING_NUM	_IOR(VHOST_VIRTIO, 0x76, __u16)

  +/* Set event fd for config interrupt*/
  +#define VHOST_VDPA_SET_CONFIG_CALL	_IOW(VHOST_VIRTIO, 0x77, int)
   #endif
  $
  $ tools/perf/trace/beauty/vhost_virtio_ioctl.sh > before
  $ cp include/uapi/linux/vhost.h tools/include/uapi/linux/vhost.h
  $ tools/perf/trace/beauty/vhost_virtio_ioctl.sh > after
  $ diff -u before after
  --- before	2020-06-17 10:15:35.123275966 -0300
  +++ after	2020-06-17 10:15:51.812482117 -0300
  @@ -27,6 +27,7 @@
   	[0x72] = "VDPA_SET_STATUS",
   	[0x74] = "VDPA_SET_CONFIG",
   	[0x75] = "VDPA_SET_VRING_ENABLE",
  +	[0x77] = "VDPA_SET_CONFIG_CALL",
   };
   static const char *vhost_virtio_ioctl_read_cmds[] = {
   	[0x00] = "GET_FEATURES",
  $

This causes these parts to get rebuilt:

  CC       /tmp/build/perf/trace/beauty/ioctl.o
  INSTALL  trace_plugins
  LD       /tmp/build/perf/trace/beauty/perf-in.o
  LD       /tmp/build/perf/perf-in.o
  LINK     /tmp/build/perf/perf

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Zhu Lingshan <lingshan.zhu@intel.com>
Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>

f64925c1

16 6月, 2020 1 次提交

bpf: Fix definition of bpf_ringbuf_output() helper in UAPI comments · b0659d8a

由 Andrii Nakryiko 提交于 6月 15, 2020

Fix definition of bpf_ringbuf_output() in UAPI header comments, which is used
to generate libbpf's bpf_helper_defs.h header. Return value is a number (error
code), not a pointer.

Fixes: 457f4436 ("bpf: Implement BPF ring buffer and verifier support for it")
Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200615214926.3638836-1-andriin@fb.com

b0659d8a

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功