- 23 5月, 2018 4 次提交
-
-
由 Martin KaFai Lau 提交于
In "struct bpf_map_info", the name "btf_id", "btf_key_id" and "btf_value_id" could cause confusion because the "id" of "btf_id" means the BPF obj id given to the BTF object while "btf_key_id" and "btf_value_id" means the BTF type id within that BTF object. To make it clear, btf_key_id and btf_value_id are renamed to btf_key_type_id and btf_value_type_id. Suggested-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NMartin KaFai Lau <kafai@fb.com> Acked-by: NYonghong Song <yhs@fb.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 Martin KaFai Lau 提交于
This patch does the followings: 1. Limit BTF_MAX_TYPES and BTF_MAX_NAME_OFFSET to 64k. We can raise it later. 2. Remove the BTF_TYPE_PARENT and BTF_STR_TBL_ELF_ID. They are currently encoded at the highest bit of a u32. It is because the current use case does not require supporting parent type (i.e type_id referring to a type in another BTF file). It also does not support referring to a string in ELF. The BTF_TYPE_PARENT and BTF_STR_TBL_ELF_ID checks are replaced by BTF_TYPE_ID_CHECK and BTF_STR_OFFSET_CHECK which are defined in btf.c instead of uapi/linux/btf.h. 3. Limit the BTF_INFO_KIND from 5 bits to 4 bits which is enough. There is unused bits headroom if we ever needed it later. 4. The root bit in BTF_INFO is also removed because it is not used in the current use case. 5. Remove BTF_INT_VARARGS since func type is not supported now. The BTF_INT_ENCODING is limited to 4 bits instead of 8 bits. The above can be added back later because the verifier ensures the unused bits are zeros. Signed-off-by: NMartin KaFai Lau <kafai@fb.com> Acked-by: NYonghong Song <yhs@fb.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 Martin KaFai Lau 提交于
There are currently unused section descriptions in the btf_header. Those sections are here to support future BTF use cases. For example, the func section (func_off) is to support function signature (e.g. the BPF prog function signature). Instead of spelling out all potential sections up-front in the btf_header. This patch makes changes to btf_header such that extending it (e.g. adding a section) is possible later. The unused ones can be removed for now and they can be added back later. This patch: 1. adds a hdr_len to the btf_header. It will allow adding sections (and other info like parent_label and parent_name) later. The check is similar to the existing bpf_attr. If a user passes in a longer hdr_len, the kernel ensures the extra tailing bytes are 0. 2. allows the section order in the BTF object to be different from its sec_off order in btf_header. 3. each sec_off is followed by a sec_len. It must not have gap or overlapping among sections. The string section is ensured to be at the end due to the 4 bytes alignment requirement of the type section. The above changes will allow enough flexibility to add new sections (and other info) to the btf_header later. This patch also removes an unnecessary !err check at the end of btf_parse(). Signed-off-by: NMartin KaFai Lau <kafai@fb.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 Martin KaFai Lau 提交于
This patch exposes check_uarg_tail_zero() which will be reused by a later BTF patch. Its name is changed to bpf_check_uarg_tail_zero(). Signed-off-by: NMartin KaFai Lau <kafai@fb.com> Acked-by: NYonghong Song <yhs@fb.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
- 22 5月, 2018 4 次提交
-
-
由 David Ahern 提交于
Determine path MTU from a FIB lookup result. Logic is based on ip6_dst_mtu_forward plus lookup of nexthop exception. Add ip6_dst_mtu_forward to ipv6_stubs to handle access by core bpf code. Signed-off-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 David Ahern 提交于
Determine path MTU from a FIB lookup result. Logic is a distillation of ip_dst_mtu_maybe_forward. Signed-off-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 Björn Töpel 提交于
In this commit we remove the explicit ring structure from the the uapi. It is tricky for an uapi to depend on a certain L1 cache line size, since it can differ for variants of the same architecture. Now, we let the user application determine the offsets of the producer, consumer and descriptors by asking the socket via getsockopt. A typical flow would be (Rx ring): struct xdp_mmap_offsets off; struct xdp_desc *ring; u32 *prod, *cons; void *map; ... getsockopt(fd, SOL_XDP, XDP_MMAP_OFFSETS, &off, &optlen); map = mmap(NULL, off.rx.desc + NUM_DESCS * sizeof(struct xdp_desc), PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, sfd, XDP_PGOFF_RX_RING); prod = map + off.rx.producer; cons = map + off.rx.consumer; ring = map + off.rx.desc; Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 Björn Töpel 提交于
Move the sxdp_flags up, avoiding a hole in the uapi structure. Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
- 19 5月, 2018 1 次提交
-
-
由 John Fastabend 提交于
Currently sk_msg programs only have access to the raw data. However, it is often useful when building policies to have the policies specific to the socket endpoint. This allows using the socket tuple as input into filters, etc. This patch adds ctx access to the sock fields. Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com> Acked-by: NMartin KaFai Lau <kafai@fb.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
- 18 5月, 2018 1 次提交
-
-
由 Björn Töpel 提交于
Clean up SPDX-License-Identifier and removing licensing leftovers. Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
- 17 5月, 2018 3 次提交
-
-
由 Mathieu Malaterre 提交于
__printf is useful to verify format and arguments. ‘bpf_verifier_vlog’ function is used twice in verifier.c in both cases the caller function already uses the __printf gcc attribute. Remove the following warning, triggered with W=1: kernel/bpf/verifier.c:176:2: warning: function might be possible candidate for ‘gnu_printf’ format attribute [-Wsuggest-attribute=format] Signed-off-by: NMathieu Malaterre <malat@debian.org> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 Paolo Abeni 提交于
Currently NOLOCK qdiscs pay a measurable overhead to atomically manipulate the __QDISC_STATE_RUNNING. Such bit is flipped twice per packet in the uncontended scenario with packet rate below the line rate: on packed dequeue and on the next, failing dequeue attempt. This changeset moves the bit manipulation into the qdisc_run_{begin,end} helpers, so that the bit is now flipped only once per packet, with measurable performance improvement in the uncontended scenario. This also allows simplifying the qdisc teardown code path - since qdisc_is_running() is now effective for each qdisc type - and avoid a possible race between qdisc_run() and dev_deactivate_many(), as now the some_qdisc_is_busy() can properly detect NOLOCK qdiscs being busy dequeuing packets. Signed-off-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Debabrata Banerjee 提交于
The rx load balancing provided by balance-alb is not mutually exclusive with using hashing for tx selection, and should provide a decent speed increase because this eliminates spinlocks and cache contention. Signed-off-by: NDebabrata Banerjee <dbanerje@akamai.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 16 5月, 2018 1 次提交
-
-
由 John Fastabend 提交于
Sockmap is currently backed by an array and enforces keys to be four bytes. This works well for many use cases and was originally modeled after devmap which also uses four bytes keys. However, this has become limiting in larger use cases where a hash would be more appropriate. For example users may want to use the 5-tuple of the socket as the lookup key. To support this add hash support. Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
- 15 5月, 2018 3 次提交
-
-
由 John Fastabend 提交于
This patch only refactors the existing sockmap code. This will allow much of the psock initialization code path and bpf helper codes to work for both sockmap bpf map types that are backed by an array, the currently supported type, and the new hash backed bpf map type sockhash. Most the fallout comes from three changes, - Pushing bpf programs into an independent structure so we can use it from the htab struct in the next patch. - Generalizing helpers to use void *key instead of the hardcoded u32. - Instead of passing map/key through the metadata we now do the lookup inline. This avoids storing the key in the metadata which will be useful when keys can be longer than 4 bytes. We rename the sk pointers to sk_redir at this point as well to avoid any confusion between the current sk pointer and the redirect pointer sk_redir. Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 Marcelo Ricardo Leitner 提交于
Currently, when the rule is not to be exclusively executed by the hardware, extack is not passed along and offloading failures don't get logged. The idea was that hardware failures are okay because the rule will get executed in software then and this way it doesn't confuse unware users. But this is not helpful in case one needs to understand why a certain rule failed to get offloaded. Considering it may have been a temporary failure, like resources exceeded or so, reproducing it later and knowing that it is triggering the same reason may be challenging. The ultimate goal is to improve Open vSwitch debuggability when using flower offloading. This patch adds a new flag to enable verbose logging. With the flag set, extack will be passed to the driver, which will be able to log the error. As the operation itself probably won't fail (not because of this, at least), current iproute will already log it as a Warning. The flag is generic, so it can be reused later. No need to restrict it just for HW offloading. The command line will follow the syntax that tc-ebpf already uses, tc ... [ verbose ] ... , and extend its meaning. For example: # ./tc qdisc add dev p7p1 ingress # ./tc filter add dev p7p1 parent ffff: protocol ip prio 1 \ flower verbose \ src_mac ed:13:db:00:00:00 dst_mac 01:80:c2:00:00:d0 \ src_ip 56.0.0.0 dst_ip 55.0.0.0 action drop Warning: TC offload is disabled on net device. # echo $? 0 # ./tc filter add dev p7p1 parent ffff: protocol ip prio 1 \ flower \ src_mac ff:13:db:00:00:00 dst_mac 01:80:c2:00:00:d0 \ src_ip 56.0.0.0 dst_ip 55.0.0.0 action drop # echo $? 0 Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Rahul Lakkireddy 提交于
The sequence of actions done by device drivers to append their device specific hardware/firmware logs to /proc/vmcore are as follows: 1. During probe (before hardware is initialized), device drivers register to the vmcore module (via vmcore_add_device_dump()), with callback function, along with buffer size and log name needed for firmware/hardware log collection. 2. vmcore module allocates the buffer with requested size. It adds an Elf note and invokes the device driver's registered callback function. 3. Device driver collects all hardware/firmware logs into the buffer and returns control back to vmcore module. Ensure that the device dump buffer size is always aligned to page size so that it can be mmaped. Also, rename alloc_elfnotes_buf() to vmcore_alloc_buf() to make it more generic and reserve NT_VMCOREDD note type to indicate vmcore device dump. Suggested-by: Eric Biederman <ebiederm@xmission.com>. Signed-off-by: NRahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: NGanesh Goudar <ganeshgr@chelsio.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 14 5月, 2018 1 次提交
-
-
由 Bruce Allan 提交于
Clean up existing instances of unnecessary parentheses in if statement and change order of conditionals to make it easier to read The opening /* should be followed by a single space and the closing */ should be preceded with a single space. Signed-off-by: NBruce Allan <bruce.w.allan@intel.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
-
- 12 5月, 2018 3 次提交
-
-
由 Eric Dumazet 提交于
linux-4.16 got support for softirq based hrtimers. TCP can switch its pacing hrtimer to this variant, since this avoids going through a tasklet and some atomic operations. pacing timer logic looks like other (jiffies based) tcp timers. v2: use hrtimer_try_to_cancel() in tcp_clear_xmit_timers() to correctly release reference on socket if needed. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Florian Fainelli 提交于
Add support for PHYLINK within the DSA subsystem in order to support more complex devices such as pluggable (SFP) and non-pluggable (SFF) modules, 10G PHYs, and traditional PHYs. Using PHYLINK allows us to drop some amount of complexity we had while probing fixed and non-fixed PHYs using Device Tree. Because PHYLINK separates the Ethernet MAC/port configuration into different stages, we let switch drivers implement those, and for now, we maintain functionality by calling dsa_slave_adjust_link() during phylink_mac_link_{up,down} which provides semantically equivalent steps. Drivers willing to take advantage of PHYLINK should implement the phylink_mac_* operations that DSA wraps. We cannot quite remove the adjust_link() callback just yet, because a number of drivers rely on that for configuring their "CPU" and "DSA" ports, this is done dsa_port_setup_phy_of() and dsa_port_fixed_link_register_of() still. Drivers that utilize fixed links for user-facing ports (e.g: bcm_sf2) will need to implement phylink_mac_ops from now on to preserve functionality, since PHYLINK *does not* create a phy_device instance for fixed links. Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Florian Fainelli 提交于
In preparation for adding support for PHYLINK within DSA, define a number of operations that we will need and that switch drivers can start implementing. Proper integration with PHYLINK will follow in subsequent patches. We start selecting PHYLINK (which implies PHYLIB) in net/dsa/Kconfig such that drivers can be guaranteed that this dependency is properly taken care of and can start referencing PHYLINK helper functions without requiring stubs or anything. Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 11 5月, 2018 15 次提交
-
-
由 Debabrata Banerjee 提交于
There was a regression at some point from the intended functionality of commit f60c3704 ("bonding: Fix alb mode to only use first level vlans.") Given the return value vlan_get_encap_level() we need to store the nest level of the bond device, and then compare the vlan's encap level to this. Without this, this check always fails and learning packets are never sent. In addition, this same commit caused a regression in the behavior of balance_alb, which requires learning packets be sent for all interfaces using the slave's mac in order to load balance properly. For vlan's that have not set a user mac, we can send after checking one bit. Otherwise we need send the set mac, albeit defeating rx load balancing for that vlan. Signed-off-by: NDebabrata Banerjee <dbanerje@akamai.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Howells 提交于
Add a tracepoint to log transmission failure from the UDP transport socket being used by AF_RXRPC. Signed-off-by: NDavid Howells <dhowells@redhat.com>
-
由 David Howells 提交于
Add a tracepoint to log received ICMP/ICMP6 events and other error messages. Signed-off-by: NDavid Howells <dhowells@redhat.com>
-
由 David Ahern 提交于
Provide a helper for doing a FIB and neighbor lookup in the kernel tables from an XDP program. The helper provides a fastpath for forwarding packets. If the packet is a local delivery or for any reason is not a simple lookup and forward, the packet continues up the stack. If it is to be forwarded, the forwarding can be done directly if the neighbor is already known. If the neighbor does not exist, the first few packets go up the stack for neighbor resolution. Once resolved, the xdp program provides the fast path. On successful lookup the nexthop dmac, current device smac and egress device index are returned. The API supports IPv4, IPv6 and MPLS protocols, but only IPv4 and IPv6 are implemented in this patch. The API includes layer 4 parameters if the XDP program chooses to do deep packet inspection to allow compare against ACLs implemented as FIB rules. Header rewrite is left to the XDP program. The lookup takes 2 flags: - BPF_FIB_LOOKUP_DIRECT to do a lookup that bypasses FIB rules and goes straight to the table associated with the device (expert setting for those looking to maximize throughput) - BPF_FIB_LOOKUP_OUTPUT to do a lookup from the egress perspective. Default is an ingress lookup. Initial performance numbers collected by Jesper, forwarded packets/sec: Full stack XDP FIB lookup XDP Direct lookup IPv4 1,947,969 7,074,156 7,415,333 IPv6 1,728,000 6,165,504 7,262,720 These number are single CPU core forwarding on a Broadwell E5-1650 v4 @ 3.60GHz. Signed-off-by: NDavid Ahern <dsahern@gmail.com> Acked-by: NJesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 David Ahern 提交于
Add stubs to retrieve a handle to an IPv6 FIB table, fib6_get_table, a stub to do a lookup in a specific table, fib6_table_lookup, and a stub for a full route lookup. The stubs are needed for core bpf code to handle the case when the IPv6 module is not builtin. Signed-off-by: NDavid Ahern <dsahern@gmail.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 David Ahern 提交于
Similar to IPv4, IPv6 should use the FIB lookup result in the tracepoint. Signed-off-by: NDavid Ahern <dsahern@gmail.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 David Ahern 提交于
Add IPv6 equivalent to fib_lookup. Does a fib lookup, including rules, but returns a FIB entry, fib6_info, rather than a dst based rt6_info. fib6_lookup is any where from 140% (MULTIPLE_TABLES config disabled) to 60% faster than any of the dst based lookup methods (without custom rules) and 25% faster with custom rules (e.g., l3mdev rule). Since the lookup function has a completely different signature, fib6_rule_action is split into 2 paths: the existing one is renamed __fib6_rule_action and a new one for the fib6_info path is added. fib6_rule_action decides which to call based on the lookup_ptr. If it is fib6_table_lookup then the new path is taken. Caller must hold rcu lock as no reference is taken on the returned fib entry. Signed-off-by: NDavid Ahern <dsahern@gmail.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 David Ahern 提交于
ip6_pol_route is used for ingress and egress FIB lookups. Refactor it moving the table lookup into a separate fib6_table_lookup that can be invoked separately and export the new function. ip6_pol_route now calls fib6_table_lookup and uses the result to generate a dst based rt6_info. Signed-off-by: NDavid Ahern <dsahern@gmail.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 David Ahern 提交于
Rename rt6_multipath_select to fib6_multipath_select and export it. A later patch wants access to it similar to IPv4's fib_select_path. Signed-off-by: NDavid Ahern <dsahern@gmail.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 David Ahern 提交于
Rename fib6_lookup to fib6_node_lookup to better reflect what it returns. The fib6_lookup name will be used in a later patch for an IPv6 equivalent to IPv4's fib_lookup. Signed-off-by: NDavid Ahern <dsahern@gmail.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 Jon Maxwell 提交于
This version has some suggestions by Eric Dumazet: - Use a local variable for the mark in IPv6 instead of ctl_sk to avoid SMP races. - Use the more elegant "IP4_REPLY_MARK(net, skb->mark) ?: sk->sk_mark" statement. - Factorize code as sk_fullsock() check is not necessary. Aidan McGurn from Openwave Mobility systems reported the following bug: "Marked routing is broken on customer deployment. Its effects are large increase in Uplink retransmissions caused by the client never receiving the final ACK to their FINACK - this ACK misses the mark and routes out of the incorrect route." Currently marks are added to sk_buffs for replies when the "fwmark_reflect" sysctl is enabled. But not for TW sockets that had sk->sk_mark set via setsockopt(SO_MARK..). Fix this in IPv4/v6 by adding tw->tw_mark for TIME_WAIT sockets. Copy the the original sk->sk_mark in __inet_twsk_hashdance() to the new tw->tw_mark location. Then progate this so that the skb gets sent with the correct mark. Do the same for resets. Give the "fwmark_reflect" sysctl precedence over sk->sk_mark so that netfilter rules are still honored. Signed-off-by: NJon Maxwell <jmaxwell37@gmail.com> Reviewed-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Joe Perches 提交于
INET_CSK_DEBUG is always set and only is used for 2 pr_debug calls. EXPORT_SYMBOL(inet_csk_timer_bug_msg) is only used by these 2 pr_debug calls and is also unnecessary as the exported string can be used directly by these calls. Signed-off-by: NJoe Perches <joe@perches.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eran Ben Elisha 提交于
If supported, write a driver version string to FW as part of the INIT_HCA command. Example of driver version: "Linux,mlx4_core,4.0-0" Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com> Signed-off-by: NTariq Toukan <tariqt@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Davidlohr Bueso 提交于
No changes in refcount semantics -- key init is false; replace static_key_slow_inc|dec with static_branch_inc|dec static_key_false with static_branch_unlikely Added a '_key' suffix to memalloc_socks, for better self documentation. Signed-off-by: NDavidlohr Bueso <dbueso@suse.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Davidlohr Bueso 提交于
No changes in refcount semantics -- key init is false; replace static_key_slow_inc|dec with static_branch_inc|dec static_key_false with static_branch_unlikely Signed-off-by: NDavidlohr Bueso <dbueso@suse.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 10 5月, 2018 2 次提交
-
-
由 Ilya Dryomov 提交于
... and store num_bvecs for client code's convenience. Signed-off-by: NIlya Dryomov <idryomov@gmail.com> Reviewed-by: NJeff Layton <jlayton@redhat.com> Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
-
由 Jakub Kicinski 提交于
It's fairly easy for offloaded XDP programs to select the RX queue packets go to. We need a way of expressing this in the software. Allow write to the rx_queue_index field of struct xdp_md for device-bound programs. Skip convert_ctx_access callback entirely for offloads. Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: NQuentin Monnet <quentin.monnet@netronome.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
- 09 5月, 2018 2 次提交
-
-
由 Martin KaFai Lau 提交于
During BPF_OBJ_GET_INFO_BY_FD on a btf_fd, the current bpf_attr's info.info is directly filled with the BTF binary data. It is not extensible. In this case, we want to add BTF ID. This patch adds "struct bpf_btf_info" which has the BTF ID as one of its member. The BTF binary data itself is exposed through the "btf" and "btf_size" members. Signed-off-by: NMartin KaFai Lau <kafai@fb.com> Acked-by: NAlexei Starovoitov <ast@fb.com> Acked-by: NSong Liu <songliubraving@fb.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 Martin KaFai Lau 提交于
This patch gives an ID to each loaded BTF. The ID is allocated by the idr like the existing prog-id and map-id. The bpf_put(map->btf) is moved to __bpf_map_put() so that the userspace can stop seeing the BTF ID ASAP when the last BTF refcnt is gone. It also makes BTF accessible from userspace through the 1. new BPF_BTF_GET_FD_BY_ID command. It is limited to CAP_SYS_ADMIN which is inline with the BPF_BTF_LOAD cmd and the existing BPF_[MAP|PROG]_GET_FD_BY_ID cmd. 2. new btf_id (and btf_key_id + btf_value_id) in "struct bpf_map_info" Once the BTF ID handler is accessible from userspace, freeing a BTF object has to go through a rcu period. The BPF_BTF_GET_FD_BY_ID cmd can then be done under a rcu_read_lock() instead of taking spin_lock. [Note: A similar rcu usage can be done to the existing bpf_prog_get_fd_by_id() in a follow up patch] When processing the BPF_BTF_GET_FD_BY_ID cmd, refcount_inc_not_zero() is needed because the BTF object could be already in the rcu dead row . btf_get() is removed since its usage is currently limited to btf.c alone. refcount_inc() is used directly instead. Signed-off-by: NMartin KaFai Lau <kafai@fb.com> Acked-by: NAlexei Starovoitov <ast@fb.com> Acked-by: NSong Liu <songliubraving@fb.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-