- 12 3月, 2021 12 次提交
-
-
由 Petr Machata 提交于
Add a dump handler for resilient next hop buckets. When next-hop group ID is given, it walks buckets of that group, otherwise it walks buckets of all groups. It then dumps the buckets whose next hops match the given filtering criteria. Signed-off-by: NPetr Machata <petrm@nvidia.com> Reviewed-by: NIdo Schimmel <idosch@nvidia.com> Reviewed-by: NDavid Ahern <dsahern@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Petr Machata 提交于
Implement the netlink messages that allow creation and dumping of resilient nexthop groups. Signed-off-by: NPetr Machata <petrm@nvidia.com> Reviewed-by: NIdo Schimmel <idosch@nvidia.com> Reviewed-by: NDavid Ahern <dsahern@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Ido Schimmel 提交于
The kernel periodically checks the idle time of nexthop buckets to determine if they are idle and can be re-populated with a new nexthop. When the resilient nexthop group is offloaded to hardware, the kernel will not see activity on nexthop buckets unless it is reported from hardware. Add a function that can be periodically called by device drivers to report activity on nexthop buckets after querying it from the underlying device. Signed-off-by: NIdo Schimmel <idosch@nvidia.com> Reviewed-by: NPetr Machata <petrm@nvidia.com> Reviewed-by: NDavid Ahern <dsahern@kernel.org> Signed-off-by: NPetr Machata <petrm@nvidia.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Ido Schimmel 提交于
Add a function that can be called by device drivers to set "offload" or "trap" indication on nexthop buckets following nexthop notifications and other changes such as a neighbour becoming invalid. Signed-off-by: NIdo Schimmel <idosch@nvidia.com> Reviewed-by: NPetr Machata <petrm@nvidia.com> Reviewed-by: NDavid Ahern <dsahern@kernel.org> Signed-off-by: NPetr Machata <petrm@nvidia.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Petr Machata 提交于
Implement the following notifications towards drivers: - NEXTHOP_EVENT_REPLACE, when a resilient nexthop group is created. - NEXTHOP_EVENT_BUCKET_REPLACE any time there is a change in assignment of next hops to hash table buckets. That includes replacements, deletions, and delayed upkeep cycles. Some bucket notifications can be vetoed by the driver, to make it possible to propagate bucket busy-ness flags from the HW back to the algorithm. Some are however forced, e.g. if a next hop is deleted, all buckets that use this next hop simply must be migrated, whether the HW wishes so or not. - NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE, before a resilient nexthop group is replaced. Usually the driver will get the bucket notifications as well, and could veto those. But in some cases, a bucket may not be migrated immediately, but during delayed upkeep, and that is too late to roll the transaction back. This notification allows the driver to take a look and veto the new proposed group up front, before anything is committed. Signed-off-by: NPetr Machata <petrm@nvidia.com> Reviewed-by: NIdo Schimmel <idosch@nvidia.com> Reviewed-by: NDavid Ahern <dsahern@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Petr Machata 提交于
At this moment, there is only one type of next-hop group: an mpath group, which implements the hash-threshold algorithm. To select a next hop, hash-threshold algorithm first assigns a range of hashes to each next hop in the group, and then selects the next hop by comparing the SKB hash with the individual ranges. When a next hop is removed from the group, the ranges are recomputed, which leads to reassignment of parts of hash space from one next hop to another. While there will usually be some overlap between the previous and the new distribution, some traffic flows change the next hop that they resolve to. That causes problems e.g. as established TCP connections are reset, because the traffic is forwarded to a server that is not familiar with the connection. Resilient hashing is a technique to address the above problem. Resilient next-hop group has another layer of indirection between the group itself and its constituent next hops: a hash table. The selection algorithm uses a straightforward modulo operation to choose a hash bucket, and then reads the next hop that this bucket contains, and forwards traffic there. This indirection brings an important feature. In the hash-threshold algorithm, the range of hashes associated with a next hop must be continuous. With a hash table, mapping between the hash table buckets and the individual next hops is arbitrary. Therefore when a next hop is deleted the buckets that held it are simply reassigned to other next hops. When weights of next hops in a group are altered, it may be possible to choose a subset of buckets that are currently not used for forwarding traffic, and use those to satisfy the new next-hop distribution demands, keeping the "busy" buckets intact. This way, established flows are ideally kept being forwarded to the same endpoints through the same paths as before the next-hop group change. In a nutshell, the algorithm works as follows. Each next hop has a number of buckets that it wants to have, according to its weight and the number of buckets in the hash table. In case of an event that might cause bucket allocation change, the numbers for individual next hops are updated, similarly to how ranges are updated for mpath group next hops. Following that, a new "upkeep" algorithm runs, and for idle buckets that belong to a next hop that is currently occupying more buckets than it wants (it is "overweight"), it migrates the buckets to one of the next hops that has fewer buckets than it wants (it is "underweight"). If, after this, there are still underweight next hops, another upkeep run is scheduled to a future time. Chances are there are not enough "idle" buckets to satisfy the new demands. The algorithm has knobs to select both what it means for a bucket to be idle, and for whether and when to forcefully migrate buckets if there keeps being an insufficient number of idle buckets. There are three users of the resilient data structures. - The forwarding code accesses them under RCU, and does not modify them except for updating the time a selected bucket was last used. - Netlink code, running under RTNL, which may modify the data. - The delayed upkeep code, which may modify the data. This runs unlocked, and mutual exclusion between the RTNL code and the delayed upkeep is maintained by canceling the delayed work synchronously before the RTNL code touches anything. Later it restarts the delayed work if necessary. The RTNL code has to implement next-hop group replacement, next hop removal, etc. For removal, the mpath code uses a neat trick of having a backup next hop group structure, doing the necessary changes offline, and then RCU-swapping them in. However, the hash tables for resilient hashing are about an order of magnitude larger than the groups themselves (the size might be e.g. 4K entries), and it was felt that keeping two of them is an overkill. Both the primary next-hop group and the spare therefore use the same resilient table, and writers are careful to keep all references valid for the forwarding code. The hash table references next-hop group entries from the next-hop group that is currently in the primary role (i.e. not spare). During the transition from primary to spare, the table references a mix of both the primary group and the spare. When a next hop is deleted, the corresponding buckets are not set to NULL, but instead marked as empty, so that the pointer is valid and can be used by the forwarding code. The buckets are then migrated to a new next-hop group entry during upkeep. The only times that the hash table is invalid is the very beginning and very end of its lifetime. Between those points, it is always kept valid. This patch introduces the core support code itself. It does not handle notifications towards drivers, which are kept as if the group were an mpath one. It does not handle netlink either. The only bit currently exposed to user space is the new next-hop group type, and that is currently bounced. There is therefore no way to actually access this code. Signed-off-by: NPetr Machata <petrm@nvidia.com> Reviewed-by: NIdo Schimmel <idosch@nvidia.com> Reviewed-by: NDavid Ahern <dsahern@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Ido Schimmel 提交于
- RTM_NEWNEXTHOP et.al. that handle resilient groups will have a new nested attribute, NHA_RES_GROUP, whose elements are attributes NHA_RES_GROUP_*. - RTM_NEWNEXTHOPBUCKET et.al. is a suite of new messages that will currently serve only for dumping of individual buckets of resilient next hop groups. For nexthop group buckets, these messages will carry a nested attribute NHA_RES_BUCKET, whose elements are attributes NHA_RES_BUCKET_*. There are several reasons why a new suite of messages is created for nexthop buckets instead of overloading the information on the existing RTM_{NEW,DEL,GET}NEXTHOP messages. First, a nexthop group can contain a large number of nexthop buckets (4k is not unheard of). This imposes limits on the amount of information that can be encoded for each nexthop bucket given a netlink message is limited to 64k bytes. Second, while RTM_NEWNEXTHOPBUCKET is only used for notifications at this point, in the future it can be extended to provide user space with control over nexthop buckets configuration. - The new group type is NEXTHOP_GRP_TYPE_RES. Note that nexthop code is adjusted to bounce groups with that type for now. Signed-off-by: NIdo Schimmel <idosch@nvidia.com> Reviewed-by: NPetr Machata <petrm@nvidia.com> Reviewed-by: NDavid Ahern <dsahern@kernel.org> Signed-off-by: NPetr Machata <petrm@nvidia.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Petr Machata 提交于
With the introduction of resilient nexthop groups, there will be two types of multipath groups: the current hash-threshold "mpath" ones, and resilient groups. Both are multipath, but to determine the fact, the system needs to consider two flags. This might prove costly in the datapath. Therefore, introduce a new flag, that should be set for next-hop groups that have more than one nexthop, and should be considered multipath. Signed-off-by: NPetr Machata <petrm@nvidia.com> Reviewed-by: NIdo Schimmel <idosch@nvidia.com> Reviewed-by: NDavid Ahern <dsahern@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Petr Machata 提交于
The cited function currently uses rtnl_dereference() to get nh_info from a handed-in nexthop. However, under the resilient hashing scheme, this function will not always be called under RTNL, sometimes the mutual exclusion will be achieved differently. Therefore move the nh_info extraction from the function to its callers to make it possible to use a different synchronization guarantee. Signed-off-by: NPetr Machata <petrm@nvidia.com> Reviewed-by: NIdo Schimmel <idosch@nvidia.com> Reviewed-by: NDavid Ahern <dsahern@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Petr Machata 提交于
Currently, replace assumes that the new group that is given is a fully-formed object. But mpath groups really only have one attribute, and that is the constituent next hop configuration. This may not be universally true. From the usability perspective, it is desirable to allow the replace operation to adjust just the constituent next hop configuration and leave the group attributes as such intact. But the object that keeps track of whether an attribute was or was not given is the nh_config object, not the next hop or next-hop group. To allow (selective) attribute updates during NH group replacement, propagate `cfg' to replace_nexthop() and further to replace_nexthop_grp(). Signed-off-by: NPetr Machata <petrm@nvidia.com> Reviewed-by: NIdo Schimmel <idosch@nvidia.com> Reviewed-by: NDavid Ahern <dsahern@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Julien Massonneau 提交于
When there are 2 segments routing header, after an End.B6 action for example, the second SRH will never be handled by an action, packet will be dropped when the first SRH has segments left equal to 0. For actions that doesn't perform decapsulation (currently: End, End.X, End.T, End.B6, End.B6.Encaps), this patch adds the IP6_FH_F_SKIP_RH flag in arguments for ipv6_find_hdr(). Signed-off-by: NJulien Massonneau <julien.massonneau@6wind.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Julien Massonneau 提交于
As specified in IETF RFC 8754, section 4.3.1.2, if the upper layer header is IPv4 or IPv6, perform IPv6 decapsulation and resubmit the decapsulated packet to the IPv4 or IPv6 module. Only IPv6 decapsulation was implemented. This patch adds support for IPv4 decapsulation. Link: https://tools.ietf.org/html/rfc8754#section-4.3.1.2Signed-off-by: NJulien Massonneau <julien.massonneau@6wind.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 11 3月, 2021 10 次提交
-
-
由 Shubhankar Kuranagatti 提交于
The extra space before tab space has been removed. Signed-off-by: NShubhankar Kuranagatti <shubhankarvk@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Shubhankar Kuranagatti 提交于
The series of space has been replaced by tab space wherever required. Signed-off-by: NShubhankar Kuranagatti <shubhankarvk@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Ido Schimmel 提交于
Implement this callback in order to get the offloaded stats added to the kernel stats. Signed-off-by: NIdo Schimmel <idosch@nvidia.com> Reviewed-by: NPetr Machata <petrm@nvidia.com> Reviewed-by: NJiri Pirko <jiri@nvidia.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yunsheng Lin 提交于
gro list uses skb_shinfo(skb)->frag_list to link two skb together, and NAPI_GRO_CB(p)->last->next is used when there are more skb, see skb_gro_receive_list(). gso expects that each segmented skb is linked together using skb->next, so only the first skb->next need to set to skb_shinfo(skb)-> frag_list when doing gso list segment. It is the same reason that nskb->next does not need to be set to list_skb before goto the error handling, because nskb->next already pointers to list_skb. And nskb is also the last skb at the end of loop, so remove tail variable and use nskb instead. Signed-off-by: NYunsheng Lin <linyunsheng@huawei.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Gustavo A. R. Silva 提交于
In preparation to enable -Wimplicit-fallthrough for Clang, fix multiple warnings by explicitly adding multiple break statements instead of letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115Signed-off-by: NGustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Gustavo A. R. Silva 提交于
In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning by explicitly adding a break statement instead of letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115Signed-off-by: NGustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Gustavo A. R. Silva 提交于
In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning by explicitly adding a break statement instead of letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115Acked-by: NNikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: NGustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Gustavo A. R. Silva 提交于
In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning by explicitly adding a break statement instead of letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115Signed-off-by: NGustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Gustavo A. R. Silva 提交于
In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning by explicitly adding a break statement instead of letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115Signed-off-by: NGustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yejune Deng 提交于
There is no need to assign the msg->msg_name to sin or sin6, because there is DECLARE_SOCKADDR statement. Signed-off-by: NYejune Deng <yejune.deng@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 10 3月, 2021 3 次提交
-
-
由 Balazs Nemeth 提交于
A packet with skb_inner_network_header(skb) == skb_network_header(skb) and ETH_P_MPLS_UC will prevent mpls_gso_segment from pulling any headers from the packet. Subsequently, the call to skb_mac_gso_segment will again call mpls_gso_segment with the same packet leading to an infinite loop. In addition, ensure that the header length is a multiple of four, which should hold irrespective of the number of stacked labels. Signed-off-by: NBalazs Nemeth <bnemeth@redhat.com> Acked-by: NWillem de Bruijn <willemb@google.com> Reviewed-by: NDavid Ahern <dsahern@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Björn Töpel 提交于
The XDP_REDIRECT implementations for maps and non-maps are fairly similar, but obviously need to take different code paths depending on if the target is using a map or not. Today, the redirect targets for XDP either uses a map, or is based on ifindex. Here, the map type and id are added to bpf_redirect_info, instead of the actual map. Map type, map item/ifindex, and the map_id (if any) is passed to xdp_do_redirect(). For ifindex-based redirect, used by the bpf_redirect() XDP BFP helper, a special map type/id are used. Map type of UNSPEC together with map id equal to INT_MAX has the special meaning of an ifindex based redirect. Note that valid map ids are 1 inclusive, INT_MAX exclusive ([1,INT_MAX[). In addition to making the code easier to follow, using explicit type and id in bpf_redirect_info has a slight positive performance impact by avoiding a pointer indirection for the map type lookup, and instead use the cacheline for bpf_redirect_info. Since the actual map is not passed via bpf_redirect_info anymore, the map lookup is only done in the BPF helper. This means that the bpf_clear_redirect_map() function can be removed. The actual map item is RCU protected. The bpf_redirect_info flags member is not used by XDP, and not read/written any more. The map member is only written to when required/used, and not unconditionally. Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Reviewed-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com> Acked-by: NJesper Dangaard Brouer <brouer@redhat.com> Acked-by: NToke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20210308112907.559576-3-bjorn.topel@gmail.com
-
由 Björn Töpel 提交于
Currently the bpf_redirect_map() implementation dispatches to the correct map-lookup function via a switch-statement. To avoid the dispatching, this change adds bpf_redirect_map() as a map operation. Each map provides its bpf_redirect_map() version, and correct function is automatically selected by the BPF verifier. A nice side-effect of the code movement is that the map lookup functions are now local to the map implementation files, which removes one additional function call. Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NJesper Dangaard Brouer <brouer@redhat.com> Acked-by: NToke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20210308112907.559576-2-bjorn.topel@gmail.com
-
- 09 3月, 2021 4 次提交
-
-
由 Jia-Ju Bai 提交于
When sock_alloc_send_skb() returns NULL to skb, no error return code of qrtr_sendmsg() is assigned. To fix this bug, rc is assigned with -ENOMEM in this case. Fixes: 194ccc88 ("net: qrtr: Support decoding incoming v2 packets") Reported-by: NTOTE Robot <oslab@tsinghua.edu.cn> Signed-off-by: NJia-Ju Bai <baijiaju1990@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Davide Caratti 提交于
in current Linux, MPTCP peers advertising endpoints with port numbers use a sub-option length that wrongly accounts for the trailing TCP NOP. Also, receivers will only process incoming ADD_ADDR with port having such wrong sub-option length. Fix this, making ADD_ADDR compliant to RFC8684 §3.4.1. this can be verified running tcpdump on the kselftests artifacts: unpatched kernel: [root@bottarga mptcp]# tcpdump -tnnr unpatched.pcap | grep add-addr reading from file unpatched.pcap, link-type LINUX_SLL (Linux cooked v1), snapshot length 65535 IP 10.0.1.1.10000 > 10.0.1.2.53078: Flags [.], ack 101, win 509, options [nop,nop,TS val 214459678 ecr 521312851,mptcp add-addr v1 id 1 a00:201:2774:2d88:7436:85c3:17fd:101], length 0 IP 10.0.1.2.53078 > 10.0.1.1.10000: Flags [.], ack 101, win 502, options [nop,nop,TS val 521312852 ecr 214459678,mptcp add-addr[bad opt]] patched kernel: [root@bottarga mptcp]# tcpdump -tnnr patched.pcap | grep add-addr reading from file patched.pcap, link-type LINUX_SLL (Linux cooked v1), snapshot length 65535 IP 10.0.1.1.10000 > 10.0.1.2.38178: Flags [.], ack 101, win 509, options [nop,nop,TS val 3728873902 ecr 2732713192,mptcp add-addr v1 id 1 10.0.2.1:10100 hmac 0xbccdfcbe59292a1f,nop,nop], length 0 IP 10.0.1.2.38178 > 10.0.1.1.10000: Flags [.], ack 101, win 502, options [nop,nop,TS val 2732713195 ecr 3728873902,mptcp add-addr v1-echo id 1 10.0.2.1:10100,nop,nop], length 0 Fixes: 22fb85ff ("mptcp: add port support for ADD_ADDR suboption writing") CC: stable@vger.kernel.org # 5.11+ Reviewed-by: NMat Martineau <mathew.j.martineau@linux.intel.com> Acked-and-tested-by: NGeliang Tang <geliangtang@gmail.com> Signed-off-by: NDavide Caratti <dcaratti@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Vladimir Oltean 提交于
Tobias reports that after the blamed patch, VLAN objects being added to a bridge device are being added to all slave ports instead (swp2, swp3). ip link add br0 type bridge vlan_filtering 1 ip link set swp2 master br0 ip link set swp3 master br0 bridge vlan add dev br0 vid 100 self This is because the fix was too broad: we made dsa_port_offloads_netdev say "yes, I offload the br0 bridge" for all slave ports, but we didn't add the checks whether the switchdev object was in fact meant for the physical port or for the bridge itself. So we are reacting on events in a way in which we shouldn't. The reason why the fix was too broad is because the question itself, "does this DSA port offload this netdev", was too broad in the first place. The solution is to disambiguate the question and separate it into two different functions, one to be called for each switchdev attribute / object that has an orig_dev == net_bridge (dsa_port_offloads_bridge), and the other for orig_dev == net_bridge_port (*_offloads_bridge_port). In the case of VLAN objects on the bridge interface, this solves the problem because we know that VLAN objects are per bridge port and not per bridge. And when orig_dev is equal to the net_bridge, we offload it as a bridge, but not as a bridge port; that's how we are able to skip reacting on those events. Note that this is compatible with future plans to have explicit offloading of VLAN objects on the bridge interface as a bridge port (in DSA, this signifies that we should add that VLAN towards the CPU port). Fixes: 99b8202b ("net: dsa: fix SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING getting ignored") Reported-by: NTobias Waldekranz <tobias@waldekranz.com> Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: NTobias Waldekranz <tobias@waldekranz.com> Tested-by: NTobias Waldekranz <tobias@waldekranz.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Björn Töpel 提交于
Currently, the AF_XDP rings uses general smp_{r,w,}mb() barriers on the kernel-side. On most modern architectures load-acquire/store-release barriers perform better, and results in simpler code for circular ring buffers. This change updates the XDP socket rings to use load-acquire/store-release barriers. It is important to note that changing from the old smp_{r,w,}mb() barriers, to load-acquire/store-release barriers does not break compatibility. The old semantics work with the new one, and vice versa. As pointed out by "Documentation/memory-barriers.txt" in the "SMP BARRIER PAIRING" section: "General barriers pair with each other, though they also pair with most other types of barriers, albeit without multicopy atomicity. An acquire barrier pairs with a release barrier, but both may also pair with other barriers, including of course general barriers." How different barriers behaves and pairs is outlined in "tools/memory-model/Documentation/cheatsheet.txt". In order to make sure that compatibility is not broken, LKMM herd7 based litmus tests can be constructed and verified. We generalize the XDP socket ring to a one entry ring, and create two scenarios; One where the ring is full, where only the consumer can proceed, followed by the producer. One where the ring is empty, where only the producer can proceed, followed by the consumer. Each scenario is then expanded to four different tests: general producer/general consumer, general producer/acqrel consumer, acqrel producer/general consumer, acqrel producer/acqrel consumer. In total eight tests. The empty ring test: C spsc-rb+empty // Simple one entry ring: // prod cons allowed action prod cons // 0 0 => prod => 1 0 // 0 1 => cons => 0 0 // 1 0 => cons => 1 1 // 1 1 => prod => 0 1 {} // We start at prod==0, cons==0, data==0, i.e. nothing has been // written to the ring. From here only the producer can start, and // should write 1. Afterwards, consumer can continue and read 1 to // data. Can we enter state prod==1, cons==1, but consumer observed // the incorrect value of 0? P0(int *prod, int *cons, int *data) { ... producer } P1(int *prod, int *cons, int *data) { ... consumer } exists( 1:d=0 /\ prod=1 /\ cons=1 ); The full ring test: C spsc-rb+full // Simple one entry ring: // prod cons allowed action prod cons // 0 0 => prod => 1 0 // 0 1 => cons => 0 0 // 1 0 => cons => 1 1 // 1 1 => prod => 0 1 { prod = 1; } // We start at prod==1, cons==0, data==1, i.e. producer has // written 0, so from here only the consumer can start, and should // consume 0. Afterwards, producer can continue and write 1 to // data. Can we enter state prod==0, cons==1, but consumer observed // the write of 1? P0(int *prod, int *cons, int *data) { ... producer } P1(int *prod, int *cons, int *data) { ... consumer } exists( 1:d=1 /\ prod=0 /\ cons=1 ); where P0 and P1 are: P0(int *prod, int *cons, int *data) { int p; p = READ_ONCE(*prod); if (READ_ONCE(*cons) == p) { WRITE_ONCE(*data, 1); smp_wmb(); WRITE_ONCE(*prod, p ^ 1); } } P0(int *prod, int *cons, int *data) { int p; p = READ_ONCE(*prod); if (READ_ONCE(*cons) == p) { WRITE_ONCE(*data, 1); smp_store_release(prod, p ^ 1); } } P1(int *prod, int *cons, int *data) { int c; int d = -1; c = READ_ONCE(*cons); if (READ_ONCE(*prod) != c) { smp_rmb(); d = READ_ONCE(*data); smp_mb(); WRITE_ONCE(*cons, c ^ 1); } } P1(int *prod, int *cons, int *data) { int c; int d = -1; c = READ_ONCE(*cons); if (smp_load_acquire(prod) != c) { d = READ_ONCE(*data); smp_store_release(cons, c ^ 1); } } The full LKMM litmus tests are found at [1]. On x86-64 systems the l2fwd AF_XDP xdpsock sample performance increases by 1%. This is mostly due to that the smp_mb() is removed, which is a relatively expensive operation on these platforms. Weakly-ordered platforms, such as ARM64 might benefit even more. [1] https://github.com/bjoto/litmus-xskSigned-off-by: NBjörn Töpel <bjorn.topel@intel.com> Signed-off-by: NAndrii Nakryiko <andrii@kernel.org> Acked-by: NToke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20210305094113.413544-2-bjorn.topel@gmail.com
-
- 06 3月, 2021 1 次提交
-
-
由 Sergey Nazarov 提交于
We need to use put_unaligned when writing 32-bit DOI value in cipso_v4_gentag_hdr to avoid unaligned memory access. v2: unneeded type cast removed as Ondrej Mosnacek suggested. Signed-off-by: NSergey Nazarov <s-nazarov@yandex.ru> Acked-by: NPaul Moore <paul@paul-moore.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 05 3月, 2021 10 次提交
-
-
由 Xuesen Huang 提交于
bpf_skb_adjust_room sets the inner_protocol as skb->protocol for packets encapsulation. But that is not appropriate when pushing Ethernet header. Add an option to further specify encap L2 type and set the inner_protocol as ETH_P_TEB. Suggested-by: NWillem de Bruijn <willemb@google.com> Signed-off-by: NXuesen Huang <huangxuesen@kuaishou.com> Signed-off-by: NZhiyong Cheng <chengzhiyong@kuaishou.com> Signed-off-by: NLi Wang <wangli09@kuaishou.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NWillem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/bpf/20210304064046.6232-1-hxseverything@gmail.com
-
由 Lorenz Bauer 提交于
Allow to pass sk_lookup programs to PROG_TEST_RUN. User space provides the full bpf_sk_lookup struct as context. Since the context includes a socket pointer that can't be exposed to user space we define that PROG_TEST_RUN returns the cookie of the selected socket or zero in place of the socket pointer. We don't support testing programs that select a reuseport socket, since this would mean running another (unrelated) BPF program from the sk_lookup test handler. Signed-off-by: NLorenz Bauer <lmb@cloudflare.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210303101816.36774-3-lmb@cloudflare.com
-
由 Lorenz Bauer 提交于
Share the timing / signal interruption logic between different implementations of PROG_TEST_RUN. There is a change in behaviour as well. We check the loop exit condition before checking for pending signals. This resolves an edge case where a signal arrives during the last iteration. Instead of aborting with EINTR we return the successful result to user space. Signed-off-by: NLorenz Bauer <lmb@cloudflare.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NAndrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210303101816.36774-2-lmb@cloudflare.com
-
由 Paul Moore 提交于
The current CIPSO and CALIPSO refcounting scheme for the DOI definitions is a bit flawed in that we: 1. Don't correctly match gets/puts in netlbl_cipsov4_list(). 2. Decrement the refcount on each attempt to remove the DOI from the DOI list, only removing it from the list once the refcount drops to zero. This patch fixes these problems by adding the missing "puts" to netlbl_cipsov4_list() and introduces a more conventional, i.e. not-buggy, refcounting mechanism to the DOI definitions. Upon the addition of a DOI to the DOI list, it is initialized with a refcount of one, removing a DOI from the list removes it from the list and drops the refcount by one; "gets" and "puts" behave as expected with respect to refcounts, increasing and decreasing the DOI's refcount by one. Fixes: b1edeb10 ("netlabel: Replace protocol/NetLabel linking with refrerence counts") Fixes: d7cce015 ("netlabel: Add support for removing a CALIPSO DOI.") Reported-by: syzbot+9ec037722d2603a9f52e@syzkaller.appspotmail.com Signed-off-by: NPaul Moore <paul@paul-moore.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Geliang Tang 提交于
When the port number is mismatched with the announced ones, use 'goto dispose_child' to free the resources instead of using 'goto out'. This patch also moves the port number checking code in subflow_syn_recv_sock before mptcp_finish_join, otherwise subflow_drop_ctx will fail in dispose_child. Fixes: 5bc56388 ("mptcp: add port number check for MP_JOIN") Reported-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NGeliang Tang <geliangtang@gmail.com> Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Paolo Abeni 提交于
__mptcp_clean_una() can free write memory and should wake-up user-space processes when needed. When such function is invoked by the MPTCP receive path, the wakeup is not needed, as the TCP stack will later trigger subflow_write_space which will do the wakeup as needed. Other __mptcp_clean_una() call sites need an additional wakeup check Let's bundle the relevant code in a new helper and use it. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/165 Fixes: 6e628cd3 ("mptcp: use mptcp release_cb for delayed tasks") Fixes: 64b9cea7 ("mptcp: fix spurious retransmissions") Tested-by: NMatthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Paolo Abeni 提交于
If we receive a MPTCP_PUSH_PENDING even from a subflow when mptcp_release_cb() is serving the previous one, the latter will be delayed up to the next release_sock(msk). Address the issue implementing a test/serve loop for such event. Additionally rename the push helper to __mptcp_push_pending() to be more consistent with the existing code. Fixes: 6e628cd3 ("mptcp: use mptcp release_cb for delayed tasks") Signed-off-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Paolo Abeni 提交于
Will simplify the following patch, no functional change intended. Signed-off-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Florian Westphal 提交于
Just like with last_snd, we have to NULL 'first' on subflow close. ack_hint isn't strictly required (its never dereferenced), but better to clear this explicitly as well instead of making it an exception. msk->first is dereferenced unconditionally at accept time, but at that point the ssk is not on the conn_list yet -- this means worker can't see it when iterating the conn_list. Reported-by: NPaolo Abeni <pabeni@redhat.com> Reviewed-by: NMatthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Florian Westphal 提交于
Christoph Paasch reported following crash: dst_release underflow WARNING: CPU: 0 PID: 1319 at net/core/dst.c:175 dst_release+0xc1/0xd0 net/core/dst.c:175 CPU: 0 PID: 1319 Comm: syz-executor217 Not tainted 5.11.0-rc6af8e85128b4d0d24083c5cac646e891227052e0c #70 Call Trace: rt_cache_route+0x12e/0x140 net/ipv4/route.c:1503 rt_set_nexthop.constprop.0+0x1fc/0x590 net/ipv4/route.c:1612 __mkroute_output net/ipv4/route.c:2484 [inline] ... The worker leaves msk->subflow alone even when it happened to close the subflow ssk associated with it. Fixes: 866f26f2 ("mptcp: always graft subflow socket to parent") Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/157Reported-by: NChristoph Paasch <cpaasch@apple.com> Suggested-by: NPaolo Abeni <pabeni@redhat.com> Acked-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-