- 16 11月, 2021 8 次提交
-
-
由 Eric Dumazet 提交于
Use INDIRECT_CALL_INET() to avoid an indirect call when/if CONFIG_RETPOLINE=y Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
(struct proto)->sk_forward_alloc is currently only used by MPTCP. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Move sk_bind_phc next to sk_peer_lock to fill a hole. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
gso_size can be moved after tclass, to use an existing hole. (8 bytes saved on 64bit arches) Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Instead of using a full netdev_features_t, we can use a single bit, as sk_route_nocaps is only used to remove NETIF_F_GSO_MASK from sk->sk_route_cap. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
We were only using one bit, and we can replace it by sk_is_tcp() Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Move sk_is_tcp() to include/net/sock.h and use it where we can. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
For some reason, I forgot to change __tcp_v6_send_check() at the same time I removed (ip_summed == CHECKSUM_PARTIAL) check in __tcp_v4_send_check() Fixes: 98be9b12 ("tcp: remove dead code after CHECKSUM_PARTIAL adoption") Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 09 11月, 2021 2 次提交
-
-
由 Jussi Maki 提交于
The current conversion of skb->data_end reads like this: ; data_end = (void*)(long)skb->data_end; 559: (79) r1 = *(u64 *)(r2 +200) ; r1 = skb->data 560: (61) r11 = *(u32 *)(r2 +112) ; r11 = skb->len 561: (0f) r1 += r11 562: (61) r11 = *(u32 *)(r2 +116) 563: (1f) r1 -= r11 But similar to the case in 84f44df6 ("bpf: sock_ops sk access may stomp registers when dst_reg = src_reg"), the code will read an incorrect skb->len when src == dst. In this case we end up generating this xlated code: ; data_end = (void*)(long)skb->data_end; 559: (79) r1 = *(u64 *)(r1 +200) ; r1 = skb->data 560: (61) r11 = *(u32 *)(r1 +112) ; r11 = (skb->data)->len 561: (0f) r1 += r11 562: (61) r11 = *(u32 *)(r1 +116) 563: (1f) r1 -= r11 ... where line 560 is the reading 4B of (skb->data + 112) instead of the intended skb->len Here the skb pointer in r1 gets set to skb->data and the later deref for skb->len ends up following skb->data instead of skb. This fixes the issue similarly to the patch mentioned above by creating an additional temporary variable and using to store the register when dst_reg = src_reg. We name the variable bpf_temp_reg and place it in the cb context for sk_skb. Then we restore from the temp to ensure nothing is lost. Fixes: 16137b09 ("bpf: Compute data_end dynamically with JIT code") Signed-off-by: NJussi Maki <joamaki@gmail.com> Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20211103204736.248403-6-john.fastabend@gmail.com
-
由 John Fastabend 提交于
Strparser is reusing the qdisc_skb_cb struct to stash the skb message handling progress, e.g. offset and length of the skb. First this is poorly named and inherits a struct from qdisc that doesn't reflect the actual usage of cb[] at this layer. But, more importantly strparser is using the following to access its metadata. (struct _strp_msg *)((void *)skb->cb + offsetof(struct qdisc_skb_cb, data)) Where _strp_msg is defined as: struct _strp_msg { struct strp_msg strp; /* 0 8 */ int accum_len; /* 8 4 */ /* size: 12, cachelines: 1, members: 2 */ /* last cacheline: 12 bytes */ }; So we use 12 bytes of ->data[] in struct. However in BPF code running parser and verdict the user has read capabilities into the data[] array as well. Its not too problematic, but we should not be exposing internal state to BPF program. If its really needed then we can use the probe_read() APIs which allow reading kernel memory. And I don't believe cb[] layer poses any API breakage by moving this around because programs can't depend on cb[] across layers. In order to fix another issue with a ctx rewrite we need to stash a temp variable somewhere. To make this work cleanly this patch builds a cb struct for sk_skb types called sk_skb_cb struct. Then we can use this consistently in the strparser, sockmap space. Additionally we can start allowing ->cb[] write access after this. Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Tested-by: NJussi Maki <joamaki@gmail.com> Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20211103204736.248403-5-john.fastabend@gmail.com
-
- 08 11月, 2021 1 次提交
-
-
由 Eric Dumazet 提交于
Both ifindex and LLC_SK_DEV_HASH_ENTRIES are signed. This means that (ifindex % LLC_SK_DEV_HASH_ENTRIES) is negative if @ifindex is negative. We could simply make LLC_SK_DEV_HASH_ENTRIES unsigned. In this patch I chose to use hash_32() to get more entropy from @ifindex, like llc_sk_laddr_hashfn(). UBSAN: array-index-out-of-bounds in ./include/net/llc.h:75:26 index -43 is out of range for type 'hlist_head [64]' CPU: 1 PID: 20999 Comm: syz-executor.3 Not tainted 5.15.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 ubsan_epilogue+0xb/0x5a lib/ubsan.c:151 __ubsan_handle_out_of_bounds.cold+0x62/0x6c lib/ubsan.c:291 llc_sk_dev_hash include/net/llc.h:75 [inline] llc_sap_add_socket+0x49c/0x520 net/llc/llc_conn.c:697 llc_ui_bind+0x680/0xd70 net/llc/af_llc.c:404 __sys_bind+0x1e9/0x250 net/socket.c:1693 __do_sys_bind net/socket.c:1704 [inline] __se_sys_bind net/socket.c:1702 [inline] __x64_sys_bind+0x6f/0xb0 net/socket.c:1702 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7fa503407ae9 Fixes: 6d2e3ea2 ("llc: use a device based hash table to speed up multicast delivery") Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: Nsyzbot <syzkaller@googlegroups.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 04 11月, 2021 1 次提交
-
-
由 Dominique Martinet 提交于
Sohaib Mohamed started a serie of tiny and incomplete checkpatch fixes but seemingly stopped halfway -- take over and do most of it. This is still missing net/9p/trans* and net/9p/protocol.c for a later time... Link: http://lkml.kernel.org/r/20211102134608.1588018-3-dominique.martinet@atmark-techno.comSigned-off-by: NDominique Martinet <asmadeus@codewreck.org>
-
- 03 11月, 2021 4 次提交
-
-
由 Talal Ahmad 提交于
Track skbs containing only zerocopy data and avoid charging them to kernel memory to correctly account the memory utilization for msg_zerocopy. All of the data in such skbs is held in user pages which are already accounted to user. Before this change, they are charged again in kernel in __zerocopy_sg_from_iter. The charging in kernel is excessive because data is not being copied into skb frags. This excessive charging can lead to kernel going into memory pressure state which impacts all sockets in the system adversely. Mark pure zerocopy skbs with a SKBFL_PURE_ZEROCOPY flag and remove charge/uncharge for data in such skbs. Initially, an skb is marked pure zerocopy when it is empty and in zerocopy path. skb can then change from a pure zerocopy skb to mixed data skb (zerocopy and copy data) if it is at tail of write queue and there is room available in it and non-zerocopy data is being sent in the next sendmsg call. At this time sk_mem_charge is done for the pure zerocopied data and the pure zerocopy flag is unmarked. We found that this happens very rarely on workloads that pass MSG_ZEROCOPY. A pure zerocopy skb can later be coalesced into normal skb if they are next to each other in queue but this patch prevents coalescing from happening. This avoids complexity of charging when skb downgrades from pure zerocopy to mixed. This is also rare. In sk_wmem_free_skb, if it is a pure zerocopy skb, an sk_mem_uncharge for SKB_TRUESIZE(skb_end_offset(skb)) is done for sk_mem_charge in tcp_skb_entail for an skb without data. Testing with the msg_zerocopy.c benchmark between two hosts(100G nics) with zerocopy showed that before this patch the 'sock' variable in memory.stat for cgroup2 that tracks sum of sk_forward_alloc, sk_rmem_alloc and sk_wmem_queued is around 1822720 and with this change it is 0. This is due to no charge to sk_forward_alloc for zerocopy data and shows memory utilization for kernel is lowered. With this commit we don't see the warning we saw in previous commit which resulted in commit 84882cf7. Signed-off-by: NTalal Ahmad <talalahmad@google.com> Acked-by: NArjun Roy <arjunroy@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NWillem de Bruijn <willemb@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Xin Long 提交于
This patch is to move secid and peer_secid from endpoint to association, and pass asoc to sctp_assoc_request and sctp_sk_clone instead of ep. As ep is the local endpoint and asoc represents a connection, and in SCTP one sk/ep could have multiple asoc/connection, saving secid/peer_secid for new asoc will overwrite the old asoc's. Note that since asoc can be passed as NULL, security_sctp_assoc_request() is moved to the place right after the new_asoc is created in sctp_sf_do_5_1B_init() and sctp_sf_do_unexpected_init(). v1->v2: - fix the description of selinux_netlbl_skbuff_setsid(), as Jakub noticed. - fix the annotation in selinux_sctp_assoc_request(), as Richard Noticed. Fixes: 72e89f50 ("security: Add support for SCTP security hooks") Reported-by: NPrashanth Prahlad <pprahlad@redhat.com> Reviewed-by: NRichard Haines <richard_c_haines@btinternet.com> Tested-by: NRichard Haines <richard_c_haines@btinternet.com> Signed-off-by: NXin Long <lucien.xin@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Dominique Martinet 提交于
- add missing SPDX-License-Identifier - remove (sometimes incorrect) file name from file header Link: http://lkml.kernel.org/r/20211102134608.1588018-2-dominique.martinet@atmark-techno.comSigned-off-by: NDominique Martinet <asmadeus@codewreck.org>
-
由 Thomas Weißschuh 提交于
Automatically load transport modules based on the trans= parameter passed to mount. This removes the requirement for the user to know which module to use. Link: http://lkml.kernel.org/r/20211017134611.4330-1-linux@weissschuh.netSigned-off-by: NThomas Weißschuh <linux@weissschuh.net> Signed-off-by: NDominique Martinet <asmadeus@codewreck.org>
-
- 02 11月, 2021 3 次提交
-
-
由 Jakub Kicinski 提交于
This reverts commit f1a456f8. WARNING: CPU: 1 PID: 6819 at net/core/skbuff.c:5429 skb_try_coalesce+0x78b/0x7e0 CPU: 1 PID: 6819 Comm: xxxxxxx Kdump: loaded Tainted: G S 5.15.0-04194-gd852503f7711 #16 RIP: 0010:skb_try_coalesce+0x78b/0x7e0 Code: e8 2a bf 41 ff 44 8b b3 bc 00 00 00 48 8b 7c 24 30 e8 19 c0 41 ff 44 89 f0 48 03 83 c0 00 00 00 48 89 44 24 40 e9 47 fb ff ff <0f> 0b e9 ca fc ff ff 4c 8d 70 ff 48 83 c0 07 48 89 44 24 38 e9 61 RSP: 0018:ffff88881f449688 EFLAGS: 00010282 RAX: 00000000fffffe96 RBX: ffff8881566e4460 RCX: ffffffff82079f7e RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffff8881566e47b0 RBP: ffff8881566e46e0 R08: ffffed102619235d R09: ffffed102619235d R10: ffff888130c91ae3 R11: ffffed102619235c R12: ffff88881f4498a0 R13: 0000000000000056 R14: 0000000000000009 R15: ffff888130c91ac0 FS: 00007fec2cbb9700(0000) GS:ffff88881f440000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fec1b060d80 CR3: 00000003acf94005 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <IRQ> tcp_try_coalesce+0xeb/0x290 ? tcp_parse_options+0x610/0x610 ? mark_held_locks+0x79/0xa0 tcp_queue_rcv+0x69/0x2f0 tcp_rcv_established+0xa49/0xd40 ? tcp_data_queue+0x18a0/0x18a0 tcp_v6_do_rcv+0x1c9/0x880 ? rt6_mtu_change_route+0x100/0x100 tcp_v6_rcv+0x1624/0x1830 Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Talal Ahmad 提交于
Track skbs with only zerocopy data and avoid charging them to kernel memory to correctly account the memory utilization for msg_zerocopy. All of the data in such skbs is held in user pages which are already accounted to user. Before this change, they are charged again in kernel in __zerocopy_sg_from_iter. The charging in kernel is excessive because data is not being copied into skb frags. This excessive charging can lead to kernel going into memory pressure state which impacts all sockets in the system adversely. Mark pure zerocopy skbs with a SKBFL_PURE_ZEROCOPY flag and remove charge/uncharge for data in such skbs. Initially, an skb is marked pure zerocopy when it is empty and in zerocopy path. skb can then change from a pure zerocopy skb to mixed data skb (zerocopy and copy data) if it is at tail of write queue and there is room available in it and non-zerocopy data is being sent in the next sendmsg call. At this time sk_mem_charge is done for the pure zerocopied data and the pure zerocopy flag is unmarked. We found that this happens very rarely on workloads that pass MSG_ZEROCOPY. A pure zerocopy skb can later be coalesced into normal skb if they are next to each other in queue but this patch prevents coalescing from happening. This avoids complexity of charging when skb downgrades from pure zerocopy to mixed. This is also rare. In sk_wmem_free_skb, if it is a pure zerocopy skb, an sk_mem_uncharge for SKB_TRUESIZE(MAX_TCP_HEADER) is done for sk_mem_charge in tcp_skb_entail for an skb without data. Testing with the msg_zerocopy.c benchmark between two hosts(100G nics) with zerocopy showed that before this patch the 'sock' variable in memory.stat for cgroup2 that tracks sum of sk_forward_alloc, sk_rmem_alloc and sk_wmem_queued is around 1822720 and with this change it is 0. This is due to no charge to sk_forward_alloc for zerocopy data and shows memory utilization for kernel is lowered. Signed-off-by: NTalal Ahmad <talalahmad@google.com> Acked-by: NArjun Roy <arjunroy@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NWillem de Bruijn <willemb@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Talal Ahmad 提交于
sk_wmem_free_skb() is only used by TCP. Rename it to make this clear, and move its declaration to include/net/tcp.h Signed-off-by: NTalal Ahmad <talalahmad@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Acked-by: NArjun Roy <arjunroy@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 01 11月, 2021 9 次提交
-
-
由 Taehee Yoo 提交于
In the previous patch, igmp report handler was added. That handler can be used for mld too. So, it uses that common code to parse mld report message. Signed-off-by: NTaehee Yoo <ap420073@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Taehee Yoo 提交于
amt 'Relay' interface manages multicast groups(igmp/mld) and sources. In order to manage, it should have the function to parse igmp/mld report messages. So, this adds the logic for parsing igmp report messages and saves them on their own data structure. struct amt_group_node means one group(igmp/mld). struct amt_source_node means one source. The same source can't exist in the same group. The same group can exist in the same tunnel because it manages the host address too. The group information is used when forwarding multicast data. If there are no groups in the specific tunnel, Relay doesn't forward it. Although Relay manages sources, it doesn't support the source filtering feature. Because the reason to manage sources is just that in order to manage group more correctly. In the next patch, MLD part will be added. Signed-off-by: NTaehee Yoo <ap420073@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Taehee Yoo 提交于
Before forwarding multicast traffic, the amt interface establishes between gateway and relay. In order to establish, amt defined some message type and those message flow looks like the below. Gateway Relay ------- ----- : Request : [1] | N | |---------------------->| | Membership Query | [2] | N,MAC,gADDR,gPORT | |<======================| [3] | Membership Update | | ({G:INCLUDE({S})}) | |======================>| | | ---------------------:-----------------------:--------------------- | | | | | | *Multicast Data | *IP Packet(S,G) | | | gADDR,gPORT |<-----------------() | | *IP Packet(S,G) |<======================| | | ()<-----------------| | | | | | | ---------------------:-----------------------:--------------------- ~ ~ ~ Request ~ [4] | N' | |---------------------->| | Membership Query | [5] | N',MAC',gADDR',gPORT' | |<======================| [6] | | | Teardown | | N,MAC,gADDR,gPORT | |---------------------->| | | [7] | Membership Update | | ({G:INCLUDE({S})}) | |======================>| | | ---------------------:-----------------------:--------------------- | | | | | | *Multicast Data | *IP Packet(S,G) | | | gADDR',gPORT' |<-----------------() | | *IP Packet (S,G) |<======================| | | ()<-----------------| | | | | | | ---------------------:-----------------------:--------------------- | | : : 1. Discovery - Sent by Gateway to Relay - To find Relay unique ip address 2. Advertisement - Sent by Relay to Gateway - Contains the unique IP address 3. Request - Sent by Gateway to Relay - Solicit to receive 'Query' message. 4. Query - Sent by Relay to Gateway - Contains General Query message. 5. Update - Sent by Gateway to Relay - Contains report message. 6. Multicast Data - Sent by Relay to Gateway - encapsulated multicast traffic. 7. Teardown - Not supported at this time. Except for the Teardown message, it supports all messages. In the next patch, IGMP/MLD logic will be added. Signed-off-by: NTaehee Yoo <ap420073@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Taehee Yoo 提交于
It adds definitions and control plane code for AMT. this is very similar to udp tunneling interfaces such as gtp, vxlan, etc. In the next patch, data plane code will be added. Signed-off-by: NTaehee Yoo <ap420073@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
devlink compat code needs to drop rtnl_lock to take devlink->lock to ensure correct lock ordering. This is problematic because we're not strictly guaranteed that the netdev will not disappear after we re-lock. It may open a possibility of nested ->begin / ->complete calls. Instead of calling into devlink under rtnl_lock take a ref on the devlink instance and make the call after we've dropped rtnl_lock. We (continue to) assume that netdevs have an implicit reference on the devlink returned from ndo_get_devlink_port Note that ndo_get_devlink_port will now get called under rtnl_lock. That should be fine since none of the drivers seem to be taking serious locks inside ndo_get_devlink_port. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Reviewed-by: NLeon Romanovsky <leonro@nvidia.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
Allow those who hold implicit reference on a devlink instance to try to take a full ref on it. This will be used from netdev code which has an implicit ref because of driver call ordering. Note that after recent changes devlink_unregister() may happen before netdev unregister, but devlink_free() should still happen after, so we are safe to try, but we can't just refcount_inc() and assume it's not zero. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Reviewed-by: NLeon Romanovsky <leonro@nvidia.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Marek Behún 提交于
Add a new DSA switch operation, phylink_get_interfaces, which should fill in which PHY_INTERFACE_MODE_* are supported by given port. Use this before phylink_create() to fill phylinks supported_interfaces member, allowing phylink to determine which PHY_INTERFACE_MODEs are supported. Signed-off-by: NMarek Behún <kabel@kernel.org> [tweaked patch and description to add more complete support -- rmk] Signed-off-by: NRussell King <rmk+kernel@armlinux.org.uk> Signed-off-by: NRussell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Pablo Neira Ayuso 提交于
Allow to match and mangle on inner headers / payload data after the transport header. There is a new field in the pktinfo structure that stores the inner header offset which is calculated only when requested. Only TCP and UDP supported at this stage. Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-
由 Pablo Neira Ayuso 提交于
Generalize boolean field to store more flags on the pktinfo structure. Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-
- 29 10月, 2021 5 次提交
-
-
由 Yoshiki Komachi 提交于
When a packet of a new flow arrives in openvswitch kernel module, it dissects the packet and passes the extracted flow key to ovs-vswtichd daemon. If hw- offload configuration is enabled, the daemon creates a new TC flower entry to bypass openvswitch kernel module for the flow (TC flower can also offload flows to NICs but this time that does not matter). In this processing flow, I found the following issue in cases of GRE/IPIP packets. When ovs_flow_key_extract() in openvswitch module parses a packet of a new GRE (or IPIP) flow received on non-tunneling vports, it extracts information of the outer IP header for ip_proto/src_ip/dst_ip match keys. This means ovs-vswitchd creates a TC flower entry with IP protocol/addresses match keys whose values are those of the outer IP header. OTOH, TC flower, which uses flow_dissector (different parser from openvswitch module), extracts information of the inner IP header. The following flow is an example to describe the issue in more detail. <----------- Outer IP -----------------> <---------- Inner IP ----------> +----------+--------------+--------------+----------+----------+----------+ | ip_proto | src_ip | dst_ip | ip_proto | src_ip | dst_ip | | 47 (GRE) | 192.168.10.1 | 192.168.10.2 | 6 (TCP) | 10.0.0.1 | 10.0.0.2 | +----------+--------------+--------------+----------+----------+----------+ In this case, TC flower entry and extracted information are shown as below: - ovs-vswitchd creates TC flower entry with: - ip_proto: 47 - src_ip: 192.168.10.1 - dst_ip: 192.168.10.2 - TC flower extracts below for IP header matches: - ip_proto: 6 - src_ip: 10.0.0.1 - dst_ip: 10.0.0.2 Thus, GRE or IPIP packets never match the TC flower entry, as each dissector behaves differently. IMHO, the behavior of TC flower (flow dissector) does not look correct, as ip_proto/src_ip/dst_ip in TC flower match means the outermost IP header information except for GRE/IPIP cases. This patch adds a new flow_dissector flag FLOW_DISSECTOR_F_STOP_BEFORE_ENCAP which skips dissection of the encapsulated inner GRE/IPIP header in TC flower classifier. Signed-off-by: NYoshiki Komachi <komachi.yoshiki@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jeremy Kerr 提交于
Now that we have an extension for MCTP data in skbs, populate the flow when a key has been created for the packet, and add a device driver operation to inform of flow destruction. Includes a fix for a warning with test builds: Reported-by: Nkernel test robot <lkp@intel.com> Signed-off-by: NJeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jeremy Kerr 提交于
This change adds a new skb extension for MCTP, to represent a request/response flow. The intention is to use this in a later change to allow i2c controllers to correctly configure a multiplexer over a flow. Since we have a cleanup function in the core path (if an extension is present), we'll need to make CONFIG_MCTP a bool, rather than a tristate. Includes a fix for a build warning with clang: Reported-by: Nkernel test robot <lkp@intel.com> Signed-off-by: NJeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Xin Long 提交于
sctp_transport_pl_hlen() is called to calculate the outer header length for PL. However, as the Figure in rfc8899#section-4.4: Any additional headers .--- MPS -----. | | | v v v +------------------------------+ | IP | ** | PL | protocol data | +------------------------------+ <----- PLPMTU -----> <---------- PMTU --------------> Outer header are IP + Any additional headers, which doesn't include Packetization Layer itself header, namely sctphdr, whereas sctphdr is counted by __sctp_mtu_payload(). The incorrect calculation caused the link pathmtu to be set larger than expected by t->pl.pmtu + sctp_transport_pl_hlen(). This patch is to fix it by subtracting sctphdr len in sctp_transport_pl_hlen(). Fixes: d9e2e410 ("sctp: add the constants/variables and states and some APIs for transport") Signed-off-by: NXin Long <lucien.xin@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Xin Long 提交于
sctp_transport_pl_update() is called when transport update its dst and pathmtu, instead of stopping the PLPMTUD probe timer, PLPMTUD should start over and reset the probe timer. Otherwise, the PLPMTUD service would stop. Fixes: 92548ec2 ("sctp: add the probe timer in transport for PLPMTUD") Signed-off-by: NXin Long <lucien.xin@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 28 10月, 2021 7 次提交
-
-
由 Davide Caratti 提交于
using packetdrill it's possible to observe that the receiver key contains random values when clients transmit MP_CAPABLE with data and checksum (as specified in RFC8684 §3.1). Fix the layout of mptcp_out_options, to avoid using the skb extension copy when writing the MP_CAPABLE sub-option. Fixes: d7b26908 ("mptcp: shrink mptcp_out_options struct") Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/233Reported-by: NPoorva Sonparote <psonparo@redhat.com> Signed-off-by: NDavide Caratti <dcaratti@redhat.com> Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com> Link: https://lore.kernel.org/r/20211027203855.264600-1-mathew.j.martineau@linux.intel.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Daniel Jordan 提交于
sk->sk_err appears to expect a positive value, a convention that ktls doesn't always follow and that leads to memory corruption in other code. For instance, [kworker] tls_encrypt_done(..., err=<negative error from crypto request>) tls_err_abort(.., err) sk->sk_err = err; [task] splice_from_pipe_feed ... tls_sw_do_sendpage if (sk->sk_err) { ret = -sk->sk_err; // ret is positive splice_from_pipe_feed (continued) ret = actor(...) // ret is still positive and interpreted as bytes // written, resulting in underflow of buf->len and // sd->len, leading to huge buf->offset and bogus // addresses computed in later calls to actor() Fix all tls_err_abort() callers to pass a negative error code consistently and centralize the error-prone sign flip there, throwing in a warning to catch future misuse and uninlining the function so it really does only warn once. Cc: stable@vger.kernel.org Fixes: c46234eb ("tls: RX path for ktls") Reported-by: syzbot+b187b77c8474f9648fae@syzkaller.appspotmail.com Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
We now have INDIRECT_CALL_INET_1() macro, no need to use #ifdef CONFIG_INET Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
All tcp_remove_empty_skb() callers now use tcp_write_queue_tail() for the skb argument, we can therefore factorize code. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Paolo Abeni 提交于
A later patch will change the MPTCP memory accounting schema in such a way that MPTCP sockets will encode the total amount of forward allocated memory in two separate fields (one for tx and one for rx). MPTCP sockets will use their own helper to provide the accurate amount of fwd allocated memory. To allow the above, this patch adds a new, optional, sk method to fetch the fwd memory, wrap the call in a new helper and use it where it is appropriate. Signed-off-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Paolo Abeni 提交于
A following patch is going to implement a similar reclaim schema for the MPTCP protocol, with different locking. Let's define a couple of macros for the used thresholds, so that the latter code will be more easily maintainable. Signed-off-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Seth Forshee 提交于
Currently rcu_barrier() is used to ensure that no readers of the inactive mini_Qdisc buffer remain before it is reused. This waits for any pending RCU callbacks to complete, when all that is actually required is to wait for one RCU grace period to elapse after the buffer was made inactive. This means that using rcu_barrier() may result in unnecessary waits. To improve this, store the current RCU state when a buffer is made inactive and use poll_state_synchronize_rcu() to check whether a full grace period has elapsed before reusing it. If a full grace period has not elapsed, wait for a grace period to elapse, and in the non-RT case use synchronize_rcu_expedited() to hasten it. Since this approach eliminates the RCU callback it is no longer necessary to synchronize_rcu() in the tp_head==NULL case. However, the RCU state should still be saved for the previously active buffer. Before this change I would typically see mini_qdisc_pair_swap() take tens of milliseconds to complete. After this change it typcially finishes in less than 1 ms, and often it takes just a few microseconds. Thanks to Paul for walking me through the options for improving this. Cc: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: NSeth Forshee <sforshee@digitalocean.com> Link: https://lore.kernel.org/r/20211026130700.121189-1-seth@forshee.meSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-