- 24 6月, 2022 2 次提交
-
-
由 Hangbin Liu 提交于
Add per port priority support for bonding active slave re-selection during failover. A higher number means higher priority in selection. The primary slave still has the highest priority. This option also follows the primary_reselect rules. This option could only be configured via netlink. Signed-off-by: NHangbin Liu <liuhangbin@gmail.com> Acked-by: NJonathan Toppins <jtoppins@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Hangbin Liu 提交于
Currently, bond_opt_value are mostly used for bonding option settings. If we want to set a value for slave, we need to re-alloc a string to store both slave name and vlaue, like bond_option_queue_id_set() does, which is complex and dumb. As Jon suggested, let's add a union field slave_dev for bond_opt_value, which will be benefit for future slave option setting. In function __bond_opt_init(), we will always check the extra field and set it if it's not NULL. Suggested-by: NJonathan Toppins <jtoppins@redhat.com> Signed-off-by: NHangbin Liu <liuhangbin@gmail.com> Acked-by: NJonathan Toppins <jtoppins@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 23 6月, 2022 1 次提交
-
-
由 Jakub Kicinski 提交于
Commit 8a59f9d1 ("sock: Introduce sk->sk_prot->psock_update_sk_prot()") has moved the inet_csk_has_ulp(sk) check from sk_psock_init() to the new tcp_bpf_update_proto() function. I'm guessing that this was done to allow creating psocks for non-inet sockets. Unfortunately the destruction path for psock includes the ULP unwind, so we need to fail the sk_psock_init() itself. Otherwise if ULP is already present we'll notice that later, and call tcp_update_ulp() with the sk_proto of the ULP itself, which will most likely result in the ULP looping its callbacks. Fixes: 8a59f9d1 ("sock: Introduce sk->sk_prot->psock_update_sk_prot()") Signed-off-by: NJakub Kicinski <kuba@kernel.org> Reviewed-by: NJohn Fastabend <john.fastabend@gmail.com> Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com> Tested-by: NJakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/r/20220620191353.1184629-2-kuba@kernel.orgSigned-off-by: NPaolo Abeni <pabeni@redhat.com>
-
- 22 6月, 2022 4 次提交
-
-
由 Kuniyuki Iwashima 提交于
unix_table_locks are to protect the global hash table, unix_socket_table. The previous commit removed it, so let's clean up the unnecessary locks. Here is a test result on EC2 c5.9xlarge where 10 processes run concurrently in different netns and bind 100,000 sockets for each. without this series : 1m 38s with this series : 11s It is ~10x faster because the global hash table is split into 10 netns in this case. Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Kuniyuki Iwashima 提交于
This commit replaces the global hash table with a per-netns one and removes the global one. We now link a socket in each netns's hash table so we can save some netns comparisons when iterating through a hash bucket. Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Kuniyuki Iwashima 提交于
This commit adds a per netns hash table for AF_UNIX, which size is fixed as UNIX_HASH_SIZE for now. The first implementation defines a per-netns hash table as a single array of lock and list: struct unix_hashbucket { spinlock_t lock; struct hlist_head head; }; struct netns_unix { struct unix_hashbucket *hash; ... }; But, Eric pointed out memory cost that the structure has holes because of sizeof(spinlock_t), which is 4 (or more if LOCKDEP is enabled). [0] It could be expensive on a host with thousands of netns and few AF_UNIX sockets. For this reason, a per-netns hash table uses two dense arrays. struct unix_table { spinlock_t *locks; struct hlist_head *buckets; }; struct netns_unix { struct unix_table table; ... }; Note the length of the list has a significant impact rather than lock contention, so having shared locks can be an option. But, per-netns locks and lists still perform better than the global locks and per-netns lists. [1] Also, this patch adds a change so that struct netns_unix disappears from struct net if CONFIG_UNIX is disabled. [0]: https://lore.kernel.org/netdev/CANn89iLVxO5aqx16azNU7p7Z-nz5NrnM5QTqOzueVxEnkVTxyg@mail.gmail.com/ [1]: https://lore.kernel.org/netdev/20220617175215.1769-1-kuniyu@amazon.com/Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Kuniyuki Iwashima 提交于
Currently, the size of AF_UNIX hash table is UNIX_HASH_SIZE * 2, the first half for bind()ed sockets and the second half for unbound ones. UNIX_HASH_SIZE * 2 is used to define the table and iterate over it. In some places, we use ARRAY_SIZE(unix_socket_table) instead of UNIX_HASH_SIZE * 2. However, we cannot use it anymore because we will allocate the hash table dynamically. Then, we would have to add UNIX_HASH_SIZE * 2 in many places, which would be troublesome. This patch adapts the UNIX_HASH_SIZE definition to include bound and unbound sockets and defines a new UNIX_HASH_MOD macro to ease calculations. Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 21 6月, 2022 1 次提交
-
-
由 Eric Dumazet 提交于
raw_diag_dump() can use rcu_read_lock() instead of read_lock() Now the hashinfo lock is only used from process context, in write mode only, we can convert it to a spinlock, and we do not need to block BH anymore. Signed-off-by: NEric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220620100509.3493504-1-eric.dumazet@gmail.comSigned-off-by: NPaolo Abeni <pabeni@redhat.com>
-
- 19 6月, 2022 2 次提交
-
-
由 Eric Dumazet 提交于
Using rwlock in networking code is extremely risky. writers can starve if enough readers are constantly grabing the rwlock. I thought rwlock were at fault and sent this patch: https://lkml.org/lkml/2022/6/17/272 But Peter and Linus essentially told me rwlock had to be unfair. We need to get rid of rwlock in networking code. Without this fix, following script triggers soft lockups: for i in {1..48} do ping -f -n -q 127.0.0.1 & sleep 0.1 done Fixes: 1da177e4 ("Linux-2.6.12-rc2") Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
In order to prepare the following patch, I change raw v4 & v6 code to use more conventional iterators. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 17 6月, 2022 2 次提交
-
-
由 Maxim Mikityanskiy 提交于
The new helpers bpf_tcp_raw_{gen,check}_syncookie_ipv{4,6} allow an XDP program to generate SYN cookies in response to TCP SYN packets and to check those cookies upon receiving the first ACK packet (the final packet of the TCP handshake). Unlike bpf_tcp_{gen,check}_syncookie these new helpers don't need a listening socket on the local machine, which allows to use them together with synproxy to accelerate SYN cookie generation. Signed-off-by: NMaxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: NTariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20220615134847.3753567-4-maximmi@nvidia.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
-
由 Joanne Koong 提交于
This reverts: commit d5a42de8 ("net: Add a second bind table hashed by port and address") commit 538aaf9b ("selftests: Add test for timing a bind request to a port with a populated bhash entry") Link: https://lore.kernel.org/netdev/20220520001834.2247810-1-kuba@kernel.org/ There are a few things that need to be fixed here: * Updating bhash2 in cases where the socket's rcv saddr changes * Adding bhash2 hashbucket locks Links to syzbot reports: https://lore.kernel.org/netdev/00000000000022208805e0df247a@google.com/ https://lore.kernel.org/netdev/0000000000003f33bc05dfaf44fe@google.com/ Fixes: d5a42de8 ("net: Add a second bind table hashed by port and address") Reported-by: syzbot+015d756bbd1f8b5c8f09@syzkaller.appspotmail.com Reported-by: syzbot+98fd2d1422063b0f8c44@syzkaller.appspotmail.com Reported-by: syzbot+0a847a982613c6438fba@syzkaller.appspotmail.com Signed-off-by: NJoanne Koong <joannelkoong@gmail.com> Link: https://lore.kernel.org/r/20220615193213.2419568-1-joannelkoong@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 13 6月, 2022 1 次提交
-
-
由 Eric Dumazet 提交于
sk_memory_allocated_add() has three callers, and returns to them @memory_allocated. sk_forced_mem_schedule() is one of them, and ignores the returned value. Change sk_memory_allocated_add() to return void. Change sock_reserve_memory() and __sk_mem_raise_allocated() to call sk_memory_allocated(). This removes one cache line miss [1] for RPC workloads, as first skbs in TCP write queue and receive queue go through sk_forced_mem_schedule(). [1] Cache line holding tcp_memory_allocated. Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 11 6月, 2022 6 次提交
-
-
由 Eric Dumazet 提交于
Currently, tcp_memory_allocated can hit tcp_mem[] limits quite fast. Each TCP socket can forward allocate up to 2 MB of memory, even after flow became less active. 10,000 sockets can have reserved 20 GB of memory, and we have no shrinker in place to reclaim that. Instead of trying to reclaim the extra allocations in some places, just keep sk->sk_forward_alloc values as small as possible. This should not impact performance too much now we have per-cpu reserves: Changes to tcp_memory_allocated should not be too frequent. For sockets not using SO_RESERVE_MEM: - idle sockets (no packets in tx/rx queues) have zero forward alloc. - non idle sockets have a forward alloc smaller than one page. Note: - Removal of SK_RECLAIM_CHUNK and SK_RECLAIM_THRESHOLD is left to MPTCP maintainers as a follow up. Signed-off-by: NEric Dumazet <edumazet@google.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Eric Dumazet 提交于
If sk->sk_forward_alloc is 150000, and we need to schedule 150001 bytes, we want to allocate 1 byte more (rounded up to one page), instead of 150001 :/ Fixes: 1da177e4 ("Linux-2.6.12-rc2") Signed-off-by: NEric Dumazet <edumazet@google.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Eric Dumazet 提交于
We plan keeping sk->sk_forward_alloc as small as possible in future patches. This means we are going to call sk_memory_allocated_add() and sk_memory_allocated_sub() more often. Implement a per-cpu cache of +1/-1 MB, to reduce number of changes to sk->sk_prot->memory_allocated, which would otherwise be cause of false sharing. Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Eric Dumazet 提交于
Each protocol having a ->memory_allocated pointer gets a corresponding per-cpu reserve, that following patches will use. Instead of having reserved bytes per socket, we want to have per-cpu reserves. Signed-off-by: NEric Dumazet <edumazet@google.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Eric Dumazet 提交于
Due to memcg interface, SK_MEM_QUANTUM is effectively PAGE_SIZE. This might change in the future, but it seems better to avoid the confusion. Signed-off-by: NEric Dumazet <edumazet@google.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Eric Dumazet 提交于
This reverts commit bd68a2a8. This change broke memcg on arches with PAGE_SIZE != 4096 Later, commit 2bb2f5fb ("net: add new socket option SO_RESERVE_MEM") also assumed PAGE_SIZE==SK_MEM_QUANTUM Following patches in the series will greatly reduce the over allocations problem. Signed-off-by: NEric Dumazet <edumazet@google.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 10 6月, 2022 4 次提交
-
-
由 Johannes Berg 提交于
The only driver using this was iwlwifi, where we just removed the support because it was never really used. Remove the code from mac80211 as well. Change-Id: I1667417a5932315ee9d81f5c233c56a354923f09 Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
-
由 Jonathan Toppins 提交于
Add support for reporting errors via extack in both bond_newlink and bond_changelink. Instead of having to look in the kernel log for why an option was not correct just report the error to the user via the extack variable. What is currently reported today: ip link add bond0 type bond ip link set bond0 up ip link set bond0 type bond mode 4 RTNETLINK answers: Device or resource busy After this change: ip link add bond0 type bond ip link set bond0 up ip link set bond0 type bond mode 4 Error: unable to set option because the bond is up. Signed-off-by: NJonathan Toppins <jtoppins@redhat.com> Acked-by: NJay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Eric Dumazet 提交于
As explained in commit 316580b6 ("u64_stats: provide u64_stats_t type") we should use u64_stats_t and related accessors to avoid load/store tearing. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Jakub Kicinski 提交于
Netdev reference helpers have a dev_ prefix for historic reasons. Renaming the old helpers would be too much churn but we can rename the tracking ones which are relatively recent and should be the default for new code. Rename: dev_hold_track() -> netdev_hold() dev_put_track() -> netdev_put() dev_replace_track() -> netdev_ref_replace() Link: https://lore.kernel.org/r/20220608043955.919359-1-kuba@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 09 6月, 2022 2 次提交
-
-
由 Wang Yufen 提交于
Resurrect ubsan overflow checks and ubsan report this warning, fix it by change the variable [length] type to size_t. UBSAN: signed-integer-overflow in net/ipv6/ip6_output.c:1489:19 2147479552 + 8567 cannot be represented in type 'int' CPU: 0 PID: 253 Comm: err Not tainted 5.16.0+ #1 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x214/0x230 show_stack+0x30/0x78 dump_stack_lvl+0xf8/0x118 dump_stack+0x18/0x30 ubsan_epilogue+0x18/0x60 handle_overflow+0xd0/0xf0 __ubsan_handle_add_overflow+0x34/0x44 __ip6_append_data.isra.48+0x1598/0x1688 ip6_append_data+0x128/0x260 udpv6_sendmsg+0x680/0xdd0 inet6_sendmsg+0x54/0x90 sock_sendmsg+0x70/0x88 ____sys_sendmsg+0xe8/0x368 ___sys_sendmsg+0x98/0xe0 __sys_sendmmsg+0xf4/0x3b8 __arm64_sys_sendmmsg+0x34/0x48 invoke_syscall+0x64/0x160 el0_svc_common.constprop.4+0x124/0x300 do_el0_svc+0x44/0xc8 el0_svc+0x3c/0x1e8 el0t_64_sync_handler+0x88/0xb0 el0t_64_sync+0x16c/0x170 Changes since v1: -Change the variable [length] type to unsigned, as Eric Dumazet suggested. Changes since v2: -Don't change exthdrlen type in ip6_make_skb, as Paolo Abeni suggested. Changes since v3: -Don't change ulen type in udpv6_sendmsg and l2tp_ip6_sendmsg, as Jakub Kicinski suggested. Reported-by: NHulk Robot <hulkci@huawei.com> Signed-off-by: NWang Yufen <wangyufen@huawei.com> Link: https://lore.kernel.org/r/20220607120028.845916-1-wangyufen@huawei.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Peter Lafreniere 提交于
Despite these inline functions having full visibility to the compiler at compile time, they still strip const from passed pointers. This change allows for functions in various network drivers to be marked as const that could not be marked const before. Signed-off-by: NPeter Lafreniere <pjlafren@mtu.edu> Link: https://lore.kernel.org/r/20220606113458.35953-1-pjlafren@mtu.eduSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 07 6月, 2022 4 次提交
-
-
由 Menglong Dong 提交于
To make the code clear, reformat the comment in dropreason.h to k-doc style. Now, the comment can pass the check of kernel-doc without warnning: $ ./scripts/kernel-doc -v -none include/linux/dropreason.h include/linux/dropreason.h:7: info: Scanning doc for enum skb_drop_reason Signed-off-by: NMenglong Dong <imagedong@tencent.com> Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
-
由 Menglong Dong 提交于
It is annoying to add new skb drop reasons to 'enum skb_drop_reason' and TRACE_SKB_DROP_REASON in trace/event/skb.h, and it's easy to forget to add the new reasons we added to TRACE_SKB_DROP_REASON. TRACE_SKB_DROP_REASON is used to convert drop reason of type number to string. For now, the string we passed to user space is exactly the same as the name in 'enum skb_drop_reason' with a 'SKB_DROP_REASON_' prefix. Therefore, we can use 'auto-generation' to generate these drop reasons to string at build time. The new source 'dropreason_str.c' will be auto generated during build time, which contains the string array 'const char * const drop_reasons[]'. Signed-off-by: NMenglong Dong <imagedong@tencent.com> Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
-
由 Menglong Dong 提交于
As the skb drop reasons are getting more and more, move the enum 'skb_drop_reason' and related function to the standalone header 'dropreason.h', as Jakub Kicinski suggested. Signed-off-by: NMenglong Dong <imagedong@tencent.com> Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
-
由 Pablo Neira Ayuso 提交于
If user requests for NFT_CHAIN_HW_OFFLOAD, then check if either device provides the .ndo_setup_tc interface or there is an indirect flow block that has been registered. Otherwise, bail out early from the preparation phase. Moreover, validate that family == NFPROTO_NETDEV and hook is NF_NETDEV_INGRESS. Fixes: c9626a2c ("netfilter: nf_tables: add hardware offload support") Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-
- 06 6月, 2022 1 次提交
-
-
由 Linus Torvalds 提交于
The bluetooth code uses our bitmap infrastructure for the two bits (!) of connection setup flags, and in the process causes odd problems when it converts between a bitmap and just the regular values of said bits. It's completely pointless to do things like bitmap_to_arr32() to convert a bitmap into a u32. It shoudln't have been a bitmap in the first place. The reason to use bitmaps is if you have arbitrary number of bits you want to manage (not two!), or if you rely on the atomicity guarantees of the bitmap setting and clearing. The code could use an "atomic_t" and use "atomic_or/andnot()" to set and clear the bit values, but considering that it then copies the bitmaps around with "bitmap_to_arr32()" and friends, there clearly cannot be a lot of atomicity requirements. So just use a regular integer. In the process, this avoids the warnings about erroneous use of bitmap_from_u64() which were triggered on 32-bit architectures when conversion from a u64 would access two words (and, surprise, surprise, only one word is needed - and indeed overkill - for a 2-bit bitmap). That was always problematic, but the compiler seems to notice it and warn about the invalid pattern only after commit 0a97953f ("lib: add bitmap_{from,to}_arr64") changed the exact implementation details of 'bitmap_from_u64()', as reported by Sudip Mukherjee and Stephen Rothwell. Fixes: fe92ee64 ("Bluetooth: hci_core: Rework hci_conn_params flags") Link: https://lore.kernel.org/all/YpyJ9qTNHJzz0FHY@debian/ Link: https://lore.kernel.org/all/20220606080631.0c3014f2@canb.auug.org.au/ Link: https://lore.kernel.org/all/20220605162537.1604762-1-yury.norov@gmail.com/Reported-by: NStephen Rothwell <sfr@canb.auug.org.au> Reported-by: NSudip Mukherjee <sudipm.mukherjee@gmail.com> Reviewed-by: NYury Norov <yury.norov@gmail.com> Cc: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Cc: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 02 6月, 2022 2 次提交
-
-
由 Duoming Zhou 提交于
There are session cleanup problems in ax25_release() and ax25_disconnect(). If we setup a session and then disconnect, the disconnected session is still in "LISTENING" state that is shown below. Active AX.25 sockets Dest Source Device State Vr/Vs Send-Q Recv-Q DL9SAU-4 DL9SAU-3 ??? LISTENING 000/000 0 0 DL9SAU-3 DL9SAU-4 ??? LISTENING 000/000 0 0 The first reason is caused by del_timer_sync() in ax25_release(). The timers of ax25 are used for correct session cleanup. If we use ax25_release() to close ax25 sessions and ax25_dev is not null, the del_timer_sync() functions in ax25_release() will execute. As a result, the sessions could not be cleaned up correctly, because the timers have stopped. In order to solve this problem, this patch adds a device_up flag in ax25_dev in order to judge whether the device is up. If there are sessions to be cleaned up, the del_timer_sync() in ax25_release() will not execute. What's more, we add ax25_cb_del() in ax25_kill_by_device(), because the timers have been stopped and there are no functions that could delete ax25_cb if we do not call ax25_release(). Finally, we reorder the position of ax25_list_lock in ax25_cb_del() in order to synchronize among different functions that call ax25_cb_del(). The second reason is caused by improper check in ax25_disconnect(). The incoming ax25 sessions which ax25->sk is null will close heartbeat timer, because the check "if(!ax25->sk || ..)" is satisfied. As a result, the session could not be cleaned up properly. In order to solve this problem, this patch changes the improper check to "if(ax25->sk && ..)" in ax25_disconnect(). What`s more, the ax25_disconnect() may be called twice, which is not necessary. For example, ax25_kill_by_device() calls ax25_disconnect() and sets ax25->state to AX25_STATE_0, but ax25_release() calls ax25_disconnect() again. In order to solve this problem, this patch add a check in ax25_release(). If the flag of ax25->sk equals to SOCK_DEAD, the ax25_disconnect() in ax25_release() should not be executed. Fixes: 82e31755 ("ax25: Fix UAF bugs in ax25 timers") Fixes: 8a367e74 ("ax25: Fix segfault after sock connection timeout") Reported-and-tested-by: NThomas Osterried <thomas@osterried.de> Signed-off-by: NDuoming Zhou <duoming@zju.edu.cn> Link: https://lore.kernel.org/r/20220530152158.108619-1-duoming@zju.edu.cnSigned-off-by: NPaolo Abeni <pabeni@redhat.com>
-
由 Pablo Neira Ayuso 提交于
Remove inactive bool field in nft_hook object that was introduced in abadb2f8 ("netfilter: nf_tables: delete devices from flowtable"). Move stale flowtable hooks to transaction list instead. Deleting twice the same device does not result in ENOENT. Fixes: abadb2f8 ("netfilter: nf_tables: delete devices from flowtable") Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-
- 01 6月, 2022 2 次提交
-
-
由 Hangbin Liu 提交于
Guard ns_targets in struct bond_params by CONFIG_IPV6, which could save 256 bytes if IPv6 not configed. Also add this protection for function bond_is_ip6_target_ok() and bond_get_targets_ip6(). Remove the IS_ENABLED() check for bond_opts[] as this will make BOND_OPT_NS_TARGETS uninitialized if CONFIG_IPV6 not enabled. Add a dummy bond_option_ns_ip6_targets_set() for this situation. Fixes: 4e24be01 ("bonding: add new parameter ns_targets") Signed-off-by: NHangbin Liu <liuhangbin@gmail.com> Acked-by: NJonathan Toppins <jtoppins@redhat.com> Link: https://lore.kernel.org/r/20220531063727.224043-1-liuhangbin@gmail.comSigned-off-by: NPaolo Abeni <pabeni@redhat.com>
-
由 Guoju Fang 提交于
In qdisc_run_end(), the spin_unlock() only has store-release semantic, which guarantees all earlier memory access are visible before it. But the subsequent test_bit() has no barrier semantics so may be reordered ahead of the spin_unlock(). The store-load reordering may cause a packet stuck problem. The concurrent operations can be described as below, CPU 0 | CPU 1 qdisc_run_end() | qdisc_run_begin() . | . ----> /* may be reorderd here */ | . | . | . | spin_unlock() | set_bit() | . | smp_mb__after_atomic() ---- test_bit() | spin_trylock() . | . Consider the following sequence of events: CPU 0 reorder test_bit() ahead and see MISSED = 0 CPU 1 calls set_bit() CPU 1 calls spin_trylock() and return fail CPU 0 executes spin_unlock() At the end of the sequence, CPU 0 calls spin_unlock() and does nothing because it see MISSED = 0. The skb on CPU 1 has beed enqueued but no one take it, until the next cpu pushing to the qdisc (if ever ...) will notice and dequeue it. This patch fix this by adding one explicit barrier. As spin_unlock() and test_bit() ordering is a store-load ordering, a full memory barrier smp_mb() is needed here. Fixes: a90c57f2 ("net: sched: fix packet stuck problem for lockless qdisc") Signed-off-by: NGuoju Fang <gjfang@linux.alibaba.com> Link: https://lore.kernel.org/r/20220528101628.120193-1-gjfang@linux.alibaba.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 27 5月, 2022 2 次提交
-
-
由 Florian Westphal 提交于
In case the conntrack is clashing, insertion can free skb->_nfct and set skb->_nfct to the already-confirmed entry. This wasn't found before because the conntrack entry and the extension space used to free'd after an rcu grace period, plus the race needs events enabled to trigger. Reported-by: <syzbot+793a590957d9c1b96620@syzkaller.appspotmail.com> Fixes: 71d8c47f ("netfilter: conntrack: introduce clash resolution on insertion race") Fixes: 2ad9d774 ("netfilter: conntrack: free extension area immediately") Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-
由 Vincent Ray 提交于
In qdisc_run_begin(), smp_mb__before_atomic() used before test_bit() does not provide any ordering guarantee as test_bit() is not an atomic operation. This, added to the fact that the spin_trylock() call at the beginning of qdisc_run_begin() does not guarantee acquire semantics if it does not grab the lock, makes it possible for the following statement : if (test_bit(__QDISC_STATE_MISSED, &qdisc->state)) to be executed before an enqueue operation called before qdisc_run_begin(). As a result the following race can happen : CPU 1 CPU 2 qdisc_run_begin() qdisc_run_begin() /* true */ set(MISSED) . /* returns false */ . . /* sees MISSED = 1 */ . /* so qdisc not empty */ . __qdisc_run() . . . pfifo_fast_dequeue() ----> /* may be done here */ . | . clear(MISSED) | . . | . smp_mb __after_atomic(); | . . | . /* recheck the queue */ | . /* nothing => exit */ | enqueue(skb1) | . | qdisc_run_begin() | . | spin_trylock() /* fail */ | . | smp_mb__before_atomic() /* not enough */ | . ---- if (test_bit(MISSED)) return false; /* exit */ In the above scenario, CPU 1 and CPU 2 both try to grab the qdisc->seqlock at the same time. Only CPU 2 succeeds and enters the bypass code path, where it emits its skb then calls __qdisc_run(). CPU1 fails, sets MISSED and goes down the traditionnal enqueue() + dequeue() code path. But when executing qdisc_run_begin() for the second time, after enqueuing its skbuff, it sees the MISSED bit still set (by itself) and consequently chooses to exit early without setting it again nor trying to grab the spinlock again. Meanwhile CPU2 has seen MISSED = 1, cleared it, checked the queue and found it empty, so it returned. At the end of the sequence, we end up with skb1 enqueued in the backlog, both CPUs out of __dev_xmit_skb(), the MISSED bit not set, and no __netif_schedule() called made. skb1 will now linger in the qdisc until somebody later performs a full __qdisc_run(). Associated to the bypass capacity of the qdisc, and the ability of the TCP layer to avoid resending packets which it knows are still in the qdisc, this can lead to serious traffic "holes" in a TCP connection. We fix this by replacing the smp_mb__before_atomic() / test_bit() / set_bit() / smp_mb__after_atomic() sequence inside qdisc_run_begin() by a single test_and_set_bit() call, which is more concise and enforces the needed memory barriers. Fixes: 89837eb4 ("net: sched: add barrier to ensure correct ordering for lockless qdisc") Signed-off-by: NVincent Ray <vray@kalrayinc.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220526001746.2437669-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 26 5月, 2022 1 次提交
-
-
由 Taehee Yoo 提交于
AMT_MSG_TEARDOWM is defined, But it should be AMT_MSG_TEARDOWN. Fixes: b9022b53 ("amt: add control plane of amt interface") Signed-off-by: NTaehee Yoo <ap420073@gmail.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 23 5月, 2022 1 次提交
-
-
由 Jakub Kicinski 提交于
Most protocol-specific pointers in struct net_device are under a respective ifdef. Wireless is the notable exception. Since there's a sizable number of custom-built kernels for datacenter workloads which don't build wireless it seems reasonable to ifdefy those pointers as well. While at it move IPv4 and IPv6 pointers up, those are special for obvious reasons. Acked-by: NJohannes Berg <johannes@sipsolutions.net> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> # ieee802154 Acked-by: NSven Eckelmann <sven@narfation.org> Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 21 5月, 2022 2 次提交
-
-
由 Joanne Koong 提交于
We currently have one tcp bind table (bhash) which hashes by port number only. In the socket bind path, we check for bind conflicts by traversing the specified port's inet_bind2_bucket while holding the bucket's spinlock (see inet_csk_get_port() and inet_csk_bind_conflict()). In instances where there are tons of sockets hashed to the same port at different addresses, checking for a bind conflict is time-intensive and can cause softirq cpu lockups, as well as stops new tcp connections since __inet_inherit_port() also contests for the spinlock. This patch proposes adding a second bind table, bhash2, that hashes by port and ip address. Searching the bhash2 table leads to significantly faster conflict resolution and less time holding the spinlock. Signed-off-by: NJoanne Koong <joannelkoong@gmail.com> Reviewed-by: NEric Dumazet <edumazet@google.com> Acked-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Geliang Tang 提交于
This patch implements a new struct bpf_func_proto, named bpf_skc_to_mptcp_sock_proto. Define a new bpf_id BTF_SOCK_TYPE_MPTCP, and a new helper bpf_skc_to_mptcp_sock(), which invokes another new helper bpf_mptcp_sock_from_subflow() in net/mptcp/bpf.c to get struct mptcp_sock from a given subflow socket. v2: Emit BTF type, add func_id checks in verifier.c and bpf_trace.c, remove build check for CONFIG_BPF_JIT v5: Drop EXPORT_SYMBOL (Martin) Co-developed-by: NNicolas Rybowski <nicolas.rybowski@tessares.net> Co-developed-by: NMatthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: NNicolas Rybowski <nicolas.rybowski@tessares.net> Signed-off-by: NMatthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: NGeliang Tang <geliang.tang@suse.com> Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: NAndrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220519233016.105670-2-mathew.j.martineau@linux.intel.com
-