- 21 1月, 2021 2 次提交
-
-
由 Jakub Kicinski 提交于
rollback_registered() is a local helper, it's common for driver code to call unregister_netdevice_queue(dev, NULL) when they want to unregister netdevices under rtnl_lock. Inline rollback_registered() and adjust the only remaining caller. Reviewed-by: NEdwin Peer <edwin.peer@broadcom.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Jakub Kicinski 提交于
Commit 93ee31f1 ("[NET]: Fix free_netdev on register_netdev failure.") moved net_set_todo() outside of rollback_registered() so that rollback_registered() can be used in the failure path of register_netdevice() but without risking a double free. Since commit cf124db5 ("net: Fix inconsistent teardown and release of private netdev state."), however, we have a better way of handling that condition, since destructors don't call free_netdev() directly. After the change in commit c269a24c ("net: make free_netdev() more lenient with unregistering devices") we can now move net_set_todo() back. Reviewed-by: NEdwin Peer <edwin.peer@broadcom.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 20 1月, 2021 2 次提交
-
-
由 Tariq Toukan 提交于
With NETIF_F_HW_TLS_RX packets are decrypted in HW. This cannot be logically done when RXCSUM offload is off. Fixes: 14136564 ("net: Add TLS RX offload feature") Signed-off-by: NTariq Toukan <tariqt@nvidia.com> Reviewed-by: NBoris Pismenny <borisp@nvidia.com> Link: https://lore.kernel.org/r/20210117151538.9411-1-tariqt@nvidia.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Xin Long 提交于
This patch is to define a inline function skb_csum_is_sctp(), and also replace all places where it checks if it's a SCTP CSUM skb. This function would be used later in many networking drivers in the following patches. Suggested-by: NAlexander Duyck <alexander.duyck@gmail.com> Signed-off-by: NXin Long <lucien.xin@gmail.com> Reviewed-by: NAlexander Duyck <alexanderduyck@fb.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 19 1月, 2021 1 次提交
-
-
由 Tariq Toukan 提交于
ndo_sk_get_lower_dev returns the lower netdev that corresponds to a given socket. Additionally, we implement a helper netdev_sk_get_lowest_dev() to get the lowest one in chain. Signed-off-by: NTariq Toukan <tariqt@nvidia.com> Reviewed-by: NBoris Pismenny <borisp@nvidia.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 15 1月, 2021 1 次提交
-
-
由 Tariq Toukan 提交于
Cited patch below blocked the TLS TX device offload unless HW_CSUM is set. This broke devices that use IP_CSUM && IP6_CSUM. Here we fix it. Note that the single HW_TLS_TX feature flag indicates support for both IPv4/6, hence it should still be disabled in case only one of (IP_CSUM | IPV6_CSUM) is set. Fixes: ae0b04b2 ("net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled") Signed-off-by: NTariq Toukan <tariqt@nvidia.com> Reported-by: NRohit Maheshwari <rohitm@chelsio.com> Reviewed-by: NMaxim Mikityanskiy <maximmi@mellanox.com> Link: https://lore.kernel.org/r/20210114151215.7061-1-tariqt@nvidia.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 14 1月, 2021 1 次提交
-
-
由 Menglong Dong 提交于
Replace the check for ETH_P_8021Q and ETH_P_8021AD in __netif_receive_skb_core with eth_type_vlan. Signed-off-by: NMenglong Dong <dong.menglong@zte.com.cn> Link: https://lore.kernel.org/r/20210111104221.3451-1-dong.menglong@zte.com.cnSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 10 1月, 2021 1 次提交
-
-
由 Eric Dumazet 提交于
GRO_DROP can only be returned from napi_gro_frags() if the skb has not been allocated by a prior napi_get_frags() Since drivers must use napi_get_frags() and test its result before populating the skb with metadata, we can safely remove GRO_DROP since it offers no practical use. Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Jesse Brandeburg <jesse.brandeburg@intel.com> Acked-by: NEdward Cree <ecree.xilinx@gmail.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 09 1月, 2021 4 次提交
-
-
由 Jakub Kicinski 提交于
If register_netdevice() fails at the very last stage - the notifier call - some subsystems may have already seen it and grabbed a reference. struct net_device can't be freed right away without calling netdev_wait_all_refs(). Now that we have a clean interface in form of dev->needs_free_netdev and lenient free_netdev() we can undo what commit 93ee31f1 ("[NET]: Fix free_netdev on register_netdev failure.") has done and complete the unregistration path by bringing the net_set_todo() call back. After registration fails user is still expected to explicitly free the net_device, so make sure ->needs_free_netdev is cleared, otherwise rolling back the registration will cause the old double free for callers who release rtnl_lock before the free. This also solves the problem of priv_destructor not being called on notifier error. net_set_todo() will be moved back into unregister_netdevice_queue() in a follow up. Reported-by: NHulk Robot <hulkci@huawei.com> Reported-by: NYang Yingliang <yangyingliang@huawei.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Jakub Kicinski 提交于
There are two flavors of handling netdev registration: - ones called without holding rtnl_lock: register_netdev() and unregister_netdev(); and - those called with rtnl_lock held: register_netdevice() and unregister_netdevice(). While the semantics of the former are pretty clear, the same can't be said about the latter. The netdev_todo mechanism is utilized to perform some of the device unregistering tasks and it hooks into rtnl_unlock() so the locked variants can't actually finish the work. In general free_netdev() does not mix well with locked calls. Most drivers operating under rtnl_lock set dev->needs_free_netdev to true and expect core to make the free_netdev() call some time later. The part where this becomes most problematic is error paths. There is no way to unwind the state cleanly after a call to register_netdevice(), since unreg can't be performed fully without dropping locks. Make free_netdev() more lenient, and defer the freeing if device is being unregistered. This allows error paths to simply call free_netdev() both after register_netdevice() failed, and after a call to unregister_netdevice() but before dropping rtnl_lock. Simplify the error paths which are currently doing gymnastics around free_netdev() handling. Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Lorenzo Bianconi 提交于
Introduce xdp_prepare_buff utility routine to initialize per-descriptor xdp_buff fields (e.g. xdp_buff pointers). Rely on xdp_prepare_buff() in all XDP capable drivers. Signed-off-by: NLorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Reviewed-by: NAlexander Duyck <alexanderduyck@fb.com> Acked-by: NJesper Dangaard Brouer <brouer@redhat.com> Acked-by: NJohn Fastabend <john.fastabend@gmail.com> Acked-by: NShay Agroskin <shayagr@amazon.com> Acked-by: NMartin Habets <habetsm.xilinx@gmail.com> Acked-by: NCamelia Groza <camelia.groza@nxp.com> Acked-by: NMarcin Wojtas <mw@semihalf.com> Link: https://lore.kernel.org/bpf/45f46f12295972a97da8ca01990b3e71501e9d89.1608670965.git.lorenzo@kernel.orgSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
-
由 Lorenzo Bianconi 提交于
Introduce xdp_init_buff utility routine to initialize xdp_buff fields const over NAPI iterations (e.g. frame_sz or rxq pointer). Rely on xdp_init_buff in all XDP capable drivers. Signed-off-by: NLorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Reviewed-by: NAlexander Duyck <alexanderduyck@fb.com> Acked-by: NJesper Dangaard Brouer <brouer@redhat.com> Acked-by: NJohn Fastabend <john.fastabend@gmail.com> Acked-by: NShay Agroskin <shayagr@amazon.com> Acked-by: NMartin Habets <habetsm.xilinx@gmail.com> Acked-by: NCamelia Groza <camelia.groza@nxp.com> Acked-by: NMarcin Wojtas <mw@semihalf.com> Link: https://lore.kernel.org/bpf/7f8329b6da1434dc2b05a77f2e800b29628a8913.1608670965.git.lorenzo@kernel.orgSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
-
- 08 1月, 2021 1 次提交
-
-
由 Jakub Kicinski 提交于
All drivers use udp_tunnel_nic_*_port() helpers, prepare for NDO removal by invoking those helpers directly. The helpers are safe to call on all devices, they check if device has the UDP tunnel state initialized. Reviewed-by: NAlexander Duyck <alexanderduyck@fb.com> Reviewed-by: NJacob Keller <jacob.e.keller@intel.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 17 12月, 2020 1 次提交
-
-
由 Lijun Pan 提交于
There are some use cases for netdev_notify_peers in the context when rtnl lock is already held. Introduce lockless version of netdev_notify_peers call to save the extra code to call call_netdevice_notifiers(NETDEV_NOTIFY_PEERS, dev); call_netdevice_notifiers(NETDEV_RESEND_IGMP, dev); After that, convert netdev_notify_peers to call the new helper. Suggested-by: NNathan Lynch <nathanl@linux.ibm.com> Signed-off-by: NLijun Pan <ljp@linux.ibm.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 15 12月, 2020 1 次提交
-
-
由 Tariq Toukan 提交于
With NETIF_F_HW_TLS_TX packets are encrypted in HW. This cannot be logically done when HW_CSUM offload is off. Fixes: 2342a851 ("net: Add TLS TX offload features") Signed-off-by: NTariq Toukan <tariqt@nvidia.com> Reviewed-by: NBoris Pismenny <borisp@nvidia.com> Link: https://lore.kernel.org/r/20201213143929.26253-1-tariqt@nvidia.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 09 12月, 2020 1 次提交
-
-
由 Toke Høiland-Jørgensen 提交于
Since commit 7f0a8382 ("bpf, xdp: Maintain info on attached XDP BPF programs in net_device"), the XDP program attachment info is now maintained in the core code. This interacts badly with the xdp_attachment_flags_ok() check that prevents unloading an XDP program with different load flags than it was loaded with. In practice, two kinds of failures are seen: - An XDP program loaded without specifying a mode (and which then ends up in driver mode) cannot be unloaded if the program mode is specified on unload. - The dev_xdp_uninstall() hook always calls the driver callback with the mode set to the type of the program but an empty flags argument, which means the flags_ok() check prevents the program from being removed, leading to bpf prog reference leaks. The original reason this check was added was to avoid ambiguity when multiple programs were loaded. With the way the checks are done in the core now, this is quite simple to enforce in the core code, so let's add a check there and get rid of the xdp_attachment_flags_ok() callback entirely. Fixes: 7f0a8382 ("bpf, xdp: Maintain info on attached XDP BPF programs in net_device") Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NJakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/bpf/160752225751.110217.10267659521308669050.stgit@toke.dk
-
- 02 12月, 2020 1 次提交
-
-
由 Vladimir Oltean 提交于
The last user of the RTNL brother of dev_getfirstbyhwtype (the latter being synchronized under RCU) has been deleted in commit b4db2b35 ("afs: Use core kernel UUID generation"). Cc: Arnd Bergmann <arnd@arndb.de> Cc: David Howells <dhowells@redhat.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20201129200550.2433401-1-vladimir.oltean@nxp.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 01 12月, 2020 3 次提交
-
-
由 Björn Töpel 提交于
Add napi_id to the xdp_rxq_info structure, and make sure the XDP socket pick up the napi_id in the Rx path. The napi_id is used to find the corresponding NAPI structure for socket busy polling. Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NIlias Apalodimas <ilias.apalodimas@linaro.org> Acked-by: NMichael S. Tsirkin <mst@redhat.com> Acked-by: NTariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/bpf/20201130185205.196029-7-bjorn.topel@gmail.com
-
由 Björn Töpel 提交于
This option lets a user set a per socket NAPI budget for busy-polling. If the options is not set, it will use the default of 8. Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Reviewed-by: NJakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/bpf/20201130185205.196029-3-bjorn.topel@gmail.com
-
由 Björn Töpel 提交于
The existing busy-polling mode, enabled by the SO_BUSY_POLL socket option or system-wide using the /proc/sys/net/core/busy_read knob, is an opportunistic. That means that if the NAPI context is not scheduled, it will poll it. If, after busy-polling, the budget is exceeded the busy-polling logic will schedule the NAPI onto the regular softirq handling. One implication of the behavior above is that a busy/heavy loaded NAPI context will never enter/allow for busy-polling. Some applications prefer that most NAPI processing would be done by busy-polling. This series adds a new socket option, SO_PREFER_BUSY_POLL, that works in concert with the napi_defer_hard_irqs and gro_flush_timeout knobs. The napi_defer_hard_irqs and gro_flush_timeout knobs were introduced in commit 6f8b12d6 ("net: napi: add hard irqs deferral feature"), and allows for a user to defer interrupts to be enabled and instead schedule the NAPI context from a watchdog timer. When a user enables the SO_PREFER_BUSY_POLL, again with the other knobs enabled, and the NAPI context is being processed by a softirq, the softirq NAPI processing will exit early to allow the busy-polling to be performed. If the application stops performing busy-polling via a system call, the watchdog timer defined by gro_flush_timeout will timeout, and regular softirq handling will resume. In summary; Heavy traffic applications that prefer busy-polling over softirq processing should use this option. Example usage: $ echo 2 | sudo tee /sys/class/net/ens785f1/napi_defer_hard_irqs $ echo 200000 | sudo tee /sys/class/net/ens785f1/gro_flush_timeout Note that the timeout should be larger than the userspace processing window, otherwise the watchdog will timeout and fall back to regular softirq processing. Enable the SO_BUSY_POLL/SO_PREFER_BUSY_POLL options on your socket. Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Reviewed-by: NJakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/bpf/20201130185205.196029-2-bjorn.topel@gmail.com
-
- 28 11月, 2020 1 次提交
-
-
由 wenxu 提交于
The mru in the qdisc_skb_cb should be init as 0. Only defrag packets in the act_ct will set the value. Fixes: 038ebb1a ("net/sched: act_ct: fix miss set mru for ovs after defrag in act_ct") Signed-off-by: Nwenxu <wenxu@ucloud.cn> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 25 11月, 2020 2 次提交
-
-
由 Heiner Kallweit 提交于
In bug report [0] a warning in r8169 driver was reported that was caused by an invalid GSO SKB (gso_type was 0). See [1] for a discussion about this issue. Still the origin of the invalid GSO SKB isn't clear. It shouldn't be a network drivers task to check for invalid GSO SKB's. Also, even if issue [0] can be fixed, we can't be sure that a similar issue doesn't pop up again at another place. Therefore let gso_features_check() check for such invalid GSO SKB's. [0] https://bugzilla.kernel.org/show_bug.cgi?id=209423 [1] https://www.spinics.net/lists/netdev/msg690794.htmlSigned-off-by: NHeiner Kallweit <hkallweit1@gmail.com> Link: https://lore.kernel.org/r/97c78d21-7f0b-d843-df17-3589f224d2cf@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Björn Töpel 提交于
Commit 642e450b ("xsk: Do not discard packet when NETDEV_TX_BUSY") addressed the problem that packets were discarded from the Tx AF_XDP ring, when the driver returned NETDEV_TX_BUSY. Part of the fix was bumping the skbuff reference count, so that the buffer would not be freed by dev_direct_xmit(). A reference count larger than one means that the skbuff is "shared", which is not the case. If the "shared" skbuff is sent to the generic XDP receive path, netif_receive_generic_xdp(), and pskb_expand_head() is entered the BUG_ON(skb_shared(skb)) will trigger. This patch adds a variant to dev_direct_xmit(), __dev_direct_xmit(), where a user can select the skbuff free policy. This allows AF_XDP to avoid bumping the reference count, but still keep the NETDEV_TX_BUSY behavior. Fixes: 642e450b ("xsk: Do not discard packet when NETDEV_TX_BUSY") Reported-by: NYonghong Song <yhs@fb.com> Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20201123175600.146255-1-bjorn.topel@gmail.com
-
- 24 11月, 2020 1 次提交
-
-
由 Peter Zijlstra 提交于
Get rid of the __call_single_node union and cleanup the API a little to avoid external code relying on the structure layout as much. Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NFrederic Weisbecker <frederic@kernel.org>
-
- 18 11月, 2020 1 次提交
-
-
由 Mauro Carvalho Chehab 提交于
Some identifiers have different names between their prototypes and the kernel-doc markup. Signed-off-by: NMauro Carvalho Chehab <mchehab+huawei@kernel.org> Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 10 11月, 2020 1 次提交
-
-
由 Heiner Kallweit 提交于
It's a frequent pattern to use netdev->stats for the less frequently accessed counters and per-cpu counters for the frequently accessed counters (rx/tx bytes/packets). Add a default ndo_get_stats64() implementation for this use case. Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com> Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 03 11月, 2020 1 次提交
-
-
由 Tom Rix 提交于
A semicolon is not needed after a switch statement. Signed-off-by: NTom Rix <trix@redhat.com> Acked-by: NAndrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20201101153647.2292322-1-trix@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 28 10月, 2020 1 次提交
-
-
由 Yi Li 提交于
No functional changes, just minor refactoring. Signed-off-by: NYi Li <yili@winhong.com> Link: https://lore.kernel.org/r/20201027055904.2683444-1-yili@winhong.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 25 10月, 2020 1 次提交
-
-
由 Willy Tarreau 提交于
With the removal of the interrupt perturbations in previous random32 change (random32: make prandom_u32() output unpredictable), the PRNG has become 100% deterministic again. While SipHash is expected to be way more robust against brute force than the previous Tausworthe LFSR, there's still the risk that whoever has even one temporary access to the PRNG's internal state is able to predict all subsequent draws till the next reseed (roughly every minute). This may happen through a side channel attack or any data leak. This patch restores the spirit of commit f227e3ec ("random32: update the net random state on interrupt and activity") in that it will perturb the internal PRNG's statee using externally collected noise, except that it will not pick that noise from the random pool's bits nor upon interrupt, but will rather combine a few elements along the Tx path that are collectively hard to predict, such as dev, skb and txq pointers, packet length and jiffies values. These ones are combined using a single round of SipHash into a single long variable that is mixed with the net_rand_state upon each invocation. The operation was inlined because it produces very small and efficient code, typically 3 xor, 2 add and 2 rol. The performance was measured to be the same (even very slightly better) than before the switch to SipHash; on a 6-core 12-thread Core i7-8700k equipped with a 40G NIC (i40e), the connection rate dropped from 556k/s to 555k/s while the SYN cookie rate grew from 5.38 Mpps to 5.45 Mpps. Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/ Cc: George Spelvin <lkml@sdf.org> Cc: Amit Klein <aksecurity@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: tytso@mit.edu Cc: Florian Westphal <fw@strlen.de> Cc: Marc Plumb <lkml.mplumb@gmail.com> Tested-by: NSedat Dilek <sedat.dilek@gmail.com> Signed-off-by: NWilly Tarreau <w@1wt.eu>
-
- 19 10月, 2020 1 次提交
-
-
由 Taehee Yoo 提交于
dev->unlink_list is reused unless dev is deleted. So, list_del() should not be used. Due to using list_del(), dev->unlink_list can't be reused so that dev->nested_level update logic doesn't work. In order to fix this bug, list_del_init() should be used instead of list_del(). Test commands: ip link add bond0 type bond ip link add bond1 type bond ip link set bond0 master bond1 ip link set bond0 nomaster ip link set bond1 master bond0 ip link set bond1 nomaster Splat looks like: [ 255.750458][ T1030] ============================================ [ 255.751967][ T1030] WARNING: possible recursive locking detected [ 255.753435][ T1030] 5.9.0-rc8+ #772 Not tainted [ 255.754553][ T1030] -------------------------------------------- [ 255.756047][ T1030] ip/1030 is trying to acquire lock: [ 255.757304][ T1030] ffff88811782a280 (&dev_addr_list_lock_key/1){+...}-{2:2}, at: dev_mc_sync_multiple+0xc2/0x150 [ 255.760056][ T1030] [ 255.760056][ T1030] but task is already holding lock: [ 255.761862][ T1030] ffff88811130a280 (&dev_addr_list_lock_key/1){+...}-{2:2}, at: bond_enslave+0x3d4d/0x43e0 [bonding] [ 255.764581][ T1030] [ 255.764581][ T1030] other info that might help us debug this: [ 255.766645][ T1030] Possible unsafe locking scenario: [ 255.766645][ T1030] [ 255.768566][ T1030] CPU0 [ 255.769415][ T1030] ---- [ 255.770259][ T1030] lock(&dev_addr_list_lock_key/1); [ 255.771629][ T1030] lock(&dev_addr_list_lock_key/1); [ 255.772994][ T1030] [ 255.772994][ T1030] *** DEADLOCK *** [ 255.772994][ T1030] [ 255.775091][ T1030] May be due to missing lock nesting notation [ 255.775091][ T1030] [ 255.777182][ T1030] 2 locks held by ip/1030: [ 255.778299][ T1030] #0: ffffffffb1f63250 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x2e4/0x8b0 [ 255.780600][ T1030] #1: ffff88811130a280 (&dev_addr_list_lock_key/1){+...}-{2:2}, at: bond_enslave+0x3d4d/0x43e0 [bonding] [ 255.783411][ T1030] [ 255.783411][ T1030] stack backtrace: [ 255.784874][ T1030] CPU: 7 PID: 1030 Comm: ip Not tainted 5.9.0-rc8+ #772 [ 255.786595][ T1030] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 [ 255.789030][ T1030] Call Trace: [ 255.789850][ T1030] dump_stack+0x99/0xd0 [ 255.790882][ T1030] __lock_acquire.cold.71+0x166/0x3cc [ 255.792285][ T1030] ? register_lock_class+0x1a30/0x1a30 [ 255.793619][ T1030] ? rcu_read_lock_sched_held+0x91/0xc0 [ 255.794963][ T1030] ? rcu_read_lock_bh_held+0xa0/0xa0 [ 255.796246][ T1030] lock_acquire+0x1b8/0x850 [ 255.797332][ T1030] ? dev_mc_sync_multiple+0xc2/0x150 [ 255.798624][ T1030] ? bond_enslave+0x3d4d/0x43e0 [bonding] [ 255.800039][ T1030] ? check_flags+0x50/0x50 [ 255.801143][ T1030] ? lock_contended+0xd80/0xd80 [ 255.802341][ T1030] _raw_spin_lock_nested+0x2e/0x70 [ 255.803592][ T1030] ? dev_mc_sync_multiple+0xc2/0x150 [ 255.804897][ T1030] dev_mc_sync_multiple+0xc2/0x150 [ 255.806168][ T1030] bond_enslave+0x3d58/0x43e0 [bonding] [ 255.807542][ T1030] ? __lock_acquire+0xe53/0x51b0 [ 255.808824][ T1030] ? bond_update_slave_arr+0xdc0/0xdc0 [bonding] [ 255.810451][ T1030] ? check_chain_key+0x236/0x5e0 [ 255.811742][ T1030] ? mutex_is_locked+0x13/0x50 [ 255.812910][ T1030] ? rtnl_is_locked+0x11/0x20 [ 255.814061][ T1030] ? netdev_master_upper_dev_get+0xf/0x120 [ 255.815553][ T1030] do_setlink+0x94c/0x3040 [ ... ] Reported-by: syzbot+4a0f7bc34e3997a6c7df@syzkaller.appspotmail.com Fixes: 1fc70edb ("net: core: add nested_level variable in net_device") Signed-off-by: NTaehee Yoo <ap420073@gmail.com> Link: https://lore.kernel.org/r/20201015162606.9377-1-ap420073@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 14 10月, 2020 1 次提交
-
-
由 Heiner Kallweit 提交于
In several places the same code is used to populate rtnl_link_stats64 fields with data from pcpu_sw_netstats. Therefore factor out this code to a new function dev_fetch_sw_netstats(). v2: - constify argument netstats - don't ignore netstats being NULL or an ERRPTR - switch to EXPORT_SYMBOL_GPL Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com> Link: https://lore.kernel.org/r/6d16a338-52f5-df69-0020-6bc771a7d498@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 12 10月, 2020 1 次提交
-
-
由 Daniel Borkmann 提交于
Add an efficient ingress to ingress netns switch that can be used out of tc BPF programs in order to redirect traffic from host ns ingress into a container veth device ingress without having to go via CPU backlog queue [0]. For local containers this can also be utilized and path via CPU backlog queue only needs to be taken once, not twice. On a high level this borrows from ipvlan which does similar switch in __netif_receive_skb_core() and then iterates via another_round. This helps to reduce latency for mentioned use cases. Pod to remote pod with redirect(), TCP_RR [1]: # percpu_netperf 10.217.1.33 RT_LATENCY: 122.450 (per CPU: 122.666 122.401 122.333 122.401 ) MEAN_LATENCY: 121.210 (per CPU: 121.100 121.260 121.320 121.160 ) STDDEV_LATENCY: 120.040 (per CPU: 119.420 119.910 125.460 115.370 ) MIN_LATENCY: 46.500 (per CPU: 47.000 47.000 47.000 45.000 ) P50_LATENCY: 118.500 (per CPU: 118.000 119.000 118.000 119.000 ) P90_LATENCY: 127.500 (per CPU: 127.000 128.000 127.000 128.000 ) P99_LATENCY: 130.750 (per CPU: 131.000 131.000 129.000 132.000 ) TRANSACTION_RATE: 32666.400 (per CPU: 8152.200 8169.842 8174.439 8169.897 ) Pod to remote pod with redirect_peer(), TCP_RR: # percpu_netperf 10.217.1.33 RT_LATENCY: 44.449 (per CPU: 43.767 43.127 45.279 45.622 ) MEAN_LATENCY: 45.065 (per CPU: 44.030 45.530 45.190 45.510 ) STDDEV_LATENCY: 84.823 (per CPU: 66.770 97.290 84.380 90.850 ) MIN_LATENCY: 33.500 (per CPU: 33.000 33.000 34.000 34.000 ) P50_LATENCY: 43.250 (per CPU: 43.000 43.000 43.000 44.000 ) P90_LATENCY: 46.750 (per CPU: 46.000 47.000 47.000 47.000 ) P99_LATENCY: 52.750 (per CPU: 51.000 54.000 53.000 53.000 ) TRANSACTION_RATE: 90039.500 (per CPU: 22848.186 23187.089 22085.077 21919.130 ) [0] https://linuxplumbersconf.org/event/7/contributions/674/attachments/568/1002/plumbers_2020_cilium_load_balancer.pdf [1] https://github.com/borkmann/netperf_scripts/blob/master/percpu_netperfSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201010234006.7075-3-daniel@iogearbox.net
-
- 30 9月, 2020 1 次提交
-
-
Quite some drivers make conditional decisions based on in_interrupt() to invoke either netif_rx() or netif_rx_ni(). Conditionals based on in_interrupt() or other variants of preempt count checks in drivers should not exist for various reasons and Linus clearly requested to either split the code pathes or pass an argument to the common functions which provides the context. This is obviously the correct solution, but for some of the affected drivers this needs a major rewrite due to their convoluted structure. As in_interrupt() usage in drivers needs to be phased out, provide netif_rx_any_context() as a stop gap for these drivers. This confines the in_interrupt() conditional to core code which in turn allows to remove the access to this check for driver code and provides one central place to do further modifications once the driver maze is cleaned up. Suggested-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 29 9月, 2020 3 次提交
-
-
由 Taehee Yoo 提交于
This patch is to add a new variable 'nested_level' into the net_device structure. This variable will be used as a parameter of spin_lock_nested() of dev->addr_list_lock. netif_addr_lock() can be called recursively so spin_lock_nested() is used instead of spin_lock() and dev->lower_level is used as a parameter of spin_lock_nested(). But, dev->lower_level value can be updated while it is being used. So, lockdep would warn a possible deadlock scenario. When a stacked interface is deleted, netif_{uc | mc}_sync() is called recursively. So, spin_lock_nested() is called recursively too. At this moment, the dev->lower_level variable is used as a parameter of it. dev->lower_level value is updated when interfaces are being unlinked/linked immediately. Thus, After unlinking, dev->lower_level shouldn't be a parameter of spin_lock_nested(). A (macvlan) | B (vlan) | C (bridge) | D (macvlan) | E (vlan) | F (bridge) A->lower_level : 6 B->lower_level : 5 C->lower_level : 4 D->lower_level : 3 E->lower_level : 2 F->lower_level : 1 When an interface 'A' is removed, it releases resources. At this moment, netif_addr_lock() would be called. Then, netdev_upper_dev_unlink() is called recursively. Then dev->lower_level is updated. There is no problem. But, when the bridge module is removed, 'C' and 'F' interfaces are removed at once. If 'F' is removed first, a lower_level value is like below. A->lower_level : 5 B->lower_level : 4 C->lower_level : 3 D->lower_level : 2 E->lower_level : 1 F->lower_level : 1 Then, 'C' is removed. at this moment, netif_addr_lock() is called recursively. The ordering is like this. C(3)->D(2)->E(1)->F(1) At this moment, the lower_level value of 'E' and 'F' are the same. So, lockdep warns a possible deadlock scenario. In order to avoid this problem, a new variable 'nested_level' is added. This value is the same as dev->lower_level - 1. But this value is updated in rtnl_unlock(). So, this variable can be used as a parameter of spin_lock_nested() safely in the rtnl context. Test commands: ip link add br0 type bridge vlan_filtering 1 ip link add vlan1 link br0 type vlan id 10 ip link add macvlan2 link vlan1 type macvlan ip link add br3 type bridge vlan_filtering 1 ip link set macvlan2 master br3 ip link add vlan4 link br3 type vlan id 10 ip link add macvlan5 link vlan4 type macvlan ip link add br6 type bridge vlan_filtering 1 ip link set macvlan5 master br6 ip link add vlan7 link br6 type vlan id 10 ip link add macvlan8 link vlan7 type macvlan ip link set br0 up ip link set vlan1 up ip link set macvlan2 up ip link set br3 up ip link set vlan4 up ip link set macvlan5 up ip link set br6 up ip link set vlan7 up ip link set macvlan8 up modprobe -rv bridge Splat looks like: [ 36.057436][ T744] WARNING: possible recursive locking detected [ 36.058848][ T744] 5.9.0-rc6+ #728 Not tainted [ 36.059959][ T744] -------------------------------------------- [ 36.061391][ T744] ip/744 is trying to acquire lock: [ 36.062590][ T744] ffff8c4767509280 (&vlan_netdev_addr_lock_key){+...}-{2:2}, at: dev_set_rx_mode+0x19/0x30 [ 36.064922][ T744] [ 36.064922][ T744] but task is already holding lock: [ 36.066626][ T744] ffff8c4767769280 (&vlan_netdev_addr_lock_key){+...}-{2:2}, at: dev_uc_add+0x1e/0x60 [ 36.068851][ T744] [ 36.068851][ T744] other info that might help us debug this: [ 36.070731][ T744] Possible unsafe locking scenario: [ 36.070731][ T744] [ 36.072497][ T744] CPU0 [ 36.073238][ T744] ---- [ 36.074007][ T744] lock(&vlan_netdev_addr_lock_key); [ 36.075290][ T744] lock(&vlan_netdev_addr_lock_key); [ 36.076590][ T744] [ 36.076590][ T744] *** DEADLOCK *** [ 36.076590][ T744] [ 36.078515][ T744] May be due to missing lock nesting notation [ 36.078515][ T744] [ 36.080491][ T744] 3 locks held by ip/744: [ 36.081471][ T744] #0: ffffffff98571df0 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x236/0x490 [ 36.083614][ T744] #1: ffff8c4767769280 (&vlan_netdev_addr_lock_key){+...}-{2:2}, at: dev_uc_add+0x1e/0x60 [ 36.085942][ T744] #2: ffff8c476c8da280 (&bridge_netdev_addr_lock_key/4){+...}-{2:2}, at: dev_uc_sync+0x39/0x80 [ 36.088400][ T744] [ 36.088400][ T744] stack backtrace: [ 36.089772][ T744] CPU: 6 PID: 744 Comm: ip Not tainted 5.9.0-rc6+ #728 [ 36.091364][ T744] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 [ 36.093630][ T744] Call Trace: [ 36.094416][ T744] dump_stack+0x77/0x9b [ 36.095385][ T744] __lock_acquire+0xbc3/0x1f40 [ 36.096522][ T744] lock_acquire+0xb4/0x3b0 [ 36.097540][ T744] ? dev_set_rx_mode+0x19/0x30 [ 36.098657][ T744] ? rtmsg_ifinfo+0x1f/0x30 [ 36.099711][ T744] ? __dev_notify_flags+0xa5/0xf0 [ 36.100874][ T744] ? rtnl_is_locked+0x11/0x20 [ 36.101967][ T744] ? __dev_set_promiscuity+0x7b/0x1a0 [ 36.103230][ T744] _raw_spin_lock_bh+0x38/0x70 [ 36.104348][ T744] ? dev_set_rx_mode+0x19/0x30 [ 36.105461][ T744] dev_set_rx_mode+0x19/0x30 [ 36.106532][ T744] dev_set_promiscuity+0x36/0x50 [ 36.107692][ T744] __dev_set_promiscuity+0x123/0x1a0 [ 36.108929][ T744] dev_set_promiscuity+0x1e/0x50 [ 36.110093][ T744] br_port_set_promisc+0x1f/0x40 [bridge] [ 36.111415][ T744] br_manage_promisc+0x8b/0xe0 [bridge] [ 36.112728][ T744] __dev_set_promiscuity+0x123/0x1a0 [ 36.113967][ T744] ? __hw_addr_sync_one+0x23/0x50 [ 36.115135][ T744] __dev_set_rx_mode+0x68/0x90 [ 36.116249][ T744] dev_uc_sync+0x70/0x80 [ 36.117244][ T744] dev_uc_add+0x50/0x60 [ 36.118223][ T744] macvlan_open+0x18e/0x1f0 [macvlan] [ 36.119470][ T744] __dev_open+0xd6/0x170 [ 36.120470][ T744] __dev_change_flags+0x181/0x1d0 [ 36.121644][ T744] dev_change_flags+0x23/0x60 [ 36.122741][ T744] do_setlink+0x30a/0x11e0 [ 36.123778][ T744] ? __lock_acquire+0x92c/0x1f40 [ 36.124929][ T744] ? __nla_validate_parse.part.6+0x45/0x8e0 [ 36.126309][ T744] ? __lock_acquire+0x92c/0x1f40 [ 36.127457][ T744] __rtnl_newlink+0x546/0x8e0 [ 36.128560][ T744] ? lock_acquire+0xb4/0x3b0 [ 36.129623][ T744] ? deactivate_slab.isra.85+0x6a1/0x850 [ 36.130946][ T744] ? __lock_acquire+0x92c/0x1f40 [ 36.132102][ T744] ? lock_acquire+0xb4/0x3b0 [ 36.133176][ T744] ? is_bpf_text_address+0x5/0xe0 [ 36.134364][ T744] ? rtnl_newlink+0x2e/0x70 [ 36.135445][ T744] ? rcu_read_lock_sched_held+0x32/0x60 [ 36.136771][ T744] ? kmem_cache_alloc_trace+0x2d8/0x380 [ 36.138070][ T744] ? rtnl_newlink+0x2e/0x70 [ 36.139164][ T744] rtnl_newlink+0x47/0x70 [ ... ] Fixes: 845e0ebb ("net: change addr_list_lock back to static key") Signed-off-by: NTaehee Yoo <ap420073@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Taehee Yoo 提交于
Functions related to nested interface infrastructure such as netdev_walk_all_{ upper | lower }_dev() pass both private functions and "data" pointer to handle their own things. At this point, the data pointer type is void *. In order to make it easier to expand common variables and functions, this new netdev_nested_priv structure is added. In the following patch, a new member variable will be added into this struct to fix the lockdep issue. Signed-off-by: NTaehee Yoo <ap420073@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Taehee Yoo 提交于
The netdev_upper_dev_unlink() has to work differently according to flags. This idea is the same with __netdev_upper_dev_link(). In the following patches, new flags will be added. Signed-off-by: NTaehee Yoo <ap420073@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 24 9月, 2020 1 次提交
-
-
由 Mauro Carvalho Chehab 提交于
kernel-doc expects the function prototype to be just after the kernel-doc markup, as otherwise it will get it all wrong: ./net/core/dev.c:10036: warning: Excess function parameter 'dev' description in 'WAIT_REFS_MIN_MSECS' Fixes: 0e4be9e5 ("net: use exponential backoff in netdev_wait_allrefs") Signed-off-by: NMauro Carvalho Chehab <mchehab+huawei@kernel.org> Reviewed-by: NFrancesco Ruggeri <fruggeri@arista.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 19 9月, 2020 2 次提交
-
-
由 Randy Dunlap 提交于
Drop repeated words in net/core/. Signed-off-by: NRandy Dunlap <rdunlap@infradead.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Francesco Ruggeri 提交于
The combination of aca_free_rcu, introduced in commit 2384d025 ("net/ipv6: Add anycast addresses to a global hashtable"), and fib6_info_destroy_rcu, introduced in commit 9b0a8da8 ("net/ipv6: respect rcu grace period before freeing fib6_info"), can result in an extra rcu grace period being needed when deleting an interface, with the result that netdev_wait_allrefs ends up hitting the msleep(250), which is considerably longer than the required grace period. This can result in long delays when deleting a large number of interfaces, and it can be observed with this script: ns=dummy-ns NIFS=100 ip netns add $ns ip netns exec $ns ip link set lo up ip netns exec $ns sysctl net.ipv6.conf.default.disable_ipv6=0 ip netns exec $ns sysctl net.ipv6.conf.default.forwarding=1 for ((i=0; i<$NIFS; i++)) do if=eth$i ip netns exec $ns ip link add $if type dummy ip netns exec $ns ip link set $if up ip netns exec $ns ip -6 addr add 2021:$i::1/120 dev $if done for ((i=0; i<$NIFS; i++)) do if=eth$i ip netns exec $ns ip link del $if done ip netns del $ns Instead of using a fixed msleep(250), this patch tries an extra rcu_barrier() followed by an exponential backoff. Time with this patch on a 5.4 kernel: real 0m7.704s user 0m0.385s sys 0m1.230s Time without this patch: real 0m31.522s user 0m0.438s sys 0m1.156s v2: use exponential backoff instead of trying to wake up netdev_wait_allrefs. v3: preserve reverse christmas tree ordering of local variables v4: try an extra rcu_barrier before the backoff, plus some cosmetic changes. Signed-off-by: NFrancesco Ruggeri <fruggeri@arista.com> Reviewed-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 16 9月, 2020 1 次提交
-
-
由 YiFei Zhu 提交于
To support modifying the used_maps array, we use a mutex to protect the use of the counter and the array. The mutex is initialized right after the prog aux is allocated, and destroyed right before prog aux is freed. This way we guarantee it's initialized for both cBPF and eBPF. Signed-off-by: NYiFei Zhu <zhuyifei@google.com> Signed-off-by: NStanislav Fomichev <sdf@google.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NAndrii Nakryiko <andriin@fb.com> Cc: YiFei Zhu <zhuyifei1999@gmail.com> Link: https://lore.kernel.org/bpf/20200915234543.3220146-2-sdf@google.com
-