- 15 5月, 2021 1 次提交
-
-
由 Yunsheng Lin 提交于
Currently qdisc_run() checks the STATE_DEACTIVATED of lockless qdisc before calling __qdisc_run(), which ultimately clear the STATE_MISSED when all the skb is dequeued. If STATE_DEACTIVATED is set before clearing STATE_MISSED, there may be rescheduling of net_tx_action() at the end of qdisc_run_end(), see below: CPU0(net_tx_atcion) CPU1(__dev_xmit_skb) CPU2(dev_deactivate) . . . . set STATE_MISSED . . __netif_schedule() . . . set STATE_DEACTIVATED . . qdisc_reset() . . . .<--------------- . synchronize_net() clear __QDISC_STATE_SCHED | . . . | . . . | . some_qdisc_is_busy() . | . return *false* . | . . test STATE_DEACTIVATED | . . __qdisc_run() *not* called | . . . | . . test STATE_MISS | . . __netif_schedule()--------| . . . . . . . . __qdisc_run() is not called by net_tx_atcion() in CPU0 because CPU2 has set STATE_DEACTIVATED flag during dev_deactivate(), and STATE_MISSED is only cleared in __qdisc_run(), __netif_schedule is called at the end of qdisc_run_end(), causing tx action rescheduling problem. qdisc_run() called by net_tx_action() runs in the softirq context, which should has the same semantic as the qdisc_run() called by __dev_xmit_skb() protected by rcu_read_lock_bh(). And there is a synchronize_net() between STATE_DEACTIVATED flag being set and qdisc_reset()/some_qdisc_is_busy in dev_deactivate(), we can safely bail out for the deactived lockless qdisc in net_tx_action(), and qdisc_reset() will reset all skb not dequeued yet. So add the rcu_read_lock() explicitly to protect the qdisc_run() and do the STATE_DEACTIVATED checking in net_tx_action() before calling qdisc_run_begin(). Another option is to do the checking in the qdisc_run_end(), but it will add unnecessary overhead for non-tx_action case, because __dev_queue_xmit() will not see qdisc with STATE_DEACTIVATED after synchronize_net(), the qdisc with STATE_DEACTIVATED can only be seen by net_tx_action() because of __netif_schedule(). The STATE_DEACTIVATED checking in qdisc_run() is to avoid race between net_tx_action() and qdisc_reset(), see: commit d518d2ed ("net/sched: fix race between deactivation and dequeue for NOLOCK qdisc"). As the bailout added above for deactived lockless qdisc in net_tx_action() provides better protection for the race without calling qdisc_run() at all, so remove the STATE_DEACTIVATED checking in qdisc_run(). After qdisc_reset(), there is no skb in qdisc to be dequeued, so clear the STATE_MISSED in dev_reset_queue() too. Fixes: 6b3ba914 ("net: sched: allow qdiscs to handle locking") Acked-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NYunsheng Lin <linyunsheng@huawei.com> V8: Clearing STATE_MISSED before calling __netif_schedule() has avoid the endless rescheduling problem, but there may still be a unnecessary rescheduling, so adjust the commit log. Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 23 4月, 2021 1 次提交
-
-
由 Martin Willi 提交于
If a generic XDP program changes the destination MAC address from/to multicast/broadcast, the skb->pkt_type is updated to properly handle the packet when passed up the stack. When changing the MAC from/to the NICs MAC, PACKET_HOST/OTHERHOST is not updated, though, making the behavior different from that of native XDP. Remember the PACKET_HOST/OTHERHOST state before calling the program in generic XDP, and update pkt_type accordingly if the destination MAC address has changed. As eth_type_trans() assumes a default pkt_type of PACKET_HOST, restore that before calling it. The use case for this is when a XDP program wants to push received packets up the stack by rewriting the MAC to the NICs MAC, for example by cluster nodes sharing MAC addresses. Fixes: 29724956 ("net: fix generic XDP to handle if eth header was mangled") Signed-off-by: NMartin Willi <martin@strongswan.org> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NToke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20210419141559.8611-1-martin@strongswan.org
-
- 20 4月, 2021 1 次提交
-
-
由 Alexander Lobakin 提交于
Commit 38ec4944 ("gro: ensure frag0 meets IP header alignment") did the right thing, but missed the fact that napi_gro_frags() logics calls for skb_gro_reset_offset() *before* pulling Ethernet header to the skb linear space. That said, the introduced check for frag0 address being aligned to 4 always fails for it as Ethernet header is obviously 14 bytes long, and in case with NET_IP_ALIGN its start is not aligned to 4. Fix this by adding @nhoff argument to skb_gro_reset_offset() which tells if an IP header is placed right at the start of frag0 or not. This restores Fast GRO for napi_gro_frags() that became very slow after the mentioned commit, and preserves the introduced check to avoid silent unaligned accesses. From v1 [0]: - inline tiny skb_gro_reset_offset() to let the code be optimized more efficively (esp. for the !NET_IP_ALIGN case) (Eric); - pull in Reviewed-by from Eric. [0] https://lore.kernel.org/netdev/20210418114200.5839-1-alobakin@pm.me Fixes: 38ec4944 ("gro: ensure frag0 meets IP header alignment") Reviewed-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NAlexander Lobakin <alobakin@pm.me> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 14 4月, 2021 1 次提交
-
-
由 Eric Dumazet 提交于
After commit 0f6925b3 ("virtio_net: Do not pull payload in skb->head") Guenter Roeck reported one failure in his tests using sh architecture. After much debugging, we have been able to spot silent unaligned accesses in inet_gro_receive() The issue at hand is that upper networking stacks assume their header is word-aligned. Low level drivers are supposed to reserve NET_IP_ALIGN bytes before the Ethernet header to make that happen. This patch hardens skb_gro_reset_offset() to not allow frag0 fast-path if the fragment is not properly aligned. Some arches like x86, arm64 and powerpc do not care and define NET_IP_ALIGN as 0, this extra check will be a NOP for them. Note that if frag0 is not used, GRO will call pskb_may_pull() as many times as needed to pull network and transport headers. Fixes: 0f6925b3 ("virtio_net: Do not pull payload in skb->head") Fixes: 78a478d0 ("gro: Inline skb_gro_header and cache frag0 virtual address") Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: NGuenter Roeck <linux@roeck-us.net> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Acked-by: NMichael S. Tsirkin <mst@redhat.com> Tested-by: NGuenter Roeck <linux@roeck-us.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 10 4月, 2021 1 次提交
-
-
由 Paolo Abeni 提交于
napi_disable() is subject to an hangup, when the threaded mode is enabled and the napi is under heavy traffic. If the relevant napi has been scheduled and the napi_disable() kicks in before the next napi_threaded_wait() completes - so that the latter quits due to the napi_disable_pending() condition, the existing code leaves the NAPI_STATE_SCHED bit set and the napi_disable() loop waiting for such bit will hang. This patch addresses the issue by dropping the NAPI_STATE_DISABLE bit test in napi_thread_wait(). The later napi_threaded_poll() iteration will take care of clearing the NAPI_STATE_SCHED. This also addresses a related problem reported by Jakub: before this patch a napi_disable()/napi_enable() pair killed the napi thread, effectively disabling the threaded mode. On the patched kernel napi_disable() simply stops scheduling the relevant thread. v1 -> v2: - let the main napi_thread_poll() loop clear the SCHED bit Reported-by: NJakub Kicinski <kuba@kernel.org> Fixes: 29863d41 ("net: implement threaded-able napi poll loop support") Signed-off-by: NPaolo Abeni <pabeni@redhat.com> Reviewed-by: NEric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/883923fa22745a9589e8610962b7dc59df09fb1f.1617981844.git.pabeni@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 08 4月, 2021 1 次提交
-
-
由 Andrei Vagin 提交于
Here is only one place where we want to specify new_ifindex. In all other cases, callers pass 0 as new_ifindex. It looks reasonable to add a low-level function with new_ifindex and to convert dev_change_net_namespace to a static inline wrapper. Fixes: eeb85a14 ("net: Allow to specify ifindex when device is moved to another namespace") Suggested-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NAndrei Vagin <avagin@gmail.com> Acked-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 06 4月, 2021 1 次提交
-
-
由 Andrei Vagin 提交于
Currently, we can specify ifindex on link creation. This change allows to specify ifindex when a device is moved to another network namespace. Even now, a device ifindex can be changed if there is another device with the same ifindex in the target namespace. So this change doesn't introduce completely new behavior, it adds more control to the process. CRIU users want to restore containers with pre-created network devices. A user will provide network devices and instructions where they have to be restored, then CRIU will restore network namespaces and move devices into them. The problem is that devices have to be restored with the same indexes that they have before C/R. Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com> Suggested-by: NChristian Brauner <christian.brauner@ubuntu.com> Signed-off-by: NAndrei Vagin <avagin@gmail.com> Reviewed-by: NChristian Brauner <christian.brauner@ubuntu.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 26 3月, 2021 1 次提交
-
-
由 Dmitry Vyukov 提交于
netdev_unregister_timeout_secs=0 can lead to printing the "waiting for dev to become free" message every jiffy. This is too frequent and unnecessary. Set the min value to 1 second. Also fix the merge issue introduced by "net: make unregister netdev warning timeout configurable": it changed "refcnt != 1" to "refcnt". Signed-off-by: NDmitry Vyukov <dvyukov@google.com> Suggested-by: NEric Dumazet <edumazet@google.com> Fixes: 5aa3afe1 ("net: make unregister netdev warning timeout configurable") Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Reviewed-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 25 3月, 2021 1 次提交
-
-
由 Pablo Neira Ayuso 提交于
This patch adds dev_fill_forward_path() which resolves the path to reach the real netdevice from the IP forwarding side. This function takes as input the netdevice and the destination hardware address and it walks down the devices calling .ndo_fill_forward_path() for each device until the real device is found. For instance, assuming the following topology: IP forwarding / \ br0 eth0 / \ eth1 eth2 . . . ethX ab:cd:ef:ab:cd:ef where eth1 and eth2 are bridge ports and eth0 provides WAN connectivity. ethX is the interface in another box which is connected to the eth1 bridge port. For packets going through IP forwarding to br0 whose destination MAC address is ab:cd:ef:ab:cd:ef, dev_fill_forward_path() provides the following path: br0 -> eth1 .ndo_fill_forward_path for br0 looks up at the FDB for the bridge port from the destination MAC address to get the bridge port eth1. This information allows to create a fast path that bypasses the classic bridge and IP forwarding paths, so packets go directly from the bridge port eth1 to eth0 (wan interface) and vice versa. fast path .------------------------. / \ | IP forwarding | | / \ \/ | br0 eth0 . / \ -> eth1 eth2 . . . ethX ab:cd:ef:ab:cd:ef Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 24 3月, 2021 1 次提交
-
-
由 Dmitry Vyukov 提交于
netdev_wait_allrefs() issues a warning if refcount does not drop to 0 after 10 seconds. While 10 second wait generally should not happen under normal workload in normal environment, it seems to fire falsely very often during fuzzing and/or in qemu emulation (~10x slower). At least it's not possible to understand if it's really a false positive or not. Automated testing generally bumps all timeouts to very high values to avoid flake failures. Add net.core.netdev_unregister_timeout_secs sysctl to make the timeout configurable for automated testing systems. Lowering the timeout may also be useful for e.g. manual bisection. The default value matches the current behavior. Signed-off-by: NDmitry Vyukov <dvyukov@google.com> Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=211877 Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 23 3月, 2021 2 次提交
-
-
由 Eric Dumazet 提交于
When adding CONFIG_PCPU_DEV_REFCNT, I forgot that the initial net device refcount was 0. When CONFIG_PCPU_DEV_REFCNT is not set, this means the first dev_hold() triggers an illegal refcount operation (addition on 0) refcount_t: addition on 0; use-after-free. WARNING: CPU: 0 PID: 1 at lib/refcount.c:25 refcount_warn_saturate+0x128/0x1a4 Fix is to change initial (and final) refcount to be 1. Also add a missing kerneldoc piece, as reported by Stephen Rothwell. Fixes: 919067cc ("net: add CONFIG_PCPU_DEV_REFCNT") Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: NGuenter Roeck <groeck@google.com> Tested-by: NGuenter Roeck <groeck@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Vladimir Oltean 提交于
Since their introduction in commit 04157469 ("net: Use static_key for XPS maps"), xps_needed and xps_rxqs_needed were never used outside net/core/dev.c, so I don't really understand why they were exported as symbols in the first place. This is needed in order to silence a "make W=1" warning about these static keys not being declared as static variables, but not having a previous declaration in a header file nonetheless. Cc: Amritha Nambiar <amritha.nambiar@intel.com> Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 20 3月, 2021 1 次提交
-
-
由 Eric Dumazet 提交于
I was working on a syzbot issue, claiming one device could not be dismantled because its refcount was -1 unregister_netdevice: waiting for sit0 to become free. Usage count = -1 It would be nice if syzbot could trigger a warning at the time this reference count became negative. This patch adds CONFIG_PCPU_DEV_REFCNT options which defaults to per cpu variables (as before this patch) on SMP builds. v2: free_dev label in alloc_netdev_mqs() is moved to avoid a compiler warning (-Wunused-label), as reported by kernel test robot <lkp@intel.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 19 3月, 2021 9 次提交
-
-
由 Antoine Tenart 提交于
In __netif_set_xps_queue, old map entries from the old dev_maps are freed but their corresponding entry in the old dev_maps aren't NULLed. Fix this. Signed-off-by: NAntoine Tenart <atenart@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Antoine Tenart 提交于
When setting up an new dev_maps in __netif_set_xps_queue, we remove and free maps from unused CPUs/rx-queues near the end of the function; by calling remove_xps_queue. However it's possible those maps are also part of the old not-freed-yet dev_maps, which might be used concurrently. When that happens, a map can be freed while its corresponding entry in the old dev_maps table isn't NULLed, leading to: "BUG: KASAN: use-after-free" in different places. This fixes the map freeing logic for unused CPUs/rx-queues, to also NULL the map entries from the old dev_maps table. Signed-off-by: NAntoine Tenart <atenart@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Antoine Tenart 提交于
Improve the readability of the loop removing tx-queue from unused CPUs/rx-queues in __netif_set_xps_queue. The change should only be cosmetic. Signed-off-by: NAntoine Tenart <atenart@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Antoine Tenart 提交于
This patch adds an helper, xps_copy_dev_maps, to copy maps from dev_maps to new_dev_maps at a given index. The logic should be the same, with an improved code readability and maintenance. Signed-off-by: NAntoine Tenart <atenart@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Antoine Tenart 提交于
Move the xps maps (xps_cpus_map and xps_rxqs_map) to an array in net_device. That will simplify a lot the code removing the need for lots of if/else conditionals as the correct map will be available using its offset in the array. This should not modify the xps maps behaviour in any way. Suggested-by: NAlexander Duyck <alexander.duyck@gmail.com> Signed-off-by: NAntoine Tenart <atenart@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Antoine Tenart 提交于
Remove the xps possible_mask. It was an optimization but we can just loop from 0 to nr_ids now that it is embedded in the xps dev_maps. That simplifies the code a bit. Suggested-by: NAlexander Duyck <alexander.duyck@gmail.com> Signed-off-by: NAntoine Tenart <atenart@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Antoine Tenart 提交于
Embed nr_ids (the number of cpu for the xps cpus map, and the number of rxqs for the xps cpus map) in dev_maps. That will help not accessing out of bound memory if those values change after dev_maps was allocated. Suggested-by: NAlexander Duyck <alexander.duyck@gmail.com> Signed-off-by: NAntoine Tenart <atenart@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Antoine Tenart 提交于
The xps cpus/rxqs map is accessed using dev->num_tc, which is used when allocating the map. But later updates of dev->num_tc can lead to having a mismatch between the maps and how they're accessed. In such cases the map values do not make any sense and out of bound accesses can occur (that can be easily seen using KASAN). This patch aims at fixing this by embedding num_tc into the maps, using the value at the time the map is created. This brings two improvements: - The maps can be accessed using the embedded num_tc, so we know for sure we won't have out of bound accesses. - Checks can be made before accessing the maps so we know the values retrieved will make sense. We also update __netif_set_xps_queue to conditionally copy old maps from dev_maps in the new one only if the number of traffic classes from both maps match. Signed-off-by: NAntoine Tenart <atenart@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jiri Bohac 提交于
__dev_alloc_name(), when supplied with a name containing '%d', will search for the first available device number to generate a unique device name. Since commit ff927412 ("net: introduce name_node struct to be used in hashlist") network devices may have alternate names. __dev_alloc_name() does take these alternate names into account, possibly generating a name that is already taken and failing with -ENFILE as a result. This demonstrates the bug: # rmmod dummy 2>/dev/null # ip link property add dev lo altname dummy0 # modprobe dummy numdummies=1 modprobe: ERROR: could not insert 'dummy': Too many open files in system Instead of creating a device named dummy1, modprobe fails. Fix this by checking all the names in the d->name_node list, not just d->name. Signed-off-by: NJiri Bohac <jbohac@suse.cz> Fixes: ff927412 ("net: introduce name_node struct to be used in hashlist") Reviewed-by: NJiri Pirko <jiri@nvidia.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 18 3月, 2021 1 次提交
-
-
由 Wei Wang 提交于
Currently, napi_thread_wait() checks for NAPI_STATE_SCHED bit to determine if the kthread owns this napi and could call napi->poll() on it. However, if socket busy poll is enabled, it is possible that the busy poll thread grabs this SCHED bit (after the previous napi->poll() invokes napi_complete_done() and clears SCHED bit) and tries to poll on the same napi. napi_disable() could grab the SCHED bit as well. This patch tries to fix this race by adding a new bit NAPI_STATE_SCHED_THREADED in napi->state. This bit gets set in ____napi_schedule() if the threaded mode is enabled, and gets cleared in napi_complete_done(), and we only poll the napi in kthread if this bit is set. This helps distinguish the ownership of the napi between kthread and other scenarios and fixes the race issue. Fixes: 29863d41 ("net: implement threaded-able napi poll loop support") Reported-by: NMartin Zaharinov <micron10@gmail.com> Suggested-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NWei Wang <weiwan@google.com> Cc: Alexander Duyck <alexanderduyck@fb.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 16 3月, 2021 2 次提交
-
-
由 Martin Willi 提交于
When a non-initial netns is destroyed, the usual policy is to delete all virtual network interfaces contained, but move physical interfaces back to the initial netns. This keeps the physical interface visible on the system. CAN devices are somewhat special, as they define rtnl_link_ops even if they are physical devices. If a CAN interface is moved into a non-initial netns, destroying that netns lets the interface vanish instead of moving it back to the initial netns. default_device_exit() skips CAN interfaces due to having rtnl_link_ops set. Reproducer: ip netns add foo ip link set can0 netns foo ip netns delete foo WARNING: CPU: 1 PID: 84 at net/core/dev.c:11030 ops_exit_list+0x38/0x60 CPU: 1 PID: 84 Comm: kworker/u4:2 Not tainted 5.10.19 #1 Workqueue: netns cleanup_net [<c010e700>] (unwind_backtrace) from [<c010a1d8>] (show_stack+0x10/0x14) [<c010a1d8>] (show_stack) from [<c086dc10>] (dump_stack+0x94/0xa8) [<c086dc10>] (dump_stack) from [<c086b938>] (__warn+0xb8/0x114) [<c086b938>] (__warn) from [<c086ba10>] (warn_slowpath_fmt+0x7c/0xac) [<c086ba10>] (warn_slowpath_fmt) from [<c0629f20>] (ops_exit_list+0x38/0x60) [<c0629f20>] (ops_exit_list) from [<c062a5c4>] (cleanup_net+0x230/0x380) [<c062a5c4>] (cleanup_net) from [<c0142c20>] (process_one_work+0x1d8/0x438) [<c0142c20>] (process_one_work) from [<c0142ee4>] (worker_thread+0x64/0x5a8) [<c0142ee4>] (worker_thread) from [<c0148a98>] (kthread+0x148/0x14c) [<c0148a98>] (kthread) from [<c0100148>] (ret_from_fork+0x14/0x2c) To properly restore physical CAN devices to the initial netns on owning netns exit, introduce a flag on rtnl_link_ops that can be set by drivers. For CAN devices setting this flag, default_device_exit() considers them non-virtual, applying the usual namespace move. The issue was introduced in the commit mentioned below, as at that time CAN devices did not have a dellink() operation. Fixes: e008b5fc ("net: Simplfy default_device_exit and improve batching.") Link: https://lore.kernel.org/r/20210302122423.872326-1-martin@strongswan.orgSigned-off-by: NMartin Willi <martin@strongswan.org> Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
-
由 Lorenzo Bianconi 提交于
For wireless devices (e.g. mt76 driver) multiple net_devices belongs to the same wireless phy and the napi object is registered in a dummy netdevice related to the wireless phy. Export dev_set_threaded in order to be reused in device drivers enabling threaded NAPI. Signed-off-by: NLorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 15 3月, 2021 3 次提交
-
-
由 Alexander Lobakin 提交于
'hash' stores not the flow hash, but the index of the GRO bucket corresponding to it. Change its name to 'bucket' to avoid confusion while reading lines like '__set_bit(hash, &napi->gro_bitmask)'. Signed-off-by: NAlexander Lobakin <alobakin@pm.me> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Alexander Lobakin 提交于
GRO bucket index doesn't change through the entire function. Store a pointer to the corresponding bucket instead of its member and use it consistently through the function. It is performance-safe since &gro_list->list == gro_list. Misc: remove superfluous braces around single-line branches. Signed-off-by: NAlexander Lobakin <alobakin@pm.me> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Alexander Lobakin 提交于
gro_list_prepare() always returns &napi->gro_hash[bucket].list, without any variations. Moreover, it uses 'napi' argument only to have access to this list, and calculates the bucket index for the second time (firstly it happens at the beginning of dev_gro_receive()) to do that. Given that dev_gro_receive() already has an index to the needed list, just pass it as the first argument to eliminate redundant calculations, and make gro_list_prepare() return void. Also, both arguments of gro_list_prepare() can be constified since this function can only modify the skbs from the bucket list. Signed-off-by: NAlexander Lobakin <alobakin@pm.me> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 11 3月, 2021 1 次提交
-
-
由 Gustavo A. R. Silva 提交于
In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning by explicitly adding a break statement instead of letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115Signed-off-by: NGustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 14 2月, 2021 2 次提交
-
-
由 Alexander Lobakin 提交于
napi_frags_finish() and napi_skb_finish() can only be called inside NAPI Rx context, so we can feed NAPI cache with skbuff_heads that got NAPI_MERGED_FREE verdict instead of immediate freeing. Replace __kfree_skb() with __kfree_skb_defer() in napi_skb_finish() and move napi_skb_free_stolen_head() to skbuff.c, so it can drop skbs to NAPI cache. As many drivers call napi_alloc_skb()/napi_get_frags() on their receive path, this becomes especially useful. Signed-off-by: NAlexander Lobakin <alobakin@pm.me> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Alexander Lobakin 提交于
This function isn't much needed as NAPI skb queue gets bulk-freed anyway when there's no more room, and even may reduce the efficiency of bulk operations. It will be even less needed after reusing skb cache on allocation path, so remove it and this way lighten network softirqs a bit. Suggested-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NAlexander Lobakin <alobakin@pm.me> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 13 2月, 2021 1 次提交
-
-
由 Jesper Dangaard Brouer 提交于
The use-case for dropping the MTU check when TC-BPF does redirect to ingress, is described by Eyal Birger in email[0]. The summary is the ability to increase packet size (e.g. with IPv6 headers for NAT64) and ingress redirect packet and let normal netstack fragment packet as needed. [0] https://lore.kernel.org/netdev/CAHsH6Gug-hsLGHQ6N0wtixdOa85LDZ3HNRHVd0opR=19Qo4W4Q@mail.gmail.com/ V15: - missing static for function declaration V9: - Make net_device "up" (IFF_UP) check explicit in skb_do_redirect V4: - Keep net_device "up" (IFF_UP) check. - Adjustment to handle bpf_redirect_peer() helper Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NJohn Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/161287790971.790810.11785274340154740591.stgit@firesoul
-
- 12 2月, 2021 1 次提交
-
-
由 Cong Wang 提交于
dev_ifsioc_locked() is called with only RCU read lock, so when there is a parallel writer changing the mac address, it could get a partially updated mac address, as shown below: Thread 1 Thread 2 // eth_commit_mac_addr_change() memcpy(dev->dev_addr, addr->sa_data, ETH_ALEN); // dev_ifsioc_locked() memcpy(ifr->ifr_hwaddr.sa_data, dev->dev_addr,...); Close this race condition by guarding them with a RW semaphore, like netdev_get_name(). We can not use seqlock here as it does not allow blocking. The writers already take RTNL anyway, so this does not affect the slow path. To avoid bothering existing dev_set_mac_address() callers in drivers, introduce a new wrapper just for user-facing callers on ioctl and rtnetlink paths. Note, bonding also changes slave mac addresses but that requires a separate patch due to the complexity of bonding code. Fixes: 3710becf ("net: RCU locking for simple ioctl()") Reported-by: N"Gong, Sishuai" <sishuai@purdue.edu> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Jakub Kicinski <kuba@kernel.org> Signed-off-by: NCong Wang <cong.wang@bytedance.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 10 2月, 2021 3 次提交
-
-
由 Wei Wang 提交于
This patch adds a new sysfs attribute to the network device class. Said attribute provides a per-device control to enable/disable the threaded mode for all the napi instances of the given network device, without the need for a device up/down. User sets it to 1 or 0 to enable or disable threaded mode. Note: when switching between threaded and the current softirq based mode for a napi instance, it will not immediately take effect if the napi is currently being polled. The mode switch will happen for the next time napi_schedule() is called. Co-developed-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NPaolo Abeni <pabeni@redhat.com> Co-developed-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Co-developed-by: NFelix Fietkau <nbd@nbd.name> Signed-off-by: NFelix Fietkau <nbd@nbd.name> Signed-off-by: NWei Wang <weiwan@google.com> Reviewed-by: NAlexander Duyck <alexanderduyck@fb.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Wei Wang 提交于
This patch allows running each napi poll loop inside its own kernel thread. The kthread is created during netif_napi_add() if dev->threaded is set. And threaded mode is enabled in napi_enable(). We will provide a way to set dev->threaded and enable threaded mode without a device up/down in the following patch. Once that threaded mode is enabled and the kthread is started, napi_schedule() will wake-up such thread instead of scheduling the softirq. The threaded poll loop behaves quite likely the net_rx_action, but it does not have to manipulate local irqs and uses an explicit scheduling point based on netdev_budget. Co-developed-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NPaolo Abeni <pabeni@redhat.com> Co-developed-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Co-developed-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NWei Wang <weiwan@google.com> Reviewed-by: NAlexander Duyck <alexanderduyck@fb.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Felix Fietkau 提交于
This commit introduces a new function __napi_poll() which does the main logic of the existing napi_poll() function, and will be called by other functions in later commits. This idea and implementation is done by Felix Fietkau <nbd@nbd.name> and is proposed as part of the patch to move napi work to work_queue context. This commit by itself is a code restructure. Signed-off-by: NFelix Fietkau <nbd@nbd.name> Signed-off-by: NWei Wang <weiwan@google.com> Reviewed-by: NAlexander Duyck <alexanderduyck@fb.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 06 2月, 2021 1 次提交
-
-
由 Eric Dumazet 提交于
Commit c8079432 ("net: Fix packet reordering caused by GRO and listified RX cooperation") had the unfortunate effect of adding latencies in common workloads. Before the patch, GRO packets were immediately passed to upper stacks. After the patch, we can accumulate quite a lot of GRO packets (depdending on NAPI budget). My fix is counting in napi->rx_count number of segments instead of number of logical packets. Fixes: c8079432 ("net: Fix packet reordering caused by GRO and listified RX cooperation") Signed-off-by: NEric Dumazet <edumazet@google.com> Bisected-by: NJohn Sperbeck <jsperbeck@google.com> Tested-by: NJian Yang <jianyang@google.com> Cc: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: NSaeed Mahameed <saeedm@nvidia.com> Reviewed-by: NEdward Cree <ecree.xilinx@gmail.com> Reviewed-by: NAlexander Lobakin <alobakin@pm.me> Link: https://lore.kernel.org/r/20210204213146.4192368-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 05 2月, 2021 1 次提交
-
-
由 Leon Romanovsky 提交于
Fir the following compilation warnings: 1031 | INDIRECT_CALLABLE_SCOPE void udp_v6_early_demux(struct sk_buff *skb) net/ipv6/ip6_offload.c:182:41: warning: no previous prototype for ‘ipv6_gro_receive’ [-Wmissing-prototypes] 182 | INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head, | ^~~~~~~~~~~~~~~~ net/ipv6/ip6_offload.c:320:29: warning: no previous prototype for ‘ipv6_gro_complete’ [-Wmissing-prototypes] 320 | INDIRECT_CALLABLE_SCOPE int ipv6_gro_complete(struct sk_buff *skb, int nhoff) | ^~~~~~~~~~~~~~~~~ net/ipv6/ip6_offload.c:182:41: warning: no previous prototype for ‘ipv6_gro_receive’ [-Wmissing-prototypes] 182 | INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head, | ^~~~~~~~~~~~~~~~ net/ipv6/ip6_offload.c:320:29: warning: no previous prototype for ‘ipv6_gro_complete’ [-Wmissing-prototypes] 320 | INDIRECT_CALLABLE_SCOPE int ipv6_gro_complete(struct sk_buff *skb, int nhoff) Signed-off-by: NLeon Romanovsky <leonro@nvidia.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 30 1月, 2021 1 次提交
-
-
由 Xin Long 提交于
NETIF_F_IP|IPV6_CSUM feature flag indicates UDP and TCP csum offload while NETIF_F_HW_CSUM feature flag indicates ip generic csum offload for HW, which includes not only for TCP/UDP csum, but also for other protocols' csum like GRE's. However, in skb_csum_hwoffload_help() it only checks features against NETIF_F_CSUM_MASK(NETIF_F_HW|IP|IPV6_CSUM). So if it's a non TCP/UDP packet and the features doesn't support NETIF_F_HW_CSUM, but supports NETIF_F_IP|IPV6_CSUM only, it would still return 0 and leave the HW to do csum. This patch is to support ip generic csum processing by checking NETIF_F_HW_CSUM for all protocols, and check (NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM) only for TCP and UDP. Note that we're using skb->csum_offset to check if it's a TCP/UDP proctol, this might be fragile. However, as Alex said, for now we only have a few L4 protocols that are requesting Tx csum offload, we'd better fix this until a new protocol comes with a same csum offset. v1->v2: - not extend skb->csum_not_inet, but use skb->csum_offset to tell if it's an UDP/TCP csum packet. v2->v3: - add a note in the changelog, as Willem suggested. Suggested-by: NAlexander Duyck <alexander.duyck@gmail.com> Signed-off-by: NXin Long <lucien.xin@gmail.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 23 1月, 2021 1 次提交
-
-
由 Yousuk Seung 提交于
This patch adds TCP_NLA_TTL to SCM_TIMESTAMPING_OPT_STATS that exports the time-to-live or hop limit of the latest incoming packet with SCM_TSTAMP_ACK. The value exported may not be from the packet that acks the sequence when incoming packets are aggregated. Exporting the time-to-live or hop limit value of incoming packets helps to estimate the hop count of the path of the flow that may change over time. Signed-off-by: NYousuk Seung <ysseung@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Link: https://lore.kernel.org/r/20210120204155.552275-1-ysseung@google.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-