- 09 2月, 2022 2 次提交
-
-
由 Eric Dumazet 提交于
For some reason default_device_ops kept two exit method: 1) default_device_exit() is called for each netns being dismantled in a cleanup_net() round. This acquires rtnl for each invocation. 2) default_device_exit_batch() is called once with the list of all netns int the batch, allowing for a single rtnl invocation. Get rid of the .exit() method to handle the logic from default_device_exit_batch(), to decrease the number of rtnl acquisition to one. Signed-off-by: NEric Dumazet <edumazet@google.com> Reviewed-by: NDavid Ahern <dsahern@kernel.org> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Eric Dumazet 提交于
Convert one dev_hold()/dev_put() pair in register_netdevice() and unregister_netdevice_many() to dev_hold_track() and dev_put_track(). This would allow to detect a rogue dev_put() a bit earlier. Signed-off-by: NEric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220207184107.1401096-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 06 2月, 2022 1 次提交
-
-
由 Eric Dumazet 提交于
While testing a patch that will follow later ("net: add netns refcount tracker to struct nsproxy") I found that devtmpfs_init() was called before init_net was initialized. This is a bug, because devtmpfs_setup() calls ksys_unshare(CLONE_NEWNS); This has the effect of increasing init_net refcount, which will be later overwritten to 1, as part of setup_net(&init_net) We had too many prior patches [1] trying to work around the root cause. Really, make sure init_net is in BSS section, and that net_ns_init() is called earlier at boot time. Note that another patch ("vfs: add netns refcount tracker to struct fs_context") also will need net_ns_init() being called before vfs_caches_init() As a bonus, this patch saves around 4KB in .data section. [1] f8c46cb3 ("netns: do not call pernet ops for not yet set up init_net namespace") b5082df8 ("net: Initialise init_net.count to 1") 734b6541 ("net: Statically initialize init_net.dev_base_head") v2: fixed a build error reported by kernel build bots (CONFIG_NET=n) Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 05 2月, 2022 1 次提交
-
-
由 Eric Dumazet 提交于
We are still chasing some syzbot reports where we think a rogue dev_put() is called with no corresponding prior dev_hold(). Unfortunately it eats a reference on dev->dev_refcnt taken by innocent dev_hold_track(), meaning that the refcount saturation splat comes too late to be useful. Make sure that 'not tracked' dev_put() and dev_hold() better use CONFIG_NET_DEV_REFCNT_TRACKER=y debug infrastructure: Prior patch in the series allowed ref_tracker_alloc() and ref_tracker_free() to be called with a NULL @trackerp parameter, and to use a separate refcount only to detect too many put() even in the following case: dev_hold_track(dev, tracker_1, GFP_ATOMIC); dev_hold(dev); dev_put(dev); dev_put(dev); // Should complain loudly here. dev_put_track(dev, tracker_1); // instead of here Add clarification about netdev_tracker_alloc() role. v2: I replaced the dev_put() in linkwatch_do_dev() with __dev_put() because callers called netdev_tracker_free(). Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 04 2月, 2022 1 次提交
-
-
由 Eric Dumazet 提交于
__dev_alloc_name() allocates a private zeroed page, then sets bits in it while iterating through net devices. It can use __set_bit() to avoid unnecessary locked operations. Signed-off-by: NEric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220203064609.3242863-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 12 1月, 2022 1 次提交
-
-
由 Toke Høiland-Jørgensen 提交于
The bpf_xdp_link_update() function didn't check the program type before updating the program, which made it possible to install any program type as an XDP program, which is obviously not good. Syzbot managed to trigger this by swapping in an LWT program on the XDP hook which would crash in a helper call. Fix this by adding a check and bailing out if the types don't match. Fixes: 026a4c28 ("bpf, xdp: Implement LINK_UPDATE for BPF XDP link") Reported-by: syzbot+983941aa85af6ded1fd9@syzkaller.appspotmail.com Acked-by: NAndrii Nakryiko <andrii@kernel.org> Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/20220107221115.326171-1-toke@redhat.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
-
- 10 1月, 2022 1 次提交
-
-
由 Menglong Dong 提交于
Introduce the interface kfree_skb_reason(), which is able to pass the reason why the skb is dropped to 'kfree_skb' tracepoint. Add the 'reason' field to 'trace_kfree_skb', therefor user can get more detail information about abnormal skb with 'drop_monitor' or eBPF. All drop reasons are defined in the enum 'skb_drop_reason', and they will be print as string in 'kfree_skb' tracepoint in format of 'reason: XXX'. ( Maybe the reasons should be defined in a uapi header file, so that user space can use them? ) Signed-off-by: NMenglong Dong <imagedong@tencent.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 06 1月, 2022 1 次提交
-
-
由 Coco Li 提交于
Eric Dumazet suggested to allow users to modify max GRO packet size. We have seen GRO being disabled by users of appliances (such as wifi access points) because of claimed bufferbloat issues, or some work arounds in sch_cake, to split GRO/GSO packets. Instead of disabling GRO completely, one can chose to limit the maximum packet size of GRO packets, depending on their latency constraints. This patch adds a per device gro_max_size attribute that can be changed with ip link command. ip link set dev eth0 gro_max_size 16000 Suggested-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NCoco Li <lixiaoyan@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 18 12月, 2021 1 次提交
-
-
由 Paul Blakey 提交于
BPF layer extends the qdisc control block via struct bpf_skb_data_end and because of that there is no more room to add variables to the qdisc layer control block without going over the skb->cb size. Extend the qdisc control block with a tc control block, and move all tc related variables to there as a pre-step for extending the tc control block with additional members. Signed-off-by: NPaul Blakey <paulb@nvidia.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 14 12月, 2021 2 次提交
-
-
Change the order of arguments and make qdisc_is_running() appear first. This is more readable for the general case. Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Paolo Abeni 提交于
In non trivial scenarios, the action id alone is not sufficient to identify the program causing the warning. Before the previous patch, the generated stack-trace pointed out at least the involved device driver. Let's additionally include the program name and id, and the relevant device name. If the user needs additional infos, he can fetch them via a kernel probe, leveraging the arguments added here. Signed-off-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NToke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/ddb96bb975cbfddb1546cf5da60e77d5100b533c.1638189075.git.pabeni@redhat.com
-
- 13 12月, 2021 1 次提交
-
-
The root-lock is dropped before dev_hard_start_xmit() is invoked and after setting the __QDISC___STATE_RUNNING bit. If the Qdisc owner is preempted by another sender/task with a higher priority then this new sender won't be able to submit packets to the NIC directly instead they will be enqueued into the Qdisc. The NIC will remain idle until the Qdisc owner is scheduled again and finishes the job. By serializing every task on the ->busylock then the task will be preempted by a sender only after the Qdisc has no owner. Always serialize on the busylock on PREEMPT_RT. Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 07 12月, 2021 2 次提交
-
-
由 Eric Dumazet 提交于
Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Eric Dumazet 提交于
net device are refcounted. Over the years we had numerous bugs caused by imbalanced dev_hold() and dev_put() calls. The general idea is to be able to precisely pair each decrement with a corresponding prior increment. Both share a cookie, basically a pointer to private data storing stack traces. This patch adds dev_hold_track() and dev_put_track(). To use these helpers, each data structure owning a refcount should also use a "netdevice_tracker" to pair the hold and put. netdevice_tracker dev_tracker; ... dev_hold_track(dev, &dev_tracker, GFP_ATOMIC); ... dev_put_track(dev, &dev_tracker); Whenever a leak happens, we will get precise stack traces of the point dev_hold_track() happened, at device dismantle phase. We will also get a stack trace if too many dev_put_track() for the same netdevice_tracker are attempted. This is guarded by CONFIG_NET_DEV_REFCNT_TRACKER option. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 02 12月, 2021 1 次提交
-
-
由 Eric Dumazet 提交于
syzbot found that __dev_queue_xmit() is reading txq->xmit_lock_owner without annotations. No serious issue there, let's document what is happening there. BUG: KCSAN: data-race in __dev_queue_xmit / __dev_queue_xmit write to 0xffff888139d09484 of 4 bytes by interrupt on cpu 0: __netif_tx_unlock include/linux/netdevice.h:4437 [inline] __dev_queue_xmit+0x948/0xf70 net/core/dev.c:4229 dev_queue_xmit_accel+0x19/0x20 net/core/dev.c:4265 macvlan_queue_xmit drivers/net/macvlan.c:543 [inline] macvlan_start_xmit+0x2b3/0x3d0 drivers/net/macvlan.c:567 __netdev_start_xmit include/linux/netdevice.h:4987 [inline] netdev_start_xmit include/linux/netdevice.h:5001 [inline] xmit_one+0x105/0x2f0 net/core/dev.c:3590 dev_hard_start_xmit+0x72/0x120 net/core/dev.c:3606 sch_direct_xmit+0x1b2/0x7c0 net/sched/sch_generic.c:342 __dev_xmit_skb+0x83d/0x1370 net/core/dev.c:3817 __dev_queue_xmit+0x590/0xf70 net/core/dev.c:4194 dev_queue_xmit+0x13/0x20 net/core/dev.c:4259 neigh_hh_output include/net/neighbour.h:511 [inline] neigh_output include/net/neighbour.h:525 [inline] ip6_finish_output2+0x995/0xbb0 net/ipv6/ip6_output.c:126 __ip6_finish_output net/ipv6/ip6_output.c:191 [inline] ip6_finish_output+0x444/0x4c0 net/ipv6/ip6_output.c:201 NF_HOOK_COND include/linux/netfilter.h:296 [inline] ip6_output+0x10e/0x210 net/ipv6/ip6_output.c:224 dst_output include/net/dst.h:450 [inline] NF_HOOK include/linux/netfilter.h:307 [inline] ndisc_send_skb+0x486/0x610 net/ipv6/ndisc.c:508 ndisc_send_rs+0x3b0/0x3e0 net/ipv6/ndisc.c:702 addrconf_rs_timer+0x370/0x540 net/ipv6/addrconf.c:3898 call_timer_fn+0x2e/0x240 kernel/time/timer.c:1421 expire_timers+0x116/0x240 kernel/time/timer.c:1466 __run_timers+0x368/0x410 kernel/time/timer.c:1734 run_timer_softirq+0x2e/0x60 kernel/time/timer.c:1747 __do_softirq+0x158/0x2de kernel/softirq.c:558 __irq_exit_rcu kernel/softirq.c:636 [inline] irq_exit_rcu+0x37/0x70 kernel/softirq.c:648 sysvec_apic_timer_interrupt+0x3e/0xb0 arch/x86/kernel/apic/apic.c:1097 asm_sysvec_apic_timer_interrupt+0x12/0x20 read to 0xffff888139d09484 of 4 bytes by interrupt on cpu 1: __dev_queue_xmit+0x5e3/0xf70 net/core/dev.c:4213 dev_queue_xmit_accel+0x19/0x20 net/core/dev.c:4265 macvlan_queue_xmit drivers/net/macvlan.c:543 [inline] macvlan_start_xmit+0x2b3/0x3d0 drivers/net/macvlan.c:567 __netdev_start_xmit include/linux/netdevice.h:4987 [inline] netdev_start_xmit include/linux/netdevice.h:5001 [inline] xmit_one+0x105/0x2f0 net/core/dev.c:3590 dev_hard_start_xmit+0x72/0x120 net/core/dev.c:3606 sch_direct_xmit+0x1b2/0x7c0 net/sched/sch_generic.c:342 __dev_xmit_skb+0x83d/0x1370 net/core/dev.c:3817 __dev_queue_xmit+0x590/0xf70 net/core/dev.c:4194 dev_queue_xmit+0x13/0x20 net/core/dev.c:4259 neigh_resolve_output+0x3db/0x410 net/core/neighbour.c:1523 neigh_output include/net/neighbour.h:527 [inline] ip6_finish_output2+0x9be/0xbb0 net/ipv6/ip6_output.c:126 __ip6_finish_output net/ipv6/ip6_output.c:191 [inline] ip6_finish_output+0x444/0x4c0 net/ipv6/ip6_output.c:201 NF_HOOK_COND include/linux/netfilter.h:296 [inline] ip6_output+0x10e/0x210 net/ipv6/ip6_output.c:224 dst_output include/net/dst.h:450 [inline] NF_HOOK include/linux/netfilter.h:307 [inline] ndisc_send_skb+0x486/0x610 net/ipv6/ndisc.c:508 ndisc_send_rs+0x3b0/0x3e0 net/ipv6/ndisc.c:702 addrconf_rs_timer+0x370/0x540 net/ipv6/addrconf.c:3898 call_timer_fn+0x2e/0x240 kernel/time/timer.c:1421 expire_timers+0x116/0x240 kernel/time/timer.c:1466 __run_timers+0x368/0x410 kernel/time/timer.c:1734 run_timer_softirq+0x2e/0x60 kernel/time/timer.c:1747 __do_softirq+0x158/0x2de kernel/softirq.c:558 __irq_exit_rcu kernel/softirq.c:636 [inline] irq_exit_rcu+0x37/0x70 kernel/softirq.c:648 sysvec_apic_timer_interrupt+0x8d/0xb0 arch/x86/kernel/apic/apic.c:1097 asm_sysvec_apic_timer_interrupt+0x12/0x20 kcsan_setup_watchpoint+0x94/0x420 kernel/kcsan/core.c:443 folio_test_anon include/linux/page-flags.h:581 [inline] PageAnon include/linux/page-flags.h:586 [inline] zap_pte_range+0x5ac/0x10e0 mm/memory.c:1347 zap_pmd_range mm/memory.c:1467 [inline] zap_pud_range mm/memory.c:1496 [inline] zap_p4d_range mm/memory.c:1517 [inline] unmap_page_range+0x2dc/0x3d0 mm/memory.c:1538 unmap_single_vma+0x157/0x210 mm/memory.c:1583 unmap_vmas+0xd0/0x180 mm/memory.c:1615 exit_mmap+0x23d/0x470 mm/mmap.c:3170 __mmput+0x27/0x1b0 kernel/fork.c:1113 mmput+0x3d/0x50 kernel/fork.c:1134 exit_mm+0xdb/0x170 kernel/exit.c:507 do_exit+0x608/0x17a0 kernel/exit.c:819 do_group_exit+0xce/0x180 kernel/exit.c:929 get_signal+0xfc3/0x1550 kernel/signal.c:2852 arch_do_signal_or_restart+0x8c/0x2e0 arch/x86/kernel/signal.c:868 handle_signal_work kernel/entry/common.c:148 [inline] exit_to_user_mode_loop kernel/entry/common.c:172 [inline] exit_to_user_mode_prepare+0x113/0x190 kernel/entry/common.c:207 __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline] syscall_exit_to_user_mode+0x20/0x40 kernel/entry/common.c:300 do_syscall_64+0x50/0xd0 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x00000000 -> 0xffffffff Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 28712 Comm: syz-executor.0 Tainted: G W 5.16.0-rc1-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: 1da177e4 ("Linux-2.6.12-rc2") Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: Nsyzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20211130170155.2331929-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 29 11月, 2021 1 次提交
-
-
The writer acquires dev_base_lock with disabled bottom halves. The reader can acquire dev_base_lock without disabling bottom halves because there is no writer in softirq context. On PREEMPT_RT the softirqs are preemptible and local_bh_disable() acts as a lock to ensure that resources, that are protected by disabling bottom halves, remain protected. This leads to a circular locking dependency if the lock acquired with disabled bottom halves (as in write_lock_bh()) and somewhere else with enabled bottom halves (as by read_lock() in netstat_show()) followed by disabling bottom halves (cxgb_get_stats() -> t4_wr_mbox_meat_timeout() -> spin_lock_bh()). This is the reverse locking order. All read_lock() invocation are from sysfs callback which are not invoked from softirq context. Therefore there is no need to disable bottom halves while acquiring a write lock. Acquire the write lock of dev_base_lock without disabling bottom halves. Reported-by: NPei Zhang <pezhang@redhat.com> Reported-by: NLuis Claudio R. Goncalves <lgoncalv@redhat.com> Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 23 11月, 2021 1 次提交
-
-
由 Jakub Kicinski 提交于
.ndo_change_proto_down was added seemingly to enable out-of-tree implementations. Over 2.5yrs later we still have no real users upstream. Hardwire the generic implementation for now, we can revert once real users materialize. (rocker is a test vehicle, not a user.) We need to drop the optimization on the sysfs side, because unlike ndos priv_flags will be changed at runtime, so we'd need READ_ONCE/WRITE_ONCE everywhere.. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 22 11月, 2021 1 次提交
-
-
由 Eric Dumazet 提交于
dev->gso_max_segs is written under RTNL protection, or when the device is not yet visible, but is read locklessly. Add netif_set_gso_max_segs() helper. Add the READ_ONCE()/WRITE_ONCE() pairs, and use netif_set_gso_max_segs() where we can to better document what is going on. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 20 11月, 2021 1 次提交
-
-
由 Jakub Kicinski 提交于
netdev->dev_addr should only be modified via helpers, but someone may be casting off the const. Add a runtime check to catch abuses. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 16 11月, 2021 1 次提交
-
-
由 Eric Dumazet 提交于
Move gro code and data from net/core/dev.c to net/core/gro.c to ease maintenance. gro_normal_list() and gro_normal_one() are inlined because they are called from both files. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 11 11月, 2021 1 次提交
-
-
由 Alexander Lobakin 提交于
Commit 719c5719 ("net: make napi_disable() symmetric with enable") accidentally introduced a bug sometimes leading to a kernel BUG when bringing an iface up/down under heavy traffic load. Prior to this commit, napi_disable() was polling n->state until none of (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC) is set and then always flip them. Now there's a possibility to get away with the NAPIF_STATE_SCHE unset as 'continue' drops us to the cmpxchg() call with an uninitialized variable, rather than straight to another round of the state check. Error path looks like: napi_disable(): unsigned long val, new; /* new is uninitialized */ do { val = READ_ONCE(n->state); /* NAPIF_STATE_NPSVC and/or NAPIF_STATE_SCHED is set */ if (val & (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC)) { /* true */ usleep_range(20, 200); continue; /* go straight to the condition check */ } new = val | <...> } while (cmpxchg(&n->state, val, new) != val); /* state == val, cmpxchg() writes garbage */ napi_enable(): do { val = READ_ONCE(n->state); BUG_ON(!test_bit(NAPI_STATE_SCHED, &val)); /* 50/50 boom */ <...> while the typical BUG splat is like: [ 172.652461] ------------[ cut here ]------------ [ 172.652462] kernel BUG at net/core/dev.c:6937! [ 172.656914] invalid opcode: 0000 [#1] PREEMPT SMP PTI [ 172.661966] CPU: 36 PID: 2829 Comm: xdp_redirect_cp Tainted: G I 5.15.0 #42 [ 172.670222] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0014.082620210524 08/26/2021 [ 172.680646] RIP: 0010:napi_enable+0x5a/0xd0 [ 172.684832] Code: 07 49 81 cc 00 01 00 00 4c 89 e2 48 89 d8 80 e6 fb f0 48 0f b1 55 10 48 39 c3 74 10 48 8b 5d 10 f6 c7 04 75 3d f6 c3 01 75 b4 <0f> 0b 5b 5d 41 5c c3 65 ff 05 b8 e5 61 53 48 c7 c6 c0 f3 34 ad 48 [ 172.703578] RSP: 0018:ffffa3c9497477a8 EFLAGS: 00010246 [ 172.708803] RAX: ffffa3c96615a014 RBX: 0000000000000000 RCX: ffff8a4b575301a0 < snip > [ 172.782403] Call Trace: [ 172.784857] <TASK> [ 172.786963] ice_up_complete+0x6f/0x210 [ice] [ 172.791349] ice_xdp+0x136/0x320 [ice] [ 172.795108] ? ice_change_mtu+0x180/0x180 [ice] [ 172.799648] dev_xdp_install+0x61/0xe0 [ 172.803401] dev_xdp_attach+0x1e0/0x550 [ 172.807240] dev_change_xdp_fd+0x1e6/0x220 [ 172.811338] do_setlink+0xee8/0x1010 [ 172.814917] rtnl_setlink+0xe5/0x170 [ 172.818499] ? bpf_lsm_binder_set_context_mgr+0x10/0x10 [ 172.823732] ? security_capable+0x36/0x50 < snip > Fix this by replacing 'do { } while (cmpxchg())' with an "infinite" for-loop with an explicit break. From v1 [0]: - just use a for-loop to simplify both the fix and the existing code (Eric). [0] https://lore.kernel.org/netdev/20211110191126.1214-1-alexandr.lobakin@intel.com Fixes: 719c5719 ("net: make napi_disable() symmetric with enable") Suggested-by: Eric Dumazet <edumazet@google.com> # for-loop Signed-off-by: NAlexander Lobakin <alexandr.lobakin@intel.com> Reviewed-by: NJesse Brandeburg <jesse.brandeburg@intel.com> Reviewed-by: NEric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20211110195605.1304-1-alexandr.lobakin@intel.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 27 10月, 2021 1 次提交
-
-
由 Ben Ben-ishay 提交于
LRO and HW-GRO are mutually exclusive, this commit adds this restriction in netdev_fix_feature. HW-GRO is preferred, that means in case both HW-GRO and LRO features are requested, LRO is cleared. Signed-off-by: NBen Ben-ishay <benishay@nvidia.com> Reviewed-by: NTariq Toukan <tariqt@nvidia.com> Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
-
- 26 10月, 2021 1 次提交
-
-
由 Cyril Strejc 提交于
During a testing of an user-space application which transmits UDP multicast datagrams and utilizes multicast routing to send the UDP datagrams out of defined network interfaces, I've found a multicast router does not fill-in UDP checksum into locally produced, looped-back and forwarded UDP datagrams, if an original output NIC the datagrams are sent to has UDP TX checksum offload enabled. The datagrams are sent malformed out of the NIC the datagrams have been forwarded to. It is because: 1. If TX checksum offload is enabled on the output NIC, UDP checksum is not calculated by kernel and is not filled into skb data. 2. dev_loopback_xmit(), which is called solely by ip_mc_finish_output(), sets skb->ip_summed = CHECKSUM_UNNECESSARY unconditionally. 3. Since 35fc92a9 ("[NET]: Allow forwarding of ip_summed except CHECKSUM_COMPLETE"), the ip_summed value is preserved during forwarding. 4. If ip_summed != CHECKSUM_PARTIAL, checksum is not calculated during a packet egress. The minimum fix in dev_loopback_xmit(): 1. Preserves skb->ip_summed CHECKSUM_PARTIAL. This is the case when the original output NIC has TX checksum offload enabled. The effects are: a) If the forwarding destination interface supports TX checksum offloading, the NIC driver is responsible to fill-in the checksum. b) If the forwarding destination interface does NOT support TX checksum offloading, checksums are filled-in by kernel before skb is submitted to the NIC driver. c) For local delivery, checksum validation is skipped as in the case of CHECKSUM_UNNECESSARY, thanks to skb_csum_unnecessary(). 2. Translates ip_summed CHECKSUM_NONE to CHECKSUM_UNNECESSARY. It means, for CHECKSUM_NONE, the behavior is unmodified and is there to skip a looped-back packet local delivery checksum validation. Signed-off-by: NCyril Strejc <cyril.strejc@skoda.cz> Reviewed-by: NWillem de Bruijn <willemb@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 25 10月, 2021 1 次提交
-
-
由 Michael Chan 提交于
Drivers call netdev_set_num_tc() and then netdev_set_tc_queue() to set the queue count and offset for each TC. So the queue count and offset for the TCs may be zero for a short period after dev->num_tc has been set. If a TX packet is being transmitted at this time in the code path netdev_pick_tx() -> skb_tx_hash(), skb_tx_hash() may see nonzero dev->num_tc but zero qcount for the TC. The while loop that keeps looping while hash >= qcount will not end. Fix it by checking the TC's qcount to be nonzero before using it. Fixes: eadec877 ("net: Add support for subordinate traffic classes to netdev_pick_tx") Reviewed-by: NAndy Gospodarek <gospo@broadcom.com> Signed-off-by: NMichael Chan <michael.chan@broadcom.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 20 10月, 2021 1 次提交
-
-
由 Jesse Brandeburg 提交于
While loading a driver and changing the number of queues, I noticed this message in the kernel log: "[253489.070080] Number of in use tx queues changed invalidating tc mappings. Priority traffic classification disabled!" But I had no idea what interface was being talked about because this message used pr_warn(). After investigating, it appears we can use the netdev_* helpers already defined to create predictably formatted messages, and that already handle <unknown netdev> cases, in more of the messages in dev.c. After this change, this message (and others) will look like this: "[ 170.181093] ice 0000:3b:00.0 ens785f0: Number of in use tx queues changed invalidating tc mappings. Priority traffic classification disabled!" One goal here was not to change the message significantly from the original format so as to not break user's expectations, so I just changed messages that used pr_* and generally started with %s == dev->name. Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 15 10月, 2021 3 次提交
-
-
由 Lukas Wunner 提交于
Support classifying packets with netfilter on egress to satisfy user requirements such as: * outbound security policies for containers (Laura) * filtering and mangling intra-node Direct Server Return (DSR) traffic on a load balancer (Laura) * filtering locally generated traffic coming in through AF_PACKET, such as local ARP traffic generated for clustering purposes or DHCP (Laura; the AF_PACKET plumbing is contained in a follow-up commit) * L2 filtering from ingress and egress for AVB (Audio Video Bridging) and gPTP with nftables (Pablo) * in the future: in-kernel NAT64/NAT46 (Pablo) The egress hook introduced herein complements the ingress hook added by commit e687ad60 ("netfilter: add netfilter ingress hook after handle_ing() under unique static key"). A patch for nftables to hook up egress rules from user space has been submitted separately, so users may immediately take advantage of the feature. Alternatively or in addition to netfilter, packets can be classified with traffic control (tc). On ingress, packets are classified first by tc, then by netfilter. On egress, the order is reversed for symmetry. Conceptually, tc and netfilter can be thought of as layers, with netfilter layered above tc. Traffic control is capable of redirecting packets to another interface (man 8 tc-mirred). E.g., an ingress packet may be redirected from the host namespace to a container via a veth connection: tc ingress (host) -> tc egress (veth host) -> tc ingress (veth container) In this case, netfilter egress classifying is not performed when leaving the host namespace! That's because the packet is still on the tc layer. If tc redirects the packet to a physical interface in the host namespace such that it leaves the system, the packet is never subjected to netfilter egress classifying. That is only logical since it hasn't passed through netfilter ingress classifying either. Packets can alternatively be redirected at the netfilter layer using nft fwd. Such a packet *is* subjected to netfilter egress classifying since it has reached the netfilter layer. Internally, the skb->nf_skip_egress flag controls whether netfilter is invoked on egress by __dev_queue_xmit(). Because __dev_queue_xmit() may be called recursively by tunnel drivers such as vxlan, the flag is reverted to false after sch_handle_egress(). This ensures that netfilter is applied both on the overlay and underlying network. Interaction between tc and netfilter is possible by setting and querying skb->mark. If netfilter egress classifying is not enabled on any interface, it is patched out of the data path by way of a static_key and doesn't make a performance difference that is discernible from noise: Before: 1537 1538 1538 1537 1538 1537 Mb/sec After: 1536 1534 1539 1539 1539 1540 Mb/sec Before + tc accept: 1418 1418 1418 1419 1419 1418 Mb/sec After + tc accept: 1419 1424 1418 1419 1422 1420 Mb/sec Before + tc drop: 1620 1619 1619 1619 1620 1620 Mb/sec After + tc drop: 1616 1624 1625 1624 1622 1619 Mb/sec When netfilter egress classifying is enabled on at least one interface, a minimal performance penalty is incurred for every egress packet, even if the interface it's transmitted over doesn't have any netfilter egress rules configured. That is caused by checking dev->nf_hooks_egress against NULL. Measurements were performed on a Core i7-3615QM. Commands to reproduce: ip link add dev foo type dummy ip link set dev foo up modprobe pktgen echo "add_device foo" > /proc/net/pktgen/kpktgend_3 samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i foo -n 400000000 -m "11:11:11:11:11:11" -d 1.1.1.1 Accept all traffic with tc: tc qdisc add dev foo clsact tc filter add dev foo egress bpf da bytecode '1,6 0 0 0,' Drop all traffic with tc: tc qdisc add dev foo clsact tc filter add dev foo egress bpf da bytecode '1,6 0 0 2,' Apply this patch when measuring packet drops to avoid errors in dmesg: https://lore.kernel.org/netdev/a73dda33-57f4-95d8-ea51-ed483abd6a7a@iogearbox.net/Signed-off-by: NLukas Wunner <lukas@wunner.de> Cc: Laura García Liébana <nevola@gmail.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Thomas Graf <tgraf@suug.ch> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-
由 Lukas Wunner 提交于
Prepare for addition of a netfilter egress hook by generalizing the ingress hook include file. No functional change intended. Signed-off-by: NLukas Wunner <lukas@wunner.de> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-
由 Lukas Wunner 提交于
Prepare for addition of a netfilter egress hook by renaming <linux/netfilter_ingress.h> to <linux/netfilter_netdev.h>. The egress hook also necessitates a refactoring of the include file, but that is done in a separate commit to ease reviewing. No functional change intended. Signed-off-by: NLukas Wunner <lukas@wunner.de> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-
- 10 10月, 2021 1 次提交
-
-
由 Antoine Tenart 提交于
Cosmetic commit making dev_get_port_parent_id slightly more readable. There is no need to split the condition to return after calling devlink_compat_switch_id_get and after that 'recurse' is always true. Signed-off-by: NAntoine Tenart <atenart@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 09 10月, 2021 1 次提交
-
-
由 Antoine Tenart 提交于
__dev_get_by_name is currently used to either retrieve a net device reference using its name or to check if a name is already used by a registered net device (per ns). In the later case there is no need to return a reference to a net device. Introduce a new helper, netdev_name_in_use, to check if a name is currently used by a registered net device without leaking a reference the corresponding net device. This helper uses netdev_name_node_lookup instead of __dev_get_by_name as we don't need the extra logic retrieving a reference to the corresponding net device. Signed-off-by: NAntoine Tenart <atenart@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 02 10月, 2021 1 次提交
-
-
由 Gyumin Hwang 提交于
napi_gro_complete always returned the same value, NET_RX_SUCCESS And the value was not used anywhere Signed-off-by: NGyumin Hwang <hkm73560@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 27 9月, 2021 1 次提交
-
-
由 Jakub Kicinski 提交于
Commit 3765996e ("napi: fix race inside napi_enable") fixed an ordering bug in napi_enable() and made the napi_enable() diverge from napi_disable(). The state transitions done on disable are not symmetric to enable. There is no known bug in napi_disable() this is just refactoring. Eric suggests we can also replace msleep(1) with a more opportunistic usleep_range(). Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 20 9月, 2021 1 次提交
-
-
由 Xuan Zhuo 提交于
The process will cause napi.state to contain NAPI_STATE_SCHED and not in the poll_list, which will cause napi_disable() to get stuck. The prefix "NAPI_STATE_" is removed in the figure below, and NAPI_STATE_HASHED is ignored in napi.state. CPU0 | CPU1 | napi.state =============================================================================== napi_disable() | | SCHED | NPSVC napi_enable() | | { | | smp_mb__before_atomic(); | | clear_bit(SCHED, &n->state); | | NPSVC | napi_schedule_prep() | SCHED | NPSVC | napi_poll() | | napi_complete_done() | | { | | if (n->state & (NPSVC | | (1) | _BUSY_POLL))) | | return false; | | ................ | | } | SCHED | NPSVC | | clear_bit(NPSVC, &n->state); | | SCHED } | | | | napi_schedule_prep() | | SCHED | MISSED (2) (1) Here return direct. Because of NAPI_STATE_NPSVC exists. (2) NAPI_STATE_SCHED exists. So not add napi.poll_list to sd->poll_list Since NAPI_STATE_SCHED already exists and napi is not in the sd->poll_list queue, NAPI_STATE_SCHED cannot be cleared and will always exist. 1. This will cause this queue to no longer receive packets. 2. If you encounter napi_disable under the protection of rtnl_lock, it will cause the entire rtnl_lock to be locked, affecting the overall system. This patch uses cmpxchg to implement napi_enable(), which ensures that there will be no race due to the separation of clear two bits. Fixes: 2d8bff12 ("netpoll: Close race condition between poll_one_napi and napi_disable") Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com> Reviewed-by: NDust Li <dust.li@linux.alibaba.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 15 9月, 2021 1 次提交
-
-
由 Jakub Kicinski 提交于
mq / mqprio make the default child qdiscs visible. They only do so for the qdiscs which are within real_num_tx_queues when the device is registered. Depending on order of calls in the driver, or if user space changes config via ethtool -L the number of qdiscs visible under tc qdisc show will differ from the number of queues. This is confusing to users and potentially to system configuration scripts which try to make sure qdiscs have the right parameters. Add a new Qdisc_ops callback and make relevant qdiscs TTRT. Note that this uncovers the "shortcut" created by commit 1f27cde3 ("net: sched: use pfifo_fast for non real queues") The default child qdiscs beyond initial real_num_tx are always pfifo_fast, no matter what the sysfs setting is. Fixing this gets a little tricky because we'd need to keep a reference on whatever the default qdisc was at the time of creation. In practice this is likely an non-issue the qdiscs likely have to be configured to non-default settings, so whatever user space is doing such configuration can replace the pfifos... now that it will see them. Reported-by: NMatthew Massey <matthewmassey@fb.com> Reviewed-by: NDave Taht <dave.taht@gmail.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 14 8月, 2021 1 次提交
-
-
由 Changbin Du 提交于
Replace the obsolete and ambiguos macro in_irq() with new macro in_hardirq(). Signed-off-by: NChangbin Du <changbin.du@gmail.com> Link: https://lore.kernel.org/r/20210813145749.86512-1-changbin.du@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 10 8月, 2021 2 次提交
-
-
由 Jussi Maki 提交于
For the XDP bonding slave lookup to work in the NAPI poll context in which the redudant rcu_read_lock() has been removed we have to follow the same approach as in 694cea39 ("bpf: Allow RCU-protected lookups to happen from bh context") and modify the WARN_ON to also check rcu_read_lock_bh_held(). Signed-off-by: NJussi Maki <joamaki@gmail.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Cc: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20210731055738.16820-6-joamaki@gmail.com
-
由 Jussi Maki 提交于
This adds the ndo_xdp_get_xmit_slave hook for transforming XDP_TX into XDP_REDIRECT after BPF program run when the ingress device is a bond slave. The dev_xdp_prog_count is exposed so that slave devices can be checked for loaded XDP programs in order to avoid the situation where both bond master and slave have programs loaded according to xdp_state. Signed-off-by: NJussi Maki <joamaki@gmail.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Cc: Jay Vosburgh <j.vosburgh@gmail.com> Cc: Veaceslav Falico <vfalico@gmail.com> Cc: Andy Gospodarek <andy@greyhouse.net> Link: https://lore.kernel.org/bpf/20210731055738.16820-3-joamaki@gmail.com
-
- 05 8月, 2021 2 次提交
-
-
由 Yajun Deng 提交于
The 'if (dev)' statement already move into dev_{put , hold}, so remove redundant if statements. Signed-off-by: NYajun Deng <yajun.deng@linux.dev> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
The functions get_online_cpus() and put_online_cpus() have been deprecated during the CPU hotplug rework. They map directly to cpus_read_lock() and cpus_read_unlock(). Replace deprecated CPU-hotplug functions with the official version. The behavior remains unchanged. Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 04 8月, 2021 1 次提交
-
-
由 Jakub Kicinski 提交于
netif_set_real_num_rx_queues() and netif_set_real_num_tx_queues() can fail which breaks drivers trying to implement reconfiguration in a way that can't leave the device half-broken. In other words those functions are incompatible with prepare/commit approach. Luckily setting real number of queues can fail only if the number is increased, meaning that if we order operations correctly we can guarantee ending up with either new config (success), or the old one (on error). Provide a helper implementing such logic so that drivers don't have to duplicate it. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-