- 17 8月, 2020 1 次提交
-
-
由 Jason A. Donenfeld 提交于
When an XDP program changes the ethernet header protocol field, eth_type_trans is used to recalculate skb->protocol. In order for eth_type_trans to work correctly, the ethernet header must actually be part of the skb data segment, so the code first pushes that onto the head of the skb. However, it subsequently forgets to pull it back off, making the behavior of the passed-on packet inconsistent between the protocol modifying case and the static protocol case. This patch fixes the issue by simply pulling the ethernet header back off of the skb head. Fixes: 29724956 ("net: fix generic XDP to handle if eth header was mangled") Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 13 8月, 2020 1 次提交
-
-
由 Andrii Nakryiko 提交于
Enforce XDP_FLAGS_UPDATE_IF_NOEXIST only if new BPF program to be attached is non-NULL (i.e., we are not detaching a BPF program). Fixes: d4baa936 ("bpf, xdp: Extract common XDP program attachment logic") Reported-by: NStanislav Fomichev <sdf@google.com> Signed-off-by: NAndrii Nakryiko <andriin@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Tested-by: NStanislav Fomichev <sdf@google.com> Acked-by: NToke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20200812022923.1217922-1-andriin@fb.com
-
- 02 8月, 2020 1 次提交
-
-
由 Andrii Nakryiko 提交于
Add LINK_DETACH command to force-detach bpf_link without destroying it. It has the same behavior as auto-detaching of bpf_link due to cgroup dying for bpf_cgroup_link or net_device being destroyed for bpf_xdp_link. In such case, bpf_link is still a valid kernel object, but is defuncts and doesn't hold BPF program attached to corresponding BPF hook. This functionality allows users with enough access rights to manually force-detach attached bpf_link without killing respective owner process. This patch implements LINK_DETACH for cgroup, xdp, and netns links, mostly re-using existing link release handling code. Signed-off-by: NAndrii Nakryiko <andriin@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NSong Liu <songliubraving@fb.com> Acked-by: NJohn Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200731182830.286260-2-andriin@fb.com
-
- 01 8月, 2020 2 次提交
-
-
由 Roopa Prabhu 提交于
netdev protodown is a mechanism that allows protocols to hold an interface down. It was initially introduced in the kernel to hold links down by a multihoming protocol. There was also an attempt to introduce protodown reason at the time but was rejected. protodown and protodown reason is supported by almost every switching and routing platform. It was ok for a while to live without a protodown reason. But, its become more critical now given more than one protocol may need to keep a link down on a system at the same time. eg: vrrp peer node, port security, multihoming protocol. Its common for Network operators and protocol developers to look for such a reason on a networking box (Its also known as errDisable by most networking operators) This patch adds support for link protodown reason attribute. There are two ways to maintain protodown reasons. (a) enumerate every possible reason code in kernel - A protocol developer has to make a request and have that appear in a certain kernel version (b) provide the bits in the kernel, and allow user-space (sysadmin or NOS distributions) to manage the bit-to-reasonname map. - This makes extending reason codes easier (kind of like the iproute2 table to vrf-name map /etc/iproute2/rt_tables.d/) This patch takes approach (b). a few things about the patch: - It treats the protodown reason bits as counter to indicate active protodown users - Since protodown attribute is already an exposed UAPI, the reason is not enforced on a protodown set. Its a no-op if not used. the patch follows the below algorithm: - presence of reason bits set indicates protodown is in use - user can set protodown and protodown reason in a single or multiple setlink operations - setlink operation to clear protodown, will return -EBUSY if there are active protodown reason bits - reason is not included in link dumps if not used example with patched iproute2: $cat /etc/iproute2/protodown_reasons.d/r.conf 0 mlag 1 evpn 2 vrrp 3 psecurity $ip link set dev vxlan0 protodown on protodown_reason vrrp on $ip link set dev vxlan0 protodown_reason mlag on $ip link show 14: vxlan0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether f6:06:be:17:91:e7 brd ff:ff:ff:ff:ff:ff protodown on <mlag,vrrp> $ip link set dev vxlan0 protodown_reason mlag off $ip link set dev vxlan0 protodown off protodown_reason vrrp off Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Miaohe Lin 提交于
When we don't care about vlan depth, we could pass NULL instead of the address of a unused local variable to skb_network_protocol() as a param. Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 26 7月, 2020 6 次提交
-
-
由 Andrii Nakryiko 提交于
Now that BPF program/link management is centralized in generic net_device code, kernel code never queries program id from drivers, so XDP_QUERY_PROG/XDP_QUERY_PROG_HW commands are unnecessary. This patch removes all the implementations of those commands in kernel, along the xdp_attachment_query(). This patch was compile-tested on allyesconfig. Signed-off-by: NAndrii Nakryiko <andriin@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200722064603.3350758-10-andriin@fb.com
-
由 Andrii Nakryiko 提交于
Implement XDP link-specific show_fdinfo and link_info to emit ifindex. Signed-off-by: NAndrii Nakryiko <andriin@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200722064603.3350758-7-andriin@fb.com
-
由 Andrii Nakryiko 提交于
Add support for LINK_UPDATE command for BPF XDP link to enable reliable replacement of underlying BPF program. Signed-off-by: NAndrii Nakryiko <andriin@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200722064603.3350758-6-andriin@fb.com
-
由 Andrii Nakryiko 提交于
Add bpf_link-based API (bpf_xdp_link) to attach BPF XDP program through BPF_LINK_CREATE command. bpf_xdp_link is mutually exclusive with direct BPF program attachment, previous BPF program should be detached prior to attempting to create a new bpf_xdp_link attachment (for a given XDP mode). Once BPF link is attached, it can't be replaced by other BPF program attachment or link attachment. It will be detached only when the last BPF link FD is closed. bpf_xdp_link will be auto-detached when net_device is shutdown, similarly to how other BPF links behave (cgroup, flow_dissector). At that point bpf_link will become defunct, but won't be destroyed until last FD is closed. Signed-off-by: NAndrii Nakryiko <andriin@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200722064603.3350758-5-andriin@fb.com
-
由 Andrii Nakryiko 提交于
Further refactor XDP attachment code. dev_change_xdp_fd() is split into two parts: getting bpf_progs from FDs and attachment logic, working with bpf_progs. This makes attachment logic a bit more straightforward and prepares code for bpf_xdp_link inclusion, which will share the common logic. Signed-off-by: NAndrii Nakryiko <andriin@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200722064603.3350758-4-andriin@fb.com
-
由 Andrii Nakryiko 提交于
Instead of delegating to drivers, maintain information about which BPF programs are attached in which XDP modes (generic/skb, driver, or hardware) locally in net_device. This effectively obsoletes XDP_QUERY_PROG command. Such re-organization simplifies existing code already. But it also allows to further add bpf_link-based XDP attachments without drivers having to know about any of this at all, which seems like a good setup. XDP_SETUP_PROG/XDP_SETUP_PROG_HW are just low-level commands to driver to install/uninstall active BPF program. All the higher-level concerns about prog/link interaction will be contained within generic driver-agnostic logic. All the XDP_QUERY_PROG calls to driver in dev_xdp_uninstall() were removed. It's not clear for me why dev_xdp_uninstall() were passing previous prog_flags when resetting installed programs. That seems unnecessary, plus most drivers don't populate prog_flags anyways. Having XDP_SETUP_PROG vs XDP_SETUP_PROG_HW should be enough of an indicator of what is required of driver to correctly reset active BPF program. dev_xdp_uninstall() is also generalized as an iteration over all three supported mode. Signed-off-by: NAndrii Nakryiko <andriin@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200722064603.3350758-3-andriin@fb.com
-
- 25 7月, 2020 1 次提交
-
-
IRQs are disabled when freeing skbs in input queue. Use the IRQ safe variant to free skbs here. Fixes: 145dd5f9 ("net: flush the softnet backlog in process context") Signed-off-by: NSubash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 24 7月, 2020 1 次提交
-
-
由 Vladimir Oltean 提交于
The purpose of this override is to give the user an indication of what the number of the CPU port is (in DSA, the CPU port is a hardware implementation detail and not a network interface capable of traffic). However, it has always failed (by design) at providing this information to the user in a reliable fashion. Prior to commit 3369afba ("net: Call into DSA netdevice_ops wrappers"), the behavior was to only override this callback if it was not provided by the DSA master. That was its first failure: if the DSA master itself was a DSA port or a switchdev, then the user would not see the number of the CPU port in /sys/class/net/eth0/phys_port_name, but the number of the DSA master port within its respective physical switch. But that was actually ok in a way. The commit mentioned above changed that behavior, and now overrides the master's ndo_get_phys_port_name unconditionally. That comes with problems of its own, which are worse in a way. The idea is that it's typical for switchdev users to have udev rules for consistent interface naming. These are based, among other things, on the phys_port_name attribute. If we let the DSA switch at the bottom to start randomly overriding ndo_get_phys_port_name with its own CPU port, we basically lose any predictability in interface naming, or even uniqueness, for that matter. So, there are reasons to let DSA override the master's callback (to provide a consistent interface, a number which has a clear meaning and must not be interpreted according to context), and there are reasons to not let DSA override it (it breaks udev matching for the DSA master). But, there is an alternative method for users to retrieve the number of the CPU port of each DSA switch in the system: $ devlink port pci/0000:00:00.5/0: type eth netdev swp0 flavour physical port 0 pci/0000:00:00.5/2: type eth netdev swp2 flavour physical port 2 pci/0000:00:00.5/4: type notset flavour cpu port 4 spi/spi2.0/0: type eth netdev sw0p0 flavour physical port 0 spi/spi2.0/1: type eth netdev sw0p1 flavour physical port 1 spi/spi2.0/2: type eth netdev sw0p2 flavour physical port 2 spi/spi2.0/4: type notset flavour cpu port 4 spi/spi2.1/0: type eth netdev sw1p0 flavour physical port 0 spi/spi2.1/1: type eth netdev sw1p1 flavour physical port 1 spi/spi2.1/2: type eth netdev sw1p2 flavour physical port 2 spi/spi2.1/3: type eth netdev sw1p3 flavour physical port 3 spi/spi2.1/4: type notset flavour cpu port 4 So remove this duplicated, unreliable and troublesome method. From this patch on, the phys_port_name attribute of the DSA master will only contain information about itself (if at all). If the users need reliable information about the CPU port they're probably using devlink anyway. Signed-off-by: NVladimir Oltean <olteanv@gmail.com> Acked-by: Nflorian Fainelli <f.fainelli@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 21 7月, 2020 1 次提交
-
-
由 Florian Fainelli 提交于
Make the core net_device code call into our ndo_do_ioctl() and ndo_get_phys_port_name() functions via the wrappers defined previously Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 17 7月, 2020 1 次提交
-
-
由 Petr Machata 提交于
This reverts commit aebe4426. Signed-off-by: NPetr Machata <petrm@mellanox.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 16 7月, 2020 1 次提交
-
-
由 Lorenzo Bianconi 提交于
Introduce the capability to attach an eBPF program to cpumap entries. The idea behind this feature is to add the possibility to define on which CPU run the eBPF program if the underlying hw does not support RSS. Current supported verdicts are XDP_DROP and XDP_PASS. This patch has been tested on Marvell ESPRESSObin using xdp_redirect_cpu sample available in the kernel tree to identify possible performance regressions. Results show there are no observable differences in packet-per-second: $./xdp_redirect_cpu --progname xdp_cpu_map0 --dev eth0 --cpu 1 rx: 354.8 Kpps rx: 356.0 Kpps rx: 356.8 Kpps rx: 356.3 Kpps rx: 356.6 Kpps rx: 356.6 Kpps rx: 356.7 Kpps rx: 355.8 Kpps rx: 356.8 Kpps rx: 356.8 Kpps Co-developed-by: NJesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: NLorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NJesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/bpf/5c9febdf903d810b3415732e5cd98491d7d9067a.1594734381.git.lorenzo@kernel.org
-
- 14 7月, 2020 2 次提交
-
-
由 Wei Yongjun 提交于
The sparse tool complains as follows: net/core/dev.c:5594:1: warning: symbol '__pcpu_scope_flush_works' was not declared. Should it be static? 'flush_works' is not used outside of dev.c, so marks it static. Fixes: 41852497 ("net: batch calls to flush_all_backlogs()") Reported-by: NHulk Robot <hulkci@huawei.com> Signed-off-by: NWei Yongjun <weiyongjun1@huawei.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Andrew Lunn 提交于
Simple fixes which require no deep knowledge of the code. Signed-off-by: NAndrew Lunn <andrew@lunn.ch> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 30 6月, 2020 1 次提交
-
-
由 Petr Machata 提交于
A following patch introduces qevents, points in qdisc algorithm where packet can be processed by user-defined filters. Should this processing lead to a situation where a new packet is to be enqueued on the same port, holding the root lock would lead to deadlocks. To solve the issue, qevent handler needs to unlock and relock the root lock when necessary. To that end, add the root lock argument to the qdisc op enqueue, and propagate throughout. Signed-off-by: NPetr Machata <petrm@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 27 6月, 2020 1 次提交
-
-
由 Mauro Carvalho Chehab 提交于
The dev argument was not listed at kernel-doc markup: ./net/core/dev.c:7878: warning: Function parameter or member 'dev' not described in 'netdev_get_xmit_slave' Signed-off-by: NMauro Carvalho Chehab <mchehab+huawei@kernel.org> Link: https://lore.kernel.org/r/6e22325bb9bd4cc2249c3768b0e3ad75933445f8.1592895969.git.mchehab+huawei@kernel.orgSigned-off-by: NJonathan Corbet <corbet@lwn.net>
-
- 23 6月, 2020 1 次提交
-
-
由 Heiner Kallweit 提交于
A netdevice may be marked as detached because the parent is runtime-suspended and not accessible whilst interface or link is down. An example are PCI network devices that go into PCI D3hot, see e.g. __igc_shutdown() or rtl8169_net_suspend(). If netdevice is down and marked as detached we can only open it if we runtime-resume it before __dev_open() calls netif_device_present(). Therefore, if netdevice is detached, try to runtime-resume the parent and only return with an error if it's still detached. Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 19 6月, 2020 3 次提交
-
-
由 Eric Dumazet 提交于
Back in commit f60e5990 ("ipv6: protect skb->sk accesses from recursive dereference inside the stack") Hannes added code so that IPv6 stack would not trust skb->sk for typical cases where packet goes through 'standard' xmit path (__dev_queue_xmit()) Alas af_packet had a dev_direct_xmit() path that was not dealing yet with xmit_recursion level. Also change sk_mc_loop() to dump a stack once only. Without this patch, syzbot was able to trigger : [1] [ 153.567378] WARNING: CPU: 7 PID: 11273 at net/core/sock.c:721 sk_mc_loop+0x51/0x70 [ 153.567378] Modules linked in: nfnetlink ip6table_raw ip6table_filter iptable_raw iptable_nat nf_nat nf_conntrack nf_defrag_ipv4 nf_defrag_ipv6 iptable_filter macsec macvtap tap macvlan 8021q hsr wireguard libblake2s blake2s_x86_64 libblake2s_generic udp_tunnel ip6_udp_tunnel libchacha20poly1305 poly1305_x86_64 chacha_x86_64 libchacha curve25519_x86_64 libcurve25519_generic netdevsim batman_adv dummy team bridge stp llc w1_therm wire i2c_mux_pca954x i2c_mux cdc_acm ehci_pci ehci_hcd mlx4_en mlx4_ib ib_uverbs ib_core mlx4_core [ 153.567386] CPU: 7 PID: 11273 Comm: b159172088 Not tainted 5.8.0-smp-DEV #273 [ 153.567387] RIP: 0010:sk_mc_loop+0x51/0x70 [ 153.567388] Code: 66 83 f8 0a 75 24 0f b6 4f 12 b8 01 00 00 00 31 d2 d3 e0 a9 bf ef ff ff 74 07 48 8b 97 f0 02 00 00 0f b6 42 3a 83 e0 01 5d c3 <0f> 0b b8 01 00 00 00 5d c3 0f b6 87 18 03 00 00 5d c0 e8 04 83 e0 [ 153.567388] RSP: 0018:ffff95c69bb93990 EFLAGS: 00010212 [ 153.567388] RAX: 0000000000000011 RBX: ffff95c6e0ee3e00 RCX: 0000000000000007 [ 153.567389] RDX: ffff95c69ae50000 RSI: ffff95c6c30c3000 RDI: ffff95c6c30c3000 [ 153.567389] RBP: ffff95c69bb93990 R08: ffff95c69a77f000 R09: 0000000000000008 [ 153.567389] R10: 0000000000000040 R11: 00003e0e00026128 R12: ffff95c6c30c3000 [ 153.567390] R13: ffff95c6cc4fd500 R14: ffff95c6f84500c0 R15: ffff95c69aa13c00 [ 153.567390] FS: 00007fdc3a283700(0000) GS:ffff95c6ff9c0000(0000) knlGS:0000000000000000 [ 153.567390] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 153.567391] CR2: 00007ffee758e890 CR3: 0000001f9ba20003 CR4: 00000000001606e0 [ 153.567391] Call Trace: [ 153.567391] ip6_finish_output2+0x34e/0x550 [ 153.567391] __ip6_finish_output+0xe7/0x110 [ 153.567391] ip6_finish_output+0x2d/0xb0 [ 153.567392] ip6_output+0x77/0x120 [ 153.567392] ? __ip6_finish_output+0x110/0x110 [ 153.567392] ip6_local_out+0x3d/0x50 [ 153.567392] ipvlan_queue_xmit+0x56c/0x5e0 [ 153.567393] ? ksize+0x19/0x30 [ 153.567393] ipvlan_start_xmit+0x18/0x50 [ 153.567393] dev_direct_xmit+0xf3/0x1c0 [ 153.567393] packet_direct_xmit+0x69/0xa0 [ 153.567394] packet_sendmsg+0xbf0/0x19b0 [ 153.567394] ? plist_del+0x62/0xb0 [ 153.567394] sock_sendmsg+0x65/0x70 [ 153.567394] sock_write_iter+0x93/0xf0 [ 153.567394] new_sync_write+0x18e/0x1a0 [ 153.567395] __vfs_write+0x29/0x40 [ 153.567395] vfs_write+0xb9/0x1b0 [ 153.567395] ksys_write+0xb1/0xe0 [ 153.567395] __x64_sys_write+0x1a/0x20 [ 153.567395] do_syscall_64+0x43/0x70 [ 153.567396] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 153.567396] RIP: 0033:0x453549 [ 153.567396] Code: Bad RIP value. [ 153.567396] RSP: 002b:00007fdc3a282cc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 153.567397] RAX: ffffffffffffffda RBX: 00000000004d32d0 RCX: 0000000000453549 [ 153.567397] RDX: 0000000000000020 RSI: 0000000020000300 RDI: 0000000000000003 [ 153.567398] RBP: 00000000004d32d8 R08: 0000000000000000 R09: 0000000000000000 [ 153.567398] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004d32dc [ 153.567398] R13: 00007ffee742260f R14: 00007fdc3a282dc0 R15: 00007fdc3a283700 [ 153.567399] ---[ end trace c1d5ae2b1059ec62 ]--- f60e5990 ("ipv6: protect skb->sk accesses from recursive dereference inside the stack") Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: Nsyzbot <syzkaller@googlegroups.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Whenever a buggy NAPI driver returns more than its budget, we emit a stack trace that is of no use, since it does not tell which driver is buggy. Instead, emit a message giving the function name, and a descriptive message. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yang Yingliang 提交于
I got a memleak report when doing some fuzz test: unreferenced object 0xffff888112584000 (size 13599): comm "ip", pid 3048, jiffies 4294911734 (age 343.491s) hex dump (first 32 bytes): 74 61 70 30 00 00 00 00 00 00 00 00 00 00 00 00 tap0............ 00 ee d9 19 81 88 ff ff 00 00 00 00 00 00 00 00 ................ backtrace: [<000000002f60ba65>] __kmalloc_node+0x309/0x3a0 [<0000000075b211ec>] kvmalloc_node+0x7f/0xc0 [<00000000d3a97396>] alloc_netdev_mqs+0x76/0xfc0 [<00000000609c3655>] __tun_chr_ioctl+0x1456/0x3d70 [<000000001127ca24>] ksys_ioctl+0xe5/0x130 [<00000000b7d5e66a>] __x64_sys_ioctl+0x6f/0xb0 [<00000000e1023498>] do_syscall_64+0x56/0xa0 [<000000009ec0eb12>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 unreferenced object 0xffff888111845cc0 (size 8): comm "ip", pid 3048, jiffies 4294911734 (age 343.491s) hex dump (first 8 bytes): 74 61 70 30 00 88 ff ff tap0.... backtrace: [<000000004c159777>] kstrdup+0x35/0x70 [<00000000d8b496ad>] kstrdup_const+0x3d/0x50 [<00000000494e884a>] kvasprintf_const+0xf1/0x180 [<0000000097880a2b>] kobject_set_name_vargs+0x56/0x140 [<000000008fbdfc7b>] dev_set_name+0xab/0xe0 [<000000005b99e3b4>] netdev_register_kobject+0xc0/0x390 [<00000000602704fe>] register_netdevice+0xb61/0x1250 [<000000002b7ca244>] __tun_chr_ioctl+0x1cd1/0x3d70 [<000000001127ca24>] ksys_ioctl+0xe5/0x130 [<00000000b7d5e66a>] __x64_sys_ioctl+0x6f/0xb0 [<00000000e1023498>] do_syscall_64+0x56/0xa0 [<000000009ec0eb12>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 unreferenced object 0xffff88811886d800 (size 512): comm "ip", pid 3048, jiffies 4294911734 (age 343.491s) hex dump (first 32 bytes): 00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N.......... ff ff ff ff ff ff ff ff c0 66 3d a3 ff ff ff ff .........f=..... backtrace: [<0000000050315800>] device_add+0x61e/0x1950 [<0000000021008dfb>] netdev_register_kobject+0x17e/0x390 [<00000000602704fe>] register_netdevice+0xb61/0x1250 [<000000002b7ca244>] __tun_chr_ioctl+0x1cd1/0x3d70 [<000000001127ca24>] ksys_ioctl+0xe5/0x130 [<00000000b7d5e66a>] __x64_sys_ioctl+0x6f/0xb0 [<00000000e1023498>] do_syscall_64+0x56/0xa0 [<000000009ec0eb12>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 If call_netdevice_notifiers() failed, then rollback_registered() calls netdev_unregister_kobject() which holds the kobject. The reference cannot be put because the netdev won't be add to todo list, so it will leads a memleak, we need put the reference to avoid memleak. Reported-by: NHulk Robot <hulkci@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 10 6月, 2020 1 次提交
-
-
由 Cong Wang 提交于
The dynamic key update for addr_list_lock still causes troubles, for example the following race condition still exists: CPU 0: CPU 1: (RCU read lock) (RTNL lock) dev_mc_seq_show() netdev_update_lockdep_key() -> lockdep_unregister_key() -> netif_addr_lock_bh() because lockdep doesn't provide an API to update it atomically. Therefore, we have to move it back to static keys and use subclass for nest locking like before. In commit 1a33e10e ("net: partially revert dynamic lockdep key changes"), I already reverted most parts of commit ab92d68f ("net: core: add generic lockdep keys"). This patch reverts the rest and also part of commit f3b0a18b ("net: remove unnecessary variables and callback"). After this patch, addr_list_lock changes back to using static keys and subclasses to satisfy lockdep. Thanks to dev->lower_level, we do not have to change back to ->ndo_get_lock_subclass(). And hopefully this reduces some syzbot lockdep noises too. Reported-by: syzbot+f3a0e80c34b3fc28ac5e@syzkaller.appspotmail.com Cc: Taehee Yoo <ap420073@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 05 6月, 2020 1 次提交
-
-
由 Ahmed S. Darwish 提交于
Sequence counters write paths are critical sections that must never be preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed. Commit 5dbe7c17 ("net: fix kernel deadlock with interface rename and netdev name retrieval.") handled a deadlock, observed with CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was infinitely spinning: it got scheduled after the seqcount write side blocked inside its own critical section. To fix that deadlock, among other issues, the commit added a cond_resched() inside the read side section. While this will get the non-preemptible kernel eventually unstuck, the seqcount reader is fully exhausting its slice just spinning -- until TIF_NEED_RESCHED is set. The fix is also still broken: if the seqcount reader belongs to a real-time scheduling policy, it can spin forever and the kernel will livelock. Disabling preemption over the seqcount write side critical section will not work: inside it are a number of GFP_KERNEL allocations and mutex locking through the drivers/base/ :: device_rename() call chain. >From all the above, replace the seqcount with a rwsem. Fixes: 5dbe7c17 (net: fix kernel deadlock with interface rename and netdev name retrieval.) Fixes: 30e6c9fa (net: devnet_rename_seq should be a seqcount) Fixes: c91f6df2 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name) Cc: <stable@vger.kernel.org> Reported-by: kbuild test robot <lkp@intel.com> [ v1 missing up_read() on error exit ] Reported-by: Dan Carpenter <dan.carpenter@oracle.com> [ v1 missing up_read() on error exit ] Signed-off-by: NAhmed S. Darwish <a.darwish@linutronix.de> Reviewed-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 02 6月, 2020 1 次提交
-
-
由 David Ahern 提交于
Add BPF_XDP_DEVMAP attach type for use with programs associated with a DEVMAP entry. Allow DEVMAPs to associate a program with a device entry by adding a bpf_prog.fd to 'struct bpf_devmap_val'. Values read show the program id, so the fd and id are a union. bpf programs can get access to the struct via vmlinux.h. The program associated with the fd must have type XDP with expected attach type BPF_XDP_DEVMAP. When a program is associated with a device index, the program is run on an XDP_REDIRECT and before the buffer is added to the per-cpu queue. At this point rxq data is still valid; the next patch adds tx device information allowing the prorgam to see both ingress and egress device indices. XDP generic is skb based and XDP programs do not work with skb's. Block the use case by walking maps used by a program that is to be attached via xdpgeneric and fail if any of them are DEVMAP / DEVMAP_HASH with Block attach of BPF_XDP_DEVMAP programs to devices. Signed-off-by: NDavid Ahern <dsahern@kernel.org> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NToke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20200529220716.75383-3-dsahern@kernel.orgSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
-
- 20 5月, 2020 1 次提交
-
-
由 Boris Sukholitko 提交于
__netif_receive_skb_core may change the skb pointer passed into it (e.g. in rx_handler). The original skb may be freed as a result of this operation. The callers of __netif_receive_skb_core may further process original skb by using pt_prev pointer returned by __netif_receive_skb_core thus leading to unpleasant effects. The solution is to pass skb by reference into __netif_receive_skb_core. v2: Added Fixes tag and comment regarding ppt_prev and skb invariant. Fixes: 88eb1944 ("net: core: propagate SKB lists through packet_type lookup") Signed-off-by: NBoris Sukholitko <boris.sukholitko@broadcom.com> Acked-by: NEdward Cree <ecree@solarflare.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 15 5月, 2020 1 次提交
-
-
由 Jesper Dangaard Brouer 提交于
The SKB "head" pointer points to the data area that contains skb_shared_info, that can be found via skb_end_pointer(). Given xdp->data_hard_start have been established (basically pointing to skb->head), frame size is between skb_end_pointer() and data_hard_start, plus the size reserved to skb_shared_info. Change the bpf_xdp_adjust_tail offset adjust of skb->len, to be a positive offset number on grow, and negative number on shrink. As this seems more natural when reading the code. Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NToke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/158945336804.97035.7164852191163722056.stgit@firesoul
-
- 08 5月, 2020 1 次提交
-
-
由 Cong Wang 提交于
syzbot managed to trigger a recursive NETDEV_FEAT_CHANGE event between bonding master and slave. I managed to find a reproducer for this: ip li set bond0 up ifenslave bond0 eth0 brctl addbr br0 ethtool -K eth0 lro off brctl addif br0 bond0 ip li set br0 up When a NETDEV_FEAT_CHANGE event is triggered on a bonding slave, it captures this and calls bond_compute_features() to fixup its master's and other slaves' features. However, when syncing with its lower devices by netdev_sync_lower_features() this event is triggered again on slaves when the LRO feature fails to change, so it goes back and forth recursively until the kernel stack is exhausted. Commit 17b85d29 intentionally lets __netdev_update_features() return -1 for such a failure case, so we have to just rely on the existing check inside netdev_sync_lower_features() and skip NETDEV_FEAT_CHANGE event only for this specific failure case. Fixes: fd867d51 ("net/core: generic support for disabling netdev features down stack") Reported-by: syzbot+e73ceacfd8560cc8a3ca@syzkaller.appspotmail.com Reported-by: syzbot+c2fb6f9ddcea95ba49b5@syzkaller.appspotmail.com Cc: Jarod Wilson <jarod@redhat.com> Cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Jann Horn <jannh@google.com> Reviewed-by: NJay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com> Acked-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 05 5月, 2020 1 次提交
-
-
由 Cong Wang 提交于
This patch reverts the folowing commits: commit 064ff66e "bonding: add missing netdev_update_lockdep_key()" commit 53d37497 "net: avoid updating qdisc_xmit_lock_key in netdev_update_lockdep_key()" commit 1f26c0d3 "net: fix kernel-doc warning in <linux/netdevice.h>" commit ab92d68f "net: core: add generic lockdep keys" but keeps the addr_list_lock_key because we still lock addr_list_lock nestedly on stack devices, unlikely xmit_lock this is safe because we don't take addr_list_lock on any fast path. Reported-and-tested-by: syzbot+aaa6fa4949cc5d9b7b25@syzkaller.appspotmail.com Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Taehee Yoo <ap420073@gmail.com> Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com> Acked-by: NTaehee Yoo <ap420073@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 02 5月, 2020 1 次提交
-
-
由 Maor Gottlieb 提交于
Add new ndo to get the xmit slave of master device. The reference counters are not incremented so the caller must be careful with locks. User can ask to get the xmit slave assume all the slaves can transmit by set all_slaves arg to true. Signed-off-by: NMaor Gottlieb <maorg@mellanox.com> Reviewed-by: NJiri Pirko <jiri@mellanox.com> Reviewed-by: NDavid Ahern <dsahern@gmail.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
-
- 24 4月, 2020 2 次提交
-
-
由 Eric Dumazet 提交于
gro_flush_timeout and napi_defer_hard_irqs can be read from napi_complete_done() while other cpus write the value, whithout explicit synchronization. Use READ_ONCE()/WRITE_ONCE() to annotate the races. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Back in commit 3b47d303 ("net: gro: add a per device gro flush timer") we added the ability to arm one high resolution timer, that we used to keep not-complete packets in GRO engine a bit longer, hoping that further frames might be added to them. Since then, we added the napi_complete_done() interface, and commit 364b6055 ("net: busy-poll: return busypolling status to drivers") allowed drivers to avoid re-arming NIC interrupts if we made a promise that their NAPI poll() handler would be called in the near future. This infrastructure can be leveraged, thanks to a new device parameter, which allows to arm the napi hrtimer, instead of re-arming the device hard IRQ. We have noticed that on some servers with 32 RX queues or more, the chit-chat between the NIC and the host caused by IRQ delivery and re-arming could hurt throughput by ~20% on 100Gbit NIC. In contrast, hrtimers are using local (percpu) resources and might have lower cost. The new tunable, named napi_defer_hard_irqs, is placed in the same hierarchy than gro_flush_timeout (/sys/class/net/ethX/) By default, both gro_flush_timeout and napi_defer_hard_irqs are zero. This patch does not change the prior behavior of gro_flush_timeout if used alone : NIC hard irqs should be rearmed as before. One concrete usage can be : echo 20000 >/sys/class/net/eth1/gro_flush_timeout echo 10 >/sys/class/net/eth1/napi_defer_hard_irqs If at least one packet is retired, then we will reset napi counter to 10 (napi_defer_hard_irqs), ensuring at least 10 periodic scans of the queue. On busy queues, this should avoid NIC hard IRQ, while before this patch IRQ avoidance was only possible if napi->poll() was exhausting its budget and not call napi_complete_done(). This feature also can be used to work around some non-optimal NIC irq coalescing strategies. Having the ability to insert XX usec delays between each napi->poll() can increase cache efficiency, since we increase batch sizes. It also keeps serving cpus not idle too long, reducing tail latencies. Co-developed-by: NLuigi Rizzo <lrizzo@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 21 4月, 2020 1 次提交
-
-
由 Andrew Lunn 提交于
RFC 2863 defines the operational state testing. Add support for this state, both as a IF_LINK_MODE_ and __LINK_STATE_. Signed-off-by: NAndrew Lunn <andrew@lunn.ch> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 15 4月, 2020 1 次提交
-
-
由 David Ahern 提交于
The commit mentioned in the Fixes tag reuses the local prog variable when looking up an expected_fd. The variable is not reset when fd < 0 causing a detach with the expected_fd set to actually call dev_xdp_install for the existing program. The end result is that the detach does not happen. Fixes: 92234c8f ("xdp: Support specifying expected existing program when attaching XDP") Signed-off-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Reviewed-by: NJakub Kicinski <kuba@kernel.org> Reviewed-by: NToke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20200412133204.43847-1-dsahern@kernel.org
-
- 08 4月, 2020 1 次提交
-
-
由 Konstantin Khlebnikov 提交于
For HZ < 1000 timeout 2000us rounds up to 1 jiffy but expires randomly because next timer interrupt could come shortly after starting softirq. For commonly used CONFIG_HZ=1000 nothing changes. Fixes: 7acf8a1e ("Replace 2 jiffies with sysctl netdev_budget_usecs to enable softirq tuning") Reported-by: NDmitry Yakunin <zeil@yandex-team.ru> Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 30 3月, 2020 1 次提交
-
-
由 Cambda Zhu 提交于
The SKB_SGO_CB_OFFSET should be SKB_GSO_CB_OFFSET which means the offset of the GSO in skb cb. This patch fixes the typo. Fixes: 9207f9d4 ("net: preserve IP control block during GSO segmentation") Signed-off-by: NCambda Zhu <cambda@linux.alibaba.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 29 3月, 2020 1 次提交
-
-
由 Toke Høiland-Jørgensen 提交于
While it is currently possible for userspace to specify that an existing XDP program should not be replaced when attaching to an interface, there is no mechanism to safely replace a specific XDP program with another. This patch adds a new netlink attribute, IFLA_XDP_EXPECTED_FD, which can be set along with IFLA_XDP_FD. If set, the kernel will check that the program currently loaded on the interface matches the expected one, and fail the operation if it does not. This corresponds to a 'cmpxchg' memory operation. Setting the new attribute with a negative value means that no program is expected to be attached, which corresponds to setting the UPDATE_IF_NOEXIST flag. A new companion flag, XDP_FLAGS_REPLACE, is also added to explicitly request checking of the EXPECTED_FD attribute. This is needed for userspace to discover whether the kernel supports the new attribute. Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Reviewed-by: NJakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/bpf/158515700640.92963.3551295145441017022.stgit@toke.dk
-
- 26 3月, 2020 1 次提交
-
-
由 Pablo Neira Ayuso 提交于
net/netfilter/nft_fwd_netdev.c: In function ‘nft_fwd_netdev_eval’: net/netfilter/nft_fwd_netdev.c:32:10: error: ‘struct sk_buff’ has no member named ‘tc_redirected’ pkt->skb->tc_redirected = 1; ^~ net/netfilter/nft_fwd_netdev.c:33:10: error: ‘struct sk_buff’ has no member named ‘tc_from_ingress’ pkt->skb->tc_from_ingress = 1; ^~ To avoid a direct dependency with tc actions from netfilter, wrap the redirect bits around CONFIG_NET_REDIRECT and move helpers to include/linux/skbuff.h. Turn on this toggle from the ifb driver, the only existing client of these bits in the tree. This patch adds skb_set_redirected() that sets on the redirected bit on the skbuff, it specifies if the packet was redirect from ingress and resets the timestamp (timestamp reset was originally missing in the netfilter bugfix). Fixes: bcfabee1 ("netfilter: nft_fwd_netdev: allow to redirect to ifb via ingress") Reported-by: noreply@ellerman.id.au Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-