提交 · 977d293e23b48a1129830d7968605f61c4af71a0 · openeuler / Kernel

23 6月, 2021 1 次提交

vxlan: add missing rcu_read_lock() in neigh_reduce() · 85e8b032

由 Eric Dumazet 提交于 6月 21, 2021

syzbot complained in neigh_reduce(), because rcu_read_lock_bh()
is treated differently than rcu_read_lock()

WARNING: suspicious RCU usage
5.13.0-rc6-syzkaller #0 Not tainted
-----------------------------
include/net/addrconf.h:313 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
3 locks held by kworker/0:0/5:
 #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
 #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: atomic64_set include/asm-generic/atomic-instrumented.h:856 [inline]
 #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: atomic_long_set include/asm-generic/atomic-long.h:41 [inline]
 #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: set_work_data kernel/workqueue.c:617 [inline]
 #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline]
 #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x871/0x1600 kernel/workqueue.c:2247
 #1: ffffc90000ca7da8 ((work_completion)(&port->wq)){+.+.}-{0:0}, at: process_one_work+0x8a5/0x1600 kernel/workqueue.c:2251
 #2: ffffffff8bf795c0 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0x1da/0x3130 net/core/dev.c:4180

stack backtrace:
CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted 5.13.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: events ipvlan_process_multicast
Call Trace:
 __dump_stack lib/dump_stack.c:79 [inline]
 dump_stack+0x141/0x1d7 lib/dump_stack.c:120
 __in6_dev_get include/net/addrconf.h:313 [inline]
 __in6_dev_get include/net/addrconf.h:311 [inline]
 neigh_reduce drivers/net/vxlan.c:2167 [inline]
 vxlan_xmit+0x34d5/0x4c30 drivers/net/vxlan.c:2919
 __netdev_start_xmit include/linux/netdevice.h:4944 [inline]
 netdev_start_xmit include/linux/netdevice.h:4958 [inline]
 xmit_one net/core/dev.c:3654 [inline]
 dev_hard_start_xmit+0x1eb/0x920 net/core/dev.c:3670
 __dev_queue_xmit+0x2133/0x3130 net/core/dev.c:4246
 ipvlan_process_multicast+0xa99/0xd70 drivers/net/ipvlan/ipvlan_core.c:287
 process_one_work+0x98d/0x1600 kernel/workqueue.c:2276
 worker_thread+0x64c/0x1120 kernel/workqueue.c:2422
 kthread+0x3b1/0x4a0 kernel/kthread.c:313
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294

Fixes: f564f45c ("vxlan: add ipv6 proxy support")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: Nsyzbot <syzkaller@googlegroups.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

85e8b032

31 3月, 2021 1 次提交

vxlan: allow L4 GRO passthrough · d18931a9

由 Paolo Abeni 提交于 3月 30, 2021

When passing up an UDP GSO packet with L4 aggregation, there is
no need to segment it at the vxlan level. We can propagate the
packet untouched and let it be segmented later, if needed.

Introduce an helper to allow let the UDP socket to accept any
L4 aggregation and use it in the vxlan driver.

v1 -> v2:
 - updated to use the newly introduced UDP socket 'accept*' fields
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Reviewed-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d18931a9

26 3月, 2021 1 次提交

vxlan: do not modify the shared tunnel info when PMTU triggers an ICMP reply · 30a93d2b

由 Antoine Tenart 提交于 3月 25, 2021

When the interface is part of a bridge or an Open vSwitch port and a
packet exceed a PMTU estimate, an ICMP reply is sent to the sender. When
using the external mode (collect metadata) the source and destination
addresses are reversed, so that Open vSwitch can match the packet
against an existing (reverse) flow.

But inverting the source and destination addresses in the shared
ip_tunnel_info will make following packets of the flow to use a wrong
destination address (packets will be tunnelled to itself), if the flow
isn't updated. Which happens with Open vSwitch, until the flow times
out.

Fixes this by uncloning the skb's ip_tunnel_info before inverting its
source and destination addresses, so that the modification will only be
made for the PTMU packet, not the following ones.

Fixes: fc68c995 ("vxlan: Support for PMTU discovery on directly bridged links")
Tested-by: NEelco Chaudron <echaudro@redhat.com>
Reviewed-by: NEelco Chaudron <echaudro@redhat.com>
Signed-off-by: NAntoine Tenart <atenart@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

30a93d2b

14 3月, 2021 1 次提交

drivers: net: vxlan.c: Fix declaration issue · 6fadbdd6

由 Sanjana Srinidhi 提交于 3月 13, 2021

Added a blank line after structure declaration.
This is done to maintain code uniformity.
Signed-off-by: NSanjana Srinidhi <sanjanasrinidhi1810@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6fadbdd6

24 2月, 2021 1 次提交

vxlan: move debug check after netdev unregister · 92584ddf

由 Taehee Yoo 提交于 2月 21, 2021

The debug check must be done after unregister_netdevice_many() call --
the hlist_del_rcu() for this is done inside .ndo_stop.

This is the same with commit 0fda7600 ("geneve: move debug check after
netdev unregister")

Test commands:
    ip netns del A
    ip netns add A
    ip netns add B

    ip netns exec B ip link add vxlan0 type vxlan vni 100 local 10.0.0.1 \
	    remote 10.0.0.2 dstport 4789 srcport 4789 4789
    ip netns exec B ip link set vxlan0 netns A
    ip netns exec A ip link set vxlan0 up
    ip netns del B

Splat looks like:
[   73.176249][    T7] ------------[ cut here ]------------
[   73.178662][    T7] WARNING: CPU: 4 PID: 7 at drivers/net/vxlan.c:4743 vxlan_exit_batch_net+0x52e/0x720 [vxlan]
[   73.182597][    T7] Modules linked in: vxlan openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 mlx5_core nfp mlxfw ixgbevf tls sch_fq_codel nf_tables nfnetlink ip_tables x_tables unix
[   73.190113][    T7] CPU: 4 PID: 7 Comm: kworker/u16:0 Not tainted 5.11.0-rc7+ #838
[   73.193037][    T7] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[   73.196986][    T7] Workqueue: netns cleanup_net
[   73.198946][    T7] RIP: 0010:vxlan_exit_batch_net+0x52e/0x720 [vxlan]
[   73.201509][    T7] Code: 00 01 00 00 0f 84 39 fd ff ff 48 89 ca 48 c1 ea 03 80 3c 1a 00 0f 85 a6 00 00 00 89 c2 48 83 c2 02 49 8b 14 d4 48 85 d2 74 ce <0f> 0b eb ca e8 b9 51 db dd 84 c0 0f 85 4a fe ff ff 48 c7 c2 80 bc
[   73.208813][    T7] RSP: 0018:ffff888100907c10 EFLAGS: 00010286
[   73.211027][    T7] RAX: 000000000000003c RBX: dffffc0000000000 RCX: ffff88800ec411f0
[   73.213702][    T7] RDX: ffff88800a278000 RSI: ffff88800fc78c70 RDI: ffff88800fc78070
[   73.216169][    T7] RBP: ffff88800b5cbdc0 R08: fffffbfff424de61 R09: fffffbfff424de61
[   73.218463][    T7] R10: ffffffffa126f307 R11: fffffbfff424de60 R12: ffff88800ec41000
[   73.220794][    T7] R13: ffff888100907d08 R14: ffff888100907c50 R15: ffff88800fc78c40
[   73.223337][    T7] FS:  0000000000000000(0000) GS:ffff888114800000(0000) knlGS:0000000000000000
[   73.225814][    T7] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   73.227616][    T7] CR2: 0000562b5cb4f4d0 CR3: 0000000105fbe001 CR4: 00000000003706e0
[   73.229700][    T7] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   73.231820][    T7] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   73.233844][    T7] Call Trace:
[   73.234698][    T7]  ? vxlan_err_lookup+0x3c0/0x3c0 [vxlan]
[   73.235962][    T7]  ? ops_exit_list.isra.11+0x93/0x140
[   73.237134][    T7]  cleanup_net+0x45e/0x8a0
[ ... ]

Fixes: 57b61127 ("vxlan: speedup vxlan tunnels dismantle")
Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
Link: https://lore.kernel.org/r/20210221154552.11749-1-ap420073@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

92584ddf

19 1月, 2021 1 次提交

vxlan: add NETIF_F_FRAGLIST flag for dev features · cb2c5711

由 Xin Long 提交于 1月 15, 2021

Some protocol HW GSO requires fraglist supported by the device, like
SCTP. Without NETIF_F_FRAGLIST set in the dev features of vxlan, it
would have to do SW GSO before the packets enter the driver, even
when the vxlan dev and lower dev (like veth) both have the feature
of NETIF_F_GSO_SCTP.

So this patch is to add it for vxlan.
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

cb2c5711

08 1月, 2021 1 次提交

udp_tunnel: remove REGISTER/UNREGISTER handling from tunnel drivers · dedc33e7

由 Jakub Kicinski 提交于 1月 06, 2021

udp_tunnel_nic handles REGISTER and UNREGISTER event, now that all
drivers use that infra we can drop the event handling in the tunnel
drivers.
Reviewed-by: NAlexander Duyck <alexanderduyck@fb.com>
Reviewed-by: NJacob Keller <jacob.e.keller@intel.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

dedc33e7

11 12月, 2020 1 次提交

vxlan: avoid double unlikely() notation when using IS_ERR() · a10b24b8

由 Antonio Quartulli 提交于 12月 10, 2020

The definition of IS_ERR() already applies the unlikely() notation
when checking the error status of the passed pointer. For this
reason there is no need to have the same notation outside of
IS_ERR() itself.

Clean up code by removing redundant notation.
Signed-off-by: NAntonio Quartulli <a@unstable.cc>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a10b24b8

03 12月, 2020 1 次提交

vxlan: fix error return code in __vxlan_dev_create() · 832e0979

由 Zhang Changzhong 提交于 12月 02, 2020

Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.

Fixes: 0ce1822c ("vxlan: add adjacent link to limit depth level")
Reported-by: NHulk Robot <hulkci@huawei.com>
Signed-off-by: NZhang Changzhong <zhangchangzhong@huawei.com>
Link: https://lore.kernel.org/r/1606903122-2098-1-git-send-email-zhangchangzhong@huawei.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

832e0979

01 12月, 2020 2 次提交

vxlan: Copy needed_tailroom from lowerdev · a5e74021

由 Sven Eckelmann 提交于 11月 26, 2020

While vxlan doesn't need any extra tailroom, the lowerdev might need it. In
that case, copy it over to reduce the chance for additional (re)allocations
in the transmit path.
Signed-off-by: NSven Eckelmann <sven@narfation.org>
Link: https://lore.kernel.org/r/20201126125247.1047977-2-sven@narfation.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>

a5e74021

vxlan: Add needed_headroom for lower device · 0a35dc41

由 Sven Eckelmann 提交于 11月 26, 2020

It was observed that sending data via batadv over vxlan (on top of
wireguard) reduced the performance massively compared to raw ethernet or
batadv on raw ethernet. A check of perf data showed that the
vxlan_build_skb was calling all the time pskb_expand_head to allocate
enough headroom for:

  min_headroom = LL_RESERVED_SPACE(dst->dev) + dst->header_len
  		+ VXLAN_HLEN + iphdr_len;

But the vxlan_config_apply only requested needed headroom for:

  lowerdev->hard_header_len + VXLAN6_HEADROOM or VXLAN_HEADROOM

So it completely ignored the needed_headroom of the lower device. The first
caller of net_dev_xmit could therefore never make sure that enough headroom
was allocated for the rest of the transmit path.

Cc: Annika Wickert <annika.wickert@exaring.de>
Signed-off-by: NSven Eckelmann <sven@narfation.org>
Tested-by: NAnnika Wickert <aw@awlnx.space>
Link: https://lore.kernel.org/r/20201126125247.1047977-1-sven@narfation.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>

0a35dc41

10 11月, 2020 1 次提交

net: switch to dev_get_tstats64 · b220a4a7

由 Heiner Kallweit 提交于 11月 07, 2020

Replace ip_tunnel_get_stats64() with the new identical core function
dev_get_tstats64().
Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

b220a4a7

07 11月, 2020 2 次提交

nexthop: Pass extack to register_nexthop_notifier() · ce7e9c8a

由 Ido Schimmel 提交于 11月 04, 2020

This will be used by the next patch which extends the function to replay
all the existing nexthops to the notifier block being registered.

Device drivers will be able to pass extack to the function since it is
passed to them upon reload from devlink.
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

ce7e9c8a

nexthop: vxlan: Convert to new notification info · 1ec69d18

由 Ido Schimmel 提交于 11月 04, 2020

Convert the sole listener of the nexthop notification chain (the VXLAN
driver) to the new notification info.
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

1ec69d18

04 11月, 2020 1 次提交

vxlan: Use a per-namespace nexthop listener instead of a global one · 626d667b

由 Ido Schimmel 提交于 11月 01, 2020

The nexthop notification chain is a per-namespace chain and not a global
one like the netdev notification chain.

Therefore, a single (global) listener cannot be registered to all these
chains simultaneously as it will result in list corruptions whenever
listeners are registered / unregistered.

Instead, register a different listener in each namespace.

Currently this is not an issue because only the VXLAN driver registers a
listener to this chain, but this is going to change with netdevsim and
mlxsw also registering their own listeners.
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20201101113926.705630-1-idosch@idosch.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>

626d667b

06 10月, 2020 1 次提交

vxlan: use dev_sw_netstats_rx_add() · 1f8dda1d

由 Fabian Frederick 提交于 10月 05, 2020

use new helper for netstats settings
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1f8dda1d

27 9月, 2020 1 次提交

Revert "vxlan: move encapsulation warning" · 435be28b

由 Jakub Kicinski 提交于 9月 25, 2020

This reverts commit 546c044c.

Nothing prevents user from sending frames to "external" VxLAN devices.
In fact kernel itself may generate icmp chatter.

This is fine, such frames should be dropped.

The point of the "missing encapsulation" warning was that
frames with missing encap should not make it into vxlan_xmit_one().
And vxlan_xmit() drops them cleanly, so let it just do that.

Without this revert the warning is triggered by the udp_tunnel_nic.sh
test, but the minimal repro is:

$ ip link add vxlan0 type vxlan \
     	      	     group 239.1.1.1 \
		     dev lo \
		     dstport 1234 \
		     external
$ ip li set dev vxlan0 up

[  419.165981] vxlan0: Missing encapsulation instructions
[  419.166551] WARNING: CPU: 0 PID: 1041 at drivers/net/vxlan.c:2889 vxlan_xmit+0x15c0/0x1fc0 [vxlan]
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

435be28b

26 9月, 2020 5 次提交

vxlan: fix vxlan_find_sock() documentation for l3mdev · 78ec710e

由 Fabian Frederick 提交于 9月 25, 2020

Since commit aab8cc36
("vxlan: add support for underlay in non-default VRF")

vxlan_find_sock() also checks if socket is assigned to the right
level 3 master device when lower device is not in the default VRF.
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

78ec710e

vxlan: check rtnl_configure_link return code correctly · 2eabcb8a

由 Fabian Frederick 提交于 9月 25, 2020

rtnl_configure_link is always checked if < 0 for error code.
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2eabcb8a

vxlan: move encapsulation warning · 546c044c

由 Fabian Frederick 提交于 9月 25, 2020

vxlan_xmit_one() was only called from vxlan_xmit() without rdst and
info was already tested. Emit warning in that function instead
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

546c044c

vxlan: add unlikely to vxlan_remcsum check · 0189399c

由 Fabian Frederick 提交于 9月 25, 2020

small optimization around checking as it's being done in all
receptions
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0189399c

vxlan: don't collect metadata if remote checksum is wrong · 2ae2904b

由 Fabian Frederick 提交于 9月 25, 2020

call vxlan_remcsum() before md filling in vxlan_rcv()
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2ae2904b

06 8月, 2020 1 次提交

Revert "vxlan: fix tos value before xmit" · a0dced17

由 Hangbin Liu 提交于 8月 05, 2020

This reverts commit 71130f29.

In commit 71130f29 ("vxlan: fix tos value before xmit") we want to
make sure the tos value are filtered by RT_TOS() based on RFC1349.

       0     1     2     3     4     5     6     7
    +-----+-----+-----+-----+-----+-----+-----+-----+
    |   PRECEDENCE    |          TOS          | MBZ |
    +-----+-----+-----+-----+-----+-----+-----+-----+

But RFC1349 has been obsoleted by RFC2474. The new DSCP field defined like

       0     1     2     3     4     5     6     7
    +-----+-----+-----+-----+-----+-----+-----+-----+
    |          DS FIELD, DSCP           | ECN FIELD |
    +-----+-----+-----+-----+-----+-----+-----+-----+

So with

IPTOS_TOS_MASK          0x1E
RT_TOS(tos)		((tos)&IPTOS_TOS_MASK)

the first 3 bits DSCP info will get lost.

To take all the DSCP info in xmit, we should revert the patch and just push
all tos bits to ip_tunnel_ecn_encap(), which will handling ECN field later.

Fixes: 71130f29 ("vxlan: fix tos value before xmit")
Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
Acked-by: NGuillaume Nault <gnault@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a0dced17

05 8月, 2020 2 次提交

vxlan: Support for PMTU discovery on directly bridged links · fc68c995

由 Stefano Brivio 提交于 8月 04, 2020

If the interface is a bridge or Open vSwitch port, and we can't
forward a packet because it exceeds the local PMTU estimate,
trigger an ICMP or ICMPv6 reply to the sender, using the same
interface to forward it back.

If metadata collection is enabled, reverse destination and source
addresses, so that Open vSwitch is able to match this packet against
the existing, reverse flow.

v2: Use netif_is_any_bridge_port() (David Ahern)
Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fc68c995

tunnels: PMTU discovery support for directly bridged IP packets · 4cb47a86

由 Stefano Brivio 提交于 8月 04, 2020

It's currently possible to bridge Ethernet tunnels carrying IP
packets directly to external interfaces without assigning them
addresses and routes on the bridged network itself: this is the case
for UDP tunnels bridged with a standard bridge or by Open vSwitch.

PMTU discovery is currently broken with those configurations, because
the encapsulation effectively decreases the MTU of the link, and
while we are able to account for this using PMTU discovery on the
lower layer, we don't have a way to relay ICMP or ICMPv6 messages
needed by the sender, because we don't have valid routes to it.

On the other hand, as a tunnel endpoint, we can't fragment packets
as a general approach: this is for instance clearly forbidden for
VXLAN by RFC 7348, section 4.3:

   VTEPs MUST NOT fragment VXLAN packets.  Intermediate routers may
   fragment encapsulated VXLAN packets due to the larger frame size.
   The destination VTEP MAY silently discard such VXLAN fragments.

The same paragraph recommends that the MTU over the physical network
accomodates for encapsulations, but this isn't a practical option for
complex topologies, especially for typical Open vSwitch use cases.

Further, it states that:

   Other techniques like Path MTU discovery (see [RFC1191] and
   [RFC1981]) MAY be used to address this requirement as well.

Now, PMTU discovery already works for routed interfaces, we get
route exceptions created by the encapsulation device as they receive
ICMP Fragmentation Needed and ICMPv6 Packet Too Big messages, and
we already rebuild those messages with the appropriate MTU and route
them back to the sender.

Add the missing bits for bridged cases:

- checks in skb_tunnel_check_pmtu() to understand if it's appropriate
  to trigger a reply according to RFC 1122 section 3.2.2 for ICMP and
  RFC 4443 section 2.4 for ICMPv6. This function is already called by
  UDP tunnels

- a new function generating those ICMP or ICMPv6 replies. We can't
  reuse icmp_send() and icmp6_send() as we don't see the sender as a
  valid destination. This doesn't need to be generic, as we don't
  cover any other type of ICMP errors given that we only provide an
  encapsulation function to the sender

While at it, make the MTU check in skb_tunnel_check_pmtu() accurate:
we might receive GSO buffers here, and the passed headroom already
includes the inner MAC length, so we don't have to account for it
a second time (that would imply three MAC headers on the wire, but
there are just two).

This issue became visible while bridging IPv6 packets with 4500 bytes
of payload over GENEVE using IPv4 with a PMTU of 4000. Given the 50
bytes of encapsulation headroom, we would advertise MTU as 3950, and
we would reject fragmented IPv6 datagrams of 3958 bytes size on the
wire. We're exclusively dealing with network MTU here, though, so we
could get Ethernet frames up to 3964 octets in that case.

v2:
- moved skb_tunnel_check_pmtu() to ip_tunnel_core.c (David Ahern)
- split IPv4/IPv6 functions (David Ahern)
Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4cb47a86

02 8月, 2020 1 次提交

vxlan: fix memleak of fdb · fda2ec62

由 Taehee Yoo 提交于 8月 01, 2020

When vxlan interface is deleted, all fdbs are deleted by vxlan_flush().
vxlan_flush() flushes fdbs but it doesn't delete fdb, which contains
all-zeros-mac because it is deleted by vxlan_uninit().
But vxlan_uninit() deletes only the fdb, which contains both all-zeros-mac
and default vni.
So, the fdb, which contains both all-zeros-mac and non-default vni
will not be deleted.

Test commands:
    ip link add vxlan0 type vxlan dstport 4789 external
    ip link set vxlan0 up
    bridge fdb add to 00:00:00:00:00:00 dst 172.0.0.1 dev vxlan0 via lo \
	    src_vni 10000 self permanent
    ip link del vxlan0

kmemleak reports as follows:
unreferenced object 0xffff9486b25ced88 (size 96):
  comm "bridge", pid 2151, jiffies 4294701712 (age 35506.901s)
  hex dump (first 32 bytes):
    02 00 00 00 ac 00 00 01 40 00 09 b1 86 94 ff ff  ........@.......
    46 02 00 00 00 00 00 00 a7 03 00 00 12 b5 6a 6b  F.............jk
  backtrace:
    [<00000000c10cf651>] vxlan_fdb_append.part.51+0x3c/0xf0 [vxlan]
    [<000000006b31a8d9>] vxlan_fdb_create+0x184/0x1a0 [vxlan]
    [<0000000049399045>] vxlan_fdb_update+0x12f/0x220 [vxlan]
    [<0000000090b1ef00>] vxlan_fdb_add+0x12a/0x1b0 [vxlan]
    [<0000000056633c2c>] rtnl_fdb_add+0x187/0x270
    [<00000000dd5dfb6b>] rtnetlink_rcv_msg+0x264/0x490
    [<00000000fc44dd54>] netlink_rcv_skb+0x4a/0x110
    [<00000000dff433e7>] netlink_unicast+0x18e/0x250
    [<00000000b87fb421>] netlink_sendmsg+0x2e9/0x400
    [<000000002ed55153>] ____sys_sendmsg+0x237/0x260
    [<00000000faa51c66>] ___sys_sendmsg+0x88/0xd0
    [<000000006c3982f1>] __sys_sendmsg+0x4e/0x80
    [<00000000a8f875d2>] do_syscall_64+0x56/0xe0
    [<000000003610eefa>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
unreferenced object 0xffff9486b1c40080 (size 128):
  comm "bridge", pid 2157, jiffies 4294701754 (age 35506.866s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 f8 dc 42 b2 86 94 ff ff  ..........B.....
    6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
  backtrace:
    [<00000000a2981b60>] vxlan_fdb_create+0x67/0x1a0 [vxlan]
    [<0000000049399045>] vxlan_fdb_update+0x12f/0x220 [vxlan]
    [<0000000090b1ef00>] vxlan_fdb_add+0x12a/0x1b0 [vxlan]
    [<0000000056633c2c>] rtnl_fdb_add+0x187/0x270
    [<00000000dd5dfb6b>] rtnetlink_rcv_msg+0x264/0x490
    [<00000000fc44dd54>] netlink_rcv_skb+0x4a/0x110
    [<00000000dff433e7>] netlink_unicast+0x18e/0x250
    [<00000000b87fb421>] netlink_sendmsg+0x2e9/0x400
    [<000000002ed55153>] ____sys_sendmsg+0x237/0x260
    [<00000000faa51c66>] ___sys_sendmsg+0x88/0xd0
    [<000000006c3982f1>] __sys_sendmsg+0x4e/0x80
    [<00000000a8f875d2>] do_syscall_64+0x56/0xe0
    [<000000003610eefa>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 3ad7a4b1 ("vxlan: support fdb and learning in COLLECT_METADATA mode")
Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
Acked-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fda2ec62

30 7月, 2020 1 次提交

vxlan: Ensure FDB dump is performed under RCU · b5141915

由 Ido Schimmel 提交于 7月 29, 2020

The commit cited below removed the RCU read-side critical section from
rtnl_fdb_dump() which means that the ndo_fdb_dump() callback is invoked
without RCU protection.

This results in the following warning [1] in the VXLAN driver, which
relied on the callback being invoked from an RCU read-side critical
section.

Fix this by calling rcu_read_lock() in the VXLAN driver, as already done
in the bridge driver.

[1]
WARNING: suspicious RCU usage
5.8.0-rc4-custom-01521-g481007553ce6 #29 Not tainted
-----------------------------
drivers/net/vxlan.c:1379 RCU-list traversed in non-reader section!!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
1 lock held by bridge/166:
 #0: ffffffff85a27850 (rtnl_mutex){+.+.}-{3:3}, at: netlink_dump+0xea/0x1090

stack backtrace:
CPU: 1 PID: 166 Comm: bridge Not tainted 5.8.0-rc4-custom-01521-g481007553ce6 #29
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
 dump_stack+0x100/0x184
 lockdep_rcu_suspicious+0x153/0x15d
 vxlan_fdb_dump+0x51e/0x6d0
 rtnl_fdb_dump+0x4dc/0xad0
 netlink_dump+0x540/0x1090
 __netlink_dump_start+0x695/0x950
 rtnetlink_rcv_msg+0x802/0xbd0
 netlink_rcv_skb+0x17a/0x480
 rtnetlink_rcv+0x22/0x30
 netlink_unicast+0x5ae/0x890
 netlink_sendmsg+0x98a/0xf40
 __sys_sendto+0x279/0x3b0
 __x64_sys_sendto+0xe6/0x1a0
 do_syscall_64+0x54/0xa0
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fe14fa2ade0
Code: Bad RIP value.
RSP: 002b:00007fff75bb5b88 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 00005614b1ba0020 RCX: 00007fe14fa2ade0
RDX: 000000000000011c RSI: 00007fff75bb5b90 RDI: 0000000000000003
RBP: 00007fff75bb5b90 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00005614b1b89160
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000

Fixes: 5e6d2435 ("bridge: netlink dump interface at par with brctl")
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Reviewed-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b5141915

11 7月, 2020 1 次提交

udp_tunnel: add central NIC RX port offload infrastructure · cc4e3835

由 Jakub Kicinski 提交于 7月 09, 2020

Cater to devices which:
 (a) may want to sleep in the callbacks;
 (b) only have IPv4 support;
 (c) need all the programming to happen while the netdev is up.

Drivers attach UDP tunnel offload info struct to their netdevs,
where they declare how many UDP ports of various tunnel types
they support. Core takes care of tracking which ports to offload.

Use a fixed-size array since this matches what almost all drivers
do, and avoids a complexity and uncertainty around memory allocations
in an atomic context.

Make sure that tunnel drivers don't try to replay the ports when
new NIC netdev is registered. Automatic replays would mess up
reference counting, and will be removed completely once all drivers
are converted.

v4:
 - use a #define NULL to avoid build issues with CONFIG_INET=n.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cc4e3835

26 6月, 2020 1 次提交

vxlan: fix last fdb index during dump of fdb with nhid · b18e9834

由 Roopa Prabhu 提交于 6月 24, 2020

This patch fixes last saved fdb index in fdb dump handler when
handling fdb's with nhid.

Fixes: 1274e1cc ("vxlan: ecmp support for mac fdb entries")
Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b18e9834

11 6月, 2020 2 次提交

vxlan: Remove access to nexthop group struct · 50cb8769

由 David Ahern 提交于 6月 09, 2020

vxlan driver should be using helpers to access nexthop struct
internals. Remove open check if whether nexthop is multipath in
favor of the existing nexthop_is_multipath helper. Add a new
helper, nexthop_has_v4, to cover the need to check has_v4 in
a group.

Fixes: 1274e1cc ("vxlan: ecmp support for mac fdb entries")
Cc: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

50cb8769

nexthop: Fix fdb labeling for groups · ce9ac056

由 David Ahern 提交于 6月 08, 2020

fdb nexthops are marked with a flag. For standalone nexthops, a flag was
added to the nh_info struct. For groups that flag was added to struct
nexthop when it should have been added to the group information. Fix
by removing the flag from the nexthop struct and adding a flag to nh_group
that mirrors nh_info and is really only a caching of the individual types.
Add a helper, nexthop_is_fdb, for use by the vxlan code and fixup the
internal code to use the flag from either nh_info or nh_group.

v2
- propagate fdb_nh in remove_nh_grp_entry

Fixes: 38428d68 ("nexthop: support for fdb ecmp nexthops")
Cc: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ce9ac056

10 6月, 2020 1 次提交

net: change addr_list_lock back to static key · 845e0ebb

由 Cong Wang 提交于 6月 08, 2020

The dynamic key update for addr_list_lock still causes troubles,
for example the following race condition still exists:

CPU 0:				CPU 1:
(RCU read lock)			(RTNL lock)
dev_mc_seq_show()		netdev_update_lockdep_key()
				  -> lockdep_unregister_key()
 -> netif_addr_lock_bh()

because lockdep doesn't provide an API to update it atomically.
Therefore, we have to move it back to static keys and use subclass
for nest locking like before.

In commit 1a33e10e ("net: partially revert dynamic lockdep key
changes"), I already reverted most parts of commit ab92d68f
("net: core: add generic lockdep keys").

This patch reverts the rest and also part of commit f3b0a18b
("net: remove unnecessary variables and callback"). After this
patch, addr_list_lock changes back to using static keys and
subclasses to satisfy lockdep. Thanks to dev->lower_level, we do
not have to change back to ->ndo_get_lock_subclass().

And hopefully this reduces some syzbot lockdep noises too.

Reported-by: syzbot+f3a0e80c34b3fc28ac5e@syzkaller.appspotmail.com
Cc: Taehee Yoo <ap420073@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

845e0ebb

02 6月, 2020 2 次提交

vxlan: fix dereference of nexthop group in nexthop update path · 03eaeda7

由 Roopa Prabhu 提交于 5月 30, 2020

fix dereference of nexthop group in fdb nexthop group
update validation path.

Fixes: 1274e1cc ("vxlan: ecmp support for mac fdb entries")
Reported-by: NIdo Schimmel <idosch@idosch.org>
Suggested-by: NIdo Schimmel <idosch@idosch.org>
Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

03eaeda7

vxlan: Avoid infinite loop when suppressing NS messages with invalid options · 8066e6b4

由 Ido Schimmel 提交于 6月 01, 2020

When proxy mode is enabled the vxlan device might reply to Neighbor
Solicitation (NS) messages on behalf of remote hosts.

In case the NS message includes the "Source link-layer address" option
[1], the vxlan device will use the specified address as the link-layer
destination address in its reply.

To avoid an infinite loop, break out of the options parsing loop when
encountering an option with length zero and disregard the NS message.

This is consistent with the IPv6 ndisc code and RFC 4886 which states
that "Nodes MUST silently discard an ND packet that contains an option
with length zero" [2].

[1] https://tools.ietf.org/html/rfc4861#section-4.3
[2] https://tools.ietf.org/html/rfc4861#section-4.6

Fixes: 4b29dba9 ("vxlan: fix nonfunctional neigh_reduce()")
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Acked-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8066e6b4

31 5月, 2020 2 次提交

vxlan: few locking fixes in nexthop event handler · 79472fe8

由 Roopa Prabhu 提交于 5月 28, 2020

- remove fdb from nh_list before the rcu grace period
- protect fdb->vdev with rcu
- hold spin lock before destroying fdb

Fixes: c7cdbe2e ("vxlan: support for nexthop notifiers")
Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

79472fe8

vxlan: add check to prevent use of remote ip attributes with NDA_NH_ID · 72b48682

由 Roopa Prabhu 提交于 5月 28, 2020

NDA_NH_ID represents a remote ip or a group of remote ips.
It allows use of nexthop groups in lieu of a remote ip or a
list of remote ips supported by the fdb api.

Current code ignores the other remote ip attrs when NDA_NH_ID is
specified. In the spirit of strict checking, This commit adds a
check to explicitly return an error on incorrect usage.

Fixes: 1274e1cc ("vxlan: ecmp support for mac fdb entries")
Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

72b48682

25 5月, 2020 1 次提交

vxlan: Do not assume RTNL is held in vxlan_fdb_info() · 06ec313e

由 Ido Schimmel 提交于 5月 25, 2020

vxlan_fdb_info() is not always called with RTNL held or from an RCU
read-side critical section. For example, in the following call path:

vxlan_cleanup()
  vxlan_fdb_destroy()
    vxlan_fdb_notify()
      __vxlan_fdb_notify()
        vxlan_fdb_info()

The use of rtnl_dereference() can therefore result in the following
splat [1].

Fix this by dereferencing the nexthop under RCU read-side critical
section.

[1]
[May24 22:56] =============================
[  +0.004676] WARNING: suspicious RCU usage
[  +0.004614] 5.7.0-rc5-custom-16219-g201392003491 #2772 Not tainted
[  +0.007116] -----------------------------
[  +0.004657] drivers/net/vxlan.c:276 suspicious rcu_dereference_check() usage!
[  +0.008164]
              other info that might help us debug this:

[  +0.009126]
              rcu_scheduler_active = 2, debug_locks = 1
[  +0.007504] 5 locks held by bash/6892:
[  +0.004392]  #0: ffff8881d47e3410 (&sig->cred_guard_mutex){+.+.}-{3:3}, at: __do_execve_file.isra.27+0x392/0x23c0
[  +0.011795]  #1: ffff8881d47e34b0 (&sig->exec_update_mutex){+.+.}-{3:3}, at: flush_old_exec+0x510/0x2030
[  +0.010947]  #2: ffff8881a141b0b0 (ptlock_ptr(page)#2){+.+.}-{2:2}, at: unmap_page_range+0x9c0/0x2590
[  +0.010585]  #3: ffff888230009d50 ((&vxlan->age_timer)){+.-.}-{0:0}, at: call_timer_fn+0xe8/0x800
[  +0.010192]  #4: ffff888183729bc8 (&vxlan->hash_lock[h]){+.-.}-{2:2}, at: vxlan_cleanup+0x133/0x4a0
[  +0.010382]
              stack backtrace:
[  +0.005103] CPU: 1 PID: 6892 Comm: bash Not tainted 5.7.0-rc5-custom-16219-g201392003491 #2772
[  +0.009675] Hardware name: Mellanox Technologies Ltd. MSN2100-CB2FO/SA001017, BIOS 5.6.5 06/07/2016
[  +0.010155] Call Trace:
[  +0.002775]  <IRQ>
[  +0.002313]  dump_stack+0xfd/0x178
[  +0.003895]  lockdep_rcu_suspicious+0x14a/0x153
[  +0.005157]  vxlan_fdb_info+0xe39/0x12a0
[  +0.004775]  __vxlan_fdb_notify+0xb8/0x160
[  +0.004672]  vxlan_fdb_notify+0x8e/0xe0
[  +0.004370]  vxlan_fdb_destroy+0x117/0x330
[  +0.004662]  vxlan_cleanup+0x1aa/0x4a0
[  +0.004329]  call_timer_fn+0x1c4/0x800
[  +0.004357]  run_timer_softirq+0x129d/0x17e0
[  +0.004762]  __do_softirq+0x24c/0xaef
[  +0.004232]  irq_exit+0x167/0x190
[  +0.003767]  smp_apic_timer_interrupt+0x1dd/0x6a0
[  +0.005340]  apic_timer_interrupt+0xf/0x20
[  +0.004620]  </IRQ>

Fixes: 1274e1cc ("vxlan: ecmp support for mac fdb entries")
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Reported-by: NAmit Cohen <amitc@mellanox.com>
Acked-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

06ec313e

23 5月, 2020 2 次提交

vxlan: support for nexthop notifiers · c7cdbe2e

由 Roopa Prabhu 提交于 5月 21, 2020

vxlan driver registers for nexthop add/del notifiers to
cleanup fdb entries pointing to such nexthops.
Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c7cdbe2e

vxlan: ecmp support for mac fdb entries · 1274e1cc

由 Roopa Prabhu 提交于 5月 21, 2020

Todays vxlan mac fdb entries can point to multiple remote
ips (rdsts) with the sole purpose of replicating
broadcast-multicast and unknown unicast packets to those remote ips.

E-VPN multihoming [1,2,3] requires bridged vxlan traffic to be
load balanced to remote switches (vteps) belonging to the
same multi-homed ethernet segment (E-VPN multihoming is analogous
to multi-homed LAG implementations, but with the inter-switch
peerlink replaced with a vxlan tunnel). In other words it needs
support for mac ecmp. Furthermore, for faster convergence, E-VPN
multihoming needs the ability to update fdb ecmp nexthops independent
of the fdb entries.

New route nexthop API is perfect for this usecase.
This patch extends the vxlan fdb code to take a nexthop id
pointing to an ecmp nexthop group.

Changes include:
- New NDA_NH_ID attribute for fdbs
- Use the newly added fdb nexthop groups
- makes vxlan rdsts and nexthop handling code mutually
  exclusive
- since this is a new use-case and the requirement is for ecmp
nexthop groups, the fdb add and update path checks that the
nexthop is really an ecmp nexthop group. This check can be relaxed
in the future, if we want to introduce replication fdb nexthop groups
and allow its use in lieu of current rdst lists.
- fdb update requests with nexthop id's only allowed for existing
fdb's that have nexthop id's
- learning will not override an existing fdb entry with nexthop
group
- I have wrapped the switchdev offload code around the presence of
rdst

[1] E-VPN RFC https://tools.ietf.org/html/rfc7432
[2] E-VPN with vxlan https://tools.ietf.org/html/rfc8365
[3] http://vger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV3.pdf

Includes a null check fix in vxlan_xmit from Nikolay

v2 - Fixed build issue:
Reported-by: Nkbuild test robot <lkp@intel.com>
Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1274e1cc

24 4月, 2020 1 次提交

vxlan: use the correct nlattr array in NL_SET_ERR_MSG_ATTR · cc8e7c69

由 Sabrina Dubroca 提交于 4月 22, 2020

IFLA_VXLAN_* attributes are in the data array, which is correctly
used when fetching the value, but not when setting the extended
ack. Because IFLA_VXLAN_MAX < IFLA_MAX, we avoid out of bounds
array accesses, but we don't provide a pointer to the invalid
attribute to userspace.

Fixes: 653ef6a3 ("vxlan: change vxlan_[config_]validate() to use netlink_ext_ack for error reporting")
Fixes: b4d30697 ("vxlan: Allow configuration of DF behaviour")
Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cc8e7c69

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功