提交 · f9c4bb0b245cee35ef66f75bf409c9573d934cf9 · openeuler / Kernel

01 3月, 2022 8 次提交

vxlan: vni filtering support on collect metadata device · f9c4bb0b

由 Roopa Prabhu 提交于 3月 01, 2022

This patch adds vnifiltering support to collect metadata device.

Motivation:
You can only use a single vxlan collect metadata device for a given
vxlan udp port in the system today. The vxlan collect metadata device
terminates all received vxlan packets. As shown in the below diagram,
there are use-cases where you need to support multiple such vxlan devices in
independent bridge domains. Each vxlan device must terminate the vni's
it is configured for.
Example usecase: In a service provider network a service provider
typically supports multiple bridge domains with overlapping vlans.
One bridge domain per customer. Vlans in each bridge domain are
mapped to globally unique vxlan ranges assigned to each customer.

vnifiltering support in collect metadata devices terminates only configured
vnis. This is similar to vlan filtering in bridge driver. The vni filtering
capability is provided by a new flag on collect metadata device.

In the below pic:
	- customer1 is mapped to br1 bridge domain
	- customer2 is mapped to br2 bridge domain
	- customer1 vlan 10-11 is mapped to vni 1001-1002
	- customer2 vlan 10-11 is mapped to vni 2001-2002
	- br1 and br2 are vlan filtering bridges
	- vxlan1 and vxlan2 are collect metadata devices with
	  vnifiltering enabled

┌──────────────────────────────────────────────────────────────────┐
│  switch                                                          │
│                                                                  │
│         ┌───────────┐                 ┌───────────┐              │
│         │           │                 │           │              │
│         │   br1     │                 │   br2     │              │
│         └┬─────────┬┘                 └──┬───────┬┘              │
│     vlans│         │               vlans │       │               │
│     10,11│         │                10,11│       │               │
│          │     vlanvnimap:               │    vlanvnimap:        │
│          │       10-1001,11-1002         │      10-2001,11-2002  │
│          │         │                     │       │               │
│   ┌──────┴┐     ┌──┴─────────┐       ┌───┴────┐  │               │
│   │ swp1  │     │vxlan1      │       │ swp2   │ ┌┴─────────────┐ │
│   │       │     │  vnifilter:│       │        │ │vxlan2        │ │
│   └───┬───┘     │   1001,1002│       └───┬────┘ │ vnifilter:   │ │
│       │         └────────────┘           │      │  2001,2002   │ │
│       │                                  │      └──────────────┘ │
│       │                                  │                       │
└───────┼──────────────────────────────────┼───────────────────────┘
        │                                  │
        │                                  │
  ┌─────┴───────┐                          │
  │  customer1  │                    ┌─────┴──────┐
  │ host/VM     │                    │customer2   │
  └─────────────┘                    │ host/VM    │
                                     └────────────┘

With this implementation, vxlan dst metadata device can
be associated with range of vnis.
struct vxlan_vni_node is introduced to represent
a configured vni. We start with vni and its
associated remote_ip in this structure. This
structure can be extended to bring in other
per vni attributes if there are usecases for it.
A vni inherits an attribute from the base vxlan device
if there is no per vni attributes defined.

struct vxlan_dev gets a new rhashtable for
vnis called vxlan_vni_group. vxlan_vnifilter.c
implements the necessary netlink api, notifications
and helper functions to process and manage lifecycle
of vxlan_vni_node.

This patch also adds new helper functions in vxlan_multicast.c
to handle per vni remote_ip multicast groups which are part
of vxlan_vni_group.

Fix build problems:
Reported-by: Nkernel test robot <lkp@intel.com>
Signed-off-by: NRoopa Prabhu <roopa@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f9c4bb0b

vxlan_multicast: Move multicast helpers to a separate file · a498c595

由 Roopa Prabhu 提交于 3月 01, 2022

subsequent patches will add more helpers.
Signed-off-by: NRoopa Prabhu <roopa@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a498c595

vxlan_core: add helper vxlan_vni_in_use · efe0f94b

由 Roopa Prabhu 提交于 3月 01, 2022

more users in follow up patches
Signed-off-by: NRoopa Prabhu <roopa@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

efe0f94b

vxlan_core: make multicast helper take rip and ifindex explicitly · a9508d12

由 Roopa Prabhu 提交于 3月 01, 2022

This patch changes multicast helpers to take rip and ifindex as input.
This is needed in future patches where rip can come from a pervni
structure while the ifindex can come from the vxlan device.
Signed-off-by: NRoopa Prabhu <roopa@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a9508d12

vxlan_core: move some fdb helpers to non-static · c63053e0

由 Roopa Prabhu 提交于 3月 01, 2022

This patch moves some fdb helpers to non-static
for use in later patches. Ideally, all fdb code
could move into its own file vxlan_fdb.c.
This can be done as a subsequent patch and is out
of scope of this series.
Signed-off-by: NRoopa Prabhu <roopa@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c63053e0

vxlan_core: move common declarations to private header file · 76fc217d

由 Roopa Prabhu 提交于 3月 01, 2022

This patch moves common structures and global declarations
to a shared private headerfile vxlan_private.h. Subsequent
patches use this header file as a common header file for
additional shared declarations.
Signed-off-by: NRoopa Prabhu <roopa@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

76fc217d

vxlan_core: fix build warnings in vxlan_xmit_one · fba55a66

由 Roopa Prabhu 提交于 3月 01, 2022

Fix the below build warnings reported by kernel test robot:
   - initialize vni in vxlan_xmit_one
   - wrap label in ipv6 enabled checks in vxlan_xmit_one

warnings:
static
   drivers/net/vxlan/vxlan_core.c:2437:14: warning: variable 'label' set
but not used [-Wunused-but-set-variable]
           __be32 vni, label;
                       ^

>> drivers/net/vxlan/vxlan_core.c:2483:7: warning: variable 'vni' is
used uninitialized whenever 'if' condition is true
[-Wsometimes-uninitialized]
Reported-by: Nkernel test robot <lkp@intel.com>
Signed-off-by: NRoopa Prabhu <roopa@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fba55a66

vxlan: move to its own directory · 67653936

由 Roopa Prabhu 提交于 3月 01, 2022

vxlan.c has grown too long. This patch moves
it to its own directory. subsequent patches add new
functionality in new files.
Signed-off-by: NRoopa Prabhu <roopa@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

67653936

14 2月, 2022 1 次提交

net: dev: Makes sure netif_rx() can be invoked in any context. · baebdf48

由 Sebastian Andrzej Siewior 提交于 2月 12, 2022

Dave suggested a while ago (eleven years by now) "Let's make netif_rx()
work in all contexts and get rid of netif_rx_ni()". Eric agreed and
pointed out that modern devices should use netif_receive_skb() to avoid
the overhead.
In the meantime someone added another variant, netif_rx_any_context(),
which behaves as suggested.

netif_rx() must be invoked with disabled bottom halves to ensure that
pending softirqs, which were raised within the function, are handled.
netif_rx_ni() can be invoked only from process context (bottom halves
must be enabled) because the function handles pending softirqs without
checking if bottom halves were disabled or not.
netif_rx_any_context() invokes on the former functions by checking
in_interrupts().

netif_rx() could be taught to handle both cases (disabled and enabled
bottom halves) by simply disabling bottom halves while invoking
netif_rx_internal(). The local_bh_enable() invocation will then invoke
pending softirqs only if the BH-disable counter drops to zero.

Eric is concerned about the overhead of BH-disable+enable especially in
regard to the loopback driver. As critical as this driver is, it will
receive a shortcut to avoid the additional overhead which is not needed.

Add a local_bh_disable() section in netif_rx() to ensure softirqs are
handled if needed.
Provide __netif_rx() which does not disable BH and has a lockdep assert
to ensure that interrupts are disabled. Use this shortcut in the
loopback driver and in drivers/net/*.c.
Make netif_rx_ni() and netif_rx_any_context() invoke netif_rx() so they
can be removed once they are no more users left.

Link: https://lkml.kernel.org/r/20100415.020246.218622820.davem@davemloft.netSigned-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Reviewed-by: NToke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

baebdf48

29 11月, 2021 1 次提交

net: vxlan: add macro definition for number of IANA VXLAN-GPE port · ed618bd8

由 Hao Chen 提交于 11月 27, 2021

Add macro definition for number of IANA VXLAN-GPE port for generic use.
Signed-off-by: NHao Chen <chenhao288@hisilicon.com>
Signed-off-by: NGuangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ed618bd8

23 11月, 2021 1 次提交

net: remove .ndo_change_proto_down · 2106efda

由 Jakub Kicinski 提交于 11月 22, 2021

.ndo_change_proto_down was added seemingly to enable out-of-tree
implementations. Over 2.5yrs later we still have no real users
upstream. Hardwire the generic implementation for now, we can
revert once real users materialize. (rocker is a test vehicle,
not a user.)

We need to drop the optimization on the sysfs side, because
unlike ndos priv_flags will be changed at runtime, so we'd
need READ_ONCE/WRITE_ONCE everywhere..
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2106efda

22 11月, 2021 2 次提交

net: annotate accesses to dev->gso_max_segs · 6d872df3

由 Eric Dumazet 提交于 11月 19, 2021

dev->gso_max_segs is written under RTNL protection, or when the device is
not yet visible, but is read locklessly.

Add netif_set_gso_max_segs() helper.

Add the READ_ONCE()/WRITE_ONCE() pairs, and use netif_set_gso_max_segs()
where we can to better document what is going on.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6d872df3

net: annotate accesses to dev->gso_max_size · 4b66d216

由 Eric Dumazet 提交于 11月 19, 2021

dev->gso_max_size is written under RTNL protection, or when the device is
not yet visible, but is read locklessly.

Add the READ_ONCE()/WRITE_ONCE() pairs, and use netif_set_gso_max_size()
where we can to better document what is going on.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4b66d216

16 11月, 2021 1 次提交

net: move gro definitions to include/net/gro.h · 4721031c

由 Eric Dumazet 提交于 11月 15, 2021

include/linux/netdevice.h became too big, move gro stuff
into include/net/gro.h
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4721031c

23 9月, 2021 1 次提交

nexthop: Fix memory leaks in nexthop notification chain listeners · 3106a084

由 Ido Schimmel 提交于 9月 22, 2021

syzkaller discovered memory leaks [1] that can be reduced to the
following commands:

 # ip nexthop add id 1 blackhole
 # devlink dev reload pci/0000:06:00.0

As part of the reload flow, mlxsw will unregister its netdevs and then
unregister from the nexthop notification chain. Before unregistering
from the notification chain, mlxsw will receive delete notifications for
nexthop objects using netdevs registered by mlxsw or their uppers. mlxsw
will not receive notifications for nexthops using netdevs that are not
dismantled as part of the reload flow. For example, the blackhole
nexthop above that internally uses the loopback netdev as its nexthop
device.

One way to fix this problem is to have listeners flush their nexthop
tables after unregistering from the notification chain. This is
error-prone as evident by this patch and also not symmetric with the
registration path where a listener receives a dump of all the existing
nexthops.

Therefore, fix this problem by replaying delete notifications for the
listener being unregistered. This is symmetric to the registration path
and also consistent with the netdev notification chain.

The above means that unregister_nexthop_notifier(), like
register_nexthop_notifier(), will have to take RTNL in order to iterate
over the existing nexthops and that any callers of the function cannot
hold RTNL. This is true for mlxsw and netdevsim, but not for the VXLAN
driver. To avoid a deadlock, change the latter to unregister its nexthop
listener without holding RTNL, making it symmetric to the registration
path.

[1]
unreferenced object 0xffff88806173d600 (size 512):
  comm "syz-executor.0", pid 1290, jiffies 4295583142 (age 143.507s)
  hex dump (first 32 bytes):
    41 9d 1e 60 80 88 ff ff 08 d6 73 61 80 88 ff ff  A..`......sa....
    08 d6 73 61 80 88 ff ff 01 00 00 00 00 00 00 00  ..sa............
  backtrace:
    [<ffffffff81a6b576>] kmemleak_alloc_recursive include/linux/kmemleak.h:43 [inline]
    [<ffffffff81a6b576>] slab_post_alloc_hook+0x96/0x490 mm/slab.h:522
    [<ffffffff81a716d3>] slab_alloc_node mm/slub.c:3206 [inline]
    [<ffffffff81a716d3>] slab_alloc mm/slub.c:3214 [inline]
    [<ffffffff81a716d3>] kmem_cache_alloc_trace+0x163/0x370 mm/slub.c:3231
    [<ffffffff82e8681a>] kmalloc include/linux/slab.h:591 [inline]
    [<ffffffff82e8681a>] kzalloc include/linux/slab.h:721 [inline]
    [<ffffffff82e8681a>] mlxsw_sp_nexthop_obj_group_create drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c:4918 [inline]
    [<ffffffff82e8681a>] mlxsw_sp_nexthop_obj_new drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c:5054 [inline]
    [<ffffffff82e8681a>] mlxsw_sp_nexthop_obj_event+0x59a/0x2910 drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c:5239
    [<ffffffff813ef67d>] notifier_call_chain+0xbd/0x210 kernel/notifier.c:83
    [<ffffffff813f0662>] blocking_notifier_call_chain kernel/notifier.c:318 [inline]
    [<ffffffff813f0662>] blocking_notifier_call_chain+0x72/0xa0 kernel/notifier.c:306
    [<ffffffff8384b9c6>] call_nexthop_notifiers+0x156/0x310 net/ipv4/nexthop.c:244
    [<ffffffff83852bd8>] insert_nexthop net/ipv4/nexthop.c:2336 [inline]
    [<ffffffff83852bd8>] nexthop_add net/ipv4/nexthop.c:2644 [inline]
    [<ffffffff83852bd8>] rtm_new_nexthop+0x14e8/0x4d10 net/ipv4/nexthop.c:2913
    [<ffffffff833e9a78>] rtnetlink_rcv_msg+0x448/0xbf0 net/core/rtnetlink.c:5572
    [<ffffffff83608703>] netlink_rcv_skb+0x173/0x480 net/netlink/af_netlink.c:2504
    [<ffffffff833de032>] rtnetlink_rcv+0x22/0x30 net/core/rtnetlink.c:5590
    [<ffffffff836069de>] netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline]
    [<ffffffff836069de>] netlink_unicast+0x5ae/0x7f0 net/netlink/af_netlink.c:1340
    [<ffffffff83607501>] netlink_sendmsg+0x8e1/0xe30 net/netlink/af_netlink.c:1929
    [<ffffffff832fde84>] sock_sendmsg_nosec net/socket.c:704 [inline]
    [<ffffffff832fde84>] sock_sendmsg net/socket.c:724 [inline]
    [<ffffffff832fde84>] ____sys_sendmsg+0x874/0x9f0 net/socket.c:2409
    [<ffffffff83304a44>] ___sys_sendmsg+0x104/0x170 net/socket.c:2463
    [<ffffffff83304c01>] __sys_sendmsg+0x111/0x1f0 net/socket.c:2492
    [<ffffffff83304d5d>] __do_sys_sendmsg net/socket.c:2501 [inline]
    [<ffffffff83304d5d>] __se_sys_sendmsg net/socket.c:2499 [inline]
    [<ffffffff83304d5d>] __x64_sys_sendmsg+0x7d/0xc0 net/socket.c:2499

Fixes: 2a014b20 ("mlxsw: spectrum_router: Add support for nexthop objects")
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Reviewed-by: NPetr Machata <petrm@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3106a084

23 6月, 2021 1 次提交

vxlan: add missing rcu_read_lock() in neigh_reduce() · 85e8b032

由 Eric Dumazet 提交于 6月 21, 2021

syzbot complained in neigh_reduce(), because rcu_read_lock_bh()
is treated differently than rcu_read_lock()

WARNING: suspicious RCU usage
5.13.0-rc6-syzkaller #0 Not tainted
-----------------------------
include/net/addrconf.h:313 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
3 locks held by kworker/0:0/5:
 #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
 #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: atomic64_set include/asm-generic/atomic-instrumented.h:856 [inline]
 #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: atomic_long_set include/asm-generic/atomic-long.h:41 [inline]
 #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: set_work_data kernel/workqueue.c:617 [inline]
 #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline]
 #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x871/0x1600 kernel/workqueue.c:2247
 #1: ffffc90000ca7da8 ((work_completion)(&port->wq)){+.+.}-{0:0}, at: process_one_work+0x8a5/0x1600 kernel/workqueue.c:2251
 #2: ffffffff8bf795c0 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0x1da/0x3130 net/core/dev.c:4180

stack backtrace:
CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted 5.13.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: events ipvlan_process_multicast
Call Trace:
 __dump_stack lib/dump_stack.c:79 [inline]
 dump_stack+0x141/0x1d7 lib/dump_stack.c:120
 __in6_dev_get include/net/addrconf.h:313 [inline]
 __in6_dev_get include/net/addrconf.h:311 [inline]
 neigh_reduce drivers/net/vxlan.c:2167 [inline]
 vxlan_xmit+0x34d5/0x4c30 drivers/net/vxlan.c:2919
 __netdev_start_xmit include/linux/netdevice.h:4944 [inline]
 netdev_start_xmit include/linux/netdevice.h:4958 [inline]
 xmit_one net/core/dev.c:3654 [inline]
 dev_hard_start_xmit+0x1eb/0x920 net/core/dev.c:3670
 __dev_queue_xmit+0x2133/0x3130 net/core/dev.c:4246
 ipvlan_process_multicast+0xa99/0xd70 drivers/net/ipvlan/ipvlan_core.c:287
 process_one_work+0x98d/0x1600 kernel/workqueue.c:2276
 worker_thread+0x64c/0x1120 kernel/workqueue.c:2422
 kthread+0x3b1/0x4a0 kernel/kthread.c:313
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294

Fixes: f564f45c ("vxlan: add ipv6 proxy support")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: Nsyzbot <syzkaller@googlegroups.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

85e8b032

31 3月, 2021 1 次提交

vxlan: allow L4 GRO passthrough · d18931a9

由 Paolo Abeni 提交于 3月 30, 2021

When passing up an UDP GSO packet with L4 aggregation, there is
no need to segment it at the vxlan level. We can propagate the
packet untouched and let it be segmented later, if needed.

Introduce an helper to allow let the UDP socket to accept any
L4 aggregation and use it in the vxlan driver.

v1 -> v2:
 - updated to use the newly introduced UDP socket 'accept*' fields
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Reviewed-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d18931a9

26 3月, 2021 1 次提交

vxlan: do not modify the shared tunnel info when PMTU triggers an ICMP reply · 30a93d2b

由 Antoine Tenart 提交于 3月 25, 2021

When the interface is part of a bridge or an Open vSwitch port and a
packet exceed a PMTU estimate, an ICMP reply is sent to the sender. When
using the external mode (collect metadata) the source and destination
addresses are reversed, so that Open vSwitch can match the packet
against an existing (reverse) flow.

But inverting the source and destination addresses in the shared
ip_tunnel_info will make following packets of the flow to use a wrong
destination address (packets will be tunnelled to itself), if the flow
isn't updated. Which happens with Open vSwitch, until the flow times
out.

Fixes this by uncloning the skb's ip_tunnel_info before inverting its
source and destination addresses, so that the modification will only be
made for the PTMU packet, not the following ones.

Fixes: fc68c995 ("vxlan: Support for PMTU discovery on directly bridged links")
Tested-by: NEelco Chaudron <echaudro@redhat.com>
Reviewed-by: NEelco Chaudron <echaudro@redhat.com>
Signed-off-by: NAntoine Tenart <atenart@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

30a93d2b

14 3月, 2021 1 次提交

drivers: net: vxlan.c: Fix declaration issue · 6fadbdd6

由 Sanjana Srinidhi 提交于 3月 13, 2021

Added a blank line after structure declaration.
This is done to maintain code uniformity.
Signed-off-by: NSanjana Srinidhi <sanjanasrinidhi1810@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6fadbdd6

24 2月, 2021 1 次提交

vxlan: move debug check after netdev unregister · 92584ddf

由 Taehee Yoo 提交于 2月 21, 2021

The debug check must be done after unregister_netdevice_many() call --
the hlist_del_rcu() for this is done inside .ndo_stop.

This is the same with commit 0fda7600 ("geneve: move debug check after
netdev unregister")

Test commands:
    ip netns del A
    ip netns add A
    ip netns add B

    ip netns exec B ip link add vxlan0 type vxlan vni 100 local 10.0.0.1 \
	    remote 10.0.0.2 dstport 4789 srcport 4789 4789
    ip netns exec B ip link set vxlan0 netns A
    ip netns exec A ip link set vxlan0 up
    ip netns del B

Splat looks like:
[   73.176249][    T7] ------------[ cut here ]------------
[   73.178662][    T7] WARNING: CPU: 4 PID: 7 at drivers/net/vxlan.c:4743 vxlan_exit_batch_net+0x52e/0x720 [vxlan]
[   73.182597][    T7] Modules linked in: vxlan openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 mlx5_core nfp mlxfw ixgbevf tls sch_fq_codel nf_tables nfnetlink ip_tables x_tables unix
[   73.190113][    T7] CPU: 4 PID: 7 Comm: kworker/u16:0 Not tainted 5.11.0-rc7+ #838
[   73.193037][    T7] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[   73.196986][    T7] Workqueue: netns cleanup_net
[   73.198946][    T7] RIP: 0010:vxlan_exit_batch_net+0x52e/0x720 [vxlan]
[   73.201509][    T7] Code: 00 01 00 00 0f 84 39 fd ff ff 48 89 ca 48 c1 ea 03 80 3c 1a 00 0f 85 a6 00 00 00 89 c2 48 83 c2 02 49 8b 14 d4 48 85 d2 74 ce <0f> 0b eb ca e8 b9 51 db dd 84 c0 0f 85 4a fe ff ff 48 c7 c2 80 bc
[   73.208813][    T7] RSP: 0018:ffff888100907c10 EFLAGS: 00010286
[   73.211027][    T7] RAX: 000000000000003c RBX: dffffc0000000000 RCX: ffff88800ec411f0
[   73.213702][    T7] RDX: ffff88800a278000 RSI: ffff88800fc78c70 RDI: ffff88800fc78070
[   73.216169][    T7] RBP: ffff88800b5cbdc0 R08: fffffbfff424de61 R09: fffffbfff424de61
[   73.218463][    T7] R10: ffffffffa126f307 R11: fffffbfff424de60 R12: ffff88800ec41000
[   73.220794][    T7] R13: ffff888100907d08 R14: ffff888100907c50 R15: ffff88800fc78c40
[   73.223337][    T7] FS:  0000000000000000(0000) GS:ffff888114800000(0000) knlGS:0000000000000000
[   73.225814][    T7] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   73.227616][    T7] CR2: 0000562b5cb4f4d0 CR3: 0000000105fbe001 CR4: 00000000003706e0
[   73.229700][    T7] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   73.231820][    T7] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   73.233844][    T7] Call Trace:
[   73.234698][    T7]  ? vxlan_err_lookup+0x3c0/0x3c0 [vxlan]
[   73.235962][    T7]  ? ops_exit_list.isra.11+0x93/0x140
[   73.237134][    T7]  cleanup_net+0x45e/0x8a0
[ ... ]

Fixes: 57b61127 ("vxlan: speedup vxlan tunnels dismantle")
Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
Link: https://lore.kernel.org/r/20210221154552.11749-1-ap420073@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

92584ddf

19 1月, 2021 1 次提交

vxlan: add NETIF_F_FRAGLIST flag for dev features · cb2c5711

由 Xin Long 提交于 1月 15, 2021

Some protocol HW GSO requires fraglist supported by the device, like
SCTP. Without NETIF_F_FRAGLIST set in the dev features of vxlan, it
would have to do SW GSO before the packets enter the driver, even
when the vxlan dev and lower dev (like veth) both have the feature
of NETIF_F_GSO_SCTP.

So this patch is to add it for vxlan.
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

cb2c5711

08 1月, 2021 1 次提交

udp_tunnel: remove REGISTER/UNREGISTER handling from tunnel drivers · dedc33e7

由 Jakub Kicinski 提交于 1月 06, 2021

udp_tunnel_nic handles REGISTER and UNREGISTER event, now that all
drivers use that infra we can drop the event handling in the tunnel
drivers.
Reviewed-by: NAlexander Duyck <alexanderduyck@fb.com>
Reviewed-by: NJacob Keller <jacob.e.keller@intel.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

dedc33e7

11 12月, 2020 1 次提交

vxlan: avoid double unlikely() notation when using IS_ERR() · a10b24b8

由 Antonio Quartulli 提交于 12月 10, 2020

The definition of IS_ERR() already applies the unlikely() notation
when checking the error status of the passed pointer. For this
reason there is no need to have the same notation outside of
IS_ERR() itself.

Clean up code by removing redundant notation.
Signed-off-by: NAntonio Quartulli <a@unstable.cc>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a10b24b8

03 12月, 2020 1 次提交

vxlan: fix error return code in __vxlan_dev_create() · 832e0979

由 Zhang Changzhong 提交于 12月 02, 2020

Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.

Fixes: 0ce1822c ("vxlan: add adjacent link to limit depth level")
Reported-by: NHulk Robot <hulkci@huawei.com>
Signed-off-by: NZhang Changzhong <zhangchangzhong@huawei.com>
Link: https://lore.kernel.org/r/1606903122-2098-1-git-send-email-zhangchangzhong@huawei.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

832e0979

01 12月, 2020 2 次提交

vxlan: Copy needed_tailroom from lowerdev · a5e74021

由 Sven Eckelmann 提交于 11月 26, 2020

While vxlan doesn't need any extra tailroom, the lowerdev might need it. In
that case, copy it over to reduce the chance for additional (re)allocations
in the transmit path.
Signed-off-by: NSven Eckelmann <sven@narfation.org>
Link: https://lore.kernel.org/r/20201126125247.1047977-2-sven@narfation.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>

a5e74021

vxlan: Add needed_headroom for lower device · 0a35dc41

由 Sven Eckelmann 提交于 11月 26, 2020

It was observed that sending data via batadv over vxlan (on top of
wireguard) reduced the performance massively compared to raw ethernet or
batadv on raw ethernet. A check of perf data showed that the
vxlan_build_skb was calling all the time pskb_expand_head to allocate
enough headroom for:

  min_headroom = LL_RESERVED_SPACE(dst->dev) + dst->header_len
  		+ VXLAN_HLEN + iphdr_len;

But the vxlan_config_apply only requested needed headroom for:

  lowerdev->hard_header_len + VXLAN6_HEADROOM or VXLAN_HEADROOM

So it completely ignored the needed_headroom of the lower device. The first
caller of net_dev_xmit could therefore never make sure that enough headroom
was allocated for the rest of the transmit path.

Cc: Annika Wickert <annika.wickert@exaring.de>
Signed-off-by: NSven Eckelmann <sven@narfation.org>
Tested-by: NAnnika Wickert <aw@awlnx.space>
Link: https://lore.kernel.org/r/20201126125247.1047977-1-sven@narfation.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>

0a35dc41

10 11月, 2020 1 次提交

net: switch to dev_get_tstats64 · b220a4a7

由 Heiner Kallweit 提交于 11月 07, 2020

Replace ip_tunnel_get_stats64() with the new identical core function
dev_get_tstats64().
Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

b220a4a7

07 11月, 2020 2 次提交

nexthop: Pass extack to register_nexthop_notifier() · ce7e9c8a

由 Ido Schimmel 提交于 11月 04, 2020

This will be used by the next patch which extends the function to replay
all the existing nexthops to the notifier block being registered.

Device drivers will be able to pass extack to the function since it is
passed to them upon reload from devlink.
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

ce7e9c8a

nexthop: vxlan: Convert to new notification info · 1ec69d18

由 Ido Schimmel 提交于 11月 04, 2020

Convert the sole listener of the nexthop notification chain (the VXLAN
driver) to the new notification info.
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

1ec69d18

04 11月, 2020 1 次提交

vxlan: Use a per-namespace nexthop listener instead of a global one · 626d667b

由 Ido Schimmel 提交于 11月 01, 2020

The nexthop notification chain is a per-namespace chain and not a global
one like the netdev notification chain.

Therefore, a single (global) listener cannot be registered to all these
chains simultaneously as it will result in list corruptions whenever
listeners are registered / unregistered.

Instead, register a different listener in each namespace.

Currently this is not an issue because only the VXLAN driver registers a
listener to this chain, but this is going to change with netdevsim and
mlxsw also registering their own listeners.
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20201101113926.705630-1-idosch@idosch.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>

626d667b

06 10月, 2020 1 次提交

vxlan: use dev_sw_netstats_rx_add() · 1f8dda1d

由 Fabian Frederick 提交于 10月 05, 2020

use new helper for netstats settings
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1f8dda1d

27 9月, 2020 1 次提交

Revert "vxlan: move encapsulation warning" · 435be28b

由 Jakub Kicinski 提交于 9月 25, 2020

This reverts commit 546c044c.

Nothing prevents user from sending frames to "external" VxLAN devices.
In fact kernel itself may generate icmp chatter.

This is fine, such frames should be dropped.

The point of the "missing encapsulation" warning was that
frames with missing encap should not make it into vxlan_xmit_one().
And vxlan_xmit() drops them cleanly, so let it just do that.

Without this revert the warning is triggered by the udp_tunnel_nic.sh
test, but the minimal repro is:

$ ip link add vxlan0 type vxlan \
     	      	     group 239.1.1.1 \
		     dev lo \
		     dstport 1234 \
		     external
$ ip li set dev vxlan0 up

[  419.165981] vxlan0: Missing encapsulation instructions
[  419.166551] WARNING: CPU: 0 PID: 1041 at drivers/net/vxlan.c:2889 vxlan_xmit+0x15c0/0x1fc0 [vxlan]
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

435be28b

26 9月, 2020 5 次提交

vxlan: fix vxlan_find_sock() documentation for l3mdev · 78ec710e

由 Fabian Frederick 提交于 9月 25, 2020

Since commit aab8cc36
("vxlan: add support for underlay in non-default VRF")

vxlan_find_sock() also checks if socket is assigned to the right
level 3 master device when lower device is not in the default VRF.
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

78ec710e

vxlan: check rtnl_configure_link return code correctly · 2eabcb8a

由 Fabian Frederick 提交于 9月 25, 2020

rtnl_configure_link is always checked if < 0 for error code.
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2eabcb8a

vxlan: move encapsulation warning · 546c044c

由 Fabian Frederick 提交于 9月 25, 2020

vxlan_xmit_one() was only called from vxlan_xmit() without rdst and
info was already tested. Emit warning in that function instead
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

546c044c

vxlan: add unlikely to vxlan_remcsum check · 0189399c

由 Fabian Frederick 提交于 9月 25, 2020

small optimization around checking as it's being done in all
receptions
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0189399c

vxlan: don't collect metadata if remote checksum is wrong · 2ae2904b

由 Fabian Frederick 提交于 9月 25, 2020

call vxlan_remcsum() before md filling in vxlan_rcv()
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2ae2904b

06 8月, 2020 1 次提交

Revert "vxlan: fix tos value before xmit" · a0dced17

由 Hangbin Liu 提交于 8月 05, 2020

This reverts commit 71130f29.

In commit 71130f29 ("vxlan: fix tos value before xmit") we want to
make sure the tos value are filtered by RT_TOS() based on RFC1349.

       0     1     2     3     4     5     6     7
    +-----+-----+-----+-----+-----+-----+-----+-----+
    |   PRECEDENCE    |          TOS          | MBZ |
    +-----+-----+-----+-----+-----+-----+-----+-----+

But RFC1349 has been obsoleted by RFC2474. The new DSCP field defined like

       0     1     2     3     4     5     6     7
    +-----+-----+-----+-----+-----+-----+-----+-----+
    |          DS FIELD, DSCP           | ECN FIELD |
    +-----+-----+-----+-----+-----+-----+-----+-----+

So with

IPTOS_TOS_MASK          0x1E
RT_TOS(tos)		((tos)&IPTOS_TOS_MASK)

the first 3 bits DSCP info will get lost.

To take all the DSCP info in xmit, we should revert the patch and just push
all tos bits to ip_tunnel_ecn_encap(), which will handling ECN field later.

Fixes: 71130f29 ("vxlan: fix tos value before xmit")
Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
Acked-by: NGuillaume Nault <gnault@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a0dced17

05 8月, 2020 2 次提交

vxlan: Support for PMTU discovery on directly bridged links · fc68c995

由 Stefano Brivio 提交于 8月 04, 2020

If the interface is a bridge or Open vSwitch port, and we can't
forward a packet because it exceeds the local PMTU estimate,
trigger an ICMP or ICMPv6 reply to the sender, using the same
interface to forward it back.

If metadata collection is enabled, reverse destination and source
addresses, so that Open vSwitch is able to match this packet against
the existing, reverse flow.

v2: Use netif_is_any_bridge_port() (David Ahern)
Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fc68c995

tunnels: PMTU discovery support for directly bridged IP packets · 4cb47a86

由 Stefano Brivio 提交于 8月 04, 2020

It's currently possible to bridge Ethernet tunnels carrying IP
packets directly to external interfaces without assigning them
addresses and routes on the bridged network itself: this is the case
for UDP tunnels bridged with a standard bridge or by Open vSwitch.

PMTU discovery is currently broken with those configurations, because
the encapsulation effectively decreases the MTU of the link, and
while we are able to account for this using PMTU discovery on the
lower layer, we don't have a way to relay ICMP or ICMPv6 messages
needed by the sender, because we don't have valid routes to it.

On the other hand, as a tunnel endpoint, we can't fragment packets
as a general approach: this is for instance clearly forbidden for
VXLAN by RFC 7348, section 4.3:

   VTEPs MUST NOT fragment VXLAN packets.  Intermediate routers may
   fragment encapsulated VXLAN packets due to the larger frame size.
   The destination VTEP MAY silently discard such VXLAN fragments.

The same paragraph recommends that the MTU over the physical network
accomodates for encapsulations, but this isn't a practical option for
complex topologies, especially for typical Open vSwitch use cases.

Further, it states that:

   Other techniques like Path MTU discovery (see [RFC1191] and
   [RFC1981]) MAY be used to address this requirement as well.

Now, PMTU discovery already works for routed interfaces, we get
route exceptions created by the encapsulation device as they receive
ICMP Fragmentation Needed and ICMPv6 Packet Too Big messages, and
we already rebuild those messages with the appropriate MTU and route
them back to the sender.

Add the missing bits for bridged cases:

- checks in skb_tunnel_check_pmtu() to understand if it's appropriate
  to trigger a reply according to RFC 1122 section 3.2.2 for ICMP and
  RFC 4443 section 2.4 for ICMPv6. This function is already called by
  UDP tunnels

- a new function generating those ICMP or ICMPv6 replies. We can't
  reuse icmp_send() and icmp6_send() as we don't see the sender as a
  valid destination. This doesn't need to be generic, as we don't
  cover any other type of ICMP errors given that we only provide an
  encapsulation function to the sender

While at it, make the MTU check in skb_tunnel_check_pmtu() accurate:
we might receive GSO buffers here, and the passed headroom already
includes the inner MAC length, so we don't have to account for it
a second time (that would imply three MAC headers on the wire, but
there are just two).

This issue became visible while bridging IPv6 packets with 4500 bytes
of payload over GENEVE using IPv4 with a PMTU of 4000. Given the 50
bytes of encapsulation headroom, we would advertise MTU as 3950, and
we would reject fragmented IPv6 datagrams of 3958 bytes size on the
wire. We're exclusively dealing with network MTU here, though, so we
could get Ethernet frames up to 3964 octets in that case.

v2:
- moved skb_tunnel_check_pmtu() to ip_tunnel_core.c (David Ahern)
- split IPv4/IPv6 functions (David Ahern)
Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4cb47a86

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功