1. 01 3月, 2022 8 次提交
    • R
      vxlan: vni filtering support on collect metadata device · f9c4bb0b
      Roopa Prabhu 提交于
      This patch adds vnifiltering support to collect metadata device.
      
      Motivation:
      You can only use a single vxlan collect metadata device for a given
      vxlan udp port in the system today. The vxlan collect metadata device
      terminates all received vxlan packets. As shown in the below diagram,
      there are use-cases where you need to support multiple such vxlan devices in
      independent bridge domains. Each vxlan device must terminate the vni's
      it is configured for.
      Example usecase: In a service provider network a service provider
      typically supports multiple bridge domains with overlapping vlans.
      One bridge domain per customer. Vlans in each bridge domain are
      mapped to globally unique vxlan ranges assigned to each customer.
      
      vnifiltering support in collect metadata devices terminates only configured
      vnis. This is similar to vlan filtering in bridge driver. The vni filtering
      capability is provided by a new flag on collect metadata device.
      
      In the below pic:
      	- customer1 is mapped to br1 bridge domain
      	- customer2 is mapped to br2 bridge domain
      	- customer1 vlan 10-11 is mapped to vni 1001-1002
      	- customer2 vlan 10-11 is mapped to vni 2001-2002
      	- br1 and br2 are vlan filtering bridges
      	- vxlan1 and vxlan2 are collect metadata devices with
      	  vnifiltering enabled
      
      ┌──────────────────────────────────────────────────────────────────┐
      │  switch                                                          │
      │                                                                  │
      │         ┌───────────┐                 ┌───────────┐              │
      │         │           │                 │           │              │
      │         │   br1     │                 │   br2     │              │
      │         └┬─────────┬┘                 └──┬───────┬┘              │
      │     vlans│         │               vlans │       │               │
      │     10,11│         │                10,11│       │               │
      │          │     vlanvnimap:               │    vlanvnimap:        │
      │          │       10-1001,11-1002         │      10-2001,11-2002  │
      │          │         │                     │       │               │
      │   ┌──────┴┐     ┌──┴─────────┐       ┌───┴────┐  │               │
      │   │ swp1  │     │vxlan1      │       │ swp2   │ ┌┴─────────────┐ │
      │   │       │     │  vnifilter:│       │        │ │vxlan2        │ │
      │   └───┬───┘     │   1001,1002│       └───┬────┘ │ vnifilter:   │ │
      │       │         └────────────┘           │      │  2001,2002   │ │
      │       │                                  │      └──────────────┘ │
      │       │                                  │                       │
      └───────┼──────────────────────────────────┼───────────────────────┘
              │                                  │
              │                                  │
        ┌─────┴───────┐                          │
        │  customer1  │                    ┌─────┴──────┐
        │ host/VM     │                    │customer2   │
        └─────────────┘                    │ host/VM    │
                                           └────────────┘
      
      With this implementation, vxlan dst metadata device can
      be associated with range of vnis.
      struct vxlan_vni_node is introduced to represent
      a configured vni. We start with vni and its
      associated remote_ip in this structure. This
      structure can be extended to bring in other
      per vni attributes if there are usecases for it.
      A vni inherits an attribute from the base vxlan device
      if there is no per vni attributes defined.
      
      struct vxlan_dev gets a new rhashtable for
      vnis called vxlan_vni_group. vxlan_vnifilter.c
      implements the necessary netlink api, notifications
      and helper functions to process and manage lifecycle
      of vxlan_vni_node.
      
      This patch also adds new helper functions in vxlan_multicast.c
      to handle per vni remote_ip multicast groups which are part
      of vxlan_vni_group.
      
      Fix build problems:
      Reported-by: Nkernel test robot <lkp@intel.com>
      Signed-off-by: NRoopa Prabhu <roopa@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f9c4bb0b
    • R
      vxlan_multicast: Move multicast helpers to a separate file · a498c595
      Roopa Prabhu 提交于
      subsequent patches will add more helpers.
      Signed-off-by: NRoopa Prabhu <roopa@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a498c595
    • R
      vxlan_core: add helper vxlan_vni_in_use · efe0f94b
      Roopa Prabhu 提交于
      more users in follow up patches
      Signed-off-by: NRoopa Prabhu <roopa@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      efe0f94b
    • R
      vxlan_core: make multicast helper take rip and ifindex explicitly · a9508d12
      Roopa Prabhu 提交于
      This patch changes multicast helpers to take rip and ifindex as input.
      This is needed in future patches where rip can come from a pervni
      structure while the ifindex can come from the vxlan device.
      Signed-off-by: NRoopa Prabhu <roopa@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a9508d12
    • R
      vxlan_core: move some fdb helpers to non-static · c63053e0
      Roopa Prabhu 提交于
      This patch moves some fdb helpers to non-static
      for use in later patches. Ideally, all fdb code
      could move into its own file vxlan_fdb.c.
      This can be done as a subsequent patch and is out
      of scope of this series.
      Signed-off-by: NRoopa Prabhu <roopa@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c63053e0
    • R
      vxlan_core: move common declarations to private header file · 76fc217d
      Roopa Prabhu 提交于
      This patch moves common structures and global declarations
      to a shared private headerfile vxlan_private.h. Subsequent
      patches use this header file as a common header file for
      additional shared declarations.
      Signed-off-by: NRoopa Prabhu <roopa@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76fc217d
    • R
      vxlan_core: fix build warnings in vxlan_xmit_one · fba55a66
      Roopa Prabhu 提交于
      Fix the below build warnings reported by kernel test robot:
         - initialize vni in vxlan_xmit_one
         - wrap label in ipv6 enabled checks in vxlan_xmit_one
      
      warnings:
      static
         drivers/net/vxlan/vxlan_core.c:2437:14: warning: variable 'label' set
      but not used [-Wunused-but-set-variable]
                 __be32 vni, label;
                             ^
      
      >> drivers/net/vxlan/vxlan_core.c:2483:7: warning: variable 'vni' is
      used uninitialized whenever 'if' condition is true
      [-Wsometimes-uninitialized]
      Reported-by: Nkernel test robot <lkp@intel.com>
      Signed-off-by: NRoopa Prabhu <roopa@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fba55a66
    • R
      vxlan: move to its own directory · 67653936
      Roopa Prabhu 提交于
      vxlan.c has grown too long. This patch moves
      it to its own directory. subsequent patches add new
      functionality in new files.
      Signed-off-by: NRoopa Prabhu <roopa@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      67653936
  2. 14 2月, 2022 1 次提交
    • S
      net: dev: Makes sure netif_rx() can be invoked in any context. · baebdf48
      Sebastian Andrzej Siewior 提交于
      Dave suggested a while ago (eleven years by now) "Let's make netif_rx()
      work in all contexts and get rid of netif_rx_ni()". Eric agreed and
      pointed out that modern devices should use netif_receive_skb() to avoid
      the overhead.
      In the meantime someone added another variant, netif_rx_any_context(),
      which behaves as suggested.
      
      netif_rx() must be invoked with disabled bottom halves to ensure that
      pending softirqs, which were raised within the function, are handled.
      netif_rx_ni() can be invoked only from process context (bottom halves
      must be enabled) because the function handles pending softirqs without
      checking if bottom halves were disabled or not.
      netif_rx_any_context() invokes on the former functions by checking
      in_interrupts().
      
      netif_rx() could be taught to handle both cases (disabled and enabled
      bottom halves) by simply disabling bottom halves while invoking
      netif_rx_internal(). The local_bh_enable() invocation will then invoke
      pending softirqs only if the BH-disable counter drops to zero.
      
      Eric is concerned about the overhead of BH-disable+enable especially in
      regard to the loopback driver. As critical as this driver is, it will
      receive a shortcut to avoid the additional overhead which is not needed.
      
      Add a local_bh_disable() section in netif_rx() to ensure softirqs are
      handled if needed.
      Provide __netif_rx() which does not disable BH and has a lockdep assert
      to ensure that interrupts are disabled. Use this shortcut in the
      loopback driver and in drivers/net/*.c.
      Make netif_rx_ni() and netif_rx_any_context() invoke netif_rx() so they
      can be removed once they are no more users left.
      
      Link: https://lkml.kernel.org/r/20100415.020246.218622820.davem@davemloft.netSigned-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      baebdf48
  3. 29 11月, 2021 1 次提交
  4. 23 11月, 2021 1 次提交
    • J
      net: remove .ndo_change_proto_down · 2106efda
      Jakub Kicinski 提交于
      .ndo_change_proto_down was added seemingly to enable out-of-tree
      implementations. Over 2.5yrs later we still have no real users
      upstream. Hardwire the generic implementation for now, we can
      revert once real users materialize. (rocker is a test vehicle,
      not a user.)
      
      We need to drop the optimization on the sysfs side, because
      unlike ndos priv_flags will be changed at runtime, so we'd
      need READ_ONCE/WRITE_ONCE everywhere..
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2106efda
  5. 22 11月, 2021 2 次提交
  6. 16 11月, 2021 1 次提交
  7. 23 9月, 2021 1 次提交
    • I
      nexthop: Fix memory leaks in nexthop notification chain listeners · 3106a084
      Ido Schimmel 提交于
      syzkaller discovered memory leaks [1] that can be reduced to the
      following commands:
      
       # ip nexthop add id 1 blackhole
       # devlink dev reload pci/0000:06:00.0
      
      As part of the reload flow, mlxsw will unregister its netdevs and then
      unregister from the nexthop notification chain. Before unregistering
      from the notification chain, mlxsw will receive delete notifications for
      nexthop objects using netdevs registered by mlxsw or their uppers. mlxsw
      will not receive notifications for nexthops using netdevs that are not
      dismantled as part of the reload flow. For example, the blackhole
      nexthop above that internally uses the loopback netdev as its nexthop
      device.
      
      One way to fix this problem is to have listeners flush their nexthop
      tables after unregistering from the notification chain. This is
      error-prone as evident by this patch and also not symmetric with the
      registration path where a listener receives a dump of all the existing
      nexthops.
      
      Therefore, fix this problem by replaying delete notifications for the
      listener being unregistered. This is symmetric to the registration path
      and also consistent with the netdev notification chain.
      
      The above means that unregister_nexthop_notifier(), like
      register_nexthop_notifier(), will have to take RTNL in order to iterate
      over the existing nexthops and that any callers of the function cannot
      hold RTNL. This is true for mlxsw and netdevsim, but not for the VXLAN
      driver. To avoid a deadlock, change the latter to unregister its nexthop
      listener without holding RTNL, making it symmetric to the registration
      path.
      
      [1]
      unreferenced object 0xffff88806173d600 (size 512):
        comm "syz-executor.0", pid 1290, jiffies 4295583142 (age 143.507s)
        hex dump (first 32 bytes):
          41 9d 1e 60 80 88 ff ff 08 d6 73 61 80 88 ff ff  A..`......sa....
          08 d6 73 61 80 88 ff ff 01 00 00 00 00 00 00 00  ..sa............
        backtrace:
          [<ffffffff81a6b576>] kmemleak_alloc_recursive include/linux/kmemleak.h:43 [inline]
          [<ffffffff81a6b576>] slab_post_alloc_hook+0x96/0x490 mm/slab.h:522
          [<ffffffff81a716d3>] slab_alloc_node mm/slub.c:3206 [inline]
          [<ffffffff81a716d3>] slab_alloc mm/slub.c:3214 [inline]
          [<ffffffff81a716d3>] kmem_cache_alloc_trace+0x163/0x370 mm/slub.c:3231
          [<ffffffff82e8681a>] kmalloc include/linux/slab.h:591 [inline]
          [<ffffffff82e8681a>] kzalloc include/linux/slab.h:721 [inline]
          [<ffffffff82e8681a>] mlxsw_sp_nexthop_obj_group_create drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c:4918 [inline]
          [<ffffffff82e8681a>] mlxsw_sp_nexthop_obj_new drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c:5054 [inline]
          [<ffffffff82e8681a>] mlxsw_sp_nexthop_obj_event+0x59a/0x2910 drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c:5239
          [<ffffffff813ef67d>] notifier_call_chain+0xbd/0x210 kernel/notifier.c:83
          [<ffffffff813f0662>] blocking_notifier_call_chain kernel/notifier.c:318 [inline]
          [<ffffffff813f0662>] blocking_notifier_call_chain+0x72/0xa0 kernel/notifier.c:306
          [<ffffffff8384b9c6>] call_nexthop_notifiers+0x156/0x310 net/ipv4/nexthop.c:244
          [<ffffffff83852bd8>] insert_nexthop net/ipv4/nexthop.c:2336 [inline]
          [<ffffffff83852bd8>] nexthop_add net/ipv4/nexthop.c:2644 [inline]
          [<ffffffff83852bd8>] rtm_new_nexthop+0x14e8/0x4d10 net/ipv4/nexthop.c:2913
          [<ffffffff833e9a78>] rtnetlink_rcv_msg+0x448/0xbf0 net/core/rtnetlink.c:5572
          [<ffffffff83608703>] netlink_rcv_skb+0x173/0x480 net/netlink/af_netlink.c:2504
          [<ffffffff833de032>] rtnetlink_rcv+0x22/0x30 net/core/rtnetlink.c:5590
          [<ffffffff836069de>] netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline]
          [<ffffffff836069de>] netlink_unicast+0x5ae/0x7f0 net/netlink/af_netlink.c:1340
          [<ffffffff83607501>] netlink_sendmsg+0x8e1/0xe30 net/netlink/af_netlink.c:1929
          [<ffffffff832fde84>] sock_sendmsg_nosec net/socket.c:704 [inline]
          [<ffffffff832fde84>] sock_sendmsg net/socket.c:724 [inline]
          [<ffffffff832fde84>] ____sys_sendmsg+0x874/0x9f0 net/socket.c:2409
          [<ffffffff83304a44>] ___sys_sendmsg+0x104/0x170 net/socket.c:2463
          [<ffffffff83304c01>] __sys_sendmsg+0x111/0x1f0 net/socket.c:2492
          [<ffffffff83304d5d>] __do_sys_sendmsg net/socket.c:2501 [inline]
          [<ffffffff83304d5d>] __se_sys_sendmsg net/socket.c:2499 [inline]
          [<ffffffff83304d5d>] __x64_sys_sendmsg+0x7d/0xc0 net/socket.c:2499
      
      Fixes: 2a014b20 ("mlxsw: spectrum_router: Add support for nexthop objects")
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NPetr Machata <petrm@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3106a084
  8. 23 6月, 2021 1 次提交
    • E
      vxlan: add missing rcu_read_lock() in neigh_reduce() · 85e8b032
      Eric Dumazet 提交于
      syzbot complained in neigh_reduce(), because rcu_read_lock_bh()
      is treated differently than rcu_read_lock()
      
      WARNING: suspicious RCU usage
      5.13.0-rc6-syzkaller #0 Not tainted
      -----------------------------
      include/net/addrconf.h:313 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 2, debug_locks = 1
      3 locks held by kworker/0:0/5:
       #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
       #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: atomic64_set include/asm-generic/atomic-instrumented.h:856 [inline]
       #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: atomic_long_set include/asm-generic/atomic-long.h:41 [inline]
       #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: set_work_data kernel/workqueue.c:617 [inline]
       #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline]
       #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x871/0x1600 kernel/workqueue.c:2247
       #1: ffffc90000ca7da8 ((work_completion)(&port->wq)){+.+.}-{0:0}, at: process_one_work+0x8a5/0x1600 kernel/workqueue.c:2251
       #2: ffffffff8bf795c0 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0x1da/0x3130 net/core/dev.c:4180
      
      stack backtrace:
      CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted 5.13.0-rc6-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Workqueue: events ipvlan_process_multicast
      Call Trace:
       __dump_stack lib/dump_stack.c:79 [inline]
       dump_stack+0x141/0x1d7 lib/dump_stack.c:120
       __in6_dev_get include/net/addrconf.h:313 [inline]
       __in6_dev_get include/net/addrconf.h:311 [inline]
       neigh_reduce drivers/net/vxlan.c:2167 [inline]
       vxlan_xmit+0x34d5/0x4c30 drivers/net/vxlan.c:2919
       __netdev_start_xmit include/linux/netdevice.h:4944 [inline]
       netdev_start_xmit include/linux/netdevice.h:4958 [inline]
       xmit_one net/core/dev.c:3654 [inline]
       dev_hard_start_xmit+0x1eb/0x920 net/core/dev.c:3670
       __dev_queue_xmit+0x2133/0x3130 net/core/dev.c:4246
       ipvlan_process_multicast+0xa99/0xd70 drivers/net/ipvlan/ipvlan_core.c:287
       process_one_work+0x98d/0x1600 kernel/workqueue.c:2276
       worker_thread+0x64c/0x1120 kernel/workqueue.c:2422
       kthread+0x3b1/0x4a0 kernel/kthread.c:313
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
      
      Fixes: f564f45c ("vxlan: add ipv6 proxy support")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      85e8b032
  9. 31 3月, 2021 1 次提交
  10. 26 3月, 2021 1 次提交
    • A
      vxlan: do not modify the shared tunnel info when PMTU triggers an ICMP reply · 30a93d2b
      Antoine Tenart 提交于
      When the interface is part of a bridge or an Open vSwitch port and a
      packet exceed a PMTU estimate, an ICMP reply is sent to the sender. When
      using the external mode (collect metadata) the source and destination
      addresses are reversed, so that Open vSwitch can match the packet
      against an existing (reverse) flow.
      
      But inverting the source and destination addresses in the shared
      ip_tunnel_info will make following packets of the flow to use a wrong
      destination address (packets will be tunnelled to itself), if the flow
      isn't updated. Which happens with Open vSwitch, until the flow times
      out.
      
      Fixes this by uncloning the skb's ip_tunnel_info before inverting its
      source and destination addresses, so that the modification will only be
      made for the PTMU packet, not the following ones.
      
      Fixes: fc68c995 ("vxlan: Support for PMTU discovery on directly bridged links")
      Tested-by: NEelco Chaudron <echaudro@redhat.com>
      Reviewed-by: NEelco Chaudron <echaudro@redhat.com>
      Signed-off-by: NAntoine Tenart <atenart@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30a93d2b
  11. 14 3月, 2021 1 次提交
  12. 24 2月, 2021 1 次提交
    • T
      vxlan: move debug check after netdev unregister · 92584ddf
      Taehee Yoo 提交于
      The debug check must be done after unregister_netdevice_many() call --
      the hlist_del_rcu() for this is done inside .ndo_stop.
      
      This is the same with commit 0fda7600 ("geneve: move debug check after
      netdev unregister")
      
      Test commands:
          ip netns del A
          ip netns add A
          ip netns add B
      
          ip netns exec B ip link add vxlan0 type vxlan vni 100 local 10.0.0.1 \
      	    remote 10.0.0.2 dstport 4789 srcport 4789 4789
          ip netns exec B ip link set vxlan0 netns A
          ip netns exec A ip link set vxlan0 up
          ip netns del B
      
      Splat looks like:
      [   73.176249][    T7] ------------[ cut here ]------------
      [   73.178662][    T7] WARNING: CPU: 4 PID: 7 at drivers/net/vxlan.c:4743 vxlan_exit_batch_net+0x52e/0x720 [vxlan]
      [   73.182597][    T7] Modules linked in: vxlan openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 mlx5_core nfp mlxfw ixgbevf tls sch_fq_codel nf_tables nfnetlink ip_tables x_tables unix
      [   73.190113][    T7] CPU: 4 PID: 7 Comm: kworker/u16:0 Not tainted 5.11.0-rc7+ #838
      [   73.193037][    T7] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      [   73.196986][    T7] Workqueue: netns cleanup_net
      [   73.198946][    T7] RIP: 0010:vxlan_exit_batch_net+0x52e/0x720 [vxlan]
      [   73.201509][    T7] Code: 00 01 00 00 0f 84 39 fd ff ff 48 89 ca 48 c1 ea 03 80 3c 1a 00 0f 85 a6 00 00 00 89 c2 48 83 c2 02 49 8b 14 d4 48 85 d2 74 ce <0f> 0b eb ca e8 b9 51 db dd 84 c0 0f 85 4a fe ff ff 48 c7 c2 80 bc
      [   73.208813][    T7] RSP: 0018:ffff888100907c10 EFLAGS: 00010286
      [   73.211027][    T7] RAX: 000000000000003c RBX: dffffc0000000000 RCX: ffff88800ec411f0
      [   73.213702][    T7] RDX: ffff88800a278000 RSI: ffff88800fc78c70 RDI: ffff88800fc78070
      [   73.216169][    T7] RBP: ffff88800b5cbdc0 R08: fffffbfff424de61 R09: fffffbfff424de61
      [   73.218463][    T7] R10: ffffffffa126f307 R11: fffffbfff424de60 R12: ffff88800ec41000
      [   73.220794][    T7] R13: ffff888100907d08 R14: ffff888100907c50 R15: ffff88800fc78c40
      [   73.223337][    T7] FS:  0000000000000000(0000) GS:ffff888114800000(0000) knlGS:0000000000000000
      [   73.225814][    T7] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   73.227616][    T7] CR2: 0000562b5cb4f4d0 CR3: 0000000105fbe001 CR4: 00000000003706e0
      [   73.229700][    T7] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   73.231820][    T7] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [   73.233844][    T7] Call Trace:
      [   73.234698][    T7]  ? vxlan_err_lookup+0x3c0/0x3c0 [vxlan]
      [   73.235962][    T7]  ? ops_exit_list.isra.11+0x93/0x140
      [   73.237134][    T7]  cleanup_net+0x45e/0x8a0
      [ ... ]
      
      Fixes: 57b61127 ("vxlan: speedup vxlan tunnels dismantle")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Link: https://lore.kernel.org/r/20210221154552.11749-1-ap420073@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      92584ddf
  13. 19 1月, 2021 1 次提交
  14. 08 1月, 2021 1 次提交
  15. 11 12月, 2020 1 次提交
  16. 03 12月, 2020 1 次提交
  17. 01 12月, 2020 2 次提交
  18. 10 11月, 2020 1 次提交
  19. 07 11月, 2020 2 次提交
  20. 04 11月, 2020 1 次提交
  21. 06 10月, 2020 1 次提交
  22. 27 9月, 2020 1 次提交
    • J
      Revert "vxlan: move encapsulation warning" · 435be28b
      Jakub Kicinski 提交于
      This reverts commit 546c044c.
      
      Nothing prevents user from sending frames to "external" VxLAN devices.
      In fact kernel itself may generate icmp chatter.
      
      This is fine, such frames should be dropped.
      
      The point of the "missing encapsulation" warning was that
      frames with missing encap should not make it into vxlan_xmit_one().
      And vxlan_xmit() drops them cleanly, so let it just do that.
      
      Without this revert the warning is triggered by the udp_tunnel_nic.sh
      test, but the minimal repro is:
      
      $ ip link add vxlan0 type vxlan \
           	      	     group 239.1.1.1 \
      		     dev lo \
      		     dstport 1234 \
      		     external
      $ ip li set dev vxlan0 up
      
      [  419.165981] vxlan0: Missing encapsulation instructions
      [  419.166551] WARNING: CPU: 0 PID: 1041 at drivers/net/vxlan.c:2889 vxlan_xmit+0x15c0/0x1fc0 [vxlan]
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      435be28b
  23. 26 9月, 2020 5 次提交
  24. 06 8月, 2020 1 次提交
    • H
      Revert "vxlan: fix tos value before xmit" · a0dced17
      Hangbin Liu 提交于
      This reverts commit 71130f29.
      
      In commit 71130f29 ("vxlan: fix tos value before xmit") we want to
      make sure the tos value are filtered by RT_TOS() based on RFC1349.
      
             0     1     2     3     4     5     6     7
          +-----+-----+-----+-----+-----+-----+-----+-----+
          |   PRECEDENCE    |          TOS          | MBZ |
          +-----+-----+-----+-----+-----+-----+-----+-----+
      
      But RFC1349 has been obsoleted by RFC2474. The new DSCP field defined like
      
             0     1     2     3     4     5     6     7
          +-----+-----+-----+-----+-----+-----+-----+-----+
          |          DS FIELD, DSCP           | ECN FIELD |
          +-----+-----+-----+-----+-----+-----+-----+-----+
      
      So with
      
      IPTOS_TOS_MASK          0x1E
      RT_TOS(tos)		((tos)&IPTOS_TOS_MASK)
      
      the first 3 bits DSCP info will get lost.
      
      To take all the DSCP info in xmit, we should revert the patch and just push
      all tos bits to ip_tunnel_ecn_encap(), which will handling ECN field later.
      
      Fixes: 71130f29 ("vxlan: fix tos value before xmit")
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Acked-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a0dced17
  25. 05 8月, 2020 2 次提交
    • S
      vxlan: Support for PMTU discovery on directly bridged links · fc68c995
      Stefano Brivio 提交于
      If the interface is a bridge or Open vSwitch port, and we can't
      forward a packet because it exceeds the local PMTU estimate,
      trigger an ICMP or ICMPv6 reply to the sender, using the same
      interface to forward it back.
      
      If metadata collection is enabled, reverse destination and source
      addresses, so that Open vSwitch is able to match this packet against
      the existing, reverse flow.
      
      v2: Use netif_is_any_bridge_port() (David Ahern)
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc68c995
    • S
      tunnels: PMTU discovery support for directly bridged IP packets · 4cb47a86
      Stefano Brivio 提交于
      It's currently possible to bridge Ethernet tunnels carrying IP
      packets directly to external interfaces without assigning them
      addresses and routes on the bridged network itself: this is the case
      for UDP tunnels bridged with a standard bridge or by Open vSwitch.
      
      PMTU discovery is currently broken with those configurations, because
      the encapsulation effectively decreases the MTU of the link, and
      while we are able to account for this using PMTU discovery on the
      lower layer, we don't have a way to relay ICMP or ICMPv6 messages
      needed by the sender, because we don't have valid routes to it.
      
      On the other hand, as a tunnel endpoint, we can't fragment packets
      as a general approach: this is for instance clearly forbidden for
      VXLAN by RFC 7348, section 4.3:
      
         VTEPs MUST NOT fragment VXLAN packets.  Intermediate routers may
         fragment encapsulated VXLAN packets due to the larger frame size.
         The destination VTEP MAY silently discard such VXLAN fragments.
      
      The same paragraph recommends that the MTU over the physical network
      accomodates for encapsulations, but this isn't a practical option for
      complex topologies, especially for typical Open vSwitch use cases.
      
      Further, it states that:
      
         Other techniques like Path MTU discovery (see [RFC1191] and
         [RFC1981]) MAY be used to address this requirement as well.
      
      Now, PMTU discovery already works for routed interfaces, we get
      route exceptions created by the encapsulation device as they receive
      ICMP Fragmentation Needed and ICMPv6 Packet Too Big messages, and
      we already rebuild those messages with the appropriate MTU and route
      them back to the sender.
      
      Add the missing bits for bridged cases:
      
      - checks in skb_tunnel_check_pmtu() to understand if it's appropriate
        to trigger a reply according to RFC 1122 section 3.2.2 for ICMP and
        RFC 4443 section 2.4 for ICMPv6. This function is already called by
        UDP tunnels
      
      - a new function generating those ICMP or ICMPv6 replies. We can't
        reuse icmp_send() and icmp6_send() as we don't see the sender as a
        valid destination. This doesn't need to be generic, as we don't
        cover any other type of ICMP errors given that we only provide an
        encapsulation function to the sender
      
      While at it, make the MTU check in skb_tunnel_check_pmtu() accurate:
      we might receive GSO buffers here, and the passed headroom already
      includes the inner MAC length, so we don't have to account for it
      a second time (that would imply three MAC headers on the wire, but
      there are just two).
      
      This issue became visible while bridging IPv6 packets with 4500 bytes
      of payload over GENEVE using IPv4 with a PMTU of 4000. Given the 50
      bytes of encapsulation headroom, we would advertise MTU as 3950, and
      we would reject fragmented IPv6 datagrams of 3958 bytes size on the
      wire. We're exclusively dealing with network MTU here, though, so we
      could get Ethernet frames up to 3964 octets in that case.
      
      v2:
      - moved skb_tunnel_check_pmtu() to ip_tunnel_core.c (David Ahern)
      - split IPv4/IPv6 functions (David Ahern)
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4cb47a86