1. 11 11月, 2014 1 次提交
  2. 18 10月, 2014 1 次提交
  3. 16 10月, 2014 2 次提交
  4. 08 10月, 2014 1 次提交
    • E
      net: better IFF_XMIT_DST_RELEASE support · 02875878
      Eric Dumazet 提交于
      Testing xmit_more support with netperf and connected UDP sockets,
      I found strange dst refcount false sharing.
      
      Current handling of IFF_XMIT_DST_RELEASE is not optimal.
      
      Dropping dst in validate_xmit_skb() is certainly too late in case
      packet was queued by cpu X but dequeued by cpu Y
      
      The logical point to take care of drop/force is in __dev_queue_xmit()
      before even taking qdisc lock.
      
      As Julian Anastasov pointed out, need for skb_dst() might come from some
      packet schedulers or classifiers.
      
      This patch adds new helper to cleanly express needs of various drivers
      or qdiscs/classifiers.
      
      Drivers that need skb_dst() in their ndo_start_xmit() should call
      following helper in their setup instead of the prior :
      
      	dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
      ->
      	netif_keep_dst(dev);
      
      Instead of using a single bit, we use two bits, one being
      eventually rebuilt in bonding/team drivers.
      
      The other one, is permanent and blocks IFF_XMIT_DST_RELEASE being
      rebuilt in bonding/team. Eventually, we could add something
      smarter later.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Julian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02875878
  5. 02 10月, 2014 1 次提交
  6. 24 9月, 2014 1 次提交
  7. 20 9月, 2014 1 次提交
  8. 02 9月, 2014 1 次提交
  9. 30 8月, 2014 1 次提交
    • T
      net: Clarification of CHECKSUM_UNNECESSARY · 77cffe23
      Tom Herbert 提交于
      This patch:
       - Clarifies the specific requirements of devices returning
         CHECKSUM_UNNECESSARY (comments in skbuff.h).
       - Adds csum_level field to skbuff. This is used to express how
         many checksums are covered by CHECKSUM_UNNECESSARY (stores n - 1).
         This replaces the overloading of skb->encapsulation, that field is
         is now only used to indicate inner headers are valid.
       - Change __skb_checksum_validate_needed to "consume" each checksum
         as indicated by csum_level as layers of the the packet are parsed.
       - Remove skb_pop_rcv_encapsulation, no longer needed in the new
         csum_level model.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      77cffe23
  10. 23 8月, 2014 1 次提交
    • G
      vxlan: fix incorrect initializer in union vxlan_addr · a45e92a5
      Gerhard Stenzel 提交于
      The first initializer in the following
      
              union vxlan_addr ipa = {
                  .sin.sin_addr.s_addr = tip,
                  .sa.sa_family = AF_INET,
              };
      
      is optimised away by the compiler, due to the second initializer,
      therefore initialising .sin.sin_addr.s_addr always to 0.
      This results in netlink messages indicating a L3 miss never contain the
      missed IP address. This was observed with GCC 4.8 and 4.9. I do not know about previous versions.
      The problem affects user space programs relying on an IP address being
      sent as part of a netlink message indicating a L3 miss.
      
      Changing
                  .sa.sa_family = AF_INET,
      to
                  .sin.sin_family = AF_INET,
      fixes the problem.
      Signed-off-by: NGerhard Stenzel <gerhard.stenzel@de.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a45e92a5
  11. 29 7月, 2014 1 次提交
  12. 15 7月, 2014 1 次提交
  13. 11 7月, 2014 1 次提交
  14. 08 7月, 2014 1 次提交
  15. 15 6月, 2014 1 次提交
  16. 14 6月, 2014 1 次提交
  17. 12 6月, 2014 1 次提交
  18. 05 6月, 2014 1 次提交
  19. 14 5月, 2014 1 次提交
  20. 25 4月, 2014 1 次提交
    • N
      vxlan: add x-netns support · f01ec1c0
      Nicolas Dichtel 提交于
      This patch allows to switch the netns when packet is encapsulated or
      decapsulated.
      The vxlan socket is openned into the i/o netns, ie into the netns where
      encapsulated packets are received. The socket lookup is done into this netns to
      find the corresponding vxlan tunnel. After decapsulation, the packet is
      injecting into the corresponding interface which may stand to another netns.
      
      When one of the two netns is removed, the tunnel is destroyed.
      
      Configuration example:
      ip netns add netns1
      ip netns exec netns1 ip link set lo up
      ip link add vxlan10 type vxlan id 10 group 239.0.0.10 dev eth0 dstport 0
      ip link set vxlan10 netns netns1
      ip netns exec netns1 ip addr add 192.168.0.249/24 broadcast 192.168.0.255 dev vxlan10
      ip netns exec netns1 ip link set vxlan10 up
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f01ec1c0
  21. 24 4月, 2014 1 次提交
  22. 16 4月, 2014 1 次提交
  23. 03 4月, 2014 1 次提交
    • M
      net: vxlan: fix crash when interface is created with no group · 5933a7bb
      Mike Rapoport 提交于
      If the vxlan interface is created without explicit group definition,
      there are corner cases which may cause kernel panic.
      
      For instance, in the following scenario:
      
      node A:
      $ ip link add dev vxlan42  address 2c:c2:60:00:10:20 type vxlan id 42
      $ ip addr add dev vxlan42 10.0.0.1/24
      $ ip link set up dev vxlan42
      $ arp -i vxlan42 -s 10.0.0.2 2c:c2:60:00:01:02
      $ bridge fdb add dev vxlan42 to 2c:c2:60:00:01:02 dst <IPv4 address>
      $ ping 10.0.0.2
      
      node B:
      $ ip link add dev vxlan42 address 2c:c2:60:00:01:02 type vxlan id 42
      $ ip addr add dev vxlan42 10.0.0.2/24
      $ ip link set up dev vxlan42
      $ arp -i vxlan42 -s 10.0.0.1 2c:c2:60:00:10:20
      
      node B crashes:
      
       vxlan42: 2c:c2:60:00:10:20 migrated from 4011:eca4:c0a8:6466:c0a8:6415:8e09:2118 to (invalid address)
       vxlan42: 2c:c2:60:00:10:20 migrated from 4011:eca4:c0a8:6466:c0a8:6415:8e09:2118 to (invalid address)
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000046
       IP: [<ffffffff8143c459>] ip6_route_output+0x58/0x82
       PGD 7bd89067 PUD 7bd4e067 PMD 0
       Oops: 0000 [#1] SMP
       Modules linked in:
       CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.14.0-rc8-hvx-xen-00019-g97a5221f-dirty #154
       Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
       task: ffff88007c774f50 ti: ffff88007c79c000 task.ti: ffff88007c79c000
       RIP: 0010:[<ffffffff8143c459>]  [<ffffffff8143c459>] ip6_route_output+0x58/0x82
       RSP: 0018:ffff88007fd03668  EFLAGS: 00010282
       RAX: 0000000000000000 RBX: ffffffff8186a000 RCX: 0000000000000040
       RDX: 0000000000000000 RSI: ffff88007b0e4a80 RDI: ffff88007fd03754
       RBP: ffff88007fd03688 R08: ffff88007b0e4a80 R09: 0000000000000000
       R10: 0200000a0100000a R11: 0001002200000000 R12: ffff88007fd03740
       R13: ffff88007b0e4a80 R14: ffff88007b0e4a80 R15: ffff88007bba0c50
       FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
       CR2: 0000000000000046 CR3: 000000007bb60000 CR4: 00000000000006e0
       Stack:
        0000000000000000 ffff88007fd037a0 ffffffff8186a000 ffff88007fd03740
        ffff88007fd036c8 ffffffff814320bb 0000000000006e49 ffff88007b8b7360
        ffff88007bdbf200 ffff88007bcbc000 ffff88007b8b7000 ffff88007b8b7360
       Call Trace:
        <IRQ>
        [<ffffffff814320bb>] ip6_dst_lookup_tail+0x2d/0xa4
        [<ffffffff814322a5>] ip6_dst_lookup+0x10/0x12
        [<ffffffff81323b4e>] vxlan_xmit_one+0x32a/0x68c
        [<ffffffff814a325a>] ? _raw_spin_unlock_irqrestore+0x12/0x14
        [<ffffffff8104c551>] ? lock_timer_base.isra.23+0x26/0x4b
        [<ffffffff8132451a>] vxlan_xmit+0x66a/0x6a8
        [<ffffffff8141a365>] ? ipt_do_table+0x35f/0x37e
        [<ffffffff81204ba2>] ? selinux_ip_postroute+0x41/0x26e
        [<ffffffff8139d0c1>] dev_hard_start_xmit+0x2ce/0x3ce
        [<ffffffff8139d491>] __dev_queue_xmit+0x2d0/0x392
        [<ffffffff813b380f>] ? eth_header+0x28/0xb5
        [<ffffffff8139d569>] dev_queue_xmit+0xb/0xd
        [<ffffffff813a5aa6>] neigh_resolve_output+0x134/0x152
        [<ffffffff813db741>] ip_finish_output2+0x236/0x299
        [<ffffffff813dc074>] ip_finish_output+0x98/0x9d
        [<ffffffff813dc749>] ip_output+0x62/0x67
        [<ffffffff813da9f2>] dst_output+0xf/0x11
        [<ffffffff813dc11c>] ip_local_out+0x1b/0x1f
        [<ffffffff813dcf1b>] ip_send_skb+0x11/0x37
        [<ffffffff813dcf70>] ip_push_pending_frames+0x2f/0x33
        [<ffffffff813ff732>] icmp_push_reply+0x106/0x115
        [<ffffffff813ff9e4>] icmp_reply+0x142/0x164
        [<ffffffff813ffb3b>] icmp_echo.part.16+0x46/0x48
        [<ffffffff813c1d30>] ? nf_iterate+0x43/0x80
        [<ffffffff813d8037>] ? xfrm4_policy_check.constprop.11+0x52/0x52
        [<ffffffff813ffb62>] icmp_echo+0x25/0x27
        [<ffffffff814005f7>] icmp_rcv+0x1d2/0x20a
        [<ffffffff813d8037>] ? xfrm4_policy_check.constprop.11+0x52/0x52
        [<ffffffff813d810d>] ip_local_deliver_finish+0xd6/0x14f
        [<ffffffff813d8037>] ? xfrm4_policy_check.constprop.11+0x52/0x52
        [<ffffffff813d7fde>] NF_HOOK.constprop.10+0x4c/0x53
        [<ffffffff813d82bf>] ip_local_deliver+0x4a/0x4f
        [<ffffffff813d7f7b>] ip_rcv_finish+0x253/0x26a
        [<ffffffff813d7d28>] ? inet_add_protocol+0x3e/0x3e
        [<ffffffff813d7fde>] NF_HOOK.constprop.10+0x4c/0x53
        [<ffffffff813d856a>] ip_rcv+0x2a6/0x2ec
        [<ffffffff8139a9a0>] __netif_receive_skb_core+0x43e/0x478
        [<ffffffff812a346f>] ? virtqueue_poll+0x16/0x27
        [<ffffffff8139aa2f>] __netif_receive_skb+0x55/0x5a
        [<ffffffff8139aaaa>] process_backlog+0x76/0x12f
        [<ffffffff8139add8>] net_rx_action+0xa2/0x1ab
        [<ffffffff81047847>] __do_softirq+0xca/0x1d1
        [<ffffffff81047ace>] irq_exit+0x3e/0x85
        [<ffffffff8100b98b>] do_IRQ+0xa9/0xc4
        [<ffffffff814a37ad>] common_interrupt+0x6d/0x6d
        <EOI>
        [<ffffffff810378db>] ? native_safe_halt+0x6/0x8
        [<ffffffff810110c7>] default_idle+0x9/0xd
        [<ffffffff81011694>] arch_cpu_idle+0x13/0x1c
        [<ffffffff8107480d>] cpu_startup_entry+0xbc/0x137
        [<ffffffff8102e741>] start_secondary+0x1a0/0x1a5
       Code: 24 14 e8 f1 e5 01 00 31 d2 a8 32 0f 95 c2 49 8b 44 24 2c 49 0b 44 24 24 74 05 83 ca 04 eb 1c 4d 85 ed 74 17 49 8b 85 a8 02 00 00 <66> 8b 40 46 66 c1 e8 07 83 e0 07 c1 e0 03 09 c2 4c 89 e6 48 89
       RIP  [<ffffffff8143c459>] ip6_route_output+0x58/0x82
        RSP <ffff88007fd03668>
       CR2: 0000000000000046
       ---[ end trace 4612329caab37efd ]---
      
      When vxlan interface is created without explicit group definition, the
      default_dst protocol family is initialiazed to AF_UNSPEC and the driver
      assumes IPv4 configuration. On the other side, the default_dst protocol
      family is used to differentiate between IPv4 and IPv6 cases and, since,
      AF_UNSPEC != AF_INET, the processing takes the IPv6 path.
      
      Making the IPv4 assumption explicit by settting default_dst protocol
      family to AF_INET4 and preventing mixing of IPv4 and IPv6 addresses in
      snooped fdb entries fixes the corner case crashes.
      Signed-off-by: NMike Rapoport <mike.rapoport@ravellosystems.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5933a7bb
  24. 25 3月, 2014 1 次提交
    • D
      vxlan: fix nonfunctional neigh_reduce() · 4b29dba9
      David Stevens 提交于
      The VXLAN neigh_reduce() code is completely non-functional since
      check-in. Specific errors:
      
      1) The original code drops all packets with a multicast destination address,
      	even though neighbor solicitations are sent to the solicited-node
      	address, a multicast address. The code after this check was never run.
      2) The neighbor table lookup used the IPv6 header destination, which is the
      	solicited node address, rather than the target address from the
      	neighbor solicitation. So neighbor lookups would always fail if it
      	got this far. Also for L3MISSes.
      3) The code calls ndisc_send_na(), which does a send on the tunnel device.
      	The context for neigh_reduce() is the transmit path, vxlan_xmit(),
      	where the host or a bridge-attached neighbor is trying to transmit
      	a neighbor solicitation. To respond to it, the tunnel endpoint needs
      	to do a *receive* of the appropriate neighbor advertisement. Doing a
      	send, would only try to send the advertisement, encapsulated, to the
      	remote destinations in the fdb -- hosts that definitely did not do the
      	corresponding solicitation.
      4) The code uses the tunnel endpoint IPv6 forwarding flag to determine the
      	isrouter flag in the advertisement. This has nothing to do with whether
      	or not the target is a router, and generally won't be set since the
      	tunnel endpoint is bridging, not routing, traffic.
      
      	The patch below creates a proxy neighbor advertisement to respond to
      neighbor solicitions as intended, providing proper IPv6 support for neighbor
      reduction.
      Signed-off-by: NDavid L Stevens <dlstevens@us.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b29dba9
  25. 19 3月, 2014 1 次提交
  26. 27 2月, 2014 1 次提交
  27. 15 2月, 2014 1 次提交
  28. 02 2月, 2014 1 次提交
  29. 31 1月, 2014 1 次提交
  30. 24 1月, 2014 1 次提交
  31. 23 1月, 2014 1 次提交
    • D
      net: vxlan: convert to act as a pernet subsystem · 783c1463
      Daniel Borkmann 提交于
      As per suggestion from Eric W. Biederman, vxlan should be using
      {un,}register_pernet_subsys() instead of {un,}register_pernet_device()
      to ensure the vxlan_net structure is initialized before and cleaned
      up after all network devices in a given network namespace i.e. when
      dealing with network notifiers. This is similarly handeled already in
      commit 91e2ff35 ("net: Teach vlans to cleanup as a pernet subsystem")
      and, thus, improves upon fd27e0d4 ("net: vxlan: do not use vxlan_net
      before checking event type"). Just as in 91e2ff35, we do not need
      to explicitly handle deletion of vxlan devices as network namespace
      exit calls dellink on all remaining virtual devices, and
      rtnl_link_unregister() calls dellink on all outstanding devices in that
      network namespace, so we can entirely drop the pernet exit operation
      as well. Moreover, on vxlan module exit, rcu_barrier() is called by
      netns since commit 3a765eda ("netns: Add an explicit rcu_barrier
      to unregister_pernet_{device|subsys}"), so this may be omitted. Tested
      with various scenarios and works well on my side.
      Suggested-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      783c1463
  32. 22 1月, 2014 2 次提交
    • O
      net: Add GRO support for vxlan traffic · dc01e7d3
      Or Gerlitz 提交于
      Add GRO handlers for vxlann, by using the UDP GRO infrastructure.
      
      For single TCP session that goes through vxlan tunneling I got nice
      improvement from 6.8Gbs to 11.5Gbs
      
      --> UDP/VXLAN GRO disabled
      $ netperf  -H 192.168.52.147 -c -C
      
      $ netperf -t TCP_STREAM -H 192.168.52.147 -c -C
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.52.147 () port 0 AF_INET
      Recv   Send    Send                          Utilization       Service Demand
      Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
      Size   Size    Size     Time     Throughput  local    remote   local   remote
      bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
      
       87380  65536  65536    10.00      6799.75   12.54    24.79    0.604   1.195
      
      --> UDP/VXLAN GRO enabled
      
      $ netperf -t TCP_STREAM -H 192.168.52.147 -c -C
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.52.147 () port 0 AF_INET
      Recv   Send    Send                          Utilization       Service Demand
      Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
      Size   Size    Size     Time     Throughput  local    remote   local   remote
      bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
      
       87380  65536  65536    10.00      11562.72   24.90    20.34    0.706   0.577
      Signed-off-by: NShlomo Pongratz <shlomop@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dc01e7d3
    • J
      net: add vxlan description · ead5139a
      Jesse Brandeburg 提交于
      Add a description to the vxlan module, helping save the world
      from the minions of destruction and confusion.
      Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
      CC: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ead5139a
  33. 18 1月, 2014 1 次提交
    • D
      net: vxlan: do not use vxlan_net before checking event type · fd27e0d4
      Daniel Borkmann 提交于
      Jesse Brandeburg reported that commit acaf4e70 caused a panic
      when adding a network namespace while vxlan module was present in
      the system:
      
      [<ffffffff814d0865>] vxlan_lowerdev_event+0xf5/0x100
      [<ffffffff816e9e5d>] notifier_call_chain+0x4d/0x70
      [<ffffffff810912be>] __raw_notifier_call_chain+0xe/0x10
      [<ffffffff810912d6>] raw_notifier_call_chain+0x16/0x20
      [<ffffffff815d9610>] call_netdevice_notifiers_info+0x40/0x70
      [<ffffffff815d9656>] call_netdevice_notifiers+0x16/0x20
      [<ffffffff815e1bce>] register_netdevice+0x1be/0x3a0
      [<ffffffff815e1dce>] register_netdev+0x1e/0x30
      [<ffffffff814cb94a>] loopback_net_init+0x4a/0xb0
      [<ffffffffa016ed6e>] ? lockd_init_net+0x6e/0xb0 [lockd]
      [<ffffffff815d6bac>] ops_init+0x4c/0x150
      [<ffffffff815d6d23>] setup_net+0x73/0x110
      [<ffffffff815d725b>] copy_net_ns+0x7b/0x100
      [<ffffffff81090e11>] create_new_namespaces+0x101/0x1b0
      [<ffffffff81090f45>] copy_namespaces+0x85/0xb0
      [<ffffffff810693d5>] copy_process.part.26+0x935/0x1500
      [<ffffffff811d5186>] ? mntput+0x26/0x40
      [<ffffffff8106a15c>] do_fork+0xbc/0x2e0
      [<ffffffff811b7f2e>] ? ____fput+0xe/0x10
      [<ffffffff81089c5c>] ? task_work_run+0xac/0xe0
      [<ffffffff8106a406>] SyS_clone+0x16/0x20
      [<ffffffff816ee689>] stub_clone+0x69/0x90
      [<ffffffff816ee329>] ? system_call_fastpath+0x16/0x1b
      
      Apparently loopback device is being registered first and thus we
      receive an event notification when vxlan_net is not ready. Hence,
      when we call net_generic() and request vxlan_net_id, we seem to
      access garbage at that point in time. In setup_net() where we set
      up a newly allocated network namespace, we traverse the list of
      pernet ops ...
      
      list_for_each_entry(ops, &pernet_list, list) {
      	error = ops_init(ops, net);
      	if (error < 0)
      		goto out_undo;
      }
      
      ... and loopback_net_init() is invoked first here, so in the middle
      of setup_net() we get this notification in vxlan. As currently we
      only care about devices that unregister, move access through
      net_generic() there. Fix is based on Cong Wang's proposal, but
      only changes what is needed here. It sucks a bit as we only work
      around the actual cure: right now it seems the only way to check if
      a netns actually finished traversing all init ops would be to check
      if it's part of net_namespace_list. But that I find quite expensive
      each time we go through a notifier callback. Anyway, did a couple
      of tests and it seems good for now.
      
      Fixes: acaf4e70 ("net: vxlan: when lower dev unregisters remove vxlan dev as well")
      Reported-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Tested-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd27e0d4
  34. 15 1月, 2014 3 次提交
    • D
      net: vxlan: properly cleanup devs on module unload · 8425783c
      Daniel Borkmann 提交于
      We should use vxlan_dellink() handler in vxlan_exit_net(), since
      i) we're not in fast-path and we should be consistent in dismantle
      just as we would remove a device through rtnl ops, and more
      importantly, ii) in case future code will kfree() memory in
      vxlan_dellink(), we would leak it right here unnoticed. Therefore,
      do not only half of the cleanup work, but make it properly.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8425783c
    • D
      net: vxlan: when lower dev unregisters remove vxlan dev as well · acaf4e70
      Daniel Borkmann 提交于
      We can create a vxlan device with an explicit underlying carrier.
      In that case, when the carrier link is being deleted from the
      system (e.g. due to module unload) we should also clean up all
      created vxlan devices on top of it since otherwise we're in an
      inconsistent state in vxlan device. In that case, the user needs
      to remove all such devices, while in case of other virtual devs
      that sit on top of physical ones, it is usually the case that
      these devices do unregister automatically as well and do not
      leave the burden on the user.
      
      This work is not necessary when vxlan device was not created with
      a real underlying device, as connections can resume in that case
      when driver is plugged again. But at least for the other cases,
      we should go ahead and do the cleanup on removal.
      
      We don't register the notifier during vxlan_newlink() here since
      I consider this event rather rare, and therefore we should not
      bloat vxlan's core structure unecessary. Also, we can simply make
      use of unregister_netdevice_many() to batch that. fdb is flushed
      upon ndo_stop().
      
      E.g. `ip -d link show vxlan13` after carrier removal before
      this patch:
      
      5: vxlan13: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN mode DEFAULT group default
          link/ether 1e:47:da:6d:4d:99 brd ff:ff:ff:ff:ff:ff promiscuity 0
          vxlan id 13 group 239.0.0.10 dev 2 port 32768 61000 ageing 300
                                       ^^^^^
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      acaf4e70
    • Y
      vxlan: use __dev_get_by_index instead of dev_get_by_index to find interface · 73763949
      Ying Xue 提交于
      The following call chains indicate that vxlan_fdb_parse() is
      under rtnl_lock protection. So if we use __dev_get_by_index()
      instead of dev_get_by_index() to find interface handler in it,
      this would help us avoid to change interface reference counter.
      
      rtnetlink_rcv()
        rtnl_lock()
        netlink_rcv_skb()
          rtnl_fdb_add()
            vxlan_fdb_add()
              vxlan_fdb_parse()
        rtnl_unlock()
      
      rtnetlink_rcv()
        rtnl_lock()
        netlink_rcv_skb()
          rtnl_fdb_del()
            vxlan_fdb_del()
              vxlan_fdb_parse()
        rtnl_unlock()
      
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      73763949
  35. 07 1月, 2014 1 次提交
    • E
      vxlan: keep original skb ownership · 8f646c92
      Eric Dumazet 提交于
      Sathya Perla posted a patch trying to address following problem :
      
      <quote>
       The vxlan driver sets itself as the socket owner for all the TX flows
       it encapsulates (using vxlan_set_owner()) and assigns it's own skb
       destructor. This causes all tunneled traffic to land up on only one TXQ
       as all encapsulated skbs refer to the vxlan socket and not the original
       socket.  Also, the vxlan skb destructor breaks some functionality for
       tunneled traffic like wmem accounting and as TCP small queues and
       FQ/pacing packet scheduler.
      </quote>
      
      I reworked Sathya patch and added some explanations.
      
      vxlan_xmit() can avoid one skb_clone()/dev_kfree_skb() pair
      and gain better drop monitor accuracy, by calling kfree_skb() when
      appropriate.
      
      The UDP socket used by vxlan to perform encapsulation of xmit packets
      do not need to be alive while packets leave vxlan code. Its better
      to keep original socket ownership to get proper feedback from qdisc and
      NIC layers.
      
      We use skb->sk to
      
      A) control amount of bytes/packets queued on behalf of a socket, but
      prior vxlan code did the skb->sk transfert without any limit/control
      on vxlan socket sk_sndbuf.
      
      B) security purposes (as selinux) or netfilter uses, and I do not think
      anything is prepared to handle vxlan stacked case in this area.
      
      By not changing ownership, vxlan tunnels behave like other tunnels.
      As Stephen mentioned, we might do the same change in L2TP.
      Reported-by: NSathya Perla <sathya.perla@emulex.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8f646c92
  36. 05 1月, 2014 1 次提交