1. 12 12月, 2014 1 次提交
    • M
      Fix race condition between vxlan_sock_add and vxlan_sock_release · 00c83b01
      Marcelo Leitner 提交于
      Currently, when trying to reuse a socket, vxlan_sock_add will grab
      vn->sock_lock, locate a reusable socket, inc refcount and release
      vn->sock_lock.
      
      But vxlan_sock_release() will first decrement refcount, and then grab
      that lock. refcnt operations are atomic but as currently we have
      deferred works which hold vs->refcnt each, this might happen, leading to
      a use after free (specially after vxlan_igmp_leave):
      
        CPU 1                            CPU 2
      
      deferred work                    vxlan_sock_add
        ...                              ...
                                         spin_lock(&vn->sock_lock)
                                         vs = vxlan_find_sock();
        vxlan_sock_release
          dec vs->refcnt, reaches 0
          spin_lock(&vn->sock_lock)
                                         vxlan_sock_hold(vs), refcnt=1
                                         spin_unlock(&vn->sock_lock)
          hlist_del_rcu(&vs->hlist);
          vxlan_notify_del_rx_port(vs)
          spin_unlock(&vn->sock_lock)
      
      So when we look for a reusable socket, we check if it wasn't freed
      already before reusing it.
      Signed-off-by: NMarcelo Ricardo Leitner <mleitner@redhat.com>
      Fixes: 7c47cedf ("vxlan: move IGMP join/leave to work queue")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      00c83b01
  2. 03 12月, 2014 1 次提交
  3. 26 11月, 2014 1 次提交
  4. 22 11月, 2014 2 次提交
  5. 19 11月, 2014 1 次提交
  6. 15 11月, 2014 1 次提交
  7. 14 11月, 2014 1 次提交
    • M
      vxlan: Do not reuse sockets for a different address family · 19ca9fc1
      Marcelo Leitner 提交于
      Currently, we only match against local port number in order to reuse
      socket. But if this new vxlan wants an IPv6 socket and a IPv4 one bound
      to that port, vxlan will reuse an IPv4 socket as IPv6 and a panic will
      follow. The following steps reproduce it:
      
         # ip link add vxlan6 type vxlan id 42 group 229.10.10.10 \
             srcport 5000 6000 dev eth0
         # ip link add vxlan7 type vxlan id 43 group ff0e::110 \
             srcport 5000 6000 dev eth0
         # ip link set vxlan6 up
         # ip link set vxlan7 up
         <panic>
      
      [    4.187481] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
      ...
      [    4.188076] Call Trace:
      [    4.188085]  [<ffffffff81667c4a>] ? ipv6_sock_mc_join+0x3a/0x630
      [    4.188098]  [<ffffffffa05a6ad6>] vxlan_igmp_join+0x66/0xd0 [vxlan]
      [    4.188113]  [<ffffffff810a3430>] process_one_work+0x220/0x710
      [    4.188125]  [<ffffffff810a33c4>] ? process_one_work+0x1b4/0x710
      [    4.188138]  [<ffffffff810a3a3b>] worker_thread+0x11b/0x3a0
      [    4.188149]  [<ffffffff810a3920>] ? process_one_work+0x710/0x710
      
      So address family must also match in order to reuse a socket.
      Reported-by: NJean-Tsung Hsiao <jhsiao@redhat.com>
      Signed-off-by: NMarcelo Ricardo Leitner <mleitner@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19ca9fc1
  8. 11 11月, 2014 1 次提交
  9. 07 11月, 2014 1 次提交
    • T
      vxlan: Fix to enable UDP checksums on interface · 5c91ae08
      Tom Herbert 提交于
      Add definition to vxlan nla_policy for UDP checksum. This is necessary
      to enable UDP checksums on VXLAN.
      
      In some instances, enabling UDP checksums can improve performance on
      receive for devices that return legacy checksum-unnecessary for UDP/IP.
      Also, UDP checksum provides some protection against VNI corruption.
      
      Testing:
      
      Ran 200 instances of TCP_STREAM and TCP_RR on bnx2x.
      
      TCP_STREAM
        IPv4, without UDP checksums
            14.41% TX CPU utilization
            25.71% RX CPU utilization
            9083.4 Mbps
        IPv4, with UDP checksums
            13.99% TX CPU utilization
            13.40% RX CPU utilization
            9095.65 Mbps
      
      TCP_RR
        IPv4, without UDP checksums
            94.08% TX CPU utilization
            156/248/462 90/95/99% latencies
            1.12743e+06 tps
        IPv4, with UDP checksums
            94.43% TX CPU utilization
            158/250/462 90/95/99% latencies
            1.13345e+06 tps
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5c91ae08
  10. 18 10月, 2014 1 次提交
  11. 16 10月, 2014 2 次提交
  12. 08 10月, 2014 1 次提交
    • E
      net: better IFF_XMIT_DST_RELEASE support · 02875878
      Eric Dumazet 提交于
      Testing xmit_more support with netperf and connected UDP sockets,
      I found strange dst refcount false sharing.
      
      Current handling of IFF_XMIT_DST_RELEASE is not optimal.
      
      Dropping dst in validate_xmit_skb() is certainly too late in case
      packet was queued by cpu X but dequeued by cpu Y
      
      The logical point to take care of drop/force is in __dev_queue_xmit()
      before even taking qdisc lock.
      
      As Julian Anastasov pointed out, need for skb_dst() might come from some
      packet schedulers or classifiers.
      
      This patch adds new helper to cleanly express needs of various drivers
      or qdiscs/classifiers.
      
      Drivers that need skb_dst() in their ndo_start_xmit() should call
      following helper in their setup instead of the prior :
      
      	dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
      ->
      	netif_keep_dst(dev);
      
      Instead of using a single bit, we use two bits, one being
      eventually rebuilt in bonding/team drivers.
      
      The other one, is permanent and blocks IFF_XMIT_DST_RELEASE being
      rebuilt in bonding/team. Eventually, we could add something
      smarter later.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Julian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02875878
  13. 02 10月, 2014 1 次提交
  14. 24 9月, 2014 1 次提交
  15. 20 9月, 2014 1 次提交
  16. 02 9月, 2014 1 次提交
  17. 30 8月, 2014 1 次提交
    • T
      net: Clarification of CHECKSUM_UNNECESSARY · 77cffe23
      Tom Herbert 提交于
      This patch:
       - Clarifies the specific requirements of devices returning
         CHECKSUM_UNNECESSARY (comments in skbuff.h).
       - Adds csum_level field to skbuff. This is used to express how
         many checksums are covered by CHECKSUM_UNNECESSARY (stores n - 1).
         This replaces the overloading of skb->encapsulation, that field is
         is now only used to indicate inner headers are valid.
       - Change __skb_checksum_validate_needed to "consume" each checksum
         as indicated by csum_level as layers of the the packet are parsed.
       - Remove skb_pop_rcv_encapsulation, no longer needed in the new
         csum_level model.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      77cffe23
  18. 23 8月, 2014 1 次提交
    • G
      vxlan: fix incorrect initializer in union vxlan_addr · a45e92a5
      Gerhard Stenzel 提交于
      The first initializer in the following
      
              union vxlan_addr ipa = {
                  .sin.sin_addr.s_addr = tip,
                  .sa.sa_family = AF_INET,
              };
      
      is optimised away by the compiler, due to the second initializer,
      therefore initialising .sin.sin_addr.s_addr always to 0.
      This results in netlink messages indicating a L3 miss never contain the
      missed IP address. This was observed with GCC 4.8 and 4.9. I do not know about previous versions.
      The problem affects user space programs relying on an IP address being
      sent as part of a netlink message indicating a L3 miss.
      
      Changing
                  .sa.sa_family = AF_INET,
      to
                  .sin.sin_family = AF_INET,
      fixes the problem.
      Signed-off-by: NGerhard Stenzel <gerhard.stenzel@de.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a45e92a5
  19. 29 7月, 2014 1 次提交
  20. 15 7月, 2014 1 次提交
  21. 11 7月, 2014 1 次提交
  22. 08 7月, 2014 1 次提交
  23. 15 6月, 2014 1 次提交
  24. 14 6月, 2014 1 次提交
  25. 12 6月, 2014 1 次提交
  26. 05 6月, 2014 1 次提交
  27. 14 5月, 2014 1 次提交
  28. 25 4月, 2014 1 次提交
    • N
      vxlan: add x-netns support · f01ec1c0
      Nicolas Dichtel 提交于
      This patch allows to switch the netns when packet is encapsulated or
      decapsulated.
      The vxlan socket is openned into the i/o netns, ie into the netns where
      encapsulated packets are received. The socket lookup is done into this netns to
      find the corresponding vxlan tunnel. After decapsulation, the packet is
      injecting into the corresponding interface which may stand to another netns.
      
      When one of the two netns is removed, the tunnel is destroyed.
      
      Configuration example:
      ip netns add netns1
      ip netns exec netns1 ip link set lo up
      ip link add vxlan10 type vxlan id 10 group 239.0.0.10 dev eth0 dstport 0
      ip link set vxlan10 netns netns1
      ip netns exec netns1 ip addr add 192.168.0.249/24 broadcast 192.168.0.255 dev vxlan10
      ip netns exec netns1 ip link set vxlan10 up
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f01ec1c0
  29. 24 4月, 2014 1 次提交
  30. 16 4月, 2014 1 次提交
  31. 03 4月, 2014 1 次提交
    • M
      net: vxlan: fix crash when interface is created with no group · 5933a7bb
      Mike Rapoport 提交于
      If the vxlan interface is created without explicit group definition,
      there are corner cases which may cause kernel panic.
      
      For instance, in the following scenario:
      
      node A:
      $ ip link add dev vxlan42  address 2c:c2:60:00:10:20 type vxlan id 42
      $ ip addr add dev vxlan42 10.0.0.1/24
      $ ip link set up dev vxlan42
      $ arp -i vxlan42 -s 10.0.0.2 2c:c2:60:00:01:02
      $ bridge fdb add dev vxlan42 to 2c:c2:60:00:01:02 dst <IPv4 address>
      $ ping 10.0.0.2
      
      node B:
      $ ip link add dev vxlan42 address 2c:c2:60:00:01:02 type vxlan id 42
      $ ip addr add dev vxlan42 10.0.0.2/24
      $ ip link set up dev vxlan42
      $ arp -i vxlan42 -s 10.0.0.1 2c:c2:60:00:10:20
      
      node B crashes:
      
       vxlan42: 2c:c2:60:00:10:20 migrated from 4011:eca4:c0a8:6466:c0a8:6415:8e09:2118 to (invalid address)
       vxlan42: 2c:c2:60:00:10:20 migrated from 4011:eca4:c0a8:6466:c0a8:6415:8e09:2118 to (invalid address)
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000046
       IP: [<ffffffff8143c459>] ip6_route_output+0x58/0x82
       PGD 7bd89067 PUD 7bd4e067 PMD 0
       Oops: 0000 [#1] SMP
       Modules linked in:
       CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.14.0-rc8-hvx-xen-00019-g97a5221f-dirty #154
       Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
       task: ffff88007c774f50 ti: ffff88007c79c000 task.ti: ffff88007c79c000
       RIP: 0010:[<ffffffff8143c459>]  [<ffffffff8143c459>] ip6_route_output+0x58/0x82
       RSP: 0018:ffff88007fd03668  EFLAGS: 00010282
       RAX: 0000000000000000 RBX: ffffffff8186a000 RCX: 0000000000000040
       RDX: 0000000000000000 RSI: ffff88007b0e4a80 RDI: ffff88007fd03754
       RBP: ffff88007fd03688 R08: ffff88007b0e4a80 R09: 0000000000000000
       R10: 0200000a0100000a R11: 0001002200000000 R12: ffff88007fd03740
       R13: ffff88007b0e4a80 R14: ffff88007b0e4a80 R15: ffff88007bba0c50
       FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
       CR2: 0000000000000046 CR3: 000000007bb60000 CR4: 00000000000006e0
       Stack:
        0000000000000000 ffff88007fd037a0 ffffffff8186a000 ffff88007fd03740
        ffff88007fd036c8 ffffffff814320bb 0000000000006e49 ffff88007b8b7360
        ffff88007bdbf200 ffff88007bcbc000 ffff88007b8b7000 ffff88007b8b7360
       Call Trace:
        <IRQ>
        [<ffffffff814320bb>] ip6_dst_lookup_tail+0x2d/0xa4
        [<ffffffff814322a5>] ip6_dst_lookup+0x10/0x12
        [<ffffffff81323b4e>] vxlan_xmit_one+0x32a/0x68c
        [<ffffffff814a325a>] ? _raw_spin_unlock_irqrestore+0x12/0x14
        [<ffffffff8104c551>] ? lock_timer_base.isra.23+0x26/0x4b
        [<ffffffff8132451a>] vxlan_xmit+0x66a/0x6a8
        [<ffffffff8141a365>] ? ipt_do_table+0x35f/0x37e
        [<ffffffff81204ba2>] ? selinux_ip_postroute+0x41/0x26e
        [<ffffffff8139d0c1>] dev_hard_start_xmit+0x2ce/0x3ce
        [<ffffffff8139d491>] __dev_queue_xmit+0x2d0/0x392
        [<ffffffff813b380f>] ? eth_header+0x28/0xb5
        [<ffffffff8139d569>] dev_queue_xmit+0xb/0xd
        [<ffffffff813a5aa6>] neigh_resolve_output+0x134/0x152
        [<ffffffff813db741>] ip_finish_output2+0x236/0x299
        [<ffffffff813dc074>] ip_finish_output+0x98/0x9d
        [<ffffffff813dc749>] ip_output+0x62/0x67
        [<ffffffff813da9f2>] dst_output+0xf/0x11
        [<ffffffff813dc11c>] ip_local_out+0x1b/0x1f
        [<ffffffff813dcf1b>] ip_send_skb+0x11/0x37
        [<ffffffff813dcf70>] ip_push_pending_frames+0x2f/0x33
        [<ffffffff813ff732>] icmp_push_reply+0x106/0x115
        [<ffffffff813ff9e4>] icmp_reply+0x142/0x164
        [<ffffffff813ffb3b>] icmp_echo.part.16+0x46/0x48
        [<ffffffff813c1d30>] ? nf_iterate+0x43/0x80
        [<ffffffff813d8037>] ? xfrm4_policy_check.constprop.11+0x52/0x52
        [<ffffffff813ffb62>] icmp_echo+0x25/0x27
        [<ffffffff814005f7>] icmp_rcv+0x1d2/0x20a
        [<ffffffff813d8037>] ? xfrm4_policy_check.constprop.11+0x52/0x52
        [<ffffffff813d810d>] ip_local_deliver_finish+0xd6/0x14f
        [<ffffffff813d8037>] ? xfrm4_policy_check.constprop.11+0x52/0x52
        [<ffffffff813d7fde>] NF_HOOK.constprop.10+0x4c/0x53
        [<ffffffff813d82bf>] ip_local_deliver+0x4a/0x4f
        [<ffffffff813d7f7b>] ip_rcv_finish+0x253/0x26a
        [<ffffffff813d7d28>] ? inet_add_protocol+0x3e/0x3e
        [<ffffffff813d7fde>] NF_HOOK.constprop.10+0x4c/0x53
        [<ffffffff813d856a>] ip_rcv+0x2a6/0x2ec
        [<ffffffff8139a9a0>] __netif_receive_skb_core+0x43e/0x478
        [<ffffffff812a346f>] ? virtqueue_poll+0x16/0x27
        [<ffffffff8139aa2f>] __netif_receive_skb+0x55/0x5a
        [<ffffffff8139aaaa>] process_backlog+0x76/0x12f
        [<ffffffff8139add8>] net_rx_action+0xa2/0x1ab
        [<ffffffff81047847>] __do_softirq+0xca/0x1d1
        [<ffffffff81047ace>] irq_exit+0x3e/0x85
        [<ffffffff8100b98b>] do_IRQ+0xa9/0xc4
        [<ffffffff814a37ad>] common_interrupt+0x6d/0x6d
        <EOI>
        [<ffffffff810378db>] ? native_safe_halt+0x6/0x8
        [<ffffffff810110c7>] default_idle+0x9/0xd
        [<ffffffff81011694>] arch_cpu_idle+0x13/0x1c
        [<ffffffff8107480d>] cpu_startup_entry+0xbc/0x137
        [<ffffffff8102e741>] start_secondary+0x1a0/0x1a5
       Code: 24 14 e8 f1 e5 01 00 31 d2 a8 32 0f 95 c2 49 8b 44 24 2c 49 0b 44 24 24 74 05 83 ca 04 eb 1c 4d 85 ed 74 17 49 8b 85 a8 02 00 00 <66> 8b 40 46 66 c1 e8 07 83 e0 07 c1 e0 03 09 c2 4c 89 e6 48 89
       RIP  [<ffffffff8143c459>] ip6_route_output+0x58/0x82
        RSP <ffff88007fd03668>
       CR2: 0000000000000046
       ---[ end trace 4612329caab37efd ]---
      
      When vxlan interface is created without explicit group definition, the
      default_dst protocol family is initialiazed to AF_UNSPEC and the driver
      assumes IPv4 configuration. On the other side, the default_dst protocol
      family is used to differentiate between IPv4 and IPv6 cases and, since,
      AF_UNSPEC != AF_INET, the processing takes the IPv6 path.
      
      Making the IPv4 assumption explicit by settting default_dst protocol
      family to AF_INET4 and preventing mixing of IPv4 and IPv6 addresses in
      snooped fdb entries fixes the corner case crashes.
      Signed-off-by: NMike Rapoport <mike.rapoport@ravellosystems.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5933a7bb
  32. 25 3月, 2014 1 次提交
    • D
      vxlan: fix nonfunctional neigh_reduce() · 4b29dba9
      David Stevens 提交于
      The VXLAN neigh_reduce() code is completely non-functional since
      check-in. Specific errors:
      
      1) The original code drops all packets with a multicast destination address,
      	even though neighbor solicitations are sent to the solicited-node
      	address, a multicast address. The code after this check was never run.
      2) The neighbor table lookup used the IPv6 header destination, which is the
      	solicited node address, rather than the target address from the
      	neighbor solicitation. So neighbor lookups would always fail if it
      	got this far. Also for L3MISSes.
      3) The code calls ndisc_send_na(), which does a send on the tunnel device.
      	The context for neigh_reduce() is the transmit path, vxlan_xmit(),
      	where the host or a bridge-attached neighbor is trying to transmit
      	a neighbor solicitation. To respond to it, the tunnel endpoint needs
      	to do a *receive* of the appropriate neighbor advertisement. Doing a
      	send, would only try to send the advertisement, encapsulated, to the
      	remote destinations in the fdb -- hosts that definitely did not do the
      	corresponding solicitation.
      4) The code uses the tunnel endpoint IPv6 forwarding flag to determine the
      	isrouter flag in the advertisement. This has nothing to do with whether
      	or not the target is a router, and generally won't be set since the
      	tunnel endpoint is bridging, not routing, traffic.
      
      	The patch below creates a proxy neighbor advertisement to respond to
      neighbor solicitions as intended, providing proper IPv6 support for neighbor
      reduction.
      Signed-off-by: NDavid L Stevens <dlstevens@us.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b29dba9
  33. 19 3月, 2014 1 次提交
  34. 27 2月, 2014 1 次提交
  35. 15 2月, 2014 1 次提交
  36. 02 2月, 2014 1 次提交
  37. 31 1月, 2014 1 次提交
  38. 24 1月, 2014 1 次提交