1. 18 1月, 2015 1 次提交
    • J
      netlink: make nlmsg_end() and genlmsg_end() void · 053c095a
      Johannes Berg 提交于
      Contrary to common expectations for an "int" return, these functions
      return only a positive value -- if used correctly they cannot even
      return 0 because the message header will necessarily be in the skb.
      
      This makes the very common pattern of
      
        if (genlmsg_end(...) < 0) { ... }
      
      be a whole bunch of dead code. Many places also simply do
      
        return nlmsg_end(...);
      
      and the caller is expected to deal with it.
      
      This also commonly (at least for me) causes errors, because it is very
      common to write
      
        if (my_function(...))
          /* error condition */
      
      and if my_function() does "return nlmsg_end()" this is of course wrong.
      
      Additionally, there's not a single place in the kernel that actually
      needs the message length returned, and if anyone needs it later then
      it'll be very easy to just use skb->len there.
      
      Remove this, and make the functions void. This removes a bunch of dead
      code as described above. The patch adds lines because I did
      
      -	return nlmsg_end(...);
      +	nlmsg_end(...);
      +	return 0;
      
      I could have preserved all the function's return values by returning
      skb->len, but instead I've audited all the places calling the affected
      functions and found that none cared. A few places actually compared
      the return value with <= 0 in dump functionality, but that could just
      be changed to < 0 with no change in behaviour, so I opted for the more
      efficient version.
      
      One instance of the error I've made numerous times now is also present
      in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
      check for <0 or <=0 and thus broke out of the loop every single time.
      I've preserved this since it will (I think) have caused the messages to
      userspace to be formatted differently with just a single message for
      every SKB returned to userspace. It's possible that this isn't needed
      for the tools that actually use this, but I don't even know what they
      are so couldn't test that changing this behaviour would be acceptable.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      053c095a
  2. 15 1月, 2015 4 次提交
    • T
      vxlan: Only bind to sockets with compatible flags enabled · ac5132d1
      Thomas Graf 提交于
      A VXLAN net_device looking for an appropriate socket may only consider
      a socket which has a matching set of flags/extensions enabled. If
      incompatible flags are enabled, return a conflict to have the caller
      create a distinct socket with distinct port.
      
      The OVS VXLAN port is kept unaware of extensions at this point.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac5132d1
    • T
      vxlan: Group Policy extension · 3511494c
      Thomas Graf 提交于
      Implements supports for the Group Policy VXLAN extension [0] to provide
      a lightweight and simple security label mechanism across network peers
      based on VXLAN. The security context and associated metadata is mapped
      to/from skb->mark. This allows further mapping to a SELinux context
      using SECMARK, to implement ACLs directly with nftables, iptables, OVS,
      tc, etc.
      
      The group membership is defined by the lower 16 bits of skb->mark, the
      upper 16 bits are used for flags.
      
      SELinux allows to manage label to secure local resources. However,
      distributed applications require ACLs to implemented across hosts. This
      is typically achieved by matching on L2-L4 fields to identify the
      original sending host and process on the receiver. On top of that,
      netlabel and specifically CIPSO [1] allow to map security contexts to
      universal labels.  However, netlabel and CIPSO are relatively complex.
      This patch provides a lightweight alternative for overlay network
      environments with a trusted underlay. No additional control protocol
      is required.
      
                 Host 1:                       Host 2:
      
            Group A        Group B        Group B     Group A
            +-----+   +-------------+    +-------+   +-----+
            | lxc |   | SELinux CTX |    | httpd |   | VM  |
            +--+--+   +--+----------+    +---+---+   +--+--+
      	  \---+---/                     \----+---/
      	      |                              |
      	  +---+---+                      +---+---+
      	  | vxlan |                      | vxlan |
      	  +---+---+                      +---+---+
      	      +------------------------------+
      
      Backwards compatibility:
      A VXLAN-GBP socket can receive standard VXLAN frames and will assign
      the default group 0x0000 to such frames. A Linux VXLAN socket will
      drop VXLAN-GBP  frames. The extension is therefore disabled by default
      and needs to be specifically enabled:
      
         ip link add [...] type vxlan [...] gbp
      
      In a mixed environment with VXLAN and VXLAN-GBP sockets, the GBP socket
      must run on a separate port number.
      
      Examples:
       iptables:
        host1# iptables -I OUTPUT -m owner --uid-owner 101 -j MARK --set-mark 0x200
        host2# iptables -I INPUT -m mark --mark 0x200 -j DROP
      
       OVS:
        # ovs-ofctl add-flow br0 'in_port=1,actions=load:0x200->NXM_NX_TUN_GBP_ID[],NORMAL'
        # ovs-ofctl add-flow br0 'in_port=2,tun_gbp_id=0x200,actions=drop'
      
      [0] https://tools.ietf.org/html/draft-smith-vxlan-group-policy
      [1] http://lwn.net/Articles/204905/Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3511494c
    • T
      vxlan: Remote checksum offload · dfd8645e
      Tom Herbert 提交于
      Add support for remote checksum offload in VXLAN. This uses a
      reserved bit to indicate that RCO is being done, and uses the low order
      reserved eight bits of the VNI to hold the start and offset values in a
      compressed manner.
      
      Start is encoded in the low order seven bits of VNI. This is start >> 1
      so that the checksum start offset is 0-254 using even values only.
      Checksum offset (transport checksum field) is indicated in the high
      order bit in the low order byte of the VNI. If the bit is set, the
      checksum field is for UDP (so offset = start + 6), else checksum
      field is for TCP (so offset = start + 16). Only TCP and UDP are
      supported in this implementation.
      
      Remote checksum offload for VXLAN is described in:
      
      https://tools.ietf.org/html/draft-herbert-vxlan-rco-00
      
      Tested by running 200 TCP_STREAM connections with VXLAN (over IPv4).
      
      With UDP checksums and Remote Checksum Offload
        IPv4
            Client
              11.84% CPU utilization
            Server
              12.96% CPU utilization
            9197 Mbps
        IPv6
            Client
              12.46% CPU utilization
            Server
              14.48% CPU utilization
            8963 Mbps
      
      With UDP checksums, no remote checksum offload
        IPv4
            Client
              15.67% CPU utilization
            Server
              14.83% CPU utilization
            9094 Mbps
        IPv6
            Client
              16.21% CPU utilization
            Server
              14.32% CPU utilization
            9058 Mbps
      
      No UDP checksums
        IPv4
            Client
              15.03% CPU utilization
            Server
              23.09% CPU utilization
            9089 Mbps
        IPv6
            Client
              16.18% CPU utilization
            Server
              26.57% CPU utilization
             8954 Mbps
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dfd8645e
    • T
      udp: pass udp_offload struct to UDP gro callbacks · a2b12f3c
      Tom Herbert 提交于
      This patch introduces udp_offload_callbacks which has the same
      GRO functions (but not a GSO function) as offload_callbacks,
      except there is an argument to a udp_offload struct passed to
      gro_receive and gro_complete functions. This additional argument
      can be used to retrieve the per port structure of the encapsulation
      for use in gro processing (mostly by doing container_of on the
      structure).
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a2b12f3c
  3. 14 1月, 2015 1 次提交
  4. 13 1月, 2015 1 次提交
    • T
      vxlan: Improve support for header flags · 3bf39475
      Tom Herbert 提交于
      This patch cleans up the header flags of VXLAN in anticipation of
      defining some new ones:
      
      - Move header related definitions from vxlan.c to vxlan.h
      - Change VXLAN_FLAGS to be VXLAN_HF_VNI (only currently defined flag)
      - Move check for unknown flags to after we find vxlan_sock, this
        assumes that some flags may be processed based on tunnel
        configuration
      - Add a comment about why the stack treating unknown set flags as an
        error instead of ignoring them
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3bf39475
  5. 03 1月, 2015 1 次提交
  6. 24 12月, 2014 1 次提交
  7. 12 12月, 2014 1 次提交
    • M
      Fix race condition between vxlan_sock_add and vxlan_sock_release · 00c83b01
      Marcelo Leitner 提交于
      Currently, when trying to reuse a socket, vxlan_sock_add will grab
      vn->sock_lock, locate a reusable socket, inc refcount and release
      vn->sock_lock.
      
      But vxlan_sock_release() will first decrement refcount, and then grab
      that lock. refcnt operations are atomic but as currently we have
      deferred works which hold vs->refcnt each, this might happen, leading to
      a use after free (specially after vxlan_igmp_leave):
      
        CPU 1                            CPU 2
      
      deferred work                    vxlan_sock_add
        ...                              ...
                                         spin_lock(&vn->sock_lock)
                                         vs = vxlan_find_sock();
        vxlan_sock_release
          dec vs->refcnt, reaches 0
          spin_lock(&vn->sock_lock)
                                         vxlan_sock_hold(vs), refcnt=1
                                         spin_unlock(&vn->sock_lock)
          hlist_del_rcu(&vs->hlist);
          vxlan_notify_del_rx_port(vs)
          spin_unlock(&vn->sock_lock)
      
      So when we look for a reusable socket, we check if it wasn't freed
      already before reusing it.
      Signed-off-by: NMarcelo Ricardo Leitner <mleitner@redhat.com>
      Fixes: 7c47cedf ("vxlan: move IGMP join/leave to work queue")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      00c83b01
  8. 03 12月, 2014 1 次提交
  9. 26 11月, 2014 1 次提交
  10. 22 11月, 2014 2 次提交
  11. 19 11月, 2014 1 次提交
  12. 15 11月, 2014 1 次提交
  13. 14 11月, 2014 1 次提交
    • M
      vxlan: Do not reuse sockets for a different address family · 19ca9fc1
      Marcelo Leitner 提交于
      Currently, we only match against local port number in order to reuse
      socket. But if this new vxlan wants an IPv6 socket and a IPv4 one bound
      to that port, vxlan will reuse an IPv4 socket as IPv6 and a panic will
      follow. The following steps reproduce it:
      
         # ip link add vxlan6 type vxlan id 42 group 229.10.10.10 \
             srcport 5000 6000 dev eth0
         # ip link add vxlan7 type vxlan id 43 group ff0e::110 \
             srcport 5000 6000 dev eth0
         # ip link set vxlan6 up
         # ip link set vxlan7 up
         <panic>
      
      [    4.187481] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
      ...
      [    4.188076] Call Trace:
      [    4.188085]  [<ffffffff81667c4a>] ? ipv6_sock_mc_join+0x3a/0x630
      [    4.188098]  [<ffffffffa05a6ad6>] vxlan_igmp_join+0x66/0xd0 [vxlan]
      [    4.188113]  [<ffffffff810a3430>] process_one_work+0x220/0x710
      [    4.188125]  [<ffffffff810a33c4>] ? process_one_work+0x1b4/0x710
      [    4.188138]  [<ffffffff810a3a3b>] worker_thread+0x11b/0x3a0
      [    4.188149]  [<ffffffff810a3920>] ? process_one_work+0x710/0x710
      
      So address family must also match in order to reuse a socket.
      Reported-by: NJean-Tsung Hsiao <jhsiao@redhat.com>
      Signed-off-by: NMarcelo Ricardo Leitner <mleitner@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19ca9fc1
  14. 11 11月, 2014 1 次提交
  15. 07 11月, 2014 1 次提交
    • T
      vxlan: Fix to enable UDP checksums on interface · 5c91ae08
      Tom Herbert 提交于
      Add definition to vxlan nla_policy for UDP checksum. This is necessary
      to enable UDP checksums on VXLAN.
      
      In some instances, enabling UDP checksums can improve performance on
      receive for devices that return legacy checksum-unnecessary for UDP/IP.
      Also, UDP checksum provides some protection against VNI corruption.
      
      Testing:
      
      Ran 200 instances of TCP_STREAM and TCP_RR on bnx2x.
      
      TCP_STREAM
        IPv4, without UDP checksums
            14.41% TX CPU utilization
            25.71% RX CPU utilization
            9083.4 Mbps
        IPv4, with UDP checksums
            13.99% TX CPU utilization
            13.40% RX CPU utilization
            9095.65 Mbps
      
      TCP_RR
        IPv4, without UDP checksums
            94.08% TX CPU utilization
            156/248/462 90/95/99% latencies
            1.12743e+06 tps
        IPv4, with UDP checksums
            94.43% TX CPU utilization
            158/250/462 90/95/99% latencies
            1.13345e+06 tps
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5c91ae08
  16. 18 10月, 2014 1 次提交
  17. 16 10月, 2014 2 次提交
  18. 08 10月, 2014 1 次提交
    • E
      net: better IFF_XMIT_DST_RELEASE support · 02875878
      Eric Dumazet 提交于
      Testing xmit_more support with netperf and connected UDP sockets,
      I found strange dst refcount false sharing.
      
      Current handling of IFF_XMIT_DST_RELEASE is not optimal.
      
      Dropping dst in validate_xmit_skb() is certainly too late in case
      packet was queued by cpu X but dequeued by cpu Y
      
      The logical point to take care of drop/force is in __dev_queue_xmit()
      before even taking qdisc lock.
      
      As Julian Anastasov pointed out, need for skb_dst() might come from some
      packet schedulers or classifiers.
      
      This patch adds new helper to cleanly express needs of various drivers
      or qdiscs/classifiers.
      
      Drivers that need skb_dst() in their ndo_start_xmit() should call
      following helper in their setup instead of the prior :
      
      	dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
      ->
      	netif_keep_dst(dev);
      
      Instead of using a single bit, we use two bits, one being
      eventually rebuilt in bonding/team drivers.
      
      The other one, is permanent and blocks IFF_XMIT_DST_RELEASE being
      rebuilt in bonding/team. Eventually, we could add something
      smarter later.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Julian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02875878
  19. 02 10月, 2014 1 次提交
  20. 24 9月, 2014 1 次提交
  21. 20 9月, 2014 1 次提交
  22. 02 9月, 2014 1 次提交
  23. 30 8月, 2014 1 次提交
    • T
      net: Clarification of CHECKSUM_UNNECESSARY · 77cffe23
      Tom Herbert 提交于
      This patch:
       - Clarifies the specific requirements of devices returning
         CHECKSUM_UNNECESSARY (comments in skbuff.h).
       - Adds csum_level field to skbuff. This is used to express how
         many checksums are covered by CHECKSUM_UNNECESSARY (stores n - 1).
         This replaces the overloading of skb->encapsulation, that field is
         is now only used to indicate inner headers are valid.
       - Change __skb_checksum_validate_needed to "consume" each checksum
         as indicated by csum_level as layers of the the packet are parsed.
       - Remove skb_pop_rcv_encapsulation, no longer needed in the new
         csum_level model.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      77cffe23
  24. 23 8月, 2014 1 次提交
    • G
      vxlan: fix incorrect initializer in union vxlan_addr · a45e92a5
      Gerhard Stenzel 提交于
      The first initializer in the following
      
              union vxlan_addr ipa = {
                  .sin.sin_addr.s_addr = tip,
                  .sa.sa_family = AF_INET,
              };
      
      is optimised away by the compiler, due to the second initializer,
      therefore initialising .sin.sin_addr.s_addr always to 0.
      This results in netlink messages indicating a L3 miss never contain the
      missed IP address. This was observed with GCC 4.8 and 4.9. I do not know about previous versions.
      The problem affects user space programs relying on an IP address being
      sent as part of a netlink message indicating a L3 miss.
      
      Changing
                  .sa.sa_family = AF_INET,
      to
                  .sin.sin_family = AF_INET,
      fixes the problem.
      Signed-off-by: NGerhard Stenzel <gerhard.stenzel@de.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a45e92a5
  25. 29 7月, 2014 1 次提交
  26. 15 7月, 2014 1 次提交
  27. 11 7月, 2014 1 次提交
  28. 08 7月, 2014 1 次提交
  29. 15 6月, 2014 1 次提交
  30. 14 6月, 2014 1 次提交
  31. 12 6月, 2014 1 次提交
  32. 05 6月, 2014 1 次提交
  33. 14 5月, 2014 1 次提交
  34. 25 4月, 2014 1 次提交
    • N
      vxlan: add x-netns support · f01ec1c0
      Nicolas Dichtel 提交于
      This patch allows to switch the netns when packet is encapsulated or
      decapsulated.
      The vxlan socket is openned into the i/o netns, ie into the netns where
      encapsulated packets are received. The socket lookup is done into this netns to
      find the corresponding vxlan tunnel. After decapsulation, the packet is
      injecting into the corresponding interface which may stand to another netns.
      
      When one of the two netns is removed, the tunnel is destroyed.
      
      Configuration example:
      ip netns add netns1
      ip netns exec netns1 ip link set lo up
      ip link add vxlan10 type vxlan id 10 group 239.0.0.10 dev eth0 dstport 0
      ip link set vxlan10 netns netns1
      ip netns exec netns1 ip addr add 192.168.0.249/24 broadcast 192.168.0.255 dev vxlan10
      ip netns exec netns1 ip link set vxlan10 up
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f01ec1c0
  35. 24 4月, 2014 1 次提交