1. 03 2月, 2015 6 次提交
  2. 29 1月, 2015 2 次提交
  3. 28 1月, 2015 6 次提交
  4. 27 1月, 2015 2 次提交
  5. 25 1月, 2015 2 次提交
    • T
      vxlan: Eliminate dependency on UDP socket in transmit path · af33c1ad
      Tom Herbert 提交于
      In the vxlan transmit path there is no need to reference the socket
      for a tunnel which is needed for the receive side. We do, however,
      need the vxlan_dev flags. This patch eliminate references
      to the socket in the transmit path, and changes VXLAN_F_UNSHAREABLE
      to be VXLAN_F_RCV_FLAGS. This mask is used to store the flags
      applicable to receive (GBP, CSUM6_RX, and REMCSUM_RX) in the
      vxlan_sock flags.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      af33c1ad
    • T
      udp: Do not require sock in udp_tunnel_xmit_skb · d998f8ef
      Tom Herbert 提交于
      The UDP tunnel transmit functions udp_tunnel_xmit_skb and
      udp_tunnel6_xmit_skb include a socket argument. The socket being
      passed to the functions (from VXLAN) is a UDP created for receive
      side. The only thing that the socket is used for in the transmit
      functions is to get the setting for checksum (enabled or zero).
      This patch removes the argument and and adds a nocheck argument
      for checksum setting. This eliminates the unnecessary dependency
      on a UDP socket for UDP tunnel transmit.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d998f8ef
  6. 20 1月, 2015 4 次提交
  7. 19 1月, 2015 1 次提交
  8. 18 1月, 2015 5 次提交
    • M
      ip_tunnel: Create percpu gro_cell · 88340160
      Martin KaFai Lau 提交于
      In the ipip tunnel, the skb->queue_mapping is lost in ipip_rcv().
      All skb will be queued to the same cell->napi_skbs.  The
      gro_cell_poll is pinned to one core under load.  In production traffic,
      we also see severe rx_dropped in the tunl iface and it is probably due to
      this limit: skb_queue_len(&cell->napi_skbs) > netdev_max_backlog.
      
      This patch is trying to alloc_percpu(struct gro_cell) and schedule
      gro_cell_poll to process the skb in the same core.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88340160
    • J
      netlink: make nlmsg_end() and genlmsg_end() void · 053c095a
      Johannes Berg 提交于
      Contrary to common expectations for an "int" return, these functions
      return only a positive value -- if used correctly they cannot even
      return 0 because the message header will necessarily be in the skb.
      
      This makes the very common pattern of
      
        if (genlmsg_end(...) < 0) { ... }
      
      be a whole bunch of dead code. Many places also simply do
      
        return nlmsg_end(...);
      
      and the caller is expected to deal with it.
      
      This also commonly (at least for me) causes errors, because it is very
      common to write
      
        if (my_function(...))
          /* error condition */
      
      and if my_function() does "return nlmsg_end()" this is of course wrong.
      
      Additionally, there's not a single place in the kernel that actually
      needs the message length returned, and if anyone needs it later then
      it'll be very easy to just use skb->len there.
      
      Remove this, and make the functions void. This removes a bunch of dead
      code as described above. The patch adds lines because I did
      
      -	return nlmsg_end(...);
      +	nlmsg_end(...);
      +	return 0;
      
      I could have preserved all the function's return values by returning
      skb->len, but instead I've audited all the places calling the affected
      functions and found that none cared. A few places actually compared
      the return value with <= 0 in dump functionality, but that could just
      be changed to < 0 with no change in behaviour, so I opted for the more
      efficient version.
      
      One instance of the error I've made numerous times now is also present
      in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
      check for <0 or <=0 and thus broke out of the loop every single time.
      I've preserved this since it will (I think) have caused the messages to
      userspace to be formatted differently with just a single message for
      every SKB returned to userspace. It's possible that this isn't needed
      for the tools that actually use this, but I don't even know what they
      are so couldn't test that changing this behaviour would be acceptable.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      053c095a
    • J
      net: replace br_fdb_external_learn_* calls with switchdev notifier events · 3aeb6617
      Jiri Pirko 提交于
      This patch benefits from newly introduced switchdev notifier and uses it
      to propagate fdb learn events from rocker driver to bridge. That avoids
      direct function calls and possible use by other listeners (ovs).
      Suggested-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NScott Feldman <sfeldma@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3aeb6617
    • J
      switchdev: introduce switchdev notifier · 03bf0c28
      Jiri Pirko 提交于
      This patch introduces new notifier for purposes of exposing events which happen
      on switch driver side. The consumers of the event messages are mainly involved
      masters, namely bridge and ovs.
      Suggested-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NScott Feldman <sfeldma@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      03bf0c28
    • J
      tc: add BPF based action · d23b8ad8
      Jiri Pirko 提交于
      This action provides a possibility to exec custom BPF code.
      Signed-off-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d23b8ad8
  9. 17 1月, 2015 2 次提交
    • J
      genetlink: synchronize socket closing and family removal · ee1c2442
      Johannes Berg 提交于
      In addition to the problem Jeff Layton reported, I looked at the code
      and reproduced the same warning by subscribing and removing the genl
      family with a socket still open. This is a fairly tricky race which
      originates in the fact that generic netlink allows the family to go
      away while sockets are still open - unlike regular netlink which has
      a module refcount for every open socket so in general this cannot be
      triggered.
      
      Trying to resolve this issue by the obvious locking isn't possible as
      it will result in deadlocks between unregistration and group unbind
      notification (which incidentally lockdep doesn't find due to the home
      grown locking in the netlink table.)
      
      To really resolve this, introduce a "closing socket" reference counter
      (for generic netlink only, as it's the only affected family) in the
      core netlink code and use that in generic netlink to wait for all the
      sockets that are being closed at the same time as a generic netlink
      family is removed.
      
      This fixes the race that when a socket is closed, it will should call
      the unbind, but if the family is removed at the same time the unbind
      will not find it, leading to the warning. The real problem though is
      that in this case the unbind could actually find a new family that is
      registered to have a multicast group with the same ID, and call its
      mcast_unbind() leading to confusing.
      
      Also remove the warning since it would still trigger, but is now no
      longer a problem.
      
      This also moves the code in af_netlink.c to before unreferencing the
      module to avoid having the same problem in the normal non-genl case.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee1c2442
    • J
      genetlink: document parallel_ops · f555f3d7
      Johannes Berg 提交于
      The kernel-doc for the parallel_ops family struct member is
      missing, add it.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f555f3d7
  10. 16 1月, 2015 3 次提交
  11. 15 1月, 2015 6 次提交
    • J
      cfg80211: remove 80+80 MHz rate reporting · 97d910d0
      Johannes Berg 提交于
      These rates are treated the same as 160 MHz in the spec, so
      it makes no sense to distinguish them. As no driver uses them
      yet, this is also not a problem, just remove them.
      
      In the userspace API the field remains reserved to preserve
      API and ABI.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      97d910d0
    • J
      mac80211: remove 80+80 MHz rate reporting · f89903d5
      Johannes Berg 提交于
      These rates are treated the same as 160 MHz in the spec,
      so it makes no sense to distinguish them. As no driver
      uses them yet, this is also not a problem, just remove
      them.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      f89903d5
    • T
      openvswitch: Support VXLAN Group Policy extension · 1dd144cf
      Thomas Graf 提交于
      Introduces support for the group policy extension to the VXLAN virtual
      port. The extension is disabled by default and only enabled if the user
      has provided the respective configuration.
      
        ovs-vsctl add-port br0 vxlan0 -- \
           set Interface vxlan0 type=vxlan options:exts=gbp
      
      The configuration interface to enable the extension is based on a new
      attribute OVS_VXLAN_EXT_GBP nested inside OVS_TUNNEL_ATTR_EXTENSION
      which can carry additional extensions as needed in the future.
      
      The group policy metadata is stored as binary blob (struct ovs_vxlan_opts)
      internally just like Geneve options but transported as nested Netlink
      attributes to user space.
      
      Renames the existing TUNNEL_OPTIONS_PRESENT to TUNNEL_GENEVE_OPT with the
      binary value kept intact, a new flag TUNNEL_VXLAN_OPT is introduced.
      
      The attributes OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS and existing
      OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS are implemented mutually exclusive.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1dd144cf
    • T
      vxlan: Only bind to sockets with compatible flags enabled · ac5132d1
      Thomas Graf 提交于
      A VXLAN net_device looking for an appropriate socket may only consider
      a socket which has a matching set of flags/extensions enabled. If
      incompatible flags are enabled, return a conflict to have the caller
      create a distinct socket with distinct port.
      
      The OVS VXLAN port is kept unaware of extensions at this point.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac5132d1
    • T
      vxlan: Group Policy extension · 3511494c
      Thomas Graf 提交于
      Implements supports for the Group Policy VXLAN extension [0] to provide
      a lightweight and simple security label mechanism across network peers
      based on VXLAN. The security context and associated metadata is mapped
      to/from skb->mark. This allows further mapping to a SELinux context
      using SECMARK, to implement ACLs directly with nftables, iptables, OVS,
      tc, etc.
      
      The group membership is defined by the lower 16 bits of skb->mark, the
      upper 16 bits are used for flags.
      
      SELinux allows to manage label to secure local resources. However,
      distributed applications require ACLs to implemented across hosts. This
      is typically achieved by matching on L2-L4 fields to identify the
      original sending host and process on the receiver. On top of that,
      netlabel and specifically CIPSO [1] allow to map security contexts to
      universal labels.  However, netlabel and CIPSO are relatively complex.
      This patch provides a lightweight alternative for overlay network
      environments with a trusted underlay. No additional control protocol
      is required.
      
                 Host 1:                       Host 2:
      
            Group A        Group B        Group B     Group A
            +-----+   +-------------+    +-------+   +-----+
            | lxc |   | SELinux CTX |    | httpd |   | VM  |
            +--+--+   +--+----------+    +---+---+   +--+--+
      	  \---+---/                     \----+---/
      	      |                              |
      	  +---+---+                      +---+---+
      	  | vxlan |                      | vxlan |
      	  +---+---+                      +---+---+
      	      +------------------------------+
      
      Backwards compatibility:
      A VXLAN-GBP socket can receive standard VXLAN frames and will assign
      the default group 0x0000 to such frames. A Linux VXLAN socket will
      drop VXLAN-GBP  frames. The extension is therefore disabled by default
      and needs to be specifically enabled:
      
         ip link add [...] type vxlan [...] gbp
      
      In a mixed environment with VXLAN and VXLAN-GBP sockets, the GBP socket
      must run on a separate port number.
      
      Examples:
       iptables:
        host1# iptables -I OUTPUT -m owner --uid-owner 101 -j MARK --set-mark 0x200
        host2# iptables -I INPUT -m mark --mark 0x200 -j DROP
      
       OVS:
        # ovs-ofctl add-flow br0 'in_port=1,actions=load:0x200->NXM_NX_TUN_GBP_ID[],NORMAL'
        # ovs-ofctl add-flow br0 'in_port=2,tun_gbp_id=0x200,actions=drop'
      
      [0] https://tools.ietf.org/html/draft-smith-vxlan-group-policy
      [1] http://lwn.net/Articles/204905/Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3511494c
    • T
      vxlan: Remote checksum offload · dfd8645e
      Tom Herbert 提交于
      Add support for remote checksum offload in VXLAN. This uses a
      reserved bit to indicate that RCO is being done, and uses the low order
      reserved eight bits of the VNI to hold the start and offset values in a
      compressed manner.
      
      Start is encoded in the low order seven bits of VNI. This is start >> 1
      so that the checksum start offset is 0-254 using even values only.
      Checksum offset (transport checksum field) is indicated in the high
      order bit in the low order byte of the VNI. If the bit is set, the
      checksum field is for UDP (so offset = start + 6), else checksum
      field is for TCP (so offset = start + 16). Only TCP and UDP are
      supported in this implementation.
      
      Remote checksum offload for VXLAN is described in:
      
      https://tools.ietf.org/html/draft-herbert-vxlan-rco-00
      
      Tested by running 200 TCP_STREAM connections with VXLAN (over IPv4).
      
      With UDP checksums and Remote Checksum Offload
        IPv4
            Client
              11.84% CPU utilization
            Server
              12.96% CPU utilization
            9197 Mbps
        IPv6
            Client
              12.46% CPU utilization
            Server
              14.48% CPU utilization
            8963 Mbps
      
      With UDP checksums, no remote checksum offload
        IPv4
            Client
              15.67% CPU utilization
            Server
              14.83% CPU utilization
            9094 Mbps
        IPv6
            Client
              16.21% CPU utilization
            Server
              14.32% CPU utilization
            9058 Mbps
      
      No UDP checksums
        IPv4
            Client
              15.03% CPU utilization
            Server
              23.09% CPU utilization
            9089 Mbps
        IPv6
            Client
              16.18% CPU utilization
            Server
              26.57% CPU utilization
             8954 Mbps
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dfd8645e
  12. 14 1月, 2015 1 次提交