1. 03 12月, 2014 3 次提交
  2. 25 11月, 2014 1 次提交
    • M
      ipvlan: Initial check-in of the IPVLAN driver. · 2ad7bf36
      Mahesh Bandewar 提交于
      This driver is very similar to the macvlan driver except that it
      uses L3 on the frame to determine the logical interface while
      functioning as packet dispatcher. It inherits L2 of the master
      device hence the packets on wire will have the same L2 for all
      the packets originating from all virtual devices off of the same
      master device.
      
      This driver was developed keeping the namespace use-case in
      mind. Hence most of the examples given here take that as the
      base setup where main-device belongs to the default-ns and
      virtual devices are assigned to the additional namespaces.
      
      The device operates in two different modes and the difference
      in these two modes in primarily in the TX side.
      
      (a) L2 mode : In this mode, the device behaves as a L2 device.
      TX processing upto L2 happens on the stack of the virtual device
      associated with (namespace). Packets are switched after that
      into the main device (default-ns) and queued for xmit.
      
      RX processing is simple and all multicast, broadcast (if
      applicable), and unicast belonging to the address(es) are
      delivered to the virtual devices.
      
      (b) L3 mode : In this mode, the device behaves like a L3 device.
      TX processing upto L3 happens on the stack of the virtual device
      associated with (namespace). Packets are switched to the
      main-device (default-ns) for the L2 processing. Hence the routing
      table of the default-ns will be used in this mode.
      
      RX processins is somewhat similar to the L2 mode except that in
      this mode only Unicast packets are delivered to the virtual device
      while main-dev will handle all other packets.
      
      The devices can be added using the "ip" command from the iproute2
      package -
      
      	ip link add link <master> <virtual> type ipvlan mode [ l2 | l3 ]
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Maciej Żenczykowski <maze@google.com>
      Cc: Laurent Chavey <chavey@google.com>
      Cc: Tim Hockin <thockin@google.com>
      Cc: Brandon Philips <brandon.philips@coreos.com>
      Cc: Pavel Emelianov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2ad7bf36
  3. 17 11月, 2014 1 次提交
    • E
      net: provide a per host RSS key generic infrastructure · 960fb622
      Eric Dumazet 提交于
      RSS (Receive Side Scaling) typically uses Toeplitz hash and a 40 or 52 bytes
      RSS key.
      
      Some drivers use a constant (and well known key), some drivers use a random
      key per port, making bonding setups hard to tune. Well known keys increase
      attack surface, considering that number of queues is usually a power of two.
      
      This patch provides infrastructure to help drivers doing the right thing.
      
      netdev_rss_key_fill() should be used by drivers to initialize their RSS key,
      even if they provide ethtool -X support to let user redefine the key later.
      
      A new /proc/sys/net/core/netdev_rss_key file can be used to get the host
      RSS key even for drivers not providing ethtool -x support, in case some
      applications want to precisely setup flows to match some RX queues.
      
      Tested:
      
      myhost:~# cat /proc/sys/net/core/netdev_rss_key
      11:63:99:bb:79:fb:a5:a7:07:45:b2:20:bf:02:42:2d:08:1a:dd:19:2b:6b:23:ac:56:28:9d:70:c3:ac:e8:16:4b:b7:c1:10:53:a4:78:41:36:40:74:b6:15:ca:27:44:aa:b3:4d:72
      
      myhost:~# ethtool -x eth0
      RX flow hash indirection table for eth0 with 8 RX ring(s):
          0:      0     1     2     3     4     5     6     7
      RSS hash key:
      11:63:99:bb:79:fb:a5:a7:07:45:b2:20:bf:02:42:2d:08:1a:dd:19:2b:6b:23:ac:56:28:9d:70:c3:ac:e8:16:4b:b7:c1:10:53:a4:78:41
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      960fb622
  4. 12 11月, 2014 1 次提交
  5. 11 11月, 2014 1 次提交
    • E
      net: gro: add a per device gro flush timer · 3b47d303
      Eric Dumazet 提交于
      Tuning coalescing parameters on NIC can be really hard.
      
      Servers can handle both bulk and RPC like traffic, with conflicting
      goals : bulk flows want as big GRO packets as possible, RPC want minimal
      latencies.
      
      To reach big GRO packets on 10Gbe NIC, one can use :
      
      ethtool -C eth0 rx-usecs 4 rx-frames 44
      
      But this penalizes rpc sessions, with an increase of latencies, up to
      50% in some cases, as NICs generally do not force an interrupt when
      a packet with TCP Push flag is received.
      
      Some NICs do not have an absolute timer, only a timer rearmed for every
      incoming packet.
      
      This patch uses a different strategy : Let GRO stack decides what do do,
      based on traffic pattern.
      
      Packets with Push flag wont be delayed.
      Packets without Push flag might be held in GRO engine, if we keep
      receiving data.
      
      This new mechanism is off by default, and shall be enabled by setting
      /sys/class/net/ethX/gro_flush_timeout to a value in nanosecond.
      
      To fully enable this mechanism, drivers should use napi_complete_done()
      instead of napi_complete().
      
      Tested:
       Ran 200 netperf TCP_STREAM from A to B (10Gbe mlx4 link, 8 RX queues)
      
      Without this feature, we send back about 305,000 ACK per second.
      
      GRO aggregation ratio is low (811/305 = 2.65 segments per GRO packet)
      
      Setting a timer of 2000 nsec is enough to increase GRO packet sizes
      and reduce number of ACK packets. (811/19.2 = 42)
      
      Receiver performs less calls to upper stacks, less wakes up.
      This also reduces cpu usage on the sender, as it receives less ACK
      packets.
      
      Note that reducing number of wakes up increases cpu efficiency, but can
      decrease QPS, as applications wont have the chance to warmup cpu caches
      doing a partial read of RPC requests/answers if they fit in one skb.
      
      B:~# sar -n DEV 1 10 | grep eth0 | tail -1
      Average:         eth0 811269.80 305732.30 1199462.57  19705.72      0.00
      0.00      0.50
      
      B:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout
      
      B:~# sar -n DEV 1 10 | grep eth0 | tail -1
      Average:         eth0 811577.30  19230.80 1199916.51   1239.80      0.00
      0.00      0.50
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b47d303
  6. 06 11月, 2014 2 次提交
  7. 04 11月, 2014 2 次提交
    • A
      netdevice: add ieee802154_ptr to net_device · 98a18b6f
      Alexander Aring 提交于
      This patch adds an ieee802154_ptr to the net_device structure.
      Furthermore the 802.15.4 subsystem will introduce a nl802154 framework
      which is similar like the nl80211 framework and a wpan_dev structure.
      The wpan_dev structure will hold additional net_device attributes like
      address options which are 802.15.4 specific. In the upcoming nl802154
      implementation we will introduce a NL802154_FLAG_NEED_WPAN_DEV like
      NL80211_FLAG_NEED_WDEV. For this flag an ieee802154_ptr in net_device is
      needed. Additional we can access the wpan_dev attributes in upper layers
      like IEEE 802.15.4 6LoWPAN easily. Current solution is a complicated
      callback interface and getting these values over subif data structure
      in mac802154.
      Signed-off-by: NAlexander Aring <alex.aring@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      98a18b6f
    • E
      net: shrink struct softnet_data · 4cdb1e2e
      Eric Dumazet 提交于
      flow_limit in struct softnet_data is only read from local cpu
      and can be moved to fill a hole, reducing softnet_data size by
      64 bytes on x86_64
      
      While we are at it, move output_queue, output_queue_tailp and
      completion_queue, so that rx / tx paths touch a single cache line.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4cdb1e2e
  8. 30 10月, 2014 1 次提交
  9. 16 10月, 2014 1 次提交
    • T
      net: Add ndo_gso_check · 04ffcb25
      Tom Herbert 提交于
      Add ndo_gso_check which a device can define to indicate whether is
      is capable of doing GSO on a packet. This funciton would be called from
      the stack to determine whether software GSO is needed to be done. A
      driver should populate this function if it advertises GSO types for
      which there are combinations that it wouldn't be able to handle. For
      instance a device that performs UDP tunneling might only implement
      support for transparent Ethernet bridging type of inner packets
      or might have limitations on lengths of inner headers.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04ffcb25
  10. 09 10月, 2014 1 次提交
  11. 08 10月, 2014 1 次提交
    • E
      net: better IFF_XMIT_DST_RELEASE support · 02875878
      Eric Dumazet 提交于
      Testing xmit_more support with netperf and connected UDP sockets,
      I found strange dst refcount false sharing.
      
      Current handling of IFF_XMIT_DST_RELEASE is not optimal.
      
      Dropping dst in validate_xmit_skb() is certainly too late in case
      packet was queued by cpu X but dequeued by cpu Y
      
      The logical point to take care of drop/force is in __dev_queue_xmit()
      before even taking qdisc lock.
      
      As Julian Anastasov pointed out, need for skb_dst() might come from some
      packet schedulers or classifiers.
      
      This patch adds new helper to cleanly express needs of various drivers
      or qdiscs/classifiers.
      
      Drivers that need skb_dst() in their ndo_start_xmit() should call
      following helper in their setup instead of the prior :
      
      	dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
      ->
      	netif_keep_dst(dev);
      
      Instead of using a single bit, we use two bits, one being
      eventually rebuilt in bonding/team drivers.
      
      The other one, is permanent and blocks IFF_XMIT_DST_RELEASE being
      rebuilt in bonding/team. Eventually, we could add something
      smarter later.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Julian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02875878
  12. 07 10月, 2014 1 次提交
  13. 04 10月, 2014 2 次提交
    • T
      fou: eliminate IPv4,v6 specific GRO functions · efc98d08
      Tom Herbert 提交于
      This patch removes fou[46]_gro_receive and fou[46]_gro_complete
      functions. The v4 or v6 variants were chosen for the UDP offloads
      based on the address family of the socket this is not necessary
      or correct. Alternatively, this patch adds is_ipv6 to napi_gro_skb.
      This is set in udp6_gro_receive and unset in udp4_gro_receive. In
      fou_gro_receive the value is used to select the correct inet_offloads
      for the protocol of the outer IP header.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      efc98d08
    • E
      qdisc: validate skb without holding lock · 55a93b3e
      Eric Dumazet 提交于
      Validation of skb can be pretty expensive :
      
      GSO segmentation and/or checksum computations.
      
      We can do this without holding qdisc lock, so that other cpus
      can queue additional packets.
      
      Trick is that requeued packets were already validated, so we carry
      a boolean so that sch_direct_xmit() can validate a fresh skb list,
      or directly use an old one.
      
      Tested on 40Gb NIC (8 TX queues) and 200 concurrent flows, 48 threads
      host.
      
      Turning TSO on or off had no effect on throughput, only few more cpu
      cycles. Lock contention on qdisc lock disappeared.
      
      Same if disabling TX checksum offload.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      55a93b3e
  14. 27 9月, 2014 1 次提交
  15. 26 9月, 2014 1 次提交
  16. 20 9月, 2014 1 次提交
  17. 16 9月, 2014 1 次提交
    • A
      dsa: Split ops up, and avoid assigning tag_protocol and receive separately · 5075314e
      Alexander Duyck 提交于
      This change addresses several issues.
      
      First, it was possible to set tag_protocol without setting the ops pointer.
      To correct that I have reordered things so that rcv is now populated before
      we set tag_protocol.
      
      Second, it didn't make much sense to keep setting the device ops each time a
      new slave was registered.  So by moving the receive portion out into root
      switch initialization that issue should be addressed.
      
      Third, I wanted to avoid sending tags if the rcv pointer was not registered
      so I changed the tag check to verify if the rcv function pointer is set on
      the root tree.  If it is then we start sending DSA tagged frames.
      
      Finally I split the device ops pointer in the structures into two spots.  I
      placed the rcv function pointer in the root switch since this makes it
      easiest to access from there, and I placed the xmit function pointer in the
      slave for the same reason.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5075314e
  18. 14 9月, 2014 4 次提交
  19. 06 9月, 2014 1 次提交
  20. 02 9月, 2014 6 次提交
  21. 30 8月, 2014 2 次提交
  22. 28 8月, 2014 2 次提交
    • F
      net: dsa: allow switches to work without tagging · 5aed85ce
      Florian Fainelli 提交于
      In case switch port tagging is disabled (voluntarily, or the switch just
      does not support it), allow us to continue using the defined set of
      dsa_device_ops in net/dsa/slave.c.
      
      We introduce dsa_protocol_is_tagged() to check whether we need to
      override skb->protocol and go through the DSA-specifif packet_type
      function, or if we just go on and receive the SKB through the normal
      path.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5aed85ce
    • F
      net: dsa: reduce number of protocol hooks · 3e8a72d1
      Florian Fainelli 提交于
      DSA is currently registering one packet_type function per EtherType it
      needs to intercept in the receive path of a DSA-enabled Ethernet device.
      Right now we have three of them: trailer, DSA and eDSA, and there might
      be more in the future, this will not scale to the addition of new
      protocols.
      
      This patch proceeds with adding a new layer of abstraction and two new
      functions:
      
      dsa_switch_rcv() which will dispatch into the tag-protocol specific
      receive function implemented by net/dsa/tag_*.c
      
      dsa_slave_xmit() which will dispatch into the tag-protocol specific
      transmit function implemented by net/dsa/tag_*.c
      
      When we do create the per-port slave network devices, we iterate over
      the switch protocol to assign the DSA-specific receive and transmit
      operations.
      
      A new fake ethertype value is used: ETH_P_XDSA to illustrate the fact
      that this is no longer going to look like ETH_P_DSA or ETH_P_TRAILER
      like it used to be.
      
      This allows us to greatly simplify the check in eth_type_trans() and
      always override the skb->protocol with ETH_P_XDSA for Ethernet switches
      tagged protocol, while also reducing the number repetitive slave
      netdevice_ops assignments.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3e8a72d1
  23. 26 8月, 2014 1 次提交
    • D
      net: Remove ndo_xmit_flush netdev operation, use signalling instead. · 0b725a2c
      David S. Miller 提交于
      As reported by Jesper Dangaard Brouer, for high packet rates the
      overhead of having another indirect call in the TX path is
      non-trivial.
      
      There is the indirect call itself, and then there is all of the
      reloading of the state to refetch the tail pointer value and
      then write the device register.
      
      Move to a more passive scheme, which requires very light modifications
      to the device drivers.
      
      The signal is a new skb->xmit_more value, if it is non-zero it means
      that more SKBs are pending to be transmitted on the same queue as the
      current SKB.  And therefore, the driver may elide the tail pointer
      update.
      
      Right now skb->xmit_more is always zero.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0b725a2c
  24. 25 8月, 2014 2 次提交