1. 02 9月, 2016 1 次提交
    • R
      rtnetlink: fdb dump: optimize by saving last interface markers · d297653d
      Roopa Prabhu 提交于
      fdb dumps spanning multiple skb's currently restart from the first
      interface again for every skb. This results in unnecessary
      iterations on the already visited interfaces and their fdb
      entries. In large scale setups, we have seen this to slow
      down fdb dumps considerably. On a system with 30k macs we
      see fdb dumps spanning across more than 300 skbs.
      
      To fix the problem, this patch replaces the existing single fdb
      marker with three markers: netdev hash entries, netdevs and fdb
      index to continue where we left off instead of restarting from the
      first netdev. This is consistent with link dumps.
      
      In the process of fixing the performance issue, this patch also
      re-implements fix done by
      commit 472681d5 ("net: ndo_fdb_dump should report -EMSGSIZE to rtnl_fdb_dump")
      (with an internal fix from Wilson Kok) in the following ways:
      - change ndo_fdb_dump handlers to return error code instead
      of the last fdb index
      - use cb->args strictly for dump frag markers and not error codes.
      This is consistent with other dump functions.
      
      Below results were taken on a system with 1000 netdevs
      and 35085 fdb entries:
      before patch:
      $time bridge fdb show | wc -l
      15065
      
      real    1m11.791s
      user    0m0.070s
      sys 1m8.395s
      
      (existing code does not return all macs)
      
      after patch:
      $time bridge fdb show | wc -l
      35085
      
      real    0m2.017s
      user    0m0.113s
      sys 0m1.942s
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NWilson Kok <wkok@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d297653d
  2. 27 8月, 2016 1 次提交
    • I
      bridge: switchdev: Add forward mark support for stacked devices · 6bc506b4
      Ido Schimmel 提交于
      switchdev_port_fwd_mark_set() is used to set the 'offload_fwd_mark' of
      port netdevs so that packets being flooded by the device won't be
      flooded twice.
      
      It works by assigning a unique identifier (the ifindex of the first
      bridge port) to bridge ports sharing the same parent ID. This prevents
      packets from being flooded twice by the same switch, but will flood
      packets through bridge ports belonging to a different switch.
      
      This method is problematic when stacked devices are taken into account,
      such as VLANs. In such cases, a physical port netdev can have upper
      devices being members in two different bridges, thus requiring two
      different 'offload_fwd_mark's to be configured on the port netdev, which
      is impossible.
      
      The main problem is that packet and netdev marking is performed at the
      physical netdev level, whereas flooding occurs between bridge ports,
      which are not necessarily port netdevs.
      
      Instead, packet and netdev marking should really be done in the bridge
      driver with the switch driver only telling it which packets it already
      forwarded. The bridge driver will mark such packets using the mark
      assigned to the ingress bridge port and will prevent the packet from
      being forwarded through any bridge port sharing the same mark (i.e.
      having the same parent ID).
      
      Remove the current switchdev 'offload_fwd_mark' implementation and
      instead implement the proposed method. In addition, make rocker - the
      sole user of the mark - use the proposed method.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6bc506b4
  3. 14 8月, 2016 1 次提交
    • S
      net: remove type_check from dev_get_nest_level() · 952fcfd0
      Sabrina Dubroca 提交于
      The idea for type_check in dev_get_nest_level() was to count the number
      of nested devices of the same type (currently, only macvlan or vlan
      devices).
      This prevented the false positive lockdep warning on configurations such
      as:
      
      eth0 <--- macvlan0 <--- vlan0 <--- macvlan1
      
      However, this doesn't prevent a warning on a configuration such as:
      
      eth0 <--- macvlan0 <--- vlan0
      eth1 <--- vlan1 <--- macvlan1
      
      In this case, all the locks end up with a nesting subclass of 1, so
      lockdep thinks that there is still a deadlock:
      
      - in the first case we have (macvlan_netdev_addr_lock_key, 1) and then
        take (vlan_netdev_xmit_lock_key, 1)
      - in the second case, we have (vlan_netdev_xmit_lock_key, 1) and then
        take (macvlan_netdev_addr_lock_key, 1)
      
      By removing the linktype check in dev_get_nest_level() and always
      incrementing the nesting depth, lockdep considers this configuration
      valid.
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      952fcfd0
  4. 11 8月, 2016 1 次提交
  5. 25 7月, 2016 1 次提交
  6. 20 7月, 2016 1 次提交
  7. 17 7月, 2016 1 次提交
    • P
      vlan: use a valid default mtu value for vlan over macsec · 18d3df3e
      Paolo Abeni 提交于
      macsec can't cope with mtu frames which need vlan tag insertion, and
      vlan device set the default mtu equal to the underlying dev's one.
      By default vlan over macsec devices use invalid mtu, dropping
      all the large packets.
      This patch adds a netif helper to check if an upper vlan device
      needs mtu reduction. The helper is used during vlan devices
      initialization to set a valid default and during mtu updating to
      forbid invalid, too bit, mtu values.
      The helper currently only check if the lower dev is a macsec device,
      if we get more users, we need to update only the helper (possibly
      reserving an additional IFF bit).
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      18d3df3e
  8. 06 7月, 2016 2 次提交
  9. 05 7月, 2016 1 次提交
  10. 01 7月, 2016 1 次提交
  11. 18 6月, 2016 2 次提交
  12. 16 6月, 2016 2 次提交
  13. 13 6月, 2016 1 次提交
  14. 11 6月, 2016 1 次提交
    • D
      bpf: enforce recursion limit on redirects · a70b506e
      Daniel Borkmann 提交于
      Respect the stack's xmit_recursion limit for calls into dev_queue_xmit().
      Currently, they are not handeled by the limiter when attached to clsact's
      egress parent, for example, and a buggy program redirecting it to the
      same device again could run into stack overflow eventually. It would be
      good if we could notify an admin to give him a chance to react. We reuse
      xmit_recursion instead of having one private to eBPF, so that the stack's
      current recursion depth will be taken into account as well. Follow-up to
      commit 3896d655 ("bpf: introduce bpf_clone_redirect() helper") and
      27b29f63 ("bpf: add bpf_redirect() helper").
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a70b506e
  15. 10 6月, 2016 1 次提交
  16. 09 6月, 2016 1 次提交
  17. 08 6月, 2016 1 次提交
  18. 04 6月, 2016 1 次提交
    • M
      sctp: Add GSO support · 90017acc
      Marcelo Ricardo Leitner 提交于
      SCTP has this pecualiarity that its packets cannot be just segmented to
      (P)MTU. Its chunks must be contained in IP segments, padding respected.
      So we can't just generate a big skb, set gso_size to the fragmentation
      point and deliver it to IP layer.
      
      This patch takes a different approach. SCTP will now build a skb as it
      would be if it was received using GRO. That is, there will be a cover
      skb with protocol headers and children ones containing the actual
      segments, already segmented to a way that respects SCTP RFCs.
      
      With that, we can tell skb_segment() to just split based on frag_list,
      trusting its sizes are already in accordance.
      
      This way SCTP can benefit from GSO and instead of passing several
      packets through the stack, it can pass a single large packet.
      
      v2:
      - Added support for receiving GSO frames, as requested by Dave Miller.
      - Clear skb->cb if packet is GSO (otherwise it's not used by SCTP)
      - Added heuristics similar to what we have in TCP for not generating
        single GSO packets that fills cwnd.
      v3:
      - consider sctphdr size in skb_gso_transport_seglen()
      - rebased due to 5c7cdf33 ("gso: Remove arbitrary checks for
        unsupported GSO")
      Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Tested-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90017acc
  19. 21 5月, 2016 1 次提交
  20. 17 5月, 2016 1 次提交
  21. 12 5月, 2016 1 次提交
    • D
      net: l3mdev: Add hook in ip and ipv6 · 74b20582
      David Ahern 提交于
      Currently the VRF driver uses the rx_handler to switch the skb device
      to the VRF device. Switching the dev prior to the ip / ipv6 layer
      means the VRF driver has to duplicate IP/IPv6 processing which adds
      overhead and makes features such as retaining the ingress device index
      more complicated than necessary.
      
      This patch moves the hook to the L3 layer just after the first NF_HOOK
      for PRE_ROUTING. This location makes exposing the original ingress device
      trivial (next patch) and allows adding other NF_HOOKs to the VRF driver
      in the future.
      
      dev_queue_xmit_nit is exported so that the VRF driver can cycle the skb
      with the switched device through the packet taps to maintain current
      behavior (tcpdump can be used on either the vrf device or the enslaved
      devices).
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      74b20582
  22. 07 5月, 2016 1 次提交
    • J
      udp_offload: Set encapsulation before inner completes. · 229740c6
      Jarno Rajahalme 提交于
      UDP tunnel segmentation code relies on the inner offsets being set for
      an UDP tunnel GSO packet, but the inner *_complete() functions will
      set the inner offsets only if 'encapsulation' is set before calling
      them.  Currently, udp_gro_complete() sets 'encapsulation' only after
      the inner *_complete() functions are done.  This causes the inner
      offsets having invalid values after udp_gro_complete() returns, which
      in turn will make it impossible to properly segment the packet in case
      it needs to be forwarded, which would be visible to the user either as
      invalid packets being sent or as packet loss.
      
      This patch fixes this by setting skb's 'encapsulation' in
      udp_gro_complete() before calling into the inner complete functions,
      and by making each possible UDP tunnel gro_complete() callback set the
      inner_mac_header to the beginning of the tunnel payload.
      Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
      Reviewed-by: NAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      229740c6
  23. 05 5月, 2016 2 次提交
    • F
      net: remove dev->trans_start · 9b36627a
      Florian Westphal 提交于
      previous patches removed all direct accesses to dev->trans_start,
      so change the netif_trans_update helper to update trans_start of
      netdev queue 0 instead and then remove trans_start from struct net_device.
      
      AFAICS a lot of the netif_trans_update() invocations are now useless
      because they occur in ndo_start_xmit and driver doesn't set LLTX
      (i.e. stack already took care of the update).
      
      As I can't test any of them it seems better to just leave them alone.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b36627a
    • F
      netdevice: add helper to update trans_start · ba162f8e
      Florian Westphal 提交于
      trans_start exists twice:
      - as member of net_device (legacy)
      - as member of netdev_queue
      
      In order to get rid of the legacy case, add a helper for the
      dev->trans_update (this patch), then convert spots that do
      
      dev->trans_start = jiffies
      
      to use this helper (next patch).
      
      This would then allow us to change the helper so that it updates the
      trans_stamp of netdev queue 0 instead.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba162f8e
  24. 03 5月, 2016 1 次提交
  25. 30 4月, 2016 1 次提交
  26. 29 4月, 2016 1 次提交
    • M
      net: fix net_gso_ok for new GSO types. · 7b748340
      Marcelo Ricardo Leitner 提交于
      Fix casting in net_gso_ok. Otherwise the shift on
      gso_type << NETIF_F_GSO_SHIFT may hit the 32th bit and make it look like
      a INT_MIN, which is then promoted from signed to uint64 which is
      0xffffffff80000000, resulting in wrong behavior when it is and'ed with
      the feature itself, as in:
      
      This test app:
      #include <stdio.h>
      #include <stdint.h>
      
      int main(int argc, char **argv)
      {
      	uint64_t feature1;
      	uint64_t feature2;
      	int gso_type = 1 << 15;
      
      	feature1 = gso_type << 16;
      	feature2 = (uint64_t)gso_type << 16;
      	printf("%lx %lx\n", feature1, feature2);
      
      	return 0;
      }
      
      Gives:
      ffffffff80000000 80000000
      
      So that this:
         return (features & feature) == feature;
      Actually works on more bits than expected and invalid ones.
      
      Fix is to promote it earlier.
      
      Issue noted while rebasing SCTP GSO patch but posting separetely as
      someone else may experience this meanwhile.
      Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7b748340
  27. 28 4月, 2016 1 次提交
  28. 27 4月, 2016 1 次提交
  29. 22 4月, 2016 2 次提交
  30. 15 4月, 2016 3 次提交
    • A
      GSO: Support partial segmentation offload · 802ab55a
      Alexander Duyck 提交于
      This patch adds support for something I am referring to as GSO partial.
      The basic idea is that we can support a broader range of devices for
      segmentation if we use fixed outer headers and have the hardware only
      really deal with segmenting the inner header.  The idea behind the naming
      is due to the fact that everything before csum_start will be fixed headers,
      and everything after will be the region that is handled by hardware.
      
      With the current implementation it allows us to add support for the
      following GSO types with an inner TSO_MANGLEID or TSO6 offload:
      NETIF_F_GSO_GRE
      NETIF_F_GSO_GRE_CSUM
      NETIF_F_GSO_IPIP
      NETIF_F_GSO_SIT
      NETIF_F_UDP_TUNNEL
      NETIF_F_UDP_TUNNEL_CSUM
      
      In the case of hardware that already supports tunneling we may be able to
      extend this further to support TSO_TCPV4 without TSO_MANGLEID if the
      hardware can support updating inner IPv4 headers.
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      802ab55a
    • A
      GRO: Add support for TCP with fixed IPv4 ID field, limit tunnel IP ID values · 1530545e
      Alexander Duyck 提交于
      This patch does two things.
      
      First it allows TCP to aggregate TCP frames with a fixed IPv4 ID field.  As
      a result we should now be able to aggregate flows that were converted from
      IPv6 to IPv4.  In addition this allows us more flexibility for future
      implementations of segmentation as we may be able to use a fixed IP ID when
      segmenting the flow.
      
      The second thing this does is that it places limitations on the outer IPv4
      ID header in the case of tunneled frames.  Specifically it forces the IP ID
      to be incrementing by 1 unless the DF bit is set in the outer IPv4 header.
      This way we can avoid creating overlapping series of IP IDs that could
      possibly be fragmented if the frame goes through GRO and is then
      resegmented via GSO.
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1530545e
    • A
      GSO: Add GSO type for fixed IPv4 ID · cbc53e08
      Alexander Duyck 提交于
      This patch adds support for TSO using IPv4 headers with a fixed IP ID
      field.  This is meant to allow us to do a lossless GRO in the case of TCP
      flows that use a fixed IP ID such as those that convert IPv6 header to IPv4
      headers.
      
      In addition I am adding a feature that for now I am referring to TSO with
      IP ID mangling.  Basically when this flag is enabled the device has the
      option to either output the flow with incrementing IP IDs or with a fixed
      IP ID regardless of what the original IP ID ordering was.  This is useful
      in cases where the DF bit is set and we do not care if the original IP ID
      value is maintained.
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cbc53e08
  31. 14 4月, 2016 2 次提交
    • E
      net: remove netdevice gso_min_segs · 743b03a8
      Eric Dumazet 提交于
      After introduction of ndo_features_check(), we believe that very
      specific checks for rare features should not be done in core
      networking stack.
      
      No driver uses gso_min_segs yet, so we revert this feature and save
      few instructions per tx packet in fast path.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      743b03a8
    • D
      net: force inlining of netif_tx_start/stop_queue, sock_hold, __sock_put · f9a7cbbf
      Denys Vlasenko 提交于
      Sometimes gcc mysteriously doesn't inline
      very small functions we expect to be inlined. See
          https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122
      Arguably, gcc should do better, but gcc people aren't willing
      to invest time into it, asking to use __always_inline instead.
      
      With this .config:
      http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os,
      the following functions get deinlined many times.
      
      netif_tx_stop_queue: 207 copies, 590 calls:
      	55                      push   %rbp
      	48 89 e5                mov    %rsp,%rbp
      	f0 80 8f e0 01 00 00 01 lock orb $0x1,0x1e0(%rdi)
      	5d                      pop    %rbp
      	c3                      retq
      
      netif_tx_start_queue: 47 copies, 111 calls
      	55                      push   %rbp
      	48 89 e5                mov    %rsp,%rbp
      	f0 80 a7 e0 01 00 00 fe lock andb $0xfe,0x1e0(%rdi)
      	5d                      pop    %rbp
      	c3                      retq
      
      sock_hold: 39 copies, 124 calls
      	55                      push   %rbp
      	48 89 e5                mov    %rsp,%rbp
      	f0 ff 87 80 00 00 00    lock incl 0x80(%rdi)
      	5d                      pop    %rbp
      	c3                      retq
      
      __sock_put: 6 copies, 13 calls
      	55                      push   %rbp
      	48 89 e5                mov    %rsp,%rbp
      	f0 ff 8f 80 00 00 00    lock decl 0x80(%rdi)
      	5d                      pop    %rbp
      	c3                      retq
      
      This patch fixes this via s/inline/__always_inline/.
      
      Code size decrease after the patch is ~2.5k:
      
          text      data      bss       dec     hex filename
      56719876  56364551 36196352 149280779 8e5d80b vmlinux_before
      56717440  56364551 36196352 149278343 8e5ce87 vmlinux
      Signed-off-by: NDenys Vlasenko <dvlasenk@redhat.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: linux-kernel@vger.kernel.org
      CC: netdev@vger.kernel.org
      CC: netfilter-devel@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f9a7cbbf
  32. 08 4月, 2016 1 次提交
    • A
      GRE: Disable segmentation offloads w/ CSUM and we are encapsulated via FOU · a0ca153f
      Alexander Duyck 提交于
      This patch fixes an issue I found in which we were dropping frames if we
      had enabled checksums on GRE headers that were encapsulated by either FOU
      or GUE.  Without this patch I was barely able to get 1 Gb/s of throughput.
      With this patch applied I am now at least getting around 6 Gb/s.
      
      The issue is due to the fact that with FOU or GUE applied we do not provide
      a transport offset pointing to the GRE header, nor do we offload it in
      software as the GRE header is completely skipped by GSO and treated like a
      VXLAN or GENEVE type header.  As such we need to prevent the stack from
      generating it and also prevent GRE from generating it via any interface we
      create.
      
      Fixes: c3483384 ("gro: Allow tunnel stacking in the case of FOU/GUE")
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a0ca153f