1. 06 11月, 2014 12 次提交
    • D
      Merge branch 'gue-next' · 1d76c1d0
      David S. Miller 提交于
      Tom Herbert says:
      
      ====================
      gue: Remote checksum offload
      
      This patch set implements remote checksum offload for
      GUE, which is a mechanism that provides checksum offload of
      encapsulated packets using rudimentary offload capabilities found in
      most Network Interface Card (NIC) devices. The outer header checksum
      for UDP is enabled in packets and, with some additional meta
      information in the GUE header, a receiver is able to deduce the
      checksum to be set for an inner encapsulated packet. Effectively this
      offloads the computation of the inner checksum. Enabling the outer
      checksum in encapsulation has the additional advantage that it covers
      more of the packet than the inner checksum including the encapsulation
      headers.
      
      Remote checksum offload is described in:
      http://tools.ietf.org/html/draft-herbert-remotecsumoffload-01
      
      The GUE transmit and receive paths are modified to support the
      remote checksum offload option. The option contains a checksum
      offset and checksum start which are directly derived from values
      set in stack when doing CHECKSUM_PARTIAL. On receipt of the option, the
      operation is to calculate the packet checksum from "start" to end of
      the packet (normally derived for checksum complete), and then set
      the resultant value at checksum "offset" (the checksum field has
      already been primed with the pseudo header). This emulates a NIC
      that implements NETIF_F_HW_CSUM.
      
      The primary purpose of this feature is to eliminate cost of performing
      checksum calculation over a packet when encpasulating.
      
      In this patch set:
        - Move fou_build_header into fou.c and split it into a couple of
          functions
        - Enable offloading of outer UDP checksum in encapsulation
        - Change udp_offload to support remote checksum offload, includes
          new GSO type and ensuring encapsulated layers (TCP) doesn't try to
          set a checksum covered by RCO
        - TX support for RCO with GUE. This is configured through ip_tunnel
          and set the option on transmit when packet being encapsulated is
          CHECKSUM_PARTIAL
        - RX support for RCO with GUE for normal and GRO paths. Includes
          resolving the offloaded checksum
      
      v2:
        Address comments from davem: Move accounting for private option
        field in gue_encap_hlen to patch in which we add the remote checksum
        offload option.
      
      Testing:
      
      I ran performance numbers using netperf TCP_STREAM and TCP_RR with 200
      streams, comparing GUE with and without remote checksum offload (doing
      checksum-unnecessary to complete conversion in both cases). These
      were run on mlnx4 and bnx2x. Some mlnx4 results are below.
      
      GRE/GUE
          TCP_STREAM
            IPv4, with remote checksum offload
              9.71% TX CPU utilization
              7.42% RX CPU utilization
              36380 Mbps
            IPv4, without remote checksum offload
              12.40% TX CPU utilization
              7.36% RX CPU utilization
              36591 Mbps
          TCP_RR
            IPv4, with remote checksum offload
              77.79% CPU utilization
      	91/144/216 90/95/99% latencies
              1.95127e+06 tps
            IPv4, without remote checksum offload
              78.70% CPU utilization
              89/152/297 90/95/99% latencies
              1.95458e+06 tps
      
      IPIP/GUE
          TCP_STREAM
            With remote checksum offload
              10.30% TX CPU utilization
              7.43% RX CPU utilization
              36486 Mbps
            Without remote checksum offload
              12.47% TX CPU utilization
              7.49% RX CPU utilization
              36694 Mbps
          TCP_RR
            With remote checksum offload
              77.80% CPU utilization
              87/153/270 90/95/99% latencies
              1.98735e+06 tps
            Without remote checksum offload
              77.98% CPU utilization
              87/150/287 90/95/99% latencies
              1.98737e+06 tps
      
      SIT/GUE
          TCP_STREAM
            With remote checksum offload
              9.68% TX CPU utilization
              7.36% RX CPU utilization
              35971 Mbps
            Without remote checksum offload
              12.95% TX CPU utilization
              8.04% RX CPU utilization
              36177 Mbps
          TCP_RR
            With remote checksum offload
              79.32% CPU utilization
              94/158/295 90/95/99% latencies
              1.88842e+06 tps
            Without remote checksum offload
              80.23% CPU utilization
              94/149/226 90/95/99% latencies
              1.90338e+06 tps
      
      VXLAN
          TCP_STREAM
              35.03% TX CPU utilization
              20.85% RX CPU utilization
              36230 Mbps
          TCP_RR
              77.36% CPU utilization
              84/146/270 90/95/99% latencies
              2.08063e+06 tps
      
      We can also look at CPU time in csum_partial using perf (with bnx2x
      setup). For GRE with TCP_STREAM I see:
      
          With remote checksum offload
              0.33% TX
              1.81% RX
          Without remote checksum offload
              6.00% TX
              0.51% RX
      
      I suspect the fact that time in csum_partial noticably increases
      with remote checksum offload for RX is due to taking the cache miss on
      the encapsulated header in that function. By similar reasoning, if on
      the TX side the packet were not in cache (say we did a splice from a
      file whose data was never touched by the CPU) the CPU savings for TX
      would probably be more pronounced.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1d76c1d0
    • T
      gue: Receive side of remote checksum offload · a8d31c12
      Tom Herbert 提交于
      Add processing of the remote checksum offload option in both the normal
      path as well as the GRO path. The implements patching the affected
      checksum to derive the offloaded checksum.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a8d31c12
    • T
      gue: TX support for using remote checksum offload option · b17f709a
      Tom Herbert 提交于
      Add if_tunnel flag TUNNEL_ENCAP_FLAG_REMCSUM to configure
      remote checksum offload on an IP tunnel. Add logic in gue_build_header
      to insert remote checksum offload option.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b17f709a
    • T
      gue: Protocol constants for remote checksum offload · c1aa8347
      Tom Herbert 提交于
      Define a private flag for remote checksun offload as well as a length
      for the option.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c1aa8347
    • T
      udp: Changes to udp_offload to support remote checksum offload · e585f236
      Tom Herbert 提交于
      Add a new GSO type, SKB_GSO_TUNNEL_REMCSUM, which indicates remote
      checksum offload being done (in this case inner checksum must not
      be offloaded to the NIC).
      
      Added logic in __skb_udp_tunnel_segment to handle remote checksum
      offload case.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e585f236
    • T
      gue: Add infrastructure for flags and options · 5024c33a
      Tom Herbert 提交于
      Add functions and basic definitions for processing standard flags,
      private flags, and control messages. This includes definitions
      to compute length of optional fields corresponding to a set of flags.
      Flag validation is in validate_gue_flags function. This checks for
      unknown flags, and that length of optional fields is <= length
      in guehdr hlen.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5024c33a
    • T
      udp: Offload outer UDP tunnel csum if available · 4bcb877d
      Tom Herbert 提交于
      In __skb_udp_tunnel_segment if outer UDP checksums are enabled and
      ip_summed is not already CHECKSUM_PARTIAL, set up checksum offload
      if device features allow it.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4bcb877d
    • T
      net: Move fou_build_header into fou.c and refactor · 63487bab
      Tom Herbert 提交于
      Move fou_build_header out of ip_tunnel.c and into fou.c splitting
      it up into fou_build_header, gue_build_header, and fou_build_udp.
      This allows for other users for TX of FOU or GUE. Change ip_tunnel_encap
      to call fou_build_header or gue_build_header based on the tunnel
      encapsulation type. Similarly, added fou_encap_hlen and gue_encap_hlen
      functions which are called by ip_encap_hlen. New net/fou.h has
      prototypes and defines for this.
      
      Added NET_FOU_IP_TUNNELS configuration. When this is set, IP tunnels
      can use FOU/GUE and fou module is also selected.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      63487bab
    • D
      Merge branch 'stmmac-next' · 890b7916
      David S. Miller 提交于
      Giuseppe Cavallaro says:
      
      ====================
      stmmac: review driver Koptions
      
      Recently many Koption options have been added to have new glue logic on several
      platforms.
      
      The main goal behind this work is to guarantee that the driver built
      fine on all the branches where it is present independently of which
      glue logic is selected.
      
      IMHO, it is better to remove all the not necessary Koption(s) that can hide
      build problems when something changes in the driver and especially when
      the DT compatibility allows us to manage all the platform data.
      
      I compiled the driver w/o any issue on net-next Git for:
      
        x86, arm and sh4.
      
      In case of there are build problems on some repos now it will be
      easy to catch them and cherry-pick patches from mainstream.
      
      For sure, do not hesitate to contact me in case of issue.
      
      Also this set removes STMMAC_DEBUG_FS and BUS_MODE_DA. The latter is useless
      and the former can be replaced by DEBUG_FS (always to make safe the build).
      
      V2: patch-set re-based on top of the latest updates for net-next
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      890b7916
    • G
      stmmac: remove BUS_MODE_DA · 98fbebcb
      Giuseppe CAVALLARO 提交于
      This is a very old and often unused option to configure
      a bit in a register inside the DMA. This support should
      not stay under Koption and should be extended for new chips too.
      This will be do later maybe via device-tree parameters.
      Also no performance impact when remove this setting on STi platforms.
      Signed-off-by: NGiuseppe Cavallaro <peppe.cavallaro@st.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98fbebcb
    • G
      stmmac: remove STMMAC_DEBUG_FS · 50fb4f74
      Giuseppe CAVALLARO 提交于
      the STMMAC_DEBUG_FS Koption is now removed from the
      driver configuration and this support will be built
      by default when DEBUG_FS is present. This can also be
      useful on building driver verification.
      Signed-off-by: NGiuseppe Cavallaro <peppe.cavallaro@st.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      50fb4f74
    • G
      stmmac: remove specific SoC Koption from platform. · c0d54066
      Giuseppe CAVALLARO 提交于
      This patch removes all the Koptions added to build the glue-logic files
      for all different architectures: DWMAC_MESON, DWMAC_SUNXI, DWMAC_STI ...
      Nowadays the stmmac needs to be compiled on several platforms; in some
      case it very convenient to guarantee that its build is always completed
      with success on all the branches where the driver is present.
      Signed-off-by: NGiuseppe Cavallaro <peppe.cavallaro@st.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0d54066
  2. 05 11月, 2014 17 次提交
  3. 04 11月, 2014 11 次提交
    • E
      net: add rbnode to struct sk_buff · 56b17425
      Eric Dumazet 提交于
      Yaogong replaces TCP out of order receive queue by an RB tree.
      
      As netem already does a private skb->{next/prev/tstamp} union
      with a 'struct rb_node', lets do this in a cleaner way.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yaogong Wang <wygivan@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      56b17425
    • D
      Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next · 8ce0c825
      David S. Miller 提交于
      Jeff Kirsher says:
      
      ====================
      Intel Wired LAN Driver Updates 2014-11-03
      
      This series contains updates to i40e and i40evf.
      
      Akeem adds a check for i40e so that flow director flush and reinit are
      not done when flow director is not enabled.
      
      Mitch fixes the i40evf driver to properly handle multiple admin queue
      messages, by reinit the msg_size field each time we go through the loop.
      Without this, we may receive truncated messages due to the firmware
      thinking we have insufficient buffer size.  Also fixes the link checking
      logic to only check the carrier state if the interface is actually
      open, which allows link changes to be reported correctly without spamming
      the VFs.  Updates i40e to inset the VSI ID in the QTX_CTL register
      when configuring queues for VMDq VSIs.
      
      Paul adds support for 10G-base-T in i40evf.
      
      Jesse fixes i40e where the call to irq_dynamic_disable() was turning off
      the interrupt completely when trying to set ITR to 0 (for lowest
      moderation).
      
      Shannon removes debugfs dump stats function, since it was not being
      kept up-to-date and was redundant with the ethtool output.  Also, scales
      back the LAN MSIx usage to force queue/vector sharing and leave some
      vectors for Flow Director, VMDq, etc. when there are more cores than
      vectors available to the PF.  Cleans up the error reporting for
      get_lump() resource tracking errors.  Also adds a check for the
      debug module parameter earlier to be able to catch the early configuration
      phase admin queue messages.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8ce0c825
    • S
      hamradio: 6pack: remove unnecessary check · ec5a0ec1
      Sudip Mukherjee 提交于
      this is check for dev is unnecessary, as we are already checking dev
      after allocating it via alloc_netdev, and jumping to label: out
      if it is NULL.
      Signed-off-by: NSudip Mukherjee <sudip@vectorindia.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec5a0ec1
    • D
      PPC: bpf_jit_comp: add SKF_AD_PKTTYPE instruction · 4e235761
      Denis Kirjanov 提交于
      Add BPF extension SKF_AD_PKTTYPE to ppc JIT to load
      skb->pkt_type field.
      
      Before:
      [   88.262622] test_bpf: #11 LD_IND_NET 86 97 99 PASS
      [   88.265740] test_bpf: #12 LD_PKTTYPE 109 107 PASS
      
      After:
      [   80.605964] test_bpf: #11 LD_IND_NET 44 40 39 PASS
      [   80.607370] test_bpf: #12 LD_PKTTYPE 9 9 PASS
      
      CC: Alexei Starovoitov<alexei.starovoitov@gmail.com>
      CC: Michael Ellerman<mpe@ellerman.id.au>
      Cc: Matt Evans <matt@ozlabs.org>
      Signed-off-by: NDenis Kirjanov <kda@linux-powerpc.org>
      
      v2: Added test rusults
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4e235761
    • D
      Merge branch 'mlx4-next' · 547f2735
      David S. Miller 提交于
      Or Gerlitz says:
      
      ====================
      Mellanox ethernet driver update Oct-30-2014
      
      The 1st patch from Saeed fixes a bug in the last net-next batch where
      a VF could get access to set port configuration, the next patch from Amir
      fixes a race in the port VPI logic. Next are two performance patches from Ido.
      
      The patch to add checksum complete status on GRE and such packets was
      preceded with a patch that converted the driver to only use napi_gro_receive
      vs. the current code which goes through napi_gro_frags on it's usual track.
      Eric D. has some thoughts and suggestions on that change for which we
      want to take the time and consider, so for the time being dropped that
      patch and the ones that depend on it.
      
      Changes from V0:
        - have the caller to provide the __GFP_COLD hint to the service function
        - dropped the patch that changes the GRO logic and the subsequent dependent
          patches.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      547f2735
    • M
      net/mlx4_core: Add retrieval of CONFIG_DEV parameters · d475c95b
      Matan Barak 提交于
      Add code to issue CONFIG_DEV "get" firmware command.
      
      This command is used in order to obtain certain parameters used for
      supporting various RX checksumming options and vxlan UDP port.
      
      The GET operation is allowed for VFs too.
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Signed-off-by: NShani Michaeli <shanim@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d475c95b
    • I
      net/mlx4_en: Add __GFP_COLD gfp flags in alloc_pages · 1ab25f86
      Ido Shamay 提交于
      Needed in order to get cache cold pages (L3 flushed) for HW scatter.
      
      Otherwise memory may flush those entries when the packet comes from
      PCI, causing back pressure resulting in BW decrease.
      Signed-off-by: NIdo Shamay <idos@mellanox.com>
      Signed-off-by: NAmir Vadai <amirv@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ab25f86
    • I
      net/mlx4_en: Remove RX buffers alignment to IP_ALIGN · 5f6e9800
      Ido Shamay 提交于
      When IP_ALIGN has a non zero value, hardware will write to a non aligned
      address. The only reader from this address is when copying the header
      from the first frag into the linear buffer (further access to the IP
      address will be from the linear buffer, in which the headers are
      aligned). Since the penalty of non align access by the hardware is
      greater than the software memcpy, changing the frag_align to always be 0.
      Signed-off-by: NIdo Shamay <idos@mellanox.com>
      Signed-off-by: NAmir Vadai <amirv@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f6e9800
    • A
      net/mlx4_core: Protect port type setting by mutex · 0a984556
      Amir Vadai 提交于
      We need to protect set_port_type() for concurrency, as the sysfs code could
      call it from mutliple contexts in parallel.
      
      The port_mutex is not enough because we need to protect from concurrent
      modification of 'info' and stopping of the port sensing work.
      Signed-off-by: NAmir Vadai <amirv@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a984556
    • S
      net/mlx4_core: Prevent VF from changing port configuration · 6e806699
      Saeed Mahameed 提交于
      Added wrapper to the ACCESS_REG command for handling guest HW
      registers access, preventing write operations, but do allow reads.
      
      This will prevent SRIOV guests to change port PTYS configuration,
      such as speed/advertised link modes.
      
      Fixes: adbc7ac5 ('net/mlx4_core: Introduce ACCESS_REG CMD [...]')
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NAmir Vadai <amirv@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e806699
    • E
      net: less interrupt masking in NAPI · d75b1ade
      Eric Dumazet 提交于
      net_rx_action() can mask irqs a single time to transfert sd->poll_list
      into a private list, for a very short duration.
      
      Then, napi_complete() can avoid masking irqs again,
      and net_rx_action() only needs to mask irq again in slow path.
      
      This patch removes 2 couples of irq mask/unmask per typical NAPI run,
      more if multiple napi were triggered.
      
      Note this also allows to give control back to caller (do_softirq())
      more often, so that other softirq handlers can be called a bit earlier,
      or ksoftirqd can be wakeup earlier under pressure.
      
      This was developed while testing an alternative to RX interrupt
      mitigation to reduce latencies while keeping or improving GRO
      aggregation on fast NIC.
      
      Idea is to test napi->gro_list at the end of a napi->poll() and
      reschedule one NAPI poll, but after servicing a full round of
      softirqs (timers, TX, rcu, ...). This will be allowed only if softirq
      is currently serviced by idle task or ksoftirqd, and resched not needed.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d75b1ade