1. 26 5月, 2018 5 次提交
    • E
      net/mlx5e: Avoid reset netdev stats on configuration changes · 05909bab
      Eran Ben Elisha 提交于
      Move all RQ, SQ and channel counters from the channel objects into the
      priv structure.  With this change, counters will not be reset upon
      channel configuration changes.
      
      Channel's statistics for SQs which are associated with TCs higher than
      zero will be presented in ethtool -S, only for SQs which were opened at
      least once since the module was loaded (regardless of their open/close
      current status).  This is done in order to decrease the total amount of
      statistics presented and calculated for the common out of box use (no
      QoS).
      
      mlx5e_channel_stats is a compound of CH,RQ,SQs stats in order to
      create locality for the NAPI when handling TX and RX of the same
      channel.
      
      Align the new statistics struct per ring to avoid several channels
      update to the same cache line at the same time.
      Packet rate was tested, no degradation sensed.
      Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
      CC: Qing Huang <qing.huang@oracle.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      05909bab
    • S
      net/mlx5e: Introducing new statistics rwlock · 868a01a2
      Shalom Lagziel 提交于
      Introduce a new read/write lock that will protect statistics gathering from
      netdev channels configuration changes.
      e.g. when channels are being replaced (increase/decrease number of rings)
      prevent statistic gathering (ndo_get_stats64) to read the statistics of
      in-active channels (channels that are being closed).
      
      Plus update channels software statistics on the fly when calling
      ndo_get_stats64, and remove it from stats periodic work.
      
      Fixes: 9218b44d ("net/mlx5e: Statistics handling refactoring")
      Signed-off-by: NShalom Lagziel <shaloml@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      868a01a2
    • T
      net/mlx5: Use order-0 allocations for all WQ types · 3a2f7033
      Tariq Toukan 提交于
      Complete the transition of all WQ types to use fragmented
      order-0 coherent memory instead of high-order allocations.
      
      CQ-WQ already uses order-0.
      Here we do the same for cyclic and linked-list WQs.
      
      This allows the driver to load cleanly on systems with a highly
      fragmented coherent memory.
      
      Performance tests:
      ConnectX-5 100Gbps, CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
      Packet rate of 64B packets, single transmit ring, size 8K.
      
      No degradation is sensed.
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      3a2f7033
    • T
      net/mlx5e: TX, Use actual WQE size for SQ edge fill · 043dc78e
      Tariq Toukan 提交于
      We fill SQ edge with NOPs to avoid WQEs wrap.
      Here, instead of doing that in advance for the maximum possible
      WQE size, we do it on-demand using the actual WQE size.
      We re-order some parts in mlx5e_sq_xmit to finish the calculation
      of WQE size (ds_cnt) before doing any writes to the WQE buffer.
      
      When SQ work queue is fragmented (introduced in an downstream patch),
      dealing with WQE wraps becomes more frequent. This change would drastically
      reduce the overhead in this case.
      
      Performance tests:
      ConnectX-5 100Gbps, CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
      Packet rate of 64B packets, single transmit ring, size 8K.
      
      Before: 14.9 Mpps
      After:  15.8 Mpps
      
      Improvement of 6%.
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      043dc78e
    • T
      net/mlx5e: Use WQ API functions instead of direct fields access · ddf385e3
      Tariq Toukan 提交于
      Use the WQ API to get the WQ size, and to map a counter
      into a WQ entry index.
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      ddf385e3
  2. 25 5月, 2018 1 次提交
    • H
      net/mlx5e: Move port speed code from en_ethtool.c to en/port.c · 2c81bfd5
      Huy Nguyen 提交于
      Move four below functions from en_ethtool.c to en/port.c. These
      functions are used by both en_ethtool.c and en_main.c. Future code
      can use these functions without ethtool link mode dependency.
        u32 mlx5e_port_ptys2speed(u32 eth_proto_oper);
        int mlx5e_port_linkspeed(struct mlx5_core_dev *mdev, u32 *speed);
        int mlx5e_port_max_linkspeed(struct mlx5_core_dev *mdev, u32 *speed);
        u32 mlx5e_port_speed2linkmodes(u32 speed);
      
      Delete the speed field from table mlx5e_build_ptys2ethtool_map. This
      table only keeps the mapping between the mlx5e link mode and
      ethtool link mode. Add new table mlx5e_link_speed for translation
      from mlx5e link mode to actual speed.
      Signed-off-by: NHuy Nguyen <huyn@mellanox.com>
      Reviewed-by: NParav Pandit <parav@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      2c81bfd5
  3. 18 5月, 2018 2 次提交
  4. 15 5月, 2018 2 次提交
  5. 05 5月, 2018 1 次提交
  6. 01 5月, 2018 3 次提交
  7. 24 4月, 2018 2 次提交
  8. 17 4月, 2018 2 次提交
    • J
      mlx5: use page_pool for xdp_return_frame call · 60bbf7ee
      Jesper Dangaard Brouer 提交于
      This patch shows how it is possible to have both the driver local page
      cache, which uses elevated refcnt for "catching"/avoiding SKB
      put_page returns the page through the page allocator.  And at the
      same time, have pages getting returned to the page_pool from
      ndp_xdp_xmit DMA completion.
      
      The performance improvement for XDP_REDIRECT in this patch is really
      good.  Especially considering that (currently) the xdp_return_frame
      API and page_pool_put_page() does per frame operations of both
      rhashtable ID-lookup and locked return into (page_pool) ptr_ring.
      (It is the plan to remove these per frame operation in a followup
      patchset).
      
      The benchmark performed was RX on mlx5 and XDP_REDIRECT out ixgbe,
      with xdp_redirect_map (using devmap) . And the target/maximum
      capability of ixgbe is 13Mpps (on this HW setup).
      
      Before this patch for mlx5, XDP redirected frames were returned via
      the page allocator.  The single flow performance was 6Mpps, and if I
      started two flows the collective performance drop to 4Mpps, because we
      hit the page allocator lock (further negative scaling occurs).
      
      Two test scenarios need to be covered, for xdp_return_frame API, which
      is DMA-TX completion running on same-CPU or cross-CPU free/return.
      Results were same-CPU=10Mpps, and cross-CPU=12Mpps.  This is very
      close to our 13Mpps max target.
      
      The reason max target isn't reached in cross-CPU test, is likely due
      to RX-ring DMA unmap/map overhead (which doesn't occur in ixgbe to
      ixgbe testing).  It is also planned to remove this unnecessary DMA
      unmap in a later patchset
      
      V2: Adjustments requested by Tariq
       - Changed page_pool_create return codes not return NULL, only
         ERR_PTR, as this simplifies err handling in drivers.
       - Save a branch in mlx5e_page_release
       - Correct page_pool size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
      
      V5: Updated patch desc
      
      V8: Adjust for b0cedc84 ("net/mlx5e: Remove rq_headroom field from params")
      V9:
       - Adjust for 121e8927 ("net/mlx5e: Refactor RQ XDP_TX indication")
       - Adjust for 73281b78 ("net/mlx5e: Derive Striding RQ size from MTU")
       - Correct handling if page_pool_create fail for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
      
      V10: Req from Tariq
       - Change pool_size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Reviewed-by: NTariq Toukan <tariqt@mellanox.com>
      Acked-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      60bbf7ee
    • J
      mlx5: register a memory model when XDP is enabled · 84f5e3fb
      Jesper Dangaard Brouer 提交于
      Now all the users of ndo_xdp_xmit have been converted to use xdp_return_frame.
      This enable a different memory model, thus activating another code path
      in the xdp_return_frame API.
      
      V2: Fixed issues pointed out by Tariq.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Reviewed-by: NTariq Toukan <tariqt@mellanox.com>
      Acked-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      84f5e3fb
  9. 06 4月, 2018 1 次提交
  10. 03 4月, 2018 1 次提交
  11. 02 4月, 2018 1 次提交
    • T
      net/mlx5e: Set EQE based as default TX interrupt moderation mode · 48bfc397
      Tal Gilboa 提交于
      The default TX moderation mode was mistakenly set to CQE based. The
      intention was to add a control ability in order to improve some specific
      use-cases. In general, we prefer to use EQE based moderation as it gives
      much better numbers for the common cases.
      
      CQE based causes a degradation in the common case since it resets the
      moderation timer on CQE generation. This causes an issue when TSO is
      well utilized (large TSO sessions). The timer is set to 16us so traffic
      of ~64KB TSO sessions per second would mean timer reset (CQE per TSO
      session -> long time between CQEs). In this case we quickly reach the
      tcp_limit_output_bytes (256KB by default) and cause a halt in TX traffic.
      
      By setting EQE based moderation we make sure timer would expire after
      16us regardless of the packet rate.
      This fixes an up to 40% packet rate and up to 23% bandwidth degradtions.
      
      Fixes: 0088cbbc ("net/mlx5e: Enable CQE based moderation on TX CQ")
      Signed-off-by: NTal Gilboa <talgi@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48bfc397
  12. 31 3月, 2018 7 次提交
    • T
      net/mlx5e: Keep single pre-initialized UMR WQE per RQ · b8a98a4c
      Tariq Toukan 提交于
      All UMR WQEs of an RQ share many common fields. We use
      pre-initialized structures to save calculations in datapath.
      One field (xlt_offset) was the only reason we saved a pre-initialized
      copy per WQE index.
      Here we remove its initialization (move its calculation to datapath),
      and reduce the number of copies to one-per-RQ.
      
      A very small datapath calculation is added, it occurs once per a MPWQE
      (i.e. once every 256KB), but reduces memory consumption and gives
      better cache utilization.
      
      Performance testing:
      Tested packet rate, no degradation sensed.
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      b8a98a4c
    • T
      net/mlx5e: Support XDP over Striding RQ · 22f45398
      Tariq Toukan 提交于
      Add XDP support over Striding RQ.
      Now that linear SKB is supported over Striding RQ,
      we can support XDP by setting stride size to PAGE_SIZE
      and headroom to XDP_PACKET_HEADROOM.
      
      Upon a MPWQE free, do not release pages that are being
      XDP xmit, they will be released upon completions.
      
      Striding RQ is capable of a higher packet-rate than
      conventional RQ.
      A performance gain is expected for all cases that had
      a HW packet-rate bottleneck. This is the case whenever
      using many flows that distribute to many cores.
      
      Performance testing:
      ConnectX-5, 24 rings, default MTU.
      CQE compression ON (to reduce completions BW in PCI).
      
      XDP_DROP packet rate:
      --------------------------------------------------
      | pkt size | XDP rate   | 100GbE linerate | pct% |
      --------------------------------------------------
      |   64byte | 126.2 Mpps |      148.0 Mpps |  85% |
      |  128byte |  80.0 Mpps |       84.8 Mpps |  94% |
      |  256byte |  42.7 Mpps |       42.7 Mpps | 100% |
      |  512byte |  23.4 Mpps |       23.4 Mpps | 100% |
      --------------------------------------------------
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      22f45398
    • T
      net/mlx5e: Use linear SKB in Striding RQ · 619a8f2a
      Tariq Toukan 提交于
      Current Striding RQ HW feature utilizes the RX buffers so that
      there is no wasted room between the strides. This maximises
      the memory utilization.
      This prevents the use of build_skb() (which requires headroom
      and tailroom), and demands to memcpy the packets headers into
      the skb linear part.
      
      In this patch, whenever a set of conditions holds, we apply
      an RQ configuration that allows combining the use of linear SKB
      on top of a Striding RQ.
      
      To use build_skb() with Striding RQ, the following must hold:
      1. packet does not cross a page boundary.
      2. there is enough headroom and tailroom surrounding the packet.
      
      We can satisfy 1 and 2 by configuring:
      	stride size = MTU + headroom + tailoom.
      
      This is possible only when:
      a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE.
      b. HW LRO is turned off.
      
      Using linear SKB has many advantages:
      - Saves a memcpy of the headers.
      - No page-boundary checks in datapath.
      - No filler CQEs.
      - Significantly smaller CQ.
      - SKB data continuously resides in linear part, and not split to
        small amount (linear part) and large amount (fragment).
        This saves datapath cycles in driver and improves utilization
        of SKB fragments in GRO.
      - The fragments of a resulting GRO SKB follow the IP forwarding
        assumption of equal-size fragments.
      
      Some implementation details:
      HW writes the packets to the beginning of a stride,
      i.e. does not keep headroom. To overcome this we make sure we can
      extend backwards and use the last bytes of stride i-1.
      Extra care is needed for stride 0 as it has no preceding stride.
      We make sure headroom bytes are available by shifting the buffer
      pointer passed to HW by headroom bytes.
      
      This configuration now becomes default, whenever capable.
      Of course, this implies turning LRO off.
      
      Performance testing:
      ConnectX-5, single core, single RX ring, default MTU.
      
      UDP packet rate, early drop in TC layer:
      
      --------------------------------------------
      | pkt size | before    | after     | ratio |
      --------------------------------------------
      | 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
      |  500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
      |   64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
      --------------------------------------------
      
      TCP streams: ~20% gain
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      619a8f2a
    • T
      net/mlx5e: Use inline MTTs in UMR WQEs · ea3886ca
      Tariq Toukan 提交于
      When modifying the page mapping of a HW memory region
      (via a UMR post), post the new values inlined in WQE,
      instead of using a data pointer.
      
      This is a micro-optimization, inline UMR WQEs of different
      rings scale better in HW.
      
      In addition, this obsoletes a few control flows and helps
      delete ~50 LOC.
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      ea3886ca
    • T
      net/mlx5e: Derive Striding RQ size from MTU · 73281b78
      Tariq Toukan 提交于
      In Striding RQ, each WQE serves multiple packets
      (hence called Multi-Packet WQE, MPWQE).
      The size of a MPWQE is constant (currently 256KB).
      
      Upon a ringparam set operation, we calculate the number of
      MPWQEs per RQ. For this, first it is needed to determine the
      number of packets that can reside within a single MPWQE.
      In this patch we use the actual MTU size instead of ETH_DATA_LEN
      for this calculation.
      
      This implies that a change in MTU might require a change
      in Striding RQ ring size.
      
      In addition, this obsoletes some WQEs-to-packets translation
      functions and helps delete ~60 LOC.
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      73281b78
    • T
      net/mlx5e: Save MTU in channels params · 472a1e44
      Tariq Toukan 提交于
      Knowing the MTU is required for RQ creation flow.
      By our design, channels creation flow is totally isolated
      from priv/netdev, and can be completed with access to
      channels params and mdev.
      Adding the MTU to the channels params helps preserving that.
      In addition, we save it in RQ to make its access faster in
      datapath checks.
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      472a1e44
    • S
      net/mlx5e: Use eq ptr from cq · 7b2117bb
      Saeed Mahameed 提交于
      Instead of looking for the EQ of the CQ, remove that redundant code and
      use the eq pointer stored in the cq struct.
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      7b2117bb
  13. 28 3月, 2018 9 次提交
  14. 27 3月, 2018 3 次提交