1. 27 7月, 2022 1 次提交
  2. 13 4月, 2022 1 次提交
    • J
      ice: Add mpls+tso support · 69e66c04
      Joe Damato 提交于
      Attempt to add mpls+tso support.
      
      I don't have ice hardware available to test myself, but I just implemented
      this feature in i40e and thought it might be useful to implement for ice
      while this is fresh in my brain.
      
      Hoping some one at intel will be able to test this on my behalf.
      Signed-off-by: NJoe Damato <jdamato@fastly.com>
      Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      69e66c04
  3. 10 3月, 2022 1 次提交
  4. 04 3月, 2022 1 次提交
    • J
      ice: store VF pointer instead of VF ID · b03d519d
      Jacob Keller 提交于
      The VSI structure contains a vf_id field used to associate a VSI with a
      VF. This is used mainly for ICE_VSI_VF as well as partially for
      ICE_VSI_CTRL associated with the VFs.
      
      This API was designed with the idea that VFs are stored in a simple
      array that was expected to be static throughout most of the driver's
      life.
      
      We plan on refactoring VF storage in a few key ways:
      
        1) converting from a simple static array to a hash table
        2) using krefs to track VF references obtained from the hash table
        3) use RCU to delay release of VF memory until after all references
           are dropped
      
      This is motivated by the goal to ensure that the lifetime of VF
      structures is accounted for, and prevent various use-after-free bugs.
      
      With the existing vsi->vf_id, the reference tracking for VFs would
      become somewhat convoluted, because each VSI maintains a vf_id field
      which will then require performing a look up. This means all these flows
      will require reference tracking and proper usage of rcu_read_lock, etc.
      
      We know that the VF VSI will always be backed by a valid VF structure,
      because the VSI is created during VF initialization and removed before
      the VF is destroyed. Rely on this and store a reference to the VF in the
      VSI structure instead of storing a VF ID. This will simplify the usage
      and avoid the need to perform lookups on the hash table in the future.
      
      For ICE_VSI_VF, it is expected that vsi->vf is always non-NULL after
      ice_vsi_alloc succeeds. Because of this, use WARN_ON when checking if a
      vsi->vf pointer is valid when dealing with VF VSIs. This will aid in
      debugging code which violates this assumption and avoid more disastrous
      panics.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      b03d519d
  5. 10 2月, 2022 1 次提交
    • B
      ice: Add hot path support for 802.1Q and 802.1ad VLAN offloads · 0d54d8f7
      Brett Creeley 提交于
      Currently the driver only supports 802.1Q VLAN insertion and stripping.
      However, once Double VLAN Mode (DVM) is fully supported, then both 802.1Q
      and 802.1ad VLAN insertion and stripping will be supported. Unfortunately
      the VSI context parameters only allow for one VLAN ethertype at a time
      for VLAN offloads so only one or the other VLAN ethertype offload can be
      supported at once.
      
      To support this, multiple changes are needed.
      
      Rx path changes:
      
      [1] In DVM, the Rx queue context l2tagsel field needs to be cleared so
      the outermost tag shows up in the l2tag2_2nd field of the Rx flex
      descriptor. In Single VLAN Mode (SVM), the l2tagsel field should remain
      1 to support SVM configurations.
      
      [2] Modify the ice_test_staterr() function to take a __le16 instead of
      the ice_32b_rx_flex_desc union pointer so this function can be used for
      both rx_desc->wb.status_error0 and rx_desc->wb.status_error1.
      
      [3] Add the new inline function ice_get_vlan_tag_from_rx_desc() that
      checks if there is a VLAN tag in l2tag1 or l2tag2_2nd.
      
      [4] In ice_receive_skb(), add a check to see if NETIF_F_HW_VLAN_STAG_RX
      is enabled in netdev->features. If it is, then this is the VLAN
      ethertype that needs to be added to the stripping VLAN tag. Since
      ice_fix_features() prevents CTAG_RX and STAG_RX from being enabled
      simultaneously, the VLAN ethertype will only ever be 802.1Q or 802.1ad.
      
      Tx path changes:
      
      [1] In DVM, the VLAN tag needs to be placed in the l2tag2 field of the Tx
      context descriptor. The new define ICE_TX_FLAGS_HW_OUTER_SINGLE_VLAN was
      added to the list of tx_flags to handle this case.
      
      [2] When the stack requests the VLAN tag to be offloaded on Tx, the
      driver needs to set either ICE_TX_FLAGS_HW_OUTER_SINGLE_VLAN or
      ICE_TX_FLAGS_HW_VLAN, so the tag is inserted in l2tag2 or l2tag1
      respectively. To determine which location to use, set a bit in the Tx
      ring flags field during ring allocation that can be used to determine
      which field to use in the Tx descriptor. In DVM, always use l2tag2,
      and in SVM, always use l2tag1.
      Signed-off-by: NBrett Creeley <brett.creeley@intel.com>
      Tested-by: NGurucharan G <gurucharanx.g@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      0d54d8f7
  6. 01 2月, 2022 1 次提交
  7. 28 1月, 2022 3 次提交
  8. 29 12月, 2021 1 次提交
  9. 18 12月, 2021 1 次提交
  10. 16 12月, 2021 3 次提交
  11. 14 12月, 2021 1 次提交
  12. 20 10月, 2021 3 次提交
    • G
      ice: use devm_kcalloc() instead of devm_kzalloc() · 6f332353
      Gustavo A. R. Silva 提交于
      Use 2-factor multiplication argument form devm_kcalloc() instead
      of devm_kzalloc().
      
      Link: https://github.com/KSPP/linux/issues/162Signed-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      6f332353
    • J
      ice: fix software generating extra interrupts · 23be7075
      Jesse Brandeburg 提交于
      The driver tried to work around missing completion events that occurred
      while interrupts are disabled, by triggering a software interrupt
      whenever we exit polling (but we had to have polled at least once).
      
      This was causing a *lot* of extra interrupts for some workloads like
      NVMe over TCP, which resulted in regressions in performance. It was also
      visible when polling didn't prevent interrupts when busy_poll was
      enabled.
      
      Fix the extra interrupts by utilizing our previously unused 3rd ITR
      (interrupt throttle) index and set it to 20K interrupts per second, and
      then trigger a software interrupt within that rate limit.
      
      While here, slightly refactor the code to avoid an overwrite of a local
      variable in the case of wb_en = true.
      
      Fixes: b7306b42 ("ice: manage interrupts during poll exit")
      Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
      Tested-by: NGurucharan G <gurucharanx.g@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      23be7075
    • J
      ice: update dim usage and moderation · d8eb7ad5
      Jesse Brandeburg 提交于
      The driver was having trouble with unreliable latency when doing single
      threaded ping-pong tests. This was root caused to the DIM algorithm
      landing on a too slow interrupt value, which caused high latency, and it
      was especially present when queues were being switched frequently by the
      scheduler as happens on default setups today.
      
      In attempting to improve this, we allow the upper rate limit for
      interrupts to move to rate limit of 4 microseconds as a max, which means
      that no vector can generate more than 250,000 interrupts per second. The
      old config was up to 100,000. The driver previously tried to program the
      rate limit too frequently and if the receive and transmit side were both
      active on the same vector, the INTRL would be set incorrectly, and this
      change fixes that issue as a side effect of the redesign.
      
      This driver will operate from now on with a slightly changed DIM table
      with more emphasis towards latency sensitivity by having more table
      entries with lower latency than with high latency (high being >= 64
      microseconds).
      
      The driver also resets the DIM algorithm state with a new stats set when
      there is no work done and the data becomes stale (older than 1 second),
      for the respective receive or transmit portion of the interrupt.
      
      Add a new helper for setting rate limit, which will be used more
      in a followup patch.
      Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
      Tested-by: NGurucharan G <gurucharanx.g@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      d8eb7ad5
  13. 15 10月, 2021 5 次提交
    • M
      ice: introduce XDP_TX fallback path · 22bf877e
      Maciej Fijalkowski 提交于
      Under rare circumstances there might be a situation where a requirement
      of having XDP Tx queue per CPU could not be fulfilled and some of the Tx
      resources have to be shared between CPUs. This yields a need for placing
      accesses to xdp_ring inside a critical section protected by spinlock.
      These accesses happen to be in the hot path, so let's introduce the
      static branch that will be triggered from the control plane when driver
      could not provide Tx queue dedicated for XDP on each CPU.
      
      Currently, the design that has been picked is to allow any number of XDP
      Tx queues that is at least half of a count of CPUs that platform has.
      For lower number driver will bail out with a response to user that there
      were not enough Tx resources that would allow configuring XDP. The
      sharing of rings is signalled via static branch enablement which in turn
      indicates that lock for xdp_ring accesses needs to be taken in hot path.
      
      Approach based on static branch has no impact on performance of a
      non-fallback path. One thing that is needed to be mentioned is a fact
      that the static branch will act as a global driver switch, meaning that
      if one PF got out of Tx resources, then other PFs that ice driver is
      servicing will suffer. However, given the fact that HW that ice driver
      is handling has 1024 Tx queues per each PF, this is currently an
      unlikely scenario.
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: NGeorge Kuruvinakunnel <george.kuruvinakunnel@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      22bf877e
    • M
      ice: optimize XDP_TX workloads · 9610bd98
      Maciej Fijalkowski 提交于
      Optimize Tx descriptor cleaning for XDP. Current approach doesn't
      really scale and chokes when multiple flows are handled.
      
      Introduce two ring fields, @next_dd and @next_rs that will keep track of
      descriptor that should be looked at when the need for cleaning arise and
      the descriptor that should have the RS bit set, respectively.
      
      Note that at this point the threshold is a constant (32), but it is
      something that we could make configurable.
      
      First thing is to get away from setting RS bit on each descriptor. Let's
      do this only once NTU is higher than the currently @next_rs value. In
      such case, grab the tx_desc[next_rs], set the RS bit in descriptor and
      advance the @next_rs by a 32.
      
      Second thing is to clean the Tx ring only when there are less than 32
      free entries. For that case, look up the tx_desc[next_dd] for a DD bit.
      This bit is written back by HW to let the driver know that xmit was
      successful. It will happen only for those descriptors that had RS bit
      set. Clean only 32 descriptors and advance the DD bit.
      
      Actual cleaning routine is moved from ice_napi_poll() down to the
      ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any
      SKBs in there that would rely on interrupts for the cleaning. Nice side
      effect is that for rare case of Tx fallback path (that next patch is
      going to introduce) we don't have to trigger the SW irq to clean the
      ring.
      
      With those two concepts, ring is kept at being almost full, but it is
      guaranteed that driver will be able to produce Tx descriptors.
      
      This approach seems to work out well even though the Tx descriptors are
      produced in one-by-one manner. Test was conducted with the ice HW
      bombarded with packets from HW generator, configured to generate 30
      flows.
      
      Xdp2 sample yields the following results:
      <snip>
      proto 17:   79973066 pkt/s
      proto 17:   80018911 pkt/s
      proto 17:   80004654 pkt/s
      proto 17:   79992395 pkt/s
      proto 17:   79975162 pkt/s
      proto 17:   79955054 pkt/s
      proto 17:   79869168 pkt/s
      proto 17:   79823947 pkt/s
      proto 17:   79636971 pkt/s
      </snip>
      
      As that sample reports the Rx'ed frames, let's look at sar output.
      It says that what we Rx'ed we do actually Tx, no noticeable drops.
      Average:        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s txcmp/s  rxmcst/s   %ifutil
      Average:       ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00      0.00      0.00     38.32
      
      with tx_busy staying calm.
      
      When compared to a state before:
      Average:        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s txcmp/s  rxmcst/s   %ifutil
      Average:       ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00      0.00      0.00     43.64
      
      it can be observed that the amount of txpck/s is almost doubled, meaning
      that the performance is improved by around 90%. All of this due to the
      drops in the driver, previously the tx_busy stat was bumped at a 7mpps
      rate.
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: NGeorge Kuruvinakunnel <george.kuruvinakunnel@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      9610bd98
    • M
      ice: propagate xdp_ring onto rx_ring · eb087cd8
      Maciej Fijalkowski 提交于
      With rings being split, it is now convenient to introduce a pointer to
      XDP ring within the Rx ring. For XDP_TX workloads this means that
      xdp_rings array access will be skipped, which was executed per each
      processed frame.
      
      Also, read the XDP prog once per NAPI and if prog is present, set up the
      local xdp_ring pointer. Reading prog a single time was discussed in [1]
      with some concern raised by Toke around dispatcher handling and having
      the need for going through the RCU grace period in the ndo_bpf driver
      callback, but ice currently is torning down NAPI instances regardless of
      the prog presence on VSI.
      
      Although the pointer to XDP ring introduced to Rx ring makes things a
      lot slimmer/simpler, I still feel that single prog read per NAPI
      lifetime is beneficial.
      
      Further patch that will introduce the fallback path will also get a
      profit from that as xdp_ring pointer will be set during the XDP rings
      setup.
      
      [1]: https://lore.kernel.org/bpf/87k0oseo6e.fsf@toke.dk/Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: NGeorge Kuruvinakunnel <george.kuruvinakunnel@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      eb087cd8
    • M
      ice: do not create xdp_frame on XDP_TX · a55e16fa
      Maciej Fijalkowski 提交于
      xdp_frame is not needed for XDP_TX data path in ice driver case.
      For this data path cleaning of sent descriptor will not happen anywhere
      outside of the driver, which means that carrying the information about
      the underlying memory model via xdp_frame will not be used. Therefore,
      this conversion can be simply dropped, which would relieve CPU a bit.
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: NGeorge Kuruvinakunnel <george.kuruvinakunnel@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      a55e16fa
    • M
      ice: split ice_ring onto Tx/Rx separate structs · e72bba21
      Maciej Fijalkowski 提交于
      While it was convenient to have a generic ring structure that served
      both Tx and Rx sides, next commits are going to introduce several
      Tx-specific fields, so in order to avoid hurting the Rx side, let's
      pull out the Tx ring onto new ice_tx_ring and ice_rx_ring structs.
      
      Rx ring could be handled by the old ice_ring which would reduce the code
      churn within this patch, but this would make things asymmetric.
      
      Make the union out of the ring container within ice_q_vector so that it
      is possible to iterate over newly introduced ice_tx_ring.
      
      Remove the @size as it's only accessed from control path and it can be
      calculated pretty easily.
      
      Change definitions of ice_update_ring_stats and
      ice_fetch_u64_stats_per_ring so that they are ring agnostic and can be
      used for both Rx and Tx rings.
      
      Sizes of Rx and Tx ring structs are 256 and 192 bytes, respectively. In
      Rx ring xdp_rxq_info occupies its own cacheline, so it's the major
      difference now.
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: NGurucharan G <gurucharanx.g@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      e72bba21
  14. 08 10月, 2021 1 次提交
  15. 29 9月, 2021 1 次提交
  16. 25 6月, 2021 2 次提交
  17. 18 6月, 2021 1 次提交
  18. 11 6月, 2021 1 次提交
    • J
      ice: enable transmit timestamps for E810 devices · ea9b847c
      Jacob Keller 提交于
      Add support for enabling Tx timestamp requests for outgoing packets on
      E810 devices.
      
      The ice hardware can support multiple outstanding Tx timestamp requests.
      When sending a descriptor to hardware, a Tx timestamp request is made by
      setting a request bit, and assigning an index that represents which Tx
      timestamp index to store the timestamp in.
      
      Hardware makes no effort to synchronize the index use, so it is up to
      software to ensure that Tx timestamp indexes are not re-used before the
      timestamp is reported back.
      
      To do this, introduce a Tx timestamp tracker which will keep track of
      currently in-use indexes.
      
      In the hot path, if a packet has a timestamp request, an index will be
      requested from the tracker. Unfortunately, this does require a lock as
      the indexes are shared across all queues on a PHY. There are not enough
      indexes to reliably assign only 1 to each queue.
      
      For the E810 devices, the timestamp indexes are not shared across PHYs,
      so each port can have its own tracking.
      
      Once hardware captures a timestamp, an interrupt is fired. In this
      interrupt, trigger a new work item that will figure out which timestamp
      was completed, and report the timestamp back to the stack.
      
      This function loops through the Tx timestamp indexes and checks whether
      there is now a valid timestamp. If so, it clears the PHY timestamp
      indication in the PHY memory, locks and removes the SKB and bit in the
      tracker, then reports the timestamp to the stack.
      
      It is possible in some cases that a timestamp request will be initiated
      but never completed. This might occur if the packet is dropped by
      software or hardware before it reaches the PHY.
      
      Add a task to the periodic work function that will check whether
      a timestamp request is more than a few seconds old. If so, the timestamp
      index is cleared in the PHY, and the SKB is released.
      
      Just as with Rx timestamps, the Tx timestamps are only 40 bits wide, and
      use the same overall logic for extending to 64 bits of nanoseconds.
      
      With this change, E810 devices should be able to perform basic PTP
      functionality.
      
      Future changes will extend the support to cover the E822-based devices.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NTony Brelinski <tonyx.brelinski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      ea9b847c
  19. 04 6月, 2021 1 次提交
    • D
      ice: Allow all LLDP packets from PF to Tx · f9f83202
      Dave Ertman 提交于
      Currently in the ice driver, the check whether to
      allow a LLDP packet to egress the interface from the
      PF_VSI is being based on the SKB's priority field.
      It checks to see if the packets priority is equal to
      TC_PRIO_CONTROL.  Injected LLDP packets do not always
      meet this condition.
      
      SCAPY defaults to a sk_buff->protocol value of ETH_P_ALL
      (0x0003) and does not set the priority field.  There will
      be other injection methods (even ones used by end users)
      that will not correctly configure the socket so that
      SKB fields are correctly populated.
      
      Then ethernet header has to have to correct value for
      the protocol though.
      
      Add a check to also allow packets whose ethhdr->h_proto
      matches ETH_P_LLDP (0x88CC).
      
      Fixes: 0c3a6101 ("ice: Allow egress control packets from PF_VSI")
      Signed-off-by: NDave Ertman <david.m.ertman@intel.com>
      Tested-by: NTony Brelinski <tonyx.brelinski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      f9f83202
  20. 03 6月, 2021 1 次提交
  21. 15 4月, 2021 3 次提交
    • J
      ice: refactor ITR data structures · d59684a0
      Jesse Brandeburg 提交于
      Use a dedicated bitfield in order to both increase
      the amount of checking around the length of ITR writes
      as well as simplify the checks of dynamic mode.
      
      Basically unpack the "high bit means dynamic" logic
      into bitfields.
      
      Also, remove some unused ITR defines.
      Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
      Tested-by: NTony Brelinski <tonyx.brelinski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      d59684a0
    • J
      ice: manage interrupts during poll exit · b7306b42
      Jesse Brandeburg 提交于
      The driver would occasionally miss that there were outstanding
      descriptors to clean when exiting busy/napi poll. This issue has
      been in the code since the introduction of the ice driver.
      
      Attempt to "catch" any remaining work by triggering a software
      interrupt when exiting napi poll or busy-poll. This will not
      cause extra interrupts in the case of normal execution.
      
      This issue was found when running sfnt-pingpong, with busy
      poll enabled, and typically with larger I/O sizes like > 8192,
      the program would occasionally report > 1 second maximums
      to complete a ping pong.
      Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
      Tested-by: NTony Brelinski <tonyx.brelinski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      b7306b42
    • J
      ice: replace custom AIM algorithm with kernel's DIM library · cdf1f1f1
      Jacob Keller 提交于
      The ice driver has support for adaptive interrupt moderation, an
      algorithm for tuning the interrupt rate dynamically. This algorithm
      is based on various assumptions about ring size, socket buffer size,
      link speed, SKB overhead, ethernet frame overhead and more.
      
      The Linux kernel has support for a dynamic interrupt moderation
      algorithm known as "dimlib". Replace the custom driver-specific
      implementation of dynamic interrupt moderation with the kernel's
      algorithm.
      
      The Intel hardware has a different hardware implementation than the
      originators of the dimlib code had to work with, which requires the
      driver to use a slightly different set of inputs for the actual
      moderation values, while getting all the advice from dimlib of
      better/worse, shift left or right.
      
      The change made for this implementation is to use a pair of values
      for each of the 5 "slots" that the dimlib moderation expects, and
      the driver will program those pairs when dimlib recommends a slot to
      use. The currently implementation uses two tables, one for receive
      and one for transmit, and the pairs of values in each slot set the
      maximum delay of an interrupt and a maximum number of interrupts per
      second (both expressed in microseconds).
      
      There are two separate kinds of bugs fixed by using DIMLIB, one is
      UDP single stream send was too slow, and the other is that 8K
      ping-pong was going to the most aggressive moderation and has much
      too high latency.
      
      The overall result of using DIMLIB is that we meet or exceed our
      performance expectations set based on the old algorithm.
      Co-developed-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
      Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NTony Brelinski <tonyx.brelinski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      cdf1f1f1
  22. 01 4月, 2021 1 次提交
  23. 23 3月, 2021 1 次提交
  24. 18 3月, 2021 1 次提交
  25. 12 3月, 2021 1 次提交
  26. 13 2月, 2021 2 次提交