1. 15 10月, 2021 6 次提交
    • M
      ice: introduce XDP_TX fallback path · 22bf877e
      Maciej Fijalkowski 提交于
      Under rare circumstances there might be a situation where a requirement
      of having XDP Tx queue per CPU could not be fulfilled and some of the Tx
      resources have to be shared between CPUs. This yields a need for placing
      accesses to xdp_ring inside a critical section protected by spinlock.
      These accesses happen to be in the hot path, so let's introduce the
      static branch that will be triggered from the control plane when driver
      could not provide Tx queue dedicated for XDP on each CPU.
      
      Currently, the design that has been picked is to allow any number of XDP
      Tx queues that is at least half of a count of CPUs that platform has.
      For lower number driver will bail out with a response to user that there
      were not enough Tx resources that would allow configuring XDP. The
      sharing of rings is signalled via static branch enablement which in turn
      indicates that lock for xdp_ring accesses needs to be taken in hot path.
      
      Approach based on static branch has no impact on performance of a
      non-fallback path. One thing that is needed to be mentioned is a fact
      that the static branch will act as a global driver switch, meaning that
      if one PF got out of Tx resources, then other PFs that ice driver is
      servicing will suffer. However, given the fact that HW that ice driver
      is handling has 1024 Tx queues per each PF, this is currently an
      unlikely scenario.
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: NGeorge Kuruvinakunnel <george.kuruvinakunnel@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      22bf877e
    • M
      ice: optimize XDP_TX workloads · 9610bd98
      Maciej Fijalkowski 提交于
      Optimize Tx descriptor cleaning for XDP. Current approach doesn't
      really scale and chokes when multiple flows are handled.
      
      Introduce two ring fields, @next_dd and @next_rs that will keep track of
      descriptor that should be looked at when the need for cleaning arise and
      the descriptor that should have the RS bit set, respectively.
      
      Note that at this point the threshold is a constant (32), but it is
      something that we could make configurable.
      
      First thing is to get away from setting RS bit on each descriptor. Let's
      do this only once NTU is higher than the currently @next_rs value. In
      such case, grab the tx_desc[next_rs], set the RS bit in descriptor and
      advance the @next_rs by a 32.
      
      Second thing is to clean the Tx ring only when there are less than 32
      free entries. For that case, look up the tx_desc[next_dd] for a DD bit.
      This bit is written back by HW to let the driver know that xmit was
      successful. It will happen only for those descriptors that had RS bit
      set. Clean only 32 descriptors and advance the DD bit.
      
      Actual cleaning routine is moved from ice_napi_poll() down to the
      ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any
      SKBs in there that would rely on interrupts for the cleaning. Nice side
      effect is that for rare case of Tx fallback path (that next patch is
      going to introduce) we don't have to trigger the SW irq to clean the
      ring.
      
      With those two concepts, ring is kept at being almost full, but it is
      guaranteed that driver will be able to produce Tx descriptors.
      
      This approach seems to work out well even though the Tx descriptors are
      produced in one-by-one manner. Test was conducted with the ice HW
      bombarded with packets from HW generator, configured to generate 30
      flows.
      
      Xdp2 sample yields the following results:
      <snip>
      proto 17:   79973066 pkt/s
      proto 17:   80018911 pkt/s
      proto 17:   80004654 pkt/s
      proto 17:   79992395 pkt/s
      proto 17:   79975162 pkt/s
      proto 17:   79955054 pkt/s
      proto 17:   79869168 pkt/s
      proto 17:   79823947 pkt/s
      proto 17:   79636971 pkt/s
      </snip>
      
      As that sample reports the Rx'ed frames, let's look at sar output.
      It says that what we Rx'ed we do actually Tx, no noticeable drops.
      Average:        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s txcmp/s  rxmcst/s   %ifutil
      Average:       ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00      0.00      0.00     38.32
      
      with tx_busy staying calm.
      
      When compared to a state before:
      Average:        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s txcmp/s  rxmcst/s   %ifutil
      Average:       ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00      0.00      0.00     43.64
      
      it can be observed that the amount of txpck/s is almost doubled, meaning
      that the performance is improved by around 90%. All of this due to the
      drops in the driver, previously the tx_busy stat was bumped at a 7mpps
      rate.
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: NGeorge Kuruvinakunnel <george.kuruvinakunnel@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      9610bd98
    • M
      ice: propagate xdp_ring onto rx_ring · eb087cd8
      Maciej Fijalkowski 提交于
      With rings being split, it is now convenient to introduce a pointer to
      XDP ring within the Rx ring. For XDP_TX workloads this means that
      xdp_rings array access will be skipped, which was executed per each
      processed frame.
      
      Also, read the XDP prog once per NAPI and if prog is present, set up the
      local xdp_ring pointer. Reading prog a single time was discussed in [1]
      with some concern raised by Toke around dispatcher handling and having
      the need for going through the RCU grace period in the ndo_bpf driver
      callback, but ice currently is torning down NAPI instances regardless of
      the prog presence on VSI.
      
      Although the pointer to XDP ring introduced to Rx ring makes things a
      lot slimmer/simpler, I still feel that single prog read per NAPI
      lifetime is beneficial.
      
      Further patch that will introduce the fallback path will also get a
      profit from that as xdp_ring pointer will be set during the XDP rings
      setup.
      
      [1]: https://lore.kernel.org/bpf/87k0oseo6e.fsf@toke.dk/Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: NGeorge Kuruvinakunnel <george.kuruvinakunnel@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      eb087cd8
    • M
      ice: split ice_ring onto Tx/Rx separate structs · e72bba21
      Maciej Fijalkowski 提交于
      While it was convenient to have a generic ring structure that served
      both Tx and Rx sides, next commits are going to introduce several
      Tx-specific fields, so in order to avoid hurting the Rx side, let's
      pull out the Tx ring onto new ice_tx_ring and ice_rx_ring structs.
      
      Rx ring could be handled by the old ice_ring which would reduce the code
      churn within this patch, but this would make things asymmetric.
      
      Make the union out of the ring container within ice_q_vector so that it
      is possible to iterate over newly introduced ice_tx_ring.
      
      Remove the @size as it's only accessed from control path and it can be
      calculated pretty easily.
      
      Change definitions of ice_update_ring_stats and
      ice_fetch_u64_stats_per_ring so that they are ring agnostic and can be
      used for both Rx and Tx rings.
      
      Sizes of Rx and Tx ring structs are 256 and 192 bytes, respectively. In
      Rx ring xdp_rxq_info occupies its own cacheline, so it's the major
      difference now.
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: NGurucharan G <gurucharanx.g@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      e72bba21
    • M
      ice: move ice_container_type onto ice_ring_container · dc23715c
      Maciej Fijalkowski 提交于
      Currently ice_container_type is scoped only for ice_ethtool.c. Next
      commit that will split the ice_ring struct onto Rx/Tx specific ring
      structs is going to also modify the type of linked list of rings that is
      within ice_ring_container. Therefore, the functions that are taking the
      ice_ring_container as an input argument will need to be aware of a ring
      type that will be looked up.
      
      Embed ice_container_type within ice_ring_container and initialize it
      properly when allocating the q_vectors.
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: NGurucharan G <gurucharanx.g@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      dc23715c
    • M
      ice: remove ring_active from ice_ring · e93d1c37
      Maciej Fijalkowski 提交于
      This field is dead and driver is not making any use of it. Simply remove
      it.
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: NGurucharan G <gurucharanx.g@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      e93d1c37
  2. 29 9月, 2021 1 次提交
  3. 28 9月, 2021 1 次提交
    • M
      ice: Use xdp_buf instead of rx_buf for xsk zero-copy · 57f7f8b6
      Magnus Karlsson 提交于
      In order to use the new xsk batched buffer allocation interface, a
      pointer to an array of struct xsk_buff pointers need to be provided so
      that the function can put the result of the allocation there. In the
      ice driver, we already have a ring that stores pointers to
      xdp_buffs. This is only used for the xsk zero-copy driver and is a
      union with the structure that is used for the regular non zero-copy
      path. Unfortunately, that structure is larger than the xdp_buffs
      pointers which mean that there will be a stride (of 20 bytes) between
      each xdp_buff pointer. And feeding this into the xsk_buff_alloc_batch
      interface will not work since it assumes a regular array of xdp_buff
      pointers (each 8 bytes with 0 bytes in-between them on a 64-bit
      system).
      
      To fix this, remove the xdp_buff pointer from the rx_buf union and
      move it one step higher to the union above which only has pointers to
      arrays in it. This solves the problem and we can directly feed the SW
      ring of xdp_buff pointers straight into the allocation function in the
      next patch when that interface is used. This will improve performance.
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210922075613.12186-4-magnus.karlsson@gmail.com
      57f7f8b6
  4. 11 6月, 2021 2 次提交
    • J
      ice: enable transmit timestamps for E810 devices · ea9b847c
      Jacob Keller 提交于
      Add support for enabling Tx timestamp requests for outgoing packets on
      E810 devices.
      
      The ice hardware can support multiple outstanding Tx timestamp requests.
      When sending a descriptor to hardware, a Tx timestamp request is made by
      setting a request bit, and assigning an index that represents which Tx
      timestamp index to store the timestamp in.
      
      Hardware makes no effort to synchronize the index use, so it is up to
      software to ensure that Tx timestamp indexes are not re-used before the
      timestamp is reported back.
      
      To do this, introduce a Tx timestamp tracker which will keep track of
      currently in-use indexes.
      
      In the hot path, if a packet has a timestamp request, an index will be
      requested from the tracker. Unfortunately, this does require a lock as
      the indexes are shared across all queues on a PHY. There are not enough
      indexes to reliably assign only 1 to each queue.
      
      For the E810 devices, the timestamp indexes are not shared across PHYs,
      so each port can have its own tracking.
      
      Once hardware captures a timestamp, an interrupt is fired. In this
      interrupt, trigger a new work item that will figure out which timestamp
      was completed, and report the timestamp back to the stack.
      
      This function loops through the Tx timestamp indexes and checks whether
      there is now a valid timestamp. If so, it clears the PHY timestamp
      indication in the PHY memory, locks and removes the SKB and bit in the
      tracker, then reports the timestamp to the stack.
      
      It is possible in some cases that a timestamp request will be initiated
      but never completed. This might occur if the packet is dropped by
      software or hardware before it reaches the PHY.
      
      Add a task to the periodic work function that will check whether
      a timestamp request is more than a few seconds old. If so, the timestamp
      index is cleared in the PHY, and the SKB is released.
      
      Just as with Rx timestamps, the Tx timestamps are only 40 bits wide, and
      use the same overall logic for extending to 64 bits of nanoseconds.
      
      With this change, E810 devices should be able to perform basic PTP
      functionality.
      
      Future changes will extend the support to cover the E822-based devices.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NTony Brelinski <tonyx.brelinski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      ea9b847c
    • J
      ice: enable receive hardware timestamping · 77a78115
      Jacob Keller 提交于
      Add SIOCGHWTSTAMP and SIOCSHWTSTAMP ioctl handlers to respond to
      requests to enable timestamping support. If the request is for enabling
      Rx timestamps, set a bit in the Rx descriptors to indicate that receive
      timestamps should be reported.
      
      Hardware captures receive timestamps in the PHY which only captures part
      of the timer, and reports only 40 bits into the Rx descriptor. The upper
      32 bits represent the contents of GLTSYN_TIME_L at the point of packet
      reception, while the lower 8 bits represent the upper 8 bits of
      GLTSYN_TIME_0.
      
      The networking and PTP stack expect 64 bit timestamps in nanoseconds. To
      support this, implement some logic to extend the timestamps by using the
      full PHC time.
      
      If the Rx timestamp was captured prior to the PHC time, then the real
      timestamp is
      
        PHC - (lower_32_bits(PHC) - timestamp)
      
      If the Rx timestamp was captured after the PHC time, then the real
      timestamp is
      
        PHC + (timestamp - lower_32_bits(PHC))
      
      These calculations are correct as long as neither the PHC timestamp nor
      the Rx timestamps are more than 2^32-1 nanseconds old. Further, we can
      detect when the Rx timestamp is before or after the PHC as long as the
      PHC timestamp is no more than 2^31-1 nanoseconds old.
      
      In that case, we calculate the delta between the lower 32 bits of the
      PHC and the Rx timestamp. If it's larger than 2^31-1 then the Rx
      timestamp must have been captured in the past. If it's smaller, then the
      Rx timestamp must have been captured after PHC time.
      
      Add an ice_ptp_extend_32b_ts function that relies on a cached copy of
      the PHC time and implements this algorithm to calculate the proper upper
      32bits of the Rx timestamps.
      
      Cache the PHC time periodically in all of the Rx rings. This enables
      each Rx ring to simply call the extension function with a recent copy of
      the PHC time. By ensuring that the PHC time is kept up to date
      periodically, we ensure this algorithm doesn't use stale data and
      produce incorrect results.
      
      To cache the time, introduce a kworker and a kwork item to periodically
      store the Rx time. It might seem like we should use the .do_aux_work
      interface of the PTP clock. This doesn't work because all PFs must cache
      this time, but only one PF owns the PTP clock device.
      
      Thus, the ice driver will manage its own kthread instead of relying on
      the PTP do_aux_work handler.
      
      With this change, the driver can now report Rx timestamps on all
      incoming packets.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NTony Brelinski <tonyx.brelinski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      77a78115
  5. 15 4月, 2021 2 次提交
    • J
      ice: refactor ITR data structures · d59684a0
      Jesse Brandeburg 提交于
      Use a dedicated bitfield in order to both increase
      the amount of checking around the length of ITR writes
      as well as simplify the checks of dynamic mode.
      
      Basically unpack the "high bit means dynamic" logic
      into bitfields.
      
      Also, remove some unused ITR defines.
      Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
      Tested-by: NTony Brelinski <tonyx.brelinski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      d59684a0
    • J
      ice: replace custom AIM algorithm with kernel's DIM library · cdf1f1f1
      Jacob Keller 提交于
      The ice driver has support for adaptive interrupt moderation, an
      algorithm for tuning the interrupt rate dynamically. This algorithm
      is based on various assumptions about ring size, socket buffer size,
      link speed, SKB overhead, ethernet frame overhead and more.
      
      The Linux kernel has support for a dynamic interrupt moderation
      algorithm known as "dimlib". Replace the custom driver-specific
      implementation of dynamic interrupt moderation with the kernel's
      algorithm.
      
      The Intel hardware has a different hardware implementation than the
      originators of the dimlib code had to work with, which requires the
      driver to use a slightly different set of inputs for the actual
      moderation values, while getting all the advice from dimlib of
      better/worse, shift left or right.
      
      The change made for this implementation is to use a pair of values
      for each of the 5 "slots" that the dimlib moderation expects, and
      the driver will program those pairs when dimlib recommends a slot to
      use. The currently implementation uses two tables, one for receive
      and one for transmit, and the pairs of values in each slot set the
      maximum delay of an interrupt and a maximum number of interrupts per
      second (both expressed in microseconds).
      
      There are two separate kinds of bugs fixed by using DIMLIB, one is
      UDP single stream send was too slow, and the other is that 8K
      ping-pong was going to the most aggressive moderation and has much
      too high latency.
      
      The overall result of using DIMLIB is that we meet or exceed our
      performance expectations set based on the old algorithm.
      Co-developed-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
      Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NTony Brelinski <tonyx.brelinski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      cdf1f1f1
  6. 08 4月, 2021 1 次提交
  7. 01 4月, 2021 2 次提交
  8. 13 2月, 2021 2 次提交
  9. 09 2月, 2021 1 次提交
    • J
      ice: fix writeback enable logic · 1d9f7ca3
      Jesse Brandeburg 提交于
      The writeback enable logic was incorrectly implemented (due to
      misunderstanding what the side effects of the implementation would be
      during polling).
      
      Fix this logic issue, while implementing a new feature allowing the user
      to control the writeback frequency using the knobs for controlling
      interrupt throttling that we already have.  Basically if you leave
      adaptive interrupts enabled, the writeback frequency will be varied even
      if busy_polling or if napi-poll is in use.  If the interrupt rates are
      set to a fixed value by ethtool -C and adaptive is off, the driver will
      allow the user-set interrupt rate to guide how frequently the hardware
      will complete descriptors to the driver.
      
      Effectively the user will get a control over the hardware efficiency,
      allowing the choice between immediate interrupts or delayed up to a
      maximum of the interrupt rate, even when interrupts are disabled
      during polling.
      Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
      Co-developed-by: NBrett Creeley <brett.creeley@intel.com>
      Signed-off-by: NBrett Creeley <brett.creeley@intel.com>
      Tested-by: NTony Brelinski <tonyx.brelinski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      1d9f7ca3
  10. 26 9月, 2020 1 次提交
    • J
      intel-ethernet: clean up W=1 warnings in kdoc · b50f7bca
      Jesse Brandeburg 提交于
      This takes care of all of the trivial W=1 fixes in the Intel
      Ethernet drivers, which allows developers and maintainers to
      build more of the networking tree with more complete warning
      checks.
      
      There are three classes of kdoc warnings fixed:
       - cannot understand function prototype: 'x'
       - Excess function parameter 'x' description in 'y'
       - Function parameter or member 'x' not described in 'y'
      
      All of the changes were trivial comment updates on
      function headers.
      
      Inspired by Lee Jones' series of wireless work to do the same.
      Compile tested only, and passes simple test of
      $ git ls-files *.[ch] | egrep drivers/net/ethernet/intel | \
        xargs scripts/kernel-doc -none
      Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
      Tested-by: NAaron Brown <aaron.f.brown@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b50f7bca
  11. 01 9月, 2020 1 次提交
  12. 01 8月, 2020 2 次提交
  13. 28 5月, 2020 1 次提交
  14. 23 5月, 2020 2 次提交
  15. 22 5月, 2020 2 次提交
  16. 20 2月, 2020 1 次提交
  17. 13 2月, 2020 1 次提交
  18. 04 1月, 2020 1 次提交
  19. 05 11月, 2019 6 次提交
  20. 21 8月, 2019 1 次提交
  21. 24 5月, 2019 2 次提交
  22. 02 5月, 2019 1 次提交