1. 25 9月, 2016 2 次提交
  2. 06 5月, 2016 4 次提交
  3. 14 4月, 2016 1 次提交
  4. 07 4月, 2016 1 次提交
  5. 05 4月, 2016 1 次提交
    • A
      i40e/i40evf: Allow up to 12K bytes of data per Tx descriptor instead of 8K · 5c4654da
      Alexander Duyck 提交于
      From what I can tell the practical limitation on the size of the Tx data
      buffer is the fact that the Tx descriptor is limited to 14 bits.  As such
      we cannot use 16K as is typically used on the other Intel drivers.  However
      artificially limiting ourselves to 8K can be expensive as this means that
      we will consume up to 10 descriptors (1 context, 1 for header, and 9 for
      payload, non-8K aligned) in a single send.
      
      I propose that we can reduce this by increasing the maximum data for a 4K
      aligned block to 12K.  We can reduce the descriptors used for a 32K aligned
      block by 1 by increasing the size like this.  In addition we still have the
      4K - 1 of space that is still unused.  We can use this as a bit of extra
      padding when dealing with data that is not aligned to 4K.
      
      By aligning the descriptors after the first to 4K we can improve the
      efficiency of PCIe accesses as we can avoid using byte enables and can fetch
      full TLP transactions after the first fetch of the buffer.  This helps to
      improve PCIe efficiency.  Below is the results of testing before and after
      with this patch:
      
      Recv   Send   Send                         Utilization      Service Demand
      Socket Socket Message  Elapsed             Send     Recv    Send    Recv
      Size   Size   Size     Time    Throughput  local    remote  local   remote
      bytes  bytes  bytes    secs.   10^6bits/s  % S      % U     us/KB   us/KB
      Before:
      87380  16384  16384    10.00     33682.24  20.27    -1.00   0.592   -1.00
      After:
      87380  16384  16384    10.00     34204.08  20.54    -1.00   0.590   -1.00
      
      So the net result of this patch is that we have a small gain in throughput
      due to a reduction in overhead for putting together the frame.
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      5c4654da
  6. 19 2月, 2016 4 次提交
    • A
      i40e/i40evf: Rewrite logic for 8 descriptor per packet check · 2d37490b
      Alexander Duyck 提交于
      This patch is meant to rewrite the logic for how we determine if we can
      transmit the frame or if it needs to be linearized.
      
      The previous code for this function was using a mix of division and modulus
      division as a part of computing if we need to take the slow path.  Instead
      I have replaced this by simply working with a sliding window which will
      tell us if the frame would be capable of causing a single packet to span
      several descriptors.
      
      The logic for the scan is fairly simple.  If any given group of 6 fragments
      is less than gso_size - 1 then it is possible for us to have one byte
      coming out of the first fragment, 6 fragments, and one or more bytes coming
      out of the last fragment.  This gives us a total of 8 fragments
      which exceeds what we can allow so we send such frames to be linearized.
      
      Arguably the use of modulus might be more exact as the approach I propose
      may generate some false positives.  However the likelihood of us taking much
      of a hit for those false positives is fairly low, and I would rather not
      add more overhead in the case where we are receiving a frame composed of 4K
      pages.
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      2d37490b
    • A
      i40e/i40evf: Break up xmit_descriptor_count from maybe_stop_tx · 4ec441df
      Alexander Duyck 提交于
      In an upcoming patch I would like to have access to the descriptor count
      used for the data portion of the frame.  For this reason I am splitting up
      the descriptor count function from the function that stops the ring.
      
      Also in order to try and reduce unnecessary duplication of code I am moving
      the slow-path portions of the code out of being inline calls so that we can
      just jump to them and process them instead of having to build them into
      each function that calls them.
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      4ec441df
    • A
      i40e/i40evf: Add exception handling for Tx checksum · 529f1f65
      Alexander Duyck 提交于
      Add exception handling to the Tx checksum path so that we can handle cases
      of TSO where the frame is bad, or Tx checksum where we didn't recognize a
      protocol
      
      Drop I40E_TX_FLAGS_CSUM as it is unused, move the CHECKSUM_PARTIAL check
      into the function itself so that we can decrease indent.
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      529f1f65
    • A
      i40e/i40evf: Drop outer checksum offload that was not requested · a9c9a81f
      Alexander Duyck 提交于
      The i40e and i40evf drivers contained code for inserting an outer checksum
      on UDP tunnels.  The issue however is that the upper levels of the stack
      never requested such an offload and it results in possible errors.
      
      In addition the same logic was being applied to the Rx side where it was
      attempting to validate the outer checksum, but the logic there was
      incorrect in that it was testing for the resultant sum to be equal to the
      header checksum instead of being equal to 0.
      
      Since this code is so massively flawed, and doing things that we didn't ask
      for it to do I am just dropping it, and will bring it back later to use as
      an offload for SKB_GSO_UDP_TUNNEL_CSUM which can make use of such a
      feature.
      
      As far as the Rx feature I am dropping it completely since it would need to
      be massively expanded and applied to IPv4 and IPv6 checksums for all parts,
      not just the one that supports Tx checksum offload for the outer.
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      a9c9a81f
  7. 18 2月, 2016 5 次提交
  8. 03 12月, 2015 1 次提交
  9. 02 12月, 2015 1 次提交
    • A
      i40e/i40evf: Fix RS bit update in Tx path and disable force WB workaround · 6a7fded7
      Anjali Singhai Jain 提交于
      This patch fixes the issue of forcing WB too often causing us to not
      benefit from NAPI.
      
      Without this patch we were forcing WB/arming interrupt too often taking
      away the benefits of NAPI and causing a performance impact.
      
      With this patch we disable force WB in the clean routine for X710
      and XL710 adapters. X722 adapters do not enable interrupt to force
      a WB and benefit from WB_ON_ITR and hence force WB is left enabled
      for those adapters.
      For XL710 and X710 adapters if we have less than 4 packets pending
      a software Interrupt triggered from service task will force a WB.
      
      This patch also changes the conditions for setting RS bit as described
      in code comments. This optimizes when the HW does a tail bump amd when
      it does a WB. It also optimizes when we do a wmb.
      
      Change-ID: Id831e1ae7d3e2ec3f52cd0917b41ce1d22d75d9d
      Signed-off-by: NAnjali Singhai Jain <anjali.singhai@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      6a7fded7
  10. 26 11月, 2015 1 次提交
  11. 20 10月, 2015 3 次提交
  12. 16 10月, 2015 1 次提交
    • J
      i40e/i40evf: moderate interrupts differently · ac26fc13
      Jesse Brandeburg 提交于
      The XL710 hardware has a different interrupt moderation design
      that can support a limit of total interrupts per second per
      vector, in addition to the "number of interrupts per second"
      controls already established in the driver.  This combination
      of hardware features allows us to set very low default latency
      settings but minimize the total CPU utilization by not
      making too many interrupts, should the user desire.
      
      The current driver implementation is still enabling the dynamic
      moderation in the driver, and only using the rx/tx-usecs
      limit in ethtool to limit the interrupt rate per second, by default.
      
      The new code implemented in this patch
      2) adds init/use of the new "Interrupt Limit" register
      3) adds ethtool knob to control/report the limits above
      
      Usage is ethtool -C ethx rx-usecs-high <value> Where <value> is number
      of microseconds to create a rate of 1/N interrupts per second,
      regardless of rx-usecs or tx-usecs values. Since there is a credit based
      scheme in the hardware, the rx-usecs and tx-usecs can be configured for
      very low latency for short bursts, but once the credit runs out the
      refill rate on the credits is limited by rx-usecs-high.
      
      Change-ID: I3a1075d3296123b0f4f50623c779b027af5b188d
      Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      ac26fc13
  13. 09 10月, 2015 1 次提交
  14. 08 10月, 2015 1 次提交
  15. 29 9月, 2015 1 次提交
  16. 06 8月, 2015 3 次提交
  17. 23 7月, 2015 1 次提交
  18. 28 5月, 2015 2 次提交
  19. 26 2月, 2015 1 次提交
  20. 24 2月, 2015 1 次提交
    • M
      i40e/i40evf: Refactor the receive routines · a132af24
      Mitch Williams 提交于
      Split the receive hot path code into two, one for packet split and one
      for single buffer. This improves receive performance since we only need
      to check if the ring is in packet split mode once per NAPI poll time,
      not several times per packet. The single buffer code is further improved
      by the removal of a bunch of code and several variables that are not
      needed. On a receive-oriented test this can improve single-threaded
      throughput.
      
      Also refactor the packet split receive path to use a fixed buffer for
      headers, like ixgbe does. This vastly reduces the number of DMA mappings
      and unmappings we need to do, allowing for much better performance in
      the presence of an IOMMU.
      
      Lastly, correct packet split descriptor types now that we are actually
      using them.
      
      Change-ID: I3a194a93af3d2c31e77ff17644ac7376da6f3e4b
      Signed-off-by: NMitch Williams <mitch.a.williams@intel.com>
      Tested-by: NJim Young <james.m.young@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      a132af24
  21. 09 2月, 2015 1 次提交
  22. 11 11月, 2014 1 次提交
  23. 27 8月, 2014 1 次提交
  24. 03 7月, 2014 1 次提交