1. 23 9月, 2016 1 次提交
  2. 08 4月, 2016 1 次提交
  3. 19 11月, 2015 1 次提交
    • E
      net: provide generic busy polling to all NAPI drivers · 93d05d4a
      Eric Dumazet 提交于
      NAPI drivers no longer need to observe a particular protocol
      to benefit from busy polling (CONFIG_NET_RX_BUSY_POLL=y)
      
      napi_hash_add() and napi_hash_del() are automatically called
      from core networking stack, respectively from
      netif_napi_add() and netif_napi_del()
      
      This patch depends on free_netdev() and netif_napi_del() being
      called from process context, which seems to be the norm.
      
      Drivers might still prefer to call napi_hash_del() on their
      own, since they might combine all the rcu grace periods into
      a single one, knowing their NAPI structures lifetime, while
      core networking stack has no idea of a possible combining.
      
      Once this patch proves to not bring serious regressions,
      we will cleanup drivers to either remove napi_hash_del()
      or provide appropriate rcu grace periods combining.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93d05d4a
  4. 16 9月, 2015 1 次提交
    • A
      ixgbe: Limit lowest interrupt rate for adaptive interrupt moderation to 12K · 8ac34f10
      Alexander Duyck 提交于
      This patch updates the lowest limit for adaptive interrupt interrupt
      moderation to roughly 12K interrupts per second.
      
      The way I came about reaching 12K as the desired interrupt rate is by
      testing with UDP flows.  Specifically I had a simple test that ran a
      netperf UDP_STREAM test at varying sizes.  What I found was as the packet
      sizes increased the performance fell steadily behind until we were only
      able to receive at ~4Gb/s with a message size of 65507.  A bit of digging
      found that we were dropping packets for the socket in the network stack,
      and looking at things further what I found was I could solve it by increasing
      the interrupt rate, or increasing the rmem_default/rmem_max.  What I found was
      that when the interrupt coalescing resulted in more data being processed
      per interrupt than could be stored in the socket buffer we started losing
      packets and the performance dropped.  So I reached 12K based on the
      following math.
      
      rmem_default = 212992
      skb->truesize = 2994
      212992 / 2994 = 71.14 packets to fill the buffer
      
      packet rate at 1514 packet size is 812744pps
      71.14 / 812744 = 87.9us to fill socket buffer
      
      From there it was just a matter of choosing the interrupt rate and
      providing a bit of wiggle room which is why I decided to go with 12K
      interrupts per second as that uses a value of 84us.
      
      The data below is based on VM to VM over a direct assigned ixgbe interface.
      The test run was:
      	netperf -H <ip> -t UDP_STREAM"
      
      Socket  Message  Elapsed      Messages                   CPU      Service
      Size    Size     Time         Okay Errors   Throughput   Util     Demand
      bytes   bytes    secs            #      #   10^6bits/sec % SS     us/KB
      Before:
      212992   65507   60.00     1100662      0     9613.4     10.89    0.557
      212992           60.00      473474            4135.4     11.27    0.576
      
      After:
      212992   65507   60.00     1100413      0     9611.2     10.73    0.549
      212992           60.00      974132            8508.3     11.69    0.598
      
      Using bare metal the data is similar but not as dramatic as the throughput
      increases from about 8.5Gb/s to 9.5Gb/s.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Tested-by: NKrishneil Singh <krishneil.k.singh@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      8ac34f10
  5. 11 11月, 2014 1 次提交
  6. 18 9月, 2014 7 次提交
  7. 12 9月, 2014 1 次提交
    • A
      ixgbe: Refactor busy poll socket code to address multiple issues · adc81090
      Alexander Duyck 提交于
      This change addresses several issues in the current ixgbe implementation of
      busy poll sockets.
      
      First was the fact that it was possible for frames to be delivered out of
      order if they were held in GRO.  This is addressed by flushing the GRO buffers
      before releasing the q_vector back to the idle state.
      
      The other issue was the fact that we were having to take a spinlock on
      changing the state to and from idle.  To resolve this I have replaced the
      state value with an atomic and use atomic_cmpxchg to change the value from
      idle, and a simple atomic set to restore it back to idle after we have
      acquired it.  This allows us to only use a locked operation on acquiring the
      vector without a need for a locked operation to release it.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Tested-by: NPhil Schmitt <phillip.j.schmitt@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      adc81090
  8. 04 9月, 2014 1 次提交
  9. 23 5月, 2014 1 次提交
  10. 13 3月, 2014 1 次提交
  11. 19 2月, 2014 1 次提交
  12. 08 11月, 2013 1 次提交
  13. 11 6月, 2013 1 次提交
  14. 16 2月, 2013 2 次提交
  15. 05 2月, 2013 1 次提交
  16. 01 11月, 2012 1 次提交
  17. 19 10月, 2012 1 次提交
  18. 22 7月, 2012 3 次提交
  19. 19 7月, 2012 2 次提交
  20. 18 7月, 2012 2 次提交
  21. 15 7月, 2012 2 次提交
  22. 11 7月, 2012 5 次提交
  23. 27 6月, 2012 1 次提交
  24. 04 5月, 2012 1 次提交
    • A
      ixgbe: Reorder the ring to q_vector mapping to improve performance · d0bfcdfd
      Alexander Duyck 提交于
      This change reorders the mapping of rings to q_vectors in the case that the
      number of rings exceeds the number of q_vectors.  Previously we would
      allocate the first R/N queues to the first q_vector where R is the number
      of rings and N is the number of q_vectors.  Instead of doing this we can do
      a better job of interleaving the rings to the CPUs by assigning every Nth
      ring to the q_vector.
      
      The below tables illustrate this change for the R = 16 N = 4 case.
                Before patch  After patch
      q_vector:  0  1  2  3    0  1  2  3
      Rings:     0  4  8 12    0  1  2  3
                 1  5  9 13    4  5  6  7
                 3  6 10 14    8  9 10 11
                 4  7 11 15   12 13 14 15
      
      This should improve the performance for both DCB or ATR when the number of
      rings exceeds the number of q_vectors allocated by the adapter.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Tested-by: NRoss Brattain <ross.b.brattain@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      d0bfcdfd