1. 26 3月, 2019 1 次提交
  2. 25 3月, 2019 22 次提交
  3. 24 3月, 2019 17 次提交
    • D
      Merge branch 'aquantia-rx-perf' · 956ca8fc
      David S. Miller 提交于
      Igor Russkikh says:
      
      ====================
      net: aquantia: RX performance optimization patches
      
      Here is a set of patches targeting for performance improvement
      on various platforms and protocols.
      
      Our main target was rx performance on iommu systems, notably
      NVIDIA Jetson TX2 and NVIDIA Xavier platforms.
      
      We introduce page reuse strategy to better deal with iommu dma mapping costs.
      With it we see 80-90% of page reuse under some test configurations on UDP traffic.
      
      This shows good improvements on other systems with IOMMU hardware, like
      AMD Ryzen.
      
      We've also improved TCP LRO configuration parameters, allowing packets to better
      coalesce.
      
      Page reuse tests were carried out using iperf3, iperf2, netperf and pktgen.
      Mainly on UDP traffic, with various packet lengths.
      
      Jetson TX2, UDP, Default MTU:
      RX Lost Datagrams
        Before: Max: 69%  Min: 68% Avg: 68.5%
        After:  Max: 41%  Min: 38% Avg: 39.2%
      Maximum throughput
        Before: 1.27 Gbits/sec
        After:  2.41 Gbits/sec
      
      AMD Ryzen 5 2400G, UDP, Default MTU:
      RX Lost Datagrams
        Before:  Max: 12%  Min: 4.5% Avg: 7.17%
        After:   Max: 6.2% Min: 2.3% Avg: 4.26%
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      956ca8fc
    • I
      net: aquantia: enable driver build for arm64 or compile_test · d0d443cd
      Igor Russkikh 提交于
      The driver is now constantly tested in our lab on aarch64 hardware:
      Jetson tx2, Pascal and Xavier tegra based hardware.
      Many of tegra smmu related HW bugs were fixed or workarounded already.
      
      Thus, add ARM64 into Kconfig.
      
      Add also COMPILE_TEST dependency.
      Signed-off-by: NIgor Russkikh <igor.russkikh@aquantia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d0d443cd
    • N
      net: aquantia: improve LRO configuration · 1eef4757
      Nikita Danilov 提交于
      Default LRO HW configuration was very conservative.
      
      Low Number of Descriptors per LRO Sequence, small session
      timeout, inefficient settings in interrupt generation logic.
      
      Change max number of LRO descriptors from 2 to 16 to
      increase performance. Increase maximum coalescing interval
      in HW to 250uS. Tune up HW LRO interrupt generation setting
      to prevent hw issues with long LRO sessions.
      Signed-off-by: NNikita Danilov <nikita.danilov@aquantia.com>
      Signed-off-by: NIgor Russkikh <igor.russkikh@aquantia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1eef4757
    • I
      net: aquantia: Increase rx ring default size from 1K to 2K · 1b09e72d
      Igor Russkikh 提交于
      For multigig rates 1K ring size is often not enough and causes extra
      packet drops in hardware.
      Signed-off-by: NIgor Russkikh <igor.russkikh@aquantia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1b09e72d
    • I
      net: aquantia: Make RX default frame size 2K · 8bd7e763
      Igor Russkikh 提交于
      This correlates with default internet MTU. This also allows page
      flip/reuse to be activated, since each allocated RX page now serves for
      two frags/packets.
      Signed-off-by: NIgor Russkikh <igor.russkikh@aquantia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8bd7e763
    • I
      net: aquantia: Introduce rx refill threshold value · 9773ef18
      Igor Russkikh 提交于
      Before that, we've refilled ring even on single descriptor move.
      Under high packet load that caused page allocation logic to be triggered
      too often. That made overall ring processing slower.
      
      Moreover, with page buffer reuse implemented, we should give a chance
      higher networking levels to process received packets faster, release
      the pages they consumed and therefore give a higher chance for these
      pages to be reused.
      
      RX ring is now refilled only when AQ_CFG_RX_REFILL_THRES or more
      descriptors were processed (32 by default). Under regular traffic this
      gives quite enough time for packet to be consumed and page to be reused.
      Signed-off-by: NIgor Russkikh <igor.russkikh@aquantia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9773ef18
    • I
      net: aquantia: optimize rx performance by page reuse strategy · 46f4c29d
      Igor Russkikh 提交于
      We introduce internal aq_rxpage wrapper over regular page
      where extra field is tracked: rxpage offset inside of allocated page.
      
      This offset allows to reuse one page for multiple packets.
      When needed (for example with large frames processing), allocated
      pageorder could be customized. This gives even larger page reuse
      efficiency.
      
      page_ref_count is used to track page users. If during rx refill
      underlying page has users, we increase pg_off by rx frame size
      thus the top half of the page is reused.
      Signed-off-by: NIgor Russkikh <igor.russkikh@aquantia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      46f4c29d
    • I
      net: aquantia: optimize rx path using larger preallocated skb len · 7e2698c4
      Igor Russkikh 提交于
      Atlantic driver used 14 bytes preallocated skb size. That made L3 protocol
      processing inefficient because pskb_pull had to fetch all the L3/L4 headers
      from extra fragments.
      
      Specially on UDP flows that caused extra packet drops because CPU was
      overloaded with pskb_pull.
      
      This patch uses eth_get_headlen for skb preallocation.
      Signed-off-by: NIgor Russkikh <igor.russkikh@aquantia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7e2698c4
    • D
      Merge tag 'mlx5-updates-2019-03-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · d64fee0a
      David S. Miller 提交于
      Saeed Mahameed says:
      
      ====================
      mlx5-updates-2019-03-20
      
      This series includes updates to mlx5 driver,
      
      1) Compiler warnings cleanup from Saeed Mahameed
      2) Parav Pandit simplifies sriov enable/disables
      3) Gustavo A. R. Silva, Removes a redundant assignment
      4) Moshe Shemesh, Adds Geneve tunnel stateless offload support
      5) Eli Britstein, Adds the Support for VLAN modify action and
         Replaces TC VLAN pop and push actions with VLAN modify
      
      Note: This series includes two simple non-mlx5 patches,
      
      1) Declare IANA_VXLAN_UDP_PORT definition in include/net/vxlan.h,
      and use it in some drivers.
      2) Declare GENEVE_UDP_PORT definition in include/net/geneve.h,
      and use it in mlx5 and nfp drivers.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d64fee0a
    • D
      Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · 071d08af
      David S. Miller 提交于
      Jeff Kirsher says:
      
      ====================
      100GbE Intel Wired LAN Driver Updates 2019-03-22
      
      This series contains updates to ice driver only.
      
      Akeem enables MAC anti-spoofing by default when a new VSI is being
      created.  Fixes an issue when reclaiming VF resources back to the pool
      after reset, by freeing VF resources separately using the first VF
      vector index to traverse the list, instead of starting at the last
      assigned vectors list.  Added support for VF & PF promiscuous mode in
      the ice driver.  Fixed the PF driver from letting the VF know it is "not
      trusted" when it attempts to add more than its permitted additional MAC
      addresses.  Altered how the driver gets the VF VSIs instances, instead
      of using the mailbox messages to retrieve VSIs, get it directly via the
      VF object in the PF data structure.
      
      Bruce fixes return values to resolve static analysis warnings.  Made
      whitespace changes to increase readability and reduce code wrapping.
      
      Anirudh cleans up code by removing a function prototype that was never
      implemented and removed an unused field in the ice_sched_vsi_info
      structure.
      
      Kiran fixes a potential divide by zero issue by adding a check.
      
      Victor cleans up the transmit scheduler by adjusting the stack variable
      usage and added/modified debug prints to make them more useful.
      
      Yashaswini updates the driver in VEB mode to ensure that the LAN_EN bit
      is set if all the right conditions are met.
      
      Christopher ensures the loopback enable bit is not set for prune switch
      rules, since all transmit traffic would be looped back to the internal
      switch and dropped.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      071d08af
    • D
      Merge branch 'tcp-rx-tx-cache' · bdaba895
      David S. Miller 提交于
      Eric Dumazet says:
      
      ====================
      tcp: add rx/tx cache to reduce lock contention
      
      On hosts with many cpus we can observe a very serious contention
      on spinlocks used in mm slab layer.
      
      The following can happen quite often :
      
      1) TX path
        sendmsg() allocates one (fclone) skb on CPU A, sends a clone.
        ACK is received on CPU B, and consumes the skb that was in the retransmit
        queue.
      
      2) RX path
        network driver allocates skb on CPU C
        recvmsg() happens on CPU D, freeing the skb after it has been delivered
        to user space.
      
      In both cases, we are hitting the asymetric alloc/free pattern
      for which slab has to drain alien caches. At 8 Mpps per second,
      this represents 16 Mpps alloc/free per second and has a huge penalty.
      
      In an interesting experiment, I tried to use a single kmem_cache for all the skbs
      (in skb_init() : skbuff_fclone_cache = skbuff_head_cache =
                        kmem_cache_create("skbuff_fclone_cache", sizeof(struct sk_buff_fclones),);
      qnd most of the contention disappeared, since cpus could better use
      their local slab per-cpu cache.
      
      But we can do actually better, in the following patches.
      
      TX : at ACK time, no longer free the skb but put it back in a tcp socket cache,
           so that next sendmsg() can reuse it immediately.
      
      RX : at recvmsg() time, do not free the skb but put it in a tcp socket cache
         so that it can be freed by the cpu feeding the incoming packets in BH.
      
      This increased the performance of small RPC benchmark by about 10 % on a host
      with 112 hyperthreads.
      
      v2 : - Solved a race condition : sk_stream_alloc_skb() to make sure the prior
             clone has been freed.
           - Really test rps_needed in sk_eat_skb() as claimed.
           - Fixed rps_needed use in drivers/net/tun.c
      
      v3: Added a #ifdef CONFIG_RPS, to avoid compile error (kbuild robot)
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bdaba895
    • E
      tcp: add one skb cache for rx · 8b27dae5
      Eric Dumazet 提交于
      Often times, recvmsg() system calls and BH handling for a particular
      TCP socket are done on different cpus.
      
      This means the incoming skb had to be allocated on a cpu,
      but freed on another.
      
      This incurs a high spinlock contention in slab layer for small rpc,
      but also a high number of cache line ping pongs for larger packets.
      
      A full size GRO packet might use 45 page fragments, meaning
      that up to 45 put_page() can be involved.
      
      More over performing the __kfree_skb() in the recvmsg() context
      adds a latency for user applications, and increase probability
      of trapping them in backlog processing, since the BH handler
      might found the socket owned by the user.
      
      This patch, combined with the prior one increases the rpc
      performance by about 10 % on servers with large number of cores.
      
      (tcp_rr workload with 10,000 flows and 112 threads reach 9 Mpps
       instead of 8 Mpps)
      
      This also increases single bulk flow performance on 40Gbit+ links,
      since in this case there are often two cpus working in tandem :
      
       - CPU handling the NIC rx interrupts, feeding the receive queue,
        and (after this patch) freeing the skbs that were consumed.
      
       - CPU in recvmsg() system call, essentially 100 % busy copying out
        data to user space.
      
      Having at most one skb in a per-socket cache has very little risk
      of memory exhaustion, and since it is protected by socket lock,
      its management is essentially free.
      
      Note that if rps/rfs is used, we do not enable this feature, because
      there is high chance that the same cpu is handling both the recvmsg()
      system call and the TCP rx path, but that another cpu did the skb
      allocations in the device driver right before the RPS/RFS logic.
      
      To properly handle this case, it seems we would need to record
      on which cpu skb was allocated, and use a different channel
      to give skbs back to this cpu.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b27dae5
    • E
      tcp: add one skb cache for tx · 472c2e07
      Eric Dumazet 提交于
      On hosts with a lot of cores, RPC workloads suffer from heavy contention on slab spinlocks.
      
          20.69%  [kernel]       [k] queued_spin_lock_slowpath
           5.64%  [kernel]       [k] _raw_spin_lock
           3.83%  [kernel]       [k] syscall_return_via_sysret
           3.48%  [kernel]       [k] __entry_text_start
           1.76%  [kernel]       [k] __netif_receive_skb_core
           1.64%  [kernel]       [k] __fget
      
      For each sendmsg(), we allocate one skb, and free it at the time ACK packet comes.
      
      In many cases, ACK packets are handled by another cpus, and this unfortunately
      incurs heavy costs for slab layer.
      
      This patch uses an extra pointer in socket structure, so that we try to reuse
      the same skb and avoid these expensive costs.
      
      We cache at most one skb per socket so this should be safe as far as
      memory pressure is concerned.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      472c2e07
    • E
      net: convert rps_needed and rfs_needed to new static branch api · dc05360f
      Eric Dumazet 提交于
      We prefer static_branch_unlikely() over static_key_false() these days.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dc05360f
    • D
      Merge branch 'net-dev-BYPASS-for-lockless-qdisc' · 7c1508e5
      David S. Miller 提交于
      Paolo Abeni says:
      
      ====================
      net: dev: BYPASS for lockless qdisc
      
      This patch series is aimed at improving xmit performances of lockless qdisc
      in the uncontended scenario.
      
      After the lockless refactor pfifo_fast can't leverage the BYPASS optimization.
      Due to retpolines the overhead for the avoidables enqueue and dequeue operations
      has increased and we see measurable regressions.
      
      The first patch introduces the BYPASS code path for lockless qdisc, and the
      second one optimizes such path further. Overall this avoids up to 3 indirect
      calls per xmit packet. Detailed performance figures are reported in the 2nd
      patch.
      
       v2 -> v3:
        - qdisc_is_empty() has a const argument (Eric)
      
       v1 -> v2:
        - use really an 'empty' flag instead of 'not_empty', as
          suggested by Eric
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c1508e5
    • P
      net: dev: introduce support for sch BYPASS for lockless qdisc · ba27b4cd
      Paolo Abeni 提交于
      With commit c5ad119f ("net: sched: pfifo_fast use skb_array")
      pfifo_fast no longer benefit from the TCQ_F_CAN_BYPASS optimization.
      Due to retpolines the cost of the enqueue()/dequeue() pair has become
      relevant and we observe measurable regression for the uncontended
      scenario when the packet-rate is below line rate.
      
      After commit 46b1c18f ("net: sched: put back q.qlen into a
      single location") we can check for empty qdisc with a reasonably
      fast operation even for nolock qdiscs.
      
      This change extends TCQ_F_CAN_BYPASS support to nolock qdisc.
      The new chunk of code mirrors closely the existing one for traditional
      qdisc, leveraging a newly introduced helper to read atomically the
      qdisc length.
      
      Tested with pktgen in queue xmit mode, with pfifo_fast, a MQ
      device, and MQ root qdisc:
      
      threads         vanilla         patched
                      kpps            kpps
      1               2465            2889
      2               4304            5188
      4               7898            9589
      
      Same as above, but with a single queue device:
      
      threads         vanilla         patched
                      kpps            kpps
      1               2556            2827
      2               2900            2900
      4               5000            5000
      8               4700            4700
      
      No mesaurable changes in the contended scenarios, and more 10%
      improvement in the uncontended ones.
      
       v1 -> v2:
        - rebased after flag name change
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Tested-by: NIvan Vecera <ivecera@redhat.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NIvan Vecera <ivecera@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba27b4cd
    • P
      net: sched: add empty status flag for NOLOCK qdisc · 28cff537
      Paolo Abeni 提交于
      The queue is marked not empty after acquiring the seqlock,
      and it's up to the NOLOCK qdisc clearing such flag on dequeue.
      Since the empty status lays on the same cache-line of the
      seqlock, it's always hot on cache during the updates.
      
      This makes the empty flag update a little bit loosy. Given
      the lack of synchronization between enqueue and dequeue, this
      is unavoidable.
      
      v2 -> v3:
       - qdisc_is_empty() has a const argument (Eric)
      
      v1 -> v2:
       - use really an 'empty' flag instead of 'not_empty', as
         suggested by Eric
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NIvan Vecera <ivecera@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      28cff537