1. 11 10月, 2018 1 次提交
  2. 02 10月, 2018 1 次提交
  3. 25 9月, 2018 1 次提交
  4. 06 9月, 2018 8 次提交
    • S
      net/mlx5e: Replace PTP clock lock from RW lock to seq lock · 64109f1d
      Shay Agroskin 提交于
      Changed "priv.clock.lock" lock from 'rw_lock' to 'seq_lock'
      in order to improve packet rate performance.
      
      Tested on Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz.
      Sent 64b packets between two peers connected by ConnectX-5,
      and measured packet rate for the receiver in three modes:
      	no time-stamping (base rate)
      	time-stamping using rw_lock (old lock) for critical region
      	time-stamping using seq_lock (new lock) for critical region
      Only the receiver time stamped its packets.
      
      The measured packet rate improvements are:
      
      	Single flow (multiple TX rings to single RX ring):
      		without timestamping:	  4.26 (M packets)/sec
      		with rw-lock (old lock):  4.1  (M packets)/sec
      		with seq-lock (new lock): 4.16 (M packets)/sec
      		1.46% improvement
      
      	Multiple flows (multiple TX rings to six RX rings):
      		without timestamping: 	  22   (M packets)/sec
      		with rw-lock (old lock):  11.7 (M packets)/sec
      		with seq-lock (new lock): 21.3 (M packets)/sec
      		82.05% improvement
      
      The packet rate improvement is due to the lack of atomic operations
      for the 'readers' by the seq-lock.
      Since there are much more 'readers' than 'writers' contention
      on this lock, almost all atomic operations are saved.
      this results in a dramatic decrease in overall
      cache misses.
      Signed-off-by: NShay Agroskin <shayag@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      64109f1d
    • V
      net/mlx5: Add flow counters idr · 12d6066c
      Vlad Buslov 提交于
      Previous patch in series changed flow counter storage structure from
      rb_tree to linked list in order to improve flow counter traversal
      performance. The drawback of such solution is that flow counter lookup by
      id becomes linear in complexity.
      
      Store pointers to flow counters in idr in order to improve lookup
      performance to logarithmic again. Idr is non-intrusive data structure and
      doesn't require extending flow counter struct with new elements. This means
      that idr can be used for lookup, while linked list from previous patch is
      used for traversal, and struct mlx5_fc size is <= 2 cache lines.
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Acked-by: NAmir Vadai <amir@vadai.me>
      Reviewed-by: NPaul Blakey <paulb@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      12d6066c
    • V
      net/mlx5: Store flow counters in a list · 9aff93d7
      Vlad Buslov 提交于
      In order to improve performance of flow counter stats query loop that
      traverses all configured flow counters, replace rb_tree with double-linked
      list. This change improves performance of traversing flow counters by
      removing the tree traversal. (profiling data showed that call to rb_next
      was most top CPU consumer)
      
      However, lookup of flow flow counter in list becomes linear, instead of
      logarithmic. This problem is fixed by next patch in series, which adds idr
      for fast lookup. Idr is to be used because it is not an intrusive data
      structure and doesn't require adding any new members to struct mlx5_fc,
      which allows its control data part to stay <= 1 cache line in size.
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Acked-by: NAmir Vadai <amir@vadai.me>
      Reviewed-by: NPaul Blakey <paulb@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      9aff93d7
    • V
      net/mlx5: Add new list to store deleted flow counters · 6e5e2283
      Vlad Buslov 提交于
      In order to prevent flow counters stats work function from traversing whole
      flow counters tree while searching for deleted flow counters, new list to
      store deleted flow counters is added to struct mlx5_fc_stats. Lockless
      NULL-terminated single linked list data type is used due to following
      reasons:
       - This use case only needs to add single element to list and
       remove/iterate whole list. Lockless list doesn't require any additional
       synchronization for these operations.
       - First cache line of flow counter data structure only has space to store
       single additional pointer, which precludes usage of double linked list.
      
      Remove flow counter 'deleted' flag that is no longer needed.
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Acked-by: NAmir Vadai <amir@vadai.me>
      Reviewed-by: NPaul Blakey <paulb@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      6e5e2283
    • V
      net/mlx5: Change flow counters addlist type to single linked list · 83033688
      Vlad Buslov 提交于
      In order to prevent flow counters stats work function from traversing whole
      flow counters tree while searching for deleted flow counters, new list to
      store deleted flow counters will be added to struct mlx5_fc_stats. However,
      the flow counter structure itself has no space left to store any more data
      in first cache line. To free space that is needed to store additional list
      node, convert current addlist double linked list (two pointers per node) to
      atomic single linked list (one pointer per node).
      
      Lockless NULL-terminated single linked list data type doesn't require any
      additional external synchronization for operations used by flow counters
      module (add single new element, remove all elements from list and traverse
      them). Remove addlist_lock that is no longer needed.
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Acked-by: NAmir Vadai <amir@vadai.me>
      Reviewed-by: NPaul Blakey <paulb@mellanox.com>
      Reviewed-by: NRoi Dayan <roid@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      83033688
    • T
      net/mlx5: Use u16 for Work Queue buffer strides offset · a0903622
      Tariq Toukan 提交于
      Minimal stride size is 16.
      Hence, the number of strides in a fragment (of PAGE_SIZE)
      is <= PAGE_SIZE / 16 <= 4K.
      
      u16 is sufficient to represent this.
      
      Fixes: d7037ad7 ("net/mlx5: Fix QP fragmented buffer allocation")
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: NEran Ben Elisha <eranbe@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      a0903622
    • T
      net/mlx5: Use u16 for Work Queue buffer fragment size · 8d71e818
      Tariq Toukan 提交于
      Minimal stride size is 16.
      Hence, the number of strides in a fragment (of PAGE_SIZE)
      is <= PAGE_SIZE / 16 <= 4K.
      
      u16 is sufficient to represent this.
      
      Fixes: 388ca8be ("IB/mlx5: Implement fragmented completion queue (CQ)")
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: NEran Ben Elisha <eranbe@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      8d71e818
    • J
      net/mlx5: Fix use-after-free in self-healing flow · 76d5581c
      Jack Morgenstein 提交于
      When the mlx5 health mechanism detects a problem while the driver
      is in the middle of init_one or remove_one, the driver needs to prevent
      the health mechanism from scheduling future work; if future work
      is scheduled, there is a problem with use-after-free: the system WQ
      tries to run the work item (which has been freed) at the scheduled
      future time.
      
      Prevent this by disabling work item scheduling in the health mechanism
      when the driver is in the middle of init_one() or remove_one().
      
      Fixes: e126ba97 ("mlx5: Add driver for Mellanox Connect-IB adapters")
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Reviewed-by: NFeras Daoud <ferasda@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      76d5581c
  5. 04 9月, 2018 1 次提交
  6. 03 8月, 2018 1 次提交
    • J
      RDMA/netdev: Use priv_destructor for netdev cleanup · 9f49a5b5
      Jason Gunthorpe 提交于
      Now that the unregister_netdev flow for IPoIB no longer relies on external
      code we can now introduce the use of priv_destructor and
      needs_free_netdev.
      
      The rdma_netdev flow is switched to use the netdev common priv_destructor
      instead of the special free_rdma_netdev and the IPOIB ULP adjusted:
       - priv_destructor needs to switch to point to the ULP's destructor
         which will then call the rdma_ndev's in the right order
       - We need to be careful around the error unwind of register_netdev
         as it sometimes calls priv_destructor on failure
       - ULPs need to use ndo_init/uninit to ensure proper ordering
         of failures around register_netdev
      
      Switching to priv_destructor is a necessary pre-requisite to using
      the rtnl new_link mechanism.
      
      The VNIC user for rdma_netdev should also be revised, but that is left for
      another patch.
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: NDenis Drozdov <denisd@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      9f49a5b5
  7. 28 7月, 2018 1 次提交
  8. 24 7月, 2018 1 次提交
  9. 19 7月, 2018 3 次提交
  10. 05 7月, 2018 1 次提交
  11. 26 5月, 2018 1 次提交
    • T
      net/mlx5: Use order-0 allocations for all WQ types · 3a2f7033
      Tariq Toukan 提交于
      Complete the transition of all WQ types to use fragmented
      order-0 coherent memory instead of high-order allocations.
      
      CQ-WQ already uses order-0.
      Here we do the same for cyclic and linked-list WQs.
      
      This allows the driver to load cleanly on systems with a highly
      fragmented coherent memory.
      
      Performance tests:
      ConnectX-5 100Gbps, CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
      Packet rate of 64B packets, single transmit ring, size 8K.
      
      No degradation is sensed.
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      3a2f7033
  12. 25 5月, 2018 1 次提交
  13. 17 5月, 2018 1 次提交
  14. 27 4月, 2018 1 次提交
    • I
      net/mlx5: Fix mlx5_get_vector_affinity function · 6082d9c9
      Israel Rukshin 提交于
      Adding the vector offset when calling to mlx5_vector2eqn() is wrong.
      This is because mlx5_vector2eqn() checks if EQ index is equal to vector number
      and the fact that the internal completion vectors that mlx5 allocates
      don't get an EQ index.
      
      The second problem here is that using effective_affinity_mask gives the same
      CPU for different vectors.
      This leads to unmapped queues when calling it from blk_mq_rdma_map_queues().
      This doesn't happen when using affinity_hint mask.
      
      Fixes: 2572cf57 ("mlx5: fix mlx5_get_vector_affinity to start from completion vector 0")
      Fixes: 05e0cc84 ("net/mlx5: Fix get vector affinity helper function")
      Signed-off-by: NIsrael Rukshin <israelr@mellanox.com>
      Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      6082d9c9
  15. 20 3月, 2018 1 次提交
    • B
      net/mlx5: Packet pacing enhancement · 05d3ac97
      Bodong Wang 提交于
      Add two new parameters: max_burst_sz and typical_pkt_size (both
      in bytes) to rate limit configurations.
      
      max_burst_sz: The device will schedule bursts of packets for an
      SQ connected to this rate, smaller than or equal to this value.
      Value 0x0 indicates packet bursts will be limited to the device
      defaults. This field should be used if bursts of packets must be
      strictly kept under a certain value.
      
      typical_pkt_size: When the rate limit is intended for a stream of
      similar packets, stating the typical packet size can improve the
      accuracy of the rate limiter. The expected packet size will be
      the same for all SQs associated with the same rate limit index.
      
      Ethernet driver is updated according to this change, but these two
      parameters will be kept as 0 due to lacking of proper way to get the
      configurations from user space which requires to change
      ndo_set_tx_maxrate interface.
      Signed-off-by: NBodong Wang <bodong@mellanox.com>
      Reviewed-by: NDaniel Jurgens <danielj@mellanox.com>
      Reviewed-by: NYishai Hadas <yishaih@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      05d3ac97
  16. 14 3月, 2018 1 次提交
  17. 24 2月, 2018 1 次提交
  18. 15 2月, 2018 4 次提交
    • Y
      IB/mlx5: Implement fragmented completion queue (CQ) · 388ca8be
      Yonatan Cohen 提交于
      The current implementation of create CQ requires contiguous
      memory, such requirement is problematic once the memory is
      fragmented or the system is low in memory, it causes for
      failures in dma_zalloc_coherent().
      
      This patch implements new scheme of fragmented CQ to overcome
      this issue by introducing new type: 'struct mlx5_frag_buf_ctrl'
      to allocate fragmented buffers, rather than contiguous ones.
      
      Base the Completion Queues (CQs) on this new fragmented buffer.
      
      It fixes following crashes:
      kworker/29:0: page allocation failure: order:6, mode:0x80d0
      CPU: 29 PID: 8374 Comm: kworker/29:0 Tainted: G OE 3.10.0
      Workqueue: ib_cm cm_work_handler [ib_cm]
      Call Trace:
      [<>] dump_stack+0x19/0x1b
      [<>] warn_alloc_failed+0x110/0x180
      [<>] __alloc_pages_slowpath+0x6b7/0x725
      [<>] __alloc_pages_nodemask+0x405/0x420
      [<>] dma_generic_alloc_coherent+0x8f/0x140
      [<>] x86_swiotlb_alloc_coherent+0x21/0x50
      [<>] mlx5_dma_zalloc_coherent_node+0xad/0x110 [mlx5_core]
      [<>] ? mlx5_db_alloc_node+0x69/0x1b0 [mlx5_core]
      [<>] mlx5_buf_alloc_node+0x3e/0xa0 [mlx5_core]
      [<>] mlx5_buf_alloc+0x14/0x20 [mlx5_core]
      [<>] create_cq_kernel+0x90/0x1f0 [mlx5_ib]
      [<>] mlx5_ib_create_cq+0x3b0/0x4e0 [mlx5_ib]
      Signed-off-by: NYonatan Cohen <yonatanc@mellanox.com>
      Reviewed-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leon@kernel.org>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      388ca8be
    • S
      net/mlx5: Remove redundant EQ API exports · 3ec5693b
      Saeed Mahameed 提交于
      EQ structure and API is private to mlx5_core driver only, external
      drivers should not have access or the means to manipulate EQ objects.
      
      Remove redundant exports and move API functions out of the linux/mlx5
      include directory into the driver's mlx5_core.h private include file.
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Reviewed-by: NGal Pressman <galp@mellanox.com>
      3ec5693b
    • S
      net/mlx5: Move CQ completion and event forwarding logic to eq.c · 3ac7afdb
      Saeed Mahameed 提交于
      Since CQ tree is now per EQ, CQ completion and event forwarding became
      specific implementation of EQ logic, this patch moves that logic to eq.c
      and makes those functions static.
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Reviewed-by: NGal Pressman <galp@mellanox.com>
      3ac7afdb
    • S
      net/mlx5: CQ Database per EQ · 02d92f79
      Saeed Mahameed 提交于
      Before this patch the driver had one CQ database protected via one
      spinlock, this spinlock is meant to synchronize between CQ
      adding/removing and CQ IRQ interrupt handling.
      
      On a system with large number of CPUs and on a work load that requires
      lots of interrupts, this global spinlock becomes a very nasty hotspot
      and introduces a contention between the active cores, which will
      significantly hurt performance and becomes a bottleneck that prevents
      seamless cpu scaling.
      
      To solve this we simply move the CQ database and its spinlock to be per
      EQ (IRQ), thus per core.
      
      Tested with:
      system: 2 sockets, 14 cores per socket, hyperthreading, 2x14x2=56 cores
      netperf command: ./super_netperf 200 -P 0 -t TCP_RR  -H <server> -l 30 -- -r 300,300 -o -s 1M,1M -S 1M,1M
      
      WITHOUT THIS PATCH:
      Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft %steal  %guest  %gnice   %idle
      Average:     all    4.32    0.00   36.15    0.09    0.00   34.02   0.00    0.00    0.00   25.41
      
      Samples: 2M of event 'cycles:pp', Event count (approx.): 1554616897271
      Overhead  Command          Shared Object                 Symbol
      +   14.28%  swapper          [kernel.vmlinux]              [k] intel_idle
      +   12.25%  swapper          [kernel.vmlinux]              [k] queued_spin_lock_slowpath
      +   10.29%  netserver        [kernel.vmlinux]              [k] queued_spin_lock_slowpath
      +    1.32%  netserver        [kernel.vmlinux]              [k] mlx5e_xmit
      
      WITH THIS PATCH:
      Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
      Average:     all    4.27    0.00   34.31    0.01    0.00   18.71    0.00    0.00    0.00   42.69
      
      Samples: 2M of event 'cycles:pp', Event count (approx.): 1498132937483
      Overhead  Command          Shared Object             Symbol
      +   23.33%  swapper          [kernel.vmlinux]          [k] intel_idle
      +    1.69%  netserver        [kernel.vmlinux]          [k] mlx5e_xmit
      Tested-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Reviewed-by: NGal Pressman <galp@mellanox.com>
      02d92f79
  19. 05 2月, 2018 1 次提交
  20. 19 1月, 2018 1 次提交
  21. 12 1月, 2018 1 次提交
    • S
      net/mlx5: Fix get vector affinity helper function · 05e0cc84
      Saeed Mahameed 提交于
      mlx5_get_vector_affinity used to call pci_irq_get_affinity and after
      reverting the patch that sets the device affinity via PCI_IRQ_AFFINITY
      API, calling pci_irq_get_affinity becomes useless and it breaks RDMA
      mlx5 users.  To fix this, this patch provides an alternative way to
      retrieve IRQ vector affinity using legacy IRQ API, following
      smp_affinity read procfs implementation.
      
      Fixes: 231243c8 ("Revert mlx5: move affinity hints assignments to generic code")
      Fixes: a435393a ("mlx5: move affinity hints assignments to generic code")
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      05e0cc84
  22. 09 1月, 2018 5 次提交
  23. 29 12月, 2017 1 次提交
    • Y
      IB/mlx5: Extend UAR stuff to support dynamic allocation · 31a78a5a
      Yishai Hadas 提交于
      This patch extends the alloc context flow to be prepared for working
      with dynamic UAR allocations.
      
      Currently upon alloc context there is some fix size of UARs that are
      allocated (named 'static allocation') and there is no option to user
      application to ask for more or control which UAR will be used by which
      QP.
      
      In this patch the driver prepares its data structures to manage both the
      static and the dynamic allocations and let the user driver knows about
      the max value of dynamic blue-flame registers that are allowed.
      
      Downstream patches from this series will enable the dynamic allocation
      and the association as part of QP creation.
      Signed-off-by: NYishai Hadas <yishaih@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leon@kernel.org>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      31a78a5a
  24. 22 12月, 2017 1 次提交