1. 18 5月, 2016 1 次提交
    • M
      net/mlx5_core: Use tasklet for user-space CQ completion events · 94c6825e
      Matan Barak 提交于
      Previously, we've fired all our completion callbacks straight from
      our ISR.
      
      Some of those callbacks were lightweight (for example, mlx5 Ethernet
      napi callbacks), but some of them did more work (for example,
      the user-space RDMA stack uverbs' completion handler). Besides that,
      doing more than the minimal work in ISR is generally considered wrong,
      it could even lead to a hard lockup of the system. Since when a lot
      of completion events are generated by the hardware, the loop over
      those events could be so long, that we'll get into a hard lockup by
      the system watchdog.
      
      In order to avoid that, add a new way of invoking completion events
      callbacks. In the interrupt itself, we add the CQs which receive
      completion event to a per-EQ list and schedule a tasklet. In the
      tasklet context we loop over all the CQs in the list and invoke the
      user callback.
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      94c6825e
  2. 17 5月, 2016 1 次提交
    • A
      net/mlx5_core: Flow counters infrastructure · 43a335e0
      Amir Vadai 提交于
      If a counter has the aging flag set when created, it is added to a list
      of counters that will be queried periodically from a workqueue.  query
      result and last use timestamp are cached.
      add/del counter must be very efficient since thousands of such
      operations might be issued in a second.
      There is only a single reference to counters without aging, therefore
      no need for locks.
      But, counters with aging enabled are stored in a list. In order to make
      code as lockless as possible, all the list manipulation and access to
      hardware is done from a single context - the periodic counters query
      thread.
      
      The hardware supports multiple counters per FTE, however currently we
      are using one counter for each FTE.
      Signed-off-by: NAmir Vadai <amirva@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      43a335e0
  3. 05 5月, 2016 1 次提交
    • M
      net/mlx5: Flow steering, Add vport ACL support · efdc810b
      Mohamad Haj Yahia 提交于
      Update the relevant flow steering device structs and commands to
      support vport.
      Update the flow steering core API to receive vport number.
      Add ingress and egress ACL flow table name spaces.
      Add ACL flow table support:
      * ACL (Access Control List) flow table is a table that contains
      only allow/drop steering rules.
      
      * We have two types of ACL flow tables - ingress and egress.
      
      * ACLs handle traffic sent from/to E-Switch FDB table, Ingress refers to
      traffic sent from Vport to E-Switch and Egress refers to traffic sent
      from E-Switch to vport.
      
      * Ingress ACL flow table allow/drop rules is checked against traffic
      sent from VF.
      
      * Egress ACL flow table allow/drop rules is checked against traffic sent
      to VF.
      Signed-off-by: NMohamad Haj Yahia <mohamad@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      efdc810b
  4. 30 4月, 2016 1 次提交
  5. 27 4月, 2016 4 次提交
  6. 25 4月, 2016 1 次提交
  7. 22 3月, 2016 2 次提交
  8. 02 3月, 2016 3 次提交
  9. 01 3月, 2016 1 次提交
  10. 25 2月, 2016 2 次提交
  11. 22 1月, 2016 1 次提交
  12. 18 1月, 2016 1 次提交
    • D
      net/mlx5_core: Fix trimming down IRQ number · 0b6e26ce
      Doron Tsur 提交于
      With several ConnectX-4 cards installed on a server, one may receive
      irqn > 255 from the kernel API, which we mistakenly trim to 8bit.
      
      This causes EQ creation failure with the following stack trace:
      [<ffffffff812a11f4>] dump_stack+0x48/0x64
      [<ffffffff810ace21>] __setup_irq+0x3a1/0x4f0
      [<ffffffff810ad7e0>] request_threaded_irq+0x120/0x180
      [<ffffffffa0923660>] ? mlx5_eq_int+0x450/0x450 [mlx5_core]
      [<ffffffffa0922f64>] mlx5_create_map_eq+0x1e4/0x2b0 [mlx5_core]
      [<ffffffffa091de01>] alloc_comp_eqs+0xb1/0x180 [mlx5_core]
      [<ffffffffa091ea99>] mlx5_dev_init+0x5e9/0x6e0 [mlx5_core]
      [<ffffffffa091ec29>] init_one+0x99/0x1c0 [mlx5_core]
      [<ffffffff812e2afc>] local_pci_probe+0x4c/0xa0
      
      Fixing it by changing of the irqn type from u8 to unsigned int to
      support values > 255
      
      Fixes: 61d0e73e ('net/mlx5_core: Use the the real irqn in eq->irqn')
      Reported-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDoron Tsur <doront@mellanox.com>
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0b6e26ce
  13. 24 12月, 2015 2 次提交
  14. 12 12月, 2015 1 次提交
  15. 04 12月, 2015 2 次提交
    • S
      net/mlx5: Introducing E-Switch and l2 table · 073bb189
      Saeed Mahameed 提交于
      E-Switch is the software entity that represents and manages ConnectX4
      inter-HCA ethernet l2 switching.
      
      E-Switch has its own Virtual Ports, each Vport/vNIC/VF can be
      connected to the device through a vport of an e-switch.
      
      Each e-switch is managed by one vNIC identified by
      HCA_CAP.vport_group_manager (usually it is the PF/vport[0]),
      and its main responsibility is to forward each packet to the
      right vport.
      
      e-Switch needs to manage its own l2-table and FDB tables.
      
      L2 table is a flow table that is managed by FW, it is needed for
      Multi-host (Multi PF) configuration for inter HCA switching between
      PFs.
      
      FDB table is a flow table that is totally managed by e-Switch driver,
      its main responsibility is to switch packets between e-Swtich internal
      vports and uplink vport that belong to the same.
      
      This patch introduces only e-Swtich l2 table management, FDB managemnt
      will come later when ethernet SRIOV/VFs will be enabled.
      
      preperation for ethernet sriov and l2 table management.
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      073bb189
    • E
      net/mlx5_core: Add base sriov support · fc50db98
      Eli Cohen 提交于
      This patch adds SRIOV base support for mlx5 supported devices. The same
      driver is used for both PFs and VFs; VFs are identified by the driver
      through the flag MLX5_PCI_DEV_IS_VF added to the pci table entries.
      Virtual functions are created as usual through writing a value to the
      sriov_numvs sysfs file of the PF device. Upon instantiating VFs, they will
      all be probed by the driver on the hypervisor. One can gracefully unbind
      them through /sys/bus/pci/drivers/mlx5_core/unbind.
      
      mlx5_wait_for_vf_pages() was added to ensure that when a VF dies without
      executing proper teardown, the hypervisor driver waits till all of the
      pages that were allocated at the hypervisor to maintain its operation
      are returned.
      
      In order for the VF to be operational, the PF needs to call enable_hca
      for it. This can be done before the VFs are created through a call to
      pci_enable_sriov.
      
      If the there are VFs assigned to a VMs when the driver of the PF is
      unloaded, all the VF will experience system error and PF driver unloads
      cleanly; in this case pci_disable_sriov is not called and the devices
      will show when running lspci. Once the PF driver is reloaded, it will
      sync its data structures which maintain state on its VFs.
      Signed-off-by: NEli Cohen <eli@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc50db98
  16. 15 10月, 2015 3 次提交
  17. 09 10月, 2015 2 次提交
  18. 25 9月, 2015 1 次提交
    • S
      IB/mlx5: Remove support for IB_DEVICE_LOCAL_DMA_LKEY · c6790aa9
      Sagi Grimberg 提交于
      Commit 96249d70 ("IB/core: Guarantee that a local_dma_lkey
      is available") allows ULPs that make use of the local dma key to keep
      working as before by allocating a DMA MR with local permissions and
      converted these consumers to use the MR associated with the PD
      rather then device->local_dma_lkey.
      
      ConnectIB has some known issues with memory registration
      using the local_dma_lkey (SEND, RDMA, RECV seems to work ok).
      
      Thus don't expose support for it (remove device->local_dma_lkey
      setting), and take advantage of the above commit such that no regression
      is introduced to working systems.
      
      The local_dma_lkey support will be restored in CX4 depending on FW
      capability query.
      Signed-off-by: NSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      c6790aa9
  19. 29 8月, 2015 1 次提交
  20. 18 8月, 2015 2 次提交
  21. 07 8月, 2015 1 次提交
  22. 27 7月, 2015 2 次提交
    • A
      net/mlx5e: TX latency optimization to save DMA reads · 88a85f99
      Achiad Shochat 提交于
      A regular TX WQE execution involves two or more DMA reads -
      one to fetch the WQE, and another one per WQE gather entry.
      
      These DMA reads obviously increase the TX latency.
      There are two mlx5 mechanisms to bypass these DMA reads:
      1) Inline WQE
      2) Blue Flame (BF)
      
      An inline WQE contains a whole packet, thus saves the DMA read/s
      of the regular WQE gather entry/s. Inline WQE support was already
      added in the previous commit.
      
      A BF WQE is written directly to the device I/O mapped memory, thus
      enables saving the DMA read that fetches the WQE.
      
      The BF WQE I/O write must be in cache line granularity, thus uses
      the CPU write combining mechanism.
      A BF WQE I/O write acts also as a TX doorbell for notifying the
      device of new TX WQEs.
      A BF WQE is written to the same I/O mapped address as the regular TX
      doorbell, thus this address is being mapped twice - once by ioremap()
      and once by io_mapping_map_wc().
      
      While both mechanisms reduce the TX latency, they both consume more CPU
      cycles than a regular WQE:
      - A BF WQE must still be written to host memory, in addition to being
        written directly to the device I/O mapped memory.
      - An inline WQE involves copying the SKB data into it.
      
      To handle this tradeoff, we introduce here a heuristic algorithm that
      strives to avoid using these two mechanisms in case the TX queue is
      being back-pressured by the device, and limit their usage rate otherwise.
      
      An inline WQE will always be "Blue Flamed" (written directly to the
      device I/O mapped memory) while a BF WQE may not be inlined (may contain
      gather entries).
      
      Preliminary testing using netperf UDP_RR shows that the latency goes down
      from 17.5us to 16.9us, while the message rate (tested with pktgen) stays
      the same.
      Signed-off-by: NAchiad Shochat <achiad@mellanox.com>
      Signed-off-by: NAmir Vadai <amirv@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88a85f99
    • S
      net/mlx5e: Allocate DMA coherent memory on reader NUMA node · 311c7c71
      Saeed Mahameed 提交于
      By affinity hints and XPS, each mlx5e channel is assigned a CPU
      core.
      
      Channel DMA coherent memory that is written by the NIC and read
      by SW (e.g CQ buffer) is allocated on the NUMA node of the CPU
      core assigned for the channel.
      
      Channel DMA coherent memory that is written by SW and read by the
      NIC (e.g SQ/RQ buffer) is allocated on the NUMA node of the NIC.
      
      Doorbell record (written by SW and read by the NIC) is an
      exception since it is accessed by SW more frequently.
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NAmir Vadai <amirv@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      311c7c71
  23. 12 6月, 2015 1 次提交
  24. 05 6月, 2015 3 次提交