1. 26 3月, 2019 1 次提交
  2. 18 3月, 2019 1 次提交
  3. 27 2月, 2019 1 次提交
  4. 23 2月, 2019 2 次提交
    • L
      RDMA: Handle ucontext allocations by IB/core · a2a074ef
      Leon Romanovsky 提交于
      Following the PD conversion patch, do the same for ucontext allocations.
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      a2a074ef
    • H
      IB/mlx4: Increase the timeout for CM cache · 2612d723
      Håkon Bugge 提交于
      Using CX-3 virtual functions, either from a bare-metal machine or
      pass-through from a VM, MAD packets are proxied through the PF driver.
      
      Since the VF drivers have separate name spaces for MAD Transaction Ids
      (TIDs), the PF driver has to re-map the TIDs and keep the book keeping
      in a cache.
      
      Following the RDMA Connection Manager (CM) protocol, it is clear when
      an entry has to evicted form the cache. But life is not perfect,
      remote peers may die or be rebooted. Hence, it's a timeout to wipe out
      a cache entry, when the PF driver assumes the remote peer has gone.
      
      During workloads where a high number of QPs are destroyed concurrently,
      excessive amount of CM DREQ retries has been observed
      
      The problem can be demonstrated in a bare-metal environment, where two
      nodes have instantiated 8 VFs each. This using dual ported HCAs, so we
      have 16 vPorts per physical server.
      
      64 processes are associated with each vPort and creates and destroys
      one QP for each of the remote 64 processes. That is, 1024 QPs per
      vPort, all in all 16K QPs. The QPs are created/destroyed using the
      CM.
      
      When tearing down these 16K QPs, excessive CM DREQ retries (and
      duplicates) are observed. With some cat/paste/awk wizardry on the
      infiniband_cm sysfs, we observe as sum of the 16 vPorts on one of the
      nodes:
      
      cm_rx_duplicates:
            dreq  2102
      cm_rx_msgs:
            drep  1989
            dreq  6195
             rep  3968
             req  4224
             rtu  4224
      cm_tx_msgs:
            drep  4093
            dreq 27568
             rep  4224
             req  3968
             rtu  3968
      cm_tx_retries:
            dreq 23469
      
      Note that the active/passive side is equally distributed between the
      two nodes.
      
      Enabling pr_debug in cm.c gives tons of:
      
      [171778.814239] <mlx4_ib> mlx4_ib_multiplex_cm_handler: id{slave:
      1,sl_cm_id: 0xd393089f} is NULL!
      
      By increasing the CM_CLEANUP_CACHE_TIMEOUT from 5 to 30 seconds, the
      tear-down phase of the application is reduced from approximately 90 to
      50 seconds. Retries/duplicates are also significantly reduced:
      
      cm_rx_duplicates:
            dreq  2460
      []
      cm_tx_retries:
            dreq  3010
             req    47
      
      Increasing the timeout further didn't help, as these duplicates and
      retries stems from a too short CMA timeout, which was 20 (~4 seconds)
      on the systems. By increasing the CMA timeout to 22 (~17 seconds), the
      numbers fell down to about 10 for both of them.
      
      Adjustment of the CMA timeout is not part of this commit.
      Signed-off-by: NHåkon Bugge <haakon.bugge@oracle.com>
      Acked-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      2612d723
  5. 16 2月, 2019 1 次提交
  6. 09 2月, 2019 1 次提交
  7. 31 1月, 2019 1 次提交
  8. 22 1月, 2019 1 次提交
    • J
      IB/mlx4: Fix using wrong function to destroy sqp AHs under SRIOV · f45f8edb
      Jack Morgenstein 提交于
      The commit cited below replaced rdma_create_ah with
      mlx4_ib_create_slave_ah when creating AHs for the paravirtualized special
      QPs.
      
      However, this change also required replacing rdma_destroy_ah with
      mlx4_ib_destroy_ah in the affected flows.
      
      The commit missed 3 places where rdma_destroy_ah should have been replaced
      with mlx4_ib_destroy_ah.
      
      As a result, the pd usecount was decremented when the ah was destroyed --
      although the usecount was NOT incremented when the ah was created.
      
      This caused the pd usecount to become negative, and resulted in the
      WARN_ON stack trace below when the mlx4_ib.ko module was unloaded:
      
      WARNING: CPU: 3 PID: 25303 at drivers/infiniband/core/verbs.c:329 ib_dealloc_pd+0x6d/0x80 [ib_core]
      Modules linked in: rdma_ucm rdma_cm iw_cm ib_cm ib_umad mlx4_ib(-) ib_uverbs ib_core mlx4_en mlx4_core nfsv3 nfs fscache configfs xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ipt_REJECT nf_reject_ipv4 tun ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bridge stp llc dm_mirror dm_region_hash dm_log dm_mod dax rndis_wlan rndis_host coretemp kvm_intel cdc_ether kvm usbnet iTCO_wdt iTCO_vendor_support cfg80211 irqbypass lpc_ich ipmi_si i2c_i801 mii pcspkr i2c_core mfd_core ipmi_devintf i7core_edac ipmi_msghandler ioatdma pcc_cpufreq dca acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 sr_mod cdrom ata_generic pata_acpi mptsas scsi_transport_sas mptscsih crc32c_intel ata_piix bnx2 mptbase ipv6 crc_ccitt autofs4 [last unloaded: mlx4_core]
      CPU: 3 PID: 25303 Comm: modprobe Tainted: G        W I       5.0.0-rc1-net-mlx4+ #1
      Hardware name: IBM  -[7148ZV6]-/Node 1, System Card, BIOS -[MLE170CUS-1.70]- 09/23/2011
      RIP: 0010:ib_dealloc_pd+0x6d/0x80 [ib_core]
      Code: 00 00 85 c0 75 02 5b c3 80 3d aa 87 03 00 00 75 f5 48 c7 c7 88 d7 8f a0 31 c0 c6 05 98 87 03 00 01 e8 07 4c 79 e0 0f 0b 5b c3 <0f> 0b eb be 0f 0b eb ab 90 66 2e 0f 1f 84 00 00 00 00 00 66 66 66
      RSP: 0018:ffffc90005347e30 EFLAGS: 00010282
      RAX: 00000000ffffffea RBX: ffff8888589e9540 RCX: 0000000000000006
      RDX: 0000000000000006 RSI: ffff88885d57ad40 RDI: 0000000000000000
      RBP: ffff88885b029c00 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000001 R11: 0000000000000004 R12: ffff8887f06c0000
      R13: ffff8887f06c13e8 R14: 0000000000000000 R15: 0000000000000000
      FS:  00007fd6743c6740(0000) GS:ffff88887fcc0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000ed1038 CR3: 00000007e3156000 CR4: 00000000000006e0
      Call Trace:
       mlx4_ib_close_sriov+0x125/0x180 [mlx4_ib]
       mlx4_ib_remove+0x57/0x1f0 [mlx4_ib]
       mlx4_remove_device+0x92/0xa0 [mlx4_core]
       mlx4_unregister_interface+0x39/0x90 [mlx4_core]
       mlx4_ib_cleanup+0xc/0xd7 [mlx4_ib]
       __x64_sys_delete_module+0x17d/0x290
       ? trace_hardirqs_off_thunk+0x1a/0x1c
       ? do_syscall_64+0x12/0x180
       do_syscall_64+0x4a/0x180
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Fixes: 5e62d5ff ("IB/mlx4: Create slave AH's directly")
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      f45f8edb
  9. 15 1月, 2019 2 次提交
  10. 11 1月, 2019 3 次提交
  11. 21 12月, 2018 1 次提交
  12. 20 12月, 2018 2 次提交
  13. 19 12月, 2018 2 次提交
  14. 12 12月, 2018 3 次提交
  15. 07 12月, 2018 1 次提交
  16. 23 11月, 2018 2 次提交
  17. 17 10月, 2018 4 次提交
  18. 04 10月, 2018 1 次提交
  19. 27 9月, 2018 2 次提交
  20. 22 9月, 2018 1 次提交
  21. 21 9月, 2018 1 次提交
  22. 07 9月, 2018 1 次提交
  23. 31 7月, 2018 3 次提交
    • J
      IB/mlx4: Use 4K pages for kernel QP's WQE buffer · f95ccffc
      Jack Morgenstein 提交于
      In the current implementation, the driver tries to allocate contiguous
      memory, and if it fails, it falls back to 4K fragmented allocation.
      
      Once the memory is fragmented, the first allocation might take a lot
      of time, and even fail, which can cause connection failures.
      
      This patch changes the logic to always allocate with 4K granularity,
      since it's more robust and more likely to succeed.
      
      This patch was tested with Lustre and no performance degradation
      was observed.
      
      Note: This commit eliminates the "shrinking WQE" feature. This feature
      depended on using vmap to create a virtually contiguous send WQ.
      vmap use was abandoned due to problems with several processors (see the
      commit cited below). As a result, shrinking WQE was available only with
      physically contiguous send WQs. Allocating such send WQs caused the
      problems described above.
      Therefore, as a side effect of eliminating the use of large physically
      contiguous send WQs, the shrinking WQE feature became unavailable.
      
      Warning example:
      worker/20:1: page allocation failure: order:8, mode:0x80d0
      CPU: 20 PID: 513 Comm: kworker/20:1 Tainted: G OE ------------
      Workqueue: ib_cm cm_work_handler [ib_cm]
      Call Trace:
      [<ffffffff81686d81>] dump_stack+0x19/0x1b
      [<ffffffff81186160>] warn_alloc_failed+0x110/0x180
      [<ffffffff8118a954>] __alloc_pages_nodemask+0x9b4/0xba0
      [<ffffffff811ce868>] alloc_pages_current+0x98/0x110
      [<ffffffff81184fae>] __get_free_pages+0xe/0x50
      [<ffffffff8133f6fe>] swiotlb_alloc_coherent+0x5e/0x150
      [<ffffffff81062551>] x86_swiotlb_alloc_coherent+0x41/0x50
      [<ffffffffa056b4c4>] mlx4_buf_direct_alloc.isra.7+0xc4/0x180 [mlx4_core]
      [<ffffffffa056b73b>] mlx4_buf_alloc+0x1bb/0x260 [mlx4_core]
      [<ffffffffa0b15496>] create_qp_common+0x536/0x1000 [mlx4_ib]
      [<ffffffff811c6ef7>] ? dma_pool_free+0xa7/0xd0
      [<ffffffffa0b163c1>] mlx4_ib_create_qp+0x3b1/0xdc0 [mlx4_ib]
      [<ffffffffa0b01bc2>] ? mlx4_ib_create_cq+0x2d2/0x430 [mlx4_ib]
      [<ffffffffa0b21f20>] mlx4_ib_create_qp_wrp+0x10/0x20 [mlx4_ib]
      [<ffffffffa08f152a>] ib_create_qp+0x7a/0x2f0 [ib_core]
      [<ffffffffa06205d4>] rdma_create_qp+0x34/0xb0 [rdma_cm]
      [<ffffffffa08275c9>] kiblnd_create_conn+0xbf9/0x1950 [ko2iblnd]
      [<ffffffffa074077a>] ? cfs_percpt_unlock+0x1a/0xb0 [libcfs]
      [<ffffffffa0835519>] kiblnd_passive_connect+0xa99/0x18c0 [ko2iblnd]
      
      Fixes: 73898db0 ("net/mlx4: Avoid wrong virtual mappings")
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      f95ccffc
    • B
      RDMA, core and ULPs: Declare ib_post_send() and ib_post_recv() arguments const · d34ac5cd
      Bart Van Assche 提交于
      Since neither ib_post_send() nor ib_post_recv() modify the data structure
      their second argument points at, declare that argument const. This change
      makes it necessary to declare the 'bad_wr' argument const too and also to
      modify all ULPs that call ib_post_send(), ib_post_recv() or
      ib_post_srq_recv(). This patch does not change any functionality but makes
      it possible for the compiler to verify whether the
      ib_post_(send|recv|srq_recv) really do not modify the posted work request.
      
      To make this possible, only one cast had to be introduce that casts away
      constness, namely in rpcrdma_post_recvs(). The only way I can think of to
      avoid that cast is to introduce an additional loop in that function or to
      change the data type of bad_wr from struct ib_recv_wr ** into int
      (an index that refers to an element in the work request list). However,
      both approaches would require even more extensive changes than this
      patch.
      Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
      Reviewed-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      d34ac5cd
    • B
      RDMA: Constify the argument of the work request conversion functions · f696bf6d
      Bart Van Assche 提交于
      When posting a send work request, the work request that is posted is not
      modified by any of the RDMA drivers. Make this explicit by constifying
      most ib_send_wr pointers in RDMA transport drivers.
      Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: NSteve Wise <swise@opengridcomputing.com>
      Reviewed-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      f696bf6d
  24. 11 7月, 2018 1 次提交
    • J
      RDMA: Fix storage of PortInfo CapabilityMask in the kernel · 2f944c0f
      Jason Gunthorpe 提交于
      The internal flag IP_BASED_GIDS was added to a field that was being used
      to hold the port Info CapabilityMask without considering the effects this
      will have. Since most drivers just use the value from the HW MAD it means
      IP_BASED_GIDS will also become set on any HW that sets the IBA flag
      IsOtherLocalChangesNoticeSupported - which is not intended.
      
      Fix this by keeping port_cap_flags only for the IBA CapabilityMask value
      and store unrelated flags externally. Move the bit definitions for this to
      ib_mad.h to make it clear what is happening.
      
      To keep the uAPI unchanged define a new set of flags in the uapi header
      that are only used by ib_uverbs_query_port_resp.port_cap_flags which match
      the current flags supported in rdma-core, and the values exposed by the
      current kernel.
      
      Fixes: b4a26a27 ("IB: Report using RoCE IP based gids in port caps")
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: NArtemy Kovalyov <artemyko@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      2f944c0f
  25. 05 7月, 2018 1 次提交