1. 19 7月, 2020 1 次提交
  2. 03 6月, 2020 1 次提交
  3. 30 5月, 2020 1 次提交
    • Y
      RDMA/core: Introduce shared CQ pool API · c7ff819a
      Yamin Friedman 提交于
      Allow a ULP to ask the core to provide a completion queue based on a
      least-used search on a per-device CQ pools. The device CQ pools grow in a
      lazy fashion when more CQs are requested.
      
      This feature reduces the amount of interrupts when using many QPs.  Using
      shared CQs allows for more effcient completion handling. It also reduces
      the amount of overhead needed for CQ contexts.
      
      Test setup:
      Intel(R) Xeon(R) Platinum 8176M CPU @ 2.10GHz servers.
      Running NVMeoF 4KB read IOs over ConnectX-5EX across Spectrum switch.
      TX-depth = 32. The patch was applied in the nvme driver on both the target
      and initiator. Four controllers are accessed from each core. In the
      current test case we have exposed sixteen NVMe namespaces using four
      different subsystems (four namespaces per subsystem) from one NVM port.
      Each controller allocated X queues (RDMA QPs) and attached to Y CQs.
      Before this series we had X == Y, i.e for four controllers we've created
      total of 4X QPs and 4X CQs. In the shared case, we've created 4X QPs and
      only X CQs which means that we have four controllers that share a
      completion queue per core. Until fourteen cores there is no significant
      change in performance and the number of interrupts per second is less than
      a million in the current case.
      ==================================================
      |Cores|Current KIOPs  |Shared KIOPs  |improvement|
      |-----|---------------|--------------|-----------|
      |14   |2332           |2723          |16.7%      |
      |-----|---------------|--------------|-----------|
      |20   |2086           |2712          |30%        |
      |-----|---------------|--------------|-----------|
      |28   |1971           |2669          |35.4%      |
      |=================================================
      |Cores|Current avg lat|Shared avg lat|improvement|
      |-----|---------------|--------------|-----------|
      |14   |767us          |657us         |14.3%      |
      |-----|---------------|--------------|-----------|
      |20   |1225us         |943us         |23%        |
      |-----|---------------|--------------|-----------|
      |28   |1816us         |1341us        |26.1%      |
      ========================================================
      |Cores|Current interrupts|Shared interrupts|improvement|
      |-----|------------------|-----------------|-----------|
      |14   |1.6M/sec          |0.4M/sec         |72%        |
      |-----|------------------|-----------------|-----------|
      |20   |2.8M/sec          |0.6M/sec         |72.4%      |
      |-----|------------------|-----------------|-----------|
      |28   |2.9M/sec          |0.8M/sec         |63.4%      |
      ====================================================================
      |Cores|Current 99.99th PCTL lat|Shared 99.99th PCTL lat|improvement|
      |-----|------------------------|-----------------------|-----------|
      |14   |67ms                    |6ms                    |90.9%      |
      |-----|------------------------|-----------------------|-----------|
      |20   |5ms                     |6ms                    |-10%       |
      |-----|------------------------|-----------------------|-----------|
      |28   |8.7ms                   |6ms                    |25.9%      |
      |===================================================================
      
      Performance improvement with sixteen disks (sixteen CQs per core) is
      comparable.
      
      Link: https://lore.kernel.org/r/1590568495-101621-3-git-send-email-yaminf@mellanox.comSigned-off-by: NYamin Friedman <yaminf@mellanox.com>
      Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
      Reviewed-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      c7ff819a
  4. 06 5月, 2020 1 次提交
  5. 13 3月, 2020 1 次提交
    • J
      RDMA/core: Fix missing error check on dev_set_name() · f2f2b3bb
      Jason Gunthorpe 提交于
      If name memory allocation fails the name will be left empty and
      device_add_one() will crash:
      
        kobject: (0000000004952746): attempted to be registered with empty name!
        WARNING: CPU: 0 PID: 329 at lib/kobject.c:234 kobject_add_internal+0x7ac/0x9a0 lib/kobject.c:234
        Kernel panic - not syncing: panic_on_warn set ...
        CPU: 0 PID: 329 Comm: syz-executor.5 Not tainted 5.6.0-rc2-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        Call Trace:
         __dump_stack lib/dump_stack.c:77 [inline]
         dump_stack+0x197/0x210 lib/dump_stack.c:118
         panic+0x2e3/0x75c kernel/panic.c:221
         __warn.cold+0x2f/0x3e kernel/panic.c:582
         report_bug+0x289/0x300 lib/bug.c:195
         fixup_bug arch/x86/kernel/traps.c:174 [inline]
         fixup_bug arch/x86/kernel/traps.c:169 [inline]
         do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:267
         do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:286
         invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1027
        RIP: 0010:kobject_add_internal+0x7ac/0x9a0 lib/kobject.c:234
        Code: 1a 98 ca f9 e9 f0 f8 ff ff 4c 89 f7 e8 6d 98 ca f9 e9 95 f9 ff ff e8 c3 f0 8b f9 4c 89 e6 48 c7 c7 a0 0e 1a 89 e8 e3 41 5c f9 <0f> 0b 41 bd ea ff ff ff e9 52 ff ff ff e8 a2 f0 8b f9 0f 0b e8 9b
        RSP: 0018:ffffc90005b27908 EFLAGS: 00010286
        RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
        RDX: 0000000000040000 RSI: ffffffff815eae46 RDI: fffff52000b64f13
        RBP: ffffc90005b27960 R08: ffff88805aeba480 R09: ffffed1015d06659
        R10: ffffed1015d06658 R11: ffff8880ae8332c7 R12: ffff8880a37fd000
        R13: 0000000000000000 R14: ffff888096691780 R15: 0000000000000001
         kobject_add_varg lib/kobject.c:390 [inline]
         kobject_add+0x150/0x1c0 lib/kobject.c:442
         device_add+0x3be/0x1d00 drivers/base/core.c:2412
         add_one_compat_dev drivers/infiniband/core/device.c:901 [inline]
         add_one_compat_dev+0x46a/0x7e0 drivers/infiniband/core/device.c:857
         rdma_dev_init_net+0x2eb/0x490 drivers/infiniband/core/device.c:1120
         ops_init+0xb3/0x420 net/core/net_namespace.c:137
         setup_net+0x2d5/0x8b0 net/core/net_namespace.c:327
         copy_net_ns+0x29e/0x5a0 net/core/net_namespace.c:468
         create_new_namespaces+0x403/0xb50 kernel/nsproxy.c:108
         unshare_nsproxy_namespaces+0xc2/0x200 kernel/nsproxy.c:229
         ksys_unshare+0x444/0x980 kernel/fork.c:2955
         __do_sys_unshare kernel/fork.c:3023 [inline]
         __se_sys_unshare kernel/fork.c:3021 [inline]
         __x64_sys_unshare+0x31/0x40 kernel/fork.c:3021
         do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Link: https://lore.kernel.org/r/20200309193200.GA10633@ziepe.ca
      Cc: stable@kernel.org
      Fixes: 4e0f7b90 ("RDMA/core: Implement compat device/sysfs tree in net namespace")
      Reported-by: syzbot+ab4dae63f7d310641ded@syzkaller.appspotmail.com
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      f2f2b3bb
  6. 10 1月, 2020 1 次提交
  7. 08 1月, 2020 2 次提交
  8. 24 11月, 2019 1 次提交
    • J
      RDMA/odp: Use mmu_interval_notifier_insert() · f25a546e
      Jason Gunthorpe 提交于
      Replace the internal interval tree based mmu notifier with the new common
      mmu_interval_notifier_insert() API. This removes a lot of code and fixes a
      deadlock that can be triggered in ODP:
      
       zap_page_range()
        mmu_notifier_invalidate_range_start()
         [..]
          ib_umem_notifier_invalidate_range_start()
             down_read(&per_mm->umem_rwsem)
        unmap_single_vma()
          [..]
            __split_huge_page_pmd()
              mmu_notifier_invalidate_range_start()
              [..]
                 ib_umem_notifier_invalidate_range_start()
                    down_read(&per_mm->umem_rwsem)   // DEADLOCK
      
              mmu_notifier_invalidate_range_end()
                 up_read(&per_mm->umem_rwsem)
        mmu_notifier_invalidate_range_end()
           up_read(&per_mm->umem_rwsem)
      
      The umem_rwsem is held across the range_start/end as the ODP algorithm for
      invalidate_range_end cannot tolerate changes to the interval
      tree. However, due to the nested invalidation regions the second
      down_read() can deadlock if there are competing writers. The new core code
      provides an alternative scheme to solve this problem.
      
      Fixes: ca748c39 ("RDMA/umem: Get rid of per_mm->notifier_count")
      Link: https://lore.kernel.org/r/20191112202231.3856-6-jgg@ziepe.caTested-by: NArtemy Kovalyov <artemyko@mellanox.com>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      f25a546e
  9. 23 11月, 2019 1 次提交
  10. 07 11月, 2019 2 次提交
  11. 29 10月, 2019 1 次提交
  12. 25 10月, 2019 1 次提交
    • P
      IB/core: Avoid deadlock during netlink message handling · 549af008
      Parav Pandit 提交于
      When rdmacm module is not loaded, and when netlink message is received to
      get char device info, it results into a deadlock due to recursive locking
      of rdma_nl_mutex with the below call sequence.
      
      [..]
        rdma_nl_rcv()
        mutex_lock()
         [..]
         rdma_nl_rcv_msg()
            ib_get_client_nl_info()
               request_module()
                 iw_cm_init()
                   rdma_nl_register()
                     mutex_lock(); <- Deadlock, acquiring mutex again
      
      Due to above call sequence, following call trace and deadlock is observed.
      
        kernel: __mutex_lock+0x35e/0x860
        kernel: ? __mutex_lock+0x129/0x860
        kernel: ? rdma_nl_register+0x1a/0x90 [ib_core]
        kernel: rdma_nl_register+0x1a/0x90 [ib_core]
        kernel: ? 0xffffffffc029b000
        kernel: iw_cm_init+0x34/0x1000 [iw_cm]
        kernel: do_one_initcall+0x67/0x2d4
        kernel: ? kmem_cache_alloc_trace+0x1ec/0x2a0
        kernel: do_init_module+0x5a/0x223
        kernel: load_module+0x1998/0x1e10
        kernel: ? __symbol_put+0x60/0x60
        kernel: __do_sys_finit_module+0x94/0xe0
        kernel: do_syscall_64+0x5a/0x270
        kernel: entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        process stack trace:
        [<0>] __request_module+0x1c9/0x460
        [<0>] ib_get_client_nl_info+0x5e/0xb0 [ib_core]
        [<0>] nldev_get_chardev+0x1ac/0x320 [ib_core]
        [<0>] rdma_nl_rcv_msg+0xeb/0x1d0 [ib_core]
        [<0>] rdma_nl_rcv+0xcd/0x120 [ib_core]
        [<0>] netlink_unicast+0x179/0x220
        [<0>] netlink_sendmsg+0x2f6/0x3f0
        [<0>] sock_sendmsg+0x30/0x40
        [<0>] ___sys_sendmsg+0x27a/0x290
        [<0>] __sys_sendmsg+0x58/0xa0
        [<0>] do_syscall_64+0x5a/0x270
        [<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      To overcome this deadlock and to allow multiple netlink messages to
      progress in parallel, following scheme is implemented.
      
      1. Split the lock protecting the cb_table into a per-index lock, and make
         it a rwlock. This lock is used to ensure no callbacks are running after
         unregistration returns. Since a module will not be registered once it
         is already running callbacks, this avoids the deadlock.
      
      2. Use smp_store_release() to update the cb_table during registration so
         that no lock is required. This avoids lockdep problems with thinking
         all the rwsems are the same lock class.
      
      Fixes: 0e2d00eb ("RDMA: Add NLDEV_GET_CHARDEV to allow char dev discovery and autoload")
      Link: https://lore.kernel.org/r/20191015080733.18625-1-leon@kernel.orgSigned-off-by: NParav Pandit <parav@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      549af008
  13. 23 10月, 2019 2 次提交
  14. 02 10月, 2019 1 次提交
  15. 01 10月, 2019 1 次提交
  16. 22 8月, 2019 1 次提交
  17. 12 8月, 2019 1 次提交
  18. 01 8月, 2019 2 次提交
    • J
      RDMA/devices: Remove the lock around remove_client_context · 9cd58817
      Jason Gunthorpe 提交于
      Due to the complexity of client->remove() callbacks it is desirable to not
      hold any locks while calling them. Remove the last one by tracking only
      the highest client ID and running backwards from there over the xarray.
      
      Since the only purpose of that lock was to protect the linked list, we can
      drop the lock.
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      Link: https://lore.kernel.org/r/20190731081841.32345-3-leon@kernel.orgSigned-off-by: NDoug Ledford <dledford@redhat.com>
      9cd58817
    • J
      RDMA/devices: Do not deadlock during client removal · 621e55ff
      Jason Gunthorpe 提交于
      lockdep reports:
      
         WARNING: possible circular locking dependency detected
      
         modprobe/302 is trying to acquire lock:
         0000000007c8919c ((wq_completion)ib_cm){+.+.}, at: flush_workqueue+0xdf/0x990
      
         but task is already holding lock:
         000000002d3d2ca9 (&device->client_data_rwsem){++++}, at: remove_client_context+0x79/0xd0 [ib_core]
      
         which lock already depends on the new lock.
      
         the existing dependency chain (in reverse order) is:
      
         -> #2 (&device->client_data_rwsem){++++}:
                down_read+0x3f/0x160
                ib_get_net_dev_by_params+0xd5/0x200 [ib_core]
                cma_ib_req_handler+0x5f6/0x2090 [rdma_cm]
                cm_process_work+0x29/0x110 [ib_cm]
                cm_req_handler+0x10f5/0x1c00 [ib_cm]
                cm_work_handler+0x54c/0x311d [ib_cm]
                process_one_work+0x4aa/0xa30
                worker_thread+0x62/0x5b0
                kthread+0x1ca/0x1f0
                ret_from_fork+0x24/0x30
      
         -> #1 ((work_completion)(&(&work->work)->work)){+.+.}:
                process_one_work+0x45f/0xa30
                worker_thread+0x62/0x5b0
                kthread+0x1ca/0x1f0
                ret_from_fork+0x24/0x30
      
         -> #0 ((wq_completion)ib_cm){+.+.}:
                lock_acquire+0xc8/0x1d0
                flush_workqueue+0x102/0x990
                cm_remove_one+0x30e/0x3c0 [ib_cm]
                remove_client_context+0x94/0xd0 [ib_core]
                disable_device+0x10a/0x1f0 [ib_core]
                __ib_unregister_device+0x5a/0xe0 [ib_core]
                ib_unregister_device+0x21/0x30 [ib_core]
                mlx5_ib_stage_ib_reg_cleanup+0x9/0x10 [mlx5_ib]
                __mlx5_ib_remove+0x3d/0x70 [mlx5_ib]
                mlx5_ib_remove+0x12e/0x140 [mlx5_ib]
                mlx5_remove_device+0x144/0x150 [mlx5_core]
                mlx5_unregister_interface+0x3f/0xf0 [mlx5_core]
                mlx5_ib_cleanup+0x10/0x3a [mlx5_ib]
                __x64_sys_delete_module+0x227/0x350
                do_syscall_64+0xc3/0x6a4
                entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Which is due to the read side of the client_data_rwsem being obtained
      recursively through a work queue flush during cm client removal.
      
      The lock is being held across the remove in remove_client_context() so
      that the function is a fence, once it returns the client is removed. This
      is required so that the two callers do not proceed with destruction until
      the client completes removal.
      
      Instead of using client_data_rwsem use the existing device unregistration
      refcount and add a similar client unregistration (client->uses) refcount.
      
      This will fence the two unregistration paths without holding any locks.
      
      Cc: <stable@vger.kernel.org>
      Fixes: 921eab11 ("RDMA/devices: Re-organize device.c locking")
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      Link: https://lore.kernel.org/r/20190731081841.32345-2-leon@kernel.orgSigned-off-by: NDoug Ledford <dledford@redhat.com>
      621e55ff
  19. 26 7月, 2019 1 次提交
  20. 25 7月, 2019 1 次提交
  21. 09 7月, 2019 1 次提交
  22. 05 7月, 2019 3 次提交
  23. 24 6月, 2019 2 次提交
  24. 19 6月, 2019 2 次提交
  25. 17 6月, 2019 1 次提交
  26. 14 6月, 2019 1 次提交
  27. 12 6月, 2019 1 次提交
  28. 11 6月, 2019 3 次提交
  29. 28 5月, 2019 1 次提交
    • K
      RDMA/core: Fix panic when port_data isn't initialized · 46bdf370
      Kamal Heib 提交于
      This happens if assign_name() returns failure when called from
      ib_register_device(), that will lead to the following panic in every time
      that someone touches the port_data's data members.
      
       BUG: unable to handle kernel NULL pointer dereference at 00000000000000c0
       PGD 0 P4D 0
       Oops: 0002 [#1] SMP PTI
       CPU: 19 PID: 1994 Comm: systemd-udevd Not tainted 5.1.0-rc5+ #1
       Hardware name: HP ProLiant DL360p Gen8, BIOS P71 12/20/2013
       RIP: 0010:_raw_spin_lock_irqsave+0x1e/0x40
       Code: 85 ff 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 53 9c 58 66 66 90
       66 90 48 89 c3 fa 66 66 90 66 66 90 31 c0 ba 01 00 00 00 <f0> 0f b1 17 0f
       94 c2 84 d2 74 05 48 89 d8 5b c3 89 c6 e8 b4 85 8a
       RSP: 0018:ffffa8d7079a7c08 EFLAGS: 00010046
       RAX: 0000000000000000 RBX: 0000000000000202 RCX: ffffa8d7079a7bf8
       RDX: 0000000000000001 RSI: ffff93607c990000 RDI: 00000000000000c0
       RBP: 0000000000000001 R08: 0000000000000000 R09: ffffffffc08c4dd8
       R10: 0000000000000000 R11: 0000000000000001 R12: 00000000000000c0
       R13: ffff93607c990000 R14: ffffffffc05a9740 R15: ffffa8d7079a7e98
       FS:  00007f1c6ee438c0(0000) GS:ffff93609f6c0000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00000000000000c0 CR3: 0000000819fca002 CR4: 00000000000606e0
       Call Trace:
        free_netdevs+0x4d/0xe0 [ib_core]
        ib_dealloc_device+0x51/0xb0 [ib_core]
        __mlx5_ib_add+0x5e/0x70 [mlx5_ib]
        mlx5_add_device+0x57/0xe0 [mlx5_core]
        mlx5_register_interface+0x85/0xc0 [mlx5_core]
        ? 0xffffffffc0474000
        do_one_initcall+0x4e/0x1d4
        ? _cond_resched+0x15/0x30
        ? kmem_cache_alloc_trace+0x15f/0x1c0
        do_init_module+0x5a/0x218
        load_module+0x186b/0x1e40
        ? m_show+0x1c0/0x1c0
        __do_sys_finit_module+0x94/0xe0
        do_syscall_64+0x5b/0x180
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: 8ceb1357 ("RDMA/device: Consolidate ib_device per_port data into one place")
      Signed-off-by: NKamal Heib <kamalheib1@gmail.com>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      46bdf370
  30. 22 5月, 2019 1 次提交