提交 · 52c5e9e7497b728b53a84cbd5873c4b707d10d55 · openeuler / Kernel

19 2月, 2020 1 次提交

RDMA/core: Get rid of ib_create_qp_user · b72bfc96

由 Jason Gunthorpe 提交于 2月 13, 2020

This function accepts a udata but does nothing with it, and is never
passed a !NULL udata. Rename it to ib_create_qp which was the only caller
and remove the udata.

Link: https://lore.kernel.org/r/20200213191911.GA9898@ziepe.caReviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

b72bfc96

17 1月, 2020 3 次提交

RDMA/uverbs: Add new relaxed ordering memory region access flag · 2233c660

由 Michael Guralnik 提交于 1月 08, 2020

Add a new relaxed ordering access flag for memory regions. Using memory
regions with relaxed ordeing set can enhance performance.

This access flag is handled in a best-effort manner, drivers should ignore
if they don't support setting relaxed ordering.

Link: https://lore.kernel.org/r/1578506740-22188-9-git-send-email-yishaih@mellanox.comSigned-off-by: NMichael Guralnik <michaelgur@mellanox.com>
Signed-off-by: NYishai Hadas <yishaih@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

2233c660

RDMA/core: Add optional access flags range · 68d384b9

由 Michael Guralnik 提交于 1月 08, 2020

Define a range of access flags that are defined to be optional, both
uverbs and drivers should enable getting them and use if they are
applicable

This will be used, for example, for the relaxed ordering access flag which
unsupporting drivers can ignore.

Link: https://lore.kernel.org/r/1578506740-22188-7-git-send-email-yishaih@mellanox.comSigned-off-by: NMichael Guralnik <michaelgur@mellanox.com>
Signed-off-by: NYishai Hadas <yishaih@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

68d384b9

RDMA/uverbs: Verify MR access flags · ca95c141

由 Michael Guralnik 提交于 1月 08, 2020

Verify that MR access flags that are passed from user are all supported
ones, otherwise an error is returned.

Fixes: 4fca0377 ("IB/uverbs: Move ib_access_flags and ib_read_counters_flags to uapi")
Link: https://lore.kernel.org/r/1578506740-22188-6-git-send-email-yishaih@mellanox.comSigned-off-by: NMichael Guralnik <michaelgur@mellanox.com>
Signed-off-by: NYishai Hadas <yishaih@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

ca95c141

16 1月, 2020 2 次提交

IB/core: Add interface to advise_mr for kernel users · 87d8069f

由 Moni Shoua 提交于 1月 15, 2020

Allow ULPs to call advise_mr, so they can control ODP regions
in the same way as user space applications.
Signed-off-by: NMoni Shoua <monis@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>

87d8069f

IB/core: Introduce ib_reg_user_mr · 33006bd4

由 Moni Shoua 提交于 1月 15, 2020

Add ib_reg_user_mr() for kernel ULPs to register user MRs.

The common use case that uses this function is a userspace application
that allocates memory for HCA access but the responsibility to register
the memory at the HCA is on an kernel ULP. This ULP that acts as an agent
for the userspace application.

This function is intended to be used without a user context so vendor
drivers need to be aware of calling reg_user_mr() device operation with
udata equal to NULL.

Among all drivers, i40iw is the only driver which relies on presence
of udata, so check udata existence for that driver.
Signed-off-by: NMoni Shoua <monis@mellanox.com>
Reviewed-by: NGuy Levi <guyle@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>

33006bd4

14 1月, 2020 4 次提交

RDMA/core: Do not erase the type of ib_wq.uobject · e04dd131

由 Jason Gunthorpe 提交于 1月 08, 2020

This is a struct ib_uwq_object pointer, instead of using container_of()
all over the place just store it with its actual type.

Link: https://lore.kernel.org/r/1578504126-9400-10-git-send-email-yishaih@mellanox.comSigned-off-by: NYishai Hadas <yishaih@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

e04dd131

RDMA/core: Do not erase the type of ib_srq.uobject · 9fbe334c

由 Jason Gunthorpe 提交于 1月 08, 2020

This is a struct ib_usrq_object pointer, instead of using container_of()
all over the place just store it with its actual type.

Link: https://lore.kernel.org/r/1578504126-9400-9-git-send-email-yishaih@mellanox.comSigned-off-by: NYishai Hadas <yishaih@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

9fbe334c

RDMA/core: Do not erase the type of ib_qp.uobject · 620d3f81

由 Jason Gunthorpe 提交于 1月 08, 2020

This is a struct ib_uqp_object pointer, instead of using container_of()
all over the place just store it with its actual type.

Link: https://lore.kernel.org/r/1578504126-9400-8-git-send-email-yishaih@mellanox.comSigned-off-by: NYishai Hadas <yishaih@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

620d3f81

RDMA/core: Do not erase the type of ib_cq.uobject · 5bd48c18

由 Jason Gunthorpe 提交于 1月 08, 2020

This is a struct ib_ucq_object pointer, instead of using container_of()
all over the place just store it with its actual type.

Link: https://lore.kernel.org/r/1578504126-9400-7-git-send-email-yishaih@mellanox.comSigned-off-by: NYishai Hadas <yishaih@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

5bd48c18

08 1月, 2020 4 次提交

IB/core: Rename event_handler_lock to qp_open_list_lock · 40adf686

由 Parav Pandit 提交于 12月 12, 2019

This lock is used to protect the qp->open_list linked list. As a side
effect it seems to also globally serialize the qp event_handler, but it
isn't clear if that is a deliberate design.

Link: https://lore.kernel.org/r/20191212113024.336702-5-leon@kernel.orgSigned-off-by: NParav Pandit <parav@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

40adf686

IB/core: Cut down single member ib_cache structure · 17e10646

由 Parav Pandit 提交于 12月 12, 2019

Given that ib_cache structure has only single member now, merge the cache
lock directly in the ib_device.

Link: https://lore.kernel.org/r/20191212113024.336702-4-leon@kernel.orgSigned-off-by: NParav Pandit <parav@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Reviewed-by: NJason Gunthorpe <jgg@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

17e10646

IB/core: Let IB core distribute cache update events · 6b57cea9

由 Parav Pandit 提交于 12月 12, 2019

Currently when the low level driver notifies Pkey, GID, and port change
events they are notified to the registered handlers in the order they are
registered.

IB core and other ULPs such as IPoIB are interested in GID, LID, Pkey
change events.

Since all GID queries done by ULPs are serviced by IB core, and the IB
core deferes cache updates to a work queue, it is possible for other
clients to see stale cache data when they handle their own events.

For example, the below call tree shows how ipoib will call
rdma_query_gid() concurrently with the update to the cache sitting in the
WQ.

mlx5_ib_handle_event()
  ib_dispatch_event()
    ib_cache_event()
       queue_work() -> slow cache update

    [..]
    ipoib_event()
     queue_work()
       [..]
       work handler
         ipoib_ib_dev_flush_light()
           __ipoib_ib_dev_flush()
              ipoib_dev_addr_changed_valid()
                rdma_query_gid() <- Returns old GID, cache not updated.

Move all the event dispatch to a work queue so that the cache update is
always done before any clients are notified.

Fixes: f35faa4b ("IB/core: Simplify ib_query_gid to always refer to cache")
Link: https://lore.kernel.org/r/20191212113024.336702-3-leon@kernel.orgSigned-off-by: NParav Pandit <parav@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Reviewed-by: NJason Gunthorpe <jgg@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

6b57cea9

RDMA/core: Trace points for diagnosing completion queue issues · 3e5901cb

由 Chuck Lever 提交于 12月 18, 2019

Sample trace events:

   kworker/u29:0-300   [007]   120.042217: cq_alloc:             cq.id=4 nr_cqe=161 comp_vector=2 poll_ctx=WORKQUEUE
          <idle>-0     [002]   120.056292: cq_schedule:          cq.id=4
    kworker/2:1H-482   [002]   120.056402: cq_process:           cq.id=4 wake-up took 109 [us] from interrupt
    kworker/2:1H-482   [002]   120.056407: cq_poll:              cq.id=4 requested 16, returned 1
          <idle>-0     [002]   120.067503: cq_schedule:          cq.id=4
    kworker/2:1H-482   [002]   120.067537: cq_process:           cq.id=4 wake-up took 34 [us] from interrupt
    kworker/2:1H-482   [002]   120.067541: cq_poll:              cq.id=4 requested 16, returned 1
          <idle>-0     [002]   120.067657: cq_schedule:          cq.id=4
    kworker/2:1H-482   [002]   120.067672: cq_process:           cq.id=4 wake-up took 15 [us] from interrupt
    kworker/2:1H-482   [002]   120.067674: cq_poll:              cq.id=4 requested 16, returned 1

 ...

         systemd-1     [002]   122.392653: cq_schedule:          cq.id=4
    kworker/2:1H-482   [002]   122.392688: cq_process:           cq.id=4 wake-up took 35 [us] from interrupt
    kworker/2:1H-482   [002]   122.392693: cq_poll:              cq.id=4 requested 16, returned 16
    kworker/2:1H-482   [002]   122.392836: cq_poll:              cq.id=4 requested 16, returned 16
    kworker/2:1H-482   [002]   122.392970: cq_poll:              cq.id=4 requested 16, returned 16
    kworker/2:1H-482   [002]   122.393083: cq_poll:              cq.id=4 requested 16, returned 16
    kworker/2:1H-482   [002]   122.393195: cq_poll:              cq.id=4 requested 16, returned 3

Several features to note in this output:
 - The WCE count and context type are reported at allocation time
 - The CPU and kworker for each CQ is evident
 - The CQ's restracker ID is tagged on each trace event
 - CQ poll scheduling latency is measured
 - Details about how often single completions occur versus multiple
   completions are evident
 - The cost of the ULP's completion handler is recorded

Link: https://lore.kernel.org/r/20191218201815.30584.3481.stgit@manet.1015granger.netSigned-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NParav Pandit <parav@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

3e5901cb

13 12月, 2019 1 次提交

IB/core: Introduce rdma_user_mmap_entry_insert_range() API · 7a763d18

由 Yishai Hadas 提交于 12月 12, 2019

Introduce rdma_user_mmap_entry_insert_range() API to be used once the
required key for the given entry should be in a given range.
Signed-off-by: NYishai Hadas <yishaih@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20191212100237.330654-2-leon@kernel.orgSigned-off-by: NDoug Ledford <dledford@redhat.com>

7a763d18

24 11月, 2019 1 次提交

RDMA/odp: Use mmu_interval_notifier_insert() · f25a546e

由 Jason Gunthorpe 提交于 11月 12, 2019

Replace the internal interval tree based mmu notifier with the new common
mmu_interval_notifier_insert() API. This removes a lot of code and fixes a
deadlock that can be triggered in ODP:

 zap_page_range()
  mmu_notifier_invalidate_range_start()
   [..]
    ib_umem_notifier_invalidate_range_start()
       down_read(&per_mm->umem_rwsem)
  unmap_single_vma()
    [..]
      __split_huge_page_pmd()
        mmu_notifier_invalidate_range_start()
        [..]
           ib_umem_notifier_invalidate_range_start()
              down_read(&per_mm->umem_rwsem)   // DEADLOCK

        mmu_notifier_invalidate_range_end()
           up_read(&per_mm->umem_rwsem)
  mmu_notifier_invalidate_range_end()
     up_read(&per_mm->umem_rwsem)

The umem_rwsem is held across the range_start/end as the ODP algorithm for
invalidate_range_end cannot tolerate changes to the interval
tree. However, due to the nested invalidation regions the second
down_read() can deadlock if there are competing writers. The new core code
provides an alternative scheme to solve this problem.

Fixes: ca748c39 ("RDMA/umem: Get rid of per_mm->notifier_count")
Link: https://lore.kernel.org/r/20191112202231.3856-6-jgg@ziepe.caTested-by: NArtemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

f25a546e

23 11月, 2019 1 次提交

IB/core: Add interfaces to get VF node and port GUIDs · bfcb3c5d

由 Danit Goldberg 提交于 11月 06, 2019

Provide ability to get node and port GUIDs of VFs to be symmetrical
to already existing set option.
Signed-off-by: NDanit Goldberg <danitg@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>

bfcb3c5d

13 11月, 2019 1 次提交

RDMA: Change MAD processing function to remove extra casting and parameter · e26e7b88

由 Leon Romanovsky 提交于 10月 29, 2019

All users of process_mad() converts input pointers from ib_mad_hdr to be
ib_mad, update the function declaration to use ib_mad directly.

Also remove not used input MAD size parameter.

Link: https://lore.kernel.org/r/20191029062745.7932-17-leon@kernel.orgSigned-off-by: NLeon Romanovsky <leonro@mellanox.com>
Tested-By: NMike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

e26e7b88

07 11月, 2019 2 次提交

RDMA: Connect between the mmap entry and the umap_priv structure · c043ff2c

由 Michal Kalderon 提交于 10月 30, 2019

The rdma_user_mmap_io interface created a common interface for drivers to
correctly map hw resources and zap them once the ucontext is destroyed
enabling the drivers to safely free the hw resources.

However, this meant the drivers need to delay freeing the resource to the
ucontext destroy phase to ensure they were no longer mapped. The new
mechanism for a common way of handling user/driver address mapping enabled
notifying the driver if all umap_priv mappings were removed, and enabled
freeing the hw resources when they are done with and not delay it until
ucontext destroy.

Since not all drivers use the mechanism, NULL can be sent to the
rdma_user_mmap_io interface to continue working as before. Drivers that
use the mmap_xa interface can pass the entry being mapped to the
rdma_user_mmap_io function to be linked together.

Link: https://lore.kernel.org/r/20191030094417.16866-4-michal.kalderon@marvell.comSigned-off-by: NAriel Elior <ariel.elior@marvell.com>
Signed-off-by: NMichal Kalderon <michal.kalderon@marvell.com>
Reviewed-by: NJason Gunthorpe <jgg@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

c043ff2c

RDMA/core: Create mmap database and cookie helper functions · 3411f9f0

由 Michal Kalderon 提交于 10月 30, 2019

Create some common API's for adding entries to a xa_mmap. Searching for
an entry and freeing one.

The general approach is copied from the EFA driver and improved to be more
general and do more to help the drivers. Integration with the core allows
a reference counted scheme with a free function so that the driver can
know when its mmaps are all gone.

This significant new functionality will be helpful for drivers to have the
correct lifetime model for mmap objects.

Link: https://lore.kernel.org/r/20191030094417.16866-3-michal.kalderon@marvell.comSigned-off-by: NAriel Elior <ariel.elior@marvell.com>
Signed-off-by: NMichal Kalderon <michal.kalderon@marvell.com>
Reviewed-by: NJason Gunthorpe <jgg@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

3411f9f0

29 10月, 2019 1 次提交

RDMA/core: Fix ib_dma_max_seg_size() · ecdfdfdb

由 Bart Van Assche 提交于 10月 25, 2019

If dev->dma_device->params == NULL then the maximum DMA segment size is 64
KB. See also the dma_get_max_seg_size() implementation. This patch fixes
the following kernel warning:

  DMA-API: infiniband rxe0: mapping sg segment longer than device claims to support [len=126976] [max=65536]
  WARNING: CPU: 4 PID: 4848 at kernel/dma/debug.c:1220 debug_dma_map_sg+0x3d9/0x450
  RIP: 0010:debug_dma_map_sg+0x3d9/0x450
  Call Trace:
   srp_queuecommand+0x626/0x18d0 [ib_srp]
   scsi_queue_rq+0xd02/0x13e0 [scsi_mod]
   __blk_mq_try_issue_directly+0x2b3/0x3f0
   blk_mq_request_issue_directly+0xac/0xf0
   blk_insert_cloned_request+0xdf/0x170
   dm_mq_queue_rq+0x43d/0x830 [dm_mod]
   __blk_mq_try_issue_directly+0x2b3/0x3f0
   blk_mq_request_issue_directly+0xac/0xf0
   blk_mq_try_issue_list_directly+0xb8/0x170
   blk_mq_sched_insert_requests+0x23c/0x3b0
   blk_mq_flush_plug_list+0x529/0x730
   blk_flush_plug_list+0x21f/0x260
   blk_mq_make_request+0x56b/0xf20
   generic_make_request+0x196/0x660
   submit_bio+0xae/0x290
   blkdev_direct_IO+0x822/0x900
   generic_file_direct_write+0x110/0x200
   __generic_file_write_iter+0x124/0x2a0
   blkdev_write_iter+0x168/0x270
   aio_write+0x1c4/0x310
   io_submit_one+0x971/0x1390
   __x64_sys_io_submit+0x12a/0x390
   do_syscall_64+0x6f/0x2e0
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

Link: https://lore.kernel.org/r/20191025225830.257535-2-bvanassche@acm.org
Cc: <stable@vger.kernel.org>
Fixes: 0b5cb330 ("RDMA/srp: Increase max_segment_size")
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NJason Gunthorpe <jgg@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

ecdfdfdb

23 10月, 2019 4 次提交

RDMA/nldev: Provide MR statistics · 4061ff7a

由 Erez Alfasi 提交于 10月 16, 2019

Add RDMA nldev netlink interface for dumping MR statistics information.

Output example:

$ ./ibv_rc_pingpong -o -P -s 500000000
  local address:  LID 0x0001, QPN 0x00008a, PSN 0xf81096, GID ::

$ rdma stat show mr
dev mlx5_0 mrn 2 page_faults 122071 page_invalidations 0

Link: https://lore.kernel.org/r/20191016062308.11886-5-leon@kernel.orgSigned-off-by: NErez Alfasi <ereza@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Reviewed-by: NJason Gunthorpe <jgg@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

4061ff7a

IB/mlx5: Introduce ODP diagnostic counters · a3de94e3

由 Erez Alfasi 提交于 10月 16, 2019

Introduce ODP diagnostic counters and count the following
per MR within IB/mlx5 driver:
 1) Page faults:
	Total number of faulted pages.
 2) Page invalidations:
	Total number of pages invalidated by the OS during all
	invalidation events. The translations can be no longer
	valid due to either non-present pages or mapping changes.

Link: https://lore.kernel.org/r/20191016062308.11886-2-leon@kernel.orgSigned-off-by: NErez Alfasi <ereza@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Reviewed-by: NJason Gunthorpe <jgg@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

a3de94e3

RDMA/uverbs: Prevent potential underflow · a9018adf

由 Dan Carpenter 提交于 10月 11, 2019

The issue is in drivers/infiniband/core/uverbs_std_types_cq.c in the
UVERBS_HANDLER(UVERBS_METHOD_CQ_CREATE) function. We check that:

if (attr.comp_vector >= attrs->ufile->device->num_comp_vectors) {

But we don't check if "attr.comp_vector" is negative. It could
potentially lead to an array underflow. My concern would be where
cq->vector is used in the create_cq() function from the cxgb4 driver.

And really "attr.comp_vector" is appears as a u32 to user space so that's
the right type to use.

Fixes: 9ee79fce ("IB/core: Add completion queue (cq) object actions")
Link: https://lore.kernel.org/r/20191011133419.GA22905@mwandaSigned-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: NJason Gunthorpe <jgg@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

a9018adf

RDMA/rw: Support threshold for registration vs scattering to local pages · 00bd1439

由 Yamin Friedman 提交于 10月 07, 2019

If there are more scatter entries than the recommended limit provided by
the ib device, UMR registration is used. This will provide optimal
performance when performing large RDMA READs over devices that advertise
the threshold capability.

With ConnectX-5 running NVMeoF RDMA with FIO single QP 128KB writes:
Without use of cap: 70Gb/sec
With use of cap: 84Gb/sec

Link: https://lore.kernel.org/r/20191007135933.12483-3-leon@kernel.orgSigned-off-by: NYamin Friedman <yaminf@mellanox.com>
Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

00bd1439

22 8月, 2019 2 次提交

RDMA/odp: use mmu_notifier_get/put for 'struct ib_ucontext_per_mm' · c571feca

由 Jason Gunthorpe 提交于 8月 06, 2019

This is a significant simplification, no extra list is kept per FD, and
the interval tree is now shared between all the ucontexts, reducing
overhead if there are multiple ucontexts active.

Link: https://lore.kernel.org/r/20190806231548.25242-7-jgg@ziepe.caSigned-off-by: NJason Gunthorpe <jgg@mellanox.com>

c571feca

RDMA/core: Make invalidate_range a device operation · ce51346f

由 Moni Shoua 提交于 8月 19, 2019

The callback function 'invalidate_range' is implemented in a driver so the
place for it is in the ib_device_ops structure and not in ib_ucontext.
Signed-off-by: NMoni Shoua <monis@mellanox.com>
Reviewed-by: NGuy Levi <guyle@mellanox.com>
Reviewed-by: NJason Gunthorpe <jgg@mellanox.com>
Link: https://lore.kernel.org/r/20190819111710.18440-11-leon@kernel.orgSigned-off-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

ce51346f

21 8月, 2019 1 次提交

RDMA: Delete DEBUG code · b2299e83

由 Leon Romanovsky 提交于 8月 19, 2019

There is no need to keep DEBUG defines for out-of-the tree testing.
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190819114547.20704-1-leon@kernel.orgSigned-off-by: NDoug Ledford <dledford@redhat.com>

b2299e83

12 8月, 2019 1 次提交

RDMA: Introduce ib_port_phys_state enum · 72a7720f

由 Kamal Heib 提交于 8月 07, 2019

In order to improve readability, add ib_port_phys_state enum to replace
the use of magic numbers.
Signed-off-by: NKamal Heib <kamalheib1@gmail.com>
Reviewed-by: NAndrew Boyer <aboyer@tobark.org>
Acked-by: NMichal Kalderon <michal.kalderon@marvell.com>
Acked-by: NBernard Metzler <bmt@zurich.ibm.com>
Link: https://lore.kernel.org/r/20190807103138.17219-2-kamalheib1@gmail.comSigned-off-by: NDoug Ledford <dledford@redhat.com>

72a7720f

06 8月, 2019 1 次提交

RDMA/core: Introduce ratelimited ibdev printk functions · 05bb411a

由 Gal Pressman 提交于 8月 01, 2019

Add ratelimited helpers to the ibdev_* printk functions.
Implementation inspired by counterpart dev_*_ratelimited functions.
Signed-off-by: NGal Pressman <galpress@amazon.com>
Reviewed-by: NLeon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190801171447.54440-2-galpress@amazon.comSigned-off-by: NDoug Ledford <dledford@redhat.com>

05bb411a

05 8月, 2019 1 次提交

rdma: Enable ib_alloc_cq to spread work over a device's comp_vectors · 20cf4e02

由 Chuck Lever 提交于 7月 29, 2019

Send and Receive completion is handled on a single CPU selected at
the time each Completion Queue is allocated. Typically this is when
an initiator instantiates an RDMA transport, or when a target
accepts an RDMA connection.

Some ULPs cannot open a connection per CPU to spread completion
workload across available CPUs and MSI vectors. For such ULPs,
provide an API that allows the RDMA core to select a completion
vector based on the device's complement of available comp_vecs.

ULPs that invoke ib_alloc_cq() with only comp_vector 0 are converted
to use the new API so that their completion workloads interfere less
with each other.
Suggested-by: NHåkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NLeon Romanovsky <leonro@mellanox.com>
Cc: <linux-cifs@vger.kernel.org>
Cc: <v9fs-developer@lists.sourceforge.net>
Link: https://lore.kernel.org/r/20190729171923.13428.52555.stgit@manet.1015granger.netSigned-off-by: NDoug Ledford <dledford@redhat.com>

20cf4e02

01 8月, 2019 2 次提交

RDMA/devices: Remove the lock around remove_client_context · 9cd58817

由 Jason Gunthorpe 提交于 7月 31, 2019

Due to the complexity of client->remove() callbacks it is desirable to not
hold any locks while calling them. Remove the last one by tracking only
the highest client ID and running backwards from there over the xarray.

Since the only purpose of that lock was to protect the linked list, we can
drop the lock.
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190731081841.32345-3-leon@kernel.orgSigned-off-by: NDoug Ledford <dledford@redhat.com>

9cd58817

RDMA/devices: Do not deadlock during client removal · 621e55ff

由 Jason Gunthorpe 提交于 7月 31, 2019

lockdep reports:

   WARNING: possible circular locking dependency detected

   modprobe/302 is trying to acquire lock:
   0000000007c8919c ((wq_completion)ib_cm){+.+.}, at: flush_workqueue+0xdf/0x990

   but task is already holding lock:
   000000002d3d2ca9 (&device->client_data_rwsem){++++}, at: remove_client_context+0x79/0xd0 [ib_core]

   which lock already depends on the new lock.

   the existing dependency chain (in reverse order) is:

   -> #2 (&device->client_data_rwsem){++++}:
          down_read+0x3f/0x160
          ib_get_net_dev_by_params+0xd5/0x200 [ib_core]
          cma_ib_req_handler+0x5f6/0x2090 [rdma_cm]
          cm_process_work+0x29/0x110 [ib_cm]
          cm_req_handler+0x10f5/0x1c00 [ib_cm]
          cm_work_handler+0x54c/0x311d [ib_cm]
          process_one_work+0x4aa/0xa30
          worker_thread+0x62/0x5b0
          kthread+0x1ca/0x1f0
          ret_from_fork+0x24/0x30

   -> #1 ((work_completion)(&(&work->work)->work)){+.+.}:
          process_one_work+0x45f/0xa30
          worker_thread+0x62/0x5b0
          kthread+0x1ca/0x1f0
          ret_from_fork+0x24/0x30

   -> #0 ((wq_completion)ib_cm){+.+.}:
          lock_acquire+0xc8/0x1d0
          flush_workqueue+0x102/0x990
          cm_remove_one+0x30e/0x3c0 [ib_cm]
          remove_client_context+0x94/0xd0 [ib_core]
          disable_device+0x10a/0x1f0 [ib_core]
          __ib_unregister_device+0x5a/0xe0 [ib_core]
          ib_unregister_device+0x21/0x30 [ib_core]
          mlx5_ib_stage_ib_reg_cleanup+0x9/0x10 [mlx5_ib]
          __mlx5_ib_remove+0x3d/0x70 [mlx5_ib]
          mlx5_ib_remove+0x12e/0x140 [mlx5_ib]
          mlx5_remove_device+0x144/0x150 [mlx5_core]
          mlx5_unregister_interface+0x3f/0xf0 [mlx5_core]
          mlx5_ib_cleanup+0x10/0x3a [mlx5_ib]
          __x64_sys_delete_module+0x227/0x350
          do_syscall_64+0xc3/0x6a4
          entry_SYSCALL_64_after_hwframe+0x49/0xbe

Which is due to the read side of the client_data_rwsem being obtained
recursively through a work queue flush during cm client removal.

The lock is being held across the remove in remove_client_context() so
that the function is a fence, once it returns the client is removed. This
is required so that the two callers do not proceed with destruction until
the client completes removal.

Instead of using client_data_rwsem use the existing device unregistration
refcount and add a similar client unregistration (client->uses) refcount.

This will fence the two unregistration paths without holding any locks.

Cc: <stable@vger.kernel.org>
Fixes: 921eab11 ("RDMA/devices: Re-organize device.c locking")
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190731081841.32345-2-leon@kernel.orgSigned-off-by: NDoug Ledford <dledford@redhat.com>

621e55ff

09 7月, 2019 2 次提交

RDMA/core: Provide RDMA DIM support for ULPs · da662979

由 Yamin Friedman 提交于 7月 08, 2019

Added the interface in the infiniband driver that applies the rdma_dim
adaptive moderation. There is now a special function for allocating an
ib_cq that uses rdma_dim.

Performance improvement (ConnectX-5 100GbE, x86) running FIO benchmark over
NVMf between two equal end-hosts with 56 cores across a Mellanox switch
using null_blk device:

READS without DIM:
blk size | BW       | IOPS | 99th percentile latency  | 99.99th latency
512B     | 3.8GiB/s | 7.7M | 1401  usec               | 2442  usec
4k       | 7.0GiB/s | 1.8M | 4817  usec               | 6587  usec
64k      | 10.7GiB/s| 175k | 9896  usec               | 10028 usec

IO WRITES without DIM:
blk size | BW       | IOPS | 99th percentile latency  | 99.99th latency
512B     | 3.6GiB/s | 7.5M | 1434  usec               | 2474  usec
4k       | 6.3GiB/s | 1.6M | 938   usec               | 1221  usec
64k      | 10.7GiB/s| 175k | 8979  usec               | 12780 usec

IO READS with DIM:
blk size | BW       | IOPS | 99th percentile latency  | 99.99th latency
512B     | 4GiB/s   | 8.2M | 816    usec              | 889   usec
4k       | 10.1GiB/s| 2.65M| 3359   usec              | 5080  usec
64k      | 10.7GiB/s| 175k | 9896   usec              | 10028 usec

IO WRITES with DIM:
blk size | BW       | IOPS  | 99th percentile latency | 99.99th latency
512B     | 3.9GiB/s | 8.1M  | 799   usec              | 922   usec
4k       | 9.6GiB/s | 2.5M  | 717   usec              | 1004  usec
64k      | 10.7GiB/s| 176k  | 8586  usec              | 12256 usec

The rdma_dim algorithm was designed to measure the effectiveness of
moderation on the flow in a general way and thus should be appropriate
for all RDMA storage protocols.

rdma_dim is configured to be the default option based on performance
improvement seen after extensive tests.
Signed-off-by: NYamin Friedman <yaminf@mellanox.com>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

da662979

IB/mlx5: Report correctly tag matching rendezvous capability · 89705e92

由 Danit Goldberg 提交于 7月 05, 2019

Userspace expects the IB_TM_CAP_RC bit to indicate that the device
supports RC transport tag matching with rendezvous offload. However the
firmware splits this into two capabilities for eager and rendezvous tag
matching.

Only if the FW supports both modes should userspace be told the tag
matching capability is available.

Cc: <stable@vger.kernel.org> # 4.13
Fixes: eb761894 ("IB/mlx5: Fill XRQ capabilities")
Signed-off-by: NDanit Goldberg <danitg@mellanox.com>
Reviewed-by: NYishai Hadas <yishaih@mellanox.com>
Reviewed-by: NArtemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

89705e92

05 7月, 2019 4 次提交

RDMA/nldev: Allow get default counter statistics through RDMA netlink · 6e7be47a

由 Mark Zhang 提交于 7月 02, 2019

This patch adds the ability to return the hwstats of per-port default
counters (which can also be queried through sysfs nodes).
Signed-off-by: NMark Zhang <markz@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

6e7be47a

RDMA/netlink: Implement counter dumpit calback · c4ffee7c

由 Mark Zhang 提交于 7月 02, 2019

This patch adds the ability to return all available counters together with
their properties and hwstats.
Signed-off-by: NMark Zhang <markz@mellanox.com>
Reviewed-by: NMajd Dibbiny <majd@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

c4ffee7c

RDMA/counter: Add "auto" configuration mode support · 99fa331d

由 Mark Zhang 提交于 7月 02, 2019

In auto mode all QPs belong to one category are bind automatically to a
single counter set. Currently only "qp type" is supported.

In this mode the qp counter is set in RST2INIT modification, and when a qp
is destroyed the counter is unbound.
Signed-off-by: NMark Zhang <markz@mellanox.com>
Reviewed-by: NMajd Dibbiny <majd@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

99fa331d

RDMA/counter: Add set/clear per-port auto mode support · 413d3347

由 Mark Zhang 提交于 7月 02, 2019

Add an API to support set/clear per-port auto mode.
Signed-off-by: NMark Zhang <markz@mellanox.com>
Reviewed-by: NMajd Dibbiny <majd@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

413d3347

24 6月, 2019 1 次提交

RDMA/mlx5: Remove unused IB_WR_REG_SIG_MR code · 5c171cbe

由 Israel Rukshin 提交于 6月 11, 2019

IB_WR_REG_SIG_MR is not needed after IB_WR_REG_MR_INTEGRITY
was used.
Signed-off-by: NIsrael Rukshin <israelr@mellanox.com>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

5c171cbe

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功