提交 · 6a40693a884dacae68c1771d369ad3be0594ba1c · openeuler / Kernel

04 4月, 2019 2 次提交

IB/hfi1: Add a function to read next expected psn from hardware flow · 6a40693a

由 Kaike Wan 提交于 3月 18, 2019

This patch adds a function to read next expected KDETH PSN from hardware
flow to simplify the code.
Reviewed-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: NMichael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

6a40693a

IB/hfi1: Delay the release of destination mr for TID RDMA WRITE DATA · f6f3f532

由 Kaike Wan 提交于 3月 18, 2019

The reference of destination memory region is first obtained when TID RDMA
WRITE request is first received on the responder side. This reference is
released once all TID RDMA WRITE RESP packets are sent to the requester
side, even though not all TID RDMA WRITE DATA packets may have been
received. This early release will especially be undesired if the software
needs to access the destination memory before the last data packet is
received.

This patch delays the release of the MR until all TID RDMA DATA packets
have been received. A helper function to release the reference is also
created to simplify the code.
Reviewed-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: NMichael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

f6f3f532

28 3月, 2019 5 次提交

IB/mlx5: Reset access mask when looping inside page fault handler · 1abe186e

由 Moni Shoua 提交于 3月 19, 2019

If page-fault handler spans multiple MRs then the access mask needs to
be reset before each MR handling or otherwise write access will be
granted to mapped pages instead of read-only.

Cc: <stable@vger.kernel.org> # 3.19
Fixes: 7bdf65d4 ("IB/mlx5: Handle page faults")
Reported-by: NJerome Glisse <jglisse@redhat.com>
Signed-off-by: NMoni Shoua <monis@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

1abe186e

IB/hfi1: Fix the allocation of RSM table · d0294344

由 Kaike Wan 提交于 3月 18, 2019

The receive side mapping (RSM) on hfi1 hardware is a special
matching mechanism to direct an incoming packet to a given
hardware receive context. It has 4 instances of matching capabilities
(RSM0 - RSM3) that share the same RSM table (RMT). The RMT has a total of
256 entries, each of which points to a receive context.

Currently, three instances of RSM have been used:
1. RSM0 by QOS;
2. RSM1 by PSM FECN;
3. RSM2 by VNIC.

Each RSM instance should reserve enough entries in RMT to function
properly. Since both PSM and VNIC could allocate any receive context
between dd->first_dyn_alloc_ctxt and dd->num_rcv_contexts, PSM FECN must
reserve enough RMT entries to cover the entire receive context index
range (dd->num_rcv_contexts - dd->first_dyn_alloc_ctxt) instead of only
the user receive contexts allocated for PSM
(dd->num_user_contexts). Consequently, the sizing of
dd->num_user_contexts in set_up_context_variables is incorrect.

Fixes: 2280740f ("IB/hfi1: Virtual Network Interface Controller (VNIC) HW support")
Reviewed-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: NMichael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

d0294344

IB/hfi1: Eliminate opcode tests on mr deref · a8639a79

由 Kaike Wan 提交于 3月 18, 2019

When an old ack_queue entry is used to store an incoming request, it may
need to clean up the old entry if it is still referencing the
MR. Originally only RDMA READ request needed to reference MR on the
responder side and therefore the opcode was tested when cleaning up the
old entry. The introduction of tid rdma specific operations in the
ack_queue makes the specific opcode tests wrong. Multiple opcodes (RDMA
READ, TID RDMA READ, and TID RDMA WRITE) may need MR ref cleanup.

Remove the opcode specific tests associated with the ack_queue.

Fixes: f48ad614 ("IB/hfi1: Move driver out of staging")
Signed-off-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

a8639a79

IB/hfi1: Clear the IOWAIT pending bits when QP is put into error state · 93b289b9

由 Kaike Wan 提交于 3月 18, 2019

When a QP is put into error state, it may be waiting for send engine
resources. In this case, the QP will be removed from the send engine's
waiting list, but its IOWAIT pending bits are not cleared. This will
normally not have any major impact as the QP is being destroyed. However,
the QP still needs to wind down its operations, such as draining the send
queue by scheduling the send engine. Clearing the pending bits will avoid
any potential complications. In addition, if the QP will eventually hang,
clearing the pending bits can help debugging by presenting a consistent
picture if the user dumps the qp_stats.

This patch clears a QP's IOWAIT_PENDING_IB and IO_PENDING_TID bits in
priv->s_iowait.flags in this case.

Fixes: 5da0fc9d ("IB/hfi1: Prepare resource waits for dual leg")
Reviewed-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: NAlex Estrin <alex.estrin@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

93b289b9

IB/hfi1: Failed to drain send queue when QP is put into error state · 662d6646

由 Kaike Wan 提交于 3月 18, 2019

When a QP is put into error state, all pending requests in the send work
queue should be drained. The following sequence of events could lead to a
failure, causing a request to hang:

(1) The QP builds a packet and tries to send through SDMA engine.
    However, PIO engine is still busy. Consequently, this packet is put on
    the QP's tx list and the QP is put on the PIO waiting list. The field
    qp->s_flags is set with HFI1_S_WAIT_PIO_DRAIN;

(2) The QP is put into error state by the user application and
    notify_error_qp() is called, which removes the QP from the PIO waiting
    list and the packet from the QP's tx list. In addition, qp->s_flags is
    cleared of RVT_S_ANY_WAIT_IO bits, which does not include
    HFI1_S_WAIT_PIO_DRAIN bit;

(3) The hfi1_schdule_send() function is called to drain the QP's send
    queue. Subsequently, hfi1_do_send() is called. Since the flag bit
    HFI1_S_WAIT_PIO_DRAIN is set in qp->s_flags, hfi1_send_ok() fails.  As
    a result, hfi1_do_send() bails out without draining any request from
    the send queue;

(4) The PIO engine completes the sending and tries to wake up any QP on
    its waiting list. But the QP has been removed from the PIO waiting
    list and therefore is kept in sleep forever.

The fix is to clear qp->s_flags of HFI1_S_ANY_WAIT_IO bits in step (2).
HFI1_S_ANY_WAIT_IO includes RVT_S_ANY_WAIT_IO and HFI1_S_WAIT_PIO_DRAIN.

Fixes: 2e2ba09e ("IB/rdmavt, IB/hfi1: Create device dependent s_flags")
Cc: <stable@vger.kernel.org> # 4.19.x+
Reviewed-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: NAlex Estrin <alex.estrin@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

662d6646

18 3月, 2019 4 次提交

i40iw: Avoid panic when handling the inetdev event · ec4fe4bc

由 Feng Tang 提交于 3月 14, 2019

There is a panic reported that on a system with x722 ethernet, when doing
the operations like:

	# ip link add br0 type bridge
	# ip link set eno1 master br0
	# systemctl restart systemd-networkd

The system will panic "BUG: unable to handle kernel null pointer
dereference at 0000000000000034", with call chain:

	i40iw_inetaddr_event
	notifier_call_chain
	blocking_notifier_call_chain
	notifier_call_chain
	__inet_del_ifa
	inet_rtm_deladdr
	rtnetlink_rcv_msg
	netlink_rcv_skb
	rtnetlink_rcv
	netlink_unicast
	netlink_sendmsg
	sock_sendmsg
	__sys_sendto

It is caused by "local_ipaddr = ntohl(in->ifa_list->ifa_address)", while
the in->ifa_list is NULL.

So add a check for the "in->ifa_list == NULL" case, and skip the ARP
operation accordingly.
Signed-off-by: NFeng Tang <feng.tang@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

ec4fe4bc

IB/mlx5: Fix mapping of link-mode to IB width and speed · cd272875

由 Aya Levin 提交于 3月 11, 2019

Add mapping of link mode: CAUI4 100Gbps CR4/KR4 with 4 lines and 25Gbps.
Fix mapping of link mode: GAUI2 50Gbps CR2/KR2 to be 2 lines with 25Gbps.

Fixes: 08e8676f ("IB/mlx5: Add support for 50Gbps per lane link modes")
Signed-off-by: NAya Levin <ayal@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

cd272875

IB/mlx5: Use mlx5 core to create/destroy a DEVX DCT · c5ae1954

由 Yishai Hadas 提交于 3月 06, 2019

To prevent a hardware memory leak when a DEVX DCT object is destroyed
without calling DRAIN DCT before, (e.g. under cleanup flow), need to
manage its creation and destruction via mlx5 core.

In that case the DRAIN DCT command will be called and only once that it
will be completed the DESTROY DCT command will be called.  Otherwise, the
DESTROY DCT may fail and a hardware leak may occur.

As of that change the DRAIN DCT command should not be exposed any more
from DEVX, it's managed internally by the driver to work as expected by
the device specification.

Fixes: 7efce369 ("IB/mlx5: Add obj create and destroy functionality")
Signed-off-by: NYishai Hadas <yishaih@mellanox.com>
Reviewed-by: NArtemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

c5ae1954

IB/mlx4: Fix race condition between catas error reset and aliasguid flows · 587443e7

由 Jack Morgenstein 提交于 3月 06, 2019

Code review revealed a race condition which could allow the catas error
flow to interrupt the alias guid query post mechanism at random points.
Thiis is fixed by doing cancel_delayed_work_sync() instead of
cancel_delayed_work() during the alias guid mechanism destroy flow.

Fixes: a0c64a17 ("mlx4: Add alias_guid mechanism")
Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

587443e7

07 3月, 2019 2 次提交

IB/hfi1: Close race condition on user context disable and close · bc5add09

由 Michael J. Ruhl 提交于 2月 26, 2019

When disabling and removing a receive context, it is possible for an
asynchronous event (i.e IRQ) to occur.  Because of this, there is a race
between cleaning up the context, and the context being used by the
asynchronous event.

cpu 0  (context cleanup)
    rc->ref_count-- (ref_count == 0)
    hfi1_rcd_free()
cpu 1  (IRQ (with rcd index))
	rcd_get_by_index()
	lock
	ref_count+++     <-- reference count race (WARNING)
	return rcd
	unlock
cpu 0
    hfi1_free_ctxtdata() <-- incorrect free location
    lock
    remove rcd from array
    unlock
    free rcd

This race will cause the following WARNING trace:

WARNING: CPU: 0 PID: 175027 at include/linux/kref.h:52 hfi1_rcd_get_by_index+0x84/0xa0 [hfi1]
CPU: 0 PID: 175027 Comm: IMB-MPI1 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.el7.x86_64 #1
Hardware name: Intel Corporation S2600KP/S2600KP, BIOS SE5C610.86B.11.01.0076.C4.111920150602 11/19/2015
Call Trace:
  dump_stack+0x19/0x1b
  __warn+0xd8/0x100
  warn_slowpath_null+0x1d/0x20
  hfi1_rcd_get_by_index+0x84/0xa0 [hfi1]
  is_rcv_urgent_int+0x24/0x90 [hfi1]
  general_interrupt+0x1b6/0x210 [hfi1]
  __handle_irq_event_percpu+0x44/0x1c0
  handle_irq_event_percpu+0x32/0x80
  handle_irq_event+0x3c/0x60
  handle_edge_irq+0x7f/0x150
  handle_irq+0xe4/0x1a0
  do_IRQ+0x4d/0xf0
  common_interrupt+0x162/0x162

The race can also lead to a use after free which could be similar to:

general protection fault: 0000 1 SMP
CPU: 71 PID: 177147 Comm: IMB-MPI1 Kdump: loaded Tainted: G W OE ------------ 3.10.0-957.el7.x86_64 #1
Hardware name: Intel Corporation S2600KP/S2600KP, BIOS SE5C610.86B.11.01.0076.C4.111920150602 11/19/2015
task: ffff9962a8098000 ti: ffff99717a508000 task.ti: ffff99717a508000 __kmalloc+0x94/0x230
Call Trace:
  ? hfi1_user_sdma_process_request+0x9c8/0x1250 [hfi1]
  hfi1_user_sdma_process_request+0x9c8/0x1250 [hfi1]
  hfi1_aio_write+0xba/0x110 [hfi1]
  do_sync_readv_writev+0x7b/0xd0
  do_readv_writev+0xce/0x260
  ? handle_mm_fault+0x39d/0x9b0
  ? pick_next_task_fair+0x5f/0x1b0
  ? sched_clock_cpu+0x85/0xc0
  ? __schedule+0x13a/0x890
  vfs_writev+0x35/0x60
  SyS_writev+0x7f/0x110
  system_call_fastpath+0x22/0x27

Use the appropriate kref API to verify access.

Reorder context cleanup to ensure context removal before cleanup occurs
correctly.

Cc: stable@vger.kernel.org # v4.14.0+
Fixes: f683c80c ("IB/hfi1: Resolve kernel panics by reference counting receive contexts")
Reviewed-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: NMichael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

bc5add09

RDMA/umem: Revert broken 'off by one' fix · 0c507d8f

由 John Hubbard 提交于 3月 05, 2019

The previous attempted bug fix overlooked the fact that
ib_umem_odp_map_dma_single_page() was doing a put_page() upon hitting an
error. So there was not really a bug there.

Therefore, this reverts the off-by-one change, but keeps the change to use
release_pages() in the error path.

Fixes: 75a3e6a3 ("RDMA/umem: minor bug fix in error handling path")
Suggested-by: NArtemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

0c507d8f

06 3月, 2019 1 次提交

mm: replace all open encodings for NUMA_NO_NODE · 98fa15f3

由 Anshuman Khandual 提交于 3月 05, 2019

Patch series "Replace all open encodings for NUMA_NO_NODE", v3.

All these places for replacement were found by running the following
grep patterns on the entire kernel code.  Please let me know if this
might have missed some instances.  This might also have replaced some
false positives.  I will appreciate suggestions, inputs and review.

1. git grep "nid == -1"
2. git grep "node == -1"
3. git grep "nid = -1"
4. git grep "node = -1"

This patch (of 2):

At present there are multiple places where invalid node number is
encoded as -1.  Even though implicitly understood it is always better to
have macros in there.  Replace these open encodings for an invalid node
number with the global macro NUMA_NO_NODE.  This helps remove NUMA
related assumptions like 'invalid node' from various places redirecting
them to a common definition.

Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	[ixgbe]
Acked-by: Jens Axboe <axboe@kernel.dk>			[mtip32xx]
Acked-by: Vinod Koul <vkoul@kernel.org>			[dmaengine.c]
Acked-by: Michael Ellerman <mpe@ellerman.id.au>		[powerpc]
Acked-by: Doug Ledford <dledford@redhat.com>		[drivers/infiniband]
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Hans Verkuil <hverkuil@xs4all.nl>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

98fa15f3

05 3月, 2019 6 次提交

RDMA/umem: minor bug fix in error handling path · 75a3e6a3

由 John Hubbard 提交于 3月 04, 2019

1. Bug fix: fix an off by one error in the code that cleans up if it fails
   to dma-map a page, after having done a get_user_pages_remote() on a
   range of pages.

2. Refinement: for that same cleanup code, release_pages() is better than
   put_page() in a loop.
Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
Signed-off-by: NIra Weiny <ira.weiny@intel.com>
Reviewed-by: NIra Weiny <ira.weiny@intel.com>
Acked-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

75a3e6a3

RDMA/hns: Use GFP_ATOMIC in hns_roce_v2_modify_qp · 4e69cf1f

由 YueHaibing 提交于 3月 04, 2019

The the below commit, hns_roce_v2_modify_qp is called inside spinlock
while using GFP_KERNEL. Change it to GFP_ATOMIC.

Fixes: 0425e3e6 ("RDMA/hns: Support flush cqe for hip08 in kernel space")
Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

4e69cf1f

cxgb4: kfree mhp after the debug print · 952a3cc9

由 Shaobo He 提交于 2月 28, 2019

In function `c4iw_dealloc_mw`, variable mhp's value is printed after
freed, it is clearer to have the print before the kfree.

Otherwise racing threads could allocate another mhp with the same pointer
value and create confusing tracing.
Signed-off-by: NShaobo He <shaobo@cs.utah.edu>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

952a3cc9

IB/rdmavt: Fix concurrency panics in QP post_send and modify to error · d757c60e

由 Michael J. Ruhl 提交于 2月 26, 2019

The RC/UC code path can go through a software loopback. In this code path
the receive side QP is manipulated.

If two threads are working on the QP receive side (i.e. post_send, and
modify_qp to an error state), QP information can be corrupted.

(post_send via loopback)
  set r_sge
  loop
     update r_sge
(modify_qp)
     take r_lock
     update r_sge <---- r_sge is now incorrect
(post_send)
     update r_sge <---- crash, etc.
     ...

This can lead to one of the two following crashes:

 BUG: unable to handle kernel NULL pointer dereference at (null)
  IP:  hfi1_copy_sge+0xf1/0x2e0 [hfi1]
  PGD 8000001fe6a57067 PUD 1fd9e0c067 PMD 0
 Call Trace:
  ruc_loopback+0x49b/0xbc0 [hfi1]
  hfi1_do_send+0x38e/0x3e0 [hfi1]
  _hfi1_do_send+0x1e/0x20 [hfi1]
  process_one_work+0x17f/0x440
  worker_thread+0x126/0x3c0
  kthread+0xd1/0xe0
  ret_from_fork_nospec_begin+0x21/0x21

or:

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
  IP:  rvt_clear_mr_refs+0x45/0x370 [rdmavt]
  PGD 80000006ae5eb067 PUD ef15d0067 PMD 0
 Call Trace:
  rvt_error_qp+0xaa/0x240 [rdmavt]
  rvt_modify_qp+0x47f/0xaa0 [rdmavt]
  ib_security_modify_qp+0x8f/0x400 [ib_core]
  ib_modify_qp_with_udata+0x44/0x70 [ib_core]
  modify_qp.isra.23+0x1eb/0x2b0 [ib_uverbs]
  ib_uverbs_modify_qp+0xaa/0xf0 [ib_uverbs]
  ib_uverbs_write+0x272/0x430 [ib_uverbs]
  vfs_write+0xc0/0x1f0
  SyS_write+0x7f/0xf0
  system_call_fastpath+0x1c/0x21

Fix by using the appropriate locking on the receiving QP.

Fixes: 15703461 ("IB/{hfi1, qib, rdmavt}: Move ruc_loopback to rdmavt")
Cc: <stable@vger.kernel.org> #v4.9+
Reviewed-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: NMichael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

d757c60e

IB/rdmavt: Fix loopback send with invalidate ordering · 38bbc9f0

由 Mike Marciniszyn 提交于 2月 26, 2019

The IBTA spec notes:

o9-5.2.1: For any HCA which supports SEND with Invalidate, upon receiving
an IETH, the Invalidate operation must not take place until after the
normal transport header validation checks have been successfully
completed.

The rdmavt loopback code does the validation after the invalidate.

Fix by relocating the operation specific logic for all SEND variants until
after the validity checks.

Cc: <stable@vger.kernel.org> #v4.20+
Reviewed-by: NMichael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

38bbc9f0

IB/iser: Fix dma_nents type definition · c1545f1a

由 Max Gurtovoy 提交于 2月 26, 2019

The retured value from ib_dma_map_sg saved in dma_nents variable. To avoid
future mismatch between types, define dma_nents as an integer instead of
unsigned.

Fixes: 57b26497 ("IB/iser: Pass the correct number of entries for dma mapped SGL")
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: NIsrael Rukshin <israelr@mellanox.com>
Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
Acked-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

c1545f1a

04 3月, 2019 2 次提交

IB/mlx5: Set correct write permissions for implicit ODP MR · 7095ec3c

由 Moni Shoua 提交于 2月 25, 2019

The write access of an implicit MR is inherited to all of its children.
Therefore we must set the correct write access to the parent MR.

Pass full access_flags when creating umem to let it calculate write access
correctly.

Fixes: da6a496a ("IB/mlx5: Ranges in implicit ODP MR inherit its write access")
Signed-off-by: NMoni Shoua <monis@mellanox.com>
Reviewed-by: NArtemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

7095ec3c

bnxt_re: Clean cq for kernel consumers only · 0fca467e

由 Devesh Sharma 提交于 2月 25, 2019

Kernel space provider driver should clean the CQs belonging to kernel
space consumers only. The current implementation is doing reverse of it.

Fixing the same by avoiding the call to __clean_cq on a kernel qp during
destroy.

Fixes: c50866e2 ("bnxt_re: fix the regression due to changes in alloc_pbl")
Signed-off-by: NDevesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

0fca467e

27 2月, 2019 1 次提交

net: devlink: turn devlink into a built-in · f4b6bcc7

由 Jakub Kicinski 提交于 2月 25, 2019

Being able to build devlink as a module causes growing pains.
First all drivers had to add a meta dependency to make sure
they are not built in when devlink is built as a module.  Now
we are struggling to invoke ethtool compat code reliably.

Make devlink code built-in, users can still not build it at
all but the dynamically loadable module option is removed.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Acked-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f4b6bcc7

26 2月, 2019 1 次提交

RDMA/uverbs: Don't do double free of allocated PD · bb618451

由 Leon Romanovsky 提交于 2月 13, 2019

There is no need to call kfree(pd) because ib_dealloc_pd() internally
frees PD.

Fixes: 21a428a0 ("RDMA: Handle PD allocations by IB/core")
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

bb618451

23 2月, 2019 4 次提交

RDMA: Handle ucontext allocations by IB/core · a2a074ef

由 Leon Romanovsky 提交于 2月 12, 2019

Following the PD conversion patch, do the same for ucontext allocations.
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

a2a074ef

RDMA/core: Fix a WARN() message · afc1990e

由 Dan Carpenter 提交于 2月 22, 2019

The first parameter of WARN_ONCE() is a condition, then following
parameters are the message.  In this case, we left out the condition so it
will just print the ops->type string.

Fixes: 3856ec4b ("RDMA/core: Add RDMA_NLDEV_CMD_NEWLINK/DELLINK support")
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: NMajd Dibbiny <majd@mellanox.com>
Reviewed-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

afc1990e

bnxt_re: fix the regression due to changes in alloc_pbl · c50866e2

由 Devesh Sharma 提交于 2月 22, 2019

While adding the use of for_each_sg_dma_page iterator for Brodcom's rdma
driver, there was a regression added in the __alloc_pbl path. The change
left bnxt_re in DOA state in for-next branch.

Fixing the regression to avoid the host crash when a user space object is
created. Restricting the unconditional access to hwq.pg_arr when hwq is
initialized for user space objects.

Fixes: 161ebe24 ("RDMA/bnxt_re: Use for_each_sg_dma_page iterator on umem SGL")
Reported-by: NGal Pressman <galpress@amazon.com>
Signed-off-by: NSelvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: NDevesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

c50866e2

IB/mlx4: Increase the timeout for CM cache · 2612d723

由 Håkon Bugge 提交于 2月 17, 2019

Using CX-3 virtual functions, either from a bare-metal machine or
pass-through from a VM, MAD packets are proxied through the PF driver.

Since the VF drivers have separate name spaces for MAD Transaction Ids
(TIDs), the PF driver has to re-map the TIDs and keep the book keeping
in a cache.

Following the RDMA Connection Manager (CM) protocol, it is clear when
an entry has to evicted form the cache. But life is not perfect,
remote peers may die or be rebooted. Hence, it's a timeout to wipe out
a cache entry, when the PF driver assumes the remote peer has gone.

During workloads where a high number of QPs are destroyed concurrently,
excessive amount of CM DREQ retries has been observed

The problem can be demonstrated in a bare-metal environment, where two
nodes have instantiated 8 VFs each. This using dual ported HCAs, so we
have 16 vPorts per physical server.

64 processes are associated with each vPort and creates and destroys
one QP for each of the remote 64 processes. That is, 1024 QPs per
vPort, all in all 16K QPs. The QPs are created/destroyed using the
CM.

When tearing down these 16K QPs, excessive CM DREQ retries (and
duplicates) are observed. With some cat/paste/awk wizardry on the
infiniband_cm sysfs, we observe as sum of the 16 vPorts on one of the
nodes:

cm_rx_duplicates:
      dreq  2102
cm_rx_msgs:
      drep  1989
      dreq  6195
       rep  3968
       req  4224
       rtu  4224
cm_tx_msgs:
      drep  4093
      dreq 27568
       rep  4224
       req  3968
       rtu  3968
cm_tx_retries:
      dreq 23469

Note that the active/passive side is equally distributed between the
two nodes.

Enabling pr_debug in cm.c gives tons of:

[171778.814239] <mlx4_ib> mlx4_ib_multiplex_cm_handler: id{slave:
1,sl_cm_id: 0xd393089f} is NULL!

By increasing the CM_CLEANUP_CACHE_TIMEOUT from 5 to 30 seconds, the
tear-down phase of the application is reduced from approximately 90 to
50 seconds. Retries/duplicates are also significantly reduced:

cm_rx_duplicates:
      dreq  2460
[]
cm_tx_retries:
      dreq  3010
       req    47

Increasing the timeout further didn't help, as these duplicates and
retries stems from a too short CMA timeout, which was 20 (~4 seconds)
on the systems. By increasing the CMA timeout to 22 (~17 seconds), the
numbers fell down to about 10 for both of them.

Adjustment of the CMA timeout is not part of this commit.
Signed-off-by: NHåkon Bugge <haakon.bugge@oracle.com>
Acked-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

2612d723

22 2月, 2019 5 次提交

IB/core: Abort page fault handler silently during owning process exit · 4438ee3f

由 Moni Shoua 提交于 2月 17, 2019

It is possible that during a page fault handling, the process that owns
the MR is terminating. The indication for it is failure to get the
task_struct or take reference on the mm_struct. In this case just abort
the page-fault handler with error but without a warning to the kernel log.
Signed-off-by: NMoni Shoua <monis@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

4438ee3f

IB/mlx5: Validate correct PD before prefetch MR · 81dd4c4b

由 Moni Shoua 提交于 2月 17, 2019

When prefetching odp mr it is required to verify that pd of the mr is
identical to the pd for which the advise_mr request arrived with.

This check was missing from synchronous flow and is added now.

Fixes: 813e90b1 ("IB/mlx5: Add advise_mr() support")
Reported-by: NParav Pandit <parav@mellanox.com>
Signed-off-by: NMoni Shoua <monis@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

81dd4c4b

IB/mlx5: Protect against prefetch of invalid MR · a6bc3875

由 Moni Shoua 提交于 2月 17, 2019

When deferring a prefetch request we need to protect against MR or PD
being destroyed while the request is still enqueued.

The first step is to validate that PD owns the lkey that describes the MR
and that the MR that the lkey refers to is owned by that PD.

The second step is to dequeue all requests when MR is destroyed.

Since PD can't be destroyed while it owns MRs it is guaranteed that when a
worker wakes up the request it refers to is still valid.

Now, it is possible to refrain from taking a reference on the device since
it is assured to be present as pd.

While that, replace the dedicated ordered workqueue with the system
unbound workqueue to reuse an existing resource and improve
performance. This will also fix a bug of queueing to the wrong workqueue.

Fixes: 813e90b1 ("IB/mlx5: Add advise_mr() support")
Reported-by: NParav Pandit <parav@mellanox.com>
Signed-off-by: NMoni Shoua <monis@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

a6bc3875

RDMA/uverbs: Store PR pointer before it is overwritten · 25fd08eb

由 Leon Romanovsky 提交于 2月 21, 2019

The IB_MR_REREG_PD command rewrites mr->pd after successful
rereg_user_mr(), such change causes to lost usecnt information and
produces the following warning:

 WARNING: CPU: 1 PID: 1771 at drivers/infiniband/core/verbs.c:336 ib_dealloc_pd+0x4e/0x60 [ib_core]
 CPU: 1 PID: 1771 Comm: rereg_mr Tainted: G        W  OE 5.0.0-rc7-for-upstream-perf-2019-02-20_14-03-40-34 #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
 RIP: 0010:ib_dealloc_pd+0x4e/0x60 [ib_core]
 RSP: 0018:ffffc90003923dc0 EFLAGS: 00010286
 RAX: 00000000ffffffff RBX: ffff88821f7f0400 RCX: ffff888236a40c00
 RDX: ffff88821f7f0400 RSI: 0000000000000001 RDI: 0000000000000000
 RBP: 0000000000000001 R08: ffff88835f665d80 R09: ffff8882209c90d8
 R10: ffff88835ec003e0 R11: 0000000000000000 R12: ffff888221680ba0
 R13: ffff888221680b00 R14: 00000000ffffffea R15: ffff88821f53c318
 FS:  00007f70db11e740(0000) GS:ffff88835f640000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000001dfd030 CR3: 000000029d9d8000 CR4: 00000000000006e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  uverbs_free_pd+0x2d/0x30 [ib_uverbs]
  destroy_hw_idr_uobject+0x16/0x40 [ib_uverbs]
  uverbs_destroy_uobject+0x28/0x170 [ib_uverbs]
  __uverbs_cleanup_ufile+0x6b/0x90 [ib_uverbs]
  uverbs_destroy_ufile_hw+0x8b/0x110 [ib_uverbs]
  ib_uverbs_close+0x1f/0x80 [ib_uverbs]
  __fput+0xb1/0x220
  task_work_run+0x7f/0xa0
  exit_to_usermode_loop+0x6b/0xb2
  do_syscall_64+0xc5/0x100
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f70dad00664

Fixes: e278173f ("RDMA/core: Cosmetic change - move member initialization to correct block")
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Reviewed-by: NMajd Dibbiny <majd@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

25fd08eb

IB/hfi1: Add missing break in switch statement · 7264235e

由 Gustavo A. R. Silva 提交于 2月 20, 2019

Fix the following warning by adding a missing break:

drivers/infiniband/hw/hfi1/tid_rdma.c: In function ‘hfi1_tid_rdma_wqe_interlock’:
drivers/infiniband/hw/hfi1/tid_rdma.c:3251:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
   switch (prev->wr.opcode) {
   ^~~~~~
drivers/infiniband/hw/hfi1/tid_rdma.c:3259:2: note: here
  case IB_WR_RDMA_READ:
  ^~~~

Warning level 3 was used: -Wimplicit-fallthrough=3

This patch is part of the ongoing efforts to enable
-Wimplicit-fallthrough.

Fixes: c6c23117 ("IB/hfi1: Add interlock between TID RDMA WRITE and other requests")
Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: NKaike Wan <Kaike.wan@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

7264235e

21 2月, 2019 1 次提交

drivers/IB,qib: Fix pinned/locked limit check in qib_get_user_pages() · ec95e0fa

由 Davidlohr Bueso 提交于 2月 16, 2019

The current check does not take into account the previous value of
pinned_vm; thus it is quite bogus as is. Fix this by checking the
new value after the (optimistic) atomic inc.
Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

ec95e0fa

20 2月, 2019 6 次提交

RDMA/core: Verify that memory window type is legal · d0e02bf6

由 Noa Osherovich 提交于 2月 19, 2019

Before calling the provider's alloc_mw function, verify that the
given memory type is either IB_MW_TYPE_1 or IB_MW_TYPE_2.
Signed-off-by: NNoa Osherovich <noaos@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

d0e02bf6

RDMA/iwcm: Fix string truncation error · 1882ab86

由 Leon Romanovsky 提交于 2月 19, 2019

The strlen() check at the beginning of iw_cm_map() ensures that devname
and ifname strings are less than destinations to which they are supposed
to be copied. Change strncpy() call to be strcpy(), because we are
protected from overflow. Zero the entire string buffer to avoid copying
uninitialized kernel stack memory to userspace.

This fixes the compilation warning below:

In file included from ./include/linux/dma-mapping.h:6,
                 from drivers/infiniband/core/iwcm.c:38:
In function _strncpy_,
    inlined from _iw_cm_map_ at drivers/infiniband/core/iwcm.c:519:2:
./include/linux/string.h:253:9: warning: ___builtin_strncpy_ specified
bound 32 equals destination size [-Wstringop-truncation]
  return __builtin_strncpy(p, q, size);
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Fixes: d53ec8af ("RDMA/iwcm: Don't copy past the end of dev_name() string")
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

1882ab86

RDMA/core: Cosmetic change - move member initialization to correct block · e278173f

由 Yuval Shaia 提交于 2月 18, 2019

old_pd is used only if IB_MR_REREG_PD flags is set.
For readability move it's initialization to where it is used.

While there rewrite the whole 'if-else' block so on error jump directly
to label and no need for 'else'
Signed-off-by: NYuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

e278173f

iw_cxgb4: Make function read_tcb() static · 3b8f8b95

由 Wei Yongjun 提交于 2月 18, 2019

Fixes the following sparse warning:

drivers/infiniband/hw/cxgb4/cm.c:658:6: warning:
 symbol 'read_tcb' was not declared. Should it be static?

Fixes: 11a27e21 ("iw_cxgb4: complete the cached SRQ buffers")
Signed-off-by: NWei Yongjun <weiyongjun1@huawei.com>
Acked-by: NRaju Rangoju <rajur@chelsio.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

3b8f8b95

RDMA/hns: Bugfix for set hem of SCC · 6ac16e40

由 Yangyang Li 提交于 2月 16, 2019

The method of set hem for scc context is different from other contexts. It
should notify the hardware with the detailed idx in bt0 for scc, while for
other contexts, it only need to notify the bt step and the hardware will
calculate the idx.

Here fixes the following error when unloading the hip08 driver:

[  123.570768] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[  123.579023] {1}[Hardware Error]: event severity: recoverable
[  123.584670] {1}[Hardware Error]:  Error 0, type: recoverable
[  123.590317] {1}[Hardware Error]:   section_type: PCIe error
[  123.595877] {1}[Hardware Error]:   version: 4.0
[  123.600395] {1}[Hardware Error]:   command: 0x0006, status: 0x0010
[  123.606562] {1}[Hardware Error]:   device_id: 0000:7d:00.0
[  123.612034] {1}[Hardware Error]:   slot: 0
[  123.616120] {1}[Hardware Error]:   secondary_bus: 0x00
[  123.621245] {1}[Hardware Error]:   vendor_id: 0x19e5, device_id: 0xa222
[  123.627847] {1}[Hardware Error]:   class_code: 000002
[  123.632977] hns3 0000:7d:00.0: aer_status: 0x00000000, aer_mask: 0x00000000
[  123.639928] hns3 0000:7d:00.0: aer_layer=Transaction Layer, aer_agent=Receiver ID
[  123.647400] hns3 0000:7d:00.0: aer_uncor_severity: 0x00000000
[  123.653136] hns3 0000:7d:00.0: PCI error detected, state(=1)!!
[  123.658959] hns3 0000:7d:00.0: ROCEE uncorrected RAS error identified
[  123.665395] hns3 0000:7d:00.0: ROCEE RAS AXI rresp error
[  123.670713] hns3 0000:7d:00.0: requesting reset due to PCI error
[  123.676715] hns3 0000:7d:00.0: received reset event , reset type is 5
[  123.683147] hns3 0000:7d:00.0: AER: Device recovery successful
[  123.688978] hns3 0000:7d:00.0: PF Reset requested
[  123.693684] hns3 0000:7d:00.0: PF failed(=-5) to send mailbox message to VF
[  123.700633] hns3 0000:7d:00.0: inform reset to vf(1) failded -5!

Fixes: 6a157f7d ("RDMA/hns: Add SCC context allocation support for hip08")
Signed-off-by: NYangyang Li <liyangyang20@huawei.com>
Reviewed-by: NYixian Liu <liuyixian@huawei.com>
Reviewed-by: NLijun Ou <oulijun@huawei.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

6ac16e40

RDMA/hns: Modify qp&cq&pd specification according to UM · 3e394f94

由 Lijun Ou 提交于 2月 16, 2019

Accroding to hip08's limitation, qp&cq specification is 1M, mtpt
specification 1M in kernel space.
Signed-off-by: NYangyang Li <liyangyang20@huawei.com>
Signed-off-by: NLijun Ou <oulijun@huawei.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

3e394f94

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功