提交 · 6a40693a884dacae68c1771d369ad3be0594ba1c · openeuler / Kernel

04 4月, 2019 2 次提交

IB/hfi1: Add a function to read next expected psn from hardware flow · 6a40693a

由 Kaike Wan 提交于 3月 18, 2019

This patch adds a function to read next expected KDETH PSN from hardware
flow to simplify the code.
Reviewed-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: NMichael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

6a40693a

IB/hfi1: Delay the release of destination mr for TID RDMA WRITE DATA · f6f3f532

由 Kaike Wan 提交于 3月 18, 2019

The reference of destination memory region is first obtained when TID RDMA
WRITE request is first received on the responder side. This reference is
released once all TID RDMA WRITE RESP packets are sent to the requester
side, even though not all TID RDMA WRITE DATA packets may have been
received. This early release will especially be undesired if the software
needs to access the destination memory before the last data packet is
received.

This patch delays the release of the MR until all TID RDMA DATA packets
have been received. A helper function to release the reference is also
created to simplify the code.
Reviewed-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: NMichael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

f6f3f532

28 3月, 2019 4 次提交

IB/hfi1: Fix the allocation of RSM table · d0294344

由 Kaike Wan 提交于 3月 18, 2019

The receive side mapping (RSM) on hfi1 hardware is a special
matching mechanism to direct an incoming packet to a given
hardware receive context. It has 4 instances of matching capabilities
(RSM0 - RSM3) that share the same RSM table (RMT). The RMT has a total of
256 entries, each of which points to a receive context.

Currently, three instances of RSM have been used:
1. RSM0 by QOS;
2. RSM1 by PSM FECN;
3. RSM2 by VNIC.

Each RSM instance should reserve enough entries in RMT to function
properly. Since both PSM and VNIC could allocate any receive context
between dd->first_dyn_alloc_ctxt and dd->num_rcv_contexts, PSM FECN must
reserve enough RMT entries to cover the entire receive context index
range (dd->num_rcv_contexts - dd->first_dyn_alloc_ctxt) instead of only
the user receive contexts allocated for PSM
(dd->num_user_contexts). Consequently, the sizing of
dd->num_user_contexts in set_up_context_variables is incorrect.

Fixes: 2280740f ("IB/hfi1: Virtual Network Interface Controller (VNIC) HW support")
Reviewed-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: NMichael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

d0294344

IB/hfi1: Eliminate opcode tests on mr deref · a8639a79

由 Kaike Wan 提交于 3月 18, 2019

When an old ack_queue entry is used to store an incoming request, it may
need to clean up the old entry if it is still referencing the
MR. Originally only RDMA READ request needed to reference MR on the
responder side and therefore the opcode was tested when cleaning up the
old entry. The introduction of tid rdma specific operations in the
ack_queue makes the specific opcode tests wrong. Multiple opcodes (RDMA
READ, TID RDMA READ, and TID RDMA WRITE) may need MR ref cleanup.

Remove the opcode specific tests associated with the ack_queue.

Fixes: f48ad614 ("IB/hfi1: Move driver out of staging")
Signed-off-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

a8639a79

IB/hfi1: Clear the IOWAIT pending bits when QP is put into error state · 93b289b9

由 Kaike Wan 提交于 3月 18, 2019

When a QP is put into error state, it may be waiting for send engine
resources. In this case, the QP will be removed from the send engine's
waiting list, but its IOWAIT pending bits are not cleared. This will
normally not have any major impact as the QP is being destroyed. However,
the QP still needs to wind down its operations, such as draining the send
queue by scheduling the send engine. Clearing the pending bits will avoid
any potential complications. In addition, if the QP will eventually hang,
clearing the pending bits can help debugging by presenting a consistent
picture if the user dumps the qp_stats.

This patch clears a QP's IOWAIT_PENDING_IB and IO_PENDING_TID bits in
priv->s_iowait.flags in this case.

Fixes: 5da0fc9d ("IB/hfi1: Prepare resource waits for dual leg")
Reviewed-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: NAlex Estrin <alex.estrin@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

93b289b9

IB/hfi1: Failed to drain send queue when QP is put into error state · 662d6646

由 Kaike Wan 提交于 3月 18, 2019

When a QP is put into error state, all pending requests in the send work
queue should be drained. The following sequence of events could lead to a
failure, causing a request to hang:

(1) The QP builds a packet and tries to send through SDMA engine.
    However, PIO engine is still busy. Consequently, this packet is put on
    the QP's tx list and the QP is put on the PIO waiting list. The field
    qp->s_flags is set with HFI1_S_WAIT_PIO_DRAIN;

(2) The QP is put into error state by the user application and
    notify_error_qp() is called, which removes the QP from the PIO waiting
    list and the packet from the QP's tx list. In addition, qp->s_flags is
    cleared of RVT_S_ANY_WAIT_IO bits, which does not include
    HFI1_S_WAIT_PIO_DRAIN bit;

(3) The hfi1_schdule_send() function is called to drain the QP's send
    queue. Subsequently, hfi1_do_send() is called. Since the flag bit
    HFI1_S_WAIT_PIO_DRAIN is set in qp->s_flags, hfi1_send_ok() fails.  As
    a result, hfi1_do_send() bails out without draining any request from
    the send queue;

(4) The PIO engine completes the sending and tries to wake up any QP on
    its waiting list. But the QP has been removed from the PIO waiting
    list and therefore is kept in sleep forever.

The fix is to clear qp->s_flags of HFI1_S_ANY_WAIT_IO bits in step (2).
HFI1_S_ANY_WAIT_IO includes RVT_S_ANY_WAIT_IO and HFI1_S_WAIT_PIO_DRAIN.

Fixes: 2e2ba09e ("IB/rdmavt, IB/hfi1: Create device dependent s_flags")
Cc: <stable@vger.kernel.org> # 4.19.x+
Reviewed-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: NAlex Estrin <alex.estrin@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

662d6646

07 3月, 2019 1 次提交

IB/hfi1: Close race condition on user context disable and close · bc5add09

由 Michael J. Ruhl 提交于 2月 26, 2019

When disabling and removing a receive context, it is possible for an
asynchronous event (i.e IRQ) to occur.  Because of this, there is a race
between cleaning up the context, and the context being used by the
asynchronous event.

cpu 0  (context cleanup)
    rc->ref_count-- (ref_count == 0)
    hfi1_rcd_free()
cpu 1  (IRQ (with rcd index))
	rcd_get_by_index()
	lock
	ref_count+++     <-- reference count race (WARNING)
	return rcd
	unlock
cpu 0
    hfi1_free_ctxtdata() <-- incorrect free location
    lock
    remove rcd from array
    unlock
    free rcd

This race will cause the following WARNING trace:

WARNING: CPU: 0 PID: 175027 at include/linux/kref.h:52 hfi1_rcd_get_by_index+0x84/0xa0 [hfi1]
CPU: 0 PID: 175027 Comm: IMB-MPI1 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.el7.x86_64 #1
Hardware name: Intel Corporation S2600KP/S2600KP, BIOS SE5C610.86B.11.01.0076.C4.111920150602 11/19/2015
Call Trace:
  dump_stack+0x19/0x1b
  __warn+0xd8/0x100
  warn_slowpath_null+0x1d/0x20
  hfi1_rcd_get_by_index+0x84/0xa0 [hfi1]
  is_rcv_urgent_int+0x24/0x90 [hfi1]
  general_interrupt+0x1b6/0x210 [hfi1]
  __handle_irq_event_percpu+0x44/0x1c0
  handle_irq_event_percpu+0x32/0x80
  handle_irq_event+0x3c/0x60
  handle_edge_irq+0x7f/0x150
  handle_irq+0xe4/0x1a0
  do_IRQ+0x4d/0xf0
  common_interrupt+0x162/0x162

The race can also lead to a use after free which could be similar to:

general protection fault: 0000 1 SMP
CPU: 71 PID: 177147 Comm: IMB-MPI1 Kdump: loaded Tainted: G W OE ------------ 3.10.0-957.el7.x86_64 #1
Hardware name: Intel Corporation S2600KP/S2600KP, BIOS SE5C610.86B.11.01.0076.C4.111920150602 11/19/2015
task: ffff9962a8098000 ti: ffff99717a508000 task.ti: ffff99717a508000 __kmalloc+0x94/0x230
Call Trace:
  ? hfi1_user_sdma_process_request+0x9c8/0x1250 [hfi1]
  hfi1_user_sdma_process_request+0x9c8/0x1250 [hfi1]
  hfi1_aio_write+0xba/0x110 [hfi1]
  do_sync_readv_writev+0x7b/0xd0
  do_readv_writev+0xce/0x260
  ? handle_mm_fault+0x39d/0x9b0
  ? pick_next_task_fair+0x5f/0x1b0
  ? sched_clock_cpu+0x85/0xc0
  ? __schedule+0x13a/0x890
  vfs_writev+0x35/0x60
  SyS_writev+0x7f/0x110
  system_call_fastpath+0x22/0x27

Use the appropriate kref API to verify access.

Reorder context cleanup to ensure context removal before cleanup occurs
correctly.

Cc: stable@vger.kernel.org # v4.14.0+
Fixes: f683c80c ("IB/hfi1: Resolve kernel panics by reference counting receive contexts")
Reviewed-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: NMichael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

bc5add09

06 3月, 2019 1 次提交

mm: replace all open encodings for NUMA_NO_NODE · 98fa15f3

由 Anshuman Khandual 提交于 3月 05, 2019

Patch series "Replace all open encodings for NUMA_NO_NODE", v3.

All these places for replacement were found by running the following
grep patterns on the entire kernel code.  Please let me know if this
might have missed some instances.  This might also have replaced some
false positives.  I will appreciate suggestions, inputs and review.

1. git grep "nid == -1"
2. git grep "node == -1"
3. git grep "nid = -1"
4. git grep "node = -1"

This patch (of 2):

At present there are multiple places where invalid node number is
encoded as -1.  Even though implicitly understood it is always better to
have macros in there.  Replace these open encodings for an invalid node
number with the global macro NUMA_NO_NODE.  This helps remove NUMA
related assumptions like 'invalid node' from various places redirecting
them to a common definition.

Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	[ixgbe]
Acked-by: Jens Axboe <axboe@kernel.dk>			[mtip32xx]
Acked-by: Vinod Koul <vkoul@kernel.org>			[dmaengine.c]
Acked-by: Michael Ellerman <mpe@ellerman.id.au>		[powerpc]
Acked-by: Doug Ledford <dledford@redhat.com>		[drivers/infiniband]
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Hans Verkuil <hverkuil@xs4all.nl>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

98fa15f3

22 2月, 2019 1 次提交

IB/hfi1: Add missing break in switch statement · 7264235e

由 Gustavo A. R. Silva 提交于 2月 20, 2019

Fix the following warning by adding a missing break:

drivers/infiniband/hw/hfi1/tid_rdma.c: In function ‘hfi1_tid_rdma_wqe_interlock’:
drivers/infiniband/hw/hfi1/tid_rdma.c:3251:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
   switch (prev->wr.opcode) {
   ^~~~~~
drivers/infiniband/hw/hfi1/tid_rdma.c:3259:2: note: here
  case IB_WR_RDMA_READ:
  ^~~~

Warning level 3 was used: -Wimplicit-fallthrough=3

This patch is part of the ongoing efforts to enable
-Wimplicit-fallthrough.

Fixes: c6c23117 ("IB/hfi1: Add interlock between TID RDMA WRITE and other requests")
Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: NKaike Wan <Kaike.wan@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

7264235e

16 2月, 2019 1 次提交

IB/hfi1: Fix a build warning for TID RDMA READ · e50838c2

由 Kaike Wan 提交于 2月 15, 2019

The following build warning was produced for the TID RDMA READ
patch ("IB/hfi1: Enable TID RDMA READ protocol"):

drivers/infiniband/hw/hfi1/qp.c: In function 'hfi1_setup_wqe':
drivers/infiniband/hw/hfi1/qp.c:328:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
   hfi1_setup_tid_rdma_wqe(qp, wqe);
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/infiniband/hw/hfi1/qp.c:329:2: note: here
  case IB_QPT_UC:
  ^~~~

This patch will fix the issue by adding the "fall through" comment.

Fixes: f1ab4efa ("IB/hfi1: Enable TID RDMA READ protocol")
Reviewed-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

e50838c2

08 2月, 2019 2 次提交

drivers/IB,hfi1: do not se mmap_sem · 0e15c253

由 Davidlohr Bueso 提交于 2月 06, 2019

This driver already uses gup_fast() and thus we can just drop the mmap_sem
protection around the pinned_vm counter. Note that the window between when
hfi1_can_pin_pages() is called and the actual counter is incremented
remains the same as mmap_sem was _only_ used for when ->pinned_vm was
touched.
Reviewed-by: NIra Weiny <ira.weiny@intel.com>
Signed-off-by: NDavidlohr Bueso <dbueso@suse.det>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

0e15c253

mm: make mm->pinned_vm an atomic64 counter · 70f8a3ca

由 Davidlohr Bueso 提交于 2月 06, 2019

Taking a sleeping lock to _only_ increment a variable is quite the
overkill, and pretty much all users do this. Furthermore, some drivers
(ie: infiniband and scif) that need pinned semantics can go to quite
some trouble to actually delay via workqueue (un)accounting for pinned
pages when not possible to acquire it.

By making the counter atomic we no longer need to hold the mmap_sem and
can simply some code around it for pinned_vm users. The counter is 64-bit
such that we need not worry about overflows such as rdma user input
controlled from userspace.
Reviewed-by: NIra Weiny <ira.weiny@intel.com>
Reviewed-by: NChristoph Lameter <cl@linux.com>
Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

70f8a3ca

06 2月, 2019 28 次提交

IB/hfi1: Prioritize the sending of ACK packets · 34025fb0

由 Kaike Wan 提交于 1月 23, 2019

ACK packets are generally associated with request completion and resource
release and therefore should be sent first. This patch optimizes the
send engine by using the following policies:
(1) QPs with RVT_S_ACK_PENDING bit set in qp->s_flags or qpriv->s_flags
should have their priority incremented;
(2) QPs with ACK or TID-ACK packet queued should have their priority
incremented;
(3) When a QP is queued to the wait list due to resource constraints, it
will be queued to the head if it has ACK packet to send;
(4) When selecting qps to run from the wait list, the one with the highest
priority and starve_cnt will be selected; each priority will be equivalent
to a fixed number of starve_cnt (16).
Reviewed-by: NMitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

34025fb0

IB/hfi1: Add static trace for TID RDMA WRITE protocol · a05c9bdc

由 Kaike Wan 提交于 1月 23, 2019

This patch makes the following changes to the static trace:
1. Adds the decoding of TID RDMA WRITE packets in IB header trace;
2. Adds trace events for various stages of the TID RDMA WRITE
protocol. These events provide a fine-grained control for monitoring
and debugging the hfi1 driver in the filed.
Reviewed-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

a05c9bdc

IB/hfi1: Enable TID RDMA WRITE protocol · ad00889e

由 Kaike Wan 提交于 1月 23, 2019

This patch enables TID RDMA WRITE protocol by converting a qualified
RDMA WRITE request into a TID RDMA WRITE request internally:
(1) The TID RDMA cability must be enabled;
(2) The request must start on a 4K page boundary;
(3) The request length must be a multiple of 4K and must be larger or
equal to 256K.
Signed-off-by: NMitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

ad00889e

IB/hfi1: Add interlock between TID RDMA WRITE and other requests · c6c23117

由 Kaike Wan 提交于 1月 23, 2019

This locking mechanism is designed to provent vavious memory corruption
scenarios from occurring when requests are pipelined, especially when
RDMA WRITE requests are interleaved with TID RDMA READ requests:
1. READ-AFTER-READ;
2. READ-AFTER-WRITE;
3. WRITE-AFTER-READ;
4. WRITE-AFTER-WRITE.
When memory corruption is likely, a request will be held back until
previous requests have been completed.
Reviewed-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: NMitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

c6c23117

IB/hfi1: Add TID RDMA WRITE functionality into RDMA verbs · 3c6cb20a

由 Kaike Wan 提交于 1月 23, 2019

This patch integrates TID RDMA WRITE protocol into normal RDMA verbs
framework. The TID RDMA WRITE protocol is an end-to-end protocol
between the hfi1 drivers on two OPA nodes that converts a qualified
RDMA WRITE request into a TID RDMA WRITE request to avoid data copying
on the responder side.
Reviewed-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: NMitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

3c6cb20a

IB/hfi1: Add the dual leg code · 572f0c33

由 Kaike Wan 提交于 1月 23, 2019

The "Second Leg" of the TID RDMA WRITE protocol deals with
the transfer of data and ack packets, which are in the KDETH
PSN space, as opposed to the IB PSN space.

Therefore, the Second Leg could be considered as a separate
state machine. As such, it is handled by a different work
queue item which is scheduled along with the normal IB state
machine work item.
Reviewed-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: NMitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

572f0c33

IB/hfi1: Add the TID second leg ACK packet builder · 24c5bfea

由 Kaike Wan 提交于 1月 23, 2019

This patch adds the TID packet builder for the responder side, which
contains the state machine to build TID RDMA ACK packet for either
TID RDMA WRITE DATA or TID RDMA RESYNC packets.
Reviewed-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: NMitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

24c5bfea

IB/hfi1: Add the TID second leg send packet builder · 70dcb2e3

由 Kaike Wan 提交于 1月 23, 2019

To improve performance, the TID RDMA WRITE protocol is designed to
own a second leg to send data and ack packets in the KDETH PSN space.
This patch adds the packet builder for the requester side, which
contains the state machine to build TID RDMA WRITE DATA and TID
RDMA RESYNC packet.
Reviewed-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: NMitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

70dcb2e3

IB/hfi1: Resend the TID RDMA WRITE DATA packets · 6e38fca6

由 Kaike Wan 提交于 1月 23, 2019

This patch adds the logic to resend TID RDMA WRITE DATA packets.
The tracking indices will be reset properly so that the correct
TID entries will be used.
Reviewed-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: NMitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

6e38fca6

IB/hfi1: Add a function to receive TID RDMA RESYNC packet · 7cf0ad67

由 Kaike Wan 提交于 1月 23, 2019

This patch adds a function to receive TID RDMA RESYNC packet on the
responder side. The QP's hardware flow will be updated and all
allocated software flows will be updated accordingly in order to
drop all stale packets.
Signed-off-by: NMitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: NAshutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

7cf0ad67

IB/hfi1: Add a function to build TID RDMA RESYNC packet · 6e391c6a

由 Kaike Wan 提交于 1月 23, 2019

This patch adds a function to build TID RDMA RESYNC packet, which is
sent by the requester to notify the responder that no TID RDMA ACK
packet has been received for a given KDETH PSN.
Signed-off-by: NMitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: NAshutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

6e391c6a

IB/hfi1: Add TID RDMA retry timer · 829eaee5

由 Kaike Wan 提交于 1月 23, 2019

This patch adds the TID RDMA retry timer to make sure that TID RDMA
WRITE DATA packets for a segment are received successfully by the
responder. This timer is generally armed when the last TID RDMA
WRITE DATA packet for a segment is sent out and stopped when all
TID RDMA DATA packets are acknowledged.
Reviewed-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: NMitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

829eaee5

IB/hfi1: Add a function to receive TID RDMA ACK packet · 9e93e967

由 Kaike Wan 提交于 1月 23, 2019

This patch adds a function to receive TID RDMA ACK packet, which could
be an acknowledge to either a TID RDMA WRITE DATA packet or an TID
RDMA RESYNC packet. For an ACK to TID RDMA WRITE DATA packet, the
request segments are completed appropriately. For an ACK to a TID
RDMA RESYNC packet, any pending segment flow information is updated
accordingly.
Signed-off-by: NMitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: NAshutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

9e93e967

IB/hfi1: Add a function to build TID RDMA ACK packet · 0f75e325

由 Kaike Wan 提交于 1月 23, 2019

This patch adds a function to build TID RDMA ACJ packet, which is also
in the KDETH PSN space for packet ordering. This packet is used to
acknowledge the receiving of all the TID RDMA WRITE DATA packets
before the given KDETH PSN. Similar to RC ACK packets, TID RDMA ACK
packets could also be coalesced.
Signed-off-by: NMitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: NAshutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

0f75e325

IB/hfi1: Add a function to receive TID RDMA WRITE DATA packet · d72fe7d5

由 Kaike Wan 提交于 1月 23, 2019

This patch adds a function to receive TID RDMA WRITE DATA packet,
which is in the KDETH PSN space in packet ordering. Due to the use
of header suppression, software is generally only notified when
the last data packet for a segment is received. This patch also
adds code to handle KDETH EFLAGS errors for ingress TID RDMA WRITE
DATA packets.
Signed-off-by: NMitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: NAshutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

d72fe7d5

IB/hfi1: Add a function to build TID RDMA WRITE DATA packet · 539e1908

由 Kaike Wan 提交于 1月 23, 2019

This patch adds a function to build TID RDMA WRITE DATA packet.
Signed-off-by: NMitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: NAshutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

539e1908

IB/hfi1: Add a function to receive TID RDMA WRITE response · 72a0ea99

由 Kaike Wan 提交于 1月 23, 2019

This patch adds a function to receive TID RDMA WRITE response.
The TID entries will be stored for encoding TID RDMA WRITE DATA
packet later.
Signed-off-by: NMitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: NAshutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

72a0ea99

IB/hfi1: Add TID resource timer · 3c759e00

由 Kaike Wan 提交于 1月 23, 2019

This patch adds the TID resource timer, which is used by the responder
to free any TID resources that are allocated for TID RDMA WRITE request
and not returned by the requester after a reasonable time.
Reviewed-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: NMitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

3c759e00

IB/hfi1: Add a function to build TID RDMA WRITE response · 38d46d36

由 Kaike Wan 提交于 1月 23, 2019

This patch adds the function to build TID RDMA WRITE response. The
main role of the TID RDMA WRITE RESP packet is to send TID entries
to the requester so that they can be used to encode TID RDMA WRITE
DATA packet.
Signed-off-by: NMitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: NAshutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

38d46d36

IB/hfi1: Add functions to receive TID RDMA WRITE request · 07b92370

由 Kaike Wan 提交于 1月 23, 2019

This patch adds the functions to receive TID RDMA WRITE request. The
request will be stored in the QP's s_ack_queue. This patch also adds
code to handle duplicate TID RDMA WRITE request and a function to
allocate TID resources for data receiving on the responder side.
Signed-off-by: NMitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: NAshutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

07b92370

IB/hfi1: Add an s_acked_ack_queue pointer · 4f9264d1

由 Kaike Wan 提交于 1月 23, 2019

The s_ack_queue is managed by two pointers into the ring:
r_head_ack_queue and s_tail_ack_queue. r_head_ack_queue is the index of
where the next received request is going to be placed and s_tail_ack_queue
is the entry of the request currently being processed. This works
perfectly fine for normal Verbs as the requests are processed one at a
time and the s_tail_ack_queue is not moved until the request that it
points to is fully completed.

In this fashion, s_tail_ack_queue constantly chases r_head_ack_queue and
the two pointers can easily be used to determine "queue full" and "queue
empty" conditions.

The detection of these two conditions are imported in determining when an
old entry can safely be overwritten with a new received request and the
resources associated with the old request be safely released.

When pipelined TID RDMA WRITE is introduced into this mix, things look
very different. r_head_ack_queue is still the point at which a newly
received request will be inserted, s_tail_ack_queue is still the
currently processed request. However, with pipelined TID RDMA WRITE
requests, s_tail_ack_queue moves to the next request once all TID RDMA
WRITE responses for that request have been sent. The rest of the protocol
for a particular request is managed by other pointers specific to TID RDMA
- r_tid_tail and r_tid_ack - which point to the entries for which the next
TID RDMA DATA packets are going to arrive and the request for which
the next TID RDMA ACK packets are to be generated, respectively.

What this means is that entries in the ring, which are "behind"
s_tail_ack_queue (entries which s_tail_ack_queue has gone past) are no
longer considered complete. This is where the problem is - a newly
received request could potentially overwrite a still active TID RDMA WRITE
request.

The reason why the TID RDMA pointers trail s_tail_ack_queue is that the
normal Verbs send engine uses s_tail_ack_queue as the pointer for the next
response. Since TID RDMA WRITE responses are processed by the normal Verbs
send engine, s_tail_ack_queue had to be moved to the next entry once all
TID RDMA WRITE response packets were sent to get the desired pipelining
between requests. Doing otherwise would mean that the normal Verbs send
engine would not be able to send the TID RDMA WRITE responses for the next
TID RDMA request until the current one is fully completed.

This patch introduces the s_acked_ack_queue index to point to the next
request to complete on the responder side. For requests other than TID
RDMA WRITE, s_acked_ack_queue should always be kept in sync with
s_tail_ack_queue. For TID RDMA WRITE request, it may fall behind
s_tail_ack_queue.
Reviewed-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: NMitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

4f9264d1

IB/hfi1: Allow for extra entries in QP's s_ack_queue · f5a4a95f

由 Kaike Wan 提交于 1月 23, 2019

The TID RDMA WRITE protocol differs from normal IB RDMA WRITE
in that TID RDMA WRITE requests do require responses, not just
ACKs.

Therefore, TID RDMA WRITE requests need to be treated as RDMA
READ requests from the point of view of the QPs' s_ack_queue.
In other words, the QPs' need to allow for TID RDMA WRITE
requests to be stored in their s_ack_queue.

However, because the user does not know anything about the TID
RDMA capability and/or protocols, these extra entries in the
queue cannot be advertized to the user.
Reviewed-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: NMitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

f5a4a95f

IB/hfi1: Build TID RDMA WRITE request · c098bbb0

由 Kaike Wan 提交于 1月 23, 2019

This patch adds the functions to build TID RDMA WRITE request.
The work request opcode, packet opcode, and packet formats for TID
RDMA WRITE protocol are also defined in this patch.
Signed-off-by: NMitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: NAshutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

c098bbb0

IB/hfi1: Add static trace for TID RDMA READ protocol · 3ce5daa2

由 Kaike Wan 提交于 1月 23, 2019

This patch makes the following changes to the static trace:
1. Adds the decoding of TID RDMA READ packets in IB header trace;
2. Tracks qpriv->s_flags and iow_flags in qpsleepwakeup trace;
3. Adds a new event to track RC ACK receiving;
4. Adds trace events for various stages of the TID RDMA READ
protocol. These events provide a fine-grained control for monitoring
and debugging the hfi1 driver in the filed.
Reviewed-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

3ce5daa2

IB/hfi1: Enable TID RDMA READ protocol · f1ab4efa

由 Kaike Wan 提交于 1月 23, 2019

This patch enables TID RDMA READ protocol by converting a qualified
RDMA READ request into a TID RDMA READ request internally:
(1) The TID RDMA capability must be enabled;
(2) The request must start on a 4K page boundary and all receiving
 buffers must start on 4K page boundaries;
(3) The request length must be a multiple of 4K and must be larger or
equal to 256K. Each receiving buffer length must be a multiple of 4K.
Signed-off-by: NMitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

f1ab4efa

IB/hfi1: Add interlock between a TID RDMA request and other requests · a0b34f75

由 Kaike Wan 提交于 1月 24, 2019

This locking mechanism is designed to provent vavious memory corruption
scenarios from occurring when requests are pipelined, especially when
RDMA READ/WRITE requests are interleaved with TID RDMA READ/WRITE
requests:
1. READ-AFTER-READ;
2. READ-AFTER-WRITE;
3. WRITE-AFTER-READ;
When memory corruption is likely, a request will be held back until
previous requests have been completed.
Reviewed-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: NMitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

a0b34f75

IB/hfi1: Integrate TID RDMA READ protocol into RC protocol · 24b11923

由 Kaike Wan 提交于 1月 23, 2019

This patch integrates the TID RDMA READ protocol into the IB RC protocol.
This protocol is an end-to-end protocol between the hfi1 drivers on two
OPA nodes that converts a qualified RDMA READ request into a TID RDMA
READ request to avoid data copying on the requester side. The following
codes are added in this patch:
- Send the TID RDMA READ request;
- Complete the TID RDMA READ send request;
- Send the TID RDMA READ response;
- Complete the TID RDMA READ request;
Reviewed-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

24b11923

IB/hfi1: Add functions for restarting TID RDMA READ request · b126078e

由 Kaike Wan 提交于 1月 23, 2019

This patch adds functions to retry TID RDMA READ request. Since TID RDMA
READ request could be retried from any segment boundary, it requires
a number of tracking fields in various structures and those fields
should be reset properly. The qp->s_num_rd_atomic field is reset before
retry and therefore should be incremented for each new or retried
RDMA READ or atomic request.
Reviewed-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: NKaike Wan <kaike.wan@intel.com>
Signed-off-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

b126078e

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功