提交 · bd7ed1d13304d914648dacec4dbb9145aaae614e · openeuler / Kernel

11 10月, 2008 3 次提交

RPC/RDMA: check selected memory registration mode at runtime. · bd7ed1d1

由 Tom Talpey 提交于 10月 09, 2008

At transport creation, check for, and use, any local dma lkey.
Then, check that the selected memory registration mode is in fact
supported by the RDMA adapter selected for the mount. Fall back
to best alternative if not.
Signed-off-by: NTom Talpey <talpey@netapp.com>
Acked-by: NTom Tucker <tom@opengridcomputing.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

bd7ed1d1

RPC/RDMA: add data types and new FRMR memory registration enum. · fe9053b3

由 Tom Talpey 提交于 10月 09, 2008

Internal RPC/RDMA structure updates in preparation for FRMR support.
Signed-off-by: NTom Talpey <talpey@netapp.com>
Acked-by: NTom Tucker <tom@opengridcomputing.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

fe9053b3

RPC/RDMA: refactor the inline memory registration code. · 8d4ba034

由 Tom Talpey 提交于 10月 09, 2008

Refactor the memory registration and deregistration routines.
This saves stack space, makes the code more readable and prepares
to add the new FRMR registration methods.
Signed-off-by: NTom Talpey <talpey@netapp.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

8d4ba034

14 8月, 2008 1 次提交

svcrdma: Fix race between svc_rdma_recvfrom thread and the dto_tasklet · 24b8b447

由 Tom Tucker 提交于 8月 13, 2008

RDMA_READ completions are kept on a separate queue from the general
I/O request queue. Since a separate lock is used to protect the RDMA_READ
completion queue, a race exists between the dto_tasklet and the
svc_rdma_recvfrom thread where the dto_tasklet sets the XPT_DATA
bit and adds I/O to the read-completion queue. Concurrently, the
recvfrom thread checks the generic queue, finds it empty and resets
the XPT_DATA bit. A subsequent svc_xprt_enqueue will fail to enqueue
the transport for I/O and cause the transport to "stall".

The fix is to protect both lists with the same lock and set the XPT_DATA
bit with this lock held.
Signed-off-by: NTom Tucker <tom@opengridcomputing.com>
Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>

24b8b447

27 7月, 2008 1 次提交

dma-mapping: add the device argument to dma_mapping_error() · 8d8bb39b

由 FUJITA Tomonori 提交于 7月 25, 2008

Add per-device dma_mapping_ops support for CONFIG_X86_64 as POWER
architecture does:

This enables us to cleanly fix the Calgary IOMMU issue that some devices
are not behind the IOMMU (http://lkml.org/lkml/2008/5/8/423).

I think that per-device dma_mapping_ops support would be also helpful for
KVM people to support PCI passthrough but Andi thinks that this makes it
difficult to support the PCI passthrough (see the above thread).  So I
CC'ed this to KVM camp.  Comments are appreciated.

A pointer to dma_mapping_ops to struct dev_archdata is added.  If the
pointer is non NULL, DMA operations in asm/dma-mapping.h use it.  If it's
NULL, the system-wide dma_ops pointer is used as before.

If it's useful for KVM people, I plan to implement a mechanism to register
a hook called when a new pci (or dma capable) device is created (it works
with hot plugging).  It enables IOMMUs to set up an appropriate
dma_mapping_ops per device.

The major obstacle is that dma_mapping_error doesn't take a pointer to the
device unlike other DMA operations.  So x86 can't have dma_mapping_ops per
device.  Note all the POWER IOMMUs use the same dma_mapping_error function
so this is not a problem for POWER but x86 IOMMUs use different
dma_mapping_error functions.

The first patch adds the device argument to dma_mapping_error.  The patch
is trivial but large since it touches lots of drivers and dma-mapping.h in
all the architecture.

This patch:

dma_mapping_error() doesn't take a pointer to the device unlike other DMA
operations.  So we can't have dma_mapping_ops per device.

Note that POWER already has dma_mapping_ops per device but all the POWER
IOMMUs use the same dma_mapping_error function.  x86 IOMMUs use device
argument.

[akpm@linux-foundation.org: fix sge]
[akpm@linux-foundation.org: fix svc_rdma]
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: fix bnx2x]
[akpm@linux-foundation.org: fix s2io]
[akpm@linux-foundation.org: fix pasemi_mac]
[akpm@linux-foundation.org: fix sdhci]
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: fix sparc]
[akpm@linux-foundation.org: fix ibmvscsi]
Signed-off-by: NFUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Muli Ben-Yehuda <muli@il.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Avi Kivity <avi@qumranet.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8d8bb39b

03 7月, 2008 10 次提交

svcrdma: Change WR context get/put to use the kmem cache · 8948896c

由 Tom Tucker 提交于 5月 28, 2008

Change the WR context pool to be shared across mount points. This
reduces the RDMA transport memory footprint significantly since
idle mounts don't consume WR context memory.
Signed-off-by: NTom Tucker <tom@opengridcomputing.com>

8948896c

svcrdma: Create a kmem cache for the WR contexts · bf5927d8

由 Tom Tucker 提交于 5月 28, 2008

Create a kmem cache to hold WR contexts. Next we will convert
the WR context get and put services to use this kmem cache.
Signed-off-by: NTom Tucker <tom@opengridcomputing.com>

bf5927d8

svcrdma: Add flush_scheduled_work to module exit function · 902a94e0

由 Tom Tucker 提交于 5月 28, 2008

Make certain all transports pending free are flushed from the wq
before unloading the module.
Signed-off-by: NTom Tucker <tom@opengridcomputing.com>

902a94e0

svcrdma: Limit ORD based on client's advertised IRD · 36ef25e4

由 Tom Tucker 提交于 5月 19, 2008

When adapters have differing IRD limits, the RDMA transport will fail to
connect properly. The RDMA transport should use the client's advertised
inbound read limit when computing its outbound read limit. For iWARP
transports, there is currently no standard for exchanging IRD/ORD
during connection establishment so the 'responder_resources' field in the
connect event is the local device's limit. The RDMA transport can be
configured to use a smaller ORD by writing the desired number to the
/proc/sys/sunrpc/svc_rdma/max_outbound_read_requests file.
Signed-off-by: NTom Tucker <tom@opengridcomputing.com>

36ef25e4

svcrdma: Remove unneeded spin locks from __svc_rdma_free · 94dba491

由 Tom Tucker 提交于 5月 28, 2008

At the time __svc_rdma_free is called, we are guaranteed that all references
to this transport are gone. There is, therefore, no need to protect the
resource lists with a spin lock.
Signed-off-by: NTom Tucker <tom@opengridcomputing.com>

94dba491

svcrdma: Add dma map count and WARN_ON · 87295b6c

由 Tom Tucker 提交于 5月 28, 2008

Add a dma map count in order to verify that all DMA mapping resources
have been freed when the transport is closed.
Signed-off-by: NTom Tucker <tom@opengridcomputing.com>

87295b6c

svcrdma: Move the DMA unmap logic to the CQ handler · e6ab9143

由 Tom Tucker 提交于 5月 28, 2008

Separate DMA unmap from context destruction and perform DMA unmapping
in the SQ/RQ CQ reap functions. This is necessary to support software
based RDMA implementations that actually copy the data in their
ib_dma_unmap callback functions and architectures that don't have
cache coherent I/O busses.
Signed-off-by: NTom Tucker <tom@opengridcomputing.com>

e6ab9143

svcrdma: Use reply and chunk map for RDMA_READ processing · f820c57e

由 Tom Tucker 提交于 5月 27, 2008

Modify the RDMA_READ processing to use the reply and chunk list mapping data
types. Also add a special purpose 'hdr_count' field in in the context to hold
the header page count instead of overloading the SGE length field and
corrupting the DMA map length.
Signed-off-by: NTom Tucker <tom@opengridcomputing.com>

f820c57e

svcrdma: Use RPC reply map for RDMA_WRITE processing · 34d16e42

由 Tom Tucker 提交于 7月 02, 2008

Use the new svc_rdma_req_map data type for mapping the client side memory
to the server side memory. Move the DMA mapping to the context pointed to
by each WR individually so that it is unmapped after the WR completes.
Signed-off-by: NTom Tucker <tom@opengridcomputing.com>

34d16e42

svcrdma: Add a type for keeping NFS RPC mapping · ab96dddb

由 Tom Tucker 提交于 5月 28, 2008

Create a new data structure to hold the remote client address space
to local server address space mapping.
Signed-off-by: NTom Tucker <tom@opengridcomputing.com>

ab96dddb

19 5月, 2008 21 次提交

svcrdma: Verify read-list fits within RPCSVC_MAXPAGES · a6f911c0

由 Tom Tucker 提交于 5月 13, 2008

A RDMA read-list cannot contain more elements than RPCSVC_MAXPAGES or
it will overflow the DTO context. Verify this when processing the
protocol header.
Signed-off-by: NTom Tucker <tom@opengridcomputing.com>

a6f911c0

svcrdma: Change svc_rdma_send_error return type to void · 008fdbc5

由 Tom Tucker 提交于 5月 07, 2008

The svc_rdma_send_error function is called when an RPCRDMA protocol
error is detected. This function attempts to post an error reply message.
Since an error posting to a transport in error is ignored, change
the return type to void.
Signed-off-by: NTom Tucker <tom@opengridcomputing.com>

008fdbc5

svcrdma: Copy transport address and arm CQ before calling rdma_accept · af261af4

由 Tom Tucker 提交于 5月 07, 2008

This race was found by inspection. Messages can be received from the peer
immediately following the rdma_accept call, however, the CQ have not yet
been armed and the transport address has not yet been set.

Set the transport address in the connect request handler and arm the CQ
prior to calling rdma_accept.
Signed-off-by: NTom Tucker <tom@opengridcomputing.com>

af261af4

svcrdma: Set rqstp transport address in rdma_read_complete function · 69500c43

由 Tom Tucker 提交于 5月 07, 2008

The rdma_read_complete function needs to copy the rqstp transport address
from the transport. Failure to do so can result in using the wrong
authentication method for the RPC or bug checking if the rqstp address
is not valid.
Signed-off-by: NTom Tucker <tom@opengridcomputing.com>

69500c43

svcrdma: Use ib verbs version of dma_unmap · 97a3df38

由 Tom Tucker 提交于 5月 01, 2008

Use the ib_verbs version of the dma_unmap service in the
svc_rdma_put_context function. This should support providers
using software rdma.
Signed-off-by: NTom Tucker <tom@opengridcomputing.com>

97a3df38

svcrdma: Cleanup queued, but unprocessed I/O in svc_rdma_free · 356d0a15

由 Tom Tucker 提交于 5月 01, 2008

When the transport is closing, the DTO tasklet may queue data
that never gets processed. Clean up resources associated with
this I/O.
Signed-off-by: NTom Tucker <tom@opengridcomputing.com>

356d0a15

svcrdma: Move the QP and cm_id destruction to svc_rdma_free · 1711386c

由 Tom Tucker 提交于 5月 01, 2008

Move the destruction of the QP and CM_ID to the free path so that the
QP cleanup code doesn't race with the dto_tasklet handling flushed WR.
The QP reference is not needed because we now have a reference for
every WR.

Also add a guard in the SQ and RQ completion handlers to ignore
calls generated by some providers when the QP is destroyed.
Signed-off-by: NTom Tucker <tom@opengridcomputing.com>

1711386c

svcrdma: Add reference for each SQ/RQ WR · 0905c0f0

由 Tom Tucker 提交于 5月 01, 2008

Add a reference on the transport for every outstanding WR.
Signed-off-by: NTom Tucker <tom@opengridcomputing.com>

0905c0f0

svcrdma: Move destroy to kernel thread · 8da91ea8

由 Tom Tucker 提交于 4月 30, 2008

Some providers may wait while destroying adapter resources.
Since it is possible that the last reference is put on the
dto_tasklet, the actual destroy must be scheduled as a work item.
Signed-off-by: NTom Tucker <tom@opengridcomputing.com>

8da91ea8

svcrdma: Shrink scope of spinlock on RQ CQ · 47698e08

由 Tom Tucker 提交于 5月 06, 2008

The rq_cq_reap function is only called from the dto_tasklet. The
only resource shared with other threads is the sc_rq_dto_q. Move the
spin lock to protect only this list.
Signed-off-by: NTom Tucker <tom@opengridcomputing.com>

47698e08

svcrdma: Use standard Linux lists for context cache · 87407673

由 Tom Tucker 提交于 4月 30, 2008

Replace the one-off linked list implementation used to implement the
context cache with the standard Linux list_head lists. Add a context
counter to catch resource leaks. A WARN_ON will be added later to
ensure that we've freed all contexts.
Signed-off-by: NTom Tucker <tom@opengridcomputing.com>

87407673

svcrdma: Simplify RDMA_READ deferral buffer management · 02e7452d

由 Tom Tucker 提交于 4月 30, 2008

An NFS_WRITE requires a set of RDMA_READ requests to fetch the write
data from the client. There are two principal pieces of data that
need to be tracked: the list of pages that comprise the completed RPC
and the SGE of dma mapped pages to refer to this list of pages. Previously
this whole bit was managed as a linked list of contexts with the
context containing the page list buried in this list. This patch
simplifies this processing by not keeping a linked list, but rather only
a pionter from the last submitted RDMA_READ's context to the context
that maps the set of pages that describe the RPC. This significantly
simplifies this code path. SGE contexts are cleaned up inline in the DTO
path instead of at read completion time.
Signed-off-by: NTom Tucker <tom@opengridcomputing.com>

02e7452d

svcrdma: Remove unused READ_DONE context flags bit · 10a38c33

由 Tom Tucker 提交于 4月 30, 2008

The RDMACTXT_F_READ_DONE bit is not longer used. Remove it.
Signed-off-by: NTom Tucker <tom@opengridcomputing.com>

10a38c33

svcrdma: Return error from rdma_read_xdr so caller knows to free context · d16d4009

由 Tom Tucker 提交于 5月 06, 2008

The rdma_read_xdr function did not discriminate between no read-list and
an error posting the read-list. This results in a leak of a page if there
is an error posting the read-list.
Signed-off-by: NTom Tucker <tom@opengridcomputing.com>

d16d4009

svcrdma: Fix error handling during listening endpoint creation · 58e8f621

由 Tom Tucker 提交于 5月 06, 2008

A listening endpoint isn't known to the generic transport switch until
the svc_create_xprt function returns without error. Calling
svc_xprt_put within the xpo_create function causes the module reference
count to be erroneously decremented.
Signed-off-by: NTom Tucker <tom@opengridcomputing.com>

58e8f621

svcrdma: Free context on post_recv error in send_reply · 5ac461a6

由 Tom Tucker 提交于 4月 25, 2008

If an error is encountered trying to post a recv buffer in send_reply,
free the passed in context. Return an error to the caller so it is
aware that the request was not posted.
Signed-off-by: NTom Tucker <tom@opengridcomputing.com>

5ac461a6

svcrdma: Free context on ib_post_recv error · 05a0826a

由 Tom Tucker 提交于 4月 25, 2008

If there is an error posting the recv WR to the RQ, free the
context associated with the WR. This would leak a context when
asynchronous errors occurred on the transport while conccurent threads
were processing their RPC.
Signed-off-by: NTom Tucker <tom@opengridcomputing.com>

05a0826a

svcrdma: Add put of connection ESTABLISHED reference in rdma_cma_handler · 120693d1

由 Tom Tucker 提交于 4月 24, 2008

The svcrdma transport takes a reference when it gets the ESTABLISHED
event from the provider. This reference is supposed to be removed when
the DISCONNECT event is received, however, the call to svc_xprt_put
was missing in the switch statement. This results in the memory
associated with the transport never being freed.
Signed-off-by: NTom Tucker <tom@opengridcomputing.com>

120693d1

svcrdma: Fix return value in svc_rdma_send · 9d6347ac

由 Tom Tucker 提交于 4月 25, 2008

Fix the return value on close to -ENOTCONN so caller knows to free context.
Also if a thread is waiting for free SQ space, check for close when waking
to avoid posting WR to a closing transport.
Signed-off-by: NTom Tucker <tom@opengridcomputing.com>

9d6347ac

svcrdma: Fix race with dto_tasklet in svc_rdma_send · dbcd00eb

由 Tom Tucker 提交于 5月 06, 2008

The svc_rdma_send function will attempt to reap SQ WR to make room for
a new request if it finds the SQ full. This function races with the
dto_tasklet that also reaps SQ WR. To avoid polling and arming the CQ
unnecessarily move the test_and_clear_bit of the RDMAXPRT_SQ_PENDING
flag and arming of the CQ to the sq_cq_reap function.

Refactor the rq_cq_reap function to match sq_cq_reap so that the
code is easier to follow.
Signed-off-by: NTom Tucker <tom@opengridcomputing.com>

dbcd00eb

svcrdma: Simplify receive buffer posting · 0e7f011a

由 Tom Tucker 提交于 4月 23, 2008

The svcrdma transport provider currently allocates receive buffers
to the RQ through the xpo_release_rqst method. This approach is overly
complicated since it means that the rqstp rq_xprt_ctxt has to be
selectively set based on whether the RPC is going to be processed
immediately or deferred. Instead, just post the receive buffer when
we are certain that we are replying in the send_reply function.
Signed-off-by: NTom Tucker <tom@opengridcomputing.com>

0e7f011a

24 4月, 2008 1 次提交

SVCRDMA: Add check for XPT_CLOSE in svc_rdma_send · 830bb59b

由 Tom Tucker 提交于 3月 11, 2008

SVCRDMA: Add check for XPT_CLOSE in svc_rdma_send

The svcrdma transport can crash if a send is waiting for an
empty SQ slot and the connection is closed due to an asynchronous error.
The crash is caused when svc_rdma_send attempts to send on a deleted
QP.
Signed-off-by: NTom Tucker <tom@opengridcomputing.com>
Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>

830bb59b

17 4月, 2008 1 次提交

IB/core: Add support for "send with invalidate" work requests · 0f39cf3d

由 Roland Dreier 提交于 4月 16, 2008

Add a new IB_WR_SEND_WITH_INV send opcode that can be used to mark a
"send with invalidate" work request as defined in the iWARP verbs and
the InfiniBand base memory management extensions. Also put "imm_data"
and a new "invalidate_rkey" member in a new "ex" union in struct
ib_send_wr. The invalidate_rkey member can be used to pass in an
R_Key/STag to be invalidated. Add this new union to struct
ib_uverbs_send_wr. Add code to copy the invalidate_rkey field in
ib_uverbs_post_send().

Fix up low-level drivers to deal with the change to struct ib_send_wr,
and just remove the imm_data initialization from net/sunrpc/xprtrdma/,
since that code never does any send with immediate operations.

Also, move the existing IB_DEVICE_SEND_W_INV flag to a new bit, since
the iWARP drivers currently in the tree set the bit. The amso1100
driver at least will silently fail to honor the IB_SEND_INVALIDATE bit
if passed in as part of userspace send requests (since it does not
implement kernel bypass work request queueing). Remove the flag from
all existing drivers that set it until we know which ones are OK.

The values chosen for the new flag is not consecutive to avoid clashing
with flags defined in the XRC patches, which are not merged yet but
which are already in use and are likely to be merged soon.

This resurrects a patch sent long ago by Mikkel Hagen <mhagen@iol.unh.edu>.
Signed-off-by: NRoland Dreier <rolandd@cisco.com>

0f39cf3d

27 3月, 2008 1 次提交

SVCRDMA: Check num_sge when setting LAST_CTXT bit · c8237a5f

由 Tom Tucker 提交于 3月 25, 2008

The RDMACTXT_F_LAST_CTXT bit was getting set incorrectly
when the last chunk in the read-list spanned multiple pages. This
resulted in a kernel panic when the wrong context was used to
build the RPC iovec page list.

RDMA_READ is used to fetch RPC data from the client for
NFS_WRITE requests. A scatter-gather is used to map the
advertised client side buffer to the server-side iovec and
associated page list.

WR contexts are used to convey which scatter-gather entries are
handled by each WR. When the write data is large, a single RPC may
require multiple RDMA_READ requests so the contexts for a single RPC
are chained together in a linked list. The last context in this list
is marked with a bit RDMACTXT_F_LAST_CTXT so that when this WR completes,
the CQ handler code can enqueue the RPC for processing.

The code in rdma_read_xdr was setting this bit on the last two
contexts on this list when the last read-list chunk spanned multiple
pages. This caused the svc_rdma_recvfrom logic to incorrectly build
the RPC and caused the kernel to crash because the second-to-last
context doesn't contain the iovec page list.

Modified the condition that sets this bit so that it correctly detects
the last context for the RPC.
Signed-off-by: NTom Tucker <tom@opengridcomputing.com>
Tested-by: NRoland Dreier <rolandd@cisco.com>
Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c8237a5f

25 3月, 2008 1 次提交

SVCRDMA: Use only 1 RDMA read scatter entry for iWARP adapters · d3073779

由 Roland Dreier 提交于 3月 24, 2008

The iWARP protocol limits RDMA read requests to a single scatter
entry.  NFS/RDMA has code in rdma_read_max_sge() that is supposed to
limit the sge_count for RDMA read requests to 1, but the code to do
that is inside an #ifdef RDMA_TRANSPORT_IWARP block.  In the mainline
kernel at least, RDMA_TRANSPORT_IWARP is an enum and not a
preprocessor #define, so the #ifdef'ed code is never compiled.

In my test of a kernel build with -j8 on an NFS/RDMA mount, this
problem eventually leads to trouble starting with:

    svcrdma: Error posting send = -22
    svcrdma : RDMA_READ error = -22

and things go downhill from there.

The trivial fix is to delete the #ifdef guard.  The check seems to be
a remnant of when the NFS/RDMA code was not merged and needed to
compile against multiple kernel versions, although I don't think it
ever worked as intended.  In any case now that the code is upstream
there's no need to test whether the RDMA_TRANSPORT_IWARP constant is
defined or not.

Without this patch, my kernel build on an NFS/RDMA mount using NetEffect
adapters quickly and 100% reproducibly failed with an error like:

    ld: final link failed: Software caused connection abort

With the patch applied I was able to complete a kernel build on the
same setup.

(Tom Tucker says this is "actually an _ancient_ remnant when it had to
compile against iWARP vs. non-iWARP enabled OFA trees.")
Signed-off-by: NRoland Dreier <rolandd@cisco.com>
Acked-by: NTom Tucker <tom@opengridcomputing.com>
Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d3073779

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功