提交 · 07e10308ee5da8e6132e0b737ece1c99dd651fb6 · openeuler / Kernel

03 1月, 2019 31 次提交

xprtrdma: Prevent leak of rpcrdma_rep objects · 07e10308

由 Chuck Lever 提交于 12月 07, 2018

If a reply has been processed but the RPC is later retransmitted
anyway, the req->rl_reply field still contains the only pointer to
the old rpcrdma rep. When the next reply comes in, the reply handler
will stomp on the rl_reply field, leaking the old rep.

A trace event is added to capture such leaks.

This problem seems to be worsened by the restructuring of the RPC
Call path in v4.20. Fully addressing this issue will require at
least a re-architecture of the disconnect logic, which is not
appropriate during -rc.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

07e10308

NFSv4.2 fix async copy reboot recovery · 9aeaf8cf

由 Olga Kornievskaia 提交于 12月 06, 2018

Original commit (e4648aa4 "NFS recover from destination server
reboot for copies") used memcmp() and then it was changed to use
nfs4_stateid_match_other() but that function returns opposite of
memcmp. As the result, recovery can't find the copy leading
to copy hanging.

Fixes: 80f42368 ("NFSv4: Split out NFS v4.2 copy completion functions")
Fixes: cb7a8384 ("NFS: Split out the body of nfs4_reclaim_open_state")
Signed-of-by: NOlga Kornievskaia <kolga@netapp.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

9aeaf8cf

xprtrdma: Don't leak freed MRs · f85adb1b

由 Chuck Lever 提交于 12月 19, 2018

Defensive clean up. Don't set frwr->fr_mr until we know that the
scatterlist allocation has succeeded.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

f85adb1b

xprtrdma: Add documenting comment for rpcrdma_buffer_destroy · af65ed40

由 Chuck Lever 提交于 12月 19, 2018

Make a note of the function's dependency on an earlier ib_drain_qp.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

af65ed40

xprtrdma: Replace outdated comment for rpcrdma_ep_post · 995d312a

由 Chuck Lever 提交于 12月 19, 2018

Since commit 7c8d9e7c ("xprtrdma: Move Receive posting to
Receive handler"), rpcrdma_ep_post is no longer responsible for
posting Receive buffers. Update the documenting comment to reflect
this change.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

995d312a

xprtrdma: Update comments in frwr_op_send · e0f86bc4

由 Chuck Lever 提交于 12月 19, 2018

Commit f2877623 ("xprtrdma: Chain Send to FastReg WRs") was
written before commit ce5b3717 ("xprtrdma: Replace all usage of
"frmr" with "frwr""), but was merged afterwards. Thus it still
refers to FRMR and MWs.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

e0f86bc4

SUNRPC: Fix some kernel doc complaints · acf0a39f

由 Chuck Lever 提交于 12月 19, 2018

Clean up some warnings observed when building with "make W=1".
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

acf0a39f

SUNRPC: Simplify defining common RPC trace events · dc5820bd

由 Chuck Lever 提交于 12月 19, 2018

Clean up, no functional change is expected.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

dc5820bd

NFS: Fix NFSv4 symbolic trace point output · 5b2095d0

由 Chuck Lever 提交于 12月 19, 2018

These symbolic values were not being displayed in string form.
TRACE_DEFINE_ENUM was missing in many cases. It also turns out that
__print_symbolic wants an unsigned long in the first field...
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

5b2095d0

xprtrdma: Trace mapping, alloc, and dereg failures · 53b2c1cb

由 Chuck Lever 提交于 12月 19, 2018

These are rare, but can be helpful at tracking down DMAR and other
problems.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

53b2c1cb

xprtrdma: Add trace points for calls to transport switch methods · 395069fc

由 Chuck Lever 提交于 12月 19, 2018

Name them "trace_xprtrdma_op_*" so they can be easily enabled as a
group. No trace point is added where the generic layer already has
observability.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

395069fc

xprtrdma: Relocate the xprtrdma_mr_map trace points · ba217ec6

由 Chuck Lever 提交于 12月 19, 2018

The mr_map trace points were capturing information about the previous
use of the MR rather than about the segment that was just mapped.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

ba217ec6

xprtrdma: Clean up of xprtrdma chunk trace points · aba11831

由 Chuck Lever 提交于 12月 19, 2018

The chunk-related trace points capture nearly the same information
as the MR-related trace points.

Also, rename them so globbing can be used to enable or disable
these trace points more easily.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

aba11831

xprtrdma: Remove unused fields from rpcrdma_ia · 9bef848f

由 Chuck Lever 提交于 12月 19, 2018

Clean up. The last use of these fields was in commit 173b8f49
("xprtrdma: Demote "connect" log messages") .
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

9bef848f

xprtrdma: Cull dprintk() call sites · ddbb347f

由 Chuck Lever 提交于 12月 19, 2018

Clean up: Remove dprintk() call sites that report rare or impossible
errors. Leave a few that display high-value low noise status
information.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

ddbb347f

xprtrdma: Simplify locking that protects the rl_allreqs list · 92f4433e

由 Chuck Lever 提交于 12月 19, 2018

Clean up: There's little chance of contention between the use of
rb_lock and rb_reqslock, so merge the two. This avoids having to
take both in some (possibly future) cases.

Transport tear-down is already serialized, thus there is no need for
locking at all when destroying rpcrdma_reqs.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

92f4433e

xprtrdma: Expose transport header errors · 236b0943

由 Chuck Lever 提交于 12月 19, 2018

For better observability of parsing errors, return the error code
generated in the decoders to the upper layer consumer.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

236b0943

xprtrdma: Remove request_module from backchannel · 889ee07f

由 Chuck Lever 提交于 12月 19, 2018

Since commit ffe1f0df ("rpcrdma: Merge svcrdma and xprtrdma
modules into one"), the forward and backchannel components are part
of the same kernel module. A separate request_module() call in the
backchannel code is no longer necessary.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

889ee07f

xprtrdma: Recognize XDRBUF_SPARSE_PAGES · 15303d9e

由 Chuck Lever 提交于 12月 19, 2018

Commit 431f6eb3 ("SUNRPC: Add a label for RPC calls that require
allocation on receive") didn't update similar logic in rpc_rdma.c.
I don't think this is a bug, per-se; the commit just adds more
careful checking for broken upper layer behavior.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

15303d9e

NFS: Make "port=" mount option optional for RDMA mounts · 0dfbb5f0

由 Chuck Lever 提交于 12月 19, 2018

Having to specify "proto=rdma,port=20049" is cumbersome.

RFC 8267 Section 6.3 requires NFSv4 clients to use "the alternative
well-known port number", which is 20049. Make the use of the well-
known port number automatic, just as it is for NFS/TCP and port
2049.

For NFSv2/3, Section 4.2 allows clients to simply choose 20049 as
the default or use rpcbind. I don't know of an NFS/RDMA server
implementation that registers it's NFS/RDMA service with rpcbind,
so automatically choosing 20049 seems like the better choice. The
other widely-deployed NFS/RDMA client, Solaris, also uses 20049
as the default port.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

0dfbb5f0

xprtrdma: Plant XID in on-the-wire RDMA offset (FRWR) · 0a93fbcb

由 Chuck Lever 提交于 12月 19, 2018

Place the associated RPC transaction's XID in the upper 32 bits of
each RDMA segment's rdma_offset field. There are two reasons to do
this:

- The R_key only has 8 bits that are different from registration to
  registration. The XID adds more uniqueness to each RDMA segment to
  reduce the likelihood of a software bug on the server reading from
  or writing into memory it's not supposed to.

- On-the-wire RDMA Read and Write requests do not otherwise carry
  any identifier that matches them up to an RPC. The XID in the
  upper 32 bits will act as an eye-catcher in network captures.
Suggested-by: NTom Talpey <ttalpey@microsoft.com>
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

0a93fbcb

xprtrdma: Remove rpcrdma_memreg_ops · 5f62412b

由 Chuck Lever 提交于 12月 19, 2018

Clean up: Now that there is only FRWR, there is no need for a memory
registration switch. The indirect calls to the memreg operations can
be replaced with faster direct calls.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

5f62412b

xprtrdma: Remove support for FMR memory registration · ba69cd12

由 Chuck Lever 提交于 12月 19, 2018

FMR is not supported on most recent RDMA devices. It is also less
secure than FRWR because an FMR memory registration can expose
adjacent bytes to remote reading or writing. As discussed during the
RDMA BoF at LPC 2018, it is time to remove support for FMR in the
NFS/RDMA client stack.

Note that NFS/RDMA server-side uses either local memory registration
or FRWR. FMR is not used.

There are a few Infiniband/RoCE devices in the kernel tree that do
not appear to support MEM_MGT_EXTENSIONS (FRWR), and therefore will
not support client-side NFS/RDMA after this patch. These are:

 - mthca
 - qib
 - hns (RoCE)

Users of these devices can use NFS/TCP on IPoIB instead.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

ba69cd12

xprtrdma: Reduce max_frwr_depth · a7886849

由 Chuck Lever 提交于 12月 19, 2018

Some devices advertise a large max_fast_reg_page_list_len
capability, but perform optimally when MRs are significantly smaller
than that depth -- probably when the MR itself is no larger than a
page.

By default, the RDMA R/W core API uses max_sge_rd as the maximum
page depth for MRs. For some devices, the value of max_sge_rd is
1, which is also not optimal. Thus, when max_sge_rd is larger than
1, use that value. Otherwise use the value of the
max_fast_reg_page_list_len attribute.

I've tested this with CX-3 Pro, FastLinq, and CX-5 devices. It
reproducibly improves the throughput of large I/Os by several
percent.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

a7886849

xprtrdma: Fix ri_max_segs and the result of ro_maxpages · 6946f823

由 Chuck Lever 提交于 12月 19, 2018

With certain combinations of krb5i/p, MR size, and r/wsize, I/O can
fail with EMSGSIZE. This is because the calculated value of
ri_max_segs (the max number of MRs per RPC) exceeded
RPCRDMA_MAX_HDR_SEGS, which caused Read or Write list encoding to
walk off the end of the transport header.

Once that was addressed, the ro_maxpages result has to be corrected
to account for the number of MRs needed for Reply chunks, which is
2 MRs smaller than a normal Read or Write chunk.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

6946f823

xprtrdma: Don't wake pending tasks until disconnect is done · 0c0829bc

由 Chuck Lever 提交于 12月 19, 2018

Transport disconnect processing does a "wake pending tasks" at
various points.

Suppose an RPC Reply is being processed. The RPC task that Reply
goes with is waiting on the pending queue. If a disconnect wake-up
happens before reply processing is done, that reply, even if it is
good, is thrown away, and the RPC has to be sent again.

This window apparently does not exist for socket transports because
there is a lock held while a reply is being received which prevents
the wake-up call until after reply processing is done.

To resolve this, all RPC replies being processed on an RPC-over-RDMA
transport have to complete before pending tasks are awoken due to a
transport disconnect.

Callers that already hold the transport write lock may invoke
->ops->close directly. Others use a generic helper that schedules
a close when the write lock can be taken safely.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

0c0829bc

xprtrdma: No qp_event disconnect · 3d433ad8

由 Chuck Lever 提交于 12月 19, 2018

After thinking about this more, and auditing other kernel ULP imple-
mentations, I believe that a DISCONNECT cm_event will occur after a
fatal QP event. If that's the case, there's no need for an explicit
disconnect in the QP event handler.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

3d433ad8

xprtrdma: Replace rpcrdma_receive_wq with a per-xprt workqueue · 6d2d0ee2

由 Chuck Lever 提交于 12月 19, 2018

To address a connection-close ordering problem, we need the ability
to drain the RPC completions running on rpcrdma_receive_wq for just
one transport. Give each transport its own RPC completion workqueue,
and drain that workqueue when disconnecting the transport.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

6d2d0ee2

xprtrdma: Refactor Receive accounting · 6ceea368

由 Chuck Lever 提交于 12月 19, 2018

Clean up: Divide the work cleanly:

- rpcrdma_wc_receive is responsible only for RDMA Receives
- rpcrdma_reply_handler is responsible only for RPC Replies
- the posted send and receive counts both belong in rpcrdma_ep
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

6ceea368

xprtrdma: Ensure MRs are DMA-unmapped when posting LOCAL_INV fails · b674c4b4

由 Chuck Lever 提交于 12月 19, 2018

The recovery case in frwr_op_unmap_sync needs to DMA unmap each MR.
frwr_release_mr does not DMA-unmap, but the recycle worker does.

Fixes: 61da886b ("xprtrdma: Explicitly resetting MRs is ... ")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

b674c4b4

xprtrdma: Yet another double DMA-unmap · e2f34e26

由 Chuck Lever 提交于 12月 19, 2018

While chasing yet another set of DMAR fault reports, I noticed that
the frwr recycler conflates whether or not an MR has been DMA
unmapped with frwr->fr_state. Actually the two have only an indirect
relationship. It's in fact impossible to guess reliably whether the
MR has been DMA unmapped based on its fr_state field, especially as
the surrounding code and its assumptions have changed over time.

A better approach is to track the DMA mapping status explicitly so
that the recycler is less brittle to unexpected situations, and
attempts to DMA-unmap a second time are prevented.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org # v4.20
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

e2f34e26

22 12月, 2018 1 次提交

NFS: nfs_compare_mount_options always compare auth flavors. · 594d1644

由 Chris Perl 提交于 12月 17, 2018

This patch removes the check from nfs_compare_mount_options to see if a
`sec' option was passed for the current mount before comparing auth
flavors and instead just always compares auth flavors.

Consider the following scenario:

You have a server with the address 192.168.1.1 and two exports /export/a
and /export/b.  The first export supports `sys' and `krb5' security, the
second just `sys'.

Assume you start with no mounts from the server.

The following results in EIOs being returned as the kernel nfs client
incorrectly thinks it can share the underlying `struct nfs_server's:

$ mkdir /tmp/{a,b}
$ sudo mount -t nfs -o vers=3,sec=krb5 192.168.1.1:/export/a /tmp/a
$ sudo mount -t nfs -o vers=3          192.168.1.1:/export/b /tmp/b
$ df >/dev/null
df: ‘/tmp/b’: Input/output error
Signed-off-by: NChris Perl <cperl@janestreet.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

594d1644

20 12月, 2018 8 次提交

SUNRPC discard cr_uid from struct rpc_cred. · 04d1532b

由 NeilBrown 提交于 12月 03, 2018

Just use ->cr_cred->fsuid directly.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

04d1532b

SUNRPC: simplify auth_unix. · 2edd8d74

由 NeilBrown 提交于 12月 03, 2018

1/ discard 'struct unx_cred'.  We don't need any data that
   is not already in 'struct rpc_cred'.
2/ Don't keep these creds in a hash table.  When a credential
   is needed, simply allocate it.  When not needed, discard it.
   This can easily be faster than performing a lookup on
   a shared hash table.
   As the lookup can happen during write-out, use a mempool
   to ensure forward progress.
   This means that we cannot compare two credentials for
   equality by comparing the pointers, but we never do that anyway.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

2edd8d74

SUNRPC: remove crbind rpc_cred operation · d6efccd9

由 NeilBrown 提交于 12月 03, 2018

This now always just does get_rpccred(), so we
don't need an operation pointer to know to do that.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

d6efccd9

SUNRPC: remove generic cred code. · 89a4f758

由 NeilBrown 提交于 12月 03, 2018

This is no longer used.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

89a4f758

NFS/NFSD/SUNRPC: replace generic creds with 'struct cred'. · a52458b4

由 NeilBrown 提交于 12月 03, 2018

SUNRPC has two sorts of credentials, both of which appear as
"struct rpc_cred".
There are "generic credentials" which are supplied by clients
such as NFS and passed in 'struct rpc_message' to indicate
which user should be used to authorize the request, and there
are low-level credentials such as AUTH_NULL, AUTH_UNIX, AUTH_GSS
which describe the credential to be sent over the wires.

This patch replaces all the generic credentials by 'struct cred'
pointers - the credential structure used throughout Linux.

For machine credentials, there is a special 'struct cred *' pointer
which is statically allocated and recognized where needed as
having a special meaning.  A look-up of a low-level cred will
map this to a machine credential.
Signed-off-by: NNeilBrown <neilb@suse.com>
Acked-by: NJ. Bruce Fields <bfields@redhat.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

a52458b4

NFS: struct nfs_open_dir_context: convert rpc_cred pointer to cred. · 684f39b4

由 NeilBrown 提交于 12月 03, 2018

Use the common 'struct cred' to pass credentials for readdir.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

684f39b4

NFS: change access cache to use 'struct cred'. · b68572e0

由 NeilBrown 提交于 12月 03, 2018

Rather than keying the access cache with 'struct rpc_cred',
use 'struct cred'.  Then use cred_fscmp() to compare
credentials rather than comparing the raw pointer.

A benefit of this approach is that in the common case we avoid the
rpc_lookup_cred_nonblock() call which can be slow when the cred cache is large.
This also keeps many fewer items pinned in the rpc cred cache, so the
cred cache is less likely to get large.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

b68572e0

SUNRPC: remove RPCAUTH_AUTH_NO_CRKEY_TIMEOUT · 354698b7

由 NeilBrown 提交于 12月 03, 2018

This is no longer used.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

354698b7

openeuler / Kernel 大约 1 年 前同步成功

openeuler / Kernel
大约 1 年前同步成功