提交 · a97c331f9aa9080706a7835225d9d82e832e0bb6 · openanolis / cloud-kernel

16 1月, 2015 10 次提交

svcrdma: Handle additional inline content · a97c331f

由 Chuck Lever 提交于 1月 13, 2015

Most NFS RPCs place their large payload argument at the end of the
RPC header (eg, NFSv3 WRITE). For NFSv3 WRITE and SYMLINK, RPC/RDMA
sends the complete RPC header inline, and the payload argument in
the read list. Data in the read list is the last part of the XDR
stream.

One important case is not like this, however. NFSv4 COMPOUND is a
counted array of operations. A WRITE operation, with its large data
payload, can appear in the middle of the compound's operations
array. Thus NFSv4 WRITE compounds can have header content after the
WRITE payload.

The Linux client, for example, performs an NFSv4 WRITE like this:

  { PUTFH, WRITE, GETATTR }

Though RFC 5667 is not precise about this, the proper way to convey
this compound is to place the GETATTR inline, _after_ the front of
the RPC header. The receiver inserts the read list payload into the
XDR stream after the initial WRITE arguments, and before the GETATTR
operation, thanks to the value of the read list "position" field.

The Linux client currently sends the GETATTR at the end of the
RPC/RDMA read list, which is incorrect. It will be corrected in the
future.

The Linux server currently rejects NFSv4 compounds with inline
content after the read list. For the above NFSv4 WRITE compound, the
NFS compound header indicates there are three operations, but the
server finds nonsense when it looks in the XDR stream for the third
operation, and the compound fails with OP_ILLEGAL.

Move trailing inline content to the end of the XDR buffer's page
list. This presents incoming NFSv4 WRITE compounds to NFSD in the
same way the socket transport does.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

a97c331f

svcrdma: Move read list XDR round-up logic · fcbeced5

由 Chuck Lever 提交于 1月 13, 2015

This is a pre-requisite for a subsequent patch.

Read list XDR round-up needs to be done _before_ additional inline
content is copied to the end of the XDR buffer's page list. Move
the logic added by commit e560e3b5 ("svcrdma: Add zero padding
if the client doesn't send it").
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

fcbeced5

svcrdma: Support RDMA_NOMSG requests · 0b056c22

由 Chuck Lever 提交于 1月 13, 2015

Currently the Linux server can not decode RDMA_NOMSG type requests.
Operations whose length exceeds the fixed size of RDMA SEND buffers,
like large NFSv4 CREATE(NF4LNK) operations, must be conveyed via
RDMA_NOMSG.

For an RDMA_MSG type request, the client sends the RPC/RDMA, RPC
headers, and some or all of the NFS arguments via RDMA SEND.

For an RDMA_NOMSG type request, the client sends just the RPC/RDMA
header via RDMA SEND. The request's read list contains elements for
the entire RPC message, including the RPC header.

NFSD expects the RPC/RMDA header and RPC header to be contiguous in
page zero of the XDR buffer. Add logic in the RDMA READ path to make
the read list contents land where the server prefers, when the
incoming message is a type RDMA_NOMSG message.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

0b056c22

svcrdma: rc_position sanity checking · 61edbcb7

由 Chuck Lever 提交于 1月 13, 2015

An RPC/RDMA client may send large RPC arguments via a read
list. This is a list of scatter/gather elements which convey
RPC call arguments too large to fit in a small RDMA SEND.

Each entry in the read list has a "position" field, whose value is
the byte offset in the XDR stream where the data in that entry is to
be inserted. Entries which share the same "position" value make up
the same RPC argument. The receiver inserts entries with the same
position field value in list order into the XDR stream.

Currently the Linux NFS/RDMA server cannot handle receiving read
chunks in more than one position, mostly because no current client
sends read lists with elements in more than one position. As a
sanity check, ensure that all received chunks have the same
"rc_position."
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

61edbcb7

svcrdma: Plant reader function in struct svcxprt_rdma · e5452411

由 Chuck Lever 提交于 1月 13, 2015

The RDMA reader function doesn't change once an svcxprt_rdma is
instantiated. Instead of checking sc_devcap during every incoming
RPC, set the reader function once when the connection is accepted.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

e5452411

svcrdma: Find rmsgp more reliably · e5523bd2

由 Chuck Lever 提交于 1月 13, 2015

xdr_start() can return the wrong rmsgp address if an assumption
about how the xdr_buf was constructed changes.  When it gets it
wrong, the client receives a reply that has gibberish in the
RPC/RDMA header, preventing it from matching a waiting RPC request.

Instead, make (and document) just one assumption: that the RDMA
header for the client's RPC call is at the start of the first page
in rq_pages.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

e5523bd2

svcrdma: Scrub BUG_ON() and WARN_ON() call sites · 3fe04ee9

由 Chuck Lever 提交于 1月 13, 2015

Current convention is to avoid using BUG_ON() in places where an
oops could cause complete system failure.

Replace BUG_ON() call sites in svcrdma with an assertion error
message and allow execution to continue safely.

Some BUG_ON() calls are removed because they have never fired in
production (that we are aware of).

Some WARN_ON() calls are also replaced where a back trace is not
helpful; e.g., in a workqueue task.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

3fe04ee9

svcrdma: Clean up read chunk counting · 2397aa8b

由 Chuck Lever 提交于 1月 13, 2015

The byte_count argument is not used, and the function is called
only from one place.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

2397aa8b

svcrdma: Remove unused variable · 83f2bedf

由 Chuck Lever 提交于 1月 13, 2015

Nit: remove an unused variable to squelch a compiler warning.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

83f2bedf

svcrdma: Clean up dprintk · 597561bf

由 Chuck Lever 提交于 1月 13, 2015

Nit: Fix inconsistent white space in dprintk messages.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

597561bf

08 1月, 2015 1 次提交

rpc: fix xdr_truncate_encode to handle buffer ending on page boundary · 49a068f8

由 J. Bruce Fields 提交于 12月 22, 2014

A struct xdr_stream at a page boundary might point to the end of one
page or the beginning of the next, but xdr_truncate_encode isn't
prepared to handle the former.

This can cause corruption of NFSv4 READDIR replies in the case that a
readdir entry that would have exceeded the client's dircount/maxcount
limit would have ended exactly on a 4k page boundary.  You're more
likely to hit this case on large directories.

Other xdr_truncate_encode callers are probably also affected.
Reported-by: NHolger Hoffstätte <holger.hoffstaette@googlemail.com>
Tested-by: NHolger Hoffstätte <holger.hoffstaette@googlemail.com>
Fixes: 3e19ce76 "rpc: xdr_truncate_encode"
Cc: stable@vger.kernel.org
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

49a068f8

10 12月, 2014 14 次提交

sunrpc/cache: convert to use string_escape_str() · 1b2e122d

由 Andy Shevchenko 提交于 11月 28, 2014

There is nice kernel helper to escape a given strings by provided rules. Let's
use it instead of custom approach.
Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
[bfields@redhat.com: fix length calculation]
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

1b2e122d

sunrpc: only call test_bit once in svc_xprt_received · acf06a7f

由 Jeff Layton 提交于 12月 01, 2014

...move the WARN_ON_ONCE inside the following if block since they use
the same condition.
Signed-off-by: NJeff Layton <jlayton@primarydata.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

acf06a7f

sunrpc: add some tracepoints around enqueue and dequeue of svc_xprt · 83a712e0

由 Jeff Layton 提交于 11月 21, 2014

These were useful when I was tracking down a race condition between
svc_xprt_do_enqueue and svc_get_next_xprt.
Signed-off-by: NJeff Layton <jlayton@primarydata.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

83a712e0

sunrpc: convert to lockless lookup of queued server threads · b1691bc0

由 Jeff Layton 提交于 11月 21, 2014

Testing has shown that the pool->sp_lock can be a bottleneck on a busy
server. Every time data is received on a socket, the server must take
that lock in order to dequeue a thread from the sp_threads list.

Address this problem by eliminating the sp_threads list (which contains
threads that are currently idle) and replacing it with a RQ_BUSY flag in
svc_rqst. This allows us to walk the sp_all_threads list under the
rcu_read_lock and find a suitable thread for the xprt by doing a
test_and_set_bit.

Note that we do still have a potential atomicity problem however with
this approach. We don't want svc_xprt_do_enqueue to set the
rqst->rq_xprt pointer unless a test_and_set_bit of RQ_BUSY returned
zero (which indicates that the thread was idle). But, by the time we
check that, the bit could be flipped by a waking thread.

To address this, we acquire a new per-rqst spinlock (rq_lock) and take
that before doing the test_and_set_bit. If that returns false, then we
can set rq_xprt and drop the spinlock. Then, when the thread wakes up,
it must set the bit under the same spinlock and can trust that if it was
already set then the rq_xprt is also properly set.

With this scheme, the case where we have an idle thread no longer needs
to take the highly contended pool->sp_lock at all, and that removes the
bottleneck.

That still leaves one issue: What of the case where we walk the whole
sp_all_threads list and don't find an idle thread? Because the search is
lockess, it's possible for the queueing to race with a thread that is
going to sleep. To address that, we queue the xprt and then search again.

If we find an idle thread at that point, we can't attach the xprt to it
directly since that might race with a different thread waking up and
finding it. All we can do is wake the idle thread back up and let it
attempt to find the now-queued xprt.
Signed-off-by: NJeff Layton <jlayton@primarydata.com>
Tested-by: NChris Worley <chris.worley@primarydata.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

b1691bc0

sunrpc: fix potential races in pool_stats collection · 403c7b44

由 Jeff Layton 提交于 11月 21, 2014

In a later patch, we'll be removing some spinlocking around the socket
and thread queueing code in order to fix some contention problems. At
that point, the stats counters will no longer be protected by the
sp_lock.

Change the counters to atomic_long_t fields, except for the
"sockets_queued" counter which will still be manipulated under a
spinlock.
Signed-off-by: NJeff Layton <jlayton@primarydata.com>
Tested-by: NChris Worley <chris.worley@primarydata.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

403c7b44

sunrpc: add a rcu_head to svc_rqst and use kfree_rcu to free it · 81244386

由 Jeff Layton 提交于 11月 21, 2014

...also make the manipulation of sp_all_threads list use RCU-friendly
functions.
Signed-off-by: NJeff Layton <jlayton@primarydata.com>
Tested-by: NChris Worley <chris.worley@primarydata.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

81244386

sunrpc: require svc_create callers to pass in meaningful shutdown routine · 0b5707e4