提交 · 7928b2cbe55b2a410a0f5c1f154610059c57b1b2 · openanolis / cloud-kernel

12 2月, 2018 1 次提交

vfs: do bulk POLL* -> EPOLL* replacement · a9a08845

由 Linus Torvalds 提交于 2月 11, 2018

This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:

    for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
        L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
        for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
    done

with de-mangling cleanups yet to come.

NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do.  But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.

The next patch from Al will sort out the final differences, and we
should be all done.
Scripted-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a9a08845

09 2月, 2018 3 次提交

SUNRPC: Don't call __UDPX_INC_STATS() from a preemptible context · 0afa6b44

由 Trond Myklebust 提交于 2月 09, 2018

Calling __UDPX_INC_STATS() from a preemptible context leads to a
warning of the form:

 BUG: using __this_cpu_add() in preemptible [00000000] code: kworker/u5:0/31
 caller is xs_udp_data_receive_workfn+0x194/0x270
 CPU: 1 PID: 31 Comm: kworker/u5:0 Not tainted 4.15.0-rc8-00076-g90ea9f1b #2
 Workqueue: xprtiod xs_udp_data_receive_workfn
 Call Trace:
  dump_stack+0x85/0xc1
  check_preemption_disabled+0xce/0xe0
  xs_udp_data_receive_workfn+0x194/0x270
  process_one_work+0x318/0x620
  worker_thread+0x20a/0x390
  ? process_one_work+0x620/0x620
  kthread+0x120/0x130
  ? __kthread_bind_mask+0x60/0x60
  ret_from_fork+0x24/0x30

Since we're taking a spinlock in those functions anyway, let's fix the
issue by moving the call so that it occurs under the spinlock.
Reported-by: Nkernel test robot <fengguang.wu@intel.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

0afa6b44

fix parallelism for rpc tasks · f515f86b

由 Olga Kornievskaia 提交于 6月 29, 2017

Hi folks,

On a multi-core machine, is it expected that we can have parallel RPCs
handled by each of the per-core workqueue?

In testing a read workload, observing via "top" command that a single
"kworker" thread is running servicing the requests (no parallelism).
It's more prominent while doing these operations over krb5p mount.

What has been suggested by Bruce is to try this and in my testing I
see then the read workload spread among all the kworker threads.
Signed-off-by: NOlga Kornievskaia <kolga@netapp.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

f515f86b

svcrdma: Fix Read chunk round-up · 175e0310

由 Chuck Lever 提交于 2月 02, 2018

A single NFSv4 WRITE compound can often have three operations:
PUTFH, WRITE, then GETATTR.

When the WRITE payload is sent in a Read chunk, the client places
the GETATTR in the inline part of the RPC/RDMA message, just after
the WRITE operation (sans payload). The position value in the Read
chunk enables the receiver to insert the Read chunk at the correct
place in the received XDR stream; that is between the WRITE and
GETATTR.

According to RFC 8166, an NFS/RDMA client does not have to add XDR
round-up to the Read chunk that carries the WRITE payload. The
receiver adds XDR round-up padding if it is absent and the
receiver's XDR decoder requires it to be present.

Commit 193bcb7b ("svcrdma: Populate tail iovec when receiving")
attempted to add support for receiving such a compound so that just
the WRITE payload appears in rq_arg's page list, and the trailing
GETATTR is placed in rq_arg's tail iovec. (TCP just strings the
whole compound into the head iovec and page list, without regard
to the alignment of the WRITE payload).

The server transport logic also had to accommodate the optional XDR
round-up of the Read chunk, which it did simply by lengthening the
tail iovec when round-up was needed. This approach is adequate for
the NFSv2 and NFSv3 WRITE decoders.

Unfortunately it is not sufficient for nfsd4_decode_write. When the
Read chunk length is a couple of bytes less than PAGE_SIZE, the
computation at the end of nfsd4_decode_write allows argp->pagelen to
go negative, which breaks the logic in read_buf that looks for the
tail iovec.

The result is that a WRITE operation whose payload length is just
less than a multiple of a page succeeds, but the subsequent GETATTR
in the same compound fails with NFS4ERR_OP_ILLEGAL because the XDR
decoder can't find it. Clients ignore the error, but they must
update their attribute cache via a separate round trip.

As nfsd4_decode_write appears to expect the payload itself to always
have appropriate XDR round-up, have svc_rdma_build_normal_read_chunk
add the Read chunk XDR round-up to the page_len rather than
lengthening the tail iovec.
Reported-by: NOlga Kornievskaia <kolga@netapp.com>
Fixes: 193bcb7b ("svcrdma: Populate tail iovec when receiving")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NOlga Kornievskaia <kolga@netapp.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

175e0310

08 2月, 2018 1 次提交

Make the xprtiod workqueue unbounded. · 90ea9f1b

由 Trond Myklebust 提交于 2月 06, 2018

This should help reduce the latency on replies.
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

90ea9f1b

07 2月, 2018 1 次提交

SUNRPC: Queue latency-sensitive socket tasks to xprtiod · 2275cde4

由 Trond Myklebust 提交于 2月 07, 2018

The response to a write_space notification is very latency sensitive,
so we should queue it to the lower latency xprtiod_workqueue. This
is something we already do for the other cases where an rpc task
holds the transport XPRT_LOCKED bitlock.
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

2275cde4

06 2月, 2018 2 次提交

SUNRPC: Ensure we always close the socket after a connection shuts down · 9b30889c

由 Trond Myklebust 提交于 2月 05, 2018

Ensure that we release the TCP socket once it is in the TCP_CLOSE or
TCP_TIME_WAIT state (and only then) so that we don't confuse rkhunter
and its ilk.
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

9b30889c

sunrpc: remove dead code in svc_sock_setbufsize · d0945caa

由 Christoph Hellwig 提交于 1月 08, 2018

Setting values in struct sock directly is the usual method.  Remove
the long dead code using set_fs() and the related comment.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

d0945caa

03 2月, 2018 2 次提交

xprtrdma: Fix BUG after a device removal · e89e8d8f

由 Chuck Lever 提交于 1月 31, 2018

Michal Kalderon reports a BUG that occurs just after device removal:

[  169.112490] rpcrdma: removing device qedr0 for 192.168.110.146:20049
[  169.143909] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[  169.181837] IP: rpcrdma_dma_unmap_regbuf+0xa/0x60 [rpcrdma]

The RPC/RDMA client transport attempts to allocate some resources
on demand. Registered buffers are one such resource. These are
allocated (or re-allocated) by xprt_rdma_allocate to hold RPC Call
and Reply messages. A hardware resource is associated with each of
these buffers, as they can be used for a Send or Receive Work
Request.

If a device is removed from under an NFS/RDMA mount, the transport
layer is responsible for releasing all hardware resources before
the device can be finally unplugged. A BUG results when the NFS
mount hasn't yet seen much activity: the transport tries to release
resources that haven't yet been allocated.

rpcrdma_free_regbuf() already checks for this case, so just move
that check to cover the DEVICE_REMOVAL case as well.
Reported-by: NMichal Kalderon <Michal.Kalderon@cavium.com>
Fixes: bebd0318 ("xprtrdma: Support unplugging an HCA ...")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NMichal Kalderon <Michal.Kalderon@cavium.com>
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

e89e8d8f

xprtrdma: Fix calculation of ri_max_send_sges · 1179e2c2

由 Chuck Lever 提交于 1月 31, 2018

Commit 16f906d6 ("xprtrdma: Reduce required number of send
SGEs") introduced the rpcrdma_ia::ri_max_send_sges field. This fixes
a problem where xprtrdma would not work if the device's max_sge
capability was small (low single digits).

At least RPCRDMA_MIN_SEND_SGES are needed for the inline parts of
each RPC. ri_max_send_sges is set to this value:

  ia->ri_max_send_sges = max_sge - RPCRDMA_MIN_SEND_SGES;

Then when marshaling each RPC, rpcrdma_args_inline uses that value
to determine whether the device has enough Send SGEs to convey an
NFS WRITE payload inline, or whether instead a Read chunk is
required.

More recently, commit ae72950a ("xprtrdma: Add data structure to
manage RDMA Send arguments") used the ri_max_send_sges value to
calculate the size of an array, but that commit erroneously assumed
ri_max_send_sges contains a value similar to the device's max_sge,
and not one that was reduced by the minimum SGE count.

This assumption results in the calculated size of the sendctx's
Send SGE array to be too small. When the array is used to marshal
an RPC, the code can write Send SGEs into the following sendctx
element in that array, corrupting it. When the device's max_sge is
large, this issue is entirely harmless; but it results in an oops
in the provider's post_send method, if dev.attrs.max_sge is small.

So let's straighten this out: ri_max_send_sges will now contain a
value with the same meaning as dev.attrs.max_sge, which makes
the code easier to understand, and enables rpcrdma_sendctx_create
to calculate the size of the SGE array correctly.
Reported-by: NMichal Kalderon <Michal.Kalderon@cavium.com>
Fixes: 16f906d6 ("xprtrdma: Reduce required number of send SGEs")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NMichal Kalderon <Michal.Kalderon@cavium.com>
Cc: stable@vger.kernel.org # v4.10+
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

1179e2c2

23 1月, 2018 16 次提交

SUNRPC: Micro-optimize __rpc_execute · 21ead9ff

由 Chuck Lever 提交于 1月 03, 2018

The common case: There are 13 to 14 actions per RPC, and tk_callback
is non-NULL in only one of them. There's no need to store a NULL in
the tk_callback field during each FSM step.

This slightly improves throughput results in dbench and other multi-
threaded benchmarks on my two-socket client on 56Gb InfiniBand, but
will probably be inconsequential on slower systems.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

21ead9ff

SUNRPC: task_run_action should display tk_callback · cf08d6f2

由 Chuck Lever 提交于 1月 03, 2018

This shows up in every RPC:

kworker/4:1-19772 [004] 3467.373443: rpc_task_run_action: task:4711@2 flags=0e81 state=0005 status=0 action=call_status
kworker/4:1-19772 [004] 3467.373444: rpc_task_run_action: task:4711@2 flags=0e81 state=0005 status=0 action=call_status

What's actually going on is that the first iteration of the RPC
scheduler is invoking the function in tk_callback (in this case,
xprt_timer), then invoking call_status on the next iteration.

Feeding do_action, rather than tk_action, to the "task_run_action"
trace point will now always display the correct FSM step.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

cf08d6f2

SUNRPC: Trace xprt_timer events · 82476d9f

由 Chuck Lever 提交于 1月 03, 2018

Track RPC timeouts: report the XID and the server address to match
the content of network capture.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

82476d9f

xprtrdma: Correct some documenting comments · 9ab6d89e

由 Chuck Lever 提交于 1月 03, 2018

Fix kernel-doc warnings in net/sunrpc/xprtrdma/ .

net/sunrpc/xprtrdma/verbs.c:1575: warning: No description found for parameter 'count'
net/sunrpc/xprtrdma/verbs.c:1575: warning: Excess function parameter 'min_reqs' description in 'rpcrdma_ep_post_extra_recv'

net/sunrpc/xprtrdma/backchannel.c:288: warning: No description found for parameter 'r_xprt'
net/sunrpc/xprtrdma/backchannel.c:288: warning: Excess function parameter 'xprt' description in 'rpcrdma_bc_receive_call'
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

9ab6d89e

xprtrdma: Fix "bytes registered" accounting · aae2349c

由 Chuck Lever 提交于 1月 03, 2018

The contents of seg->mr_len changed when ->ro_map stopped returning
the full chunk length in the first segment. Count the full length of
each Write chunk, not the length of the first segment (which now can
only be as large as a page).

Fixes: 9d6b0409 ("xprtrdma: Place registered MWs on a ... ")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

aae2349c

C
xprtrdma: Instrument allocation/release of rpcrdma_req/rep objects · ae724676
由 Chuck Lever 提交于 12月 20, 2017
```
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
```
ae724676
C
xprtrdma: Add trace points to instrument QP and CQ access upcalls · 643cf323
由 Chuck Lever 提交于 12月 20, 2017
```
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
```
643cf323
C
xprtrdma: Add trace points in the client-side backchannel code paths · fc1eb807
由 Chuck Lever 提交于 12月 20, 2017
```
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
```
fc1eb807

xprtrdma: Add trace points for connect events · b4744e00

由 Chuck Lever 提交于 12月 20, 2017

Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

b4744e00

C
xprtrdma: Add trace points to instrument MR allocation and recovery · 1c443eff
由 Chuck Lever 提交于 12月 20, 2017
```
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
```
1c443eff
C
xprtrdma: Add trace points to instrument memory invalidation · 2937fede
由 Chuck Lever 提交于 12月 20, 2017
```
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
```
2937fede

xprtrdma: Add trace points in reply decoder path · e11b7c96

由 Chuck Lever 提交于 12月 20, 2017

This includes decoding Write and Reply chunks, and fixing up inline
payloads.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

e11b7c96

C
xprtrdma: Add trace points to instrument memory registration · 58f10ad4
由 Chuck Lever 提交于 12月 20, 2017
```
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
```
58f10ad4

xprtrdma: Add trace points in the RPC Reply handler paths · b4a7f91c

由 Chuck Lever 提交于 12月 20, 2017

Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

b4a7f91c

xprtrdma: Add trace points in RPC Call transmit paths · ab03eff5

由 Chuck Lever 提交于 12月 20, 2017

Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

ab03eff5

C
rpcrdma: infrastructure for static trace points in rpcrdma.ko · e48f083e
由 Chuck Lever 提交于 1月 20, 2018
```
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
```
e48f083e

19 1月, 2018 1 次提交

svcrdma: Post Receives in the Receive completion handler · 48272502

由 Chuck Lever 提交于 1月 03, 2018

This change improves Receive efficiency by posting Receives only
on the same CPU that handles Receive completion. Improved latency
and throughput has been noted with this change.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

48272502

17 1月, 2018 13 次提交

xprtrdma: Introduce rpcrdma_mw_unmap_and_put · ec12e479