提交 · dd6fd213b05e7a1f590b470500343dd97c3a32c1 · openeuler / Kernel

01 12月, 2016 5 次提交

svcrdma: Remove DMA map accounting · dd6fd213

由 Chuck Lever 提交于 11月 29, 2016

Clean up: sc_dma_used is not required for correct operation. It is
simply a debugging tool to report when svcrdma has leaked DMA maps.

However, manipulating an atomic has a measurable CPU cost, and DMA
map accounting specific to svcrdma will be meaningless once svcrdma
is converted to use the new generic r/w API.

A similar kind of debug accounting can be done simply by enabling
the IOMMU or by using CONFIG_DMA_API_DEBUG, CONFIG_IOMMU_DEBUG, and
CONFIG_IOMMU_LEAK.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

dd6fd213

svcrdma: Remove BH-disabled spin locking in svc_rdma_send() · e4eb42ce

由 Chuck Lever 提交于 11月 29, 2016

svcrdma's current SQ accounting algorithm takes sc_lock and disables
bottom-halves while posting all RDMA Read, Write, and Send WRs.

This is relatively heavyweight serialization. And note that Write and
Send are already fully serialized by the xpt_mutex.

Using a single atomic_t should be all that is necessary to guarantee
that ib_post_send() is called only when there is enough space on the
send queue. This is what the other RDMA-enabled storage targets do.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

e4eb42ce

svcrdma: Renovate sendto chunk list parsing · 5fdca653

由 Chuck Lever 提交于 11月 29, 2016

The current sendto code appears to support clients that provide only
one of a Read list, a Write list, or a Reply chunk. My reading of
that code is that it doesn't support the following cases:

 - Read list + Write list
 - Read list + Reply chunk
 - Write list + Reply chunk
 - Read list + Write list + Reply chunk

The protocol allows more than one Read or Write chunk in those
lists. Some clients do send a Read list and Reply chunk
simultaneously. NFSv4 WRITE uses a Read list for the data payload,
and a Reply chunk because the GETATTR result in the reply can
contain a large object like an ACL.

Generalize one of the sendto code paths needed to support all of
the above cases, and attempt to ensure that only one pass is done
through the RPC Call's transport header to gather chunk list
information for building the reply.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

5fdca653

svcauth_gss: Close connection when dropping an incoming message · 4d712ef1

由 Chuck Lever 提交于 11月 29, 2016

S5.3.3.1 of RFC 2203 requires that an incoming GSS-wrapped message
whose sequence number lies outside the current window is dropped.
The rationale is:

  The reason for discarding requests silently is that the server
  is unable to determine if the duplicate or out of range request
  was due to a sequencing problem in the client, network, or the
  operating system, or due to some quirk in routing, or a replay
  attack by an intruder.  Discarding the request allows the client
  to recover after timing out, if indeed the duplication was
  unintentional or well intended.

However, clients may rely on the server dropping the connection to
indicate that a retransmit is needed. Without a connection reset, a
client can wait forever without retransmitting, and the workload
just stops dead. I've reproduced this behavior by running xfstests
generic/323 on an NFSv4.0 mount with proto=rdma and sec=krb5i.

To address this issue, have the server close the connection when it
silently discards an incoming message due to a GSS sequence number
problem.

There are a few other places where the server will never reply.
Change those spots in a similar fashion.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

4d712ef1

svcrdma: Clear xpt_bc_xps in xprt_setup_rdma_bc() error exit arm · 1b9f700b

由 Chuck Lever 提交于 11月 29, 2016

Logic copied from xs_setup_bc_tcp().

Fixes: 39a9beab ('rpc: share one xps between all backchannels')
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

1b9f700b

02 11月, 2016 2 次提交

sunrpc: GFP_KERNEL should be GFP_NOFS in crypto code · 56094edd

由 J. Bruce Fields 提交于 10月 26, 2016

Writes may depend on the auth_gss crypto code, so we shouldn't be
allocating with GFP_KERNEL there.

This still leaves some crypto_alloc_* calls which end up doing
GFP_KERNEL allocations in the crypto code.  Those could probably done at
crypto import time.
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

56094edd

svcrdma: backchannel cannot share a page for send and rcv buffers · 8d42629b

由 Chuck Lever 提交于 10月 28, 2016

The underlying transport releases the page pointed to by rq_buffer
during xprt_rdma_bc_send_request. When the backchannel reply arrives,
rq_rbuffer then points to freed memory.

Fixes: 68778945 ('SUNRPC: Separate buffer pointers for RPC ...')
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Cc: Jeff Layton <jlayton@redhat.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

8d42629b

29 10月, 2016 1 次提交

sunrpc: fix some missing rq_rbuffer assignments · 18e601d6

由 Jeff Layton 提交于 10月 24, 2016

We've been seeing some crashes in testing that look like this:

BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<ffffffff8135ce99>] memcpy_orig+0x29/0x110
PGD 212ca2067 PUD 212ca3067 PMD 0
Oops: 0002 [#1] SMP
Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache ppdev parport_pc i2c_piix4 sg parport i2c_core virtio_balloon pcspkr acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod ata_generic pata_acpi virtio_scsi 8139too ata_piix libata 8139cp mii virtio_pci floppy virtio_ring serio_raw virtio
CPU: 1 PID: 1540 Comm: nfsd Not tainted 4.9.0-rc1 #39
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
task: ffff88020d7ed200 task.stack: ffff880211838000
RIP: 0010:[<ffffffff8135ce99>]  [<ffffffff8135ce99>] memcpy_orig+0x29/0x110
RSP: 0018:ffff88021183bdd0  EFLAGS: 00010206
RAX: 0000000000000000 RBX: ffff88020d7fa000 RCX: 000000f400000000
RDX: 0000000000000014 RSI: ffff880212927020 RDI: 0000000000000000
RBP: ffff88021183be30 R08: 01000000ef896996 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff880211704ca8
R13: ffff88021473f000 R14: 00000000ef896996 R15: ffff880211704800
FS:  0000000000000000(0000) GS:ffff88021fc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000212ca1000 CR4: 00000000000006e0
Stack:
 ffffffffa01ea087 ffffffff63400001 ffff880215145e00 ffff880211bacd00
 ffff88021473f2b8 0000000000000004 00000000d0679d67 ffff880211bacd00
 ffff88020d7fa000 ffff88021473f000 0000000000000000 ffff88020d7faa30
Call Trace:
 [<ffffffffa01ea087>] ? svc_tcp_recvfrom+0x5a7/0x790 [sunrpc]
 [<ffffffffa01f84d8>] svc_recv+0xad8/0xbd0 [sunrpc]
 [<ffffffffa0262d5e>] nfsd+0xde/0x160 [nfsd]
 [<ffffffffa0262c80>] ? nfsd_destroy+0x60/0x60 [nfsd]
 [<ffffffff810a9418>] kthread+0xd8/0xf0
 [<ffffffff816dbdbf>] ret_from_fork+0x1f/0x40
 [<ffffffff810a9340>] ? kthread_park+0x60/0x60
Code: 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 fe 7c 35 48 83 ea 20 48 83 ea 20 4c 8b 06 4c 8b 4e 08 4c 8b 56 10 4c 8b 5e 18 48 8d 76 20 <4c> 89 07 4c 89 4f 08 4c 89 57 10 4c 89 5f 18 48 8d 7f 20 73 d4
RIP  [<ffffffff8135ce99>] memcpy_orig+0x29/0x110
 RSP <ffff88021183bdd0>
CR2: 0000000000000000

Both Bruce and Eryu ran a bisect here and found that the problematic
patch was 68778945 (SUNRPC: Separate buffer pointers for RPC Call and
Reply messages).

That patch changed rpc_xdr_encode to use a new rq_rbuffer pointer to
set up the receive buffer, but didn't change all of the necessary
codepaths to set it properly. In particular the backchannel setup was
missing.

We need to set rq_rbuffer whenever rq_buffer is set. Ensure that it is.
Reviewed-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NChuck Lever <chuck.lever@oracle.com>
Reported-by: NEryu Guan <guaneryu@gmail.com>
Tested-by: NEryu Guan <guaneryu@gmail.com>
Fixes: 68778945 "SUNRPC: Separate buffer pointers..."
Reported-by: NJ. Bruce Fields <bfields@fieldses.org>
Signed-off-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

18e601d6

27 10月, 2016 1 次提交

sunrpc: don't pass on-stack memory to sg_set_buf · 2876a344

由 J. Bruce Fields 提交于 10月 18, 2016

As of ac4e97ab "scatterlist: sg_set_buf() argument must be in linear
mapping", sg_set_buf hits a BUG when make_checksum_v2->xdr_process_buf,
among other callers, passes it memory on the stack.

We only need a scatterlist to pass this to the crypto code, and it seems
like overkill to require kmalloc'd memory just to encrypt a few bytes,
but for now this seems the best fix.

Many of these callers are in the NFS write paths, so we allocate with
GFP_NOFS.  It might be possible to do without allocations here entirely,
but that would probably be a bigger project.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

2876a344

08 10月, 2016 1 次提交

cred: simpler, 1D supplementary groups · 81243eac

由 Alexey Dobriyan 提交于 10月 07, 2016

Current supplementary groups code can massively overallocate memory and
is implemented in a way so that access to individual gid is done via 2D
array.

If number of gids is <= 32, memory allocation is more or less tolerable
(140/148 bytes).  But if it is not, code allocates full page (!)
regardless and, what's even more fun, doesn't reuse small 32-entry
array.

2D array means dependent shifts, loads and LEAs without possibility to
optimize them (gid is never known at compile time).

All of the above is unnecessary.  Switch to the usual
trailing-zero-len-array scheme.  Memory is allocated with
kmalloc/vmalloc() and only as much as needed.  Accesses become simpler
(LEA 8(gi,idx,4) or even without displacement).

Maximum number of gids is 65536 which translates to 256KB+8 bytes.  I
think kernel can handle such allocation.

On my usual desktop system with whole 9 (nine) aux groups, struct
group_info shrinks from 148 bytes to 44 bytes, yay!

Nice side effects:

 - "gi->gid[i]" is shorter than "GROUP_AT(gi, i)", less typing,

 - fix little mess in net/ipv4/ping.c
   should have been using GROUP_AT macro but this point becomes moot,

 - aux group allocation is persistent and should be accounted as such.

Link: http://lkml.kernel.org/r/20160817201927.GA2096@p183.telecom.bySigned-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
Cc: Vasily Kulikov <segoon@openwall.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

81243eac

01 10月, 2016 4 次提交

sunrpc: replace generic auth_cred hash with auth-specific function · 66cbd4ba

由 Frank Sorenson 提交于 9月 29, 2016

Replace the generic code to hash the auth_cred with the call to
the auth-specific hash function in the rpc_authops struct.
Signed-off-by: NFrank Sorenson <sorenson@redhat.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

66cbd4ba

sunrpc: add RPCSEC_GSS hash_cred() function · a960f8d6

由 Frank Sorenson 提交于 9月 29, 2016

Add a hash_cred() function for RPCSEC_GSS, using only the
uid from the auth_cred.
Signed-off-by: NFrank Sorenson <sorenson@redhat.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

a960f8d6

sunrpc: add auth_unix hash_cred() function · 1e035d06

由 Frank Sorenson 提交于 9月 29, 2016

Add a hash_cred() function for auth_unix, using both the
uid and gid from the auth_cred.
Signed-off-by: NFrank Sorenson <sorenson@redhat.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

1e035d06

sunrpc: add generic_auth hash_cred() function · 18028c96

由 Frank Sorenson 提交于 9月 29, 2016

Add a hash_cred() function for generic_auth, using both the
uid and gid from the auth_cred.
Signed-off-by: NFrank Sorenson <sorenson@redhat.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

18028c96

28 9月, 2016 2 次提交

fs: Replace CURRENT_TIME with current_time() for inode timestamps · 078cd827

由 Deepa Dinamani 提交于 9月 14, 2016

CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_time() instead.

CURRENT_TIME is also not y2038 safe.

This is also in preparation for the patch that transitions
vfs timestamps to use 64 bit time and hence make them
y2038 safe. As part of the effort current_time() will be
extended to do range checks. Hence, it is necessary for all
file system timestamps to use current_time(). Also,
current_time() will be transitioned along with vfs to be
y2038 safe.

Note that whenever a single call to current_time() is used
to change timestamps in different inodes, it is because they
share the same time granularity.
Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: NArnd Bergmann <arnd@arndb.de>
Acked-by: NFelipe Balbi <balbi@kernel.org>
Acked-by: NSteven Whitehouse <swhiteho@redhat.com>
Acked-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

078cd827

sunrpc: queue work on system_power_efficient_wq · 77b00bc0

由 Ke Wang 提交于 9月 01, 2016

sunrpc uses workqueue to clean cache regulary. There is no real dependency
of executing work on the cpu which queueing it.

On a idle system, especially for a heterogeneous systems like big.LITTLE,
it is observed that the big idle cpu was woke up many times just to service
this work, which against the principle of power saving. It would be better
if we can schedule it on a cpu which the scheduler believes to be the most
appropriate one.

After apply this patch, system_wq will be replaced by
system_power_efficient_wq for sunrpc. This functionality is enabled when
CONFIG_WQ_POWER_EFFICIENT is selected.
Signed-off-by: NKe Wang <ke.wang@spreadtrum.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

77b00bc0

24 9月, 2016 1 次提交

IB/core: add support to create a unsafe global rkey to ib_create_pd · ed082d36

由 Christoph Hellwig 提交于 9月 05, 2016

Instead of exposing ib_get_dma_mr to ULPs and letting them use it more or
less unchecked, this moves the capability of creating a global rkey into
the RDMA core, where it can be easily audited.  It also prints a warning
everytime this feature is used as well.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NJason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

ed082d36

23 9月, 2016 7 次提交

svcrdma: support Remote Invalidation · 25d55296

由 Chuck Lever 提交于 9月 13, 2016

Support Remote Invalidation. A private message is exchanged with
the client upon RDMA transport connect that indicates whether
Send With Invalidation may be used by the server to send RPC
replies. The invalidate_rkey is arbitrarily chosen from among
rkeys present in the RPC-over-RDMA header's chunk lists.

Send With Invalidate improves performance only when clients can
recognize, while processing an RPC reply, that an rkey has already
been invalidated. That has been submitted as a separate change.

In the future, the RPC-over-RDMA protocol might support Remote
Invalidation properly. The protocol needs to enable signaling
between peers to indicate when Remote Invalidation can be used
for each individual RPC.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

25d55296

svcrdma: Server-side support for rpcrdma_connect_private · cc9d8340

由 Chuck Lever 提交于 9月 13, 2016

Prepare to receive an RDMA-CM private message when handling a new
connection attempt, and send a similar message as part of connection
acceptance.

Both sides can communicate their various implementation limits.
Implementations that don't support this sideband protocol ignore it.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

cc9d8340

svcrdma: Skip put_page() when send_reply() fails · 9995237b

由 Chuck Lever 提交于 9月 13, 2016

Message from syslogd@klimt at Aug 18 17:00:37 ...
kernel:page:ffffea0020639b00 count:0 mapcount:0 mapping: (null) index:0x0
Aug 18 17:00:37 klimt kernel: flags: 0x2fffff80000000()
Aug 18 17:00:37 klimt kernel: page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)

Aug 18 17:00:37 klimt kernel: kernel BUG at /home/cel/src/linux/linux-2.6/include/linux/mm.h:445!
Aug 18 17:00:37 klimt kernel: RIP: 0010:[<ffffffffa05c21c1>] svc_rdma_sendto+0x641/0x820 [rpcrdma]

send_reply() assigns its page argument as the first page of ctxt. On
error, send_reply() already invokes svc_rdma_put_context(ctxt, 1);
which does a put_page() on that very page. No need to do that again
as svc_rdma_sendto exits.

Fixes: 3e1eeb98 ("svcrdma: Close connection when a send error occurs")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

9995237b

svcrdma: Tail iovec leaves an orphaned DMA mapping · cace564f

由 Chuck Lever 提交于 9月 13, 2016

The ctxt's count field is overloaded to mean the number of pages in
the ctxt->page array and the number of SGEs in the ctxt->sge array.
Typically these two numbers are the same.

However, when an inline RPC reply is constructed from an xdr_buf
with a tail iovec, the head and tail often occupy the same page,
but each are DMA mapped independently. In that case, ->count equals
the number of pages, but it does not equal the number of SGEs.
There's one more SGE, for the tail iovec. Hence there is one more
DMA mapping than there are pages in the ctxt->page array.

This isn't a real problem until the server's iommu is enabled. Then
each RPC reply that has content in that iovec orphans a DMA mapping
that consists of real resources.

krb5i and krb5p always populate that tail iovec. After a couple
million sent krb5i/p RPC replies, the NFS server starts behaving
erratically. Reboot is needed to clear the problem.

Fixes: 9d11b51c ("svcrdma: Fix send_reply() scatter/gather set-up")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

cace564f

xprtrdma: use complete() instead complete_all() · 5690a22d

由 Daniel Wagner 提交于 9月 23, 2016

There is only one waiter for the completion, therefore there
is no need to use complete_all(). Let's make that clear by
using complete() instead of complete_all().

The usage pattern of the completion is:

waiter context                          waker context

frwr_op_unmap_sync()
  reinit_completion()
  ib_post_send()
  wait_for_completion()

					frwr_wc_localinv_wake()
					  complete()
Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: linux-nfs@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

5690a22d

SUNRPC: Fix setting of buffer length in xdr_set_next_buffer() · a6cebd41

由 Trond Myklebust 提交于 9月 20, 2016

Use xdr->nwords to tell us how much buffer remains.
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

a6cebd41

SUNRPC: Fix corruption of xdr->nwords in xdr_copy_to_scratch · ace0e14f

由 Trond Myklebust 提交于 9月 20, 2016

When we copy the first part of the data, we need to ensure that value
of xdr->nwords is updated as well. Do so by calling __xdr_inline_decode()
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

ace0e14f

20 9月, 2016 16 次提交

sunrpc: fix write space race causing stalls · d48f9ce7

由 David Vrabel 提交于 9月 19, 2016

Write space becoming available may race with putting the task to sleep
in xprt_wait_for_buffer_space().  The existing mechanism to avoid the
race does not work.

This (edited) partial trace illustrates the problem:

   [1] rpc_task_run_action: task:43546@5 ... action=call_transmit
   [2] xs_write_space <-xs_tcp_write_space
   [3] xprt_write_space <-xs_write_space
   [4] rpc_task_sleep: task:43546@5 ...
   [5] xs_write_space <-xs_tcp_write_space

[1] Task 43546 runs but is out of write space.

[2] Space becomes available, xs_write_space() clears the
    SOCKWQ_ASYNC_NOSPACE bit.

[3] xprt_write_space() attemts to wake xprt->snd_task (== 43546), but
    this has not yet been queued and the wake up is lost.

[4] xs_nospace() is called which calls xprt_wait_for_buffer_space()
    which queues task 43546.

[5] The call to sk->sk_write_space() at the end of xs_nospace() (which
    is supposed to handle the above race) does not call
    xprt_write_space() as the SOCKWQ_ASYNC_NOSPACE bit is clear and
    thus the task is not woken.

Fix the race by resetting the SOCKWQ_ASYNC_NOSPACE bit in xs_nospace()
so the second call to sk->sk_write_space() calls xprt_write_space().
Suggested-by: NTrond Myklebust <trondmy@primarydata.com>
Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
cc: stable@vger.kernel.org # 4.4
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

d48f9ce7

xprtrdma: Eliminate rpcrdma_receive_worker() · 496b77a5

由 Chuck Lever 提交于 9月 15, 2016

Clean up: the extra layer of indirection doesn't add value.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

496b77a5

xprtrdma: Rename rpcrdma_receive_wc() · 1519e969

由 Chuck Lever 提交于 9月 15, 2016

Clean up: When converting xprtrdma to use the new CQ API, I missed a
spot. The naming convention elsewhere is:

  {svc_rdma,rpcrdma}_wc_{operation}
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

1519e969

xprtrmda: Report address of frmr, not mw · eeb30613

由 Chuck Lever 提交于 9月 15, 2016

Tie frwr debugging messages together by always reporting the address
of the frwr.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

eeb30613

xprtrdma: Support larger inline thresholds · 44829d02

由 Chuck Lever 提交于 9月 15, 2016

The Version One default inline threshold is still 1KB. But allow
testing with thresholds up to 64KB.

This maximum is somewhat arbitrary. There's no fundamental
architectural limit I'm aware of, but it's good to keep the size of
Receive buffers reasonable. Now that Send can use a s/g list, a
Send buffer is only as large as each RPC requires. Receive buffers
are always the size of the inline threshold, however.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

44829d02

xprtrdma: Use gathered Send for large inline messages · 655fec69

由 Chuck Lever 提交于 9月 15, 2016

An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"

- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload

- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent

As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.

The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.

Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.

This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.

This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

655fec69

xprtrdma: Basic support for Remote Invalidation · c8b920bb

由 Chuck Lever 提交于 9月 15, 2016

Have frwr's ro_unmap_sync recognize an invalidated rkey that appears
as part of a Receive completion. Local invalidation can be skipped
for that rkey.

Use an out-of-band signaling mechanism to indicate to the server
that the client is prepared to receive RDMA Send With Invalidate.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

c8b920bb

xprtrdma: Client-side support for rpcrdma_connect_private · 87cfb9a0

由 Chuck Lever 提交于 9月 15, 2016

Send an RDMA-CM private message on connect, and look for one during
a connection-established event.

Both sides can communicate their various implementation limits.
Implementations that don't support this sideband protocol ignore it.

Once the client knows the server's inline threshold maxima, it can
adjust the use of Reply chunks, and eliminate most use of Position
Zero Read chunks. Moderately-sized I/O can be done using a pure
inline RDMA Send instead of RDMA operations that require memory
registration.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

87cfb9a0

xprtrdma: Move recv_wr to struct rpcrdma_rep · 6ea8e711

由 Chuck Lever 提交于 9月 15, 2016

Clean up: The fields in the recv_wr do not vary. There is no need to
initialize them before each ib_post_recv(). This removes a large-ish
data structure from the stack.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

6ea8e711

xprtrdma: Move send_wr to struct rpcrdma_req · 90aab602

由 Chuck Lever 提交于 9月 15, 2016

Clean up: Most of the fields in each send_wr do not vary. There is
no need to initialize them before each ib_post_send(). This removes
a large-ish data structure from the stack.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

90aab602

xprtrdma: Simplify rpcrdma_ep_post_recv() · b157380a

由 Chuck Lever 提交于 9月 15, 2016

Clean up.

Since commit fc664485 ("xprtrdma: Split the completion queue"),
rpcrdma_ep_post_recv() no longer uses the "ep" argument.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

b157380a

xprtrdma: Eliminate "ia" argument in rpcrdma_{alloc, free}_regbuf · 13650c23

由 Chuck Lever 提交于 9月 15, 2016

Clean up. The "ia" argument is no longer used.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

13650c23

xprtrdma: Delay DMA mapping Send and Receive buffers · 54cbd6b0

由 Chuck Lever 提交于 9月 15, 2016

Currently, each regbuf is allocated and DMA mapped at the same time.
This is done during transport creation.

When a device driver is unloaded, every DMA-mapped buffer in use by
a transport has to be unmapped, and then remapped to the new
device if the driver is loaded again. Remapping will have to be done
_after_ the connect worker has set up the new device.

But there's an ordering problem:

call_allocate, which invokes xprt_rdma_allocate which calls
rpcrdma_alloc_regbuf to allocate Send buffers, happens _before_
the connect worker can run to set up the new device.

Instead, at transport creation, allocate each buffer, but leave it
unmapped. Once the RPC carries these buffers into ->send_request, by
which time a transport connection should have been established,
check to see that the RPC's buffers have been DMA mapped. If not,
map them there.

When device driver unplug support is added, it will simply unmap all
the transport's regbufs, but it doesn't have to deallocate the
underlying memory.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

54cbd6b0

xprtrdma: Replace DMA_BIDIRECTIONAL · 99ef4db3

由 Chuck Lever 提交于 9月 15, 2016

The use of DMA_BIDIRECTIONAL is discouraged by DMA-API.txt.
Fortunately, xprtrdma now knows which direction I/O is going as
soon as it allocates each regbuf.

The RPC Call and Reply buffers are no longer the same regbuf. They
can each be labeled correctly now. The RPC Reply buffer is never
part of either a Send or Receive WR, but it can be part of Reply
chunk, which is mapped and registered via ->ro_map . So it is not
DMA mapped when it is allocated (DMA_NONE), to avoid a double-
mapping.

Since Receive buffers are no longer DMA_BIDIRECTIONAL and their
contents are never modified by the host CPU, DMA-API-HOWTO.txt
suggests that a DMA sync before posting each buffer should be
unnecessary. (See my_card_interrupt_handler).
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

99ef4db3

xprtrdma: Use smaller buffers for RPC-over-RDMA headers · 08cf2efd

由 Chuck Lever 提交于 9月 15, 2016

Commit 94931746 ("xprtrdma: Limit number of RDMA segments in
RPC-over-RDMA headers") capped the number of chunks that may appear
in RPC-over-RDMA headers. The maximum header size can be estimated
and fixed to avoid allocating buffer space that is never used.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

08cf2efd

xprtrdma: Initialize separate RPC call and reply buffers · 9c40c49f

由 Chuck Lever 提交于 9月 15, 2016

RPC-over-RDMA needs to separate its RPC call and reply buffers.

 o When an RPC Call is sent, rq_snd_buf is DMA mapped for an RDMA
   Send operation using DMA_TO_DEVICE

 o If the client expects a large RPC reply, it DMA maps rq_rcv_buf
   as part of a Reply chunk using DMA_FROM_DEVICE

The two mappings are for data movement in opposite directions.

DMA-API.txt suggests that if these mappings share a DMA cacheline,
bad things can happen. This could occur in the final bytes of
rq_snd_buf and the first bytes of rq_rcv_buf if the two buffers
happen to share a DMA cacheline.

On x86_64 the cacheline size is typically 8 bytes, and RPC call
messages are usually much smaller than the send buffer, so this
hasn't been a noticeable problem. But the DMA cacheline size can be
larger on other platforms.

Also, often rq_rcv_buf starts most of the way into a page, thus
an additional RDMA segment is needed to map and register the end of
that buffer. Try to avoid that scenario to reduce the cost of
registering and invalidating Reply chunks.

Instead of carrying a single regbuf that covers both rq_snd_buf and
rq_rcv_buf, each struct rpcrdma_req now carries one regbuf for
rq_snd_buf and one regbuf for rq_rcv_buf.

Some incidental changes worth noting:

- To clear out some spaghetti, refactor xprt_rdma_allocate.
- The value stored in rg_size is the same as the value stored in
  the iov.length field, so eliminate rg_size
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

9c40c49f

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功