1. 30 11月, 2016 9 次提交
    • C
      xprtrdma: Squelch "max send, max recv" messages at connect time · 6d6bf72d
      Chuck Lever 提交于
      Clean up: This message was intended to be a dprintk, as it is on the
      server-side.
      
      Fixes: 87cfb9a0 ('xprtrdma: Client-side support for ...')
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      6d6bf72d
    • C
      xprtrdma: Update documenting comment · 289400af
      Chuck Lever 提交于
      Clean up: If reset fails, FRMRs are no longer abandoned, rather
      they are released immediately. Update the comment to reflect this.
      
      Fixes: 2ffc871a ('xprtrdma: Release orphaned MRs immediately')
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      289400af
    • C
      xprtrdma: Refactor FRMR invalidation · a100fda1
      Chuck Lever 提交于
      Clean up: After some recent updates, clarifications can be made to
      the FRMR invalidation logic.
      
      - Both the remote and local invalidation case mark the frmr INVALID,
        so make that a common path.
      
      - Manage the WR list more "tastefully" by replacing the conditional
        that discriminates between the list head and ->next pointers.
      
      - Use mw->mw_handle in all cases, since that has the same value as
        f->fr_mr->rkey, and is already in cache.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      a100fda1
    • C
      xprtrdma: Avoid calls to ro_unmap_safe() · 48016dce
      Chuck Lever 提交于
      Micro-optimization: Most of the time, calls to ro_unmap_safe are
      expensive no-ops. Call only when there is work to do.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      48016dce
    • C
      xprtrdma: Address coverity complaint about wait_for_completion() · 109b88ab
      Chuck Lever 提交于
      > ** CID 114101:  Error handling issues  (CHECKED_RETURN)
      > /net/sunrpc/xprtrdma/verbs.c: 355 in rpcrdma_create_id()
      
      Commit 5675add3 ("RPC/RDMA: harden connection logic against
      missing/late rdma_cm upcalls.") replaced wait_for_completion() calls
      with these two call sites.
      
      The original wait_for_completion() calls were added in the initial
      commit of verbs.c, which was commit c56c65fb ("RPCRDMA: rpc rdma
      verbs interface implementation"), but these returned void.
      
      rpcrdma_create_id() is called by the RDMA connect worker, which
      probably won't ever be interrupted. It is also called by
      rpcrdma_ia_open which is in the synchronous mount path, and ^C is
      possible there.
      
      Add a bit of logic at those two call sites to return if the waits
      return ERESTARTSYS.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      109b88ab
    • C
      SUNRPC: Proper metric accounting when RPC is not transmitted · ae09531d
      Chuck Lever 提交于
      I noticed recently that during an xfstests on a krb5i mount, the
      retransmit count for certain operations had gone negative, and the
      backlog value became unreasonably large. I recall that Andy has
      pointed this out to me in the past.
      
      When call_refresh fails to find a valid credential for an RPC, the
      RPC exits immediately without sending anything on the wire. This
      leaves rq_ntrans, rq_xtime, and rq_rtt set to zero.
      
      The solution for om_queue is to not add the to RPC's running backlog
      queue total whenever rq_xtime is zero.
      
      For om_ntrans, it's a bit more difficult. A zero rq_ntrans causes
      om_ops to become larger than om_ntrans. The design of the RPC
      metrics API assumes that ntrans will always be equal to or larger
      than the ops count. The result is that when an RPC fails to find
      credentials, the RPC operation's reported retransmit count, which is
      computed in user space as the difference between ops and ntrans,
      goes negative.
      
      Ideally the kernel API should report a separate retransmit and
      "exited before initial transmission" metric, so that user space can
      sort out the difference properly.
      
      To avoid kernel API changes and changes to the way rq_ntrans is used
      when performing transport locking, account for untransmitted RPCs
      so that om_ntrans keeps up with om_ops: always add one or more to
      om_ntrans.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      ae09531d
    • C
      xprtrdma: Support for SG_GAP devices · 5e9fc6a0
      Chuck Lever 提交于
      Some devices (such as the Mellanox CX-4) can register, under a
      single R_key, a set of memory regions that are not contiguous. When
      this is done, all the segments in a Reply list, say, can then be
      invalidated in a single LocalInv Work Request (or via Remote
      Invalidation, which can invalidate exactly one R_key when completing
      a Receive).
      
      This means a single FastReg WR is used to register, and one or zero
      LocalInv WRs can invalidate, the memory involved with RDMA transfers
      on behalf of an RPC.
      
      In addition, xprtrdma constructs some Reply chunks from three or
      more segments. By registering them with SG_GAP, only one segment
      is needed for the Reply chunk, allowing the whole chunk to be
      invalidated remotely.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      5e9fc6a0
    • C
      xprtrdma: Make FRWR send queue entry accounting more accurate · 8d38de65
      Chuck Lever 提交于
      Verbs providers may perform house-keeping on the Send Queue during
      each signaled send completion. It is necessary therefore for a verbs
      consumer (like xprtrdma) to occasionally force a signaled send
      completion if it runs unsignaled most of the time.
      
      xprtrdma does not require signaled completions for Send or FastReg
      Work Requests, but does signal some LocalInv Work Requests. To
      ensure that Send Queue house-keeping can run before the Send Queue
      is more than half-consumed, xprtrdma forces a signaled completion
      on occasion by counting the number of Send Queue Entries it
      consumes. It currently does this by counting each ib_post_send as
      one Entry.
      
      Commit c9918ff5 ("xprtrdma: Add ro_unmap_sync method for FRWR")
      introduced the ability for frwr_op_unmap_sync to post more than one
      Work Request with a single post_send. Thus the underlying assumption
      of one Send Queue Entry per ib_post_send is no longer true.
      
      Also, FastReg Work Requests are currently never signaled. They
      should be signaled once in a while, just as Send is, to keep the
      accounting of consumed SQEs accurate.
      
      While we're here, convert the CQCOUNT macros to the currently
      preferred kernel coding style, which is inline functions.
      
      Fixes: c9918ff5 ("xprtrdma: Add ro_unmap_sync method for FRWR")
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      8d38de65
    • C
      xprtrdma: Cap size of callback buffer resources · 62aee0e3
      Chuck Lever 提交于
      When the inline threshold size is set to large values (say, 32KB)
      any NFSv4.1 CB request from the server gets a reply with status
      NFS4ERR_RESOURCE.
      
      Looks like there are some upper layer assumptions about the maximum
      size of a reply (for example, in process_op). Cap the size of the
      NFSv4 client's reply resources at a page.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      62aee0e3
  2. 14 11月, 2016 1 次提交
    • S
      sunrpc: svc_age_temp_xprts_now should not call setsockopt non-tcp transports · ea08e392
      Scott Mayhew 提交于
      This fixes the following panic that can occur with NFSoRDMA.
      
      general protection fault: 0000 [#1] SMP
      Modules linked in: rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi
      scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp
      scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm
      mlx5_ib ib_core intel_powerclamp coretemp kvm_intel kvm sg ioatdma
      ipmi_devintf ipmi_ssif dcdbas iTCO_wdt iTCO_vendor_support pcspkr
      irqbypass sb_edac shpchp dca crc32_pclmul ghash_clmulni_intel edac_core
      lpc_ich aesni_intel lrw gf128mul glue_helper ablk_helper mei_me mei
      ipmi_si cryptd wmi ipmi_msghandler acpi_pad acpi_power_meter nfsd
      auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod
      crc_t10dif crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper
      syscopyarea sysfillrect sysimgblt ahci fb_sys_fops ttm libahci mlx5_core
      tg3 crct10dif_pclmul drm crct10dif_common
      ptp i2c_core libata crc32c_intel pps_core fjes dm_mirror dm_region_hash
      dm_log dm_mod
      CPU: 1 PID: 120 Comm: kworker/1:1 Not tainted 3.10.0-514.el7.x86_64 #1
      Hardware name: Dell Inc. PowerEdge R320/0KM5PX, BIOS 2.4.2 01/29/2015
      Workqueue: events check_lifetime
      task: ffff88031f506dd0 ti: ffff88031f584000 task.ti: ffff88031f584000
      RIP: 0010:[<ffffffff8168d847>]  [<ffffffff8168d847>]
      _raw_spin_lock_bh+0x17/0x50
      RSP: 0018:ffff88031f587ba8  EFLAGS: 00010206
      RAX: 0000000000020000 RBX: 20041fac02080072 RCX: ffff88031f587fd8
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 20041fac02080072
      RBP: ffff88031f587bb0 R08: 0000000000000008 R09: ffffffff8155be77
      R10: ffff880322a59b00 R11: ffffea000bf39f00 R12: 20041fac02080072
      R13: 000000000000000d R14: ffff8800c4fbd800 R15: 0000000000000001
      FS:  0000000000000000(0000) GS:ffff880322a40000(0000)
      knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f3c52d4547e CR3: 00000000019ba000 CR4: 00000000001407e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Stack:
      20041fac02080002 ffff88031f587bd0 ffffffff81557830 20041fac02080002
      ffff88031f587c78 ffff88031f587c40 ffffffff8155ae08 000000010157df32
      0000000800000001 ffff88031f587c20 ffffffff81096acb ffffffff81aa37d0
      Call Trace:
      [<ffffffff81557830>] lock_sock_nested+0x20/0x50
      [<ffffffff8155ae08>] sock_setsockopt+0x78/0x940
      [<ffffffff81096acb>] ? lock_timer_base.isra.33+0x2b/0x50
      [<ffffffff8155397d>] kernel_setsockopt+0x4d/0x50
      [<ffffffffa0386284>] svc_age_temp_xprts_now+0x174/0x1e0 [sunrpc]
      [<ffffffffa03b681d>] nfsd_inetaddr_event+0x9d/0xd0 [nfsd]
      [<ffffffff81691ebc>] notifier_call_chain+0x4c/0x70
      [<ffffffff810b687d>] __blocking_notifier_call_chain+0x4d/0x70
      [<ffffffff810b68b6>] blocking_notifier_call_chain+0x16/0x20
      [<ffffffff815e8538>] __inet_del_ifa+0x168/0x2d0
      [<ffffffff815e8cef>] check_lifetime+0x25f/0x270
      [<ffffffff810a7f3b>] process_one_work+0x17b/0x470
      [<ffffffff810a8d76>] worker_thread+0x126/0x410
      [<ffffffff810a8c50>] ? rescuer_thread+0x460/0x460
      [<ffffffff810b052f>] kthread+0xcf/0xe0
      [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140
      [<ffffffff81696418>] ret_from_fork+0x58/0x90
      [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140
      Code: ca 75 f1 5d c3 0f 1f 80 00 00 00 00 eb d9 66 0f 1f 44 00 00 0f 1f
      44 00 00 55 48 89 e5 53 48 89 fb e8 7e 04 a0 ff b8 00 00 02 00 <f0> 0f
      c1 03 89 c2 c1 ea 10 66 39 c2 75 03 5b 5d c3 83 e2 fe 0f
      RIP  [<ffffffff8168d847>] _raw_spin_lock_bh+0x17/0x50
      RSP <ffff88031f587ba8>
      Signed-off-by: NScott Mayhew <smayhew@redhat.com>
      Fixes: c3d4879e ("sunrpc: Add a function to close temporary transports immediately")
      Reviewed-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      ea08e392
  3. 11 11月, 2016 1 次提交
  4. 08 11月, 2016 1 次提交
  5. 02 11月, 2016 1 次提交
  6. 29 10月, 2016 1 次提交
    • J
      sunrpc: fix some missing rq_rbuffer assignments · 18e601d6
      Jeff Layton 提交于
      We've been seeing some crashes in testing that look like this:
      
      BUG: unable to handle kernel NULL pointer dereference at           (null)
      IP: [<ffffffff8135ce99>] memcpy_orig+0x29/0x110
      PGD 212ca2067 PUD 212ca3067 PMD 0
      Oops: 0002 [#1] SMP
      Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache ppdev parport_pc i2c_piix4 sg parport i2c_core virtio_balloon pcspkr acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod ata_generic pata_acpi virtio_scsi 8139too ata_piix libata 8139cp mii virtio_pci floppy virtio_ring serio_raw virtio
      CPU: 1 PID: 1540 Comm: nfsd Not tainted 4.9.0-rc1 #39
      Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
      task: ffff88020d7ed200 task.stack: ffff880211838000
      RIP: 0010:[<ffffffff8135ce99>]  [<ffffffff8135ce99>] memcpy_orig+0x29/0x110
      RSP: 0018:ffff88021183bdd0  EFLAGS: 00010206
      RAX: 0000000000000000 RBX: ffff88020d7fa000 RCX: 000000f400000000
      RDX: 0000000000000014 RSI: ffff880212927020 RDI: 0000000000000000
      RBP: ffff88021183be30 R08: 01000000ef896996 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff880211704ca8
      R13: ffff88021473f000 R14: 00000000ef896996 R15: ffff880211704800
      FS:  0000000000000000(0000) GS:ffff88021fc80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000000 CR3: 0000000212ca1000 CR4: 00000000000006e0
      Stack:
       ffffffffa01ea087 ffffffff63400001 ffff880215145e00 ffff880211bacd00
       ffff88021473f2b8 0000000000000004 00000000d0679d67 ffff880211bacd00
       ffff88020d7fa000 ffff88021473f000 0000000000000000 ffff88020d7faa30
      Call Trace:
       [<ffffffffa01ea087>] ? svc_tcp_recvfrom+0x5a7/0x790 [sunrpc]
       [<ffffffffa01f84d8>] svc_recv+0xad8/0xbd0 [sunrpc]
       [<ffffffffa0262d5e>] nfsd+0xde/0x160 [nfsd]
       [<ffffffffa0262c80>] ? nfsd_destroy+0x60/0x60 [nfsd]
       [<ffffffff810a9418>] kthread+0xd8/0xf0
       [<ffffffff816dbdbf>] ret_from_fork+0x1f/0x40
       [<ffffffff810a9340>] ? kthread_park+0x60/0x60
      Code: 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 fe 7c 35 48 83 ea 20 48 83 ea 20 4c 8b 06 4c 8b 4e 08 4c 8b 56 10 4c 8b 5e 18 48 8d 76 20 <4c> 89 07 4c 89 4f 08 4c 89 57 10 4c 89 5f 18 48 8d 7f 20 73 d4
      RIP  [<ffffffff8135ce99>] memcpy_orig+0x29/0x110
       RSP <ffff88021183bdd0>
      CR2: 0000000000000000
      
      Both Bruce and Eryu ran a bisect here and found that the problematic
      patch was 68778945 (SUNRPC: Separate buffer pointers for RPC Call and
      Reply messages).
      
      That patch changed rpc_xdr_encode to use a new rq_rbuffer pointer to
      set up the receive buffer, but didn't change all of the necessary
      codepaths to set it properly. In particular the backchannel setup was
      missing.
      
      We need to set rq_rbuffer whenever rq_buffer is set. Ensure that it is.
      Reviewed-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NChuck Lever <chuck.lever@oracle.com>
      Reported-by: NEryu Guan <guaneryu@gmail.com>
      Tested-by: NEryu Guan <guaneryu@gmail.com>
      Fixes: 68778945 "SUNRPC: Separate buffer pointers..."
      Reported-by: NJ. Bruce Fields <bfields@fieldses.org>
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      18e601d6
  7. 27 10月, 2016 1 次提交
    • J
      sunrpc: don't pass on-stack memory to sg_set_buf · 2876a344
      J. Bruce Fields 提交于
      As of ac4e97ab "scatterlist: sg_set_buf() argument must be in linear
      mapping", sg_set_buf hits a BUG when make_checksum_v2->xdr_process_buf,
      among other callers, passes it memory on the stack.
      
      We only need a scatterlist to pass this to the crypto code, and it seems
      like overkill to require kmalloc'd memory just to encrypt a few bytes,
      but for now this seems the best fix.
      
      Many of these callers are in the NFS write paths, so we allocate with
      GFP_NOFS.  It might be possible to do without allocations here entirely,
      but that would probably be a bigger project.
      
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      2876a344
  8. 08 10月, 2016 1 次提交
    • A
      cred: simpler, 1D supplementary groups · 81243eac
      Alexey Dobriyan 提交于
      Current supplementary groups code can massively overallocate memory and
      is implemented in a way so that access to individual gid is done via 2D
      array.
      
      If number of gids is <= 32, memory allocation is more or less tolerable
      (140/148 bytes).  But if it is not, code allocates full page (!)
      regardless and, what's even more fun, doesn't reuse small 32-entry
      array.
      
      2D array means dependent shifts, loads and LEAs without possibility to
      optimize them (gid is never known at compile time).
      
      All of the above is unnecessary.  Switch to the usual
      trailing-zero-len-array scheme.  Memory is allocated with
      kmalloc/vmalloc() and only as much as needed.  Accesses become simpler
      (LEA 8(gi,idx,4) or even without displacement).
      
      Maximum number of gids is 65536 which translates to 256KB+8 bytes.  I
      think kernel can handle such allocation.
      
      On my usual desktop system with whole 9 (nine) aux groups, struct
      group_info shrinks from 148 bytes to 44 bytes, yay!
      
      Nice side effects:
      
       - "gi->gid[i]" is shorter than "GROUP_AT(gi, i)", less typing,
      
       - fix little mess in net/ipv4/ping.c
         should have been using GROUP_AT macro but this point becomes moot,
      
       - aux group allocation is persistent and should be accounted as such.
      
      Link: http://lkml.kernel.org/r/20160817201927.GA2096@p183.telecom.bySigned-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Vasily Kulikov <segoon@openwall.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81243eac
  9. 01 10月, 2016 4 次提交
  10. 28 9月, 2016 2 次提交
  11. 24 9月, 2016 1 次提交
  12. 23 9月, 2016 7 次提交
  13. 20 9月, 2016 10 次提交
    • D
      sunrpc: fix write space race causing stalls · d48f9ce7
      David Vrabel 提交于
      Write space becoming available may race with putting the task to sleep
      in xprt_wait_for_buffer_space().  The existing mechanism to avoid the
      race does not work.
      
      This (edited) partial trace illustrates the problem:
      
         [1] rpc_task_run_action: task:43546@5 ... action=call_transmit
         [2] xs_write_space <-xs_tcp_write_space
         [3] xprt_write_space <-xs_write_space
         [4] rpc_task_sleep: task:43546@5 ...
         [5] xs_write_space <-xs_tcp_write_space
      
      [1] Task 43546 runs but is out of write space.
      
      [2] Space becomes available, xs_write_space() clears the
          SOCKWQ_ASYNC_NOSPACE bit.
      
      [3] xprt_write_space() attemts to wake xprt->snd_task (== 43546), but
          this has not yet been queued and the wake up is lost.
      
      [4] xs_nospace() is called which calls xprt_wait_for_buffer_space()
          which queues task 43546.
      
      [5] The call to sk->sk_write_space() at the end of xs_nospace() (which
          is supposed to handle the above race) does not call
          xprt_write_space() as the SOCKWQ_ASYNC_NOSPACE bit is clear and
          thus the task is not woken.
      
      Fix the race by resetting the SOCKWQ_ASYNC_NOSPACE bit in xs_nospace()
      so the second call to sk->sk_write_space() calls xprt_write_space().
      Suggested-by: NTrond Myklebust <trondmy@primarydata.com>
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      cc: stable@vger.kernel.org # 4.4
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      d48f9ce7
    • C
      xprtrdma: Eliminate rpcrdma_receive_worker() · 496b77a5
      Chuck Lever 提交于
      Clean up: the extra layer of indirection doesn't add value.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      496b77a5
    • C
      xprtrdma: Rename rpcrdma_receive_wc() · 1519e969
      Chuck Lever 提交于
      Clean up: When converting xprtrdma to use the new CQ API, I missed a
      spot. The naming convention elsewhere is:
      
        {svc_rdma,rpcrdma}_wc_{operation}
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      1519e969
    • C
      xprtrmda: Report address of frmr, not mw · eeb30613
      Chuck Lever 提交于
      Tie frwr debugging messages together by always reporting the address
      of the frwr.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      eeb30613
    • C
      xprtrdma: Support larger inline thresholds · 44829d02
      Chuck Lever 提交于
      The Version One default inline threshold is still 1KB. But allow
      testing with thresholds up to 64KB.
      
      This maximum is somewhat arbitrary. There's no fundamental
      architectural limit I'm aware of, but it's good to keep the size of
      Receive buffers reasonable. Now that Send can use a s/g list, a
      Send buffer is only as large as each RPC requires. Receive buffers
      are always the size of the inline threshold, however.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      44829d02
    • C
      xprtrdma: Use gathered Send for large inline messages · 655fec69
      Chuck Lever 提交于
      An RPC Call message that is sent inline but that has a data payload
      (ie, one or more items in rq_snd_buf's page list) must be "pulled
      up:"
      
      - call_allocate has to reserve enough RPC Call buffer space to
      accommodate the data payload
      
      - call_transmit has to memcopy the rq_snd_buf's page list and tail
      into its head iovec before it is sent
      
      As the inline threshold is increased beyond its current 1KB default,
      however, this means data payloads of more than a few KB are copied
      by the host CPU. For example, if the inline threshold is increased
      just to 4KB, then NFS WRITE requests up to 4KB would involve a
      memcpy of the NFS WRITE's payload data into the RPC Call buffer.
      This is an undesirable amount of participation by the host CPU.
      
      The inline threshold may be much larger than 4KB in the future,
      after negotiation with a peer server.
      
      Instead of copying the components of rq_snd_buf into its head iovec,
      construct a gather list of these components, and send them all in
      place. The same approach is already used in the Linux server's
      RPC-over-RDMA reply path.
      
      This mechanism also eliminates the need for rpcrdma_tail_pullup,
      which is used to manage the XDR pad and trailing inline content when
      a Read list is present.
      
      This requires that the pages in rq_snd_buf's page list be DMA-mapped
      during marshaling, and unmapped when a data-bearing RPC is
      completed. This is slightly less efficient for very small I/O
      payloads, but significantly more efficient as data payload size and
      inline threshold increase past a kilobyte.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      655fec69
    • C
      xprtrdma: Basic support for Remote Invalidation · c8b920bb
      Chuck Lever 提交于
      Have frwr's ro_unmap_sync recognize an invalidated rkey that appears
      as part of a Receive completion. Local invalidation can be skipped
      for that rkey.
      
      Use an out-of-band signaling mechanism to indicate to the server
      that the client is prepared to receive RDMA Send With Invalidate.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      c8b920bb
    • C
      xprtrdma: Client-side support for rpcrdma_connect_private · 87cfb9a0
      Chuck Lever 提交于
      Send an RDMA-CM private message on connect, and look for one during
      a connection-established event.
      
      Both sides can communicate their various implementation limits.
      Implementations that don't support this sideband protocol ignore it.
      
      Once the client knows the server's inline threshold maxima, it can
      adjust the use of Reply chunks, and eliminate most use of Position
      Zero Read chunks. Moderately-sized I/O can be done using a pure
      inline RDMA Send instead of RDMA operations that require memory
      registration.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      87cfb9a0
    • C
      xprtrdma: Move recv_wr to struct rpcrdma_rep · 6ea8e711
      Chuck Lever 提交于
      Clean up: The fields in the recv_wr do not vary. There is no need to
      initialize them before each ib_post_recv(). This removes a large-ish
      data structure from the stack.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      6ea8e711
    • C
      xprtrdma: Move send_wr to struct rpcrdma_req · 90aab602
      Chuck Lever 提交于
      Clean up: Most of the fields in each send_wr do not vary. There is
      no need to initialize them before each ib_post_send(). This removes
      a large-ish data structure from the stack.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      90aab602