1. 26 4月, 2017 7 次提交
    • C
      svcrdma: Clean up RDMA_ERROR path · 6b19cc5c
      Chuck Lever 提交于
      Now that svc_rdma_sendto has been renovated, svc_rdma_send_error can
      be refactored to reduce code duplication and remove C structure-
      based XDR encoding. It is also relocated to the source file that
      contains its only caller.
      
      This is a refactoring change only.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      6b19cc5c
    • C
      svcrdma: Use rdma_rw API in RPC reply path · 9a6a180b
      Chuck Lever 提交于
      The current svcrdma sendto code path posts one RDMA Write WR at a
      time. Each of these Writes typically carries a small number of pages
      (for instance, up to 30 pages for mlx4 devices). That means a 1MB
      NFS READ reply requires 9 ib_post_send() calls for the Write WRs,
      and one for the Send WR carrying the actual RPC Reply message.
      
      Instead, use the new rdma_rw API. The details of Write WR chain
      construction and memory registration are taken care of in the RDMA
      core. svcrdma can focus on the details of the RPC-over-RDMA
      protocol. This gives three main benefits:
      
      1. All Write WRs for one RDMA segment are posted in a single chain.
      As few as one ib_post_send() for each Write chunk.
      
      2. The Write path can now use FRWR to register the Write buffers.
      If the device's maximum page list depth is large, this means a
      single Write WR is needed for each RPC's Write chunk data.
      
      3. The new code introduces support for RPCs that carry both a Write
      list and a Reply chunk. This combination can be used for an NFSv4
      READ where the data payload is large, and thus is removed from the
      Payload Stream, but the Payload Stream is still larger than the
      inline threshold.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      9a6a180b
    • C
      svcrdma: Introduce local rdma_rw API helpers · f13193f5
      Chuck Lever 提交于
      The plan is to replace the local bespoke code that constructs and
      posts RDMA Read and Write Work Requests with calls to the rdma_rw
      API. This shares code with other RDMA-enabled ULPs that manages the
      gory details of buffer registration and posting Work Requests.
      
      Some design notes:
      
       o The structure of RPC-over-RDMA transport headers is flexible,
         allowing multiple segments per Reply with arbitrary alignment,
         each with a unique R_key. Write and Send WRs continue to be
         built and posted in separate code paths. However, one whole
         chunk (with one or more RDMA segments apiece) gets exactly
         one ib_post_send and one work completion.
      
       o svc_xprt reference counting is modified, since a chain of
         rdma_rw_ctx structs generates one completion, no matter how
         many Write WRs are posted.
      
       o The current code builds the transport header as it is construct-
         ing Write WRs. I've replaced that with marshaling of transport
         header data items in a separate step. This is because the exact
         structure of client-provided segments may not align with the
         components of the server's reply xdr_buf, or the pages in the
         page list. Thus parts of each client-provided segment may be
         written at different points in the send path.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      f13193f5
    • C
      svcrdma: Eliminate RPCRDMA_SQ_DEPTH_MULT · b623589d
      Chuck Lever 提交于
      The Send Queue depth is temporarily reduced to 1 SQE per credit. The
      new rdma_rw API does an internal computation, during QP creation, to
      increase the depth of the Send Queue to handle RDMA Read and Write
      operations.
      
      This change has to come before the NFSD code paths are updated to
      use the rdma_rw API. Without this patch, rdma_rw_init_qp() increases
      the size of the SQ too much, resulting in memory allocation failures
      during QP creation.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      b623589d
    • C
      svcrdma: Add svc_rdma_map_reply_hdr() · 6e6092ca
      Chuck Lever 提交于
      Introduce a helper to DMA-map a reply's transport header before
      sending it. This will in part replace the map vector cache.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      6e6092ca
    • C
      svcrdma: Move send_wr to svc_rdma_op_ctxt · 17f5f7f5
      Chuck Lever 提交于
      Clean up: Move the ib_send_wr off the stack, and move common code
      to post a Send Work Request into a helper.
      
      This is a refactoring change only.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      17f5f7f5
    • J
      nfsd: check for oversized NFSv2/v3 arguments · 51f56777
      J. Bruce Fields 提交于
      A client can append random data to the end of an NFSv2 or NFSv3 RPC call
      without our complaining; we'll just stop parsing at the end of the
      expected data and ignore the rest.
      
      Encoded arguments and replies are stored together in an array of pages,
      and if a call is too large it could leave inadequate space for the
      reply.  This is normally OK because NFS RPC's typically have either
      short arguments and long replies (like READ) or long arguments and short
      replies (like WRITE).  But a client that sends an incorrectly long reply
      can violate those assumptions.  This was observed to cause crashes.
      
      So, insist that the argument not be any longer than we expect.
      
      Also, several operations increment rq_next_page in the decode routine
      before checking the argument size, which can leave rq_next_page pointing
      well past the end of the page array, causing trouble later in
      svc_free_pages.
      
      As followup we may also want to rewrite the encoding routines to check
      more carefully that they aren't running off the end of the page array.
      Reported-by: NTuomas Haanpää <thaan@synopsys.com>
      Reported-by: NAri Kauppi <ari@synopsys.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      51f56777
  2. 02 3月, 2017 1 次提交
  3. 25 2月, 2017 3 次提交
  4. 22 2月, 2017 2 次提交
  5. 10 2月, 2017 3 次提交
  6. 09 2月, 2017 8 次提交
  7. 01 2月, 2017 1 次提交
  8. 31 1月, 2017 1 次提交
  9. 25 1月, 2017 1 次提交
    • K
      SUNRPC: cleanup ida information when removing sunrpc module · c929ea0b
      Kinglong Mee 提交于
      After removing sunrpc module, I get many kmemleak information as,
      unreferenced object 0xffff88003316b1e0 (size 544):
        comm "gssproxy", pid 2148, jiffies 4294794465 (age 4200.081s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffffb0cfb58a>] kmemleak_alloc+0x4a/0xa0
          [<ffffffffb03507fe>] kmem_cache_alloc+0x15e/0x1f0
          [<ffffffffb0639baa>] ida_pre_get+0xaa/0x150
          [<ffffffffb0639cfd>] ida_simple_get+0xad/0x180
          [<ffffffffc06054fb>] nlmsvc_lookup_host+0x4ab/0x7f0 [lockd]
          [<ffffffffc0605e1d>] lockd+0x4d/0x270 [lockd]
          [<ffffffffc06061e5>] param_set_timeout+0x55/0x100 [lockd]
          [<ffffffffc06cba24>] svc_defer+0x114/0x3f0 [sunrpc]
          [<ffffffffc06cbbe7>] svc_defer+0x2d7/0x3f0 [sunrpc]
          [<ffffffffc06c71da>] rpc_show_info+0x8a/0x110 [sunrpc]
          [<ffffffffb044a33f>] proc_reg_write+0x7f/0xc0
          [<ffffffffb038e41f>] __vfs_write+0xdf/0x3c0
          [<ffffffffb0390f1f>] vfs_write+0xef/0x240
          [<ffffffffb0392fbd>] SyS_write+0xad/0x130
          [<ffffffffb0d06c37>] entry_SYSCALL_64_fastpath+0x1a/0xa9
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      I found, the ida information (dynamic memory) isn't cleanup.
      Signed-off-by: NKinglong Mee <kinglongmee@gmail.com>
      Fixes: 2f048db4 ("SUNRPC: Add an identifier for struct rpc_clnt")
      Cc: stable@vger.kernel.org # v3.12+
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      c929ea0b
  10. 14 1月, 2017 1 次提交
    • P
      locking/atomic, kref: Add kref_read() · 2c935bc5
      Peter Zijlstra 提交于
      Since we need to change the implementation, stop exposing internals.
      
      Provide kref_read() to read the current reference count; typically
      used for debug messages.
      
      Kills two anti-patterns:
      
      	atomic_read(&kref->refcount)
      	kref->refcount.counter
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2c935bc5
  11. 13 1月, 2017 1 次提交
  12. 01 12月, 2016 4 次提交
    • C
      svcrdma: Remove svc_rdma_op_ctxt::wc_status · 96a58f9c
      Chuck Lever 提交于
      Clean up: Completion status is already reported in the individual
      completion handlers. Save a few bytes in struct svc_rdma_op_ctxt.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      96a58f9c
    • C
      svcrdma: Remove DMA map accounting · dd6fd213
      Chuck Lever 提交于
      Clean up: sc_dma_used is not required for correct operation. It is
      simply a debugging tool to report when svcrdma has leaked DMA maps.
      
      However, manipulating an atomic has a measurable CPU cost, and DMA
      map accounting specific to svcrdma will be meaningless once svcrdma
      is converted to use the new generic r/w API.
      
      A similar kind of debug accounting can be done simply by enabling
      the IOMMU or by using CONFIG_DMA_API_DEBUG, CONFIG_IOMMU_DEBUG, and
      CONFIG_IOMMU_LEAK.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      dd6fd213
    • C
      svcrdma: Remove BH-disabled spin locking in svc_rdma_send() · e4eb42ce
      Chuck Lever 提交于
      svcrdma's current SQ accounting algorithm takes sc_lock and disables
      bottom-halves while posting all RDMA Read, Write, and Send WRs.
      
      This is relatively heavyweight serialization. And note that Write and
      Send are already fully serialized by the xpt_mutex.
      
      Using a single atomic_t should be all that is necessary to guarantee
      that ib_post_send() is called only when there is enough space on the
      send queue. This is what the other RDMA-enabled storage targets do.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      e4eb42ce
    • C
      svcrdma: Renovate sendto chunk list parsing · 5fdca653
      Chuck Lever 提交于
      The current sendto code appears to support clients that provide only
      one of a Read list, a Write list, or a Reply chunk. My reading of
      that code is that it doesn't support the following cases:
      
       - Read list + Write list
       - Read list + Reply chunk
       - Write list + Reply chunk
       - Read list + Write list + Reply chunk
      
      The protocol allows more than one Read or Write chunk in those
      lists. Some clients do send a Read list and Reply chunk
      simultaneously. NFSv4 WRITE uses a Read list for the data payload,
      and a Reply chunk because the GETATTR result in the reply can
      contain a large object like an ACL.
      
      Generalize one of the sendto code paths needed to support all of
      the above cases, and attempt to ensure that only one pass is done
      through the RPC Call's transport header to gather chunk list
      information for building the reply.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      5fdca653
  13. 14 11月, 2016 1 次提交
    • S
      sunrpc: svc_age_temp_xprts_now should not call setsockopt non-tcp transports · ea08e392
      Scott Mayhew 提交于
      This fixes the following panic that can occur with NFSoRDMA.
      
      general protection fault: 0000 [#1] SMP
      Modules linked in: rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi
      scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp
      scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm
      mlx5_ib ib_core intel_powerclamp coretemp kvm_intel kvm sg ioatdma
      ipmi_devintf ipmi_ssif dcdbas iTCO_wdt iTCO_vendor_support pcspkr
      irqbypass sb_edac shpchp dca crc32_pclmul ghash_clmulni_intel edac_core
      lpc_ich aesni_intel lrw gf128mul glue_helper ablk_helper mei_me mei
      ipmi_si cryptd wmi ipmi_msghandler acpi_pad acpi_power_meter nfsd
      auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod
      crc_t10dif crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper
      syscopyarea sysfillrect sysimgblt ahci fb_sys_fops ttm libahci mlx5_core
      tg3 crct10dif_pclmul drm crct10dif_common
      ptp i2c_core libata crc32c_intel pps_core fjes dm_mirror dm_region_hash
      dm_log dm_mod
      CPU: 1 PID: 120 Comm: kworker/1:1 Not tainted 3.10.0-514.el7.x86_64 #1
      Hardware name: Dell Inc. PowerEdge R320/0KM5PX, BIOS 2.4.2 01/29/2015
      Workqueue: events check_lifetime
      task: ffff88031f506dd0 ti: ffff88031f584000 task.ti: ffff88031f584000
      RIP: 0010:[<ffffffff8168d847>]  [<ffffffff8168d847>]
      _raw_spin_lock_bh+0x17/0x50
      RSP: 0018:ffff88031f587ba8  EFLAGS: 00010206
      RAX: 0000000000020000 RBX: 20041fac02080072 RCX: ffff88031f587fd8
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 20041fac02080072
      RBP: ffff88031f587bb0 R08: 0000000000000008 R09: ffffffff8155be77
      R10: ffff880322a59b00 R11: ffffea000bf39f00 R12: 20041fac02080072
      R13: 000000000000000d R14: ffff8800c4fbd800 R15: 0000000000000001
      FS:  0000000000000000(0000) GS:ffff880322a40000(0000)
      knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f3c52d4547e CR3: 00000000019ba000 CR4: 00000000001407e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Stack:
      20041fac02080002 ffff88031f587bd0 ffffffff81557830 20041fac02080002
      ffff88031f587c78 ffff88031f587c40 ffffffff8155ae08 000000010157df32
      0000000800000001 ffff88031f587c20 ffffffff81096acb ffffffff81aa37d0
      Call Trace:
      [<ffffffff81557830>] lock_sock_nested+0x20/0x50
      [<ffffffff8155ae08>] sock_setsockopt+0x78/0x940
      [<ffffffff81096acb>] ? lock_timer_base.isra.33+0x2b/0x50
      [<ffffffff8155397d>] kernel_setsockopt+0x4d/0x50
      [<ffffffffa0386284>] svc_age_temp_xprts_now+0x174/0x1e0 [sunrpc]
      [<ffffffffa03b681d>] nfsd_inetaddr_event+0x9d/0xd0 [nfsd]
      [<ffffffff81691ebc>] notifier_call_chain+0x4c/0x70
      [<ffffffff810b687d>] __blocking_notifier_call_chain+0x4d/0x70
      [<ffffffff810b68b6>] blocking_notifier_call_chain+0x16/0x20
      [<ffffffff815e8538>] __inet_del_ifa+0x168/0x2d0
      [<ffffffff815e8cef>] check_lifetime+0x25f/0x270
      [<ffffffff810a7f3b>] process_one_work+0x17b/0x470
      [<ffffffff810a8d76>] worker_thread+0x126/0x410
      [<ffffffff810a8c50>] ? rescuer_thread+0x460/0x460
      [<ffffffff810b052f>] kthread+0xcf/0xe0
      [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140
      [<ffffffff81696418>] ret_from_fork+0x58/0x90
      [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140
      Code: ca 75 f1 5d c3 0f 1f 80 00 00 00 00 eb d9 66 0f 1f 44 00 00 0f 1f
      44 00 00 55 48 89 e5 53 48 89 fb e8 7e 04 a0 ff b8 00 00 02 00 <f0> 0f
      c1 03 89 c2 c1 ea 10 66 39 c2 75 03 5b 5d c3 83 e2 fe 0f
      RIP  [<ffffffff8168d847>] _raw_spin_lock_bh+0x17/0x50
      RSP <ffff88031f587ba8>
      Signed-off-by: NScott Mayhew <smayhew@redhat.com>
      Fixes: c3d4879e ("sunrpc: Add a function to close temporary transports immediately")
      Reviewed-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      ea08e392
  14. 01 10月, 2016 1 次提交
  15. 23 9月, 2016 3 次提交
    • C
      svcrdma: support Remote Invalidation · 25d55296
      Chuck Lever 提交于
      Support Remote Invalidation. A private message is exchanged with
      the client upon RDMA transport connect that indicates whether
      Send With Invalidation may be used by the server to send RPC
      replies. The invalidate_rkey is arbitrarily chosen from among
      rkeys present in the RPC-over-RDMA header's chunk lists.
      
      Send With Invalidate improves performance only when clients can
      recognize, while processing an RPC reply, that an rkey has already
      been invalidated. That has been submitted as a separate change.
      
      In the future, the RPC-over-RDMA protocol might support Remote
      Invalidation properly. The protocol needs to enable signaling
      between peers to indicate when Remote Invalidation can be used
      for each individual RPC.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      25d55296
    • C
      rpcrdma: RDMA/CM private message data structure · 5d487096
      Chuck Lever 提交于
      Introduce data structure used by both client and server to exchange
      implementation details during RDMA/CM connection establishment.
      
      This is an experimental out-of-band exchange between Linux
      RPC-over-RDMA Version One implementations, replacing the deprecated
      CCP (see RFC 5666bis). The purpose of this extension is to enable
      prototyping of features that might be introduced in a subsequent
      version of RPC-over-RDMA.
      
      Suggested by Christoph Hellwig and Devesh Sharma.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      5d487096
    • C
      svcrdma: Tail iovec leaves an orphaned DMA mapping · cace564f
      Chuck Lever 提交于
      The ctxt's count field is overloaded to mean the number of pages in
      the ctxt->page array and the number of SGEs in the ctxt->sge array.
      Typically these two numbers are the same.
      
      However, when an inline RPC reply is constructed from an xdr_buf
      with a tail iovec, the head and tail often occupy the same page,
      but each are DMA mapped independently. In that case, ->count equals
      the number of pages, but it does not equal the number of SGEs.
      There's one more SGE, for the tail iovec. Hence there is one more
      DMA mapping than there are pages in the ctxt->page array.
      
      This isn't a real problem until the server's iommu is enabled. Then
      each RPC reply that has content in that iovec orphans a DMA mapping
      that consists of real resources.
      
      krb5i and krb5p always populate that tail iovec. After a couple
      million sent krb5i/p RPC replies, the NFS server starts behaving
      erratically. Reboot is needed to clear the problem.
      
      Fixes: 9d11b51c ("svcrdma: Fix send_reply() scatter/gather set-up")
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      cace564f
  16. 20 9月, 2016 2 次提交
    • C
      xprtrdma: Support larger inline thresholds · 44829d02
      Chuck Lever 提交于
      The Version One default inline threshold is still 1KB. But allow
      testing with thresholds up to 64KB.
      
      This maximum is somewhat arbitrary. There's no fundamental
      architectural limit I'm aware of, but it's good to keep the size of
      Receive buffers reasonable. Now that Send can use a s/g list, a
      Send buffer is only as large as each RPC requires. Receive buffers
      are always the size of the inline threshold, however.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      44829d02
    • C
      xprtrdma: Client-side support for rpcrdma_connect_private · 87cfb9a0
      Chuck Lever 提交于
      Send an RDMA-CM private message on connect, and look for one during
      a connection-established event.
      
      Both sides can communicate their various implementation limits.
      Implementations that don't support this sideband protocol ignore it.
      
      Once the client knows the server's inline threshold maxima, it can
      adjust the use of Reply chunks, and eliminate most use of Position
      Zero Read chunks. Moderately-sized I/O can be done using a pure
      inline RDMA Send instead of RDMA operations that require memory
      registration.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      87cfb9a0