1. 18 11月, 2013 1 次提交
    • Y
      IB/core: extended command: an improved infrastructure for uverbs commands · f21519b2
      Yann Droneaud 提交于
      Commit 400dbc96 ("IB/core: Infrastructure for extensible uverbs
      commands") added an infrastructure for extensible uverbs commands
      while later commit 436f2ad0 ("IB/core: Export ib_create/destroy_flow
      through uverbs") exported ib_create_flow()/ib_destroy_flow() functions
      using this new infrastructure.
      
      According to the commit 400dbc96, the purpose of this
      infrastructure is to support passing around provider (eg. hardware)
      specific buffers when userspace issue commands to the kernel, so that
      it would be possible to extend uverbs (eg. core) buffers independently
      from the provider buffers.
      
      But the new kernel command function prototypes were not modified to
      take advantage of this extension. This issue was exposed by Roland
      Dreier in a previous review[1].
      
      So the following patch is an attempt to a revised extensible command
      infrastructure.
      
      This improved extensible command infrastructure distinguish between
      core (eg. legacy)'s command/response buffers from provider
      (eg. hardware)'s command/response buffers: each extended command
      implementing function is given a struct ib_udata to hold core
      (eg. uverbs) input and output buffers, and another struct ib_udata to
      hold the hw (eg. provider) input and output buffers.
      
      Having those buffers identified separately make it easier to increase
      one buffer to support extension without having to add some code to
      guess the exact size of each command/response parts: This should make
      the extended functions more reliable.
      
      Additionally, instead of relying on command identifier being greater
      than IB_USER_VERBS_CMD_THRESHOLD, the proposed infrastructure rely on
      unused bits in command field: on the 32 bits provided by command
      field, only 6 bits are really needed to encode the identifier of
      commands currently supported by the kernel. (Even using only 6 bits
      leaves room for about 23 new commands).
      
      So this patch makes use of some high order bits in command field to
      store flags, leaving enough room for more command identifiers than one
      will ever need (eg. 256).
      
      The new flags are used to specify if the command should be processed
      as an extended one or a legacy one. While designing the new command
      format, care was taken to make usage of flags itself extensible.
      
      Using high order bits of the commands field ensure that newer
      libibverbs on older kernel will properly fail when trying to call
      extended commands. On the other hand, older libibverbs on newer kernel
      will never be able to issue calls to extended commands.
      
      The extended command header includes the optional response pointer so
      that output buffer length and output buffer pointer are located
      together in the command, allowing proper parameters checking. This
      should make implementing functions easier and safer.
      
      Additionally the extended header ensure 64bits alignment, while making
      all sizes multiple of 8 bytes, extending the maximum buffer size:
      
                                   legacy      extended
      
         Maximum command buffer:  256KBytes   1024KBytes (512KBytes + 512KBytes)
        Maximum response buffer:  256KBytes   1024KBytes (512KBytes + 512KBytes)
      
      For the purpose of doing proper buffer size accounting, the headers
      size are no more taken in account in "in_words".
      
      One of the odds of the current extensible infrastructure, reading
      twice the "legacy" command header, is fixed by removing the "legacy"
      command header from the extended command header: they are processed as
      two different parts of the command: memory is read once and
      information are not duplicated: it's making clear that's an extended
      command scheme and not a different command scheme.
      
      The proposed scheme will format input (command) and output (response)
      buffers this way:
      
      - command:
      
        legacy header +
        extended header +
        command data (core + hw):
      
          +----------------------------------------+
          | flags     |   00      00    |  command |
          |        in_words    |   out_words       |
          +----------------------------------------+
          |                 response               |
          |                 response               |
          | provider_in_words | provider_out_words |
          |                 padding                |
          +----------------------------------------+
          |                                        |
          .              <uverbs input>            .
          .              (in_words * 8)            .
          |                                        |
          +----------------------------------------+
          |                                        |
          .             <provider input>           .
          .          (provider_in_words * 8)       .
          |                                        |
          +----------------------------------------+
      
      - response, if present:
      
          +----------------------------------------+
          |                                        |
          .          <uverbs output space>         .
          .             (out_words * 8)            .
          |                                        |
          +----------------------------------------+
          |                                        |
          .         <provider output space>        .
          .         (provider_out_words * 8)       .
          |                                        |
          +----------------------------------------+
      
      The overall design is to ensure that the extensible infrastructure is
      itself extensible while begin more reliable with more input and bound
      checking.
      
      Note:
      
      The unused field in the extended header would be perfect candidate to
      hold the command "comp_mask" (eg. bit field used to handle
      compatibility).  This was suggested by Roland Dreier in a previous
      review[2].  But "comp_mask" field is likely to be present in the uverb
      input and/or provider input, likewise for the response, as noted by
      Matan Barak[3], so it doesn't make sense to put "comp_mask" in the
      header.
      
      [1]:
      http://marc.info/?i=CAL1RGDWxmM17W2o_era24A-TTDeKyoL6u3NRu_=t_dhV_ZA9MA@mail.gmail.com
      
      [2]:
      http://marc.info/?i=CAL1RGDXJtrc849M6_XNZT5xO1+ybKtLWGq6yg6LhoSsKpsmkYA@mail.gmail.com
      
      [3]:
      http://marc.info/?i=525C1149.6000701@mellanox.comSigned-off-by: NYann Droneaud <ydroneaud@opteya.com>
      Link: http://marc.info/?i=cover.1383773832.git.ydroneaud@opteya.com
      
      [ Convert "ret ? ret : 0" to the equivalent "ret".  - Roland ]
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      f21519b2
  2. 16 11月, 2013 1 次提交
  3. 09 11月, 2013 1 次提交
  4. 03 9月, 2013 1 次提交
  5. 29 8月, 2013 2 次提交
    • H
      IB/core: Export ib_create/destroy_flow through uverbs · 436f2ad0
      Hadar Hen Zion 提交于
      Implement ib_uverbs_create_flow() and ib_uverbs_destroy_flow() to
      support flow steering for user space applications.
      Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      436f2ad0
    • H
      IB/core: Add receive flow steering support · 319a441d
      Hadar Hen Zion 提交于
      The RDMA stack allows for applications to create IB_QPT_RAW_PACKET
      QPs, which receive plain Ethernet packets, specifically packets that
      don't carry any QPN to be matched by the receiving side.  Applications
      using these QPs must be provided with a method to program some
      steering rule with the HW so packets arriving at the local port can be
      routed to them.
      
      This patch adds ib_create_flow(), which allow providing a flow
      specification for a QP.  When there's a match between the
      specification and a received packet, the packet is forwarded to that
      QP, in a the same way one uses ib_attach_multicast() for IB UD
      multicast handling.
      
      Flow specifications are provided as instances of struct ib_flow_spec_yyy,
      which describe L2, L3 and L4 headers.  Currently specs for Ethernet, IPv4,
      TCP and UDP are defined.  Flow specs are made of values and masks.
      
      The input to ib_create_flow() is a struct ib_flow_attr, which contains
      a few mandatory control elements and optional flow specs.
      
          struct ib_flow_attr {
                  enum ib_flow_attr_type type;
                  u16      size;
                  u16      priority;
                  u32      flags;
                  u8       num_of_specs;
                  u8       port;
                  /* Following are the optional layers according to user request
                   * struct ib_flow_spec_yyy
                   * struct ib_flow_spec_zzz
                   */
          };
      
      As these specs are eventually coming from user space, they are defined and
      used in a way which allows adding new spec types without kernel/user ABI
      change, just with a little API enhancement which defines the newly added spec.
      
      The flow spec structures are defined with TLV (Type-Length-Value)
      entries, which allows calling ib_create_flow() with a list of variable
      length of optional specs.
      
      For the actual processing of ib_flow_attr the driver uses the number
      of specs and the size mandatory fields along with the TLV nature of
      the specs.
      
      Steering rules processing order is according to the domain over which
      the rule is set and the rule priority.  All rules set by user space
      applicatations fall into the IB_FLOW_DOMAIN_USER domain, other domains
      could be used by future IPoIB RFS and Ethetool flow-steering interface
      implementation.  Lower numerical value for the priority field means
      higher priority.
      
      The returned value from ib_create_flow() is a struct ib_flow, which
      contains a database pointer (handle) provided by the HW driver to be
      used when calling ib_destroy_flow().
      
      Applications that offload TCP/IP traffic can also be written over IB
      UD QPs.  The ib_create_flow() / ib_destroy_flow() API is designed to
      support UD QPs too.  A HW driver can set IB_DEVICE_MANAGED_FLOW_STEERING
      to denote support for flow steering.
      
      The ib_flow_attr enum type supports usage of flow steering for promiscuous
      and sniffer purposes:
      
          IB_FLOW_ATTR_NORMAL - "regular" rule, steering according to rule specification
      
          IB_FLOW_ATTR_ALL_DEFAULT - default unicast and multicast rule, receive
              all Ethernet traffic which isn't steered to any QP
      
          IB_FLOW_ATTR_MC_DEFAULT - same as IB_FLOW_ATTR_ALL_DEFAULT but only for multicast
      
          IB_FLOW_ATTR_SNIFFER - sniffer rule, receive all port traffic
      
      ALL_DEFAULT and MC_DEFAULT rules options are valid only for Ethernet link type.
      Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      319a441d
  6. 14 8月, 2013 1 次提交
  7. 08 7月, 2013 1 次提交
  8. 22 2月, 2013 1 次提交
    • S
      IB/core: Add "type 2" memory windows support · 7083e42e
      Shani Michaeli 提交于
      This patch enhances the IB core support for Memory Windows (MWs).
      
      MWs allow an application to have better/flexible control over remote
      access to memory.
      
      Two types of MWs are supported, with the second type having two flavors:
      
          Type 1  - associated with PD only
          Type 2A - associated with QPN only
          Type 2B - associated with PD and QPN
      
      Applications can allocate a MW once, and then repeatedly bind the MW
      to different ranges in MRs that are associated to the same PD. Type 1
      windows are bound through a verb, while type 2 windows are bound by
      posting a work request.
      
      The 32-bit memory key is composed of a 24-bit index and an 8-bit
      key. The key is changed with each bind, thus allowing more control
      over the peer's use of the memory key.
      
      The changes introduced are the following:
      
      * add memory window type enum and a corresponding parameter to ib_alloc_mw.
      * type 2 memory window bind work request support.
      * create a struct that contains the common part of the bind verb struct
        ibv_mw_bind and the bind work request into a single struct.
      * add the ib_inc_rkey helper function to advance the tag part of an rkey.
      
      Consumer interface details:
      
      * new device capability flags IB_DEVICE_MEM_WINDOW_TYPE_2A and
        IB_DEVICE_MEM_WINDOW_TYPE_2B are added to indicate device support
        for these features.
      
        Devices can set either IB_DEVICE_MEM_WINDOW_TYPE_2A or
        IB_DEVICE_MEM_WINDOW_TYPE_2B if it supports type 2A or type 2B
        memory windows. It can set neither to indicate it doesn't support
        type 2 windows at all.
      
      * modify existing provides and consumers code to the new param of
        ib_alloc_mw and the ib_mw_bind_info structure
      Signed-off-by: NHaggai Eran <haggaie@mellanox.com>
      Signed-off-by: NShani Michaeli <shanim@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      7083e42e
  9. 01 10月, 2012 1 次提交
  10. 09 5月, 2012 2 次提交
  11. 09 3月, 2012 1 次提交
    • O
      IB: Change CQE "csum_ok" field to a bit flag · d927d505
      Or Gerlitz 提交于
      Use a bit in wc_flags rather then a whole integer to hold the
      "checksum OK" flag.  By itself, this change doesn't reduce the size of
      struct ib_wc on 64bit machines -- it stays on 56 bytes because of
      padding.  However, it will allow to add more fields in the future
      without enlarging the struct.  Also, it will let us have a unified
      approach with future libibverbs checksum offload reporting, because a
      bit flag doesn't break the library ABI.
      
      This patch was suggested during conversation with Liran Liss
      <liranl@mellanox.com>.
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Reviewed-by: NSean Hefty <sean.hefty@intel.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      d927d505
  12. 06 3月, 2012 1 次提交
  13. 14 10月, 2011 6 次提交
    • S
      RDMA/core: Export ib_open_qp() to share XRC TGT QPs · 0e0ec7e0
      Sean Hefty 提交于
      XRC TGT QPs are shared resources among multiple processes.  Since the
      creating process may exit, allow other processes which share the same
      XRC domain to open an existing QP.  This allows us to transfer
      ownership of an XRC TGT QP to another process.
      Signed-off-by: NSean Hefty <sean.hefty@intel.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      0e0ec7e0
    • S
      RDMA/uverbs: Export XRC domains to user space · 53d0bd1e
      Sean Hefty 提交于
      Allow user space to create XRC domains.  Because XRCDs are expected to
      be shared among multiple processes, we use inodes to identify an XRCD.
      
      Based on patches by Jack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NSean Hefty <sean.hefty@intel.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      53d0bd1e
    • S
      RDMA/verbs: Cleanup XRC TGT QPs when destroying XRCD · d3d72d90
      Sean Hefty 提交于
      XRC TGT QPs are intended to be shared among multiple users and
      processes.  Allow the destruction of an XRC TGT QP to be done explicitly
      through ib_destroy_qp() or when the XRCD is destroyed.
      
      To support destroying an XRC TGT QP, we need to track TGT QPs with the
      XRCD.  When the XRCD is destroyed, all tracked XRC TGT QPs are also
      cleaned up.
      
      To avoid stale reference issues, if a user is holding a reference on a
      TGT QP, we increment a reference count on the QP.  The user releases the
      reference by calling ib_release_qp.  This releases any access to the QP
      from a user above verbs, but allows the QP to continue to exist until
      destroyed by the XRCD.
      Signed-off-by: NSean Hefty <sean.hefty@intel.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      d3d72d90
    • S
      RDMA/core: Add XRC QPs · b42b63cf
      Sean Hefty 提交于
      XRC ("eXtended reliable connected") is an IB transport that provides
      better scalability by allowing senders to specify which shared receive
      queue (SRQ) should be used to receive a message, which essentially
      allows one transport context (QP connection) to serve multiple
      destinations (as long as they share an adapter, of course).
      
      XRC communication is between an initiator (INI) QP and a target (TGT)
      QP.  Target QPs are associated with SRQs through an XRCD.  An XRC TGT QP
      behaves like a receive-only RD QP.  XRC INI QPs behave similarly to RC
      QPs, except that work requests posted to an XRC INI QP must specify the
      remote SRQ that is the target of the work request.
      
      We define two new QP types for XRC, to distinguish between INI and TGT
      QPs, and update the core layer to support XRC QPs.
      
      This patch is derived from work by Jack Morgenstein
      <jackm@dev.mellanox.co.il>
      Signed-off-by: NSean Hefty <sean.hefty@intel.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      b42b63cf
    • S
      RDMA/core: Add XRC SRQ type · 418d5130
      Sean Hefty 提交于
      XRC ("eXtended reliable connected") is an IB transport that provides
      better scalability by allowing senders to specify which shared receive
      queue (SRQ) should be used to receive a message, which essentially
      allows one transport context (QP connection) to serve multiple
      destinations (as long as they share an adapter, of course).
      
      XRC defines SRQs that are specifically used by XRC connections.  Expand
      the SRQ code to support XRC SRQs.  An XRC SRQ is currently restricted to
      only XRC use according to the IB XRC Annex.
      
      Portions of this patch were derived from work by
      Jack Morgenstein <jackm@dev.mellanox.co.il>.
      Signed-off-by: NSean Hefty <sean.hefty@intel.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      418d5130
    • S
      RDMA/core: Add SRQ type field · 96104eda
      Sean Hefty 提交于
      Currently, there is only a single ("basic") type of SRQ, but with XRC
      support we will add a second.  Prepare for this by defining an SRQ type
      and setting all current users to IB_SRQT_BASIC.
      Signed-off-by: NSean Hefty <sean.hefty@intel.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      96104eda
  14. 13 10月, 2011 1 次提交
    • S
      RDMA/core: Add XRC domain support · 59991f94
      Sean Hefty 提交于
      XRC ("eXtended reliable connected") is an IB transport that provides
      better scalability by allowing senders to specify which shared receive
      queue (SRQ) should be used to receive a message, which essentially
      allows one transport context (QP connection) to serve multiple
      destinations (as long as they share an adapter, of course).
      
      A few new concepts are introduced to support this.  This patch adds:
      
       - A new device capability flag, IB_DEVICE_XRC, which low-level
         drivers set to indicate that a device supports XRC.
       - A new object type, XRC domains (struct ib_xrcd), and new verbs
         ib_alloc_xrcd()/ib_dealloc_xrcd().  XRCDs are used to limit which
         XRC SRQs an incoming message can target.
      
      This patch is derived from work by Jack Morgenstein <jackm@dev.mellanox.co.il>.
      Signed-off-by: NSean Hefty <sean.hefty@intel.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      59991f94
  15. 12 10月, 2011 1 次提交
  16. 27 7月, 2011 1 次提交
  17. 19 7月, 2011 1 次提交
  18. 17 1月, 2011 1 次提交
    • T
      RDMA: Update workqueue usage · f0626710
      Tejun Heo 提交于
      * ib_wq is added, which is used as the common workqueue for infiniband
        instead of the system workqueue.  All system workqueue usages
        including flush_scheduled_work() callers are converted to use and
        flush ib_wq.
      
      * cancel_delayed_work() + flush_scheduled_work() converted to
        cancel_delayed_work_sync().
      
      * qib_wq is removed and ib_wq is used instead.
      
      This is to prepare for deprecation of flush_scheduled_work().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      f0626710
  19. 28 9月, 2010 1 次提交
    • E
      IB/core: Add link layer property to ports · a3f5adaf
      Eli Cohen 提交于
      This patch allows ports to have different link layers:
      IB_LINK_LAYER_INFINIBAND or IB_LINK_LAYER_ETHERNET.  This is required
      for adding IBoE (InfiniBand-over-Ethernet, aka RoCE) support.  For
      devices that do not provide an implementation for querying the link
      layer property of a port, we return a default value based on the
      transport: RMA_TRANSPORT_IB nodes will return IB_LINK_LAYER_INFINIBAND
      and RDMA_TRANSPORT_IWARP nodes will return IB_LINK_LAYER_ETHERNET.
      Signed-off-by: NEli Cohen <eli@mellanox.co.il>
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      a3f5adaf
  20. 05 8月, 2010 1 次提交
  21. 22 5月, 2010 1 次提交
    • R
      IB/core: Allow device-specific per-port sysfs files · 9a6edb60
      Ralph Campbell 提交于
      Add a new parameter to ib_register_device() so that low-level device
      drivers can pass in a pointer to a callback function that will be
      called for each port that is registered in sysfs.  This allows
      low-level device drivers to create files in
      
          /sys/class/infiniband/<hca>/ports/<N>/
      
      without having to poke through the internals of the RDMA sysfs handling.
      
      There is no need for an unregister function since the kobject
      reference will go to zero when ib_unregister_device() is called.
      Signed-off-by: NRalph Campbell <ralph.campbell@qlogic.com>
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      9a6edb60
  22. 22 4月, 2010 1 次提交
    • V
      IB/core: Add support for masked atomic operations · 5e80ba8f
      Vladimir Sokolovsky 提交于
       - Add new IB_WR_MASKED_ATOMIC_CMP_AND_SWP and IB_WR_MASKED_ATOMIC_FETCH_AND_ADD
         send opcodes that can be used to post "masked atomic compare and
         swap" and "masked atomic fetch and add" work request respectively.
       - Add masked_atomic_cap capability.
       - Add mask fields to atomic struct of ib_send_wr
       - Add new opcodes to ib_wc_opcode
      
      The new operations are described more precisely below:
      
      * Masked Compare and Swap (MskCmpSwap)
      
      The MskCmpSwap atomic operation is an extension to the CmpSwap
      operation defined in the IB spec.  MskCmpSwap allows the user to
      select a portion of the 64 bit target data for the “compare” check as
      well as to restrict the swap to a (possibly different) portion.  The
      pseudo code below describes the operation:
      
      | atomic_response = *va
      | if (!((compare_add ^ *va) & compare_add_mask)) then
      |     *va = (*va & ~(swap_mask)) | (swap & swap_mask)
      |
      | return atomic_response
      
      The additional operands are carried in the Extended Transport Header.
      Atomic response generation and packet format for MskCmpSwap is as for
      standard IB Atomic operations.
      
      * Masked Fetch and Add (MFetchAdd)
      
      The MFetchAdd Atomic operation extends the functionality of the
      standard IB FetchAdd by allowing the user to split the target into
      multiple fields of selectable length. The atomic add is done
      independently on each one of this fields. A bit set in the
      field_boundary parameter specifies the field boundaries. The pseudo
      code below describes the operation:
      
      | bit_adder(ci, b1, b2, *co)
      | {
      |	value = ci + b1 + b2
      |	*co = !!(value & 2)
      |
      |	return value & 1
      | }
      |
      | #define MASK_IS_SET(mask, attr)      (!!((mask)&(attr)))
      | bit_position = 1
      | carry = 0
      | atomic_response = 0
      |
      | for i = 0 to 63
      | {
      |         if ( i != 0 )
      |                 bit_position =  bit_position << 1
      |
      |         bit_add_res = bit_adder(carry, MASK_IS_SET(*va, bit_position),
      |                                 MASK_IS_SET(compare_add, bit_position), &new_carry)
      |         if (bit_add_res)
      |                 atomic_response |= bit_position
      |
      |         carry = ((new_carry) && (!MASK_IS_SET(compare_add_mask, bit_position)))
      | }
      |
      | return atomic_response
      Signed-off-by: NVladimir Sokolovsky <vlad@mellanox.co.il>
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      5e80ba8f
  23. 25 2月, 2010 1 次提交
  24. 10 12月, 2009 1 次提交
  25. 15 2月, 2009 1 次提交
  26. 27 7月, 2008 1 次提交
    • F
      dma-mapping: add the device argument to dma_mapping_error() · 8d8bb39b
      FUJITA Tomonori 提交于
      Add per-device dma_mapping_ops support for CONFIG_X86_64 as POWER
      architecture does:
      
      This enables us to cleanly fix the Calgary IOMMU issue that some devices
      are not behind the IOMMU (http://lkml.org/lkml/2008/5/8/423).
      
      I think that per-device dma_mapping_ops support would be also helpful for
      KVM people to support PCI passthrough but Andi thinks that this makes it
      difficult to support the PCI passthrough (see the above thread).  So I
      CC'ed this to KVM camp.  Comments are appreciated.
      
      A pointer to dma_mapping_ops to struct dev_archdata is added.  If the
      pointer is non NULL, DMA operations in asm/dma-mapping.h use it.  If it's
      NULL, the system-wide dma_ops pointer is used as before.
      
      If it's useful for KVM people, I plan to implement a mechanism to register
      a hook called when a new pci (or dma capable) device is created (it works
      with hot plugging).  It enables IOMMUs to set up an appropriate
      dma_mapping_ops per device.
      
      The major obstacle is that dma_mapping_error doesn't take a pointer to the
      device unlike other DMA operations.  So x86 can't have dma_mapping_ops per
      device.  Note all the POWER IOMMUs use the same dma_mapping_error function
      so this is not a problem for POWER but x86 IOMMUs use different
      dma_mapping_error functions.
      
      The first patch adds the device argument to dma_mapping_error.  The patch
      is trivial but large since it touches lots of drivers and dma-mapping.h in
      all the architecture.
      
      This patch:
      
      dma_mapping_error() doesn't take a pointer to the device unlike other DMA
      operations.  So we can't have dma_mapping_ops per device.
      
      Note that POWER already has dma_mapping_ops per device but all the POWER
      IOMMUs use the same dma_mapping_error function.  x86 IOMMUs use device
      argument.
      
      [akpm@linux-foundation.org: fix sge]
      [akpm@linux-foundation.org: fix svc_rdma]
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: fix bnx2x]
      [akpm@linux-foundation.org: fix s2io]
      [akpm@linux-foundation.org: fix pasemi_mac]
      [akpm@linux-foundation.org: fix sdhci]
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: fix sparc]
      [akpm@linux-foundation.org: fix ibmvscsi]
      Signed-off-by: NFUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Cc: Muli Ben-Yehuda <muli@il.ibm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Avi Kivity <avi@qumranet.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d8bb39b
  27. 15 7月, 2008 5 次提交
    • S
      RDMA/core: Add local DMA L_Key support · 96f15c03
      Steve Wise 提交于
      - Change the IB_DEVICE_ZERO_STAG flag to the transport-neutral name
        IB_DEVICE_LOCAL_DMA_LKEY, which is used by iWARP RNICs to indicate 0
        STag support and IB HCAs to indicate reserved L_Key support.
      
      - Add a u32 local_dma_lkey member to struct ib_device.  Drivers fill
        this in with the appropriate local DMA L_Key (if they support it).
      
      - Fix up the drivers using this flag.
      Signed-off-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      96f15c03
    • R
      IB/core: Add support for multicast loopback blocking · 47ee1b9f
      Ron Livne 提交于
      This patch also adds a creation flag for QPs,
      IB_QP_CREATE_MULTICAST_BLOCK_LOOPBACK, which when set means that
      multicast sends from the QP to a group that the QP is attached to will
      not be looped back to the QP's receive queue.  This can be used to
      save receive resources when a consumer does not want a local copy of
      multicast traffic; for example IPoIB must waste CPU time throwing away
      such local copies of multicast traffic.
      
      This patch also adds a device capability flag that shows whether a
      device supports this feature or not.
      Signed-off-by: NRon Livne <ronli@voltaire.com>
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      47ee1b9f
    • S
      RDMA/core: Add iWARP protocol statistics attributes in sysfs · 7f624d02
      Steve Wise 提交于
      This patch adds a sysfs attribute group called "proto_stats" under
      /sys/class/infiniband/$device/ and populates this group with protocol
      statistics if they exist for a given device.  Currently, only iWARP
      stats are defined, but the code is designed to allow InfiniBand
      protocol stats if they become available.  These stats are per-device
      and more importantly -not- per port.
      
      Details:
      
      - Add union rdma_protocol_stats in ib_verbs.h.  This union allows
        defining transport-specific stats.  Currently only iwarp stats are
        defined.
      
      - Add struct iw_protocol_stats to define the current set of iwarp
        protocol stats.
      
      - Add new ib_device method called get_proto_stats() to return protocol
        statistics.
      
      - Add logic in core/sysfs.c to create iwarp protocol stats attributes
        if the device is an RNIC and has a get_proto_stats() method.
      Signed-off-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      7f624d02
    • S
      RDMA/core: Add memory management extensions support · 00f7ec36
      Steve Wise 提交于
      This patch adds support for the IB "base memory management extension"
      (BMME) and the equivalent iWARP operations (which the iWARP verbs
      mandates all devices must implement).  The new operations are:
      
       - Allocate an ib_mr for use in fast register work requests.
      
       - Allocate/free a physical buffer lists for use in fast register work
         requests.  This allows device drivers to allocate this memory as
         needed for use in posting send requests (eg via dma_alloc_coherent).
      
       - New send queue work requests:
         * send with remote invalidate
         * fast register memory region
         * local invalidate memory region
         * RDMA read with invalidate local memory region (iWARP only)
      
      Consumer interface details:
      
       - A new device capability flag IB_DEVICE_MEM_MGT_EXTENSIONS is added
         to indicate device support for these features.
      
       - New send work request opcodes IB_WR_FAST_REG_MR, IB_WR_LOCAL_INV,
         IB_WR_RDMA_READ_WITH_INV are added.
      
       - A new consumer API function, ib_alloc_mr() is added to allocate
         fast register memory regions.
      
       - New consumer API functions, ib_alloc_fast_reg_page_list() and
         ib_free_fast_reg_page_list() are added to allocate and free
         device-specific memory for fast registration page lists.
      
       - A new consumer API function, ib_update_fast_reg_key(), is added to
         allow the key portion of the R_Key and L_Key of a fast registration
         MR to be updated.  Consumers call this if desired before posting
         a IB_WR_FAST_REG_MR work request.
      
      Consumers can use this as follows:
      
       - MR is allocated with ib_alloc_mr().
      
       - Page list memory is allocated with ib_alloc_fast_reg_page_list().
      
       - MR R_Key/L_Key "key" field is updated with ib_update_fast_reg_key().
      
       - MR made VALID and bound to a specific page list via
         ib_post_send(IB_WR_FAST_REG_MR)
      
       - MR made INVALID via ib_post_send(IB_WR_LOCAL_INV),
         ib_post_send(IB_WR_RDMA_READ_WITH_INV) or an incoming send with
         invalidate operation.
      
       - MR is deallocated with ib_dereg_mr()
      
       - page lists dealloced via ib_free_fast_reg_page_list().
      
      Applications can allocate a fast register MR once, and then can
      repeatedly bind the MR to different physical block lists (PBLs) via
      posting work requests to a send queue (SQ).  For each outstanding
      MR-to-PBL binding in the SQ pipe, a fast_reg_page_list needs to be
      allocated (the fast_reg_page_list is owned by the low-level driver
      from the consumer posting a work request until the request completes).
      Thus pipelining can be achieved while still allowing device-specific
      page_list processing.
      
      The 32-bit fast register memory key/STag is composed of a 24-bit index
      and an 8-bit key.  The application can change the key each time it
      fast registers thus allowing more control over the peer's use of the
      key/STag (ie it can effectively be changed each time the rkey is
      rebound to a page list).
      Signed-off-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      00f7ec36
    • D
      RDMA: Improve include file coding style · 4deccd6d
      Dotan Barak 提交于
      Remove subversion $Id lines and improve readability by fixing other
      coding style problems pointed out by checkpatch.pl.
      Signed-off-by: NDotan Barak <dotanba@gmail.com>
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      4deccd6d
  28. 10 6月, 2008 1 次提交
    • R
      IB/core: Remove IB_DEVICE_SEND_W_INV capability flag · 4c0283fc
      Roland Dreier 提交于
      In 2.6.26, we added some support for send with invalidate work
      requests, including a device capability flag to indicate whether a
      device supports such requests.  However, the support was incomplete:
      the completion structure was not extended with a field for the key
      contained in incoming send with invalidate requests.
      
      Full support for memory management extensions (send with invalidate,
      local invalidate, fast register through a send queue, etc) is planned
      for 2.6.27.  Since send with invalidate is not very useful by itself,
      just remove the IB_DEVICE_SEND_W_INV bit before the 2.6.26 final
      release; we will add an IB_DEVICE_MEM_MGT_EXTENSIONS bit in 2.6.27,
      which makes things simpler for applications, since they will not have
      quite as confusing an array of fine-grained bits to check.
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      4c0283fc
  29. 29 4月, 2008 1 次提交
    • A
      IB: expand ib_umem_get() prototype · cb9fbc5c
      Arthur Kepner 提交于
      Add a new parameter, dmasync, to the ib_umem_get() prototype.  Use dmasync = 1
      when mapping user-allocated CQs with ib_umem_get().
      Signed-off-by: NArthur Kepner <akepner@sgi.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
      Cc: Jes Sorensen <jes@sgi.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Roland Dreier <rdreier@cisco.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Grant Grundler <grundler@parisc-linux.org>
      Cc: Michael Ellerman <michael@ellerman.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb9fbc5c