1. 15 1月, 2014 1 次提交
    • M
      IB/core: Ethernet L2 attributes in verbs/cm structures · dd5f03be
      Matan Barak 提交于
      This patch add the support for Ethernet L2 attributes in the
      verbs/cm/cma structures.
      
      When dealing with L2 Ethernet, we should use smac, dmac, vlan ID and priority
      in a similar manner that the IB L2 (and the L4 PKEY) attributes are used.
      
      Thus, those attributes were added to the following structures:
      
      * ib_ah_attr - added dmac
      * ib_qp_attr - added smac and vlan_id, (sl remains vlan priority)
      * ib_wc - added smac, vlan_id
      * ib_sa_path_rec - added smac, dmac, vlan_id
      * cm_av - added smac and vlan_id
      
      For the path record structure, extra care was taken to avoid the new
      fields when packing it into wire format, so we don't break the IB CM
      and SA wire protocol.
      
      On the active side, the CM fills. its internal structures from the
      path provided by the ULP.  We add there taking the ETH L2 attributes
      and placing them into the CM Address Handle (struct cm_av).
      
      On the passive side, the CM fills its internal structures from the WC
      associated with the REQ message.  We add there taking the ETH L2
      attributes from the WC.
      
      When the HW driver provides the required ETH L2 attributes in the WC,
      they set the IB_WC_WITH_SMAC and IB_WC_WITH_VLAN flags. The IB core
      code checks for the presence of these flags, and in their absence does
      address resolution from the ib_init_ah_from_wc() helper function.
      
      ib_modify_qp_is_ok is also updated to consider the link layer. Some
      parameters are mandatory for Ethernet link layer, while they are
      irrelevant for IB.  Vendor drivers are modified to support the new
      function signature.
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      dd5f03be
  2. 21 12月, 2013 7 次提交
  3. 17 12月, 2013 1 次提交
  4. 16 12月, 2013 1 次提交
    • S
      RDMA/iwcm: Don't touch cm_id after deref in rem_ref · 6b59ba60
      Steve Wise 提交于
      rem_ref() calls iwcm_deref_id(), which will wake up any blockers on
      cm_id_priv->destroy_comp if the refcnt hits 0.  That will unblock
      someone in iw_destroy_cm_id() which will free the cmid.  If that
      happens before rem_ref() calls test_bit(IWCM_F_CALLBACK_DESTROY,
      &cm_id_priv->flags), then the test_bit() will touch freed memory.
      
      The fix is to read the bit first, then deref.  We should never be in
      iw_destroy_cm_id() with IWCM_F_CALLBACK_DESTROY set, and there is a
      BUG_ON() to make sure of that.
      Signed-off-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      6b59ba60
  5. 18 11月, 2013 6 次提交
    • M
      IB/core: Re-enable create_flow/destroy_flow uverbs · 69ad5da4
      Matan Barak 提交于
      This commit reverts commit 7afbddfa ("IB/core: Temporarily disable
      create_flow/destroy_flow uverbs").  Since the uverbs extensions
      functionality was experimental for v3.12, this patch re-enables the
      support for them and flow-steering for v3.13.
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      69ad5da4
    • Y
      IB/core: extended command: an improved infrastructure for uverbs commands · f21519b2
      Yann Droneaud 提交于
      Commit 400dbc96 ("IB/core: Infrastructure for extensible uverbs
      commands") added an infrastructure for extensible uverbs commands
      while later commit 436f2ad0 ("IB/core: Export ib_create/destroy_flow
      through uverbs") exported ib_create_flow()/ib_destroy_flow() functions
      using this new infrastructure.
      
      According to the commit 400dbc96, the purpose of this
      infrastructure is to support passing around provider (eg. hardware)
      specific buffers when userspace issue commands to the kernel, so that
      it would be possible to extend uverbs (eg. core) buffers independently
      from the provider buffers.
      
      But the new kernel command function prototypes were not modified to
      take advantage of this extension. This issue was exposed by Roland
      Dreier in a previous review[1].
      
      So the following patch is an attempt to a revised extensible command
      infrastructure.
      
      This improved extensible command infrastructure distinguish between
      core (eg. legacy)'s command/response buffers from provider
      (eg. hardware)'s command/response buffers: each extended command
      implementing function is given a struct ib_udata to hold core
      (eg. uverbs) input and output buffers, and another struct ib_udata to
      hold the hw (eg. provider) input and output buffers.
      
      Having those buffers identified separately make it easier to increase
      one buffer to support extension without having to add some code to
      guess the exact size of each command/response parts: This should make
      the extended functions more reliable.
      
      Additionally, instead of relying on command identifier being greater
      than IB_USER_VERBS_CMD_THRESHOLD, the proposed infrastructure rely on
      unused bits in command field: on the 32 bits provided by command
      field, only 6 bits are really needed to encode the identifier of
      commands currently supported by the kernel. (Even using only 6 bits
      leaves room for about 23 new commands).
      
      So this patch makes use of some high order bits in command field to
      store flags, leaving enough room for more command identifiers than one
      will ever need (eg. 256).
      
      The new flags are used to specify if the command should be processed
      as an extended one or a legacy one. While designing the new command
      format, care was taken to make usage of flags itself extensible.
      
      Using high order bits of the commands field ensure that newer
      libibverbs on older kernel will properly fail when trying to call
      extended commands. On the other hand, older libibverbs on newer kernel
      will never be able to issue calls to extended commands.
      
      The extended command header includes the optional response pointer so
      that output buffer length and output buffer pointer are located
      together in the command, allowing proper parameters checking. This
      should make implementing functions easier and safer.
      
      Additionally the extended header ensure 64bits alignment, while making
      all sizes multiple of 8 bytes, extending the maximum buffer size:
      
                                   legacy      extended
      
         Maximum command buffer:  256KBytes   1024KBytes (512KBytes + 512KBytes)
        Maximum response buffer:  256KBytes   1024KBytes (512KBytes + 512KBytes)
      
      For the purpose of doing proper buffer size accounting, the headers
      size are no more taken in account in "in_words".
      
      One of the odds of the current extensible infrastructure, reading
      twice the "legacy" command header, is fixed by removing the "legacy"
      command header from the extended command header: they are processed as
      two different parts of the command: memory is read once and
      information are not duplicated: it's making clear that's an extended
      command scheme and not a different command scheme.
      
      The proposed scheme will format input (command) and output (response)
      buffers this way:
      
      - command:
      
        legacy header +
        extended header +
        command data (core + hw):
      
          +----------------------------------------+
          | flags     |   00      00    |  command |
          |        in_words    |   out_words       |
          +----------------------------------------+
          |                 response               |
          |                 response               |
          | provider_in_words | provider_out_words |
          |                 padding                |
          +----------------------------------------+
          |                                        |
          .              <uverbs input>            .
          .              (in_words * 8)            .
          |                                        |
          +----------------------------------------+
          |                                        |
          .             <provider input>           .
          .          (provider_in_words * 8)       .
          |                                        |
          +----------------------------------------+
      
      - response, if present:
      
          +----------------------------------------+
          |                                        |
          .          <uverbs output space>         .
          .             (out_words * 8)            .
          |                                        |
          +----------------------------------------+
          |                                        |
          .         <provider output space>        .
          .         (provider_out_words * 8)       .
          |                                        |
          +----------------------------------------+
      
      The overall design is to ensure that the extensible infrastructure is
      itself extensible while begin more reliable with more input and bound
      checking.
      
      Note:
      
      The unused field in the extended header would be perfect candidate to
      hold the command "comp_mask" (eg. bit field used to handle
      compatibility).  This was suggested by Roland Dreier in a previous
      review[2].  But "comp_mask" field is likely to be present in the uverb
      input and/or provider input, likewise for the response, as noted by
      Matan Barak[3], so it doesn't make sense to put "comp_mask" in the
      header.
      
      [1]:
      http://marc.info/?i=CAL1RGDWxmM17W2o_era24A-TTDeKyoL6u3NRu_=t_dhV_ZA9MA@mail.gmail.com
      
      [2]:
      http://marc.info/?i=CAL1RGDXJtrc849M6_XNZT5xO1+ybKtLWGq6yg6LhoSsKpsmkYA@mail.gmail.com
      
      [3]:
      http://marc.info/?i=525C1149.6000701@mellanox.comSigned-off-by: NYann Droneaud <ydroneaud@opteya.com>
      Link: http://marc.info/?i=cover.1383773832.git.ydroneaud@opteya.com
      
      [ Convert "ret ? ret : 0" to the equivalent "ret".  - Roland ]
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      f21519b2
    • Y
      IB/core: Remove ib_uverbs_flow_spec structure from userspace · 2490f20b
      Yann Droneaud 提交于
      The structure holding any types of flow_spec is of no use to
      userspace.  It would be wrong for userspace to do:
      
        struct ib_uverbs_flow_spec flow_spec;
      
        flow_spec.type = IB_FLOW_SPEC_TCP;
        flow_spec.size = sizeof(flow_spec);
      
      Instead, userspace should use the dedicated flow_spec structure for
        - Ethernet : struct ib_uverbs_flow_spec_eth,
        - IPv4     : struct ib_uverbs_flow_spec_ipv4,
        - TCP/UDP  : struct ib_uverbs_flow_spec_tcp_udp.
      
      In other words, struct ib_uverbs_flow_spec is a "virtual" data
      structure that can only be use by the kernel as an alias to the other.
      Signed-off-by: NYann Droneaud <ydroneaud@opteya.com>
      Link: http://marc.info/?i=cover.1383773832.git.ydroneaud@opteya.comSigned-off-by: NRoland Dreier <roland@purestorage.com>
      2490f20b
    • Y
      IB/core: Make uverbs flow structure use names like verbs ones · b68c9560
      Yann Droneaud 提交于
      This patch adds "flow" prefix to most of data structure added as part
      of commit 436f2ad0 ("IB/core: Export ib_create/destroy_flow through
      uverbs") to keep those names in sync with the data structures added in
      commit 319a441d ("IB/core: Add receive flow steering support").
      
      It's just a matter of translating 'ib_flow' to 'ib_uverbs_flow'.
      Signed-off-by: NYann Droneaud <ydroneaud@opteya.com>
      Link: http://marc.info/?i=cover.1383773832.git.ydroneaud@opteya.comSigned-off-by: NRoland Dreier <roland@purestorage.com>
      b68c9560
    • Y
      IB/core: Rename 'flow' structs to match other uverbs structs · d82693da
      Yann Droneaud 提交于
      Commit 436f2ad0 ("IB/core: Export ib_create/destroy_flow through
      uverbs") added public data structures to support receive flow
      steering.  The new structs are not following the 'uverbs' pattern:
      they're lacking the common prefix 'ib_uverbs'.
      
      This patch replaces ib_kern prefix by ib_uverbs.
      Signed-off-by: NYann Droneaud <ydroneaud@opteya.com>
      Link: http://marc.info/?i=cover.1383773832.git.ydroneaud@opteya.comSigned-off-by: NRoland Dreier <roland@purestorage.com>
      d82693da
    • M
      IB/core: clarify overflow/underflow checks on ib_create/destroy_flow · f8848274
      Matan Barak 提交于
      This patch fixes the following issues:
      
      1. Unneeded checks were removed
      
      2. Removed the fixed size out of flow_attr.size, thus simplifying the checks.
      
      3. Remove a 32bit hole on 64bit systems with strict alignment in
         struct ib_kern_flow_att by adding a reserved field.
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      f8848274
  6. 17 11月, 2013 2 次提交
  7. 16 11月, 2013 1 次提交
  8. 12 11月, 2013 2 次提交
    • M
      RDMA/cma: Remove unused argument and minor dead code · 352b9056
      Michal Nazarewicz 提交于
      The dev variable is never assigned after being initialised.
      Signed-off-by: NMichal Nazarewicz <mina86@mina86.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      352b9056
    • S
      RDMA/ucma: Discard events for IDs not yet claimed by user space · c6b21824
      Sean Hefty 提交于
      Problem reported by Avneesh Pant <avneesh.pant@oracle.com>:
      
          It looks like we are triggering a bug in RDMA CM/UCM interaction.
          The bug specifically hits when we have an incoming connection
          request and the connecting process dies BEFORE the passive end of
          the connection can process the request i.e. it does not call
          rdma_get_cm_event() to retrieve the initial connection event.  We
          were able to triage this further and have some additional
          information now.
      
          In the example below when P1 dies after issuing a connect request
          as the CM id is being destroyed all outstanding connects (to P2)
          are sent a reject message. We see this reject message being
          received on the passive end and the appropriate CM ID created for
          the initial connection message being retrieved in cm_match_req().
          The problem is in the ucma_event_handler() code when this reject
          message is delivered to it and the initial connect message itself
          HAS NOT been delivered to the client. In fact the client has not
          even called rdma_cm_get_event() at this stage so we haven't
          allocated a new ctx in ucma_get_event() and updated the new
          connection CM_ID to point to the new UCMA context.
      
          This results in the reject message not being dropped in
          ucma_event_handler() for the new connection request as the
          (if (!ctx->uid)) block is skipped since the ctx it refers to is
          the listen CM id context which does have a valid UID associated
          with it (I believe the new CMID for the connection initially
          uses the listen CMID -> context when it is created in
          cma_new_conn_id). Thus the assumption that new events for a
          connection can get dropped in ucma_event_handler() is incorrect
          IF the initial connect request has not been retrieved in the
          first case. We end up getting a CM Reject event on the listen CM
          ID and our upper layer code asserts (in fact this event does not
          even have the listen_id set as that only gets set up librdmacm
          for connect requests).
      
      The solution is to verify that the cm_id being reported in the event
      is the same as the cm_id referenced by the ucma context.  A mismatch
      indicates that the ucma context corresponds to the listen.  This fix
      was validated by using a modified version of librdmacm that was able
      to verify the problem and see that the reject message was indeed
      dropped after this patch was applied.
      Signed-off-by: NSean Hefty <sean.hefty@intel.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      c6b21824
  9. 09 11月, 2013 5 次提交
    • U
      IB/core: Add Cisco usNIC rdma node and transport types · 180771a3
      Upinder Malhi \(umalhi\) 提交于
      This patch adds new rdma node and new rdma transport, and supporting
      code used by Cisco's low latency driver called usNIC.  usNIC uses its
      own transport, distinct from IB and iWARP.
      Signed-off-by: NUpinder Malhi <umalhi@cisco.com>
      Signed-off-by: NJeff Squyres <jsquyres@cisco.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      180771a3
    • M
      IB/netlink: Remove superfluous RDMA_NL_GET_OP() masking · 5476781b
      Mathias Krause 提交于
      'op' is the already RDMA_NL_GET_OP() masked 'type'.  No need to mask it again.
      Signed-off-by: NMathias Krause <minipli@googlemail.com>
      Reviewed-by: NYann Droneaud <ydroneaud@opteya.com>
      Acked-by: NSean Hefty <sean.hefty@intel.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      5476781b
    • L
      IB/core: Pass imm_data from ib_uverbs_send_wr to ib_send_wr correctly · 6b7d103c
      Latchesar Ionkov 提交于
      Currently, we don't copy the immediate data from the userspace struct
      to the kernel one when UD messages are being sent.
      
      This patch makes sure that the immediate data is set correctly.
      Signed-off-by: NLatchesar Ionkov <lucho@ionkov.net>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      6b7d103c
    • D
      IB/cma: Check for GID on listening device first · be9130cc
      Doug Ledford 提交于
      As a simple optimization that should speed up the vast majority of
      connect attemps on IB devices, when we are searching for the GID of an
      incoming connection in the cached GID lists of devices, search the
      device that received the incoming connection request first.  If we
      don't find it there, then move on to other devices.
      
      This reduces the time to perform 10,000 connections considerably.
      Prior to this patch, a bad run of cmtime would look like this:
      
      connect      :    12399.26   12351.10    8609.00    1239.93
      
      With this patch, it looks more like this:
      
      connect      :     5864.86    5799.80    8876.00     586.49
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      be9130cc
    • D
      IB/cma: Use cached gids · 29f27e84
      Doug Ledford 提交于
      The cma_acquire_dev function was changed by commit 3c86aa70
      ("RDMA/cm: Add RDMA CM support for IBoE devices") to use find_gid_port()
      because multiport devices might have either IB or IBoE formatted gids.
      The old function assumed that all ports on the same device used the
      same GID format.
      
      However, when it was changed to use find_gid_port(), we inadvertently
      lost usage of the GID cache.  This turned out to be a very costly
      change.  In our testing, each iteration through each index of the GID
      table takes roughly 35us.  When you have multiple devices in a system,
      and the GID you are looking for is on one of the later devices, the
      code loops through all of the GID indexes on all of the early devices
      before it finally succeeds on the target device.  This pathological
      search behavior combined with 35us per GID table index retrieval
      results in results such as the following from the cmtime application
      that's part of the latest librdmacm git repo:
      
      ib1:
      step              total ms     max ms     min us  us / conn
      create id    :       29.42       0.04       1.00       2.94
      bind addr    :   186705.66      19.00   18556.00   18670.57
      resolve addr :       41.93       9.68     619.00       4.19
      resolve route:      486.93       0.48     101.00      48.69
      create qp    :     4021.95       6.18     330.00     402.20
      connect      :    68350.39   68588.17   24632.00    6835.04
      disconnect   :     1460.43     252.65-1862269.00     146.04
      destroy      :       41.16       0.04       2.00       4.12
      
      ib0:
      step              total ms     max ms     min us  us / conn
      create id    :       28.61       0.68       1.00       2.86
      bind addr    :     2178.86       2.95     201.00     217.89
      resolve addr :       51.26      16.85     845.00       5.13
      resolve route:      620.08       0.43      92.00      62.01
      create qp    :     3344.40       6.36     273.00     334.44
      connect      :     6435.99    6368.53    7844.00     643.60
      disconnect   :     5095.38     321.90     757.00     509.54
      destroy      :       37.13       0.02       2.00       3.71
      
      Clearly, both the bind address and connect operations suffer
      a huge penalty for being anything other than the default
      GID on the first port in the system.
      
      After applying this patch, the numbers now look like this:
      
      ib1:
      step              total ms     max ms     min us  us / conn
      create id    :       30.15       0.03       1.00       3.01
      bind addr    :       80.27       0.04       7.00       8.03
      resolve addr :       43.02      13.53     589.00       4.30
      resolve route:      482.90       0.45     100.00      48.29
      create qp    :     3986.55       5.80     330.00     398.66
      connect      :     7141.53    7051.29    5005.00     714.15
      disconnect   :     5038.85     193.63     918.00     503.88
      destroy      :       37.02       0.04       2.00       3.70
      
      ib0:
      step              total ms     max ms     min us  us / conn
      create id    :       34.27       0.05       1.00       3.43
      bind addr    :       26.45       0.04       1.00       2.64
      resolve addr :       38.25      10.54     760.00       3.82
      resolve route:      604.79       0.43      97.00      60.48
      create qp    :     3314.95       6.34     273.00     331.49
      connect      :    12399.26   12351.10    8609.00    1239.93
      disconnect   :     5096.76     270.72    1015.00     509.68
      destroy      :       37.10       0.03       2.00       3.71
      
      It's worth noting that we still suffer a bit of a penalty on
      connect to the wrong device, but the penalty is much less than
      it used to be.  Follow on patches deal with this penalty.
      
      Many thanks to Neil Horman for helping to track the source of
      slow function that allowed us to track down the fact that
      the original patch I mentioned above backed out cache usage
      and identify just how much that impacted the system.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      29f27e84
  10. 08 11月, 2013 1 次提交
  11. 22 10月, 2013 1 次提交
  12. 01 10月, 2013 1 次提交
  13. 03 9月, 2013 1 次提交
  14. 29 8月, 2013 3 次提交
    • H
      IB/core: Export ib_create/destroy_flow through uverbs · 436f2ad0
      Hadar Hen Zion 提交于
      Implement ib_uverbs_create_flow() and ib_uverbs_destroy_flow() to
      support flow steering for user space applications.
      Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      436f2ad0
    • I
      IB/core: Infrastructure for extensible uverbs commands · 400dbc96
      Igor Ivanov 提交于
      Add infrastructure to support extended uverbs capabilities in a
      forward/backward manner.  Uverbs command opcodes which are based on
      the verbs extensions approach should be greater or equal to
      IB_USER_VERBS_CMD_THRESHOLD.  They have new header format and
      processed a bit differently.
      
      Whenever a specific IB_USER_VERBS_CMD_XXX is extended, which practically means
      it needs to have additional arguments, we will be able to add them without creating
      a completely new IB_USER_VERBS_CMD_YYY command or bumping the uverbs ABI version.
      
      This patch for itself doesn't provide the whole scheme which is also dependent
      on adding a comp_mask field to each extended uverbs command struct.
      
      The new header framework allows for future extension of the CMD arguments
      (ib_uverbs_cmd_hdr.in_words, ib_uverbs_cmd_hdr.out_words) for an existing
      new command (that is a command that supports the new uverbs command header format
      suggested in this patch) w/o bumping ABI version and with maintaining backward
      and formward compatibility to new and old libibverbs versions.
      
      In the uverbs command we are passing both uverbs arguments and the provider arguments.
      We split the ib_uverbs_cmd_hdr.in_words to ib_uverbs_cmd_hdr.in_words which will now carry only
      uverbs input argument struct size and  ib_uverbs_cmd_hdr.provider_in_words that will carry
      the provider input argument size. Same goes for the response (the uverbs CMD output argument).
      
      For example take the create_cq call and the mlx4_ib provider:
      
      The uverbs layer gets libibverb's struct ibv_create_cq (named struct ib_uverbs_create_cq
      in the kernel), mlx4_ib gets libmlx4's struct mlx4_create_cq (which includes struct
      ibv_create_cq and is named struct mlx4_ib_create_cq in the kernel) and
      in_words = sizeof(mlx4_create_cq)/4 .
      
      Thus ib_uverbs_cmd_hdr.in_words carry both uverbs plus mlx4_ib input argument sizes,
      where uverbs assumes it knows the size of its input argument - struct ibv_create_cq.
      
      Now, if we wish to add a variable to struct ibv_create_cq, we can add a comp_mask field
      to the struct which is basically bit field indicating which fields exists in the struct
      (as done for the libibverbs API extension), but we need a way to tell what is the total
      size of the struct and not assume the struct size is predefined (since we may get different
      struct sizes from different user libibverbs versions). So we know at which point the
      provider input argument (struct mlx4_create_cq) begins. Same goes for extending the
      provider struct mlx4_create_cq. Thus we split the ib_uverbs_cmd_hdr.in_words to
      ib_uverbs_cmd_hdr.in_words which will now carry only uverbs input argument struct size and
      ib_uverbs_cmd_hdr.provider_in_words that will carry the provider (mlx4_ib) input argument size.
      Signed-off-by: NIgor Ivanov <Igor.Ivanov@itseez.com>
      Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      400dbc96
    • H
      IB/core: Add receive flow steering support · 319a441d
      Hadar Hen Zion 提交于
      The RDMA stack allows for applications to create IB_QPT_RAW_PACKET
      QPs, which receive plain Ethernet packets, specifically packets that
      don't carry any QPN to be matched by the receiving side.  Applications
      using these QPs must be provided with a method to program some
      steering rule with the HW so packets arriving at the local port can be
      routed to them.
      
      This patch adds ib_create_flow(), which allow providing a flow
      specification for a QP.  When there's a match between the
      specification and a received packet, the packet is forwarded to that
      QP, in a the same way one uses ib_attach_multicast() for IB UD
      multicast handling.
      
      Flow specifications are provided as instances of struct ib_flow_spec_yyy,
      which describe L2, L3 and L4 headers.  Currently specs for Ethernet, IPv4,
      TCP and UDP are defined.  Flow specs are made of values and masks.
      
      The input to ib_create_flow() is a struct ib_flow_attr, which contains
      a few mandatory control elements and optional flow specs.
      
          struct ib_flow_attr {
                  enum ib_flow_attr_type type;
                  u16      size;
                  u16      priority;
                  u32      flags;
                  u8       num_of_specs;
                  u8       port;
                  /* Following are the optional layers according to user request
                   * struct ib_flow_spec_yyy
                   * struct ib_flow_spec_zzz
                   */
          };
      
      As these specs are eventually coming from user space, they are defined and
      used in a way which allows adding new spec types without kernel/user ABI
      change, just with a little API enhancement which defines the newly added spec.
      
      The flow spec structures are defined with TLV (Type-Length-Value)
      entries, which allows calling ib_create_flow() with a list of variable
      length of optional specs.
      
      For the actual processing of ib_flow_attr the driver uses the number
      of specs and the size mandatory fields along with the TLV nature of
      the specs.
      
      Steering rules processing order is according to the domain over which
      the rule is set and the rule priority.  All rules set by user space
      applicatations fall into the IB_FLOW_DOMAIN_USER domain, other domains
      could be used by future IPoIB RFS and Ethetool flow-steering interface
      implementation.  Lower numerical value for the priority field means
      higher priority.
      
      The returned value from ib_create_flow() is a struct ib_flow, which
      contains a database pointer (handle) provided by the HW driver to be
      used when calling ib_destroy_flow().
      
      Applications that offload TCP/IP traffic can also be written over IB
      UD QPs.  The ib_create_flow() / ib_destroy_flow() API is designed to
      support UD QPs too.  A HW driver can set IB_DEVICE_MANAGED_FLOW_STEERING
      to denote support for flow steering.
      
      The ib_flow_attr enum type supports usage of flow steering for promiscuous
      and sniffer purposes:
      
          IB_FLOW_ATTR_NORMAL - "regular" rule, steering according to rule specification
      
          IB_FLOW_ATTR_ALL_DEFAULT - default unicast and multicast rule, receive
              all Ethernet traffic which isn't steered to any QP
      
          IB_FLOW_ATTR_MC_DEFAULT - same as IB_FLOW_ATTR_ALL_DEFAULT but only for multicast
      
          IB_FLOW_ATTR_SNIFFER - sniffer rule, receive all port traffic
      
      ALL_DEFAULT and MC_DEFAULT rules options are valid only for Ethernet link type.
      Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      319a441d
  15. 14 8月, 2013 2 次提交
  16. 13 8月, 2013 1 次提交
  17. 01 8月, 2013 1 次提交
    • J
      IB/core: Create QP1 using the pkey index which contains the default pkey · ef5ed416
      Jack Morgenstein 提交于
      Currently, QP1 is created using pkey index 0. This patch simply looks
      for the index containing the default pkey, rather than hard-coding
      pkey index 0.
      
      This change will have no effect in native mode, since QP0 and QP1 are
      created before the SM configures the port, so pkey table will still be
      the default table defined by the IB Spec, in C10-123: "If non-volatile
      storage is not used to hold P_Key Table contents, then if a PM
      (Partition Manager) is not present, and prior to PM initialization of
      the P_Key Table, the P_Key Table must act as if it contains a single
      valid entry, at P_Key_ix = 0, containing the default partition
      key. All other entries in the P_Key Table must be invalid."
      
      Thus, in the native mode case, the driver will find the default pkey
      at index 0 (so it will be no different than the hard-coding).
      
      However, in SR-IOV mode, for VFs, the pkey table may be
      paravirtualized, so that the VF's pkey index zero may not necessarily
      be mapped to the real pkey index 0. For VFs, therefore, it is
      important to find the virtual index which maps to the real default
      pkey.
      
      This commit does the following for QP1 creation:
      
      1. Find the pkey index containing the default pkey, and use that index
         if found.  ib_find_pkey() returns the index of the
         limited-membership default pkey (0x7FFF) if the full-member default
         pkey is not in the table.
      
      2. If neither form of the default pkey is found, use pkey index 0
         (previous behavior).
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Reviewed-by: NSean Hefty <sean.hefty@intel.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      ef5ed416
  18. 31 7月, 2013 3 次提交