1. 15 1月, 2014 1 次提交
    • M
      IB/core: Ethernet L2 attributes in verbs/cm structures · dd5f03be
      Matan Barak 提交于
      This patch add the support for Ethernet L2 attributes in the
      verbs/cm/cma structures.
      
      When dealing with L2 Ethernet, we should use smac, dmac, vlan ID and priority
      in a similar manner that the IB L2 (and the L4 PKEY) attributes are used.
      
      Thus, those attributes were added to the following structures:
      
      * ib_ah_attr - added dmac
      * ib_qp_attr - added smac and vlan_id, (sl remains vlan priority)
      * ib_wc - added smac, vlan_id
      * ib_sa_path_rec - added smac, dmac, vlan_id
      * cm_av - added smac and vlan_id
      
      For the path record structure, extra care was taken to avoid the new
      fields when packing it into wire format, so we don't break the IB CM
      and SA wire protocol.
      
      On the active side, the CM fills. its internal structures from the
      path provided by the ULP.  We add there taking the ETH L2 attributes
      and placing them into the CM Address Handle (struct cm_av).
      
      On the passive side, the CM fills its internal structures from the WC
      associated with the REQ message.  We add there taking the ETH L2
      attributes from the WC.
      
      When the HW driver provides the required ETH L2 attributes in the WC,
      they set the IB_WC_WITH_SMAC and IB_WC_WITH_VLAN flags. The IB core
      code checks for the presence of these flags, and in their absence does
      address resolution from the ib_init_ah_from_wc() helper function.
      
      ib_modify_qp_is_ok is also updated to consider the link layer. Some
      parameters are mandatory for Ethernet link layer, while they are
      irrelevant for IB.  Vendor drivers are modified to support the new
      function signature.
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      dd5f03be
  2. 18 11月, 2013 2 次提交
    • M
      IB/core: Re-enable create_flow/destroy_flow uverbs · 69ad5da4
      Matan Barak 提交于
      This commit reverts commit 7afbddfa ("IB/core: Temporarily disable
      create_flow/destroy_flow uverbs").  Since the uverbs extensions
      functionality was experimental for v3.12, this patch re-enables the
      support for them and flow-steering for v3.13.
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      69ad5da4
    • Y
      IB/core: extended command: an improved infrastructure for uverbs commands · f21519b2
      Yann Droneaud 提交于
      Commit 400dbc96 ("IB/core: Infrastructure for extensible uverbs
      commands") added an infrastructure for extensible uverbs commands
      while later commit 436f2ad0 ("IB/core: Export ib_create/destroy_flow
      through uverbs") exported ib_create_flow()/ib_destroy_flow() functions
      using this new infrastructure.
      
      According to the commit 400dbc96, the purpose of this
      infrastructure is to support passing around provider (eg. hardware)
      specific buffers when userspace issue commands to the kernel, so that
      it would be possible to extend uverbs (eg. core) buffers independently
      from the provider buffers.
      
      But the new kernel command function prototypes were not modified to
      take advantage of this extension. This issue was exposed by Roland
      Dreier in a previous review[1].
      
      So the following patch is an attempt to a revised extensible command
      infrastructure.
      
      This improved extensible command infrastructure distinguish between
      core (eg. legacy)'s command/response buffers from provider
      (eg. hardware)'s command/response buffers: each extended command
      implementing function is given a struct ib_udata to hold core
      (eg. uverbs) input and output buffers, and another struct ib_udata to
      hold the hw (eg. provider) input and output buffers.
      
      Having those buffers identified separately make it easier to increase
      one buffer to support extension without having to add some code to
      guess the exact size of each command/response parts: This should make
      the extended functions more reliable.
      
      Additionally, instead of relying on command identifier being greater
      than IB_USER_VERBS_CMD_THRESHOLD, the proposed infrastructure rely on
      unused bits in command field: on the 32 bits provided by command
      field, only 6 bits are really needed to encode the identifier of
      commands currently supported by the kernel. (Even using only 6 bits
      leaves room for about 23 new commands).
      
      So this patch makes use of some high order bits in command field to
      store flags, leaving enough room for more command identifiers than one
      will ever need (eg. 256).
      
      The new flags are used to specify if the command should be processed
      as an extended one or a legacy one. While designing the new command
      format, care was taken to make usage of flags itself extensible.
      
      Using high order bits of the commands field ensure that newer
      libibverbs on older kernel will properly fail when trying to call
      extended commands. On the other hand, older libibverbs on newer kernel
      will never be able to issue calls to extended commands.
      
      The extended command header includes the optional response pointer so
      that output buffer length and output buffer pointer are located
      together in the command, allowing proper parameters checking. This
      should make implementing functions easier and safer.
      
      Additionally the extended header ensure 64bits alignment, while making
      all sizes multiple of 8 bytes, extending the maximum buffer size:
      
                                   legacy      extended
      
         Maximum command buffer:  256KBytes   1024KBytes (512KBytes + 512KBytes)
        Maximum response buffer:  256KBytes   1024KBytes (512KBytes + 512KBytes)
      
      For the purpose of doing proper buffer size accounting, the headers
      size are no more taken in account in "in_words".
      
      One of the odds of the current extensible infrastructure, reading
      twice the "legacy" command header, is fixed by removing the "legacy"
      command header from the extended command header: they are processed as
      two different parts of the command: memory is read once and
      information are not duplicated: it's making clear that's an extended
      command scheme and not a different command scheme.
      
      The proposed scheme will format input (command) and output (response)
      buffers this way:
      
      - command:
      
        legacy header +
        extended header +
        command data (core + hw):
      
          +----------------------------------------+
          | flags     |   00      00    |  command |
          |        in_words    |   out_words       |
          +----------------------------------------+
          |                 response               |
          |                 response               |
          | provider_in_words | provider_out_words |
          |                 padding                |
          +----------------------------------------+
          |                                        |
          .              <uverbs input>            .
          .              (in_words * 8)            .
          |                                        |
          +----------------------------------------+
          |                                        |
          .             <provider input>           .
          .          (provider_in_words * 8)       .
          |                                        |
          +----------------------------------------+
      
      - response, if present:
      
          +----------------------------------------+
          |                                        |
          .          <uverbs output space>         .
          .             (out_words * 8)            .
          |                                        |
          +----------------------------------------+
          |                                        |
          .         <provider output space>        .
          .         (provider_out_words * 8)       .
          |                                        |
          +----------------------------------------+
      
      The overall design is to ensure that the extensible infrastructure is
      itself extensible while begin more reliable with more input and bound
      checking.
      
      Note:
      
      The unused field in the extended header would be perfect candidate to
      hold the command "comp_mask" (eg. bit field used to handle
      compatibility).  This was suggested by Roland Dreier in a previous
      review[2].  But "comp_mask" field is likely to be present in the uverb
      input and/or provider input, likewise for the response, as noted by
      Matan Barak[3], so it doesn't make sense to put "comp_mask" in the
      header.
      
      [1]:
      http://marc.info/?i=CAL1RGDWxmM17W2o_era24A-TTDeKyoL6u3NRu_=t_dhV_ZA9MA@mail.gmail.com
      
      [2]:
      http://marc.info/?i=CAL1RGDXJtrc849M6_XNZT5xO1+ybKtLWGq6yg6LhoSsKpsmkYA@mail.gmail.com
      
      [3]:
      http://marc.info/?i=525C1149.6000701@mellanox.comSigned-off-by: NYann Droneaud <ydroneaud@opteya.com>
      Link: http://marc.info/?i=cover.1383773832.git.ydroneaud@opteya.com
      
      [ Convert "ret ? ret : 0" to the equivalent "ret".  - Roland ]
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      f21519b2
  3. 16 11月, 2013 2 次提交
  4. 08 11月, 2013 1 次提交
  5. 05 11月, 2013 1 次提交
    • J
      mlx4: Structures and init/teardown for VF resource quotas · 5a0d0a61
      Jack Morgenstein 提交于
      This is step #1 for implementing SRIOV resource quotas for VFs.
      
      Quotas are implemented per resource type for VFs and the PF, to prevent
      any entity from simply grabbing all the resources for itself and leaving
      the other entities unable to obtain such resources.
      
      Resources which are allocated using quotas:  QPs, CQs, SRQs, MPTs, MTTs, MAC,
                                                   VLAN, and Counters.
      
      The quota system works as follows:
      Each entity (VF or PF) is given a max number of a given resource (its quota),
      and a guaranteed minimum number for each resource (starvation prevention).
      
      For QPs, CQs, SRQs, MPTs and MTTs:
      50% of the available quantity for the resource is divided equally among
      the PF and all the active VFs (i.e., the number of VFs in the mlx4_core module
      parameter "num_vfs"). This 50% represents the "guaranteed minimum" pool.
      The other 50% is the "free pool", allocated on a first-come-first-serve basis.
      For each VF/PF, resources are first allocated from its "guaranteed-minimum"
      pool. When that pool is exhausted, the driver attempts to allocate from
      the resource "free-pool".
      
      The quota (i.e., max) for the VFs and the PF is:
        The free-pool amount (50% of the real max) + the guaranteed minimum
      
      For MACs:
        Guarantee 2 MACs per VF/PF per port. As a result, since we have only
        128 MACs per port, reduce the allowable number of VFs from 64 to 63.
        Any remaining MACs are put into a free pool.
      
      For VLANs:
        For the PF, the per-port quota is 128 and guarantee is 64
           (to allow the PF to register at least a VLAN per VF in VST mode).
        For the VFs, the per-port quota is 64 and the guarantee is 0.
            We assume that VGT VFs are trusted not to abuse the VLAN resource.
      
      For Counters:
        For all functions (PF and VFs), the quota is 128 and the guarantee is 0.
      
      In this patch, we define the needed structures, which are added to the
      resource-tracker struct.  In addition, we do initialization
      for the resource quota, and adjust the query_device response to use quotas
      rather than resource maxima.
      
      As part of the implementation, we introduce a new field in
      mlx4_dev: quotas.  This field holds the resource quotas used
      to report maxima to the upper layers (ib_core, via query_device).
      
      The HCA maxima of these values are passed to the VFs (via
      QUERY_HCA) so that they may continue to use these in handling
      QPs, CQs, SRQs and MPTs.
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5a0d0a61
  6. 22 10月, 2013 1 次提交
  7. 29 8月, 2013 1 次提交
  8. 01 8月, 2013 1 次提交
    • J
      IB/mlx4: Use default pkey when creating tunnel QPs · 3eac103f
      Jack Morgenstein 提交于
      When creating tunnel QPs for special QP tunneling, look for the
      default pkey in the slave's virtual pkey table.  If it is present, use
      the real pkey index where the default pkey is located.
      
      If the default pkey is not found in the pkey table, use the real pkey
      index which is stored at index 0 in the slave's virtual pkey table
      (this is the current behavior).
      
      This change is required to support cloud computing, where the
      paravirtualized index of the default pkey is moved to index 1 or
      higher.  The pkey at paravirtualized index 0 is used for the default
      IPoIB interface created by the VF.
      
      Its possible for the pkey value at paravirtualized index 0 to be
      invalid (zero) at VF probe time (pkey index 0 is mapped to real pkey
      index 127, which contains pkey = 0).
      
      At some point after the VF probe, the cloud computing interface at the
      hypervisor maps virtual index 0 for the VF to the pkey index
      containing the pkey that IPoIB will use in its operation.  However,
      when the tunnel QP is created, the pkey at the slave's virtual index 0
      is still mapped to the invalid pkey index, so tunnel QP creation
      fails.
      
      This commit causes the hypervisor to search for the default pkey in
      the slave's pkey table -- and this pkey is present in the table (at
      index > 0) at tunnel QP creation time, so that the tunnel QP creation
      will succeed.
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      3eac103f
  9. 29 5月, 2013 1 次提交
  10. 08 5月, 2013 1 次提交
  11. 30 4月, 2013 1 次提交
  12. 25 4月, 2013 3 次提交
  13. 17 4月, 2013 2 次提交
  14. 14 3月, 2013 1 次提交
  15. 28 2月, 2013 2 次提交
    • T
      idr: remove MAX_IDR_MASK and move left MAX_IDR_* into idr.c · e8c8d1bc
      Tejun Heo 提交于
      MAX_IDR_MASK is another weirdness in the idr interface.  As idr covers
      whole positive integer range, it's defined as 0x7fffffff or INT_MAX.
      
      Its usage in idr_find(), idr_replace() and idr_remove() is bizarre.
      They basically mask off the sign bit and operate on the rest, so if
      the caller, by accident, passes in a negative number, the sign bit
      will be masked off and the remaining part will be used as if that was
      the input, which is worse than crashing.
      
      The constant is visible in idr.h and there are several users in the
      kernel.
      
      * drivers/i2c/i2c-core.c:i2c_add_numbered_adapter()
      
        Basically used to test if adap->nr is a negative number which isn't
        -1 and returns -EINVAL if so.  idr_alloc() already has negative
        @start checking (w/ WARN_ON_ONCE), so this can go away.
      
      * drivers/infiniband/core/cm.c:cm_alloc_id()
        drivers/infiniband/hw/mlx4/cm.c:id_map_alloc()
      
        Used to wrap cyclic @start.  Can be replaced with max(next, 0).
        Note that this type of cyclic allocation using idr is buggy.  These
        are prone to spurious -ENOSPC failure after the first wraparound.
      
      * fs/super.c:get_anon_bdev()
      
        The ID allocated from ida is masked off before being tested whether
        it's inside valid range.  ida allocated ID can never be a negative
        number and the masking is unnecessary.
      
      Update idr_*() functions to fail with -EINVAL when negative @id is
      specified and update other MAX_IDR_MASK users as described above.
      
      This leaves MAX_IDR_MASK without any user, remove it and relocate
      other MAX_IDR_* constants to lib/idr.c.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jean Delvare <khali@linux-fr.org>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Sean Hefty <sean.hefty@intel.com>
      Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
      Cc: "Marciniszyn, Mike" <mike.marciniszyn@intel.com>
      Cc: Jack Morgenstein <jackm@dev.mellanox.co.il>
      Cc: Or Gerlitz <ogerlitz@mellanox.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Acked-by: NWolfram Sang <wolfram@the-dreams.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e8c8d1bc
    • T
      IB/mlx4: convert to idr_alloc() · 6a920060
      Tejun Heo 提交于
      Convert to the much saner new idr interface.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jack Morgenstein <jackm@dev.mellanox.co.il>
      Cc: Or Gerlitz <ogerlitz@mellanox.com>
      Cc: Roland Dreier <roland@purestorage.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6a920060
  16. 26 2月, 2013 6 次提交
  17. 22 2月, 2013 2 次提交
  18. 16 2月, 2013 2 次提交
    • J
      IB/mlx4: Adjust duplicate test · 6950a235
      Julia Lawall 提交于
      Delete successive tests to the same location.  The code tested the result
      of a previous allocation, that itself was already tested.  It is changed to
      test the result of the most recent allocation.
      
      A simplified version of the semantic match that finds this problem is as
      follows: (http://coccinelle.lip6.fr/)
      
      // <smpl>
      @s exists@
      local idexpression y;
      expression x,e;
      @@
      
      *if ( \(x == NULL\|IS_ERR(x)\|y != 0\) )
       { ... when forall
         return ...; }
      ... when != \(y = e\|y += e\|y -= e\|y |= e\|y &= e\|y++\|y--\|&y\)
          when != \(XT_GETPAGE(...,y)\|WMI_CMD_BUF(...)\)
      *if ( \(x == NULL\|IS_ERR(x)\|y != 0\) )
       { ... when forall
         return ...; }
      // </smpl>
      Signed-off-by: NJulia Lawall <Julia.Lawall@lip6.fr>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      6950a235
    • D
      IB/mlx4: Fix bug unwinding on error in mlx4_ib_init_sriov() · cab66d12
      Dan Carpenter 提交于
      We have to decrement "i" before calling mlx4_ib_free_demux_ctx() or we
      free something that wasn't allocated.  That's fine for free_pv_object()
      but it would lead to a NULL dereference calling mlx4_ib_free_demux_ctx().
      The null dereference is because ->tun is NULL when we check:
      
      	if (!ctx->tun[i])
      
      Also we didn't free ->sriov.demux[0] so it was a small leak.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      cab66d12
  19. 30 11月, 2012 1 次提交
  20. 27 11月, 2012 1 次提交
    • O
      mlx4: 64-byte CQE/EQE support · 08ff3235
      Or Gerlitz 提交于
      ConnectX-3 devices can use either 64- or 32-byte completion queue
      entries (CQEs) and event queue entries (EQEs).  Using 64-byte
      EQEs/CQEs performs better because each entry is aligned to a complete
      cacheline.  This patch queries the HCA's capabilities, and if it
      supports 64-byte CQEs and EQES the driver will configure the HW to
      work in 64-byte mode.
      
      The 32-byte vs 64-byte mode is global per HCA and not per CQ or EQ.
      
      Since this mode is global, userspace (libmlx4) must be updated to work
      with the configured CQE size, and guests using SR-IOV virtual
      functions need to know both EQE and CQE size.
      
      In case one of the 64-byte CQE/EQE capabilities is activated, the
      patch makes sure that older guest drivers that use the QUERY_DEV_FUNC
      command (e.g as done in mlx4_core of Linux 3.3..3.6) will notice that
      they need an update to be able to work with the PPF. This is done by
      changing the returned pf_context_behaviour not to be zero any more. In
      case none of these capabilities is activated that value remains zero
      and older guest drivers can run OK.
      
      The SRIOV related flow is as follows
      
      1. the PPF does the detection of the new capabilities using
         QUERY_DEV_CAP command.
      
      2. the PPF activates the new capabilities using INIT_HCA.
      
      3. the VF detects if the PPF activated the capabilities using
         QUERY_HCA, and if this is the case activates them for itself too.
      
      Note that the VF detects that it must be aware to the new PF behaviour
      using QUERY_FUNC_CAP.  Steps 1 and 2 apply also for native mode.
      
      User space notification is done through a new field introduced in
      struct mlx4_ib_ucontext which holds device capabilities for which user
      space must take action. This changes the binary interface so the ABI
      towards libmlx4 exposed through uverbs is bumped from 3 to 4 but only
      when **needed** i.e. only when the driver does use 64-byte CQEs or
      future device capabilities which must be in sync by user space. This
      practice allows to work with unmodified libmlx4 on older devices (e.g
      A0, B0) which don't support 64-byte CQEs.
      
      In order to keep existing systems functional when they update to a
      newer kernel that contains these changes in VF and userspace ABI, a
      module parameter enable_64b_cqe_eqe must be set to enable 64-byte
      mode; the default is currently false.
      Signed-off-by: NEli Cohen <eli@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      08ff3235
  21. 19 10月, 2012 3 次提交
    • E
      IB/mlx4: Synchronize cleanup of MCGs in MCG paravirtualization · bef83ed9
      Eli Cohen 提交于
      A client re-register event invokes cleanup of all MCGs.  This is
      required to protect against misbehaved guests leading to corruption of
      join/leave database.  However, since cleaning up the MCGs is a heavy
      operation, it is pushed to a work queue for further processing.
      Client re-register is also propagated to ULPs (e.g IPoIB).
      
      However, since the cleanup is performed in a workqueue, the ULP could
      leave and re-join groups before the cleanup occurs.  In this case,
      when the cleanup takes place, it prunes the (newly-joined) MCGs and
      the ULP is left without actual MCGs while believing it joined them.
      
      Fix this by setting the flushing flag before invoking the cleanup task
      and clearing it after flushing is complete.
      Signed-off-by: NEli Cohen <eli@mellanox.com>
      Reviewed-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      bef83ed9
    • J
      IB/mlx4: Fix QP1 P_Key processing in the Primary Physical Function (PPF) · 2c75d2cc
      Jack Morgenstein 提交于
      In the MAD paravirtualization code, one of the checks performed when
      forwarding QP1 (GSI) packets from wire to slave was a P_Key check: the
      P_Key received in the MAD must be present in the guest's paravirtualized
      P_Key table, and at least one of the (packet P_Key, guest P_Key) must
      be a full-membership P_Key.
      
      However, if everyone involved has only limited membership in the
      default P_Key, then packets sent by full-member remote hosts arrive at
      the PPF but are not passed on to the VFs with the current P_Key1 check.
      
      Fix this as follows:
      
      1. Don't care if P_Key received over wire is full or not. If it
         successfully passed HW checks on the real QP1, then simply pass it
         to guest regardless of whether the guest has full or limited
         membership in its P_Key table.
      
      2. If the guest (including paravirtualized master) has both full and
         limited P_Key forms in its table, preferentially pass the
         paravirtualized P_Key index of the full P_Key form in the tunnel
         header.
      
      3. In the multicast join flow (mlx4/mcg.c), use the index for the
         default P_Key (wherever it is located) in replies generated from
         within the mcg module (previously, P_Key index 0 was used in all
         cases).
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      2c75d2cc
    • D
      IB/mlx4: Fix build error on platforms where UL is not 64 bits · 8a095030
      Doug Ledford 提交于
      Line 110 uses UL as a compiler cast for the 0x constant, but it's not
      large enough to hold a 64-bit value on a 32-bit arch.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      
      [ Use "-1" instead of "FFFFFFFFFFFFFFFFULL".  - Roland ]
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      8a095030
  22. 06 10月, 2012 1 次提交
  23. 01 10月, 2012 3 次提交