1. 05 11月, 2013 1 次提交
    • J
      mlx4: Structures and init/teardown for VF resource quotas · 5a0d0a61
      Jack Morgenstein 提交于
      This is step #1 for implementing SRIOV resource quotas for VFs.
      
      Quotas are implemented per resource type for VFs and the PF, to prevent
      any entity from simply grabbing all the resources for itself and leaving
      the other entities unable to obtain such resources.
      
      Resources which are allocated using quotas:  QPs, CQs, SRQs, MPTs, MTTs, MAC,
                                                   VLAN, and Counters.
      
      The quota system works as follows:
      Each entity (VF or PF) is given a max number of a given resource (its quota),
      and a guaranteed minimum number for each resource (starvation prevention).
      
      For QPs, CQs, SRQs, MPTs and MTTs:
      50% of the available quantity for the resource is divided equally among
      the PF and all the active VFs (i.e., the number of VFs in the mlx4_core module
      parameter "num_vfs"). This 50% represents the "guaranteed minimum" pool.
      The other 50% is the "free pool", allocated on a first-come-first-serve basis.
      For each VF/PF, resources are first allocated from its "guaranteed-minimum"
      pool. When that pool is exhausted, the driver attempts to allocate from
      the resource "free-pool".
      
      The quota (i.e., max) for the VFs and the PF is:
        The free-pool amount (50% of the real max) + the guaranteed minimum
      
      For MACs:
        Guarantee 2 MACs per VF/PF per port. As a result, since we have only
        128 MACs per port, reduce the allowable number of VFs from 64 to 63.
        Any remaining MACs are put into a free pool.
      
      For VLANs:
        For the PF, the per-port quota is 128 and guarantee is 64
           (to allow the PF to register at least a VLAN per VF in VST mode).
        For the VFs, the per-port quota is 64 and the guarantee is 0.
            We assume that VGT VFs are trusted not to abuse the VLAN resource.
      
      For Counters:
        For all functions (PF and VFs), the quota is 128 and the guarantee is 0.
      
      In this patch, we define the needed structures, which are added to the
      resource-tracker struct.  In addition, we do initialization
      for the resource quota, and adjust the query_device response to use quotas
      rather than resource maxima.
      
      As part of the implementation, we introduce a new field in
      mlx4_dev: quotas.  This field holds the resource quotas used
      to report maxima to the upper layers (ib_core, via query_device).
      
      The HCA maxima of these values are passed to the VFs (via
      QUERY_HCA) so that they may continue to use these in handling
      QPs, CQs, SRQs and MPTs.
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5a0d0a61
  2. 22 10月, 2013 1 次提交
  3. 29 8月, 2013 1 次提交
  4. 01 8月, 2013 1 次提交
    • J
      IB/mlx4: Use default pkey when creating tunnel QPs · 3eac103f
      Jack Morgenstein 提交于
      When creating tunnel QPs for special QP tunneling, look for the
      default pkey in the slave's virtual pkey table.  If it is present, use
      the real pkey index where the default pkey is located.
      
      If the default pkey is not found in the pkey table, use the real pkey
      index which is stored at index 0 in the slave's virtual pkey table
      (this is the current behavior).
      
      This change is required to support cloud computing, where the
      paravirtualized index of the default pkey is moved to index 1 or
      higher.  The pkey at paravirtualized index 0 is used for the default
      IPoIB interface created by the VF.
      
      Its possible for the pkey value at paravirtualized index 0 to be
      invalid (zero) at VF probe time (pkey index 0 is mapped to real pkey
      index 127, which contains pkey = 0).
      
      At some point after the VF probe, the cloud computing interface at the
      hypervisor maps virtual index 0 for the VF to the pkey index
      containing the pkey that IPoIB will use in its operation.  However,
      when the tunnel QP is created, the pkey at the slave's virtual index 0
      is still mapped to the invalid pkey index, so tunnel QP creation
      fails.
      
      This commit causes the hypervisor to search for the default pkey in
      the slave's pkey table -- and this pkey is present in the table (at
      index > 0) at tunnel QP creation time, so that the tunnel QP creation
      will succeed.
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      3eac103f
  5. 29 5月, 2013 1 次提交
  6. 08 5月, 2013 1 次提交
  7. 30 4月, 2013 1 次提交
  8. 25 4月, 2013 3 次提交
  9. 17 4月, 2013 2 次提交
  10. 14 3月, 2013 1 次提交
  11. 28 2月, 2013 2 次提交
    • T
      idr: remove MAX_IDR_MASK and move left MAX_IDR_* into idr.c · e8c8d1bc
      Tejun Heo 提交于
      MAX_IDR_MASK is another weirdness in the idr interface.  As idr covers
      whole positive integer range, it's defined as 0x7fffffff or INT_MAX.
      
      Its usage in idr_find(), idr_replace() and idr_remove() is bizarre.
      They basically mask off the sign bit and operate on the rest, so if
      the caller, by accident, passes in a negative number, the sign bit
      will be masked off and the remaining part will be used as if that was
      the input, which is worse than crashing.
      
      The constant is visible in idr.h and there are several users in the
      kernel.
      
      * drivers/i2c/i2c-core.c:i2c_add_numbered_adapter()
      
        Basically used to test if adap->nr is a negative number which isn't
        -1 and returns -EINVAL if so.  idr_alloc() already has negative
        @start checking (w/ WARN_ON_ONCE), so this can go away.
      
      * drivers/infiniband/core/cm.c:cm_alloc_id()
        drivers/infiniband/hw/mlx4/cm.c:id_map_alloc()
      
        Used to wrap cyclic @start.  Can be replaced with max(next, 0).
        Note that this type of cyclic allocation using idr is buggy.  These
        are prone to spurious -ENOSPC failure after the first wraparound.
      
      * fs/super.c:get_anon_bdev()
      
        The ID allocated from ida is masked off before being tested whether
        it's inside valid range.  ida allocated ID can never be a negative
        number and the masking is unnecessary.
      
      Update idr_*() functions to fail with -EINVAL when negative @id is
      specified and update other MAX_IDR_MASK users as described above.
      
      This leaves MAX_IDR_MASK without any user, remove it and relocate
      other MAX_IDR_* constants to lib/idr.c.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jean Delvare <khali@linux-fr.org>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Sean Hefty <sean.hefty@intel.com>
      Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
      Cc: "Marciniszyn, Mike" <mike.marciniszyn@intel.com>
      Cc: Jack Morgenstein <jackm@dev.mellanox.co.il>
      Cc: Or Gerlitz <ogerlitz@mellanox.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Acked-by: NWolfram Sang <wolfram@the-dreams.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e8c8d1bc
    • T
      IB/mlx4: convert to idr_alloc() · 6a920060
      Tejun Heo 提交于
      Convert to the much saner new idr interface.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jack Morgenstein <jackm@dev.mellanox.co.il>
      Cc: Or Gerlitz <ogerlitz@mellanox.com>
      Cc: Roland Dreier <roland@purestorage.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6a920060
  12. 26 2月, 2013 6 次提交
  13. 22 2月, 2013 2 次提交
  14. 16 2月, 2013 2 次提交
    • J
      IB/mlx4: Adjust duplicate test · 6950a235
      Julia Lawall 提交于
      Delete successive tests to the same location.  The code tested the result
      of a previous allocation, that itself was already tested.  It is changed to
      test the result of the most recent allocation.
      
      A simplified version of the semantic match that finds this problem is as
      follows: (http://coccinelle.lip6.fr/)
      
      // <smpl>
      @s exists@
      local idexpression y;
      expression x,e;
      @@
      
      *if ( \(x == NULL\|IS_ERR(x)\|y != 0\) )
       { ... when forall
         return ...; }
      ... when != \(y = e\|y += e\|y -= e\|y |= e\|y &= e\|y++\|y--\|&y\)
          when != \(XT_GETPAGE(...,y)\|WMI_CMD_BUF(...)\)
      *if ( \(x == NULL\|IS_ERR(x)\|y != 0\) )
       { ... when forall
         return ...; }
      // </smpl>
      Signed-off-by: NJulia Lawall <Julia.Lawall@lip6.fr>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      6950a235
    • D
      IB/mlx4: Fix bug unwinding on error in mlx4_ib_init_sriov() · cab66d12
      Dan Carpenter 提交于
      We have to decrement "i" before calling mlx4_ib_free_demux_ctx() or we
      free something that wasn't allocated.  That's fine for free_pv_object()
      but it would lead to a NULL dereference calling mlx4_ib_free_demux_ctx().
      The null dereference is because ->tun is NULL when we check:
      
      	if (!ctx->tun[i])
      
      Also we didn't free ->sriov.demux[0] so it was a small leak.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      cab66d12
  15. 30 11月, 2012 1 次提交
  16. 27 11月, 2012 1 次提交
    • O
      mlx4: 64-byte CQE/EQE support · 08ff3235
      Or Gerlitz 提交于
      ConnectX-3 devices can use either 64- or 32-byte completion queue
      entries (CQEs) and event queue entries (EQEs).  Using 64-byte
      EQEs/CQEs performs better because each entry is aligned to a complete
      cacheline.  This patch queries the HCA's capabilities, and if it
      supports 64-byte CQEs and EQES the driver will configure the HW to
      work in 64-byte mode.
      
      The 32-byte vs 64-byte mode is global per HCA and not per CQ or EQ.
      
      Since this mode is global, userspace (libmlx4) must be updated to work
      with the configured CQE size, and guests using SR-IOV virtual
      functions need to know both EQE and CQE size.
      
      In case one of the 64-byte CQE/EQE capabilities is activated, the
      patch makes sure that older guest drivers that use the QUERY_DEV_FUNC
      command (e.g as done in mlx4_core of Linux 3.3..3.6) will notice that
      they need an update to be able to work with the PPF. This is done by
      changing the returned pf_context_behaviour not to be zero any more. In
      case none of these capabilities is activated that value remains zero
      and older guest drivers can run OK.
      
      The SRIOV related flow is as follows
      
      1. the PPF does the detection of the new capabilities using
         QUERY_DEV_CAP command.
      
      2. the PPF activates the new capabilities using INIT_HCA.
      
      3. the VF detects if the PPF activated the capabilities using
         QUERY_HCA, and if this is the case activates them for itself too.
      
      Note that the VF detects that it must be aware to the new PF behaviour
      using QUERY_FUNC_CAP.  Steps 1 and 2 apply also for native mode.
      
      User space notification is done through a new field introduced in
      struct mlx4_ib_ucontext which holds device capabilities for which user
      space must take action. This changes the binary interface so the ABI
      towards libmlx4 exposed through uverbs is bumped from 3 to 4 but only
      when **needed** i.e. only when the driver does use 64-byte CQEs or
      future device capabilities which must be in sync by user space. This
      practice allows to work with unmodified libmlx4 on older devices (e.g
      A0, B0) which don't support 64-byte CQEs.
      
      In order to keep existing systems functional when they update to a
      newer kernel that contains these changes in VF and userspace ABI, a
      module parameter enable_64b_cqe_eqe must be set to enable 64-byte
      mode; the default is currently false.
      Signed-off-by: NEli Cohen <eli@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      08ff3235
  17. 19 10月, 2012 3 次提交
    • E
      IB/mlx4: Synchronize cleanup of MCGs in MCG paravirtualization · bef83ed9
      Eli Cohen 提交于
      A client re-register event invokes cleanup of all MCGs.  This is
      required to protect against misbehaved guests leading to corruption of
      join/leave database.  However, since cleaning up the MCGs is a heavy
      operation, it is pushed to a work queue for further processing.
      Client re-register is also propagated to ULPs (e.g IPoIB).
      
      However, since the cleanup is performed in a workqueue, the ULP could
      leave and re-join groups before the cleanup occurs.  In this case,
      when the cleanup takes place, it prunes the (newly-joined) MCGs and
      the ULP is left without actual MCGs while believing it joined them.
      
      Fix this by setting the flushing flag before invoking the cleanup task
      and clearing it after flushing is complete.
      Signed-off-by: NEli Cohen <eli@mellanox.com>
      Reviewed-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      bef83ed9
    • J
      IB/mlx4: Fix QP1 P_Key processing in the Primary Physical Function (PPF) · 2c75d2cc
      Jack Morgenstein 提交于
      In the MAD paravirtualization code, one of the checks performed when
      forwarding QP1 (GSI) packets from wire to slave was a P_Key check: the
      P_Key received in the MAD must be present in the guest's paravirtualized
      P_Key table, and at least one of the (packet P_Key, guest P_Key) must
      be a full-membership P_Key.
      
      However, if everyone involved has only limited membership in the
      default P_Key, then packets sent by full-member remote hosts arrive at
      the PPF but are not passed on to the VFs with the current P_Key1 check.
      
      Fix this as follows:
      
      1. Don't care if P_Key received over wire is full or not. If it
         successfully passed HW checks on the real QP1, then simply pass it
         to guest regardless of whether the guest has full or limited
         membership in its P_Key table.
      
      2. If the guest (including paravirtualized master) has both full and
         limited P_Key forms in its table, preferentially pass the
         paravirtualized P_Key index of the full P_Key form in the tunnel
         header.
      
      3. In the multicast join flow (mlx4/mcg.c), use the index for the
         default P_Key (wherever it is located) in replies generated from
         within the mcg module (previously, P_Key index 0 was used in all
         cases).
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      2c75d2cc
    • D
      IB/mlx4: Fix build error on platforms where UL is not 64 bits · 8a095030
      Doug Ledford 提交于
      Line 110 uses UL as a compiler cast for the 0x constant, but it's not
      large enough to hold a 64-bit value on a 32-bit arch.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      
      [ Use "-1" instead of "FFFFFFFFFFFFFFFFULL".  - Roland ]
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      8a095030
  18. 06 10月, 2012 1 次提交
  19. 01 10月, 2012 9 次提交
    • J
      IB/mlx4: Create paravirt contexts for VFs when master IB driver initializes · 3806d08c
      Jack Morgenstein 提交于
      When we have VFs and PFs on same host, the VFs are activated within
      the mlx4_core module before the mlx4_ib kernel module is loaded.
      
      When the mlx4_ib module initializes the PF (master), it now creates
      MAD paravirtualization contexts for any VFs that already active.
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      3806d08c
    • J
      mlx4: Modify proxy/tunnel QP mechanism so that guests do no calculations · 47605df9
      Jack Morgenstein 提交于
      Previously, the structure of a guest's proxy QPs followed the
      structure of the PPF special qps (qp0 port 1, qp0 port 2, qp1 port 1,
      qp1 port 2, ...).  The guest then did offset calculations on the
      sqp_base qp number that the PPF passed to it in QUERY_FUNC_CAP().
      
      This is now changed so that the guest does no offset calculations
      regarding proxy or tunnel QPs to use.  This change frees the PPF from
      needing to adhere to a specific order in allocating proxy and tunnel
      QPs.
      
      Now QUERY_FUNC_CAP provides each port individually with its proxy
      qp0, proxy qp1, tunnel qp0, and tunnel qp1 QP numbers, and these are
      used directly where required (with no offset calculations).
      
      To accomplish this change, several fields were added to the phys_caps
      structure for use by the PPF and by non-SR-IOV mode:
      
          base_sqpn -- in non-sriov mode, this was formerly sqp_start.
          base_proxy_sqpn -- the first physical proxy qp number -- used by PPF
          base_tunnel_sqpn -- the first physical tunnel qp number -- used by PPF.
      
      The current code in the PPF still adheres to the previous layout of
      sqps, proxy-sqps and tunnel-sqps.  However, the PPF can change this
      layout without affecting VF or (paravirtualized) PF code.
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      47605df9
    • J
      mlx4: Paravirtualize Node Guids for slaves · afa8fd1d
      Jack Morgenstein 提交于
      This is necessary in order to support > 1 VF/PF in a VM for software
      that uses the node guid as a discriminator, such as librdmacm.
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      afa8fd1d
    • J
      mlx4: Activate SR-IOV mode for IB · 026149cb
      Jack Morgenstein 提交于
      Remove the error returns for IB ports from mlx4_ib_add,
      mlx4_INIT_PORT_wrapper, and mlx4_CLOSE_PORT_wrapper.
      
      Currently, SRIOV is supported only for devices for which the
      link layer is IB on all ports; RoCE support will be added later.
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      026149cb
    • J
      IB/mlx4: Miscellaneous adjustments for SR-IOV IB support · 992e8e6e
      Jack Morgenstein 提交于
      1. Allow only master to change node description.
      2. Prevent AH leakage in send mads.
      3. Take device part number from PCI structure, so that guests see the
         VF part number (and not the PF part number).
      4. Place the device revision ID into caps structure at startup.
      5. SET_PORT in update_gids_task needs to go through wrapper on master.
      6. In mlx4_ib_event(), PORT_MGMT_EVENT needs be handled in a work
         queue on the master, since it propagates events to slaves using
         GEN_EQE.
      7. Do not support FMR on slaves.
      8. Add spinlock to slave_event(), since it is called both in interrupt
         context and in process context (due to 6 above, and also if
         smp_snoop is used).  This fix was found and implemented by Saeed
         Mahameed <saeedm@mellanox.com>
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      992e8e6e
    • J
      IB/mlx4: Add iov directory in sysfs under the ib device · c1e7e466
      Jack Morgenstein 提交于
      This directory is added only for the master -- slaves do not have it.
      
      The sysfs iov directory is used to manage and examine the port P_Key
      and guid paravirtualization.
      
      Under iov/ports, the administrator may examine the gid and P_Key tables
      as they are present in the device (and as are seen in the "network
      view" presented to the SM).
      
      Under the iov/<pci slot number> directories, the admin may map the
      index numbers in the physical tables (as under iov/ports) to the
      paravirtualized index numbers that guests see.
      
      For example, if the administrator, for port 1 on guest 2 maps physical
      pkey index 10 to virtual index 1, then that guest, whenever it uses
      its pkey index 1, will actually be using the real pkey index 10.
      
      Based on patch from Erez Shitrit <erezsh@mellanox.com>
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      c1e7e466
    • J
      IB/mlx4: Propagate P_Key and guid change port management events to slaves · 2a4fae14
      Jack Morgenstein 提交于
      P_Key change and guid change events are not of interest to all slaves,
      but only to those slaves which "see" the table slots whose contents
      have change.
      
      For example, if the guid at port 1, index 5 has changed in the PPF, we
      wish to propagate the gid-change event only to the function which has
      that guid index mapped to its port/guid table (in this case it is
      slave #5). Other functions should not get the event, since the event
      does not affect them.
      
      Similarly with P_Keys -- P_Key change events are forwarded only to
      slaves which have that P_Key index mapped to their virtual P_Key table.
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      2a4fae14
    • J
      mlx4: Add alias_guid mechanism · a0c64a17
      Jack Morgenstein 提交于
      For IB ports, we paravirtualize the GUID at index 0 on slaves.  The
      GUID at index 0 seen by a slave is the actual GUID occupying the GUID
      table at the slave-id index.
      
      The driver, by default, requests at startup time that subnet manager
      populate its entire guid table with GUIDs. These guids are then mapped
      (paravirtualized) to the slaves, and appear for each slave as its GUID
      at index 0.
      
      Until each slave has such a guid, its port status is DOWN.
      
      The guid table is cached to support special QP paravirtualization, and
      event propagation to slaves on guid change (we test to see if the guid
      really changed before propagating an event to the slave).
      
      To support this caching, add capability to __mlx4_ib_query_gid() to
      obtain the network view (i.e., physical view) gid at index X, not just
      the host (paravirtualized) view.
      
      Based on a patch from Erez Shitrit <erezsh@mellanox.com>
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      a0c64a17
    • A
      IB/mlx4: Add CM paravirtualization · 3cf69cc8
      Amir Vadai 提交于
      In CM para-virtualization:
      
      1. Incoming requests are steered to the correct vHCA according to the
         embedded GID.
      2. Communication IDs on outgoing requests are replaced by a globally
         unique ID, generated by the PPF, since there is no synchronization
         of ID generation between guests (and so these IDs are not
         guaranteed to be globally unique).  The guest's comm ID is stored,
         and is returned to the response MAD when it arrives.
      Signed-off-by: NAmir Vadai <amirv@mellanox.co.il>
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      3cf69cc8