1. 01 10月, 2012 3 次提交
    • J
      IB/mlx4: SR-IOV IB context objects and proxy/tunnel SQP support · 1ffeb2eb
      Jack Morgenstein 提交于
      1. Introduce the basic SR-IOV parvirtualization context objects for
         multiplexing and demultiplexing MADs.
      2. Introduce support for the new proxy and tunnel QP types.
      
      This patch introduces the objects required by the master for managing
      QP paravirtualization for guests.
      
      struct mlx4_ib_sriov is created by the master only.
      It is a container for the following:
      
      1. All the info required by the PPF to multiplex and de-multiplex MADs
         (including those from the PF). (struct mlx4_ib_demux_ctx demux)
      2. All the info required to manage alias GUIDs (i.e., the GUID at
         index 0 that each guest perceives.  In fact, this is not the GUID
         which is actually at index 0, but is, in fact, the GUID which is at
         index[<VF number>] in the physical table.
      3. structures which are used to manage CM paravirtualization
      4. structures for managing the real special QPs when running in SR-IOV
         mode.  The real SQPs are controlled by the PPF in this case.  All
         SQPs created and controlled by the ib core layer are proxy SQP.
      
      struct mlx4_ib_demux_ctx contains the information per port needed
      to manage paravirtualization:
      
      1. All multicast paravirt info
      2. All tunnel-qp paravirt info for the port.
      3. GUID-table and GUID-prefix for the port
      4. work queues.
      
      struct mlx4_ib_demux_pv_ctx contains all the info for managing the
      paravirtualized QPs for one slave/port.
      
      struct mlx4_ib_demux_pv_qp contains the info need to run an individual
      QP (either tunnel qp or real SQP).
      
      Note:  We made use of the 2 most significant bits in enum
      mlx4_ib_qp_flags (based on enum ib_qp_create_flags in ib_verbs.h).
      We need these bits in the low-level driver for internal purposes.
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      1ffeb2eb
    • J
      IB/core: Add ib_find_exact_cached_pkey() · 73aaa741
      Jack Morgenstein 提交于
      When P_Key tables potentially contain both full and partial membership
      copies for the same P_Key, we need a function to find the index for an
      exact (16-bit) P_Key.
      
      This is necessary when the master forwards QP1 MADs sent by guests.
      If the guest has sent the MAD with a limited membership P_Key, we need
      to to forward the MAD using the same limited membership P_Key.  Since
      the master may have both the limited and the full member P_Keys in its
      table, we must make sure to retrieve the limited membership P_Key in
      this case.
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      73aaa741
    • J
      IB/core: Handle table with full and partial membership for the same P_Key · ff7166c4
      Jack Morgenstein 提交于
      Extend the cached and non-cached P_Key table lookups to handle limited
      and full membership of the same P_Key to co-exist in the P_Key table.
      
      This is necessary for SR-IOV, to allow for some guests would to have
      the full membership P_Key in their virtual P_Key table, while other
      guests on the same physical HCA would have the limited one.
      To support this, we need both the limited and full membership P_Keys
      to be present in the master's (hypervisor physical port) P_Key table.
      
      The algorithm for handling P_Key tables which contain both the limited
      and the full membership versions of the same P_Key works as follows:
      
      When scanning the P_Key table for a 15-bit P_Key:
      
      A. If there is a full member version of that P_Key anywhere in the
          table, return its index (even if a limited-member version of the
          P_Key exists earlier in the table).
      
      B. If the full member version is not in the table, but the
         limited-member version is in the table, return the index of the
         limited P_Key.
      Signed-off-by: NLiran Liss <liranl@mellanox.com>
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      ff7166c4
  2. 15 9月, 2012 2 次提交
  3. 13 9月, 2012 2 次提交
  4. 08 9月, 2012 1 次提交
  5. 17 8月, 2012 1 次提交
  6. 16 8月, 2012 3 次提交
  7. 15 8月, 2012 2 次提交
  8. 14 8月, 2012 1 次提交
  9. 11 8月, 2012 2 次提交
    • R
      RDMA/ocrdma: Don't call vlan_dev_real_dev() for non-VLAN netdevs · d549f55f
      Roland Dreier 提交于
      If CONFIG_VLAN_8021Q is not set, then vlan_dev_real_dev() just goes BUG(),
      so we shouldn't call it unless we're actually dealing with a VLAN netdev.
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      d549f55f
    • J
      IB/mlx4: Fix possible deadlock on sm_lock spinlock · df7fba66
      Jack Morgenstein 提交于
      The sm_lock spinlock is taken in the process context by
      mlx4_ib_modify_device, and in the interrupt context by update_sm_ah,
      so we need to take that spinlock with irqsave, and release it with
      irqrestore.
      
      Lockdeps reports this as follows:
      
          [ INFO: inconsistent lock state ]
          3.5.0+ #20 Not tainted
          inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
          swapper/0/0 [HC1[1]:SC0[0]:HE0:SE1] takes:
          (&(&ibdev->sm_lock)->rlock){?.+...}, at: [<ffffffffa028af1d>] update_sm_ah+0xad/0x100 [mlx4_ib]
          {HARDIRQ-ON-W} state was registered at:
            [<ffffffff810b84a0>] mark_irqflags+0x120/0x190
            [<ffffffff810b9ce7>] __lock_acquire+0x307/0x4c0
            [<ffffffff810b9f51>] lock_acquire+0xb1/0x150
            [<ffffffff815523b1>] _raw_spin_lock+0x41/0x50
            [<ffffffffa028d563>] mlx4_ib_modify_device+0x63/0x240 [mlx4_ib]
            [<ffffffffa026d1fc>] ib_modify_device+0x1c/0x20 [ib_core]
            [<ffffffffa026c353>] set_node_desc+0x83/0xc0 [ib_core]
            [<ffffffff8136a150>] dev_attr_store+0x20/0x30
            [<ffffffff81201fd6>] sysfs_write_file+0xe6/0x170
            [<ffffffff8118da38>] vfs_write+0xc8/0x190
            [<ffffffff8118dc01>] sys_write+0x51/0x90
            [<ffffffff8155b869>] system_call_fastpath+0x16/0x1b
      
          ...
          *** DEADLOCK ***
      
          1 lock held by swapper/0/0:
      
          stack backtrace:
          Pid: 0, comm: swapper/0 Not tainted 3.5.0+ #20
          Call Trace:
          <IRQ>  [<ffffffff810b7bea>] print_usage_bug+0x18a/0x190
          [<ffffffff810b7370>] ? print_irq_inversion_bug+0x210/0x210
          [<ffffffff810b7fb2>] mark_lock_irq+0xf2/0x280
          [<ffffffff810b8290>] mark_lock+0x150/0x240
          [<ffffffff810b84ef>] mark_irqflags+0x16f/0x190
          [<ffffffff810b9ce7>] __lock_acquire+0x307/0x4c0
          [<ffffffffa028af1d>] ? update_sm_ah+0xad/0x100 [mlx4_ib]
          [<ffffffff810b9f51>] lock_acquire+0xb1/0x150
          [<ffffffffa028af1d>] ? update_sm_ah+0xad/0x100 [mlx4_ib]
          [<ffffffff815523b1>] _raw_spin_lock+0x41/0x50
          [<ffffffffa028af1d>] ? update_sm_ah+0xad/0x100 [mlx4_ib]
          [<ffffffffa026b2fa>] ? ib_create_ah+0x1a/0x40 [ib_core]
          [<ffffffffa028af1d>] update_sm_ah+0xad/0x100 [mlx4_ib]
          [<ffffffff810c27c3>] ? is_module_address+0x23/0x30
          [<ffffffffa028b05b>] handle_port_mgmt_change_event+0xeb/0x150 [mlx4_ib]
          [<ffffffffa028c177>] mlx4_ib_event+0x117/0x160 [mlx4_ib]
          [<ffffffff81552501>] ? _raw_spin_lock_irqsave+0x61/0x70
          [<ffffffffa022718c>] mlx4_dispatch_event+0x6c/0x90 [mlx4_core]
          [<ffffffffa0221b40>] mlx4_eq_int+0x500/0x950 [mlx4_core]
      
      Reported by: Or Gerlitz <ogerlitz@mellanox.com>
      Tested-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      df7fba66
  10. 30 7月, 2012 2 次提交
    • S
      IPoIB: Use a private hash table for path lookup in xmit path · b63b70d8
      Shlomo Pongratz 提交于
      Dave Miller <davem@davemloft.net> provided a detailed description of
      why the way IPoIB is using neighbours for its own ipoib_neigh struct
      is buggy:
      
          Any time an ipoib_neigh is changed, a sequence like the following is made:
      
          			spin_lock_irqsave(&priv->lock, flags);
          			/*
          			 * It's safe to call ipoib_put_ah() inside
          			 * priv->lock here, because we know that
          			 * path->ah will always hold one more reference,
          			 * so ipoib_put_ah() will never do more than
          			 * decrement the ref count.
          			 */
          			if (neigh->ah)
          				ipoib_put_ah(neigh->ah);
          			list_del(&neigh->list);
          			ipoib_neigh_free(dev, neigh);
          			spin_unlock_irqrestore(&priv->lock, flags);
          			ipoib_path_lookup(skb, n, dev);
      
          This doesn't work, because you're leaving a stale pointer to the freed up
          ipoib_neigh in the special neigh->ha pointer cookie.  Yes, it even fails
          with all the locking done to protect _changes_ to *ipoib_neigh(n), and
          with the code in ipoib_neigh_free() that NULLs out the pointer.
      
          The core issue is that read side calls to *to_ipoib_neigh(n) are not
          being synchronized at all, they are performed without any locking.  So
          whether we hold the lock or not when making changes to *ipoib_neigh(n)
          you still can have threads see references to freed up ipoib_neigh
          objects.
      
          	cpu 1			cpu 2
          	n = *ipoib_neigh()
          				*ipoib_neigh() = NULL
          				kfree(n)
          	n->foo == OOPS
      
          [..]
      
          Perhaps the ipoib code can have a private path database it manages
          entirely itself, which holds all the necessary information and is
          looked up by some generic key which is available easily at transmit
          time and does not involve generic neighbour entries.
      
      See <http://marc.info/?l=linux-rdma&m=132812793105624&w=2> and
      <http://marc.info/?l=linux-rdma&w=2&r=1&s=allows+references+to+freed+memory&q=b>
      for the full discussion.
      
      This patch aims to solve the race conditions found in the IPoIB driver.
      
      The patch removes the connection between the core networking neighbour
      structure and the ipoib_neigh structure.  In addition to avoiding the
      race described above, it allows us to handle SKBs carrying IP packets
      that don't have any associated neighbour.
      
      We add an ipoib_neigh hash table with N buckets where the key is the
      destination hardware address.  The ipoib_neigh is fetched from the
      hash table and instead of the stashed location in the neighbour
      structure. The hash table uses both RCU and reference counting to
      guarantee that no ipoib_neigh instance is ever deleted while in use.
      
      Fetching the ipoib_neigh structure instance from the hash also makes
      the special code in ipoib_start_xmit that handles remote and local
      bonding failover redundant.
      
      Aged ipoib_neigh instances are deleted by a garbage collection task
      that runs every M seconds and deletes every ipoib_neigh instance that
      was idle for at least 2*M seconds. The deletion is safe since the
      ipoib_neigh instances are protected using RCU and reference count
      mechanisms.
      
      The number of buckets (N) and frequency of running the GC thread (M),
      are taken from the exported arb_tbl.
      Signed-off-by: NShlomo Pongratz <shlomop@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      b63b70d8
    • M
      IB/qib: Fix size of cc_supported_table_entries · 5d7fe4ef
      Mike Marciniszyn 提交于
      Commit 36a8f01c ("IB/qib: Add congestion control agent
      implementation") tries to store the value 1984 in a u8, which leads to
      truncation.  Fix this by making the member big enough.
      
      This bug was detected by a smatch warning.
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NRamkrishna Vepa <ramkrishna.vepa@intel.com>
      Signed-off-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      5d7fe4ef
  11. 28 7月, 2012 3 次提交
  12. 20 7月, 2012 4 次提交
  13. 19 7月, 2012 1 次提交
  14. 18 7月, 2012 1 次提交
  15. 17 7月, 2012 2 次提交
    • D
      net: Pass optional SKB and SK arguments to dst_ops->{update_pmtu,redirect}() · 6700c270
      David S. Miller 提交于
      This will be used so that we can compose a full flow key.
      
      Even though we have a route in this context, we need more.  In the
      future the routes will be without destination address, source address,
      etc. keying.  One ipv4 route will cover entire subnets, etc.
      
      In this environment we have to have a way to possess persistent storage
      for redirects and PMTU information.  This persistent storage will exist
      in the FIB tables, and that's why we'll need to be able to rebuild a
      full lookup flow key here.  Using that flow key will do a fib_lookup()
      and create/update the persistent entry.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6700c270
    • C
      srpt: use target_execute_cmd for WRITEs in srpt_handle_rdma_comp · e672a47f
      Christoph Hellwig 提交于
      srpt_handle_rdma_comp is called from kthread context and thus can execute
      target_execute_cmd directly.  srpt_abort_cmd sets the CMD_T_LUN_STOP
      flag directly, and thus the abuse of transport_generic_handle_data can be
      replaced with an opencoded variant of that code path.  I'm still not happy
      about a fabric driver poking into target core internals like this, but
      let's defer the bigger architecture changes for now.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      e672a47f
  16. 12 7月, 2012 5 次提交
  17. 11 7月, 2012 3 次提交
    • E
      IPoIB: fix skb truesize underestimatiom · b28ba726
      Eric Dumazet 提交于
      Or Gerlitz reported triggering of WARN_ON_ONCE(delta < len); in
      skb_try_coalesce()
      This warning tracks drivers that incorrectly set skb->truesize
      
      IPoIB indeed allocates a full page to store a fragment, but only
      accounts in skb->truesize the used part of the page (frame length)
      
      This patch fixes skb truesize underestimation, and
      also fixes a performance issue, because RX skbs have not enough tailroom
      to allow IP and TCP stacks to pull their header in skb linear part
      without an expensive call to pskb_expand_head()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Cc: Erez Shitrit <erezsh@mellanox.com>
      Cc: Shlomo Pongartz <shlomop@mellanox.com>
      Cc: Roland Dreier <roland@purestorage.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b28ba726
    • M
      IB/qib: Fix sparse RCU warnings in qib_keys.c · 7e230177
      Mike Marciniszyn 提交于
      Commit 8aac4cc3 ("IB/qib: RCU locking for MR validation") introduced
      new sparse warnings in qib_keys.c.
      Acked-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      7e230177
    • J
      mlx4: Use port management change event instead of smp_snoop · 00f5ce99
      Jack Morgenstein 提交于
      The port management change event can replace smp_snoop.  If the
      capability bit for this event is set in dev-caps, the event is used
      (by the driver setting the PORT_MNG_CHG_EVENT bit in the async event
      mask in the MAP_EQ fw command).  In this case, when the driver passes
      incoming SMP PORT_INFO SET mads to the FW, the FW generates port
      management change events to signal any changes to the driver.
      
      If the FW generates these events, smp_snoop shouldn't be invoked in
      ib_process_mad(), or duplicate events will occur (once from the
      FW-generated event, and once from smp_snoop).
      
      In the case where the FW does not generate port management change
      events smp_snoop needs to be invoked to create these events.  The flow
      in smp_snoop has been modified to make use of the same procedures as
      in the fw-generated-event event case to generate the port management
      events (LID change, Client-rereg, Pkey change, and/or GID change).
      
      Port management change event handling required changing the
      mlx4_ib_event and mlx4_dispatch_event prototypes; the "param" argument
      (last argument) had to be changed to unsigned long in order to
      accomodate passing the EQE pointer.
      
      We also needed to move the definition of struct mlx4_eqe from
      net/mlx4.h to file device.h -- to make it available to the IB driver,
      to handle port management change events.
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      00f5ce99
  18. 09 7月, 2012 2 次提交
    • M
      IB/qib: RCU locking for MR validation · 8aac4cc3
      Mike Marciniszyn 提交于
      Profiling indicates that MR validation locking is expensive.  The MR
      table is largely read-only and is a suitable candidate for RCU locking.
      
      The patch uses RCU locking during validation to eliminate one
      lock/unlock during that validation.
      Reviewed-by: NMike Heinz <michael.william.heinz@intel.com>
      Signed-off-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      8aac4cc3
    • M
      IB/qib: Avoid returning EBUSY from MR deregister · 6a82649f
      Mike Marciniszyn 提交于
      A timing issue can occur where qib_mr_dereg can return -EBUSY if the
      MR use count is not zero.
      
      This can occur if the MR is de-registered while RDMA read response
      packets are being progressed from the SDMA ring.  The suspicion is
      that the peer sent an RDMA read request, which has already been copied
      across to the peer.  The peer sees the completion of his request and
      then communicates to the responder that the MR is not needed any
      longer.  The responder tries to de-register the MR, catching some
      responses remaining in the SDMA ring holding the MR use count.
      
      The code now uses a get/put paradigm to track MR use counts and
      coordinates with the MR de-registration process using a completion
      when the count has reached zero.  A timeout on the delay is in place
      to catch other EBUSY issues.
      
      The reference count protocol is as follows:
      - The return to the user counts as 1
      - A reference from the lk_table or the qib_ibdev counts as 1.
      - Transient I/O operations increase/decrease as necessary
      
      A lot of code duplication has been folded into the new routines
      init_qib_mregion() and deinit_qib_mregion().  Additionally, explicit
      initialization of fields to zero is now handled by kzalloc().
      
      Also, duplicated code 'while.*num_sge' that decrements reference
      counts have been consolidated in qib_put_ss().
      Reviewed-by: NRamkrishna Vepa <ramkrishna.vepa@intel.com>
      Signed-off-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      6a82649f