提交 · 819a087316a63ef7d60e7b816d18c4e29a05861a · openeuler / raspberrypi-kernel

22 2月, 2013 2 次提交

IB/iser: Avoid error prints on EAGAIN registration failures · 819a0873

由 Or Gerlitz 提交于 2月 21, 2013

Under IO/CPU stress its possible that the FMR pool might not have a
free FMR mapping element for iSER to use because of incomplete
background unmapping processing.  In that case we get -EAGAIN and the
IO is pushed back to the SCSI layer which soon retries it.  No need to
be so verbose about that.
Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NRoland Dreier <roland@purestorage.com>

819a0873

IB/iser: Use proper define for the commands per LUN value advertised to SCSI ML · b96e4aba

由 Or Gerlitz 提交于 2月 21, 2013

ISER_DEF_CMD_PER_LUN was meant to be ISCSI_DEF_XMIT_CMDS_MAX, not plain 128
Signed-off-by: NRoi Dayan <roid@mellanox.com>
Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NRoland Dreier <roland@purestorage.com>

b96e4aba

06 2月, 2013 1 次提交

IPoIB: Fix crash due to skb double destruct · 7e5a90c2

由 Shlomo Pongratz 提交于 2月 04, 2013

After commit b13912bb ("IPoIB: Call skb_dst_drop() once skb is
enqueued for sending"), using connected mode and running multithreaded
iperf for long time, ie

    iperf -c <IP> -P 16 -t 3600

results in a crash.

After the above-mentioned patch, the driver is calling skb_orphan() and
skb_dst_drop() after calling post_send() in ipoib_cm.c::ipoib_cm_send()
(also in ipoib_ib.c::ipoib_send())

The problem with this is, as is written in a comment in both routines,
"it's entirely possible that the completion handler will run before we
execute anything after the post_send()."  This leads to running the
skb cleanup routines simultaneously in two different contexts.

The solution is to always perform the skb_orphan() and skb_dst_drop()
before queueing the send work request.  If an error occurs, then it
will be no different than the regular case where dev_free_skb_any() in
the completion path, which is assumed to be after these two routines.
Signed-off-by: NShlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: NRoland Dreier <roland@purestorage.com>

7e5a90c2

20 12月, 2012 1 次提交

IPoIB: Call skb_dst_drop() once skb is enqueued for sending · b13912bb

由 Roland Dreier 提交于 12月 19, 2012

Currently, IPoIB delays collecting send completions for TX packets in
order to batch work more efficiently.  It does skb_orphan() right after
queuing the packets so that destructors run early, to avoid problems
like holding socket send buffers for too long (since we might not
collect a send completion until a long time after the packet is
actually sent).

However, IPoIB clears IFF_XMIT_DST_RELEASE because it actually looks
at skb_dst() to update the PMTU when it gets a too-long packet.  This
means that the packets sitting in the TX ring with uncollected send
completions are holding a reference on the dst.  We've seen this lead
to pathological behavior with respect to route and neighbour GC.  The
easy fix for this is to call skb_dst_drop() when we call skb_orphan().

Also, give packets sent via connected mode (CM) the same skb_orphan()
/ skb_dst_drop() treatment that packets sent via datagram mode get.
Signed-off-by: NRoland Dreier <roland@purestorage.com>

b13912bb

01 12月, 2012 12 次提交

IB/srp: Allow SRP disconnect through sysfs · dc1bdbd9

由 Bart Van Assche 提交于 9月 16, 2011

Make it possible to disconnect the IB RC connection used by the SRP
protocol to communicate with a target.

Have the SRP transport layer create a sysfs "delete" attribute for
initiator drivers that support this functionality.
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Acked-by: NDavid Dillow <dillowda@ornl.gov>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Signed-off-by: NRoland Dreier <roland@purestorage.com>

dc1bdbd9

IB/srp: send disconnect request without waiting for CM timewait exit · 55d93898

由 Vu Pham 提交于 11月 26, 2012

Now that SRP recreates the CM ID, QP, and CQ for each connection,
there is no need to wait for the timewait state to complete.
Signed-off-by: NVu Pham <vu@mellanox.com>
Signed-off-by: NDavid Dillow <dillowda@ornl.gov>
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NRoland Dreier <roland@purestorage.com>

55d93898

IB/srp: destroy and recreate QP and CQs when reconnecting · 73aa89ed

由 Ishai Rabinovitz 提交于 11月 26, 2012

HW QP FATAL errors persist over a reset operation, but we can recover
from that by recreating the QP and associated CQs for each connection.
Creating a new QP/CQ also completely forecloses any possibility of
getting stale completions or packets on the new connection.
Signed-off-by: NIshai Rabinovitz <ishai@mellanox.co.il>
Signed-off-by: NMichael S. Tsirkin <mst@mellanox.co.il>

[ updated to current code from OFED, cleaned up commit message ]
Signed-off-by: NDavid Dillow <dillowda@ornl.gov>
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NRoland Dreier <roland@purestorage.com>

73aa89ed

IB/srp: Eliminate state SRP_TARGET_DEAD · ef6c49d8

由 Bart Van Assche 提交于 12月 26, 2011

Only queue removal work after having changed the target state
into SRP_TARGET_REMOVED and not if that state was already equal
to SRP_TARGET_REMOVED.  That allows us to remove the state
SRP_TARGET_DEAD.  Add a call to srp_disconnect_target() in
srp_remove_target() -- due to previous changes it is now safe to
invoke that function even if the IB connection has already
been disconnected.  This change allows us to replace the target
removal code in srp_remove_one() by an (indirect) call to
srp_remove_target().  Rename srp_target_port.work into
srp_target_port.remove_work to reflect its usage.
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Acked-by: NDavid Dillow <dillowda@ornl.gov>
Signed-off-by: NRoland Dreier <roland@purestorage.com>

ef6c49d8

IB/srp: Introduce the helper function srp_remove_target() · ee12d6a8

由 Bart Van Assche 提交于 12月 25, 2011

Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Acked-by: NDavid Dillow <dillowda@ornl.gov>
Signed-off-by: NRoland Dreier <roland@purestorage.com>

ee12d6a8

IB/srp: Suppress superfluous error messages · 294c875a

由 Bart Van Assche 提交于 12月 25, 2011

Keep track of the connection state.  Only report QP errors while
connected.  Only invoke ib_send_cm_dreq() when connected so that
invoking srp_disconnect_target() after having received a DREQ does not
cause an error message to be printed.
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Acked-by: NDavid Dillow <dillowda@ornl.gov>
Signed-off-by: NRoland Dreier <roland@purestorage.com>

294c875a

IB/srp: Process all error completions · 4f0af697

由 Bart Van Assche 提交于 11月 26, 2012

If the RDMA RC connection is closed, tell the SCSI mid-layer to
terminate all pending commands instead of only the first.
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Acked-by: NDavid Dillow <dillowda@ornl.gov>
Signed-off-by: NRoland Dreier <roland@purestorage.com>

4f0af697

IB/srp: Introduce srp_handle_qp_err() · 948d1e88

由 Bart Van Assche 提交于 9月 03, 2011

Introduce the function srp_handle_qp_err(), change the type of
qp_in_error from int into bool and move the initialization of that
variable from srp_reconnect_target() to srp_connect_target().
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Acked-by: NDavid Dillow <dillowda@ornl.gov>
Signed-off-by: NRoland Dreier <roland@purestorage.com>

948d1e88

IB/srp: Simplify SCSI error handling · 224db157

由 Bart Van Assche 提交于 10月 24, 2012

Since scsi_remove_host() has been modified so that SCSI error handling
functions will no longer be invoked after scsi_remove_host() returns,
the test at the start of srp_send_tsk_mgmt() is now superfluous.
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Acked-by: NDavid Dillow <dillowda@ornl.gov>
Signed-off-by: NRoland Dreier <roland@purestorage.com>

224db157

IB/srp: Keep processing commands during host removal · f3718231

由 Bart Van Assche 提交于 4月 19, 2012

Some SCSI upper layer drivers, e.g. sd, issue SCSI commands from
inside scsi_remove_host() (see the sd_shutdown() call in sd_remove()).
Make sure that these commands have a chance to reach the SCSI device.
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Acked-by: NDavid Dillow <dillowda@ornl.gov>
Signed-off-by: NRoland Dreier <roland@purestorage.com>

f3718231

IB/srp: Eliminate state SRP_TARGET_CONNECTING · 09be70a2

由 Bart Van Assche 提交于 3月 17, 2012

Block the SCSI host while reconnecting instead of representing the
reconnection activity as a distinct SRP target state.  This allows us
to eliminate the target state SRP_TARGET_CONNECTING.
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Acked-by: NDavid Dillow <dillowda@ornl.gov>
Signed-off-by: NRoland Dreier <roland@purestorage.com>

09be70a2

IB/srp: Increase block layer timeout · c9b03c1a

由 Bart Van Assche 提交于 9月 03, 2011

Increase the block layer timeout for disks so that it is above the
InfiniBand transport layer timeout.
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Acked-by: NDavid Dillow <dillowda@ornl.gov>
Signed-off-by: NRoland Dreier <roland@purestorage.com>

c9b03c1a

29 11月, 2012 1 次提交

ib_srpt: Convert TMR path to target_submit_tmr · 3e4f5748

由 Nicholas Bellinger 提交于 11月 28, 2012

This patch converts the TMR path in srpt_handle_tsk_mgmt() to use
target_submit_tmr() with TARGET_SCF_ACK_KREF flag usage.

v2: Drop ununused res in target_submit_tmr (Fengguang.Wu)

Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Roland Dreier <roland@kernel.org>
Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>

3e4f5748

28 11月, 2012 1 次提交

ib_srpt: Convert I/O path to target_submit_cmd + drop legacy ioctx->kref · 9474b043

由 Nicholas Bellinger 提交于 11月 27, 2012

This patch converts the main srpt_handle_cmd() I/O path to use modern
target_submit_cmd() with TARGET_SCF_ACK_KREF flag usage.  This includes
dropping the original internal ioctx->kref + srpt_put_send_ioctx() usage
in favor of target_put_sess_cmd() w/ se_cmd_t->cmd_kref within ib_srpt
response callbacks.

It also updates srpt_abort_cmd() to call target_put_sess_cmd() for
completion of aborted commands, and adds target_wait_for_sess_cmds() into
srpt_release_channel_work() to allow outstanding I/O to complete during
session shutdown.

Also, go ahead and update srpt_handle_tsk_mgmt() to make the remaining
transport_init_se_cmd() to setup the ioctx->cmd with se_tmr_req.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Roland Dreier <roland@kernel.org>
Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>

9474b043

07 11月, 2012 1 次提交

target: pass sense_reason as a return value · de103c93

由 Christoph Hellwig 提交于 11月 06, 2012

Pass the sense reason as an explicit return value from the I/O submission
path instead of storing it in struct se_cmd and using negative return
values.  This cleans up a lot of the code pathes, and with the sparse
annotations for the new sense_reason_t type allows for much better
error checking.

(nab: Convert spc_emulate_modesense + spc_emulate_modeselect to use
      sense_reason_t with Roland's MODE SELECT changes)
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Cc: Roland Dreier <roland@purestorage.com>
Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>

de103c93

04 10月, 2012 1 次提交

IB/iser: Add more RX CQs to scale out processing of SCSI responses · 5a33a669

由 Alex Tabachnik 提交于 9月 23, 2012

RX/TX CQs will now be selected from a per HCA pool.  For the RX flow
this has the effect of using different interrupt vectors when using
low level drivers (such as mlx4) that map the "vector" param provided
by the ULP on CQ creation to a dedicated IRQ/MSI-X vector.  This
allows the RX flow processing of IO responses to be distributed across
multiple CPUs.

QPs (--> iSER sessions) are assigned to CQs in round robin order using
the CQ with the minimum number of sessions attached to it.
Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NAlex Tabachnik <alext@mellanox.com>
Signed-off-by: NRoland Dreier <roland@purestorage.com>

5a33a669

03 10月, 2012 1 次提交

IPoIB: Fix build with CONFIG_INFINIBAND_IPOIB_CM=n · 71d9c5f9

由 Roland Dreier 提交于 10月 02, 2012

With the new netlink support in commit 862096a8 ("IB/ipoib: Add more
rtnl_link_ops callbacks") we need ipoib_set_mode() to be available even
if connected mode isn't built.  Move the function from ipoib_cm.c to
ipoib_main.c (and make a few CM-related macros available unconditonally).

This fixes the build error

    drivers/built-in.o: In function 'ipoib_changelink':
    ipoib_netlink.c:(.text+0x6a5fc9): undefined reference to 'ipoib_set_mode'
    ipoib_netlink.c:(.text+0x6a5fe3): undefined reference to 'ipoib_set_mode'

when CONFIG_INFINIBAND_IPOIB_CM isn't set.
Reported-by: NRandy Dunlap <rdunlap@xenotime.net>
Reported-by: NMichael Neuling <mikey@neuling.org>
Signed-off-by: NRoland Dreier <roland@purestorage.com>

71d9c5f9

02 10月, 2012 1 次提交

IB/ipoib: Add more rtnl_link_ops callbacks · 862096a8

由 Or Gerlitz 提交于 9月 27, 2012

Add the rtnl_link_ops changelink and fill_info callbacks, through
which the admin can now set/get the driver mode, etc policies.
Maintain the proprietary sysfs entries only for legacy childs.

For child devices, set dev->iflink to point to the parent
device ifindex, such that user space tools can now correctly
show the uplink relation as done for vlan, macvlan, etc
devices. Pointed out by Patrick McHardy <kaber@trash.net>
Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

862096a8

01 10月, 2012 3 次提交

IB/srp: Avoid having aborted requests hang · d8536670

由 Bart Van Assche 提交于 8月 24, 2012

We need to call scsi_done() for commands after we abort them.
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Acked-by: NDavid Dillow <dillowda@ornl.gov>
Cc: <stable@vger.kernel.org>
Signed-off-by: NRoland Dreier <roland@purestorage.com>

d8536670

IB/srp: Fix use-after-free in srp_reset_req() · 9b796d06

由 Bart Van Assche 提交于 8月 24, 2012

srp_free_req() uses the scsi_cmnd structure contents to unmap
buffers, so we must invoke srp_free_req() before we release
ownership of that structure.
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Acked-by: NDavid Dillow <dillowda@ornl.gov>
Cc: <stable@vger.kernel.org>
Signed-off-by: NRoland Dreier <roland@purestorage.com>

9b796d06

IPoIB: Fix use-after-free of multicast object · bea1e22d

由 Patrick McHardy 提交于 8月 30, 2012

Fix a crash in ipoib_mcast_join_task().  (with help from Or Gerlitz)

Commit c8c2afe3 ("IPoIB: Use rtnl lock/unlock when changing device
flags") added a call to rtnl_lock() in ipoib_mcast_join_task(), which
is run from the ipoib_workqueue, and hence the workqueue can't be
flushed from the context of ipoib_stop().

In the current code, ipoib_stop() (which doesn't flush the workqueue)
calls ipoib_mcast_dev_flush(), which goes and deletes all the
multicast entries.  This takes place without any synchronization with
a possible running instance of ipoib_mcast_join_task() for the same
ipoib device, leading to a crash due to NULL pointer dereference.

Fix this by making sure that the workqueue is flushed before
ipoib_mcast_dev_flush() is called.  To make that possible, we move the
RTNL-lock wrapped code to ipoib_mcast_join_finish().
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: NRoland Dreier <roland@purestorage.com>

bea1e22d

21 9月, 2012 1 次提交

IB/ipoib: Add rtnl_link_ops support · 9baa0b03

由 Or Gerlitz 提交于 9月 13, 2012

Add rtnl_link_ops to IPoIB, with the first usage being child device
create/delete through them. Childs devices are now either legacy ones,
created/deleted through the ipoib sysfs entries, or RTNL ones.

Adding support for RTNL childs involved refactoring of ipoib_vlan_add
which is now used by both the sysfs and the link_ops code.

Also, added ndo_uninit entry to support calling unregister_netdevice_queue
from the rtnl dellink entry. This required removal of calls to
ipoib_dev_cleanup from the driver in flows which use unregister_netdevice,
since the networking core will invoke ipoib_uninit which does exactly that.
Signed-off-by: NErez Shitrit <erezsh@mellanox.co.il>
Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9baa0b03

18 9月, 2012 2 次提交

target: Simplify fabric sense data length handling · 9c58b7dd

由 Roland Dreier 提交于 8月 15, 2012

Every fabric driver has to supply a se_tfo->set_fabric_sense_len()
method, just so iSCSI can return an offset of 2.  However, every fabric
driver is already allocating a sense buffer and passing it into the
target core, either via transport_init_se_cmd() or target_submit_cmd().

So instead of having iSCSI pass the start of its sense buffer into the
core and then later tell the core to skip the first 2 bytes, it seems
easier for iSCSI just to do the offset of 2 when it passes the sense
buffer into the core.  Then we can drop the se_tfo->set_fabric_sense_len()
everywhere, and just add a couple of lines of code to iSCSI to set the
sense data length to the beginning of the buffer right before it sends
it over the network.

(nab: Remove .set_fabric_sense_len usage from tcm_qla2xxx_npiv_ops +
      change transport_get_sense_buffer to follow v3.6-rc6 code w/o
      ->set_fabric_sense_len usage)
Signed-off-by: NRoland Dreier <roland@purestorage.com>
Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>

9c58b7dd

target: Remove unused target_core_fabric_ops.get_fabric_sense_len method · 2ed772b7

由 Roland Dreier 提交于 8月 15, 2012

There are no callers of se_tfo->get_fabric_sense_len(), so we should
stop having every fabric driver implement it.
Signed-off-by: NRoland Dreier <roland@purestorage.com>
Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>

2ed772b7

13 9月, 2012 2 次提交

IPoIB: Fix AB-BA deadlock when deleting neighbours · b5120a6e

由 Shlomo Pongratz 提交于 8月 29, 2012

Lockdep points out a circular locking dependency betwwen the ipoib
device priv spinlock (priv->lock) and the neighbour table rwlock
(ntbl->rwlock).

In the normal path, ie neigbour garbage collection task, the neigh
table rwlock is taken first and then if the neighbour needs to be
deleted, priv->lock is taken.

However in some error paths, such as in ipoib_cm_handle_tx_wc(),
priv->lock is taken first and then ipoib_neigh_free routine is called
which in turn takes the neighbour table ntbl->rwlock.

The solution is to get rid the neigh table rwlock completely and use
only priv->lock.
Signed-off-by: NShlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NRoland Dreier <roland@purestorage.com>

b5120a6e

IPoIB: Fix memory leak in the neigh table deletion flow · 66172c09

由 Shlomo Pongratz 提交于 8月 29, 2012

If the neighbours hash table is empty when unloading the module, then
ipoib_flush_neighs(), the cleanup routine, isn't called and the
memory used for the hash table itself leaked.

To fix this, ipoib_flush_neighs() is allways called, and another
completion object is added to signal when the table is freed.

Once invoked, ipoib_flush_neighs() flushes all the neighbours (if
there are any), calls the the hash table RCU free routine, which now
signals completion of the deletion process, and waits for the last
neighbour to be freed.
Signed-off-by: NShlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NRoland Dreier <roland@purestorage.com>

66172c09

16 8月, 2012 2 次提交

IB/srp: Fix a race condition · 22032991

由 Bart Van Assche 提交于 8月 14, 2012

Avoid a crash caused by the scmnd->scsi_done(scmnd) call in
srp_process_rsp() being invoked with scsi_done == NULL.  This can
happen if a reply is received during or after a command abort.
Reported-by: NJoseph Glanville <joseph.glanville@orionvm.com.au>
Reference: http://marc.info/?l=linux-rdma&m=134314367801595
Cc: <stable@vger.kernel.org>
Acked-by: NDavid Dillow <dillowda@ornl.gov>
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NRoland Dreier <roland@purestorage.com>

22032991

IB: Fix typos in infiniband drivers · 142ad5db

由 Masanari Iida 提交于 8月 10, 2012

Correct spelling typos in comments in drivers/infiniband.
Signed-off-by: NMasanari Iida <standby24x7@gmail.com>
Signed-off-by: NRoland Dreier <roland@purestorage.com>

142ad5db

15 8月, 2012 2 次提交

IB/ipoib: Fix RCU pointer dereference of wrong object · 6c723a68

由 Shlomo Pongratz 提交于 8月 13, 2012

Commit b63b70d8 ("IPoIB: Use a private hash table for path lookup
in xmit path") introduced a bug where in ipoib_neigh_free() (which is
called from a few errors flows in the driver), rcu_dereference() is
invoked with the wrong pointer object, which results in a crash.
Signed-off-by: NShlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NRoland Dreier <roland@purestorage.com>

6c723a68

IB/ipoib: Add missing locking when CM object is deleted · fa16ebed

由 Shlomo Pongratz 提交于 8月 13, 2012

Commit b63b70d8 ("IPoIB: Use a private hash table for path lookup
in xmit path") introduced a bug where in ipoib_cm_destroy_tx() a CM
object is moved between lists without any supported locking.  Under a
stress test, this eventually leads to list corruption and a crash.

Previously when this routine was called, callers were taking the
device priv lock.  Currently this function is called from the RCU
callback associated with neighbour deletion.  Fix the race by taking
the same lock we used to before.
Signed-off-by: NShlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NRoland Dreier <roland@purestorage.com>

fa16ebed

30 7月, 2012 1 次提交

IPoIB: Use a private hash table for path lookup in xmit path · b63b70d8

由 Shlomo Pongratz 提交于 7月 24, 2012

Dave Miller <davem@davemloft.net> provided a detailed description of
why the way IPoIB is using neighbours for its own ipoib_neigh struct
is buggy:

    Any time an ipoib_neigh is changed, a sequence like the following is made:

    			spin_lock_irqsave(&priv->lock, flags);
    			/*
    			 * It's safe to call ipoib_put_ah() inside
    			 * priv->lock here, because we know that
    			 * path->ah will always hold one more reference,
    			 * so ipoib_put_ah() will never do more than
    			 * decrement the ref count.
    			 */
    			if (neigh->ah)
    				ipoib_put_ah(neigh->ah);
    			list_del(&neigh->list);
    			ipoib_neigh_free(dev, neigh);
    			spin_unlock_irqrestore(&priv->lock, flags);
    			ipoib_path_lookup(skb, n, dev);

    This doesn't work, because you're leaving a stale pointer to the freed up
    ipoib_neigh in the special neigh->ha pointer cookie.  Yes, it even fails
    with all the locking done to protect _changes_ to *ipoib_neigh(n), and
    with the code in ipoib_neigh_free() that NULLs out the pointer.

    The core issue is that read side calls to *to_ipoib_neigh(n) are not
    being synchronized at all, they are performed without any locking.  So
    whether we hold the lock or not when making changes to *ipoib_neigh(n)
    you still can have threads see references to freed up ipoib_neigh
    objects.

    	cpu 1			cpu 2
    	n = *ipoib_neigh()
    				*ipoib_neigh() = NULL
    				kfree(n)
    	n->foo == OOPS

    [..]

    Perhaps the ipoib code can have a private path database it manages
    entirely itself, which holds all the necessary information and is
    looked up by some generic key which is available easily at transmit
    time and does not involve generic neighbour entries.

See <http://marc.info/?l=linux-rdma&m=132812793105624&w=2> and
<http://marc.info/?l=linux-rdma&w=2&r=1&s=allows+references+to+freed+memory&q=b>
for the full discussion.

This patch aims to solve the race conditions found in the IPoIB driver.

The patch removes the connection between the core networking neighbour
structure and the ipoib_neigh structure.  In addition to avoiding the
race described above, it allows us to handle SKBs carrying IP packets
that don't have any associated neighbour.

We add an ipoib_neigh hash table with N buckets where the key is the
destination hardware address.  The ipoib_neigh is fetched from the
hash table and instead of the stashed location in the neighbour
structure. The hash table uses both RCU and reference counting to
guarantee that no ipoib_neigh instance is ever deleted while in use.

Fetching the ipoib_neigh structure instance from the hash also makes
the special code in ipoib_start_xmit that handles remote and local
bonding failover redundant.

Aged ipoib_neigh instances are deleted by a garbage collection task
that runs every M seconds and deletes every ipoib_neigh instance that
was idle for at least 2*M seconds. The deletion is safe since the
ipoib_neigh instances are protected using RCU and reference count
mechanisms.

The number of buckets (N) and frequency of running the GC thread (M),
are taken from the exported arb_tbl.
Signed-off-by: NShlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NRoland Dreier <roland@purestorage.com>

b63b70d8

17 7月, 2012 2 次提交

net: Pass optional SKB and SK arguments to dst_ops->{update_pmtu,redirect}() · 6700c270

由 David S. Miller 提交于 7月 17, 2012

This will be used so that we can compose a full flow key.

Even though we have a route in this context, we need more. In the
future the routes will be without destination address, source address,
etc. keying. One ipv4 route will cover entire subnets, etc.

In this environment we have to have a way to possess persistent storage
for redirects and PMTU information. This persistent storage will exist
in the FIB tables, and that's why we'll need to be able to rebuild a
full lookup flow key here. Using that flow key will do a fib_lookup()
and create/update the persistent entry.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6700c270

srpt: use target_execute_cmd for WRITEs in srpt_handle_rdma_comp · e672a47f

由 Christoph Hellwig 提交于 7月 08, 2012

srpt_handle_rdma_comp is called from kthread context and thus can execute
target_execute_cmd directly. srpt_abort_cmd sets the CMD_T_LUN_STOP
flag directly, and thus the abuse of transport_generic_handle_data can be
replaced with an opencoded variant of that code path. I'm still not happy
about a fabric driver poking into target core internals like this, but
let's defer the bigger architecture changes for now.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>

e672a47f

11 7月, 2012 1 次提交

IPoIB: fix skb truesize underestimatiom · b28ba726

由 Eric Dumazet 提交于 7月 10, 2012

Or Gerlitz reported triggering of WARN_ON_ONCE(delta < len); in
skb_try_coalesce()
This warning tracks drivers that incorrectly set skb->truesize

IPoIB indeed allocates a full page to store a fragment, but only
accounts in skb->truesize the used part of the page (frame length)

This patch fixes skb truesize underestimation, and
also fixes a performance issue, because RX skbs have not enough tailroom
to allow IP and TCP stacks to pull their header in skb linear part
without an expensive call to pskb_expand_head()
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NOr Gerlitz <ogerlitz@mellanox.com>
Cc: Erez Shitrit <erezsh@mellanox.com>
Cc: Shlomo Pongartz <shlomop@mellanox.com>
Cc: Roland Dreier <roland@purestorage.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b28ba726

09 7月, 2012 1 次提交

IB: Use IS_ENABLED(CONFIG_IPV6) · d90f9b35

由 Roland Dreier 提交于 7月 05, 2012

Instead of testing defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
Signed-off-by: NRoland Dreier <roland@purestorage.com>

d90f9b35

06 7月, 2012 1 次提交

ipoib: Need to do dst_neigh_lookup_skb() outside of priv->lock. · 700db99d

由 David S. Miller 提交于 7月 05, 2012

Otherwise local_bh_enable() complains.
Reported-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

700db99d