提交 · ef5217a6e2e60bc3d0679f2652480b99730956fe · openeuler / raspberrypi-kernel

26 8月, 2015 40 次提交

RDS: flush the FMR pool less often · ef5217a6

由 santosh.shilimkar@oracle.com 提交于 8月 25, 2015

FMR flush is an expensive and time consuming operation. Reduce the
frequency of FMR pool flush by 50% so that more FMR work gets accumulated
for more efficient flushing.
Signed-off-by: NSantosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ef5217a6

RDS: push FMR pool flush work to its own worker · ad1d7dc0

由 santosh.shilimkar@oracle.com 提交于 8月 25, 2015

RDS FMR flush operation and also it races with connect/reconect
which happes a lot with RDS. FMR flush being on common rds_wq aggrevates
the problem. Lets push RDS FMR pool flush work to its own worker.
Signed-off-by: NSantosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ad1d7dc0

RDS: fix fmr pool dirty_count · 6116c203

由 Wengang Wang 提交于 8月 25, 2015

In rds_ib_flush_mr_pool(), dirty_count accounts the clean ones
which is wrong. This can lead to a negative dirty count value.

Lets fix it.
Signed-off-by: NWengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: NSantosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6116c203

RDS: Fix rds MR reference count in rds_rdma_unuse() · 3f6b3143

由 santosh.shilimkar@oracle.com 提交于 8月 25, 2015

rds_rdma_unuse() drops the mr reference count which it hasn't
taken. Correct way of removing mr is to remove mr from the tree
and then rdma_destroy_mr() it first, then rds_mr_put() to decrement
its reference count. Whichever thread holds last reference will free
the mr via rds_mr_put()

This bug was triggering weird null pointer crashes. One if the trace
for it is captured below.

BUG: unable to handle kernel NULL pointer dereference at
0000000000000104
IP: [<ffffffffa0899471>] rds_ib_free_mr+0x31/0x130 [rds_rdma]
PGD 4366fa067 PUD 4366f9067 PMD 0
Oops: 0000 [#1] SMP

[...]

task: ffff88046da6a000 ti: ffff88046da6c000 task.ti: ffff88046da6c000
RIP: 0010:[<ffffffffa0899471>]  [<ffffffffa0899471>]
rds_ib_free_mr+0x31/0x130 [rds_rdma]
RSP: 0018:ffff88046fa43bd8  EFLAGS: 00010286
RAX: 0000000071d38b80 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff880079e7ff40
RBP: ffff88046fa43bf8 R08: 0000000000000000 R09: 0000000000000000
R10: ffff88046fa43ca8 R11: ffff88046a802ed8 R12: ffff880079e7fa40
R13: 0000000000000000 R14: ffff880079e7ff40 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff88046fa40000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000104 CR3: 00000004366fb000 CR4: 00000000000006e0
Stack:
 ffff880079e7fa40 ffff880671d38f08 ffff880079e7ff40 0000000000000296
 ffff88046fa43c28 ffffffffa087a38b ffff880079e7fa40 ffff880671d38f10
 0000000000000000 0000000000000292 ffff88046fa43c48 ffffffffa087a3b6
Call Trace:
 <IRQ>
 [<ffffffffa087a38b>] rds_destroy_mr+0x8b/0xa0 [rds]
 [<ffffffffa087a3b6>] __rds_put_mr_final+0x16/0x30 [rds]
 [<ffffffffa087a492>] rds_rdma_unuse+0xc2/0x120 [rds]
 [<ffffffffa08766d3>] rds_recv_incoming_exthdrs+0x83/0xa0 [rds]
 [<ffffffffa0876782>] rds_recv_incoming+0x92/0x200 [rds]
 [<ffffffffa0895269>] rds_ib_process_recv+0x259/0x320 [rds_rdma]
 [<ffffffffa08962a8>] rds_ib_recv_tasklet_fn+0x1a8/0x490 [rds_rdma]
 [<ffffffff810dcd78>] ? __remove_hrtimer+0x58/0x90
 [<ffffffff810799e1>] tasklet_action+0xb1/0xc0
 [<ffffffff81079b52>] __do_softirq+0xe2/0x290
 [<ffffffff81079df6>] irq_exit+0xa6/0xb0
 [<ffffffff81613915>] do_IRQ+0x65/0xf0
 [<ffffffff816118ab>] common_interrupt+0x6b/0x6b
Signed-off-by: NSantosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3f6b3143

RDS: fix the dangling reference to rds_ib_incoming_slab · ba54d3ce

由 santosh.shilimkar@oracle.com 提交于 8月 25, 2015

On rds_ib_frag_slab allocation failure, ensure rds_ib_incoming_slab
is not pointing to the detsroyed memory.
Signed-off-by: NSantosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ba54d3ce

Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-merge · b8766e4e

由 David S. Miller 提交于 8月 25, 2015

Antonio Quartulli says:

====================
Included changes:
- code restyling and beautification
- use int kernel types instead of C99
- update kereldoc
- prevent potential hlist double deletion of VLAN objects
- fix gw bandwidth calculation
- convert list to hlist when needed
- add lockdep_asserts calls in function with lock requirements
  described in kerneldoc
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b8766e4e

enic: reduce ioread in devcmd2 · dafc2199

由 Govindarajulu Varadarajan 提交于 8月 25, 2015

posted_index is RO in firmware. We need not do ioread everytime to get
posted index. Store posted index locally.
Signed-off-by: NGovindarajulu Varadarajan <_govind@gmx.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dafc2199

r8169: Add values missing in @get_stats64 from HW counters · 6e85d5ad

由 Corinna Vinschen 提交于 8月 24, 2015

The r8169 driver collects statistical information returned by
@get_stats64 by counting them in the driver itself, even though many
(but not all) of the values are already collected by tally counters
(TCs) in the NIC.  Some of these TC values are not returned by
@get_stats64.  Especially the received multicast packages are missing
from /proc/net/dev.

Rectify this by fetching the TCs and returning them from
rtl8169_get_stats64.

The counters collected in the driver obviously disappear as soon as the
driver is unloaded so after a driver is loaded the counters always start
at 0. The TCs on the other hand are only reset by a power cycle.  Without
further considerations the values collected by the driver would not match
up against the TC values.

This patch introduces a new function rtl8169_reset_counters which
resets the TCs.  Also, since rtl8169_reset_counters shares most of
its code with rtl8169_update_counters, refactor the shared code into
two new functions  rtl8169_map_counters and rtl8169_unmap_counters.

Unfortunately chip versions prior to RTL_GIGA_MAC_VER_19 don't allow
to reset the TCs programatically.  Therefore introduce an addition to
the rtl8169_private struct and a function rtl8169_init_counter_offsets
to store the TCs at first rtl_open.  Use these values as offsets in
rtl8169_get_stats64.  Propagate a failure to reset *and* update the
counters up to rtl_open and emit a warning message, if so.
Signed-off-by: NCorinna Vinschen <vinschen@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6e85d5ad

rds: Fix improper gfp_t usage. · b01d04aa

由 David S. Miller 提交于 8月 25, 2015

>> net/rds/ib_recv.c:382:28: sparse: incorrect type in initializer (different base types)
   net/rds/ib_recv.c:382:28:    expected int [signed] can_wait
   net/rds/ib_recv.c:382:28:    got restricted gfp_t
   net/rds/ib_recv.c:828:23: sparse: cast to restricted __le64
Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b01d04aa

MAINTAINERS: update vmxnet3 driver maintainer · 04e1b734

由 Shrikrishna Khare 提交于 8月 24, 2015

Shreyas Bhatewara would no longer maintain the vmxnet3 driver. Taking over
the role of vmxnet3 maintainer.
Signed-off-by: NShrikrishna Khare <skhare@vmware.com>
Signed off-by: Shreyas Bhatewara <sbhatewara@vmware.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

04e1b734

vxlan: fix multiple inclusion of vxlan.h · 48e92c44

由 Jiri Benc 提交于 8月 25, 2015

The vxlan_get_sk_family inline function was added after the last #endif,
making multiple inclusion of net/vxlan.h fail. Move it to the proper place.
Reported-by: NMark Rustad <mark.d.rustad@intel.com>
Fixes: 705cc62f ("vxlan: provide access function for vxlan socket address family")
Signed-off-by: NJiri Benc <jbenc@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

48e92c44

MAINTAINERS: Add VRF entry · 081958eb

由 David Ahern 提交于 8月 25, 2015

Add entry for new VRF device driver.
Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

081958eb

route: fix a use-after-free · e252b3d1

由 WANG Cong 提交于 8月 25, 2015

This patch fixes the following crash:

 general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
 CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.2.0-rc7+ #166
 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 task: ffff88010656d280 ti: ffff880106570000 task.ti: ffff880106570000
 RIP: 0010:[<ffffffff8182f91b>]  [<ffffffff8182f91b>] dst_destroy+0xa6/0xef
 RSP: 0018:ffff880107603e38  EFLAGS: 00010202
 RAX: 0000000000000001 RBX: ffff8800d225a000 RCX: ffffffff82250fd0
 RDX: 0000000000000001 RSI: ffffffff82250fd0 RDI: 6b6b6b6b6b6b6b6b
 RBP: ffff880107603e58 R08: 0000000000000001 R09: 0000000000000001
 R10: 000000000000b530 R11: ffff880107609000 R12: 0000000000000000
 R13: ffffffff82343c40 R14: 0000000000000000 R15: ffffffff8182fb4f
 FS:  0000000000000000(0000) GS:ffff880107600000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
 CR2: 00007fcabd9d3000 CR3: 00000000d7279000 CR4: 00000000000006e0
 Stack:
  ffffffff82250fd0 ffff8801077d6f00 ffffffff82253c40 ffff8800d225a000
  ffff880107603e68 ffffffff8182fb5d ffff880107603f08 ffffffff810d795e
  ffffffff810d7648 ffff880106574000 ffff88010656d280 ffff88010656d280
 Call Trace:
  <IRQ>
  [<ffffffff8182fb5d>] dst_destroy_rcu+0xe/0x1d
  [<ffffffff810d795e>] rcu_process_callbacks+0x618/0x7eb
  [<ffffffff810d7648>] ? rcu_process_callbacks+0x302/0x7eb
  [<ffffffff8182fb4f>] ? dst_gc_task+0x1eb/0x1eb
  [<ffffffff8107e11b>] __do_softirq+0x178/0x39f
  [<ffffffff8107e52e>] irq_exit+0x41/0x95
  [<ffffffff81a4f215>] smp_apic_timer_interrupt+0x34/0x40
  [<ffffffff81a4d5cd>] apic_timer_interrupt+0x6d/0x80
  <EOI>
  [<ffffffff8100b968>] ? default_idle+0x21/0x32
  [<ffffffff8100b966>] ? default_idle+0x1f/0x32
  [<ffffffff8100bf19>] arch_cpu_idle+0xf/0x11
  [<ffffffff810b0bc7>] default_idle_call+0x1f/0x21
  [<ffffffff810b0dce>] cpu_startup_entry+0x1ad/0x273
  [<ffffffff8102fe67>] start_secondary+0x135/0x156

dst is freed right before lwtstate_put(), this is not correct...

Fixes: 61adedf3 ("route: move lwtunnel state to dst_entry")
Acked-by: NJiri Benc <jbenc@redhat.com>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NCong Wang <cwang@twopensource.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e252b3d1

net-next: Fix warning while make xmldocs caused by skbuff.c · d7499160

由 Masanari Iida 提交于 8月 24, 2015

This patch fix following warnings.

.//net/core/skbuff.c:407: warning: No description found
for parameter 'len'
.//net/core/skbuff.c:407: warning: Excess function parameter
 'length' description in '__netdev_alloc_skb'
.//net/core/skbuff.c:476: warning: No description found
 for parameter 'len'
.//net/core/skbuff.c:476: warning: Excess function parameter
'length' description in '__napi_alloc_skb'
Signed-off-by: NMasanari Iida <standby24x7@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d7499160

ppp: implement x-netns support · 79c441ae

由 Guillaume Nault 提交于 8月 24, 2015

Let packets move from one netns to the other at PPP encapsulation and
decapsulation time.

PPP units and channels remain in the netns in which they were
originally created. Only the net_device may move to a different
namespace. Cross netns handling is thus transparent to lower PPP
layers (PPPoE, L2TP, etc.).

PPP devices are automatically unregistered when their netns gets
removed. So read() and poll() on the unit file descriptor will
respectively receive EOF and POLLHUP. Channels aren't affected.
Signed-off-by: NGuillaume Nault <g.nault@alphalink.fr>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

79c441ae

net: sun4i-emac: Claim emac sram · 542a64c7

由 Hans de Goede 提交于 8月 23, 2015

Claim the emac sram ourselves, rather then relying on the bootloader
having mapped the sram to the emac controller during boot.
Signed-off-by: NHans de Goede <hdegoede@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

542a64c7

inetpeer: remove dead code · 2c0027cd

由 David Ahern 提交于 8月 23, 2015

Remove various inlined functions not referenced in the kernel.
Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2c0027cd

net/mlx5e: Avoid accessing NULL pointer at ndo_select_queue · 5283af89

由 Rana Shahout 提交于 8月 23, 2015

To avoid multiply/division operations on the data path,
we hold a {channel, tc}==>txq mapping table.
We held this mapping table inside the channel object that is
being destroyed upon some configuration operations (e.g MTU change).
So in case ndo_select_queue occurs during such a configuration operation,
it may access a NULL channel pointer, resulting in kernel panic.
To fix this issue we moved the {channel, tc}==>txq mapping table
outside the channel object so that it will be available also
during such configuration operations.
Signed-off-by: NRana Shahout <ranas@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5283af89

D
ah4: Fix error return in ah_input(). · 94c10f0e
由 David S. Miller 提交于 8月 25, 2015
```
Noticed by Herbert Xu.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
94c10f0e

ah6: fix error return code · 25105051

由 Julia Lawall 提交于 8月 23, 2015

Return a negative error code on failure.

A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@@
identifier ret; expression e1,e2;
@@
(
if (\(ret < 0\|ret != 0\))
 { ... return ret; }
|
ret = 0
)
... when != ret = e1
    when != &ret
*if(...)
{
  ... when != ret = e2
      when forall
 return ret;
}
// </smpl>
Signed-off-by: NJulia Lawall <Julia.Lawall@lip6.fr>
Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

25105051

mlxsw: fix error return code · 5c121979

由 Julia Lawall 提交于 8月 23, 2015

Return a negative error code on failure.

A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@@
identifier ret; expression e1,e2;
@@
(
if (\(ret < 0\|ret != 0\))
 { ... return ret; }
|
ret = 0
)
... when != ret = e1
    when != &ret
*if(...)
{
  ... when != ret = e2
      when forall
 return ret;
}
// </smpl>
Signed-off-by: NJulia Lawall <Julia.Lawall@lip6.fr>
Acked-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5c121979

net: davinci_emac: fix error return code · 1ef53ebf

由 Julia Lawall 提交于 8月 23, 2015

Propagate error code on failure.

A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@@
identifier ret; expression e1,e2;
@@
(
if (\(ret < 0\|ret != 0\))
 { ... return ret; }
|
ret = 0
)
... when != ret = e1
    when != &ret
*if(...)
{
  ... when != ret = e2
      when forall
 return ret;
}
// </smpl>
Signed-off-by: NJulia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1ef53ebf

Merge branch 'rds-assorted-bug-fixes' · 96fd26b9

由 David S. Miller 提交于 8月 25, 2015

Santosh Shilimkar says:

====================
RDS: Assorted bug fixes

We would like to improve RDS upstream support and in that context, I
started playing with it.  But run into number of issues including as
basic is RDS IB RDMA doesn't work. As part of the debug, I ended up
creating the $subject series which has bunch of assorted fixes. At
least with this series I can run RDS IB RDMA and other tests
successfully.

Some of these fixes have been done by Chris Meson, Andy Grover and
Zach Brown while at Oracle. There are still more kinks with FMR and
error handling and I plan to address them in a follow up series.

Series generated against Linus's master(v4.2-rc-7) but also applies
against next-next cleanly. Its tested on Oracle hardware with IB
fabric for both bcopy as well as RDMA mode. I don't have access
to iWARP hardware so any testing help on iWARP hardware appreciated.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

96fd26b9

RDS: check for valid cm_id before initiating connection · ae05368a

由 santosh.shilimkar@oracle.com 提交于 8月 22, 2015

Connection could have been dropped while the route is being resolved
so check for valid cm_id before initiating the connection.
Reviewed-by: NAjaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: NSantosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ae05368a

RDS: return EMSGSIZE for oversize requests before processing/queueing · 06e8941e

由 Mukesh Kacker 提交于 8月 22, 2015

rds_send_queue_rm() allows for the "current datagram" being queued
to exceed SO_SNDBUF thresholds by checking bytes queued without
counting in length of current datagram. (Since sk_sndbuf is set
to twice requested SO_SNDBUF value as a kernel heuristic this
is usually fine!)

If this "current datagram" squeezing past the threshold is itself
many times the size of the sk_sndbuf threshold itself then even
twice the SO_SNDBUF does not save us and it gets queued but
cannot be transmitted. Threads block and deadlock and device
becomes unusable. The check for this datagram not exceeding
SNDBUF thresholds (EMSGSIZE) is not done on this datagram as
that check is only done if queueing attempt fails.
(Datagrams that follow this datagram fail queueing attempts, go
through the check and eventually trip EMSGSIZE error but zero
length datagrams silently fail!)

This fix moves the check for datagrams exceeding SNDBUF limits
before any processing or queueing is attempted and returns EMSGSIZE
early in the rds_sndmsg() code. This change also ensures that all
datagrams get checked for exceeding SNDBUF/sk_sndbuf size limits
and the large datagrams that exceed those limits do not get to
rds_send_queue_rm() code for processing.
Signed-off-by: NMukesh Kacker <mukesh.kacker@oracle.com>
Signed-off-by: NSantosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

06e8941e

RDS: make sure rds_send_drop_to properly takes the m_rs_lock · dfcec251

由 santosh.shilimkar@oracle.com 提交于 8月 22, 2015

rds_send_drop_to() is used during socket tear down to find all the
messages on the socket and flush them . It can race with the
acking code unless it takes the m_rs_lock on each and every message.

This plugs a hole where we didn't take m_rs_lock on any message that
didn't have the RDS_MSG_ON_CONN set. Taking m_rs_lock avoids
double frees and other memory corruptions as the ack code trusts
the message m_rs pointer on a socket that had actually been freed.

We must take m_rs_lock to access m_rs. Because of lock nesting and
rs access, we also need to acquire rs_lock.
Reviewed-by: NAjaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: NSantosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dfcec251

RDS: Don't destroy the rdma id until after we're done using it · 1c3be624

由 Santosh Shilimkar 提交于 8月 22, 2015

During connection resets, we are destroying the rdma id too soon. We can't
destroy it when it is still in use. So lets move rdma_destroy_id() after
we clear the rings.
Reviewed-by: NAjaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: NSantosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1c3be624

RDS: Fix assertion level from fatal to warning · 5c240fa2

由 santosh.shilimkar@oracle.com 提交于 8月 22, 2015

Fix the asserion level since its not fatal and can be hit
in normal execution paths. There is no need to take the
system down.

We keep the WARN_ON() to detect the condition if we get
here with bad pages.
Reviewed-by: NAjaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: NSantosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5c240fa2

RDS: Make sure we do a signaled send for large-send · 3049147c

由 santosh.shilimkar@oracle.com 提交于 8月 22, 2015

WR(Work Requests )always generate a WC(Work Completion) with
signaled send. Default RDS ib code is setup for un-signaled
completion. Since RDS connction is persistent, we can end up
sending the data even after large-send when the remote end is
not active(for any reason).

By doing  a signaled send at least once per large-send,
we can at least detect the problem in work completion
handler there by avoiding sending more data to
inactive remote.
Reviewed-by: NAjaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: NSantosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3049147c

RDS: Mark message mapped before transmit · 4f73113c

由 santosh.shilimkar@oracle.com 提交于 8月 22, 2015

rds_send_xmit() marks the rds message map flag after
xmit_[rdma/atomic]() which is clearly wrong.  We need
to maintain the ownership between transport and rds.

Also take care of error path.
Reviewed-by: NAjaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: NSantosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4f73113c

RDS: add a sock_destruct callback debug aid · 0df5f9a6

由 santosh.shilimkar@oracle.com 提交于 8月 22, 2015

This helps to detect the accidental processes/apps trying to destroy
the RDS socket which they are sharing with other processes/apps.
Reviewed-by: NAjaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: NSantosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0df5f9a6

RDS: check for congestion updates during rds_send_xmit · 0c484240

由 santosh.shilimkar@oracle.com 提交于 8月 22, 2015

Ensure we don't keep sending the data if the link is congested.
Reviewed-by: NAjaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: NSantosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0c484240

RDS: make sure we post recv buffers · 73ce4317

由 santosh.shilimkar@oracle.com 提交于 8月 22, 2015

If we get an ENOMEM during rds_ib_recv_refill, we might never come
back and refill again later. Patch makes sure to kick krdsd into
helping out.

To achieve this we add RDS_RECV_REFILL flag and update in the refill
path based on that so that at least some therad will keep posting
receive buffers.

Since krdsd and softirq both might race for refill, we decide to
schedule on work queue based on ring_low instead of ring_empty.
Reviewed-by: NAjaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: NSantosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

73ce4317

RDS: don't update ip address tables if the address hasn't changed · e1f475a7

由 santosh.shilimkar@oracle.com 提交于 8月 22, 2015

If the ip address tables hasn't changed, there is no need to remove
them only to be added back again.

Lets fix it.
Reviewed-by: NAjaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: NSantosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e1f475a7

RDS: destroy the ib state earlier during shutdown · 1bc7b863

由 santosh.shilimkar@oracle.com 提交于 8月 22, 2015

Destroy ib state early during shutdown. Otherwise we can get callbacks
after the QP isn't really able to handle them.
Reviewed-by: NAjaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: NSantosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1bc7b863

RDS: always free recv frag as we free its ring entry · 43962dd7

由 santosh.shilimkar@oracle.com 提交于 8月 22, 2015

We were still seeing rare occurrences of the WARN_ON(recv->r_frag) which
indicates that the recv refill path was finding allocated frags in ring
entries that were marked free. These were usually followed by OOM crashes.
They only seem to be occurring in the presence of completion errors and
connection resets.

This patch ensures that we free the frag as we mark the ring entry free.
This should stop the refill path from finding allocated frags in ring
entries that were marked free.
Reviewed-by: NAjaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: NSantosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

43962dd7

RDS: restore return value in rds_cmsg_rdma_args() · 1d2e3f39

由 santosh.shilimkar@oracle.com 提交于 8月 22, 2015

In rds_cmsg_rdma_args() 'ret' is used by rds_pin_pages() which returns
number of pinned pages on success. And the same value is returned to the
caller of rds_cmsg_rdma_args() on success which is not intended.

Commit f4a3fc03 ("RDS: Clean up error handling in rds_cmsg_rdma_args")
removed the 'ret = 0' line which broke RDS RDMA mode.

Fix it by restoring the return value on rds_pin_pages() success
keeping the clean-up in place.
Signed-off-by: NSantosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1d2e3f39

tcp: refine pacing rate determination · 43e122b0

由 Eric Dumazet 提交于 8月 21, 2015

When TCP pacing was added back in linux-3.12, we chose
to apply a fixed ratio of 200 % against current rate,
to allow probing for optimal throughput even during
slow start phase, where cwnd can be doubled every other gRTT.

At Google, we found it was better applying a different ratio
while in Congestion Avoidance phase.
This ratio was set to 120 %.

We've used the normal tcp_in_slow_start() helper for a while,
then tuned the condition to select the conservative ratio
as soon as cwnd >= ssthresh/2 :

- After cwnd reduction, it is safer to ramp up more slowly,
  as we approach optimal cwnd.
- Initial ramp up (ssthresh == INFINITY) still allows doubling
  cwnd every other RTT.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

43e122b0

xfrm: Use VRF master index if output device is enslaved · 4ec3b28c

由 David Ahern 提交于 8月 20, 2015

Directs route lookups to VRF table. Compiles out if NET_VRF is not
enabled. With this patch able to successfully bring up ipsec tunnels
in VRFs, even with duplicate network configuration.
Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
Acked-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: NSteffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4ec3b28c

tcp: fix slow start after idle vs TSO/GSO · 6f021c62

由 Eric Dumazet 提交于 8月 21, 2015

slow start after idle might reduce cwnd, but we perform this
after first packet was cooked and sent.

With TSO/GSO, it means that we might send a full TSO packet
even if cwnd should have been reduced to IW10.

Moving the SSAI check in skb_entail() makes sense, because
we slightly reduce number of times this check is done,
especially for large send() and TCP Small queue callbacks from
softirq context.

As Neal pointed out, we also need to perform the check
if/when receive window opens.

Tested:

Following packetdrill test demonstrates the problem
// Test of slow start after idle

`sysctl -q net.ipv4.tcp_slow_start_after_idle=1`

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0    setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0    bind(3, ..., ...) = 0
+0    listen(3, 1) = 0

+0    < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+0    > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
+.100 < . 1:1(0) ack 1 win 511
+0    accept(3, ..., ...) = 4
+0    setsockopt(4, SOL_SOCKET, SO_SNDBUF, [200000], 4) = 0

+0    write(4, ..., 26000) = 26000
+0    > . 1:5001(5000) ack 1
+0    > . 5001:10001(5000) ack 1
+0    %{ assert tcpi_snd_cwnd == 10 }%

+.100 < . 1:1(0) ack 10001 win 511
+0    %{ assert tcpi_snd_cwnd == 20, tcpi_snd_cwnd }%
+0    > . 10001:20001(10000) ack 1
+0    > P. 20001:26001(6000) ack 1

+.100 < . 1:1(0) ack 26001 win 511
+0    %{ assert tcpi_snd_cwnd == 36, tcpi_snd_cwnd }%

+4 write(4, ..., 20000) = 20000
// If slow start after idle works properly, we should send 5 MSS here (cwnd/2)
+0    > . 26001:31001(5000) ack 1
+0    %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }%
+0    > . 31001:36001(5000) ack 1
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6f021c62