提交 · 5f74f82ea34c0da80ea0b49192bb5ea06e063593 · openeuler / raspberrypi-kernel

09 2月, 2016 1 次提交

由 Hans Westgaard Ry 提交于 2月 03, 2016

Devices may have limits on the number of fragments in an skb they support.
Current codebase uses a constant as maximum for number of fragments one
skb can hold and use.
When enabling scatter/gather and running traffic with many small messages
the codebase uses the maximum number of fragments and may thereby violate
the max for certain devices.
The patch introduces a global variable as max number of fragments.
Signed-off-by: NHans Westgaard Ry <hans.westgaard.ry@oracle.com>
Reviewed-by: NHåkon Bugge <haakon.bugge@oracle.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5f74f82e

29 1月, 2016 1 次提交

tcp: beware of alignments in tcp_get_info() · ff5d7497

由 Eric Dumazet 提交于 1月 27, 2016

With some combinations of user provided flags in netlink command,
it is possible to call tcp_get_info() with a buffer that is not 8-bytes
aligned.

It does matter on some arches, so we need to use put_unaligned() to
store the u64 fields.

Current iproute2 package does not trigger this particular issue.

Fixes: 0df48c26 ("tcp: add tcpi_bytes_acked to tcp_info")
Fixes: 977cb0ec ("tcp: add pacing_rate information into tcp_info")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ff5d7497

15 1月, 2016 1 次提交

net: tcp_memcontrol: protect all tcp_memcontrol calls by jump-label · 3d596f7b

由 Johannes Weiner 提交于 1月 14, 2016

Move the jump-label from sock_update_memcg() and sock_release_memcg() to
the callsite, and so eliminate those function calls when socket
accounting is not enabled.

This also eliminates the need for dummy functions because the calls will
be optimized away if the Kconfig options are not enabled.
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3d596f7b

23 12月, 2015 1 次提交

net: tcp: deal with listen sockets properly in tcp_abort. · 2010b93e

由 Lorenzo Colitti 提交于 12月 22, 2015

When closing a listen socket, tcp_abort currently calls
tcp_done without clearing the request queue. If the socket has a
child socket that is established but not yet accepted, the child
socket is then left without a parent, causing a leak.

Fix this by setting the socket state to TCP_CLOSE and calling
inet_csk_listen_stop with the socket lock held, like tcp_close
does.

Tested using net_test. With this patch, calling SOCK_DESTROY on a
listen socket that has an established but not yet accepted child
socket results in the parent and the child being closed, such
that they no longer appear in sock_diag dumps.
Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2010b93e

19 12月, 2015 1 次提交

tcp: diag: add support for request sockets to tcp_abort() · 07f6f4a3

由 Eric Dumazet 提交于 12月 17, 2015

Adding support for SYN_RECV request sockets to tcp_abort()
is quite easy after our tcp listener rewrite.

Note that we also need to better handle listeners, or we might
leak not yet accepted children, because of a missing
inet_csk_listen_stop() call.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Tested-by: NLorenzo Colitti <lorenzo@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

07f6f4a3

16 12月, 2015 3 次提交

net: diag: Support destroying TCP sockets. · c1e64e29

由 Lorenzo Colitti 提交于 12月 16, 2015

This implements SOCK_DESTROY for TCP sockets. It causes all
blocking calls on the socket to fail fast with ECONNABORTED and
causes a protocol close of the socket. It informs the other end
of the connection by sending a RST, i.e., initiating a TCP ABORT
as per RFC 793. ECONNABORTED was chosen for consistency with
FreeBSD.
Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c1e64e29

tcp: Fix conditions to determine checksum offload · 9a49850d

由 Tom Herbert 提交于 12月 14, 2015

In tcp_send_sendpage and tcp_sendmsg we check the route capabilities to
determine if checksum offload can be performed. This check currently
does not take the IP protocol into account for devices that advertise
only one of NETIF_F_IPV6_CSUM or NETIF_F_IP_CSUM. This patch adds a
function to check capabilities for checksum offload with a socket
called sk_check_csum_caps. This function checks for specific IPv4 or
IPv6 offload support based on the family of the socket.
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9a49850d

net: Rename NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK · a188222b

由 Tom Herbert 提交于 12月 14, 2015

The name NETIF_F_ALL_CSUM is a misnomer. This does not correspond to the
set of features for offloading all checksums. This is a mask of the
checksum offload related features bits. It is incorrect to set both
NETIF_F_HW_CSUM and NETIF_F_IP_CSUM or NETIF_F_IPV6 at the same time for
features of a device.

This patch:
  - Changes instances of NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK (where
    NETIF_F_ALL_CSUM is being used as a mask).
  - Changes bonding, sfc/efx, ipvlan, macvlan, vlan, and team drivers to
    use NEITF_F_HW_CSUM in features list instead of NETIF_F_ALL_CSUM.
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a188222b

02 12月, 2015 1 次提交

net: rename SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA · 9cd3e072

由 Eric Dumazet 提交于 11月 29, 2015

This patch is a cleanup to make following patch easier to
review.

Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
from (struct socket)->flags to a (struct socket_wq)->flags
to benefit from RCU protection in sock_wake_async()

To ease backports, we rename both constants.

Two new helpers, sk_set_bit(int nr, struct sock *sk)
and sk_clear_bit(int net, struct sock *sk) are added so that
following patch can change their implementation.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9cd3e072

16 11月, 2015 1 次提交

tcp: ensure proper barriers in lockless contexts · 00fd38d9

由 Eric Dumazet 提交于 11月 12, 2015

Some functions access TCP sockets without holding a lock and
might output non consistent data, depending on compiler and or
architecture.

tcp_diag_get_info(), tcp_get_info(), tcp_poll(), get_tcp4_sock() ...

Introduce sk_state_load() and sk_state_store() to fix the issues,
and more clearly document where this lack of locking is happening.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

00fd38d9

21 10月, 2015 1 次提交

tcp: track min RTT using windowed min-filter · f6722583

由 Yuchung Cheng 提交于 10月 16, 2015

Kathleen Nichols' algorithm for tracking the minimum RTT of a
data stream over some measurement window. It uses constant space
and constant time per update. Yet it almost always delivers
the same minimum as an implementation that has to keep all
the data in the window. The measurement window is tunable via
sysctl.net.ipv4.tcp_min_rtt_wlen with a default value of 5 minutes.

The algorithm keeps track of the best, 2nd best & 3rd best min
values, maintaining an invariant that the measurement time of
the n'th best >= n-1'th best. It also makes sure that the three
values are widely separated in the time window since that bounds
the worse case error when that data is monotonically increasing
over the window.

Upon getting a new min, we can forget everything earlier because
it has no value - the new min is less than everything else in the
window by definition and it's the most recent. So we restart fresh
on every new min and overwrites the 2nd & 3rd choices. The same
property holds for the 2nd & 3rd best.

Therefore we have to maintain two invariants to maximize the
information in the samples, one on values (1st.v <= 2nd.v <=
3rd.v) and the other on times (now-win <=1st.t <= 2nd.t <= 3rd.t <=
now). These invariants determine the structure of the code

The RTT input to the windowed filter is the minimum RTT measured
from ACK or SACK, or as the last resort from TCP timestamps.

The accessor tcp_min_rtt() returns the minimum RTT seen in the
window. ~0U indicates it is not available. The minimum is 1usec
even if the true RTT is below that.
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f6722583

07 10月, 2015 1 次提交

net: ipv4: tcp.c Fixed an assignment coding style issue · 686a5624

由 Yuvaraja Mariappan 提交于 10月 06, 2015

Fixed an assignment coding style issue
Signed-off-by: NYuvaraja Mariappan <ymariappan@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

686a5624

30 9月, 2015 1 次提交

tcp: prepare fastopen code for upcoming listener changes · 0536fcc0

由 Eric Dumazet 提交于 9月 29, 2015

While auditing TCP stack for upcoming 'lockless' listener changes,
I found I had to change fastopen_init_queue() to properly init the object
before publishing it.

Otherwise an other cpu could try to lock the spinlock before it gets
properly initialized.

Instead of adding appropriate barriers, just remove dynamic memory
allocations :
- Structure is 28 bytes on 64bit arches. Using additional 8 bytes
  for holding a pointer seems overkill.
- Two listeners can share same cache line and performance would suffer.

If we really want to save few bytes, we would instead dynamically allocate
whole struct request_sock_queue in the future.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0536fcc0

26 8月, 2015 1 次提交

tcp: fix slow start after idle vs TSO/GSO · 6f021c62

由 Eric Dumazet 提交于 8月 21, 2015

slow start after idle might reduce cwnd, but we perform this
after first packet was cooked and sent.

With TSO/GSO, it means that we might send a full TSO packet
even if cwnd should have been reduced to IW10.

Moving the SSAI check in skb_entail() makes sense, because
we slightly reduce number of times this check is done,
especially for large send() and TCP Small queue callbacks from
softirq context.

As Neal pointed out, we also need to perform the check
if/when receive window opens.

Tested:

Following packetdrill test demonstrates the problem
// Test of slow start after idle

`sysctl -q net.ipv4.tcp_slow_start_after_idle=1`

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0    setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0    bind(3, ..., ...) = 0
+0    listen(3, 1) = 0

+0    < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+0    > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
+.100 < . 1:1(0) ack 1 win 511
+0    accept(3, ..., ...) = 4
+0    setsockopt(4, SOL_SOCKET, SO_SNDBUF, [200000], 4) = 0

+0    write(4, ..., 26000) = 26000
+0    > . 1:5001(5000) ack 1
+0    > . 5001:10001(5000) ack 1
+0    %{ assert tcpi_snd_cwnd == 10 }%

+.100 < . 1:1(0) ack 10001 win 511
+0    %{ assert tcpi_snd_cwnd == 20, tcpi_snd_cwnd }%
+0    > . 10001:20001(10000) ack 1
+0    > P. 20001:26001(6000) ack 1

+.100 < . 1:1(0) ack 26001 win 511
+0    %{ assert tcpi_snd_cwnd == 36, tcpi_snd_cwnd }%

+4 write(4, ..., 20000) = 20000
// If slow start after idle works properly, we should send 5 MSS here (cwnd/2)
+0    > . 26001:31001(5000) ack 1
+0    %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }%
+0    > . 31001:36001(5000) ack 1
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6f021c62

27 7月, 2015 1 次提交

tcp: fix recv with flags MSG_WAITALL | MSG_PEEK · dfbafc99

由 Sabrina Dubroca 提交于 7月 24, 2015

Currently, tcp_recvmsg enters a busy loop in sk_wait_data if called
with flags = MSG_WAITALL | MSG_PEEK.

sk_wait_data waits for sk_receive_queue not empty, but in this case,
the receive queue is not empty, but does not contain any skb that we
can use.

Add a "last skb seen on receive queue" argument to sk_wait_data, so
that it sleeps until the receive queue has new skbs.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=99461
Link: https://sourceware.org/bugzilla/show_bug.cgi?id=18493
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1205258Reported-by: NEnrico Scholz <rh-bugzilla@ensc.de>
Reported-by: NDan Searle <dan@censornet.com>
Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dfbafc99

23 6月, 2015 1 次提交

tcp: Do not call tcp_fastopen_reset_cipher from interrupt context · dfea2aa6

由 Christoph Paasch 提交于 6月 18, 2015

tcp_fastopen_reset_cipher really cannot be called from interrupt
context. It allocates the tcp_fastopen_context with GFP_KERNEL and
calls crypto_alloc_cipher, which allocates all kind of stuff with
GFP_KERNEL.

Thus, we might sleep when the key-generation is triggered by an
incoming TFO cookie-request which would then happen in interrupt-
context, as shown by enabling CONFIG_DEBUG_ATOMIC_SLEEP:

[   36.001813] BUG: sleeping function called from invalid context at mm/slub.c:1266
[   36.003624] in_atomic(): 1, irqs_disabled(): 0, pid: 1016, name: packetdrill
[   36.004859] CPU: 1 PID: 1016 Comm: packetdrill Not tainted 4.1.0-rc7 #14
[   36.006085] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[   36.008250]  00000000000004f2 ffff88007f8838a8 ffffffff8171d53a ffff880075a084a8
[   36.009630]  ffff880075a08000 ffff88007f8838c8 ffffffff810967d3 ffff88007f883928
[   36.011076]  0000000000000000 ffff88007f8838f8 ffffffff81096892 ffff88007f89be00
[   36.012494] Call Trace:
[   36.012953]  <IRQ>  [<ffffffff8171d53a>] dump_stack+0x4f/0x6d
[   36.014085]  [<ffffffff810967d3>] ___might_sleep+0x103/0x170
[   36.015117]  [<ffffffff81096892>] __might_sleep+0x52/0x90
[   36.016117]  [<ffffffff8118e887>] kmem_cache_alloc_trace+0x47/0x190
[   36.017266]  [<ffffffff81680d82>] ? tcp_fastopen_reset_cipher+0x42/0x130
[   36.018485]  [<ffffffff81680d82>] tcp_fastopen_reset_cipher+0x42/0x130
[   36.019679]  [<ffffffff81680f01>] tcp_fastopen_init_key_once+0x61/0x70
[   36.020884]  [<ffffffff81680f2c>] __tcp_fastopen_cookie_gen+0x1c/0x60
[   36.022058]  [<ffffffff816814ff>] tcp_try_fastopen+0x58f/0x730
[   36.023118]  [<ffffffff81671788>] tcp_conn_request+0x3e8/0x7b0
[   36.024185]  [<ffffffff810e3872>] ? __module_text_address+0x12/0x60
[   36.025327]  [<ffffffff8167b2e1>] tcp_v4_conn_request+0x51/0x60
[   36.026410]  [<ffffffff816727e0>] tcp_rcv_state_process+0x190/0xda0
[   36.027556]  [<ffffffff81661f97>] ? __inet_lookup_established+0x47/0x170
[   36.028784]  [<ffffffff8167c2ad>] tcp_v4_do_rcv+0x16d/0x3d0
[   36.029832]  [<ffffffff812e6806>] ? security_sock_rcv_skb+0x16/0x20
[   36.030936]  [<ffffffff8167cc8a>] tcp_v4_rcv+0x77a/0x7b0
[   36.031875]  [<ffffffff816af8c3>] ? iptable_filter_hook+0x33/0x70
[   36.032953]  [<ffffffff81657d22>] ip_local_deliver_finish+0x92/0x1f0
[   36.034065]  [<ffffffff81657f1a>] ip_local_deliver+0x9a/0xb0
[   36.035069]  [<ffffffff81657c90>] ? ip_rcv+0x3d0/0x3d0
[   36.035963]  [<ffffffff81657569>] ip_rcv_finish+0x119/0x330
[   36.036950]  [<ffffffff81657ba7>] ip_rcv+0x2e7/0x3d0
[   36.037847]  [<ffffffff81610652>] __netif_receive_skb_core+0x552/0x930
[   36.038994]  [<ffffffff81610a57>] __netif_receive_skb+0x27/0x70
[   36.040033]  [<ffffffff81610b72>] process_backlog+0xd2/0x1f0
[   36.041025]  [<ffffffff81611482>] net_rx_action+0x122/0x310
[   36.042007]  [<ffffffff81076743>] __do_softirq+0x103/0x2f0
[   36.042978]  [<ffffffff81723e3c>] do_softirq_own_stack+0x1c/0x30

This patch moves the call to tcp_fastopen_init_key_once to the places
where a listener socket creates its TFO-state, which always happens in
user-context (either from the setsockopt, or implicitly during the
listen()-call)

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Fixes: 222e83d2 ("tcp: switch tcp_fastopen key generation to net_get_random_once")
Signed-off-by: NChristoph Paasch <cpaasch@apple.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dfea2aa6

16 6月, 2015 1 次提交

sock_diag: implement a get_info handler for inet · 35ac838a

由 Craig Gallek 提交于 6月 15, 2015

This get_info handler will simply dispatch to the appropriate
existing inet protocol handler.

This patch also includes a new netlink attribute
(INET_DIAG_PROTOCOL).  This attribute is currently only used
for multicast messages.  Without this attribute, there is no
way of knowing the IP protocol used by the socket information
being broadcast.  This attribute is not necessary in the 'dump'
variant of this protocol (though it could easily be added)
because dump requests are issued for specific family/protocol
pairs.

Tested: ss -E (note, the -E option has not yet been merged into
the upstream version of ss).
Signed-off-by: NCraig Gallek <kraig@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

35ac838a

25 5月, 2015 1 次提交

net: make skb_splice_bits more configureable · a60e3cc7

由 Hannes Frederic Sowa 提交于 5月 21, 2015

Prepare skb_splice_bits to be able to deal with AF_UNIX sockets.

AF_UNIX sockets don't use lock_sock/release_sock and thus we have to
use a callback to make the locking and unlocking configureable.
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a60e3cc7

23 5月, 2015 1 次提交

tcp: fix a potential deadlock in tcp_get_info() · d654976c

由 Eric Dumazet 提交于 5月 21, 2015

Taking socket spinlock in tcp_get_info() can deadlock, as
inet_diag_dump_icsk() holds the &hashinfo->ehash_locks[i],
while packet processing can use the reverse locking order.

We could avoid this locking for TCP_LISTEN states, but lockdep would
certainly get confused as all TCP sockets share same lockdep classes.

[  523.722504] ======================================================
[  523.728706] [ INFO: possible circular locking dependency detected ]
[  523.734990] 4.1.0-dbg-DEV #1676 Not tainted
[  523.739202] -------------------------------------------------------
[  523.745474] ss/18032 is trying to acquire lock:
[  523.750002]  (slock-AF_INET){+.-...}, at: [<ffffffff81669d44>] tcp_get_info+0x2c4/0x360
[  523.758129]
[  523.758129] but task is already holding lock:
[  523.763968]  (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [<ffffffff816bcb75>] inet_diag_dump_icsk+0x1d5/0x6c0
[  523.774661]
[  523.774661] which lock already depends on the new lock.
[  523.774661]
[  523.782850]
[  523.782850] the existing dependency chain (in reverse order) is:
[  523.790326]
-> #1 (&(&hashinfo->ehash_locks[i])->rlock){+.-...}:
[  523.796599]        [<ffffffff811126bb>] lock_acquire+0xbb/0x270
[  523.802565]        [<ffffffff816f5868>] _raw_spin_lock+0x38/0x50
[  523.808628]        [<ffffffff81665af8>] __inet_hash_nolisten+0x78/0x110
[  523.815273]        [<ffffffff816819db>] tcp_v4_syn_recv_sock+0x24b/0x350
[  523.822067]        [<ffffffff81684d41>] tcp_check_req+0x3c1/0x500
[  523.828199]        [<ffffffff81682d09>] tcp_v4_do_rcv+0x239/0x3d0
[  523.834331]        [<ffffffff816842fe>] tcp_v4_rcv+0xa8e/0xc10
[  523.840202]        [<ffffffff81658fa3>] ip_local_deliver_finish+0x133/0x3e0
[  523.847214]        [<ffffffff81659a9a>] ip_local_deliver+0xaa/0xc0
[  523.853440]        [<ffffffff816593b8>] ip_rcv_finish+0x168/0x5c0
[  523.859624]        [<ffffffff81659db7>] ip_rcv+0x307/0x420

Lets use u64_sync infrastructure instead. As a bonus, 64bit
arches get optimized, as these are nop for them.

Fixes: 0df48c26 ("tcp: add tcpi_bytes_acked to tcp_info")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d654976c

22 5月, 2015 3 次提交

tcp: add tcpi_segs_in and tcpi_segs_out to tcp_info · 2efd055c

由 Marcelo Ricardo Leitner 提交于 5月 20, 2015

This patch tracks the total number of inbound and outbound segments on a
TCP socket. One may use this number to have an idea on connection
quality when compared against the retransmissions.

RFC4898 named these : tcpEStatsPerfSegsIn and tcpEStatsPerfSegsOut

These are a 32bit field each and can be fetched both from TCP_INFO
getsockopt() if one has a handle on a TCP socket, or from inet_diag
netlink facility (iproute2/ss patch will follow)

Note that tp->segs_out was placed near tp->snd_nxt for good data
locality and minimal performance impact, while tp->segs_in was placed
near tp->bytes_received for the same reason.

Join work with Eric Dumazet.

Note that received SYN are accounted on the listener, but sent SYNACK
are not accounted.
Signed-off-by: NMarcelo Ricardo Leitner <mleitner@redhat.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2efd055c

tcp: ensure epoll edge trigger wakeup when write queue is empty · ce5ec440

由 Jason Baron 提交于 5月 20, 2015

We currently rely on the setting of SOCK_NOSPACE in the write()
path to ensure that we wake up any epoll edge trigger waiters when
acks return to free space in the write queue. However, if we fail
to allocate even a single skb in the write queue, we could end up
waiting indefinitely.

Fix this by explicitly issuing a wakeup when we detect the condition
of an empty write queue and a return value of -EAGAIN. This allows
userspace to re-try as we expect this to be a temporary failure.

I've tested this approach by artificially making
sk_stream_alloc_skb() return NULL periodically. In that case,
epoll edge trigger waiters will hang indefinitely in epoll_wait()
without this patch.
Signed-off-by: NJason Baron <jbaron@akamai.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ce5ec440

tcp: add a force_schedule argument to sk_stream_alloc_skb() · eb934478

由 Eric Dumazet 提交于 5月 19, 2015

In commit 8e4d980a ("tcp: fix behavior for epoll edge trigger")
we fixed a possible hang of TCP sockets under memory pressure,
by allowing sk_stream_alloc_skb() to use sk_forced_mem_schedule()
if no packet is in socket write queue.

It turns out there are other cases where we want to force memory
schedule :

tcp_fragment() & tso_fragment() need to split a big TSO packet into
two smaller ones. If we block here because of TCP memory pressure,
we can effectively block TCP socket from sending new data.
If no further ACK is coming, this hang would be definitive, and socket
has no chance to effectively reduce its memory usage.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

eb934478

20 5月, 2015 1 次提交

tcp: Return error instead of partial read for saved syn headers · aea0929e

由 Eric B Munson 提交于 5月 18, 2015

Currently the getsockopt() requesting the cached contents of the syn
packet headers will fail silently if the caller uses a buffer that is
too small to contain the requested data.  Rather than fail silently and
discard the headers, getsockopt() should return an error and report the
required size to hold the data.
Signed-off-by: NEric B Munson <emunson@akamai.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

aea0929e

18 5月, 2015 2 次提交

tcp: halves tcp_mem[] limits · b66e91cc

由 Eric Dumazet 提交于 5月 15, 2015

Allowing tcp to use ~19% of physical memory is way too much,
and allowed bugs to be hidden. Add to this that some drivers use a full
page per incoming frame, so real cost can be twice the advertized one.

Reduce tcp_mem by 50 % as a first step to sanity.

tcp_mem[0,1,2] defaults are now 4.68%, 6.25%, 9.37% of physical memory.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b66e91cc

tcp: fix behavior for epoll edge trigger · 8e4d980a

由 Eric Dumazet 提交于 5月 15, 2015

Under memory pressure, tcp_sendmsg() can fail to queue a packet
while no packet is present in write queue. If we return -EAGAIN
with no packet in write queue, no ACK packet will ever come
to raise EPOLLOUT.

We need to allow one skb per TCP socket, and make sure that
tcp sockets can release their forward allocations under pressure.

This is a followup to commit 790ba456 ("tcp: set SOCK_NOSPACE
under memory pressure")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8e4d980a

06 5月, 2015 1 次提交

tcp: provide SYN headers for passive connections · cd8ae852

由 Eric Dumazet 提交于 5月 03, 2015

This patch allows a server application to get the TCP SYN headers for
its passive connections.  This is useful if the server is doing
fingerprinting of clients based on SYN packet contents.

Two socket options are added: TCP_SAVE_SYN and TCP_SAVED_SYN.

The first is used on a socket to enable saving the SYN headers
for child connections. This can be set before or after the listen()
call.

The latter is used to retrieve the SYN headers for passive connections,
if the parent listener has enabled TCP_SAVE_SYN.

TCP_SAVED_SYN is read once, it frees the saved SYN headers.

The data returned in TCP_SAVED_SYN are network (IPv4/IPv6) and TCP
headers.

Original patch was written by Tom Herbert, I changed it to not hold
a full skb (and associated dst and conntracking reference).

We have used such patch for about 3 years at Google.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Tested-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cd8ae852

30 4月, 2015 3 次提交

tcp: add TCP_CC_INFO socket option · 6e9250f5

由 Eric Dumazet 提交于 4月 28, 2015

Some Congestion Control modules can provide per flow information,
but current way to get this information is to use netlink.

Like TCP_INFO, let's add TCP_CC_INFO so that applications can
issue a getsockopt() if they have a socket file descriptor,
instead of playing complex netlink games.

Sample usage would be :

  union tcp_cc_info info;
  socklen_t len = sizeof(info);

  if (getsockopt(fd, SOL_TCP, TCP_CC_INFO, &info, &len) == -1)
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6e9250f5

tcp: add tcpi_bytes_received to tcp_info · bdd1f9ed

由 Eric Dumazet 提交于 4月 28, 2015

This patch tracks total number of payload bytes received on a TCP socket.
This is the sum of all changes done to tp->rcv_nxt

RFC4898 named this : tcpEStatsAppHCThruOctetsReceived

This is a 64bit field, and can be fetched both from TCP_INFO
getsockopt() if one has a handle on a TCP socket, or from inet_diag
netlink facility (iproute2/ss patch will follow)

Note that tp->bytes_received was placed near tp->rcv_nxt for
best data locality and minimal performance impact.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Eric Salo <salo@google.com>
Cc: Martin Lau <kafai@fb.com>
Cc: Chris Rapier <rapier@psc.edu>
Acked-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bdd1f9ed

tcp: add tcpi_bytes_acked to tcp_info · 0df48c26

由 Eric Dumazet 提交于 4月 28, 2015

This patch tracks total number of bytes acked for a TCP socket.
This is the sum of all changes done to tp->snd_una, and allows
for precise tracking of delivered data.

RFC4898 named this : tcpEStatsAppHCThruOctetsAcked

This is a 64bit field, and can be fetched both from TCP_INFO
getsockopt() if one has a handle on a TCP socket, or from inet_diag
netlink facility (iproute2/ss patch will follow)

Note that tp->bytes_acked was placed near tp->snd_una for
best data locality and minimal performance impact.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NYuchung Cheng <ycheng@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Eric Salo <salo@google.com>
Cc: Martin Lau <kafai@fb.com>
Cc: Chris Rapier <rapier@psc.edu>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0df48c26

22 4月, 2015 1 次提交

tcp: add memory barriers to write space paths · 3c715127

由 jbaron@akamai.com 提交于 4月 20, 2015

Ensure that we either see that the buffer has write space
in tcp_poll() or that we perform a wakeup from the input
side. Did not run into any actual problem here, but thought
that we should make things explicit.
Signed-off-by: NJason Baron <jbaron@akamai.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3c715127

18 4月, 2015 1 次提交

tcp: tcp_get_info() should fetch socket fields once · fad9dfef

由 Eric Dumazet 提交于 4月 16, 2015

tcp_get_info() can be called without holding socket lock,
so any socket fields can change under us.

Use READ_ONCE() to fetch sk_pacing_rate and sk_max_pacing_rate

Fixes: 977cb0ec ("tcp: add pacing_rate information into tcp_info")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fad9dfef

12 4月, 2015 1 次提交

new helper: msg_data_left() · 01e97e65

由 Al Viro 提交于 12月 15, 2014

convert open-coded instances
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

01e97e65

04 4月, 2015 2 次提交

ipv4: coding style: comparison for inequality with NULL · 00db4124

由 Ian Morris 提交于 4月 03, 2015

The ipv4 code uses a mixture of coding styles. In some instances check
for non-NULL pointer is done as x != NULL and sometimes as x. x is
preferred according to checkpatch and this patch makes the code
consistent by adopting the latter form.

No changes detected by objdiff.
Signed-off-by: NIan Morris <ipm@chirality.org.uk>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

00db4124

ipv4: coding style: comparison for equality with NULL · 51456b29

由 Ian Morris 提交于 4月 03, 2015

The ipv4 code uses a mixture of coding styles. In some instances check
for NULL pointer is done as x == NULL and sometimes as !x. !x is
preferred according to checkpatch and this patch makes the code
consistent by adopting the latter form.

No changes detected by objdiff.
Signed-off-by: NIan Morris <ipm@chirality.org.uk>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

51456b29

25 3月, 2015 1 次提交

tcp: use C99 initializers in new_state[] · 0980c1e3

由 Eric Dumazet 提交于 3月 24, 2015

Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0980c1e3

06 3月, 2015 1 次提交

tcp: align tcp_xmit_size_goal() on tcp_tso_autosize() · 6c09fa09

由 Eric Dumazet 提交于 3月 05, 2015

With some mss values, it is possible tcp_xmit_size_goal() puts
one segment more in TSO packet than tcp_tso_autosize().

We send then one TSO packet followed by one single MSS.

It is not a serious bug, but we can do slightly better, especially
for drivers using netif_set_gso_max_size() to lower gso_max_size.

Using same formula avoids these corner cases and makes
tcp_xmit_size_goal() a bit faster.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Fixes: 605ad7f1 ("tcp: refine TSO autosizing")
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6c09fa09

03 3月, 2015 1 次提交

net: Remove iocb argument from sendmsg and recvmsg · 1b784140

由 Ying Xue 提交于 3月 02, 2015

After TIPC doesn't depend on iocb argument in its internal
implementations of sendmsg() and recvmsg() hooks defined in proto
structure, no any user is using iocb argument in them at all now.
Then we can drop the redundant iocb argument completely from kinds of
implementations of both sendmsg() and recvmsg() in the entire
networking stack.

Cc: Christoph Hellwig <hch@lst.de>
Suggested-by: NAl Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: NYing Xue <ying.xue@windriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1b784140

02 3月, 2015 1 次提交

net: use common macro for assering skb->cb[] available size in protocol families · b4772ef8

由 Eyal Birger 提交于 3月 01, 2015

As part of an effort to move skb->dropcount to skb->cb[] use a common
macro in protocol families using skb->cb[] for ancillary data to
validate available room in skb->cb[].
Signed-off-by: NEyal Birger <eyal.birger@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b4772ef8

04 2月, 2015 1 次提交

ip: convert tcp_sendmsg() to iov_iter primitives · 57be5bda

由 Al Viro 提交于 11月 28, 2014

patch is actually smaller than it seems to be - most of it is unindenting
the inner loop body in tcp_sendmsg() itself...

the bit in tcp_input.c is going to get reverted very soon - that's what
memcpy_from_msg() will become, but not in this commit; let's keep it
reasonably contained...

There's one potentially subtle change here: in case of short copy from
userland, mainline tcp_send_syn_data() discards the skb it has allocated
and falls back to normal path, where we'll send as much as possible after
rereading the same data again. This patch trims SYN+data skb instead -
that way we don't need to copy from the same place twice.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

57be5bda

10 12月, 2014 1 次提交

tcp: refine TSO autosizing · 605ad7f1

由 Eric Dumazet 提交于 12月 07, 2014

Commit 95bd09eb ("tcp: TSO packets automatic sizing") tried to
control TSO size, but did this at the wrong place (sendmsg() time)

At sendmsg() time, we might have a pessimistic view of flow rate,
and we end up building very small skbs (with 2 MSS per skb).

This is bad because :

 - It sends small TSO packets even in Slow Start where rate quickly
   increases.
 - It tends to make socket write queue very big, increasing tcp_ack()
   processing time, but also increasing memory needs, not necessarily
   accounted for, as fast clones overhead is currently ignored.
 - Lower GRO efficiency and more ACK packets.

Servers with a lot of small lived connections suffer from this.

Lets instead fill skbs as much as possible (64KB of payload), but split
them at xmit time, when we have a precise idea of the flow rate.
skb split is actually quite efficient.

Patch looks bigger than necessary, because TCP Small Queue decision now
has to take place after the eventual split.

As Neal suggested, introduce a new tcp_tso_autosize() helper, so that
tcp_tso_should_defer() can be synchronized on same goal.

Rename tp->xmit_size_goal_segs to tp->gso_segs, as this variable
contains number of mss that we can put in GSO packet, and is not
related to the autosizing goal anymore.

Tested:

40 ms rtt link

nstat >/dev/null
netperf -H remote -l -2000000 -- -s 1000000
nstat | egrep "IpInReceives|IpOutRequests|TcpOutSegs|IpExtOutOctets"

Before patch :

Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/s

 87380 2000000 2000000    0.36         44.22
IpInReceives                    600                0.0
IpOutRequests                   599                0.0
TcpOutSegs                      1397               0.0
IpExtOutOctets                  2033249            0.0

After patch :

Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 87380 2000000 2000000    0.36       44.27
IpInReceives                    221                0.0
IpOutRequests                   232                0.0
TcpOutSegs                      1397               0.0
IpExtOutOctets                  2013953            0.0
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Acked-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

605ad7f1