提交 · aab4874355679c70f93993cf3b3fd74643b9ac33 · openeuler / raspberrypi-kernel

20 7月, 2012 8 次提交

net-tcp: Fast Open client - detecting SYN-data drops · aab48743

由 Yuchung Cheng 提交于 7月 19, 2012

On paths with firewalls dropping SYN with data or experimental TCP options,
Fast Open connections will have experience SYN timeout and bad performance.
The solution is to track such incidents in the cookie cache and disables
Fast Open temporarily.

Since only the original SYN includes data and/or Fast Open option, the
SYN-ACK has some tell-tale sign (tcp_rcv_fastopen_synack()) to detect
such drops. If a path has recurring Fast Open SYN drops, Fast Open is
disabled for 2^(recurring_losses) minutes starting from four minutes up to
roughly one and half day. sendmsg with MSG_FASTOPEN flag will succeed but
it behaves as connect() then write().
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

aab48743

net-tcp: Fast Open client - sendmsg(MSG_FASTOPEN) · cf60af03

由 Yuchung Cheng 提交于 7月 19, 2012

sendmsg() (or sendto()) with MSG_FASTOPEN is a combo of connect(2)
and write(2). The application should replace connect() with it to
send data in the opening SYN packet.

For blocking socket, sendmsg() blocks until all the data are buffered
locally and the handshake is completed like connect() call. It
returns similar errno like connect() if the TCP handshake fails.

For non-blocking socket, it returns the number of bytes queued (and
transmitted in the SYN-data packet) if cookie is available. If cookie
is not available, it transmits a data-less SYN packet with Fast Open
cookie request option and returns -EINPROGRESS like connect().

Using MSG_FASTOPEN on connecting or connected socket will result in
simlar errno like repeating connect() calls. Therefore the application
should only use this flag on new sockets.

The buffer size of sendmsg() is independent of the MSS of the connection.
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cf60af03

net-tcp: Fast Open client - sending SYN-data · 783237e8

由 Yuchung Cheng 提交于 7月 19, 2012

This patch implements sending SYN-data in tcp_connect(). The data is
from tcp_sendmsg() with flag MSG_FASTOPEN (implemented in a later patch).

The length of the cookie in tcp_fastopen_req, init'd to 0, controls the
type of the SYN. If the cookie is not cached (len==0), the host sends
data-less SYN with Fast Open cookie request option to solicit a cookie
from the remote. If cookie is not available (len > 0), the host sends
a SYN-data with Fast Open cookie option. If cookie length is negative,
the SYN will not include any Fast Open option (for fall back operations).

To deal with middleboxes that may drop SYN with data or experimental TCP
option, the SYN-data is only sent once. SYN retransmits do not include
data or Fast Open options. The connection will fall back to regular TCP
handshake.
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

783237e8

net-tcp: Fast Open client - cookie cache · 1fe4c481

由 Yuchung Cheng 提交于 7月 19, 2012

With help from Eric Dumazet, add Fast Open metrics in tcp metrics cache.
The basic ones are MSS and the cookies. Later patch will cache more to
handle unfriendly middleboxes.
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1fe4c481

net-tcp: Fast Open base · 2100c8d2

由 Yuchung Cheng 提交于 7月 19, 2012

This patch impelements the common code for both the client and server.

1. TCP Fast Open option processing. Since Fast Open does not have an
   option number assigned by IANA yet, it shares the experiment option
   code 254 by implementing draft-ietf-tcpm-experimental-options
   with a 16 bits magic number 0xF989. This enables global experiments
   without clashing the scarce(2) experimental options available for TCP.

   When the draft status becomes standard (maybe), the client should
   switch to the new option number assigned while the server supports
   both numbers for transistion.

2. The new sysctl tcp_fastopen

3. A place holder init function
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2100c8d2

net: Fix warnings in dst_ops.h · d8f1641b

由 David S. Miller 提交于 7月 19, 2012

include/net/dst_ops.h:28:20: warning: ‘struct sock’ declared inside parameter list
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d8f1641b

ipv4: tcp: remove per net tcp_sock · be9f4a44

由 Eric Dumazet 提交于 7月 19, 2012

tcp_v4_send_reset() and tcp_v4_send_ack() use a single socket
per network namespace.

This leads to bad behavior on multiqueue NICS, because many cpus
contend for the socket lock and once socket lock is acquired, extra
false sharing on various socket fields slow down the operations.

To better resist to attacks, we use a percpu socket. Each cpu can
run without contention, using appropriate memory (local node)

Additional features :

1) We also mirror the queue_mapping of the incoming skb, so that
answers use the same queue if possible.

2) Setting SOCK_USE_WRITE_QUEUE socket flag speedup sock_wfree()

3) We now limit the number of in-flight RST/ACK [1] packets
per cpu, instead of per namespace, and we honor the sysctl_wmem_default
limit dynamically. (Prior to this patch, sysctl_wmem_default value was
copied at boot time, so any further change would not affect tcp_sock
limit)

[1] These packets are only generated when no socket was matched for
the incoming packet.
Reported-by: NBill Sommerfeld <wsommerfeld@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

be9f4a44

ipv4: use seqlock for nh_exceptions · aee06da6

由 Julian Anastasov 提交于 7月 18, 2012

Use global seqlock for the nh_exceptions. Call
fnhe_oldest with the right hash chain. Correct the diff
value for dst_set_expires.

v2: after suggestions from Eric Dumazet:
* get rid of spin lock fnhe_lock, rearrange update_or_create_fnhe
* continue daddr search in rt_bind_exception

v3:
* remove the daddr check before seqlock in rt_bind_exception
* restart lookup in rt_bind_exception on detected seqlock change,
as suggested by David Miller
Signed-off-by: NJulian Anastasov <ja@ssi.bg>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

aee06da6

19 7月, 2012 2 次提交

ipv6: add ipv6_addr_hash() helper · ddbe5032

由 Eric Dumazet 提交于 7月 18, 2012

Introduce ipv6_addr_hash() helper doing a XOR on all bits
of an IPv6 address, with an optimized x86_64 version.

Use it in flow dissector, as suggested by Andrew McGregor,
to reduce hash collision probabilities in fq_codel (and other
users of flow dissector)

Use it in ip6_tunnel.c and use more bit shuffling, as suggested
by David Laight, as existing hash was ignoring most of them.

Use it in sunrpc and use more bit shuffling, using hash_32().

Use it in net/ipv6/addrconf.c, using hash_32() as well.

As a cleanup, use it in net/ipv4/tcp_metrics.c
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NAndrew McGregor <andrewmcgr@gmail.com>
Cc: Dave Taht <dave.taht@gmail.com>
Cc: Tom Herbert <therbert@google.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ddbe5032

net/ipv4: VTI support rx-path hook in xfrm4_mode_tunnel. · eb8637cd

由 Saurabh 提交于 7月 17, 2012

Incorporated David and Steffen's comments.
Add hook for rx-path xfmr4_mode_tunnel for VTI tunnel module.
Signed-off-by: NSaurabh Mohan <saurabh.mohan@vyatta.com>
Reviewed-by: NStephen Hemminger <shemminger@vyatta.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

eb8637cd

18 7月, 2012 1 次提交

ipv6: fix inet6_csk_xmit() · d3818c92

由 Eric Dumazet 提交于 7月 17, 2012

We should provide to inet6_csk_route_socket a struct flowi6 pointer,
so that net6_csk_xmit() works correctly instead of sending garbage.

Also add some consts
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NYuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d3818c92

17 7月, 2012 4 次提交

ipv4: Add FIB nexthop exceptions. · 4895c771

由 David S. Miller 提交于 7月 17, 2012

In a regime where we have subnetted route entries, we need a way to
store persistent storage about destination specific learned values
such as redirects and PMTU values.

This is implemented here via nexthop exceptions.

The initial implementation is a 2048 entry hash table with relaiming
starting at chain length 5.  A more sophisticated scheme can be
devised if that proves necessary.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4895c771

net: Pass optional SKB and SK arguments to dst_ops->{update_pmtu,redirect}() · 6700c270

由 David S. Miller 提交于 7月 17, 2012

This will be used so that we can compose a full flow key.

Even though we have a route in this context, we need more. In the
future the routes will be without destination address, source address,
etc. keying. One ipv4 route will cover entire subnets, etc.

In this environment we have to have a way to possess persistent storage
for redirects and PMTU information. This persistent storage will exist
in the FIB tables, and that's why we'll need to be able to rebuild a
full lookup flow key here. Using that flow key will do a fib_lookup()
and create/update the persistent entry.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6700c270

tcp: implement RFC 5961 3.2 · 282f23c6

由 Eric Dumazet 提交于 7月 17, 2012

Implement the RFC 5691 mitigation against Blind
Reset attack using RST bit.

Idea is to validate incoming RST sequence,
to match RCV.NXT value, instead of previouly accepted
window : (RCV.NXT <= SEG.SEQ < RCV.NXT+RCV.WND)

If sequence is in window but not an exact match, send
a "challenge ACK", so that the other part can resend an
RST with the appropriate sequence.

Add a new sysctl, tcp_challenge_ack_limit, to limit
number of challenge ACK sent per second.

Add a new SNMP counter to count number of challenge acks sent.
(netstat -s | grep TCPChallengeACK)
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Kiran Kumar Kella <kkiran@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

282f23c6

net: make sock diag per-namespace · 51d7cccf

由 Andrey Vagin 提交于 7月 16, 2012

Before this patch sock_diag works for init_net only and dumps
information about sockets from all namespaces.

This patch expands sock_diag for all name-spaces.
It creates a netlink kernel socket for each netns and filters
data during dumping.

v2: filter accoding with netns in all places
    remove an unused variable.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Pavel Emelyanov <xemul@parallels.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: NAndrew Vagin <avagin@openvz.org>
Acked-by: NPavel Emelyanov <xemul@parallels.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

51d7cccf

16 7月, 2012 3 次提交

sctp: Adjust PMTU updates to accomodate route invalidation. · 02f3d4ce

由 David S. Miller 提交于 7月 16, 2012

This adjusts the call to dst_ops->update_pmtu() so that we can
transparently handle the fact that, in the future, the dst itself can
be invalidated by the PMTU update (when we have non-host routes cached
in sockets).
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

02f3d4ce

ipv6: Add helper inet6_csk_update_pmtu(). · 35ad9b9c

由 David S. Miller 提交于 7月 16, 2012

This is the ipv6 version of inet_csk_update_pmtu().
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

35ad9b9c

ipv4: Add helper inet_csk_update_pmtu(). · 80d0a69f

由 David S. Miller 提交于 7月 16, 2012

This abstracts away the call to dst_ops->update_pmtu() so that we can
transparently handle the fact that, in the future, the dst itself can
be invalidated by the PMTU update (when we have non-host routes cached
in sockets).

So we try to rebuild the socket cached route after the method
invocation if necessary.

This isn't used by SCTP because it needs to cache dsts per-transport,
and thus will need it's own local version of this helper.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

80d0a69f

13 7月, 2012 2 次提交

ipv4: Don't store a rule pointer in fib_result. · 85b91b03

由 David S. Miller 提交于 7月 13, 2012

We only use it to fetch the rule's tclassid, so just store the
tclassid there instead.

This also decreases the size of fib_result by a full 8 bytes on
64-bit.  On 32-bits it's a wash.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

85b91b03

D
ipv4: Remove tb_peers from fib_table. · 391e5c22
由 David S. Miller 提交于 7月 12, 2012
```
No longer used.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
391e5c22

12 7月, 2012 10 次提交

D
ipv6: Use icmpv6_notify() to propagate redirect, instead of rt6_redirect(). · b94f1c09
由 David S. Miller 提交于 7月 12, 2012
```
And delete rt6_redirect(), since it is no longer used.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
b94f1c09
D
ipv6: Add redirect support to all protocol icmp error handlers. · ec18d9a2
由 David S. Miller 提交于 7月 12, 2012
```
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
ec18d9a2
D
ipv6: Add ip6_redirect() and ip6_sk_redirect() helper functions. · 3a5ad2ee
由 David S. Miller 提交于 7月 12, 2012
```
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
3a5ad2ee

ipv6: Move bulk of redirect handling into rt6_redirect(). · e8599ff4

由 David S. Miller 提交于 7月 11, 2012

This sets things up so that we can have the protocol error handlers
call down into the ipv6 route code for redirects just as ipv4 already
does.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e8599ff4

ipv6: Export ndisc option parsing from ndisc.c · 30f2a5f3

由 David S. Miller 提交于 7月 11, 2012

This is going to be used internally by the rt6 redirect code.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

30f2a5f3

ipv4: Kill ip_rt_redirect(). · 1f42539d

由 David S. Miller 提交于 7月 11, 2012

No longer needed, as the protocol handlers now all properly
propagate the redirect back into the routing code.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1f42539d

D
ipv4: Add ipv4_redirect() and ipv4_sk_redirect() helper functions. · b42597e2
由 David S. Miller 提交于 7月 11, 2012
```
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
b42597e2
D
ipv4: Generalize ip_do_redirect() and hook into new dst_ops->redirect. · e47a185b
由 David S. Miller 提交于 7月 11, 2012
```
All of the redirect acceptance policy is now contained within.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
e47a185b

ipv4: Rearrange arguments to ip_rt_redirect() · 94206125

由 David S. Miller 提交于 7月 11, 2012

Pass in the SKB rather than just the IP addresses, so that policy
and other aspects can reside in ip_rt_redirect() rather then
icmp_redirect().
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

94206125

tcp: TCP Small Queues · 46d3ceab

由 Eric Dumazet 提交于 7月 11, 2012

This introduce TSQ (TCP Small Queues)

TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
device queues), to reduce RTT and cwnd bias, part of the bufferbloat
problem.

sk->sk_wmem_alloc not allowed to grow above a given limit,
allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
given time.

TSO packets are sized/capped to half the limit, so that we have two
TSO packets in flight, allowing better bandwidth use.

As a side effect, setting the limit to 40000 automatically reduces the
standard gso max limit (65536) to 40000/2 : It can help to reduce
latencies of high prio packets, having smaller TSO packets.

This means we divert sock_wfree() to a tcp_wfree() handler, to
queue/send following frames when skb_orphan() [2] is called for the
already queued skbs.

Results on my dev machines (tg3/ixgbe nics) are really impressive,
using standard pfifo_fast, and with or without TSO/GSO.

Without reduction of nominal bandwidth, we have reduction of buffering
per bulk sender :
< 1ms on Gbit (instead of 50ms with TSO)
< 8ms on 100Mbit (instead of 132 ms)

I no longer have 4 MBytes backlogged in qdisc by a single netperf
session, and both side socket autotuning no longer use 4 Mbytes.

As skb destructor cannot restart xmit itself ( as qdisc lock might be
taken at this point ), we delegate the work to a tasklet. We use one
tasklest per cpu for performance reasons.

If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag.
This flag is tested in a new protocol method called from release_sock(),
to eventually send new segments.

[1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
[2] skb_orphan() is usually called at TX completion time,
  but some drivers call it in their start_xmit() handler.
  These drivers should at least use BQL, or else a single TCP
  session can still fill the whole NIC TX ring, since TSQ will
  have no effect.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Dave Taht <dave.taht@bufferbloat.net>
Cc: Tom Herbert <therbert@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

46d3ceab

11 7月, 2012 10 次提交

ipv6: optimize ipv6 addresses compares · 1a203cb3

由 Eric Dumazet 提交于 7月 10, 2012

On 64 bit arches having efficient unaligned accesses (eg x86_64) we can
use long words to reduce number of instructions for free.

Joe Perches suggested to change ipv6_masked_addr_cmp() to return a bool
instead of 'int', to make sure ipv6_masked_addr_cmp() cannot be used
in a sorting function.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1a203cb3

D
ipv4: Remove inetpeer from routes. · f185071d
由 David S. Miller 提交于 7月 10, 2012
```
No longer used.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
f185071d

ipv4: Maintain redirect and PMTU info in struct rtable again. · 5943634f

由 David S. Miller 提交于 7月 10, 2012

Maintaining this in the inetpeer entries was not the right way to do
this at all.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5943634f

inet: Kill FLOWI_FLAG_PRECOW_METRICS. · 3e12939a

由 David S. Miller 提交于 7月 10, 2012

No longer needed.  TCP writes metrics, but now in it's own special
cache that does not dirty the route metrics.  Therefore there is no
longer any reason to pre-cow metrics in this way.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3e12939a

D
inet: Remove ->get_peer() method. · 16d18399
由 David S. Miller 提交于 7月 10, 2012
```
No longer used.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
16d18399
D
tcp: Move timestamps from inetpeer to metrics cache. · 81166dd6
由 David S. Miller 提交于 7月 10, 2012
```
With help from Lin Ming.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
81166dd6
D
net: Kill set_dst_metric_rtt(). · 94334d5e
由 David S. Miller 提交于 7月 10, 2012
```
No longer used.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
94334d5e

tcp: Maintain dynamic metrics in local cache. · 51c5d0c4

由 David S. Miller 提交于 7月 10, 2012

Maintain a local hash table of TCP dynamic metrics blobs.

Computed TCP metrics are no longer maintained in the route metrics.

The table uses RCU and an extremely simple hash so that it has low
latency and low overhead.  A simple hash is legitimate because we only
make metrics blobs for fully established connections.

Some tweaking of the default hash table sizes, metric timeouts, and
the hash chain length limit certainly could use some tweaking.  But
the basic design seems sound.

With help from Eric Dumazet and Joe Perches.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

51c5d0c4

D
tcp: Abstract back handling peer aliveness test into helper function. · ab92bb2f
由 David S. Miller 提交于 7月 09, 2012
```
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
ab92bb2f
D
tcp: Move dynamnic metrics handling into seperate file. · 4aabd8ef
由 David S. Miller 提交于 7月 09, 2012
```
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
4aabd8ef