提交 · 032ee4236954eb214651cb9bfc1b38ffa8fd7a01 · openeuler / raspberrypi-kernel

08 2月, 2015 1 次提交

tcp: helpers to mitigate ACK loops by rate-limiting out-of-window dupacks · 032ee423

由 Neal Cardwell 提交于 2月 06, 2015

Helpers for mitigating ACK loops by rate-limiting dupacks sent in
response to incoming out-of-window packets.

This patch includes:

- rate-limiting logic
- sysctl to control how often we allow dupacks to out-of-window packets
- SNMP counter for cases where we rate-limited our dupack sending

The rate-limiting logic in this patch decides to not send dupacks in
response to out-of-window segments if (a) they are SYNs or pure ACKs
and (b) the remote endpoint is sending them faster than the configured
rate limit.

We rate-limit our responses rather than blocking them entirely or
resetting the connection, because legitimate connections can rely on
dupacks in response to some out-of-window segments. For example, zero
window probes are typically sent with a sequence number that is below
the current window, and ZWPs thus expect to thus elicit a dupack in
response.

We allow dupacks in response to TCP segments with data, because these
may be spurious retransmissions for which the remote endpoint wants to
receive DSACKs. This is safe because segments with data can't
realistically be part of ACK loops, which by their nature consist of
each side sending pure/data-less ACKs to each other.

The dupack interval is controlled by a new sysctl knob,
tcp_invalid_ratelimit, given in milliseconds, in case an administrator
needs to dial this upward in the face of a high-rate DoS attack. The
name and units are chosen to be analogous to the existing analogous
knob for ICMP, icmp_ratelimit.

The default value for tcp_invalid_ratelimit is 500ms, which allows at
most one such dupack per 500ms. This is chosen to be 2x faster than
the 1-second minimum RTO interval allowed by RFC 6298 (section 2, rule
2.4). We allow the extra 2x factor because network delay variations
can cause packets sent at 1 second intervals to be compressed and
arrive much closer.
Reported-by: NAvery Fay <avery@mixpanel.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

032ee423

05 2月, 2015 1 次提交

tcp: do not pace pure ack packets · 98781965

由 Eric Dumazet 提交于 2月 03, 2015

When we added pacing to TCP, we decided to let sch_fq take care
of actual pacing.

All TCP had to do was to compute sk->pacing_rate using simple formula:

sk->pacing_rate = 2 * cwnd * mss / rtt

It works well for senders (bulk flows), but not very well for receivers
or even RPC :

cwnd on the receiver can be less than 10, rtt can be around 100ms, so we
can end up pacing ACK packets, slowing down the sender.

Really, only the sender should pace, according to its own logic.

Instead of adding a new bit in skb, or call yet another flow
dissection, we tweak skb->truesize to a small value (2), and
we instruct sch_fq to use new helper and not pace pure ack.

Note this also helps TCP small queue, as ack packets present
in qdisc/NIC do not prevent sending a data packet (RPC workload)

This helps to reduce tx completion overhead, ack packets can use regular
sock_wfree() instead of tcp_wfree() which is a bit more expensive.

This has no impact in the case packets are sent to loopback interface,
as we do not coalesce ack packets (were we would detect skb->truesize
lie)

In case netem (with a delay) is used, skb_orphan_partial() also sets
skb->truesize to 1.

This patch is a combination of two patches we used for about one year at
Google.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

98781965

29 1月, 2015 1 次提交

tcp: stretch ACK fixes prep · e73ebb08

由 Neal Cardwell 提交于 1月 28, 2015

LRO, GRO, delayed ACKs, and middleboxes can cause "stretch ACKs" that
cover more than the RFC-specified maximum of 2 packets. These stretch
ACKs can cause serious performance shortfalls in common congestion
control algorithms that were designed and tuned years ago with
receiver hosts that were not using LRO or GRO, and were instead
politely ACKing every other packet.

This patch series fixes Reno and CUBIC to handle stretch ACKs.

This patch prepares for the upcoming stretch ACK bug fix patches. It
adds an "acked" parameter to tcp_cong_avoid_ai() to allow for future
fixes to tcp_cong_avoid_ai() to correctly handle stretch ACKs, and
changes all congestion control algorithms to pass in 1 for the ACKed
count. It also changes tcp_slow_start() to return the number of packet
ACK "credits" that were not processed in slow start mode, and can be
processed by the congestion control module in additive increase mode.

In future patches we will fix tcp_cong_avoid_ai() to handle stretch
ACKs, and fix Reno and CUBIC handling of stretch ACKs in slow start
and additive increase mode.
Reported-by: NEyal Perry <eyalpe@mellanox.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e73ebb08

06 1月, 2015 3 次提交

net: tcp: add per route congestion control · 81164413

由 Daniel Borkmann 提交于 1月 05, 2015

This work adds the possibility to define a per route/destination
congestion control algorithm. Generally, this opens up the possibility
for a machine with different links to enforce specific congestion
control algorithms with optimal strategies for each of them based
on their network characteristics, even transparently for a single
application listening on all links.

For our specific use case, this additionally facilitates deployment
of DCTCP, for example, applications can easily serve internal
traffic/dsts in DCTCP and external one with CUBIC. Other scenarios
would also allow for utilizing e.g. long living, low priority
background flows for certain destinations/routes while still being
able for normal traffic to utilize the default congestion control
algorithm. We also thought about a per netns setting (where different
defaults are possible), but given its actually a link specific
property, we argue that a per route/destination setting is the most
natural and flexible.

The administrator can utilize this through ip-route(8) by appending
"congctl [lock] <name>", where <name> denotes the name of a
congestion control algorithm and the optional lock parameter allows
to enforce the given algorithm so that applications in user space
would not be allowed to overwrite that algorithm for that destination.

The dst metric lookups are being done when a dst entry is already
available in order to avoid a costly lookup and still before the
algorithms are being initialized, thus overhead is very low when the
feature is not being used. While the client side would need to drop
the current reference on the module, on server side this can actually
even be avoided as we just got a flat-copied socket clone.

Joint work with Florian Westphal.
Suggested-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

81164413

net: tcp: add RTAX_CC_ALGO fib handling · ea697639

由 Daniel Borkmann 提交于 1月 05, 2015

This patch adds the minimum necessary for the RTAX_CC_ALGO congestion
control metric to be set up and dumped back to user space.

While the internal representation of RTAX_CC_ALGO is handled as a u32
key, we avoided to expose this implementation detail to user space, thus
instead, we chose the netlink attribute that is being exchanged between
user space to be the actual congestion control algorithm name, similarly
as in the setsockopt(2) API in order to allow for maximum flexibility,
even for 3rd party modules.

It is a bit unfortunate that RTAX_QUICKACK used up a whole RTAX slot as
it should have been stored in RTAX_FEATURES instead, we first thought
about reusing it for the congestion control key, but it brings more
complications and/or confusion than worth it.

Joint work with Florian Westphal.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ea697639

net: tcp: add key management to congestion control · c5c6a8ab

由 Daniel Borkmann 提交于 1月 05, 2015

This patch adds necessary infrastructure to the congestion control
framework for later per route congestion control support.

For a per route congestion control possibility, our aim is to store
a unique u32 key identifier into dst metrics, which can then be
mapped into a tcp_congestion_ops struct. We argue that having a
RTAX key entry is the most simple, generic and easy way to manage,
and also keeps the memory footprint of dst entries lower on 64 bit
than with storing a pointer directly, for example. Having a unique
key id also allows for decoupling actual TCP congestion control
module management from the FIB layer, i.e. we don't have to care
about expensive module refcounting inside the FIB at this point.

We first thought of using an IDR store for the realization, which
takes over dynamic assignment of unused key space and also performs
the key to pointer mapping in RCU. While doing so, we stumbled upon
the issue that due to the nature of dynamic key distribution, it
just so happens, arguably in very rare occasions, that excessive
module loads and unloads can lead to a possible reuse of previously
used key space. Thus, previously stale keys in the dst metric are
now being reassigned to a different congestion control algorithm,
which might lead to unexpected behaviour. One way to resolve this
would have been to walk FIBs on the actually rare occasion of a
module unload and reset the metric keys for each FIB in each netns,
but that's just very costly.

Therefore, we argue a better solution is to reuse the unique
congestion control algorithm name member and map that into u32 key
space through jhash. For that, we split the flags attribute (as it
currently uses 2 bits only anyway) into two u32 attributes, flags
and key, so that we can keep the cacheline boundary of 2 cachelines
on x86_64 and cache the precalculated key at registration time for
the fast path. On average we might expect 2 - 4 modules being loaded
worst case perhaps 15, so a key collision possibility is extremely
low, and guaranteed collision-free on LE/BE for all in-tree modules.
Overall this results in much simpler code, and all without the
overhead of an IDR. Due to the deterministic nature, modules can
now be unloaded, the congestion control algorithm for a specific
but unloaded key will fall back to the default one, and on module
reload time it will switch back to the expected algorithm
transparently.

Joint work with Florian Westphal.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c5c6a8ab

05 11月, 2014 2 次提交

net: allow setting ecn via routing table · f7b3bec6

由 Florian Westphal 提交于 11月 03, 2014

This patch allows to set ECN on a per-route basis in case the sysctl
tcp_ecn is not set to 1. In other words, when ECN is set for specific
routes, it provides a tcp_ecn=1 behaviour for that route while the rest
of the stack acts according to the global settings.

One can use 'ip route change dev $dev $net features ecn' to toggle this.

Having a more fine-grained per-route setting can be beneficial for various
reasons, for example, 1) within data centers, or 2) local ISPs may deploy
ECN support for their own video/streaming services [1], etc.

There was a recent measurement study/paper [2] which scanned the Alexa's
publicly available top million websites list from a vantage point in US,
Europe and Asia:

Half of the Alexa list will now happily use ECN (tcp_ecn=2, most likely
blamed to commit 255cac91 ("tcp: extend ECN sysctl to allow server-side
only ECN") ;)); the break in connectivity on-path was found is about
1 in 10,000 cases. Timeouts rather than receiving back RSTs were much
more common in the negotiation phase (and mostly seen in the Alexa
middle band, ranks around 50k-150k): from 12-thousand hosts on which
there _may_ be ECN-linked connection failures, only 79 failed with RST
when _not_ failing with RST when ECN is not requested.

It's unclear though, how much equipment in the wild actually marks CE
when buffers start to fill up.

We thought about a fallback to non-ECN for retransmitted SYNs as another
global option (which could perhaps one day be made default), but as Eric
points out, there's much more work needed to detect broken middleboxes.

Two examples Eric mentioned are buggy firewalls that accept only a single
SYN per flow, and middleboxes that successfully let an ECN flow establish,
but later mark CE for all packets (so cwnd converges to 1).

 [1] http://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf, p.15
 [2] http://ecn.ethz.ch/

Joint work with Daniel Borkmann.

Reference: http://thread.gmane.org/gmane.linux.network/335797Suggested-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f7b3bec6

syncookies: split cookie_check_timestamp() into two functions · f1673381

由 Florian Westphal 提交于 11月 03, 2014

The function cookie_check_timestamp(), both called from IPv4/6 context,
is being used to decode the echoed timestamp from the SYN/ACK into TCP
options used for follow-up communication with the peer.

We can remove ECN handling from that function, split it into a separate
one, and simply rename the original function into cookie_decode_options().
cookie_decode_options() just fills in tcp_option struct based on the
echoed timestamp received from the peer. Anything that fails in this
function will actually discard the request socket.

While this is the natural place for decoding options such as ECN which
commit 172d69e6 ("syncookies: add support for ECN") added, we argue
that in particular for ECN handling, it can be checked at a later point
in time as the request sock would actually not need to be dropped from
this, but just ECN support turned off.

Therefore, we split this functionality into cookie_ecn_ok(), which tells
us if the timestamp indicates ECN support AND the tcp_ecn sysctl is enabled.

This prepares for per-route ECN support: just looking at the tcp_ecn sysctl
won't be enough anymore at that point; if the timestamp indicates ECN
and sysctl tcp_ecn == 0, we will also need to check the ECN dst metric.

This would mean adding a route lookup to cookie_check_timestamp(), which
we definitely want to avoid. As we already do a route lookup at a later
point in cookie_{v4,v6}_check(), we can simply make use of that as well
for the new cookie_ecn_ok() function w/o any additional cost.

Joint work with Daniel Borkmann.
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f1673381

30 10月, 2014 1 次提交

tcp: allow for bigger reordering level · dca145ff

由 Eric Dumazet 提交于 10月 27, 2014

While testing upcoming Yaogong patch (converting out of order queue
into an RB tree), I hit the max reordering level of linux TCP stack.

Reordering level was limited to 127 for no good reason, and some
network setups [1] can easily reach this limit and get limited
throughput.

Allow a new max limit of 300, and add a sysctl to allow admins to even
allow bigger (or lower) values if needed.

[1] Aggregation of links, per packet load balancing, fabrics not doing
 deep packet inspections, alternative TCP congestion modules...
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Yaogong Wang <wygivan@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dca145ff

24 10月, 2014 1 次提交

net: Remove trailing whitespace in tcp.h icmp.c syncookies.c · 105970f6

由 Kenjiro Nakayama 提交于 10月 20, 2014

Remove trailing whitespace in tcp.h icmp.c syncookies.c
Signed-off-by: NKenjiro Nakayama <nakayamakenjiro@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

105970f6

19 10月, 2014 1 次提交

tcp: fix build error if IPv6 is not enabled · 815afe17

由 Eric Dumazet 提交于 10月 18, 2014

$ make M=net/ipv4
  CC      net/ipv4/route.o
In file included from net/ipv4/route.c:102:0:
include/net/tcp.h: In function ‘tcp_v6_iif’:
include/net/tcp.h:738:32: error: ‘union <anonymous>’ has no member named ‘h6’
  return TCP_SKB_CB(skb)->header.h6.iif;
Signed-off-by: NEric Dumazet <edumazet@google.com>
Fixes: 870c3151 ("ipv6: introduce tcp_v6_iif()")
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

815afe17

18 10月, 2014 3 次提交

ipv6: introduce tcp_v6_iif() · 870c3151

由 Eric Dumazet 提交于 10月 17, 2014

Commit 971f10ec ("tcp: better TCP_SKB_CB layout to reduce cache line
misses") added a regression for SO_BINDTODEVICE on IPv6.

This is because we still use inet6_iif() which expects that IP6 control
block is still at the beginning of skb->cb[]

This patch adds tcp_v6_iif() helper and uses it where necessary.

Because __inet6_lookup_skb() is used by TCP and DCCP, we add an iif
parameter to it.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Fixes: 971f10ec ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
Acked-by: NCong Wang <cwang@twopensource.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

870c3151

ipv4: clean up cookie_v4_check() · 461b74c3

由 Cong Wang 提交于 10月 15, 2014

We can retrieve opt from skb, no need to pass it as a parameter.
And opt should always be non-NULL, no need to check.

Cc: Krzysztof Kolasa <kkolasa@winsoft.pl>
Cc: Eric Dumazet <edumazet@google.com>
Tested-by: NKrzysztof Kolasa <kkolasa@winsoft.pl>
Signed-off-by: NCong Wang <cwang@twopensource.com>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

461b74c3

ipv4: share tcp_v4_save_options() with cookie_v4_check() · e25f866f

由 Cong Wang 提交于 10月 15, 2014

cookie_v4_check() allocates ip_options_rcu in the same way
with tcp_v4_save_options(), we can just make it a helper function.

Cc: Krzysztof Kolasa <kkolasa@winsoft.pl>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: NCong Wang <cwang@twopensource.com>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e25f866f

01 10月, 2014 1 次提交

tcp: Change tcp_slow_start function to return void · a12a601e

由 Li RongQing 提交于 9月 30, 2014

No caller uses the return value, so make this function return void.
Signed-off-by: NLi RongQing <roy.qing.li@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a12a601e

30 9月, 2014 1 次提交

tcp: move TCP_ECN_create_request out of header · d82bd122

由 Florian Westphal 提交于 9月 29, 2014

After Octavian Purdilas tcp ipv4/ipv6 unification work this helper only
has a single callsite.

While at it, convert name to lowercase, suggested by Stephen.
Suggested-by: NStephen Hemminger <stephen@networkplumber.org>
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d82bd122

29 9月, 2014 6 次提交

net: tcp: more detailed ACK events and events for CE marked packets · 9890092e

由 Florian Westphal 提交于 9月 26, 2014

DataCenter TCP (DCTCP) determines cwnd growth based on ECN information
and ACK properties, e.g. ACK that updates window is treated differently
than DUPACK.

Also DCTCP needs information whether ACK was delayed ACK. Furthermore,
DCTCP also implements a CE state machine that keeps track of CE markings
of incoming packets.

Therefore, extend the congestion control framework to provide these
event types, so that DCTCP can be properly implemented as a normal
congestion algorithm module outside of the core stack.

Joint work with Daniel Borkmann and Glenn Judd.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NGlenn Judd <glenn.judd@morganstanley.com>
Acked-by: NStephen Hemminger <stephen@networkplumber.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9890092e

net: tcp: split ack slow/fast events from cwnd_event · 7354c8c3

由 Florian Westphal 提交于 9月 26, 2014

The congestion control ops "cwnd_event" currently supports
CA_EVENT_FAST_ACK and CA_EVENT_SLOW_ACK events (among others).
Both FAST and SLOW_ACK are only used by Westwood congestion
control algorithm.

This removes both flags from cwnd_event and adds a new
in_ack_event callback for this. The goal is to be able to
provide more detailed information about ACKs, such as whether
ECE flag was set, or whether the ACK resulted in a window
update.

It is required for DataCenter TCP (DCTCP) congestion control
algorithm as it makes a different choice depending on ECE being
set or not.

Joint work with Daniel Borkmann and Glenn Judd.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NGlenn Judd <glenn.judd@morganstanley.com>
Acked-by: NStephen Hemminger <stephen@networkplumber.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7354c8c3

net: tcp: add flag for ca to indicate that ECN is required · 30e502a3

由 Daniel Borkmann 提交于 9月 26, 2014

This patch adds a flag to TCP congestion algorithms that allows
for requesting to mark IPv4/IPv6 sockets with transport as ECN
capable, that is, ECT(0), when required by a congestion algorithm.

It is currently used and needed in DataCenter TCP (DCTCP), as it
requires both peers to assert ECT on all IP packets sent - it
uses ECN feedback (i.e. CE, Congestion Encountered information)
from switches inside the data center to derive feedback to the
end hosts.

Therefore, simply add a new flag to icsk_ca_ops. Note that DCTCP's
algorithm/behaviour slightly diverges from RFC3168, therefore this
is only (!) enabled iff the assigned congestion control ops module
has requested this. By that, we can tightly couple this logic really
only to the provided congestion control ops.

Joint work with Florian Westphal and Glenn Judd.
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NGlenn Judd <glenn.judd@morganstanley.com>
Acked-by: NStephen Hemminger <stephen@networkplumber.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

30e502a3

net: tcp: assign tcp cong_ops when tcp sk is created · 55d8694f

由 Florian Westphal 提交于 9月 26, 2014

Split assignment and initialization from one into two functions.

This is required by followup patches that add Datacenter TCP
(DCTCP) congestion control algorithm - we need to be able to
determine if the connection is moderated by DCTCP before the
3WHS has finished.

As we walk the available congestion control list during the
assignment, we are always guaranteed to have Reno present as
it's fixed compiled-in. Therefore, since we're doing the
early assignment, we don't have a real use for the Reno alias
tcp_init_congestion_ops anymore and can thus remove it.

Actual usage of the congestion control operations are being
made after the 3WHS has finished, in some cases however we
can access get_info() via diag if implemented, therefore we
need to zero out the private area for those modules.

Joint work with Daniel Borkmann and Glenn Judd.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NGlenn Judd <glenn.judd@morganstanley.com>
Acked-by: NStephen Hemminger <stephen@networkplumber.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

55d8694f

tcp: change tcp_skb_pcount() location · cd7d8498

由 Eric Dumazet 提交于 9月 24, 2014

Our goal is to access no more than one cache line access per skb in
a write or receive queue when doing the various walks.

After recent TCP_SKB_CB() reorganizations, it is almost done.

Last part is tcp_skb_pcount() which currently uses
skb_shinfo(skb)->gso_segs, which is a terrible choice, because it needs
3 cache lines in current kernel (skb->head, skb->end, and
shinfo->gso_segs are all in 3 different cache lines, far from skb->cb)

This very simple patch reuses space currently taken by tcp_tw_isn
only in input path, as tcp_skb_pcount is only needed for skb stored in
write queue.

This considerably speeds up tcp_ack(), granted we avoid shinfo->tx_flags
to get SKBTX_ACK_TSTAMP, which seems possible.

This also speeds up all sack processing in general.

This speeds up tcp_sendmsg() because it no longer has to access/dirty
shinfo.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cd7d8498

tcp: better TCP_SKB_CB layout to reduce cache line misses · 971f10ec

由 Eric Dumazet 提交于 9月 27, 2014

TCP maintains lists of skb in write queue, and in receive queues
(in order and out of order queues)

Scanning these lists both in input and output path usually requires
access to skb->next, TCP_SKB_CB(skb)->seq, and TCP_SKB_CB(skb)->end_seq

These fields are currently in two different cache lines, meaning we
waste lot of memory bandwidth when these queues are big and flows
have either packet drops or packet reorders.

We can move TCP_SKB_CB(skb)->header at the end of TCP_SKB_CB, because
this header is not used in fast path. This allows TCP to search much faster
in the skb lists.

Even with regular flows, we save one cache line miss in fast path.

Thanks to Christoph Paasch for noticing we need to cleanup
skb->cb[] (IPCB/IP6CB) before entering IP stack in tx path,
and that I forgot IPCB use in tcp_v4_hnd_req() and tcp_v4_save_options().
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

971f10ec

28 9月, 2014 2 次提交

net: make tcp_cleanup_rbuf private · 3f334078

由 Dan Williams 提交于 12月 30, 2013

net_dma was the only external user so this can become local to tcp.c
again.

Cc: James Morris <jmorris@namei.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

3f334078

net_dma: simple removal · 7bced397

由 Dan Williams 提交于 12月 30, 2013

Per commit "77873803 net_dma: mark broken" net_dma is no longer used
and there is no plan to fix it.

This is the mechanical removal of bits in CONFIG_NET_DMA ifdef guards.
Reverting the remainder of the net_dma induced changes is deferred to
subsequent patches.

Marked for stable due to Roman's report of a memory leak in
dma_pin_iovec_pages():

    https://lkml.org/lkml/2014/9/3/177

Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Vinod Koul <vinod.koul@intel.com>
Cc: David Whipple <whipple@securedatainnovations.ch>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: <stable@vger.kernel.org>
Reported-by: NRoman Gushchin <klamm@yandex-team.ru>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

7bced397

06 9月, 2014 2 次提交

tcp: remove TCP_SKB_CB(skb)->when · 7faee5c0

由 Eric Dumazet 提交于 9月 05, 2014

After commit 740b0f18 ("tcp: switch rtt estimations to usec resolution"),
we no longer need to maintain timestamps in two different fields.

TCP_SKB_CB(skb)->when can be removed, as same information sits in skb_mstamp.stamp_jiffies
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7faee5c0

tcp: introduce TCP_SKB_CB(skb)->tcp_tw_isn · 04317daf

由 Eric Dumazet 提交于 9月 05, 2014

TCP_SKB_CB(skb)->when has different meaning in output and input paths.

In output path, it contains a timestamp.
In input path, it contains an ISN, chosen by tcp_timewait_state_process()

Lets add a different name to ease code comprehension.

Note that 'when' field will disappear in following patch,
as skb_mstamp already contains timestamp, the anonymous
union will promptly disappear as well.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

04317daf

15 8月, 2014 3 次提交

tcp: don't allow syn packets without timestamps to pass tcp_tw_recycle logic · a26552af

由 Hannes Frederic Sowa 提交于 8月 14, 2014

tcp_tw_recycle heavily relies on tcp timestamps to build a per-host
ordering of incoming connections and teardowns without the need to
hold state on a specific quadruple for TCP_TIMEWAIT_LEN, but only for
the last measured RTO. To do so, we keep the last seen timestamp in a
per-host indexed data structure and verify if the incoming timestamp
in a connection request is strictly greater than the saved one during
last connection teardown. Thus we can verify later on that no old data
packets will be accepted by the new connection.

During moving a socket to time-wait state we already verify if timestamps
where seen on a connection. Only if that was the case we let the
time-wait socket expire after the RTO, otherwise normal TCP_TIMEWAIT_LEN
will be used. But we don't verify this on incoming SYN packets. If a
connection teardown was less than TCP_PAWS_MSL seconds in the past we
cannot guarantee to not accept data packets from an old connection if
no timestamps are present. We should drop this SYN packet. This patch
closes this loophole.

Please note, this patch does not make tcp_tw_recycle in any way more
usable but only adds another safety check:
Sporadic drops of SYN packets because of reordering in the network or
in the socket backlog queues can happen. Users behing NAT trying to
connect to a tcp_tw_recycle enabled server can get caught in blackholes
and their connection requests may regullary get dropped because hosts
behind an address translator don't have synchronized tcp timestamp clocks.
tcp_tw_recycle cannot work if peers don't have tcp timestamps enabled.

In general, use of tcp_tw_recycle is disadvised.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a26552af

tcp: fix tcp_release_cb() to dispatch via address family for mtu_reduced() · 4fab9071

由 Neal Cardwell 提交于 8月 14, 2014

Make sure we use the correct address-family-specific function for
handling MTU reductions from within tcp_release_cb().

Previously AF_INET6 sockets were incorrectly always using the IPv6
code path when sometimes they were handling IPv4 traffic and thus had
an IPv4 dst.
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Diagnosed-by: NWillem de Bruijn <willemb@google.com>
Fixes: 563d34d0 ("tcp: dont drop MTU reduction indications")
Reviewed-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4fab9071

tcp: don't use timestamp from repaired skb-s to calculate RTT (v2) · 9d186cac

由 Andrey Vagin 提交于 8月 13, 2014

We don't know right timestamp for repaired skb-s. Wrong RTT estimations
isn't good, because some congestion modules heavily depends on it.

This patch adds the TCPCB_REPAIRED flag, which is included in
TCPCB_RETRANS.

Thanks to Eric for the advice how to fix this issue.

This patch fixes the warning:
[  879.562947] WARNING: CPU: 0 PID: 2825 at net/ipv4/tcp_input.c:3078 tcp_ack+0x11f5/0x1380()
[  879.567253] CPU: 0 PID: 2825 Comm: socket-tcpbuf-l Not tainted 3.16.0-next-20140811 #1
[  879.567829] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  879.568177]  0000000000000000 00000000c532680c ffff880039643d00 ffffffff817aa2d2
[  879.568776]  0000000000000000 ffff880039643d38 ffffffff8109afbd ffff880039d6ba80
[  879.569386]  ffff88003a449800 000000002983d6bd 0000000000000000 000000002983d6bc
[  879.569982] Call Trace:
[  879.570264]  [<ffffffff817aa2d2>] dump_stack+0x4d/0x66
[  879.570599]  [<ffffffff8109afbd>] warn_slowpath_common+0x7d/0xa0
[  879.570935]  [<ffffffff8109b0ea>] warn_slowpath_null+0x1a/0x20
[  879.571292]  [<ffffffff816d0a05>] tcp_ack+0x11f5/0x1380
[  879.571614]  [<ffffffff816d10bd>] tcp_rcv_established+0x1ed/0x710
[  879.571958]  [<ffffffff816dc9da>] tcp_v4_do_rcv+0x10a/0x370
[  879.572315]  [<ffffffff81657459>] release_sock+0x89/0x1d0
[  879.572642]  [<ffffffff816c81a0>] do_tcp_setsockopt.isra.36+0x120/0x860
[  879.573000]  [<ffffffff8110a52e>] ? rcu_read_lock_held+0x6e/0x80
[  879.573352]  [<ffffffff816c8912>] tcp_setsockopt+0x32/0x40
[  879.573678]  [<ffffffff81654ac4>] sock_common_setsockopt+0x14/0x20
[  879.574031]  [<ffffffff816537b0>] SyS_setsockopt+0x80/0xf0
[  879.574393]  [<ffffffff817b40a9>] system_call_fastpath+0x16/0x1b
[  879.574730] ---[ end trace a17cbc38eb8c5c00 ]---

v2: moving setting of skb->when for repaired skb-s in tcp_write_xmit,
    where it's set for other skb-s.

Fixes: 431a9124 ("tcp: timestamp SYN+DATA messages")
Fixes: 740b0f18 ("tcp: switch rtt estimations to usec resolution")
Cc: Eric Dumazet <edumazet@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: NAndrey Vagin <avagin@openvz.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9d186cac

06 8月, 2014 1 次提交

tcp: reduce spurious retransmits due to transient SACK reneging · 5ae344c9

由 Neal Cardwell 提交于 8月 04, 2014

This commit reduces spurious retransmits due to apparent SACK reneging
by only reacting to SACK reneging that persists for a short delay.

When a sequence space hole at snd_una is filled, some TCP receivers
send a series of ACKs as they apparently scan their out-of-order queue
and cumulatively ACK all the packets that have now been consecutiveyly
received. This is essentially misbehavior B in "Misbehaviors in TCP
SACK generation" ACM SIGCOMM Computer Communication Review, April
2011, so we suspect that this is from several common OSes (Windows
2000, Windows Server 2003, Windows XP). However, this issue has also
been seen in other cases, e.g. the netdev thread "TCP being hoodwinked
into spurious retransmissions by lack of timestamps?" from March 2014,
where the receiver was thought to be a BSD box.

Since snd_una would temporarily be adjacent to a previously SACKed
range in these scenarios, this receiver behavior triggered the Linux
SACK reneging code path in the sender. This led the sender to clear
the SACK scoreboard, enter CA_Loss, and spuriously retransmit
(potentially) every packet from the entire write queue at line rate
just a few milliseconds before the ACK for each packet arrives at the
sender.

To avoid such situations, now when a sender sees apparent reneging it
does not yet retransmit, but rather adjusts the RTO timer to give the
receiver a little time (max(RTT/2, 10ms)) to send us some more ACKs
that will restore sanity to the SACK scoreboard. If the reneging
persists until this RTO then, as before, we clear the SACK scoreboard
and enter CA_Loss.

A 10ms delay tolerates a receiver sending such a stream of ACKs at
56Kbit/sec. And to allow for receivers with slower or more congested
paths, we wait for at least RTT/2.

We validated the resulting max(RTT/2, 10ms) delay formula with a mix
of North American and South American Google web server traffic, and
found that for ACKs displaying transient reneging:

 (1) 90% of inter-ACK delays were less than 10ms
 (2) 99% of inter-ACK delays were less than RTT/2

In tests on Google web servers this commit reduced reneging events by
75%-90% (as measured by the TcpExtTCPSACKReneging counter), without
any measurable impact on latency for user HTTP and SPDY requests.
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5ae344c9

16 7月, 2014 1 次提交

tcp: Remove unnecessary arg from tcp_enter_cwr and tcp_init_cwnd_reduction · 5ee2c941

由 Christoph Paasch 提交于 7月 14, 2014

Since Yuchung's 9b44190d (tcp: refactor F-RTO), tcp_enter_cwr is always
called with set_ssthresh = 1. Thus, we can remove this argument from
tcp_enter_cwr. Further, as we remove this one, tcp_init_cwnd_reduction
is then always called with set_ssthresh = true, and so we can get rid of
this argument as well.

Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: NChristoph Paasch <christoph.paasch@uclouvain.be>
Acked-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5ee2c941

08 7月, 2014 1 次提交

tcp: switch snt_synack back to measuring transmit time of first SYNACK · 86c6a2c7

由 Neal Cardwell 提交于 6月 30, 2014

Always store in snt_synack the time at which the server received the
first client SYN and attempted to send the first SYNACK.

Recent commit aa27fc50 ("tcp: tcp_v[46]_conn_request: fix snt_synack
initialization") resolved an inconsistency between IPv4 and IPv6 in
the initialization of snt_synack. This commit brings back the idea
from 843f4a55 (tcp: use tcp_v4_send_synack on first SYN-ACK), which
was going for the original behavior of snt_synack from the commit
where it was added in 9ad7c049 ("tcp: RFC2988bis + taking RTT
sample from 3WHS for the passive open side") in v3.1.

In addition to being simpler (and probably a tiny bit faster),
unconditionally storing the time of the first SYNACK attempt has been
useful because it allows calculating a performance metric quantifying
how long it took to establish a passive TCP connection.
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Cc: Octavian Purdila <octavian.purdila@intel.com>
Cc: Jerry Chu <hkchu@google.com>
Acked-by: NOctavian Purdila <octavian.purdila@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

86c6a2c7

28 6月, 2014 8 次提交

tcp: add tcp_conn_request · 1fb6f159

由 Octavian Purdila 提交于 6月 25, 2014

Create tcp_conn_request and remove most of the code from
tcp_v4_conn_request and tcp_v6_conn_request.
Signed-off-by: NOctavian Purdila <octavian.purdila@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1fb6f159

tcp: add queue_add_hash to tcp_request_sock_ops · 695da14e

由 Octavian Purdila 提交于 6月 25, 2014

Add queue_add_hash member to tcp_request_sock_ops so that we can later
unify tcp_v4_conn_request and tcp_v6_conn_request.
Signed-off-by: NOctavian Purdila <octavian.purdila@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

695da14e

tcp: add mss_clamp to tcp_request_sock_ops · 2aec4a29

由 Octavian Purdila 提交于 6月 25, 2014

Add mss_clamp member to tcp_request_sock_ops so that we can later
unify tcp_v4_conn_request and tcp_v6_conn_request.
Signed-off-by: NOctavian Purdila <octavian.purdila@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2aec4a29

tcp: unify tcp_v4_rtx_synack and tcp_v6_rtx_synack · 5db92c99

由 Octavian Purdila 提交于 6月 25, 2014

Signed-off-by: NOctavian Purdila <octavian.purdila@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5db92c99

tcp: add send_synack method to tcp_request_sock_ops · d6274bd8

由 Octavian Purdila 提交于 6月 25, 2014

Create a new tcp_request_sock_ops method to unify the IPv4/IPv6
signature for tcp_v[46]_send_synack. This allows us to later unify
tcp_v4_rtx_synack with tcp_v6_rtx_synack and tcp_v4_conn_request with
tcp_v4_conn_request.
Signed-off-by: NOctavian Purdila <octavian.purdila@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d6274bd8

tcp: add init_seq method to tcp_request_sock_ops · 936b8bdb

由 Octavian Purdila 提交于 6月 25, 2014

More work in preparation of unifying tcp_v4_conn_request and
tcp_v6_conn_request: indirect the init sequence calls via the
tcp_request_sock_ops.
Signed-off-by: NOctavian Purdila <octavian.purdila@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

936b8bdb

tcp: add route_req method to tcp_request_sock_ops · d94e0417

由 Octavian Purdila 提交于 6月 25, 2014

Create wrappers with same signature for the IPv4/IPv6 request routing
calls and use these wrappers (via route_req method from
tcp_request_sock_ops) in tcp_v4_conn_request and tcp_v6_conn_request
with the purpose of unifying the two functions in a later patch.

We can later drop the wrapper functions and modify inet_csk_route_req
and inet6_cks_route_req to use the same signature.
Signed-off-by: NOctavian Purdila <octavian.purdila@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d94e0417

tcp: add init_cookie_seq method to tcp_request_sock_ops · fb7b37a7

由 Octavian Purdila 提交于 6月 25, 2014

Move the specific IPv4/IPv6 cookie sequence initialization to a new
method in tcp_request_sock_ops in preparation for unifying
tcp_v4_conn_request and tcp_v6_conn_request.
Signed-off-by: NOctavian Purdila <octavian.purdila@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fb7b37a7