提交 · eaf6dc03388d5ea7b4151cf55cfc3370c2f9884c · openanolis / cloud-kernel

07 8月, 2017 1 次提交

tcp: fix cwnd undo in Reno and HTCP congestion controls · 4faf7839

由 Yuchung Cheng 提交于 8月 03, 2017

Using ssthresh to revert cwnd is less reliable when ssthresh is
bounded to 2 packets. This patch uses an existing variable in TCP
"prior_cwnd" that snapshots the cwnd right before entering fast
recovery and RTO recovery in Reno. This fixes the issue discussed
in netdev thread: "A buggy behavior for Linux TCP Reno and HTCP"
https://www.spinics.net/lists/netdev/msg444955.htmlSuggested-by: NNeal Cardwell <ncardwell@google.com>
Reported-by: NWei Sun <unlcsewsun@gmail.com>
Signed-off-by: NYuchung Cheng <ncardwell@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4faf7839

04 8月, 2017 1 次提交

tcp: remove extra POLL_OUT added for finished active connect() · d06c3583

由 Neal Cardwell 提交于 8月 02, 2017

Commit 45f119bf ("tcp: remove header prediction") introduced a
minor bug: the sk_state_change() and sk_wake_async() notifications for
a completed active connection happen twice: once in this new spot
inside tcp_finish_connect() and once in the existing code in
tcp_rcv_synsent_state_process() immediately after it calls
tcp_finish_connect(). This commit remoes the duplicate POLL_OUT
notifications.

Fixes: 45f119bf ("tcp: remove header prediction")
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d06c3583

03 8月, 2017 1 次提交

tcp: tcp_data_queue() cleanup · 5357f0bd

由 Eric Dumazet 提交于 8月 01, 2017

Commit c13ee2a4 ("tcp: reindent two spots after prequeue removal")
removed code in tcp_data_queue().

We can go a little farther, removing an always true test,
and removing initializers for fragstolen and eaten variables.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5357f0bd

01 8月, 2017 4 次提交

tcp: remove CA_ACK_SLOWPATH · 573aeb04

由 Florian Westphal 提交于 7月 30, 2017

re-indent tcp_ack, and remove CA_ACK_SLOWPATH; it is always set now.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

573aeb04

tcp: remove header prediction · 45f119bf

由 Florian Westphal 提交于 7月 30, 2017

Like prequeue, I am not sure this is overly useful nowadays.

If we receive a train of packets, GRO will aggregate them if the
headers are the same (HP predates GRO by several years) so we don't
get a per-packet benefit, only a per-aggregated-packet one.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

45f119bf

tcp: reindent two spots after prequeue removal · c13ee2a4

由 Florian Westphal 提交于 7月 30, 2017

These two branches are now always true, remove the conditional.
objdiff shows no changes.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c13ee2a4

tcp: remove prequeue support · e7942d06

由 Florian Westphal 提交于 7月 30, 2017

prequeue is a tcp receive optimization that moves part of rx processing
from bh to process context.

This only works if the socket being processed belongs to a process that
is blocked in recv on that socket.

In practice, this doesn't happen anymore that often because nowadays
servers tend to use an event driven (epoll) model.

Even normal client applications (web browsers) commonly use many tcp
connections in parallel.

This has measureable impact only in netperf (which uses plain recv and
thus allows prequeue use) from host to locally running vm (~4%), however,
there were no changes when using netperf between two physical hosts with
ixgbe interfaces.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e7942d06

25 7月, 2017 1 次提交

tcp: remove redundant argument from tcp_rcv_established() · e42e24c3

由 Matvejchikov Ilya 提交于 7月 24, 2017

The last (4th) argument of tcp_rcv_established() is redundant as it
always equals to skb->len and the skb itself is always passed as 2th
agrument. There is no reason to have it.
Signed-off-by: NIlya V. Matveychikov <matvejchikov@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e42e24c3

02 7月, 2017 3 次提交

bpf: Add support for changing congestion control · 91b5b21c

由 Lawrence Brakmo 提交于 6月 30, 2017

Added support for changing congestion control for SOCK_OPS bpf
programs through the setsockopt bpf helper function. It also adds
a new SOCK_OPS op, BPF_SOCK_OPS_NEEDS_ECN, that is needed for
congestion controls, like dctcp, that need to enable ECN in the
SYN packets.
Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

91b5b21c

bpf: Add TCP connection BPF callbacks · 9872a4bd

由 Lawrence Brakmo 提交于 6月 30, 2017

Added callbacks to BPF SOCK_OPS type program before an active
connection is intialized and after a passive or active connection is
established.

The following patch demostrates how they can be used to set send and
receive buffer sizes.
Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9872a4bd

bpf: Support for per connection SYN/SYN-ACK RTOs · 8550f328

由 Lawrence Brakmo 提交于 6月 30, 2017

This patch adds support for setting a per connection SYN and
SYN_ACK RTOs from within a BPF_SOCK_OPS program. For example,
to set small RTOs when it is known both hosts are within a
datacenter.
Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8550f328

08 6月, 2017 4 次提交

tcp: Namespaceify sysctl_tcp_timestamps · 5d2ed052

由 Eric Dumazet 提交于 6月 07, 2017

Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5d2ed052

tcp: Namespaceify sysctl_tcp_window_scaling · 9bb37ef0

由 Eric Dumazet 提交于 6月 07, 2017

Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9bb37ef0

tcp: Namespaceify sysctl_tcp_sack · f9301034

由 Eric Dumazet 提交于 6月 07, 2017

Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f9301034

tcp: add a struct net parameter to tcp_parse_options() · eed29f17

由 Eric Dumazet 提交于 6月 07, 2017

We want to move some TCP sysctls to net namespaces in the future.

tcp_window_scaling, tcp_sack and tcp_timestamps being fetched
from tcp_parse_options(), we need to pass an extra parameter.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

eed29f17

03 6月, 2017 1 次提交

tcp: use TS opt on RTTs for congestion control · 775e68a9

由 Yuchung Cheng 提交于 5月 31, 2017

Currently when a data packet is retransmitted, we do not compute an
RTT sample for congestion control due to Kern's check. Therefore the
congestion control that uses RTT signals may not receive any update
during loss recovery which could last many round trips. For example,
BBR and Vegas may not be able to update its min RTT estimation if the
network path has shortened until it recovers from losses. This patch
mitigates that by using TCP timestamp options for RTT measurement
for congestion control. Note that we already use timestamps for
RTT estimation.
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

775e68a9

26 5月, 2017 1 次提交

tcp: better validation of received ack sequences · d0e1a1b5

由 Eric Dumazet 提交于 5月 23, 2017

Paul Fiterau Brostean reported :

<quote>
Linux TCP stack we analyze exhibits behavior that seems odd to me.
The scenario is as follows (all packets have empty payloads, no window
scaling, rcv/snd window size should not be a factor):

       TEST HARNESS (CLIENT)                        LINUX SERVER

   1.  -                                          LISTEN (server listen,
then accepts)

   2.  - --> <SEQ=100><CTL=SYN>               --> SYN-RECEIVED

   3.  - <-- <SEQ=300><ACK=101><CTL=SYN,ACK>  <-- SYN-RECEIVED

   4.  - --> <SEQ=101><ACK=301><CTL=ACK>      --> ESTABLISHED

   5.  - <-- <SEQ=301><ACK=101><CTL=FIN,ACK>  <-- FIN WAIT-1 (server
opts to close the data connection calling "close" on the connection
socket)

   6.  - --> <SEQ=101><ACK=99999><CTL=FIN,ACK> --> CLOSING (client sends
FIN,ACK with not yet sent acknowledgement number)

   7.  - <-- <SEQ=302><ACK=102><CTL=ACK>      <-- CLOSING (ACK is 102
instead of 101, why?)

... (silence from CLIENT)

   8.  - <-- <SEQ=301><ACK=102><CTL=FIN,ACK>  <-- CLOSING
(retransmission, again ACK is 102)

Now, note that packet 6 while having the expected sequence number,
acknowledges something that wasn't sent by the server. So I would
expect
the packet to maybe prompt an ACK response from the server, and then be
ignored. Yet it is not ignored and actually leads to an increase of the
acknowledgement number in the server's retransmission of the FIN,ACK
packet. The explanation I found is that the FIN  in packet 6 was
processed, despite the acknowledgement number being unacceptable.
Further experiments indeed show that the server processes this FIN,
transitioning to CLOSING, then on receiving an ACK for the FIN it had
send in packet 5, the server (or better said connection) transitions
from CLOSING to TIME_WAIT (as signaled by netstat).

</quote>

Indeed, tcp_rcv_state_process() calls tcp_ack() but
does not exploit the @acceptable status but for TCP_SYN_RECV
state.

What we want here is to send a challenge ACK, if not in TCP_SYN_RECV
state. TCP_FIN_WAIT1 state is not the only state we should fix.

Add a FLAG_NO_CHALLENGE_ACK so that tcp_rcv_state_process()
can choose to send a challenge ACK and discard the packet instead
of wrongly change socket state.

With help from Neal Cardwell.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NPaul Fiterau Brostean <p.fiterau-brostean@science.ru.nl>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d0e1a1b5

20 5月, 2017 1 次提交

tcp: warn on negative reordering values · 6f5b24ee

由 Soheil Hassas Yeganeh 提交于 5月 16, 2017

Commit bafbb9c7 ("tcp: eliminate negative reordering
in tcp_clean_rtx_queue") fixes an issue for negative
reordering metrics.

To be resilient to such errors, warn and return
when a negative metric is passed to tcp_update_reordering().
Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6f5b24ee

19 5月, 2017 1 次提交

tcp: fix tcp_rearm_rto() · b17b8a20

由 Eric Dumazet 提交于 5月 18, 2017

skbs in (re)transmit queue no longer have a copy of jiffies
at the time of the transmit : skb->skb_mstamp is now in usec unit,
with no correlation to tcp_jiffies32.

We have to convert rto from jiffies to usec, compute a time difference
in usec, then convert the delta to HZ units.

Fixes: 9a568de4 ("tcp: switch TCP TS option (RFC 7323) to 1ms clock")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b17b8a20

18 5月, 2017 6 次提交

tcp: switch TCP TS option (RFC 7323) to 1ms clock · 9a568de4

由 Eric Dumazet 提交于 5月 16, 2017

TCP Timestamps option is defined in RFC 7323

Traditionally on linux, it has been tied to the internal
'jiffies' variable, because it had been a cheap and good enough
generator.

For TCP flows on the Internet, 1 ms resolution would be much better
than 4ms or 10ms (HZ=250 or HZ=100 respectively)

For TCP flows in the DC, Google has used usec resolution for more
than two years with great success [1]

Receive size autotuning (DRS) is indeed more precise and converges
faster to optimal window size.

This patch converts tp->tcp_mstamp to a plain u64 value storing
a 1 usec TCP clock.

This choice will allow us to upstream the 1 usec TS option as
discussed in IETF 97.

[1] https://www.ietf.org/proceedings/97/slides/slides-97-tcpm-tcp-options-for-low-latency-00.pdfSigned-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9a568de4

tcp: replace misc tcp_time_stamp to tcp_jiffies32 · ac9517fc

由 Eric Dumazet 提交于 5月 16, 2017

After this patch, all uses of tcp_time_stamp will require
a change when we introduce 1 ms and/or 1 us TCP TS option.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ac9517fc

tcp: use tcp_jiffies32 in __tcp_oow_rate_limited() · 594208af

由 Eric Dumazet 提交于 5月 16, 2017

This place wants to use tcp_jiffies32, this is good enough.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

594208af

tcp: use tcp_jiffies32 for rcv_tstamp and lrcvtime · 70eabf0e

由 Eric Dumazet 提交于 5月 16, 2017

Use tcp_jiffies32 instead of tcp_time_stamp, since
tcp_time_stamp will soon be only used for TCP TS option.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

70eabf0e

tcp: use tcp_jiffies32 to feed tp->snd_cwnd_stamp · c2203cf7

由 Eric Dumazet 提交于 5月 16, 2017

Use tcp_jiffies32 instead of tcp_time_stamp to feed
tp->snd_cwnd_stamp.

tcp_time_stamp will soon be a litle bit more expensive
than simply reading 'jiffies'.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c2203cf7

tcp: use tcp_jiffies32 to feed tp->lsndtime · d635fbe2

由 Eric Dumazet 提交于 5月 16, 2017

Use tcp_jiffies32 instead of tcp_time_stamp to feed
tp->lsndtime.

tcp_time_stamp will soon be a litle bit more expensive
than simply reading 'jiffies'.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d635fbe2

17 5月, 2017 1 次提交

tcp: eliminate negative reordering in tcp_clean_rtx_queue · bafbb9c7

由 Soheil Hassas Yeganeh 提交于 5月 15, 2017

tcp_ack() can call tcp_fragment() which may dededuct the
value tp->fackets_out when MSS changes. When prior_fackets
is larger than tp->fackets_out, tcp_clean_rtx_queue() can
invoke tcp_update_reordering() with negative values. This
results in absurd tp->reodering values higher than
sysctl_tcp_max_reordering.

Note that tcp_update_reordering indeeds sets tp->reordering
to min(sysctl_tcp_max_reordering, metric), but because
the comparison is signed, a negative metric always wins.

Fixes: c7caf8d3 ("[TCP]: Fix reord detection due to snd_una covered holes")
Reported-by: NRebecca Isaacs <risaacs@google.com>
Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bafbb9c7

12 5月, 2017 1 次提交

tcp: avoid fragmenting peculiar skbs in SACK · b451e5d2

由 Yuchung Cheng 提交于 5月 10, 2017

This patch fixes a bug in splitting an SKB during SACK
processing. Specifically if an skb contains multiple
packets and is only partially sacked in the higher sequences,
tcp_match_sack_to_skb() splits the skb and marks the second fragment
as SACKed.

The current code further attempts rounding up the first fragment
to MSS boundaries. But it misses a boundary condition when the
rounded-up fragment size (pkt_len) is exactly skb size.  Spliting
such an skb is pointless and causses a kernel warning and aborts
the SACK processing. This patch universally checks such over-split
before calling tcp_fragment to prevent these unnecessary warnings.

Fixes: adb92db8 ("tcp: Make SACK code to split only at mss boundaries")
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b451e5d2

06 5月, 2017 1 次提交

tcp: randomize timestamps on syncookies · 84b114b9

由 Eric Dumazet 提交于 5月 05, 2017

Whole point of randomization was to hide server uptime, but an attacker
can simply start a syn flood and TCP generates 'old style' timestamps,
directly revealing server jiffies value.

Also, TSval sent by the server to a particular remote address vary
depending on syncookies being sent or not, potentially triggering PAWS
drops for innocent clients.

Lets implement proper randomization, including for SYNcookies.

Also we do not need to export sysctl_tcp_timestamps, since it is not
used from a module.

In v2, I added Florian feedback and contribution, adding tsoff to
tcp_get_cookie_sock().

v3 removed one unused variable in tcp_v4_connect() as Florian spotted.

Fixes: 95a22cae ("tcp: randomize tcp timestamp offsets for each connection")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reviewed-by: NFlorian Westphal <fw@strlen.de>
Tested-by: NFlorian Westphal <fw@strlen.de>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

84b114b9

27 4月, 2017 9 次提交

tcp: switch rcv_rtt_est and rcvq_space to high resolution timestamps · 645f4c6f