1. 09 9月, 2016 1 次提交
    • Y
      tcp: use an RB tree for ooo receive queue · 9f5afeae
      Yaogong Wang 提交于
      Over the years, TCP BDP has increased by several orders of magnitude,
      and some people are considering to reach the 2 Gbytes limit.
      
      Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
      MSS.
      
      In presence of packet losses (or reorders), TCP stores incoming packets
      into an out of order queue, and number of skbs sitting there waiting for
      the missing packets to be received can be in the 10^5 range.
      
      Most packets are appended to the tail of this queue, and when
      packets can finally be transferred to receive queue, we scan the queue
      from its head.
      
      However, in presence of heavy losses, we might have to find an arbitrary
      point in this queue, involving a linear scan for every incoming packet,
      throwing away cpu caches.
      
      This patch converts it to a RB tree, to get bounded latencies.
      
      Yaogong wrote a preliminary patch about 2 years ago.
      Eric did the rebase, added ofo_last_skb cache, polishing and tests.
      
      Tested with network dropping between 1 and 10 % packets, with good
      success (about 30 % increase of throughput in stress tests)
      
      Next step would be to also use an RB tree for the write queue at sender
      side ;)
      Signed-off-by: NYaogong Wang <wygivan@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Acked-By: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f5afeae
  2. 19 8月, 2016 1 次提交
    • E
      tcp: refine tcp_prune_ofo_queue() to not drop all packets · 36a6503f
      Eric Dumazet 提交于
      Over the years, TCP BDP has increased a lot, and is typically
      in the order of ~10 Mbytes with help of clever Congestion Control
      modules.
      
      In presence of packet losses, TCP stores incoming packets into an out of
      order queue, and number of skbs sitting there waiting for the missing
      packets to be received can match the BDP (~10 Mbytes)
      
      In some cases, TCP needs to make room for incoming skbs, and current
      strategy can simply remove all skbs in the out of order queue as a last
      resort, incurring a huge penalty, both for receiver and sender.
      
      Unfortunately these 'last resort events' are quite frequent, forcing
      sender to send all packets again, stalling the flow and wasting a lot of
      resources.
      
      This patch cleans only a part of the out of order queue in order
      to meet the memory constraints.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: C. Stephen Gun <csg@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      36a6503f
  3. 16 7月, 2016 1 次提交
    • J
      tcp: enable per-socket rate limiting of all 'challenge acks' · 083ae308
      Jason Baron 提交于
      The per-socket rate limit for 'challenge acks' was introduced in the
      context of limiting ack loops:
      
      commit f2b2c582 ("tcp: mitigate ACK loops for connections as tcp_sock")
      
      And I think it can be extended to rate limit all 'challenge acks' on a
      per-socket basis.
      
      Since we have the global tcp_challenge_ack_limit, this patch allows for
      tcp_challenge_ack_limit to be set to a large value and effectively rely on
      the per-socket limit, or set tcp_challenge_ack_limit to a lower value and
      still prevents a single connections from consuming the entire challenge ack
      quota.
      
      It further moves in the direction of eliminating the global limit at some
      point, as Eric Dumazet has suggested. This a follow-up to:
      Subject: tcp: make challenge acks less predictable
      
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Yue Cao <ycao009@ucr.edu>
      Signed-off-by: NJason Baron <jbaron@akamai.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      083ae308
  4. 12 7月, 2016 1 次提交
  5. 28 6月, 2016 1 次提交
  6. 11 6月, 2016 1 次提交
  7. 08 6月, 2016 1 次提交
    • P
      tcp: accept RST if SEQ matches right edge of right-most SACK block · e00431bc
      Pau Espin Pedrol 提交于
      RFC 5961 advises to only accept RST packets containing a seq number
      matching the next expected seq number instead of the whole receive
      window in order to avoid spoofing attacks.
      
      However, this situation is not optimal in the case SACK is in use at the
      time the RST is sent. I recently run into a scenario in which packet
      losses were high while uploading data to a server, and userspace was
      willing to frequently terminate connections by sending a RST. In
      this case, the ACK sent on the receiver side (rcv_nxt) is frozen waiting
      for a lost packet retransmission and SACK blocks are used to let the
      client continue uploading data. At some point later on, the client sends
      the RST (snd_nxt), which matches the next expected seq number of the
      right-most SACK block on the receiver side which is going forward
      receiving data.
      
      In this scenario, as RFC 5961 defines, the RST SEQ doesn't match the
      frozen main ACK at receiver side and thus gets dropped and a challenge
      ACK is sent, which gets usually lost due to network conditions. The main
      consequence is that the connection stays alive for a while even if it
      made sense to accept the RST. This can get really bad if lots of
      connections like this one are created in few seconds, allocating all the
      resources of the server easily.
      
      For security reasons, not all SACK blocks are checked (there could be a
      big amount of SACK blocks => acceptable SEQ numbers). Furthermore, it
      wouldn't make sense to check for RST in blocks other than the right-most
      received one because the sender is not expected to be sending new data
      after the RST. For simplicity, only up to the 4 most recently updated
      SACK blocks (selective_acks[4] field) are compared to find the
      right-most block, as usually those are the ones with bigger probability
      to contain it.
      
      This patch was tested in a 3.18 kernel and probed to improve the
      situation in the scenario described above.
      Signed-off-by: NPau Espin Pedrol <pau.espin@tessares.net>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Tested-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e00431bc
  8. 12 5月, 2016 1 次提交
  9. 05 5月, 2016 1 次提交
  10. 03 5月, 2016 2 次提交
  11. 29 4月, 2016 2 次提交
    • M
      tcp: Handle eor bit when coalescing skb · a643b5d4
      Martin KaFai Lau 提交于
      This patch:
      1. Prevent next_skb from coalescing to the prev_skb if
         TCP_SKB_CB(prev_skb)->eor is set
      2. Update the TCP_SKB_CB(prev_skb)->eor if coalescing is
         allowed
      
      Packetdrill script for testing:
      ~~~~~~
      +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
      +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
      +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0 bind(3, ..., ...) = 0
      +0 listen(3, 1) = 0
      
      0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
      0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
      0.200 < . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
      +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
      
      0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
      0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
      0.200 write(4, ..., 11680) = 11680
      
      0.200 > P. 1:731(730) ack 1
      0.200 > P. 731:1461(730) ack 1
      0.200 > . 1461:8761(7300) ack 1
      0.200 > P. 8761:13141(4380) ack 1
      
      0.300 < . 1:1(0) ack 1 win 257 <sack 1461:13141,nop,nop>
      0.300 > P. 1:731(730) ack 1
      0.300 > P. 731:1461(730) ack 1
      0.400 < . 1:1(0) ack 13141 win 257
      
      0.400 close(4) = 0
      0.400 > F. 13141:13141(0) ack 1
      0.500 < F. 1:1(0) ack 13142 win 257
      0.500 > . 13142:13142(0) ack 2
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a643b5d4
    • S
      tcp: remove SKBTX_ACK_TSTAMP since it is redundant · 0a2cf20c
      Soheil Hassas Yeganeh 提交于
      The SKBTX_ACK_TSTAMP flag is set in skb_shinfo->tx_flags when
      the timestamp of the TCP acknowledgement should be reported on
      error queue. Since accessing skb_shinfo is likely to incur a
      cache-line miss at the time of receiving the ack, the
      txstamp_ack bit was added in tcp_skb_cb, which is set iff
      the SKBTX_ACK_TSTAMP flag is set for an skb. This makes
      SKBTX_ACK_TSTAMP flag redundant.
      
      Remove the SKBTX_ACK_TSTAMP and instead use the txstamp_ack bit
      everywhere.
      
      Note that this frees one bit in shinfo->tx_flags.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Suggested-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a2cf20c
  12. 28 4月, 2016 2 次提交
  13. 26 4月, 2016 1 次提交
  14. 25 4月, 2016 1 次提交
    • E
      tcp-tso: do not split TSO packets at retransmit time · 10d3be56
      Eric Dumazet 提交于
      Linux TCP stack painfully segments all TSO/GSO packets before retransmits.
      
      This was fine back in the days when TSO/GSO were emerging, with their
      bugs, but we believe the dark age is over.
      
      Keeping big packets in write queues, but also in stack traversal
      has a lot of benefits.
       - Less memory overhead, because write queues have less skbs
       - Less cpu overhead at ACK processing.
       - Better SACK processing, as lot of studies mentioned how
         awful linux was at this ;)
       - Less cpu overhead to send the rtx packets
         (IP stack traversal, netfilter traversal, drivers...)
       - Better latencies in presence of losses.
       - Smaller spikes in fq like packet schedulers, as retransmits
         are not constrained by TCP Small Queues.
      
      1 % packet losses are common today, and at 100Gbit speeds, this
      translates to ~80,000 losses per second.
      Losses are often correlated, and we see many retransmit events
      leading to 1-MSS train of packets, at the time hosts are already
      under stress.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10d3be56
  15. 22 4月, 2016 2 次提交
    • M
      tcp: Merge tx_flags and tskey in tcp_shifted_skb · cfea5a68
      Martin KaFai Lau 提交于
      After receiving sacks, tcp_shifted_skb() will collapse
      skbs if possible.  tx_flags and tskey also have to be
      merged.
      
      This patch reuses the tcp_skb_collapse_tstamp() to handle
      them.
      
      BPF Output Before:
      ~~~~~
      <no-output-due-to-missing-tstamp-event>
      
      BPF Output After:
      ~~~~~
      <...>-2024  [007] d.s.    88.644374: : ee_data:14599
      
      Packetdrill Script:
      ~~~~~
      +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
      +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
      +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0 bind(3, ..., ...) = 0
      +0 listen(3, 1) = 0
      
      0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
      0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
      0.200 < . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
      +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
      
      0.200 write(4, ..., 1460) = 1460
      +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
      0.200 write(4, ..., 13140) = 13140
      
      0.200 > P. 1:1461(1460) ack 1
      0.200 > . 1461:8761(7300) ack 1
      0.200 > P. 8761:14601(5840) ack 1
      
      0.300 < . 1:1(0) ack 1 win 257 <sack 1461:14601,nop,nop>
      0.300 > P. 1:1461(1460) ack 1
      0.400 < . 1:1(0) ack 14601 win 257
      
      0.400 close(4) = 0
      0.400 > F. 14601:14601(0) ack 1
      0.500 < F. 1:1(0) ack 14602 win 257
      0.500 > . 14602:14602(0) ack 2
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Tested-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cfea5a68
    • M
      tcp: Fix SOF_TIMESTAMPING_TX_ACK when handling dup acks · 479f85c3
      Martin KaFai Lau 提交于
      Assuming SOF_TIMESTAMPING_TX_ACK is on. When dup acks are received,
      it could incorrectly think that a skb has already
      been acked and queue a SCM_TSTAMP_ACK cmsg to the
      sk->sk_error_queue.
      
      In tcp_ack_tstamp(), it checks
      'between(shinfo->tskey, prior_snd_una, tcp_sk(sk)->snd_una - 1)'.
      If prior_snd_una == tcp_sk(sk)->snd_una like the following packetdrill
      script, between() returns true but the tskey is actually not acked.
      e.g. try between(3, 2, 1).
      
      The fix is to replace between() with one before() and one !before().
      By doing this, the -1 offset on the tcp_sk(sk)->snd_una can also be
      removed.
      
      A packetdrill script is used to reproduce the dup ack scenario.
      Due to the lacking cmsg support in packetdrill (may be I
      cannot find it),  a BPF prog is used to kprobe to
      sock_queue_err_skb() and print out the value of
      serr->ee.ee_data.
      
      Both the packetdrill and the bcc BPF script is attached at the end of
      this commit message.
      
      BPF Output Before Fix:
      ~~~~~~
            <...>-2056  [001] d.s.   433.927987: : ee_data:1459  #incorrect
      packetdrill-2056  [001] d.s.   433.929563: : ee_data:1459  #incorrect
      packetdrill-2056  [001] d.s.   433.930765: : ee_data:1459  #incorrect
      packetdrill-2056  [001] d.s.   434.028177: : ee_data:1459
      packetdrill-2056  [001] d.s.   434.029686: : ee_data:14599
      
      BPF Output After Fix:
      ~~~~~~
            <...>-2049  [000] d.s.   113.517039: : ee_data:1459
            <...>-2049  [000] d.s.   113.517253: : ee_data:14599
      
      BCC BPF Script:
      ~~~~~~
      #!/usr/bin/env python
      
      from __future__ import print_function
      from bcc import BPF
      
      bpf_text = """
      #include <uapi/linux/ptrace.h>
      #include <net/sock.h>
      #include <bcc/proto.h>
      #include <linux/errqueue.h>
      
      #ifdef memset
      #undef memset
      #endif
      
      int trace_err_skb(struct pt_regs *ctx)
      {
      	struct sk_buff *skb = (struct sk_buff *)ctx->si;
      	struct sock *sk = (struct sock *)ctx->di;
      	struct sock_exterr_skb *serr;
      	u32 ee_data = 0;
      
      	if (!sk || !skb)
      		return 0;
      
      	serr = SKB_EXT_ERR(skb);
      	bpf_probe_read(&ee_data, sizeof(ee_data), &serr->ee.ee_data);
      	bpf_trace_printk("ee_data:%u\\n", ee_data);
      
      	return 0;
      };
      """
      
      b = BPF(text=bpf_text)
      b.attach_kprobe(event="sock_queue_err_skb", fn_name="trace_err_skb")
      print("Attached to kprobe")
      b.trace_print()
      
      Packetdrill Script:
      ~~~~~~
      +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
      +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
      +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0 bind(3, ..., ...) = 0
      +0 listen(3, 1) = 0
      
      0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
      0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
      0.200 < . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
      +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
      
      +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
      0.200 write(4, ..., 1460) = 1460
      0.200 write(4, ..., 13140) = 13140
      
      0.200 > P. 1:1461(1460) ack 1
      0.200 > . 1461:8761(7300) ack 1
      0.200 > P. 8761:14601(5840) ack 1
      
      0.300 < . 1:1(0) ack 1 win 257 <sack 1461:2921,nop,nop>
      0.300 < . 1:1(0) ack 1 win 257 <sack 1461:4381,nop,nop>
      0.300 < . 1:1(0) ack 1 win 257 <sack 1461:5841,nop,nop>
      0.300 > P. 1:1461(1460) ack 1
      0.400 < . 1:1(0) ack 14601 win 257
      
      0.400 close(4) = 0
      0.400 > F. 14601:14601(0) ack 1
      0.500 < F. 1:1(0) ack 14602 win 257
      0.500 > . 14602:14602(0) ack 2
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Tested-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      479f85c3
  16. 16 4月, 2016 2 次提交
  17. 05 4月, 2016 3 次提交
  18. 03 4月, 2016 1 次提交
    • Y
      tcp: remove cwnd moderation after recovery · 23492623
      Yuchung Cheng 提交于
      For non-SACK connections, cwnd is lowered to inflight plus 3 packets
      when the recovery ends. This is an optional feature in the NewReno
      RFC 2582 to reduce the potential burst when cwnd is "re-opened"
      after recovery and inflight is low.
      
      This feature is questionably effective because of PRR: when
      the recovery ends (i.e., snd_una == high_seq) NewReno holds the
      CA_Recovery state for another round trip to prevent false fast
      retransmits. But if the inflight is low, PRR will overwrite the
      moderated cwnd in tcp_cwnd_reduction() later regardlessly. So if a
      receiver responds bogus ACKs (i.e., acking future data) to speed up
      transfer after recovery, it can only induce a burst up to a window
      worth of data packets by acking up to SND.NXT. A restart from (short)
      idle or receiving streched ACKs can both cause such bursts as well.
      
      On the other hand, if the recovery ends because the sender
      detects the losses were spurious (e.g., reordering). This feature
      unconditionally lowers a reverted cwnd even though nothing
      was lost.
      
      By principle loss recovery module should not update cwnd. Further
      pacing is much more effective to reduce burst. Hence this patch
      removes the cwnd moderation feature.
      
      v2 changes: revised commit message on bogus ACKs and burst, and
                  missing signature
      Signed-off-by: NMatt Mathis <mattmathis@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      23492623
  19. 17 2月, 2016 1 次提交
  20. 08 2月, 2016 8 次提交
  21. 07 2月, 2016 1 次提交
    • E
      tcp: fastopen: call tcp_fin() if FIN present in SYNACK · e3e17b77
      Eric Dumazet 提交于
      When we acknowledge a FIN, it is not enough to ack the sequence number
      and queue the skb into receive queue. We also have to call tcp_fin()
      to properly update socket state and send proper poll() notifications.
      
      It seems we also had the problem if we received a SYN packet with the
      FIN flag set, but it does not seem an urgent issue, as no known
      implementation can do that.
      
      Fixes: 61d2bcae ("tcp: fastopen: accept data/FIN present in SYNACK message")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e3e17b77
  22. 06 2月, 2016 1 次提交
  23. 30 1月, 2016 1 次提交
  24. 29 1月, 2016 1 次提交
    • N
      tcp: fix tcp_mark_head_lost to check skb len before fragmenting · d88270ee
      Neal Cardwell 提交于
      This commit fixes a corner case in tcp_mark_head_lost() which was
      causing the WARN_ON(len > skb->len) in tcp_fragment() to fire.
      
      tcp_mark_head_lost() was assuming that if a packet has
      tcp_skb_pcount(skb) of N, then it's safe to fragment off a prefix of
      M*mss bytes, for any M < N. But with the tricky way TCP pcounts are
      maintained, this is not always true.
      
      For example, suppose the sender sends 4 1-byte packets and have the
      last 3 packet sacked. It will merge the last 3 packets in the write
      queue into an skb with pcount = 3 and len = 3 bytes. If another
      recovery happens after a sack reneging event, tcp_mark_head_lost()
      may attempt to split the skb assuming it has more than 2*MSS bytes.
      
      This sounds very counterintuitive, but as the commit description for
      the related commit c0638c24 ("tcp: don't fragment SACKed skbs in
      tcp_mark_head_lost()") notes, this is because tcp_shifted_skb()
      coalesces adjacent regions of SACKed skbs, and when doing this it
      preserves the sum of their packet counts in order to reflect the
      real-world dynamics on the wire. The c0638c24 commit tried to
      avoid problems by not fragmenting SACKed skbs, since SACKed skbs are
      where the non-proportionality between pcount and skb->len/mss is known
      to be possible. However, that commit did not handle the case where
      during a reneging event one of these weird SACKed skbs becomes an
      un-SACKed skb, which tcp_mark_head_lost() can then try to fragment.
      
      The fix is to simply mark the entire skb lost when this happens.
      This makes the recovery slightly more aggressive in such corner
      cases before we detect reordering. But once we detect reordering
      this code path is by-passed because FACK is disabled.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d88270ee
  25. 07 1月, 2016 1 次提交
    • Y
      tcp: fix zero cwnd in tcp_cwnd_reduction · 8b8a321f
      Yuchung Cheng 提交于
      Patch 3759824d ("tcp: PRR uses CRB mode by default and SS mode
      conditionally") introduced a bug that cwnd may become 0 when both
      inflight and sndcnt are 0 (cwnd = inflight + sndcnt). This may lead
      to a div-by-zero if the connection starts another cwnd reduction
      phase by setting tp->prior_cwnd to the current cwnd (0) in
      tcp_init_cwnd_reduction().
      
      To prevent this we skip PRR operation when nothing is acked or
      sacked. Then cwnd must be positive in all cases as long as ssthresh
      is positive:
      
      1) The proportional reduction mode
         inflight > ssthresh > 0
      
      2) The reduction bound mode
        a) inflight == ssthresh > 0
      
        b) inflight < ssthresh
           sndcnt > 0 since newly_acked_sacked > 0 and inflight < ssthresh
      
      Therefore in all cases inflight and sndcnt can not both be 0.
      We check invalid tp->prior_cwnd to avoid potential div0 bugs.
      
      In reality this bug is triggered only with a sequence of less common
      events.  For example, the connection is terminating an ECN-triggered
      cwnd reduction with an inflight 0, then it receives reordered/old
      ACKs or DSACKs from prior transmission (which acks nothing). Or the
      connection is in fast recovery stage that marks everything lost,
      but fails to retransmit due to local issues, then receives data
      packets from other end which acks nothing.
      
      Fixes: 3759824d ("tcp: PRR uses CRB mode by default and SS mode conditionally")
      Reported-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b8a321f
  26. 19 12月, 2015 1 次提交
    • D
      net: Allow accepted sockets to be bound to l3mdev domain · 6dd9a14e
      David Ahern 提交于
      Allow accepted sockets to derive their sk_bound_dev_if setting from the
      l3mdev domain in which the packets originated. A sysctl setting is added
      to control the behavior which is similar to sk_mark and
      sysctl_tcp_fwmark_accept.
      
      This effectively allow a process to have a "VRF-global" listen socket,
      with child sockets bound to the VRF device in which the packet originated.
      A similar behavior can be achieved using sk_mark, but a solution using marks
      is incomplete as it does not handle duplicate addresses in different L3
      domains/VRFs. Allowing sockets to inherit the sk_bound_dev_if from l3mdev
      domain provides a complete solution.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6dd9a14e