1. 26 9月, 2015 3 次提交
  2. 22 9月, 2015 1 次提交
  3. 18 9月, 2015 1 次提交
    • E
      tcp: provide skb->hash to synack packets · 58d607d3
      Eric Dumazet 提交于
      In commit b73c3d0e ("net: Save TX flow hash in sock and set in skbuf
      on xmit"), Tom provided a l4 hash to most outgoing TCP packets.
      
      We'd like to provide one as well for SYNACK packets, so that all packets
      of a given flow share same txhash, to later enable bonding driver to
      also use skb->hash to perform slave selection.
      
      Note that a SYNACK retransmit shuffles the tx hash, as Tom did
      in commit 265f94ff ("net: Recompute sk_txhash on negative routing
      advice") for established sockets.
      
      This has nice effect making TCP flows resilient to some kind of black
      holes, even at connection establish phase.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <tom@herbertland.com>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Acked-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      58d607d3
  4. 11 9月, 2015 1 次提交
  5. 26 8月, 2015 1 次提交
    • E
      tcp: fix slow start after idle vs TSO/GSO · 6f021c62
      Eric Dumazet 提交于
      slow start after idle might reduce cwnd, but we perform this
      after first packet was cooked and sent.
      
      With TSO/GSO, it means that we might send a full TSO packet
      even if cwnd should have been reduced to IW10.
      
      Moving the SSAI check in skb_entail() makes sense, because
      we slightly reduce number of times this check is done,
      especially for large send() and TCP Small queue callbacks from
      softirq context.
      
      As Neal pointed out, we also need to perform the check
      if/when receive window opens.
      
      Tested:
      
      Following packetdrill test demonstrates the problem
      // Test of slow start after idle
      
      `sysctl -q net.ipv4.tcp_slow_start_after_idle=1`
      
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0    setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0    bind(3, ..., ...) = 0
      +0    listen(3, 1) = 0
      
      +0    < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop,nop,wscale 7>
      +0    > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
      +.100 < . 1:1(0) ack 1 win 511
      +0    accept(3, ..., ...) = 4
      +0    setsockopt(4, SOL_SOCKET, SO_SNDBUF, [200000], 4) = 0
      
      +0    write(4, ..., 26000) = 26000
      +0    > . 1:5001(5000) ack 1
      +0    > . 5001:10001(5000) ack 1
      +0    %{ assert tcpi_snd_cwnd == 10 }%
      
      +.100 < . 1:1(0) ack 10001 win 511
      +0    %{ assert tcpi_snd_cwnd == 20, tcpi_snd_cwnd }%
      +0    > . 10001:20001(10000) ack 1
      +0    > P. 20001:26001(6000) ack 1
      
      +.100 < . 1:1(0) ack 26001 win 511
      +0    %{ assert tcpi_snd_cwnd == 36, tcpi_snd_cwnd }%
      
      +4 write(4, ..., 20000) = 20000
      // If slow start after idle works properly, we should send 5 MSS here (cwnd/2)
      +0    > . 26001:31001(5000) ack 1
      +0    %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }%
      +0    > . 31001:36001(5000) ack 1
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6f021c62
  6. 14 8月, 2015 2 次提交
  7. 27 7月, 2015 1 次提交
  8. 10 7月, 2015 1 次提交
    • J
      tcp: v1 always send a quick ack when quickacks are enabled · 2251ae46
      Jon Maxwell 提交于
      V1 of this patch contains Eric Dumazet's suggestion to move the per
      dst RTAX_QUICKACK check into tcp_in_quickack_mode(). Thanks Eric.
      
      I ran some tests and after setting the "ip route change quickack 1"
      knob there were still many delayed ACKs sent. This occured
      because when icsk_ack.quick=0 the !icsk_ack.pingpong value is
      subsequently ignored as tcp_in_quickack_mode() checks both these
      values. The condition for a quick ack to trigger requires
      that both icsk_ack.quick != 0 and icsk_ack.pingpong=0. Currently
      only icsk_ack.pingpong is controlled by the knob. But the
      icsk_ack.quick value changes dynamically depending on heuristics.
      The crux of the matter is that delayed acks still cannot be entirely
      disabled even with the RTAX_QUICKACK per dst knob enabled. This
      patch ensures that a quick ack is always sent when the RTAX_QUICKACK
      per dst knob is turned on.
      
      The "ip route change quickack 1" knob was recently added to enable
      quickacks. It was modeled around the TCP_QUICKACK setsockopt() option.
      This issue is that even with "ip route change quickack 1" enabled
      we still see delayed ACKs under some conditions. It would be nice
      to be able to completely disable delayed ACKs.
      
      Here is an example:
      
      # netstat -s|grep dela
          3 delayed acks sent
      
      For all routes enable the knob
      
      # ip route change quickack 1
      
      Generate some traffic across a slow link and we still see the delayed
      acks.
      
      # netstat -s|grep dela
          106 delayed acks sent
          1 delayed acks further delayed because of locked socket
      
      The issue is that both the "ip route change quickack 1" knob and
      the TCP_QUICKACK option set the icsk_ack.pingpong variable to 0.
      However at the business end in the __tcp_ack_snd_check() routine,
      tcp_in_quickack_mode() checks that both icsk_ack.quick != 0
      and icsk_ack.pingpong=0 in order to trigger a quickack. As
      icsk_ack.quick is determined by heuristics it can be 0. When
      that occurs the icsk_ack.pingpong value is ignored and a delayed
      ACK is sent regardless.
      
      This patch moves the RTAX_QUICKACK per dst check into the
      tcp_in_quickack_mode() routine which ensures that a quickack is
      always sent when the quickack knob is enabled for that dst.
      Signed-off-by: NJon Maxwell <jmaxwell37@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2251ae46
  9. 12 6月, 2015 4 次提交
  10. 04 6月, 2015 1 次提交
  11. 27 5月, 2015 1 次提交
  12. 22 5月, 2015 2 次提交
    • M
      tcp: add tcpi_segs_in and tcpi_segs_out to tcp_info · 2efd055c
      Marcelo Ricardo Leitner 提交于
      This patch tracks the total number of inbound and outbound segments on a
      TCP socket. One may use this number to have an idea on connection
      quality when compared against the retransmissions.
      
      RFC4898 named these : tcpEStatsPerfSegsIn and tcpEStatsPerfSegsOut
      
      These are a 32bit field each and can be fetched both from TCP_INFO
      getsockopt() if one has a handle on a TCP socket, or from inet_diag
      netlink facility (iproute2/ss patch will follow)
      
      Note that tp->segs_out was placed near tp->snd_nxt for good data
      locality and minimal performance impact, while tp->segs_in was placed
      near tp->bytes_received for the same reason.
      
      Join work with Eric Dumazet.
      
      Note that received SYN are accounted on the listener, but sent SYNACK
      are not accounted.
      Signed-off-by: NMarcelo Ricardo Leitner <mleitner@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2efd055c
    • E
      tcp: add a force_schedule argument to sk_stream_alloc_skb() · eb934478
      Eric Dumazet 提交于
      In commit 8e4d980a ("tcp: fix behavior for epoll edge trigger")
      we fixed a possible hang of TCP sockets under memory pressure,
      by allowing sk_stream_alloc_skb() to use sk_forced_mem_schedule()
      if no packet is in socket write queue.
      
      It turns out there are other cases where we want to force memory
      schedule :
      
      tcp_fragment() & tso_fragment() need to split a big TSO packet into
      two smaller ones. If we block here because of TCP memory pressure,
      we can effectively block TCP socket from sending new data.
      If no further ACK is coming, this hang would be definitive, and socket
      has no chance to effectively reduce its memory usage.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eb934478
  13. 20 5月, 2015 1 次提交
    • D
      tcp: add rfc3168, section 6.1.1.1. fallback · 49213555
      Daniel Borkmann 提交于
      This work as a follow-up of commit f7b3bec6 ("net: allow setting ecn
      via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing
      ECN connections. In other words, this work adds a retry with a non-ECN
      setup SYN packet, as suggested from the RFC on the first timeout:
      
        [...] A host that receives no reply to an ECN-setup SYN within the
        normal SYN retransmission timeout interval MAY resend the SYN and
        any subsequent SYN retransmissions with CWR and ECE cleared. [...]
      
      Schematic client-side view when assuming the server is in tcp_ecn=2 mode,
      that is, Linux default since 2009 via commit 255cac91 ("tcp: extend
      ECN sysctl to allow server-side only ECN"):
      
       1) Normal ECN-capable path:
      
          SYN ECE CWR ----->
                      <----- SYN ACK ECE
                  ACK ----->
      
       2) Path with broken middlebox, when client has fallback:
      
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
                  SYN ----->
                      <----- SYN ACK
                  ACK ----->
      
      In case we would not have the fallback implemented, the middlebox drop
      point would basically end up as:
      
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
      
      In any case, it's rather a smaller percentage of sites where there would
      occur such additional setup latency: it was found in end of 2014 that ~56%
      of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate
      ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect
      when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the
      fallback would mitigate with a slight latency trade-off. Recent related
      paper on this topic:
      
        Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth,
        Gorry Fairhurst, and Richard Scheffenegger:
          "Enabling Internet-Wide Deployment of Explicit Congestion Notification."
          Proc. PAM 2015, New York.
        http://ecn.ethz.ch/ecn-pam15.pdf
      
      Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168,
      section 6.1.1.1. fallback on timeout. For users explicitly not wanting this
      which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that
      allows for disabling the fallback.
      
      tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but
      rather we let tcp_ecn_rcv_synack() take that over on input path in case a
      SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent
      ECN being negotiated eventually in that case.
      
      Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf
      Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdfSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NMirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>
      Signed-off-by: NBrian Trammell <trammell@tik.ee.ethz.ch>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Dave That <dave.taht@gmail.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      49213555
  14. 18 5月, 2015 2 次提交
  15. 10 5月, 2015 2 次提交
  16. 24 4月, 2015 1 次提交
    • E
      tcp: avoid looping in tcp_send_fin() · 845704a5
      Eric Dumazet 提交于
      Presence of an unbound loop in tcp_send_fin() had always been hard
      to explain when analyzing crash dumps involving gigantic dying processes
      with millions of sockets.
      
      Lets try a different strategy :
      
      In case of memory pressure, try to add the FIN flag to last packet
      in write queue, even if packet was already sent. TCP stack will
      be able to deliver this FIN after a timeout event. Note that this
      FIN being delivered by a retransmit, it also carries a Push flag
      given our current implementation.
      
      By checking sk_under_memory_pressure(), we anticipate that cooking
      many FIN packets might deplete tcp memory.
      
      In the case we could not allocate a packet, even with __GFP_WAIT
      allocation, then not sending a FIN seems quite reasonable if it allows
      to get rid of this socket, free memory, and not block the process from
      eventually doing other useful work.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      845704a5
  17. 23 4月, 2015 1 次提交
    • E
      tcp: fix possible deadlock in tcp_send_fin() · d83769a5
      Eric Dumazet 提交于
      Using sk_stream_alloc_skb() in tcp_send_fin() is dangerous in
      case a huge process is killed by OOM, and tcp_mem[2] is hit.
      
      To be able to free memory we need to make progress, so this
      patch allows FIN packets to not care about tcp_mem[2], if
      skb allocation succeeded.
      
      In a follow-up patch, we might abort tcp_send_fin() infinite loop
      in case TIF_MEMDIE is set on this thread, as memory allocator
      did its best getting extra memory already.
      
      This patch reverts d22e1537 ("tcp: fix tcp fin memory accounting")
      
      Fixes: d22e1537 ("tcp: fix tcp fin memory accounting")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d83769a5
  18. 10 4月, 2015 1 次提交
    • E
      tcp: tcp_make_synack() should clear skb->tstamp · b50edd78
      Eric Dumazet 提交于
      I noticed tcpdump was giving funky timestamps for locally
      generated SYNACK messages on loopback interface.
      
      11:42:46.938990 IP 127.0.0.1.48245 > 127.0.0.2.23850: S
      945476042:945476042(0) win 43690 <mss 65495,nop,nop,sackOK,nop,wscale 7>
      
      20:28:58.502209 IP 127.0.0.2.23850 > 127.0.0.1.48245: S
      3160535375:3160535375(0) ack 945476043 win 43690 <mss
      65495,nop,nop,sackOK,nop,wscale 7>
      
      This is because we need to clear skb->tstamp before
      entering lower stack, otherwise net_timestamp_check()
      does not set skb->tstamp.
      
      Fixes: 7faee5c0 ("tcp: remove TCP_SKB_CB(skb)->when")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b50edd78
  19. 08 4月, 2015 2 次提交
  20. 04 4月, 2015 2 次提交
  21. 25 3月, 2015 3 次提交
  22. 21 3月, 2015 2 次提交
  23. 07 3月, 2015 2 次提交
    • F
      ipv4: Create probe timer for tcp PMTU as per RFC4821 · 05cbc0db
      Fan Du 提交于
      As per RFC4821 7.3.  Selecting Probe Size, a probe timer should
      be armed once probing has converged. Once this timer expired,
      probing again to take advantage of any path PMTU change. The
      recommended probing interval is 10 minutes per RFC1981. Probing
      interval could be sysctled by sysctl_tcp_probe_interval.
      
      Eric Dumazet suggested to implement pseudo timer based on 32bits
      jiffies tcp_time_stamp instead of using classic timer for such
      rare event.
      Signed-off-by: NFan Du <fan.du@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      05cbc0db
    • F
      ipv4: Use binary search to choose tcp PMTU probe_size · 6b58e0a5
      Fan Du 提交于
      Current probe_size is chosen by doubling mss_cache,
      the probing process will end shortly with a sub-optimal
      mss size, and the link mtu will not be taken full
      advantage of, in return, this will make user to tweak
      tcp_base_mss with care.
      
      Use binary search to choose probe_size in a fine
      granularity manner, an optimal mss will be found
      to boost performance as its maxmium.
      
      In addition, introduce a sysctl_tcp_probe_threshold
      to control when probing will stop in respect to
      the width of search range.
      
      Test env:
      Docker instance with vxlan encapuslation(82599EB)
      iperf -c 10.0.0.24  -t 60
      
      before this patch:
      1.26 Gbits/sec
      
      After this patch: increase 26%
      1.59 Gbits/sec
      Signed-off-by: NFan Du <fan.du@intel.com>
      Acked-by: NJohn Heffner <johnwheffner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b58e0a5
  24. 01 3月, 2015 2 次提交
    • E
      tcp: tso: allow CA_CWR state in tcp_tso_should_defer() · a0ea700e
      Eric Dumazet 提交于
      Another TCP issue is triggered by ECN.
      
      Under pressure, receiver gets ECN marks, and send back ACK packets
      with ECE TCP flag. Senders enter CA_CWR state.
      
      In this state, tcp_tso_should_defer() is short cut :
      
      if (icsk->icsk_ca_state != TCP_CA_Open)
          goto send_now;
      
      This means that about all ACK packets we receive are triggering
      a partial send, and because cwnd is kept small, we can only send
      a small amount of data for each incoming ACK,
      which in return generate more ACK packets.
      
      Allowing CA_Open and CA_CWR states to enable TSO defer in
      tcp_tso_should_defer() brings performance back :
      TSO autodefer has more chance to defer under pressure.
      
      This patch increases TSO and LRO/GRO efficiency back to normal levels,
      and does not impact overall ECN behavior.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a0ea700e
    • E
      tcp: tso: restore IW10 after TSO autosizing · 50c8339e
      Eric Dumazet 提交于
      With sysctl_tcp_min_tso_segs being 4, it is very possible
      that tcp_tso_should_defer() decides not sending last 2 MSS
      of initial window of 10 packets. This also applies if
      autosizing decides to send X MSS per GSO packet, and cwnd
      is not a multiple of X.
      
      This patch implements an heuristic based on age of first
      skb in write queue : If it was sent very recently (less than half srtt),
      we can predict that no ACK packet will come in less than half rtt,
      so deferring might cause an under utilization of our window.
      
      This is visible on initial send (IW10) on web servers,
      but more generally on some RPC, as the last part of the message
      might need an extra RTT to get delivered.
      
      Tested:
      
      Ran following packetdrill test
      // A simple server-side test that sends exactly an initial window (IW10)
      // worth of packets.
      
      `sysctl -e -q net.ipv4.tcp_min_tso_segs=4`
      
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0    setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0    bind(3, ..., ...) = 0
      +0    listen(3, 1) = 0
      
      +.1   < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
      +0    > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
      +.1   < . 1:1(0) ack 1 win 257
      +0    accept(3, ..., ...) = 4
      
      +0    write(4, ..., 14600) = 14600
      +0    > . 1:5841(5840) ack 1 win 457
      +0    > . 5841:11681(5840) ack 1 win 457
      // Following packet should be sent right now.
      +0    > P. 11681:14601(2920) ack 1 win 457
      
      +.1   < . 1:1(0) ack 14601 win 257
      
      +0    close(4) = 0
      +0    > F. 14601:14601(0) ack 1
      +.1   < F. 1:1(0) ack 14602 win 257
      +0    > . 14602:14602(0) ack 2
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      50c8339e