1. 26 6月, 2017 1 次提交
  2. 01 6月, 2017 1 次提交
  3. 26 5月, 2017 1 次提交
    • W
      tcp: avoid fastopen API to be used on AF_UNSPEC · ba615f67
      Wei Wang 提交于
      Fastopen API should be used to perform fastopen operations on the TCP
      socket. It does not make sense to use fastopen API to perform disconnect
      by calling it with AF_UNSPEC. The fastopen data path is also prone to
      race conditions and bugs when using with AF_UNSPEC.
      
      One issue reported and analyzed by Vegard Nossum is as follows:
      +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
      Thread A:                            Thread B:
      ------------------------------------------------------------------------
      sendto()
       - tcp_sendmsg()
           - sk_stream_memory_free() = 0
               - goto wait_for_sndbuf
      	     - sk_stream_wait_memory()
      	        - sk_wait_event() // sleep
                |                          sendto(flags=MSG_FASTOPEN, dest_addr=AF_UNSPEC)
      	  |                           - tcp_sendmsg()
      	  |                              - tcp_sendmsg_fastopen()
      	  |                                 - __inet_stream_connect()
      	  |                                    - tcp_disconnect() //because of AF_UNSPEC
      	  |                                       - tcp_transmit_skb()// send RST
      	  |                                    - return 0; // no reconnect!
      	  |                           - sk_stream_wait_connect()
      	  |                                 - sock_error()
      	  |                                    - xchg(&sk->sk_err, 0)
      	  |                                    - return -ECONNRESET
      	- ... // wake up, see sk->sk_err == 0
          - skb_entail() on TCP_CLOSE socket
      
      If the connection is reopened then we will send a brand new SYN packet
      after thread A has already queued a buffer. At this point I think the
      socket internal state (sequence numbers etc.) becomes messed up.
      
      When the new connection is closed, the FIN-ACK is rejected because the
      sequence number is outside the window. The other side tries to
      retransmit,
      but __tcp_retransmit_skb() calls tcp_trim_head() on an empty skb which
      corrupts the skb data length and hits a BUG() in copy_and_csum_bits().
      +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
      
      Hence, this patch adds a check for AF_UNSPEC in the fastopen data path
      and return EOPNOTSUPP to user if such case happens.
      
      Fixes: cf60af03 ("tcp: Fast Open client - sendmsg(MSG_FASTOPEN)")
      Reported-by: NVegard Nossum <vegard.nossum@oracle.com>
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba615f67
  4. 22 5月, 2017 1 次提交
  5. 01 5月, 2017 1 次提交
  6. 27 4月, 2017 1 次提交
    • E
      tcp: switch rcv_rtt_est and rcvq_space to high resolution timestamps · 645f4c6f
      Eric Dumazet 提交于
      Some devices or distributions use HZ=100 or HZ=250
      
      TCP receive buffer autotuning has poor behavior caused by this choice.
      Since autotuning happens after 4 ms or 10 ms, short distance flows
      get their receive buffer tuned to a very high value, but after an initial
      period where it was frozen to (too small) initial value.
      
      With tp->tcp_mstamp introduction, we can switch to high resolution
      timestamps almost for free (at the expense of 8 additional bytes per
      TCP structure)
      
      Note that some TCP stacks use usec TCP timestamps where this
      patch makes even more sense : Many TCP flows have < 500 usec RTT.
      Hopefully this finer TS option can be standardized soon.
      
      Tested:
       HZ=100 kernel
       ./netperf -H lpaa24 -t TCP_RR -l 1000 -- -r 10000,10000 &
      
       Peer without patch :
       lpaa24:~# ss -tmi dst lpaa23
       ...
       skmem:(r0,rb8388608,...)
       rcv_rtt:10 rcv_space:3210000 minrtt:0.017
      
       Peer with the patch :
       lpaa23:~# ss -tmi dst lpaa24
       ...
       skmem:(r0,rb428800,...)
       rcv_rtt:0.069 rcv_space:30000 minrtt:0.017
      
      We can see saner RCVBUF, and more precise rcv_rtt information.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      645f4c6f
  7. 25 4月, 2017 1 次提交
    • W
      net/tcp_fastopen: Disable active side TFO in certain scenarios · cf1ef3f0
      Wei Wang 提交于
      Middlebox firewall issues can potentially cause server's data being
      blackholed after a successful 3WHS using TFO. Following are the related
      reports from Apple:
      https://www.nanog.org/sites/default/files/Paasch_Network_Support.pdf
      Slide 31 identifies an issue where the client ACK to the server's data
      sent during a TFO'd handshake is dropped.
      C ---> syn-data ---> S
      C <--- syn/ack ----- S
      C (accept & write)
      C <---- data ------- S
      C ----- ACK -> X     S
      		[retry and timeout]
      
      https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf
      Slide 5 shows a similar situation that the server's data gets dropped
      after 3WHS.
      C ---- syn-data ---> S
      C <--- syn/ack ----- S
      C ---- ack --------> S
      S (accept & write)
      C?  X <- data ------ S
      		[retry and timeout]
      
      This is the worst failure b/c the client can not detect such behavior to
      mitigate the situation (such as disabling TFO). Failing to proceed, the
      application (e.g., SSL library) may simply timeout and retry with TFO
      again, and the process repeats indefinitely.
      
      The proposed solution is to disable active TFO globally under the
      following circumstances:
      1. client side TFO socket detects out of order FIN
      2. client side TFO socket receives out of order RST
      
      We disable active side TFO globally for 1hr at first. Then if it
      happens again, we disable it for 2h, then 4h, 8h, ...
      And we reset the timeout to 1hr if a client side TFO sockets not opened
      on loopback has successfully received data segs from server.
      And we examine this condition during close().
      
      The rational behind it is that when such firewall issue happens,
      application running on the client should eventually close the socket as
      it is not able to get the data it is expecting. Or application running
      on the server should close the socket as it is not able to receive any
      response from client.
      In both cases, out of order FIN or RST will get received on the client
      given that the firewall will not block them as no data are in those
      frames.
      And we want to disable active TFO globally as it helps if the middle box
      is very close to the client and most of the connections are likely to
      fail.
      
      Also, add a debug sysctl:
        tcp_fastopen_blackhole_detect_timeout_sec:
          the initial timeout to use when firewall blackhole issue happens.
          This can be set and read.
          When setting it to 0, it means to disable the active disable logic.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf1ef3f0
  8. 10 4月, 2017 1 次提交
    • E
      tcp: clear saved_syn in tcp_disconnect() · 17c3060b
      Eric Dumazet 提交于
      In the (very unlikely) case a passive socket becomes a listener,
      we do not want to duplicate its saved SYN headers.
      
      This would lead to double frees, use after free, and please hackers and
      various fuzzers
      
      Tested:
          0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 setsockopt(3, IPPROTO_TCP, TCP_SAVE_SYN, [1], 4) = 0
         +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
      
         +0 bind(3, ..., ...) = 0
         +0 listen(3, 5) = 0
      
         +0 < S 0:0(0) win 32972 <mss 1460,nop,wscale 7>
         +0 > S. 0:0(0) ack 1 <...>
        +.1 < . 1:1(0) ack 1 win 257
         +0 accept(3, ..., ...) = 4
      
         +0 connect(4, AF_UNSPEC, ...) = 0
         +0 close(3) = 0
         +0 bind(4, ..., ...) = 0
         +0 listen(4, 5) = 0
      
         +0 < S 0:0(0) win 32972 <mss 1460,nop,wscale 7>
         +0 > S. 0:0(0) ack 1 <...>
        +.1 < . 1:1(0) ack 1 win 257
      
      Fixes: cd8ae852 ("tcp: provide SYN headers for passive connections")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      17c3060b
  9. 05 4月, 2017 1 次提交
  10. 23 3月, 2017 1 次提交
  11. 17 3月, 2017 1 次提交
  12. 03 3月, 2017 1 次提交
  13. 18 2月, 2017 1 次提交
  14. 07 2月, 2017 1 次提交
  15. 30 1月, 2017 1 次提交
  16. 26 1月, 2017 2 次提交
    • W
      net/tcp-fastopen: make connect()'s return case more consistent with non-TFO · 3979ad7e
      Willy Tarreau 提交于
      Without TFO, any subsequent connect() call after a successful one returns
      -1 EISCONN. The last API update ensured that __inet_stream_connect() can
      return -1 EINPROGRESS in response to sendmsg() when TFO is in use to
      indicate that the connection is now in progress. Unfortunately since this
      function is used both for connect() and sendmsg(), it has the undesired
      side effect of making connect() now return -1 EINPROGRESS as well after
      a successful call, while at the same time poll() returns POLLOUT. This
      can confuse some applications which happen to call connect() and to
      check for -1 EISCONN to ensure the connection is usable, and for which
      EINPROGRESS indicates a need to poll, causing a loop.
      
      This problem was encountered in haproxy where a call to connect() is
      precisely used in certain cases to confirm a connection's readiness.
      While arguably haproxy's behaviour should be improved here, it seems
      important to aim at a more robust behaviour when the goal of the new
      API is to make it easier to implement TFO in existing applications.
      
      This patch simply ensures that we preserve the same semantics as in
      the non-TFO case on the connect() syscall when using TFO, while still
      returning -1 EINPROGRESS on sendmsg(). For this we simply tell
      __inet_stream_connect() whether we're doing a regular connect() or in
      fact connecting for a sendmsg() call.
      
      Cc: Wei Wang <weiwan@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NWilly Tarreau <w@1wt.eu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3979ad7e
    • W
      net/tcp-fastopen: Add new API support · 19f6d3f3
      Wei Wang 提交于
      This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an
      alternative way to perform Fast Open on the active side (client). Prior
      to this patch, a client needs to replace the connect() call with
      sendto(MSG_FASTOPEN). This can be cumbersome for applications who want
      to use Fast Open: these socket operations are often done in lower layer
      libraries used by many other applications. Changing these libraries
      and/or the socket call sequences are not trivial. A more convenient
      approach is to perform Fast Open by simply enabling a socket option when
      the socket is created w/o changing other socket calls sequence:
        s = socket()
          create a new socket
        setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …);
          newly introduced sockopt
          If set, new functionality described below will be used.
          Return ENOTSUPP if TFO is not supported or not enabled in the
          kernel.
      
        connect()
          With cookie present, return 0 immediately.
          With no cookie, initiate 3WHS with TFO cookie-request option and
          return -1 with errno = EINPROGRESS.
      
        write()/sendmsg()
          With cookie present, send out SYN with data and return the number of
          bytes buffered.
          With no cookie, and 3WHS not yet completed, return -1 with errno =
          EINPROGRESS.
          No MSG_FASTOPEN flag is needed.
      
        read()
          Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but
          write() is not called yet.
          Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is
          established but no msg is received yet.
          Return number of bytes read if socket is established and there is
          msg received.
      
      The new API simplifies life for applications that always perform a write()
      immediately after a successful connect(). Such applications can now take
      advantage of Fast Open by merely making one new setsockopt() call at the time
      of creating the socket. Nothing else about the application's socket call
      sequence needs to change.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19f6d3f3
  17. 21 1月, 2017 1 次提交
  18. 14 1月, 2017 2 次提交
  19. 10 1月, 2017 1 次提交
  20. 06 1月, 2017 1 次提交
    • S
      tcp: provide timestamps for partial writes · ad02c4f5
      Soheil Hassas Yeganeh 提交于
      For TCP sockets, TX timestamps are only captured when the user data
      is successfully and fully written to the socket. In many cases,
      however, TCP writes can be partial for which no timestamp is
      collected.
      
      Collect timestamps whenever any user data is (fully or partially)
      copied into the socket. Pass tcp_write_queue_tail to tcp_tx_timestamp
      instead of the local skb pointer since it can be set to NULL on
      the error path.
      
      Note that tcp_write_queue_tail can be NULL, even if bytes have been
      copied to the socket. This is because acknowledgements are being
      processed in tcp_sendmsg(), and by the time tcp_tx_timestamp is
      called tcp_write_queue_tail can be NULL. For such cases, this patch
      does not collect any timestamps (i.e., it is best-effort).
      
      This patch is written with suggestions from Willem de Bruijn and
      Eric Dumazet.
      
      Change-log V1 -> V2:
      	- Use sockc.tsflags instead of sk->sk_tsflags.
      	- Use the same code path for normal writes and errors.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ad02c4f5
  21. 30 12月, 2016 2 次提交
  22. 25 12月, 2016 1 次提交
  23. 06 12月, 2016 1 次提交
  24. 30 11月, 2016 3 次提交
  25. 16 11月, 2016 1 次提交
  26. 10 11月, 2016 3 次提交
  27. 04 11月, 2016 2 次提交
    • E
      tcp: fix return value for partial writes · 79d8665b
      Eric Dumazet 提交于
      After my commit, tcp_sendmsg() might restart its loop after
      processing socket backlog.
      
      If sk_err is set, we blindly return an error, even though we
      copied data to user space before.
      
      We should instead return number of bytes that could be copied,
      otherwise user space might resend data and corrupt the stream.
      
      This might happen if another thread is using recvmsg(MSG_ERRQUEUE)
      to process timestamps.
      
      Issue was diagnosed by Soheil and Willem, big kudos to them !
      
      Fixes: d41a69f1 ("tcp: make tcp_sendmsg() aware of socket backlog")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Tested-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79d8665b
    • E
      tcp: fix potential memory corruption · ac9e70b1
      Eric Dumazet 提交于
      Imagine initial value of max_skb_frags is 17, and last
      skb in write queue has 15 frags.
      
      Then max_skb_frags is lowered to 14 or smaller value.
      
      tcp_sendmsg() will then be allowed to add additional page frags
      and eventually go past MAX_SKB_FRAGS, overflowing struct
      skb_shared_info.
      
      Fixes: 5f74f82e ("net:Add sysctl_max_skb_frags")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
      Cc: Håkon Bugge <haakon.bugge@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac9e70b1
  28. 08 10月, 2016 1 次提交
  29. 04 10月, 2016 1 次提交
    • A
      skb_splice_bits(): get rid of callback · 25869262
      Al Viro 提交于
      since pipe_lock is the outermost now, we don't need to drop/regain
      socket locks around the call of splice_to_pipe() from skb_splice_bits(),
      which kills the need to have a socket-specific callback; we can just
      call splice_to_pipe() and be done with that.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      25869262
  30. 21 9月, 2016 3 次提交
    • Y
      tcp: export data delivery rate · eb8329e0
      Yuchung Cheng 提交于
      This commit export two new fields in struct tcp_info:
      
        tcpi_delivery_rate: The most recent goodput, as measured by
          tcp_rate_gen(). If the socket is limited by the sending
          application (e.g., no data to send), it reports the highest
          measurement instead of the most recent. The unit is bytes per
          second (like other rate fields in tcp_info).
      
        tcpi_delivery_rate_app_limited: A boolean indicating if the goodput
          was measured when the socket's throughput was limited by the
          sending application.
      
      This delivery rate information can be useful for applications that
      want to know the current throughput the TCP connection is seeing,
      e.g. adaptive bitrate video streaming. It can also be very useful for
      debugging or troubleshooting.
      Signed-off-by: NVan Jacobson <vanj@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eb8329e0
    • S
      tcp: track application-limited rate samples · d7722e85
      Soheil Hassas Yeganeh 提交于
      This commit adds code to track whether the delivery rate represented
      by each rate_sample was limited by the application.
      
      Upon each transmit, we store in the is_app_limited field in the skb a
      boolean bit indicating whether there is a known "bubble in the pipe":
      a point in the rate sample interval where the sender was
      application-limited, and did not transmit even though the cwnd and
      pacing rate allowed it.
      
      This logic marks the flow app-limited on a write if *all* of the
      following are true:
      
        1) There is less than 1 MSS of unsent data in the write queue
           available to transmit.
      
        2) There is no packet in the sender's queues (e.g. in fq or the NIC
           tx queue).
      
        3) The connection is not limited by cwnd.
      
        4) There are no lost packets to retransmit.
      
      The tcp_rate_check_app_limited() code in tcp_rate.c determines whether
      the connection is application-limited at the moment. If the flow is
      application-limited, it sets the tp->app_limited field. If the flow is
      application-limited then that means there is effectively a "bubble" of
      silence in the pipe now, and this silence will be reflected in a lower
      bandwidth sample for any rate samples from now until we get an ACK
      indicating this bubble has exited the pipe: specifically, until we get
      an ACK for the next packet we transmit.
      
      When we send every skb we record in scb->tx.is_app_limited whether the
      resulting rate sample will be application-limited.
      
      The code in tcp_rate_gen() checks to see when it is safe to mark all
      known application-limited bubbles of silence as having exited the
      pipe. It does this by checking to see when the delivered count moves
      past the tp->app_limited marker. At this point it zeroes the
      tp->app_limited marker, as all known bubbles are out of the pipe.
      
      We make room for the tx.is_app_limited bit in the skb by borrowing a
      bit from the in_flight field used by NV to record the number of bytes
      in flight. The receive window in the TCP header is 16 bits, and the
      max receive window scaling shift factor is 14 (RFC 1323). So the max
      receive window offered by the TCP protocol is 2^(16+14) = 2^30. So we
      only need 30 bits for the tx.in_flight used by NV.
      Signed-off-by: NVan Jacobson <vanj@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7722e85
    • E
      tcp: switch back to proper tcp_skb_cb size check in tcp_init() · b2d3ea4a
      Eric Dumazet 提交于
      Revert to the tcp_skb_cb size check that tcp_init() had before commit
      b4772ef8 ("net: use common macro for assering skb->cb[] available
      size in protocol families"). As related commit 744d5a3e ("net:
      move skb->dropcount to skb->cb[]") explains, the
      sock_skb_cb_check_size() mechanism was added to ensure that there is
      space for dropcount, "for protocol families using it". But TCP is not
      a protocol using dropcount, so tcp_init() doesn't need to provision
      space for dropcount in the skb->cb[], and thus we can revert to the
      older form of the tcp_skb_cb size check. Doing so allows TCP to use 4
      more bytes of the skb->cb[] space.
      
      Fixes: b4772ef8 ("net: use common macro for assering skb->cb[] available size in protocol families")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2d3ea4a