1. 16 6月, 2017 1 次提交
    • D
      tcp: ULP infrastructure · 734942cc
      Dave Watson 提交于
      Add the infrustructure for attaching Upper Layer Protocols (ULPs) over TCP
      sockets. Based on a similar infrastructure in tcp_cong.  The idea is that any
      ULP can add its own logic by changing the TCP proto_ops structure to its own
      methods.
      
      Example usage:
      
      setsockopt(sock, SOL_TCP, TCP_ULP, "tls", sizeof("tls"));
      
      modules will call:
      tcp_register_ulp(&tcp_tls_ulp_ops);
      
      to register/unregister their ulp, with an init function and name.
      
      A list of registered ulps will be returned by tcp_get_available_ulp, which is
      hooked up to /proc.  Example:
      
      $ cat /proc/sys/net/ipv4/tcp_available_ulp
      tls
      
      There is currently no functionality to remove or chain ULPs, but
      it should be possible to add these in the future if needed.
      Signed-off-by: NBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: NDave Watson <davejwatson@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      734942cc
  2. 08 6月, 2017 1 次提交
    • E
      tcp: add TCPMemoryPressuresChrono counter · 06044751
      Eric Dumazet 提交于
      DRAM supply shortage and poor memory pressure tracking in TCP
      stack makes any change in SO_SNDBUF/SO_RCVBUF (or equivalent autotuning
      limits) and tcp_mem[] quite hazardous.
      
      TCPMemoryPressures SNMP counter is an indication of tcp_mem sysctl
      limits being hit, but only tracking number of transitions.
      
      If TCP stack behavior under stress was perfect :
      1) It would maintain memory usage close to the limit.
      2) Memory pressure state would be entered for short times.
      
      We certainly prefer 100 events lasting 10ms compared to one event
      lasting 200 seconds.
      
      This patch adds a new SNMP counter tracking cumulative duration of
      memory pressure events, given in ms units.
      
      $ cat /proc/sys/net/ipv4/tcp_mem
      3088    4117    6176
      $ grep TCP /proc/net/sockstat
      TCP: inuse 180 orphan 0 tw 2 alloc 234 mem 4140
      $ nstat -n ; sleep 10 ; nstat |grep Pressure
      TcpExtTCPMemoryPressures        1700
      TcpExtTCPMemoryPressuresChrono  5209
      
      v2: Used EXPORT_SYMBOL_GPL() instead of EXPORT_SYMBOL() as David
      instructed.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      06044751
  3. 01 6月, 2017 1 次提交
  4. 26 5月, 2017 1 次提交
    • W
      tcp: avoid fastopen API to be used on AF_UNSPEC · ba615f67
      Wei Wang 提交于
      Fastopen API should be used to perform fastopen operations on the TCP
      socket. It does not make sense to use fastopen API to perform disconnect
      by calling it with AF_UNSPEC. The fastopen data path is also prone to
      race conditions and bugs when using with AF_UNSPEC.
      
      One issue reported and analyzed by Vegard Nossum is as follows:
      +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
      Thread A:                            Thread B:
      ------------------------------------------------------------------------
      sendto()
       - tcp_sendmsg()
           - sk_stream_memory_free() = 0
               - goto wait_for_sndbuf
      	     - sk_stream_wait_memory()
      	        - sk_wait_event() // sleep
                |                          sendto(flags=MSG_FASTOPEN, dest_addr=AF_UNSPEC)
      	  |                           - tcp_sendmsg()
      	  |                              - tcp_sendmsg_fastopen()
      	  |                                 - __inet_stream_connect()
      	  |                                    - tcp_disconnect() //because of AF_UNSPEC
      	  |                                       - tcp_transmit_skb()// send RST
      	  |                                    - return 0; // no reconnect!
      	  |                           - sk_stream_wait_connect()
      	  |                                 - sock_error()
      	  |                                    - xchg(&sk->sk_err, 0)
      	  |                                    - return -ECONNRESET
      	- ... // wake up, see sk->sk_err == 0
          - skb_entail() on TCP_CLOSE socket
      
      If the connection is reopened then we will send a brand new SYN packet
      after thread A has already queued a buffer. At this point I think the
      socket internal state (sequence numbers etc.) becomes messed up.
      
      When the new connection is closed, the FIN-ACK is rejected because the
      sequence number is outside the window. The other side tries to
      retransmit,
      but __tcp_retransmit_skb() calls tcp_trim_head() on an empty skb which
      corrupts the skb data length and hits a BUG() in copy_and_csum_bits().
      +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
      
      Hence, this patch adds a check for AF_UNSPEC in the fastopen data path
      and return EOPNOTSUPP to user if such case happens.
      
      Fixes: cf60af03 ("tcp: Fast Open client - sendmsg(MSG_FASTOPEN)")
      Reported-by: NVegard Nossum <vegard.nossum@oracle.com>
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba615f67
  5. 23 5月, 2017 1 次提交
  6. 22 5月, 2017 1 次提交
  7. 18 5月, 2017 4 次提交
  8. 01 5月, 2017 1 次提交
  9. 27 4月, 2017 1 次提交
    • E
      tcp: switch rcv_rtt_est and rcvq_space to high resolution timestamps · 645f4c6f
      Eric Dumazet 提交于
      Some devices or distributions use HZ=100 or HZ=250
      
      TCP receive buffer autotuning has poor behavior caused by this choice.
      Since autotuning happens after 4 ms or 10 ms, short distance flows
      get their receive buffer tuned to a very high value, but after an initial
      period where it was frozen to (too small) initial value.
      
      With tp->tcp_mstamp introduction, we can switch to high resolution
      timestamps almost for free (at the expense of 8 additional bytes per
      TCP structure)
      
      Note that some TCP stacks use usec TCP timestamps where this
      patch makes even more sense : Many TCP flows have < 500 usec RTT.
      Hopefully this finer TS option can be standardized soon.
      
      Tested:
       HZ=100 kernel
       ./netperf -H lpaa24 -t TCP_RR -l 1000 -- -r 10000,10000 &
      
       Peer without patch :
       lpaa24:~# ss -tmi dst lpaa23
       ...
       skmem:(r0,rb8388608,...)
       rcv_rtt:10 rcv_space:3210000 minrtt:0.017
      
       Peer with the patch :
       lpaa23:~# ss -tmi dst lpaa24
       ...
       skmem:(r0,rb428800,...)
       rcv_rtt:0.069 rcv_space:30000 minrtt:0.017
      
      We can see saner RCVBUF, and more precise rcv_rtt information.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      645f4c6f
  10. 25 4月, 2017 1 次提交
    • W
      net/tcp_fastopen: Disable active side TFO in certain scenarios · cf1ef3f0
      Wei Wang 提交于
      Middlebox firewall issues can potentially cause server's data being
      blackholed after a successful 3WHS using TFO. Following are the related
      reports from Apple:
      https://www.nanog.org/sites/default/files/Paasch_Network_Support.pdf
      Slide 31 identifies an issue where the client ACK to the server's data
      sent during a TFO'd handshake is dropped.
      C ---> syn-data ---> S
      C <--- syn/ack ----- S
      C (accept & write)
      C <---- data ------- S
      C ----- ACK -> X     S
      		[retry and timeout]
      
      https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf
      Slide 5 shows a similar situation that the server's data gets dropped
      after 3WHS.
      C ---- syn-data ---> S
      C <--- syn/ack ----- S
      C ---- ack --------> S
      S (accept & write)
      C?  X <- data ------ S
      		[retry and timeout]
      
      This is the worst failure b/c the client can not detect such behavior to
      mitigate the situation (such as disabling TFO). Failing to proceed, the
      application (e.g., SSL library) may simply timeout and retry with TFO
      again, and the process repeats indefinitely.
      
      The proposed solution is to disable active TFO globally under the
      following circumstances:
      1. client side TFO socket detects out of order FIN
      2. client side TFO socket receives out of order RST
      
      We disable active side TFO globally for 1hr at first. Then if it
      happens again, we disable it for 2h, then 4h, 8h, ...
      And we reset the timeout to 1hr if a client side TFO sockets not opened
      on loopback has successfully received data segs from server.
      And we examine this condition during close().
      
      The rational behind it is that when such firewall issue happens,
      application running on the client should eventually close the socket as
      it is not able to get the data it is expecting. Or application running
      on the server should close the socket as it is not able to receive any
      response from client.
      In both cases, out of order FIN or RST will get received on the client
      given that the firewall will not block them as no data are in those
      frames.
      And we want to disable active TFO globally as it helps if the middle box
      is very close to the client and most of the connections are likely to
      fail.
      
      Also, add a debug sysctl:
        tcp_fastopen_blackhole_detect_timeout_sec:
          the initial timeout to use when firewall blackhole issue happens.
          This can be set and read.
          When setting it to 0, it means to disable the active disable logic.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf1ef3f0
  11. 10 4月, 2017 1 次提交
    • E
      tcp: clear saved_syn in tcp_disconnect() · 17c3060b
      Eric Dumazet 提交于
      In the (very unlikely) case a passive socket becomes a listener,
      we do not want to duplicate its saved SYN headers.
      
      This would lead to double frees, use after free, and please hackers and
      various fuzzers
      
      Tested:
          0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 setsockopt(3, IPPROTO_TCP, TCP_SAVE_SYN, [1], 4) = 0
         +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
      
         +0 bind(3, ..., ...) = 0
         +0 listen(3, 5) = 0
      
         +0 < S 0:0(0) win 32972 <mss 1460,nop,wscale 7>
         +0 > S. 0:0(0) ack 1 <...>
        +.1 < . 1:1(0) ack 1 win 257
         +0 accept(3, ..., ...) = 4
      
         +0 connect(4, AF_UNSPEC, ...) = 0
         +0 close(3) = 0
         +0 bind(4, ..., ...) = 0
         +0 listen(4, 5) = 0
      
         +0 < S 0:0(0) win 32972 <mss 1460,nop,wscale 7>
         +0 > S. 0:0(0) ack 1 <...>
        +.1 < . 1:1(0) ack 1 win 257
      
      Fixes: cd8ae852 ("tcp: provide SYN headers for passive connections")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      17c3060b
  12. 05 4月, 2017 1 次提交
  13. 23 3月, 2017 1 次提交
  14. 17 3月, 2017 1 次提交
  15. 03 3月, 2017 1 次提交
  16. 18 2月, 2017 1 次提交
  17. 07 2月, 2017 1 次提交
  18. 30 1月, 2017 1 次提交
  19. 26 1月, 2017 2 次提交
    • W
      net/tcp-fastopen: make connect()'s return case more consistent with non-TFO · 3979ad7e
      Willy Tarreau 提交于
      Without TFO, any subsequent connect() call after a successful one returns
      -1 EISCONN. The last API update ensured that __inet_stream_connect() can
      return -1 EINPROGRESS in response to sendmsg() when TFO is in use to
      indicate that the connection is now in progress. Unfortunately since this
      function is used both for connect() and sendmsg(), it has the undesired
      side effect of making connect() now return -1 EINPROGRESS as well after
      a successful call, while at the same time poll() returns POLLOUT. This
      can confuse some applications which happen to call connect() and to
      check for -1 EISCONN to ensure the connection is usable, and for which
      EINPROGRESS indicates a need to poll, causing a loop.
      
      This problem was encountered in haproxy where a call to connect() is
      precisely used in certain cases to confirm a connection's readiness.
      While arguably haproxy's behaviour should be improved here, it seems
      important to aim at a more robust behaviour when the goal of the new
      API is to make it easier to implement TFO in existing applications.
      
      This patch simply ensures that we preserve the same semantics as in
      the non-TFO case on the connect() syscall when using TFO, while still
      returning -1 EINPROGRESS on sendmsg(). For this we simply tell
      __inet_stream_connect() whether we're doing a regular connect() or in
      fact connecting for a sendmsg() call.
      
      Cc: Wei Wang <weiwan@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NWilly Tarreau <w@1wt.eu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3979ad7e
    • W
      net/tcp-fastopen: Add new API support · 19f6d3f3
      Wei Wang 提交于
      This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an
      alternative way to perform Fast Open on the active side (client). Prior
      to this patch, a client needs to replace the connect() call with
      sendto(MSG_FASTOPEN). This can be cumbersome for applications who want
      to use Fast Open: these socket operations are often done in lower layer
      libraries used by many other applications. Changing these libraries
      and/or the socket call sequences are not trivial. A more convenient
      approach is to perform Fast Open by simply enabling a socket option when
      the socket is created w/o changing other socket calls sequence:
        s = socket()
          create a new socket
        setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …);
          newly introduced sockopt
          If set, new functionality described below will be used.
          Return ENOTSUPP if TFO is not supported or not enabled in the
          kernel.
      
        connect()
          With cookie present, return 0 immediately.
          With no cookie, initiate 3WHS with TFO cookie-request option and
          return -1 with errno = EINPROGRESS.
      
        write()/sendmsg()
          With cookie present, send out SYN with data and return the number of
          bytes buffered.
          With no cookie, and 3WHS not yet completed, return -1 with errno =
          EINPROGRESS.
          No MSG_FASTOPEN flag is needed.
      
        read()
          Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but
          write() is not called yet.
          Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is
          established but no msg is received yet.
          Return number of bytes read if socket is established and there is
          msg received.
      
      The new API simplifies life for applications that always perform a write()
      immediately after a successful connect(). Such applications can now take
      advantage of Fast Open by merely making one new setsockopt() call at the time
      of creating the socket. Nothing else about the application's socket call
      sequence needs to change.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19f6d3f3
  20. 21 1月, 2017 1 次提交
  21. 14 1月, 2017 2 次提交
  22. 10 1月, 2017 1 次提交
  23. 06 1月, 2017 1 次提交
    • S
      tcp: provide timestamps for partial writes · ad02c4f5
      Soheil Hassas Yeganeh 提交于
      For TCP sockets, TX timestamps are only captured when the user data
      is successfully and fully written to the socket. In many cases,
      however, TCP writes can be partial for which no timestamp is
      collected.
      
      Collect timestamps whenever any user data is (fully or partially)
      copied into the socket. Pass tcp_write_queue_tail to tcp_tx_timestamp
      instead of the local skb pointer since it can be set to NULL on
      the error path.
      
      Note that tcp_write_queue_tail can be NULL, even if bytes have been
      copied to the socket. This is because acknowledgements are being
      processed in tcp_sendmsg(), and by the time tcp_tx_timestamp is
      called tcp_write_queue_tail can be NULL. For such cases, this patch
      does not collect any timestamps (i.e., it is best-effort).
      
      This patch is written with suggestions from Willem de Bruijn and
      Eric Dumazet.
      
      Change-log V1 -> V2:
      	- Use sockc.tsflags instead of sk->sk_tsflags.
      	- Use the same code path for normal writes and errors.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ad02c4f5
  24. 30 12月, 2016 2 次提交
  25. 25 12月, 2016 1 次提交
  26. 06 12月, 2016 1 次提交
  27. 30 11月, 2016 3 次提交
  28. 16 11月, 2016 1 次提交
  29. 10 11月, 2016 3 次提交
  30. 04 11月, 2016 1 次提交
    • E
      tcp: fix return value for partial writes · 79d8665b
      Eric Dumazet 提交于
      After my commit, tcp_sendmsg() might restart its loop after
      processing socket backlog.
      
      If sk_err is set, we blindly return an error, even though we
      copied data to user space before.
      
      We should instead return number of bytes that could be copied,
      otherwise user space might resend data and corrupt the stream.
      
      This might happen if another thread is using recvmsg(MSG_ERRQUEUE)
      to process timestamps.
      
      Issue was diagnosed by Soheil and Willem, big kudos to them !
      
      Fixes: d41a69f1 ("tcp: make tcp_sendmsg() aware of socket backlog")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Tested-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79d8665b