1. 26 1月, 2017 2 次提交
    • W
      net/tcp-fastopen: make connect()'s return case more consistent with non-TFO · 3979ad7e
      Willy Tarreau 提交于
      Without TFO, any subsequent connect() call after a successful one returns
      -1 EISCONN. The last API update ensured that __inet_stream_connect() can
      return -1 EINPROGRESS in response to sendmsg() when TFO is in use to
      indicate that the connection is now in progress. Unfortunately since this
      function is used both for connect() and sendmsg(), it has the undesired
      side effect of making connect() now return -1 EINPROGRESS as well after
      a successful call, while at the same time poll() returns POLLOUT. This
      can confuse some applications which happen to call connect() and to
      check for -1 EISCONN to ensure the connection is usable, and for which
      EINPROGRESS indicates a need to poll, causing a loop.
      
      This problem was encountered in haproxy where a call to connect() is
      precisely used in certain cases to confirm a connection's readiness.
      While arguably haproxy's behaviour should be improved here, it seems
      important to aim at a more robust behaviour when the goal of the new
      API is to make it easier to implement TFO in existing applications.
      
      This patch simply ensures that we preserve the same semantics as in
      the non-TFO case on the connect() syscall when using TFO, while still
      returning -1 EINPROGRESS on sendmsg(). For this we simply tell
      __inet_stream_connect() whether we're doing a regular connect() or in
      fact connecting for a sendmsg() call.
      
      Cc: Wei Wang <weiwan@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NWilly Tarreau <w@1wt.eu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3979ad7e
    • W
      net/tcp-fastopen: Add new API support · 19f6d3f3
      Wei Wang 提交于
      This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an
      alternative way to perform Fast Open on the active side (client). Prior
      to this patch, a client needs to replace the connect() call with
      sendto(MSG_FASTOPEN). This can be cumbersome for applications who want
      to use Fast Open: these socket operations are often done in lower layer
      libraries used by many other applications. Changing these libraries
      and/or the socket call sequences are not trivial. A more convenient
      approach is to perform Fast Open by simply enabling a socket option when
      the socket is created w/o changing other socket calls sequence:
        s = socket()
          create a new socket
        setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …);
          newly introduced sockopt
          If set, new functionality described below will be used.
          Return ENOTSUPP if TFO is not supported or not enabled in the
          kernel.
      
        connect()
          With cookie present, return 0 immediately.
          With no cookie, initiate 3WHS with TFO cookie-request option and
          return -1 with errno = EINPROGRESS.
      
        write()/sendmsg()
          With cookie present, send out SYN with data and return the number of
          bytes buffered.
          With no cookie, and 3WHS not yet completed, return -1 with errno =
          EINPROGRESS.
          No MSG_FASTOPEN flag is needed.
      
        read()
          Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but
          write() is not called yet.
          Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is
          established but no msg is received yet.
          Return number of bytes read if socket is established and there is
          msg received.
      
      The new API simplifies life for applications that always perform a write()
      immediately after a successful connect(). Such applications can now take
      advantage of Fast Open by merely making one new setsockopt() call at the time
      of creating the socket. Nothing else about the application's socket call
      sequence needs to change.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19f6d3f3
  2. 21 1月, 2017 1 次提交
  3. 14 1月, 2017 2 次提交
  4. 10 1月, 2017 1 次提交
  5. 06 1月, 2017 1 次提交
    • S
      tcp: provide timestamps for partial writes · ad02c4f5
      Soheil Hassas Yeganeh 提交于
      For TCP sockets, TX timestamps are only captured when the user data
      is successfully and fully written to the socket. In many cases,
      however, TCP writes can be partial for which no timestamp is
      collected.
      
      Collect timestamps whenever any user data is (fully or partially)
      copied into the socket. Pass tcp_write_queue_tail to tcp_tx_timestamp
      instead of the local skb pointer since it can be set to NULL on
      the error path.
      
      Note that tcp_write_queue_tail can be NULL, even if bytes have been
      copied to the socket. This is because acknowledgements are being
      processed in tcp_sendmsg(), and by the time tcp_tx_timestamp is
      called tcp_write_queue_tail can be NULL. For such cases, this patch
      does not collect any timestamps (i.e., it is best-effort).
      
      This patch is written with suggestions from Willem de Bruijn and
      Eric Dumazet.
      
      Change-log V1 -> V2:
      	- Use sockc.tsflags instead of sk->sk_tsflags.
      	- Use the same code path for normal writes and errors.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ad02c4f5
  6. 30 12月, 2016 2 次提交
  7. 25 12月, 2016 1 次提交
  8. 06 12月, 2016 1 次提交
  9. 30 11月, 2016 3 次提交
  10. 16 11月, 2016 1 次提交
  11. 10 11月, 2016 3 次提交
  12. 04 11月, 2016 2 次提交
    • E
      tcp: fix return value for partial writes · 79d8665b
      Eric Dumazet 提交于
      After my commit, tcp_sendmsg() might restart its loop after
      processing socket backlog.
      
      If sk_err is set, we blindly return an error, even though we
      copied data to user space before.
      
      We should instead return number of bytes that could be copied,
      otherwise user space might resend data and corrupt the stream.
      
      This might happen if another thread is using recvmsg(MSG_ERRQUEUE)
      to process timestamps.
      
      Issue was diagnosed by Soheil and Willem, big kudos to them !
      
      Fixes: d41a69f1 ("tcp: make tcp_sendmsg() aware of socket backlog")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Tested-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79d8665b
    • E
      tcp: fix potential memory corruption · ac9e70b1
      Eric Dumazet 提交于
      Imagine initial value of max_skb_frags is 17, and last
      skb in write queue has 15 frags.
      
      Then max_skb_frags is lowered to 14 or smaller value.
      
      tcp_sendmsg() will then be allowed to add additional page frags
      and eventually go past MAX_SKB_FRAGS, overflowing struct
      skb_shared_info.
      
      Fixes: 5f74f82e ("net:Add sysctl_max_skb_frags")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
      Cc: Håkon Bugge <haakon.bugge@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac9e70b1
  13. 08 10月, 2016 1 次提交
  14. 04 10月, 2016 1 次提交
    • A
      skb_splice_bits(): get rid of callback · 25869262
      Al Viro 提交于
      since pipe_lock is the outermost now, we don't need to drop/regain
      socket locks around the call of splice_to_pipe() from skb_splice_bits(),
      which kills the need to have a socket-specific callback; we can just
      call splice_to_pipe() and be done with that.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      25869262
  15. 21 9月, 2016 4 次提交
  16. 17 9月, 2016 1 次提交
    • E
      tcp: prepare skbs for better sack shifting · 3613b3db
      Eric Dumazet 提交于
      With large BDP TCP flows and lossy networks, it is very important
      to keep a low number of skbs in the write queue.
      
      RACK and SACK processing can perform a linear scan of it.
      
      We should avoid putting any payload in skb->head, so that SACK
      shifting can be done if needed.
      
      With this patch, we allow to pack ~0.5 MB per skb instead of
      the 64KB initially cooked at tcp_sendmsg() time.
      
      This gives a reduction of number of skbs in write queue by eight.
      tcp_rack_detect_loss() likes this.
      
      We still allow payload in skb->head for first skb put in the queue,
      to not impact RPC workloads.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3613b3db
  17. 09 9月, 2016 1 次提交
    • Y
      tcp: use an RB tree for ooo receive queue · 9f5afeae
      Yaogong Wang 提交于
      Over the years, TCP BDP has increased by several orders of magnitude,
      and some people are considering to reach the 2 Gbytes limit.
      
      Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
      MSS.
      
      In presence of packet losses (or reorders), TCP stores incoming packets
      into an out of order queue, and number of skbs sitting there waiting for
      the missing packets to be received can be in the 10^5 range.
      
      Most packets are appended to the tail of this queue, and when
      packets can finally be transferred to receive queue, we scan the queue
      from its head.
      
      However, in presence of heavy losses, we might have to find an arbitrary
      point in this queue, involving a linear scan for every incoming packet,
      throwing away cpu caches.
      
      This patch converts it to a RB tree, to get bounded latencies.
      
      Yaogong wrote a preliminary patch about 2 years ago.
      Eric did the rebase, added ofo_last_skb cache, polishing and tests.
      
      Tested with network dropping between 1 and 10 % packets, with good
      success (about 30 % increase of throughput in stress tests)
      
      Next step would be to also use an RB tree for the write queue at sender
      side ;)
      Signed-off-by: NYaogong Wang <wygivan@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Acked-By: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f5afeae
  18. 29 8月, 2016 1 次提交
  19. 24 8月, 2016 1 次提交
  20. 20 8月, 2016 1 次提交
  21. 01 7月, 2016 1 次提交
  22. 30 6月, 2016 1 次提交
    • A
      tcp: add an ability to dump and restore window parameters · b1ed4c4f
      Andrey Vagin 提交于
      We found that sometimes a restored tcp socket doesn't work.
      
      A reason of this bug is incorrect window parameters and in this case
      tcp_acceptable_seq() returns tcp_wnd_end(tp) instead of tp->snd_nxt. The
      other side drops packets with this seq, because seq is less than
      tp->rcv_nxt ( tcp_sequence() ).
      
      Data from a send queue is sent only if there is enough space in a
      window, so when we restore unacked data, we need to expand a window to
      fit this data.
      
      This was in a first version of this patch:
      "tcp: extend window to fit all restored unacked data in a send queue"
      
      Then Alexey recommended me to restore window parameters instead of
      adjusted them according with data in a sent queue. This sounds resonable.
      
      rcv_wnd has to be restored, because it was reported to another side
      and the offered window is never shrunk.
      One of reasons why we need to restore snd_wnd was described above.
      
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b1ed4c4f
  23. 05 5月, 2016 1 次提交
  24. 03 5月, 2016 3 次提交
    • E
      tcp: make tcp_sendmsg() aware of socket backlog · d41a69f1
      Eric Dumazet 提交于
      Large sendmsg()/write() hold socket lock for the duration of the call,
      unless sk->sk_sndbuf limit is hit. This is bad because incoming packets
      are parked into socket backlog for a long time.
      Critical decisions like fast retransmit might be delayed.
      Receivers have to maintain a big out of order queue with additional cpu
      overhead, and also possible stalls in TX once windows are full.
      
      Bidirectional flows are particularly hurt since the backlog can become
      quite big if the copy from user space triggers IO (page faults)
      
      Some applications learnt to use sendmsg() (or sendmmsg()) with small
      chunks to avoid this issue.
      
      Kernel should know better, right ?
      
      Add a generic sk_flush_backlog() helper and use it right
      before a new skb is allocated. Typically we put 64KB of payload
      per skb (unless MSG_EOR is requested) and checking socket backlog
      every 64KB gives good results.
      
      As a matter of fact, tests with TSO/GSO disabled give very nice
      results, as we manage to keep a small write queue and smaller
      perceived rtt.
      
      Note that sk_flush_backlog() maintains socket ownership,
      so is not equivalent to a {release_sock(sk); lock_sock(sk);},
      to ensure implicit atomicity rules that sendmsg() was
      giving to (possibly buggy) applications.
      
      In this simple implementation, I chose to not call tcp_release_cb(),
      but we might consider this later.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d41a69f1
    • E
      tcp: do not block bh during prequeue processing · fb3477c0
      Eric Dumazet 提交于
      AFAIK, nothing in current TCP stack absolutely wants BH
      being disabled once socket is owned by a thread running in
      process context.
      
      As mentioned in my prior patch ("tcp: give prequeue mode some care"),
      processing a batch of packets might take time, better not block BH
      at all.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fb3477c0
    • E
      tcp: do not assume TCP code is non preemptible · c10d9310
      Eric Dumazet 提交于
      We want to to make TCP stack preemptible, as draining prequeue
      and backlog queues can take lot of time.
      
      Many SNMP updates were assuming that BH (and preemption) was disabled.
      
      Need to convert some __NET_INC_STATS() calls to NET_INC_STATS()
      and some __TCP_INC_STATS() to TCP_INC_STATS()
      
      Before using this_cpu_ptr(net->ipv4.tcp_sk) in tcp_v4_send_reset()
      and tcp_v4_send_ack(), we add an explicit preempt disabled section.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c10d9310
  25. 29 4月, 2016 3 次提交
    • M
      tcp: Make use of MSG_EOR in tcp_sendmsg · c134ecb8
      Martin KaFai Lau 提交于
      This patch adds an eor bit to the TCP_SKB_CB.  When MSG_EOR
      is passed to tcp_sendmsg, the eor bit will be set at the skb
      containing the last byte of the userland's msg.  The eor bit
      will prevent data from appending to that skb in the future.
      
      The change in do_tcp_sendpages is to honor the eor set
      during the previous tcp_sendmsg(MSG_EOR) call.
      
      This patch handles the tcp_sendmsg case.  The followup patches
      will handle other skb coalescing and fragment cases.
      
      One potential use case is to use MSG_EOR with
      SOF_TIMESTAMPING_TX_ACK to get a more accurate
      TCP ack timestamping on application protocol with
      multiple outgoing response messages (e.g. HTTP2).
      
      Packetdrill script for testing:
      ~~~~~~
      +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
      +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
      +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0 bind(3, ..., ...) = 0
      +0 listen(3, 1) = 0
      
      0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
      0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
      0.200 < . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
      +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
      
      0.200 write(4, ..., 14600) = 14600
      0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
      0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
      
      0.200 > .  1:7301(7300) ack 1
      0.200 > P. 7301:14601(7300) ack 1
      
      0.300 < . 1:1(0) ack 14601 win 257
      0.300 > P. 14601:15331(730) ack 1
      0.300 > P. 15331:16061(730) ack 1
      
      0.400 < . 1:1(0) ack 16061 win 257
      0.400 close(4) = 0
      0.400 > F. 16061:16061(0) ack 1
      0.400 < F. 1:1(0) ack 16062 win 257
      0.400 > . 16062:16062(0) ack 2
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c134ecb8
    • S
      tcp: remove SKBTX_ACK_TSTAMP since it is redundant · 0a2cf20c
      Soheil Hassas Yeganeh 提交于
      The SKBTX_ACK_TSTAMP flag is set in skb_shinfo->tx_flags when
      the timestamp of the TCP acknowledgement should be reported on
      error queue. Since accessing skb_shinfo is likely to incur a
      cache-line miss at the time of receiving the ack, the
      txstamp_ack bit was added in tcp_skb_cb, which is set iff
      the SKBTX_ACK_TSTAMP flag is set for an skb. This makes
      SKBTX_ACK_TSTAMP flag redundant.
      
      Remove the SKBTX_ACK_TSTAMP and instead use the txstamp_ack bit
      everywhere.
      
      Note that this frees one bit in shinfo->tx_flags.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Suggested-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a2cf20c
    • S
      tcp: remove an unnecessary check in tcp_tx_timestamp · 863c1fd9
      Soheil Hassas Yeganeh 提交于
      Remove the redundant check for sk->sk_tsflags in tcp_tx_timestamp.
      
      tcp_tx_timestamp() receives the tsflags as a parameter. As a
      result the "sk->sk_tsflags || tsflags" is redundant, since
      tsflags already includes sk->sk_tsflags plus overrides from
      control messages.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      863c1fd9