1. 29 4月, 2016 6 次提交
    • E
      tcp: give prequeue mode some care · 0cef6a4c
      Eric Dumazet 提交于
      TCP prequeue goal is to defer processing of incoming packets
      to user space thread currently blocked in a recvmsg() system call.
      
      Intent is to spend less time processing these packets on behalf
      of softirq handler, as softirq handler is unfair to normal process
      scheduler decisions, as it might interrupt threads that do not
      even use networking.
      
      Current prequeue implementation has following issues :
      
      1) It only checks size of the prequeue against sk_rcvbuf
      
         It was fine 15 years ago when sk_rcvbuf was in the 64KB vicinity.
         But we now have ~8MB values to cope with modern networking needs.
         We have to add sk_rmem_alloc in the equation, since out of order
         packets can definitely use up to sk_rcvbuf memory themselves.
      
      2) Even with a fixed memory truesize check, prequeue can be filled
         by thousands of packets. When prequeue needs to be flushed, either
         from sofirq context (in tcp_prequeue() or timer code), or process
         context (in tcp_prequeue_process()), this adds a latency spike
         which is often not desirable.
         I added a fixed limit of 32 packets, as this translated to a max
         flush time of 60 us on my test hosts.
      
         Also note that all packets in prequeue are not accounted for tcp_mem,
         since they are not charged against sk_forward_alloc at this point.
         This is probably not a big deal.
      
      Note that this might increase LINUX_MIB_TCPPREQUEUEDROPPED counts,
      which is misnamed, as packets are not dropped at all, but rather pushed
      to the stack (where they can be either consumed or dropped)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0cef6a4c
    • M
      tcp: Handle eor bit when fragmenting a skb · a166140e
      Martin KaFai Lau 提交于
      When fragmenting a skb, the next_skb should carry
      the eor from prev_skb.  The eor of prev_skb should
      also be reset.
      
      Packetdrill script for testing:
      ~~~~~~
      +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
      +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
      +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0 bind(3, ..., ...) = 0
      +0 listen(3, 1) = 0
      
      0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
      0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
      0.200 < . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
      +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
      
      0.200 sendto(4, ..., 15330, MSG_EOR, ..., ...) = 15330
      0.200 sendto(4, ..., 730, 0, ..., ...) = 730
      
      0.200 > .  1:7301(7300) ack 1
      0.200 > . 7301:14601(7300) ack 1
      
      0.300 < . 1:1(0) ack 14601 win 257
      0.300 > P. 14601:15331(730) ack 1
      0.300 > P. 15331:16061(730) ack 1
      
      0.400 < . 1:1(0) ack 16061 win 257
      0.400 close(4) = 0
      0.400 > F. 16061:16061(0) ack 1
      0.400 < F. 1:1(0) ack 16062 win 257
      0.400 > . 16062:16062(0) ack 2
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a166140e
    • M
      tcp: Handle eor bit when coalescing skb · a643b5d4
      Martin KaFai Lau 提交于
      This patch:
      1. Prevent next_skb from coalescing to the prev_skb if
         TCP_SKB_CB(prev_skb)->eor is set
      2. Update the TCP_SKB_CB(prev_skb)->eor if coalescing is
         allowed
      
      Packetdrill script for testing:
      ~~~~~~
      +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
      +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
      +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0 bind(3, ..., ...) = 0
      +0 listen(3, 1) = 0
      
      0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
      0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
      0.200 < . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
      +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
      
      0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
      0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
      0.200 write(4, ..., 11680) = 11680
      
      0.200 > P. 1:731(730) ack 1
      0.200 > P. 731:1461(730) ack 1
      0.200 > . 1461:8761(7300) ack 1
      0.200 > P. 8761:13141(4380) ack 1
      
      0.300 < . 1:1(0) ack 1 win 257 <sack 1461:13141,nop,nop>
      0.300 > P. 1:731(730) ack 1
      0.300 > P. 731:1461(730) ack 1
      0.400 < . 1:1(0) ack 13141 win 257
      
      0.400 close(4) = 0
      0.400 > F. 13141:13141(0) ack 1
      0.500 < F. 1:1(0) ack 13142 win 257
      0.500 > . 13142:13142(0) ack 2
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a643b5d4
    • M
      tcp: Make use of MSG_EOR in tcp_sendmsg · c134ecb8
      Martin KaFai Lau 提交于
      This patch adds an eor bit to the TCP_SKB_CB.  When MSG_EOR
      is passed to tcp_sendmsg, the eor bit will be set at the skb
      containing the last byte of the userland's msg.  The eor bit
      will prevent data from appending to that skb in the future.
      
      The change in do_tcp_sendpages is to honor the eor set
      during the previous tcp_sendmsg(MSG_EOR) call.
      
      This patch handles the tcp_sendmsg case.  The followup patches
      will handle other skb coalescing and fragment cases.
      
      One potential use case is to use MSG_EOR with
      SOF_TIMESTAMPING_TX_ACK to get a more accurate
      TCP ack timestamping on application protocol with
      multiple outgoing response messages (e.g. HTTP2).
      
      Packetdrill script for testing:
      ~~~~~~
      +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
      +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
      +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0 bind(3, ..., ...) = 0
      +0 listen(3, 1) = 0
      
      0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
      0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
      0.200 < . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
      +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
      
      0.200 write(4, ..., 14600) = 14600
      0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
      0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
      
      0.200 > .  1:7301(7300) ack 1
      0.200 > P. 7301:14601(7300) ack 1
      
      0.300 < . 1:1(0) ack 14601 win 257
      0.300 > P. 14601:15331(730) ack 1
      0.300 > P. 15331:16061(730) ack 1
      
      0.400 < . 1:1(0) ack 16061 win 257
      0.400 close(4) = 0
      0.400 > F. 16061:16061(0) ack 1
      0.400 < F. 1:1(0) ack 16062 win 257
      0.400 > . 16062:16062(0) ack 2
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c134ecb8
    • S
      tcp: remove SKBTX_ACK_TSTAMP since it is redundant · 0a2cf20c
      Soheil Hassas Yeganeh 提交于
      The SKBTX_ACK_TSTAMP flag is set in skb_shinfo->tx_flags when
      the timestamp of the TCP acknowledgement should be reported on
      error queue. Since accessing skb_shinfo is likely to incur a
      cache-line miss at the time of receiving the ack, the
      txstamp_ack bit was added in tcp_skb_cb, which is set iff
      the SKBTX_ACK_TSTAMP flag is set for an skb. This makes
      SKBTX_ACK_TSTAMP flag redundant.
      
      Remove the SKBTX_ACK_TSTAMP and instead use the txstamp_ack bit
      everywhere.
      
      Note that this frees one bit in shinfo->tx_flags.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Suggested-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a2cf20c
    • S
      tcp: remove an unnecessary check in tcp_tx_timestamp · 863c1fd9
      Soheil Hassas Yeganeh 提交于
      Remove the redundant check for sk->sk_tsflags in tcp_tx_timestamp.
      
      tcp_tx_timestamp() receives the tsflags as a parameter. As a
      result the "sk->sk_tsflags || tsflags" is redundant, since
      tsflags already includes sk->sk_tsflags plus overrides from
      control messages.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      863c1fd9
  2. 28 4月, 2016 9 次提交
  3. 27 4月, 2016 1 次提交
  4. 26 4月, 2016 3 次提交
  5. 25 4月, 2016 4 次提交
    • P
      ipv4/fib: don't warn when primary address is missing if in_dev is dead · 391a2033
      Paolo Abeni 提交于
      After commit fbd40ea0 ("ipv4: Don't do expensive useless work
      during inetdev destroy.") when deleting an interface,
      fib_del_ifaddr() can be executed without any primary address
      present on the dead interface.
      
      The above is safe, but triggers some "bug: prim == NULL" warnings.
      
      This commit avoids warning if the in_dev is dead
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      391a2033
    • E
      tcp-tso: do not split TSO packets at retransmit time · 10d3be56
      Eric Dumazet 提交于
      Linux TCP stack painfully segments all TSO/GSO packets before retransmits.
      
      This was fine back in the days when TSO/GSO were emerging, with their
      bugs, but we believe the dark age is over.
      
      Keeping big packets in write queues, but also in stack traversal
      has a lot of benefits.
       - Less memory overhead, because write queues have less skbs
       - Less cpu overhead at ACK processing.
       - Better SACK processing, as lot of studies mentioned how
         awful linux was at this ;)
       - Less cpu overhead to send the rtx packets
         (IP stack traversal, netfilter traversal, drivers...)
       - Better latencies in presence of losses.
       - Smaller spikes in fq like packet schedulers, as retransmits
         are not constrained by TCP Small Queues.
      
      1 % packet losses are common today, and at 100Gbit speeds, this
      translates to ~80,000 losses per second.
      Losses are often correlated, and we see many retransmit events
      leading to 1-MSS train of packets, at the time hosts are already
      under stress.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10d3be56
    • M
      tcp: Merge txstamp_ack in tcp_skb_collapse_tstamp · 2de8023e
      Martin KaFai Lau 提交于
      When collapsing skbs, txstamp_ack also needs to be merged.
      
      Retrans Collapse Test:
      ~~~~~~
      0.200 accept(3, ..., ...) = 4
      +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
      
      0.200 write(4, ..., 730) = 730
      +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
      0.200 write(4, ..., 730) = 730
      +0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0
      0.200 write(4, ..., 11680) = 11680
      
      0.200 > P. 1:731(730) ack 1
      0.200 > P. 731:1461(730) ack 1
      0.200 > . 1461:8761(7300) ack 1
      0.200 > P. 8761:13141(4380) ack 1
      
      0.300 < . 1:1(0) ack 1 win 257 <sack 1461:2921,nop,nop>
      0.300 < . 1:1(0) ack 1 win 257 <sack 1461:4381,nop,nop>
      0.300 < . 1:1(0) ack 1 win 257 <sack 1461:5841,nop,nop>
      0.300 > P. 1:1461(1460) ack 1
      0.400 < . 1:1(0) ack 13141 win 257
      
      BPF Output Before:
      ~~~~~
      <No output due to missing SCM_TSTAMP_ACK timestamp>
      
      BPF Output After:
      ~~~~~
      <...>-2027  [007] d.s.    79.765921: : ee_data:1459
      
      Sacks Collapse Test:
      ~~~~~
      0.200 accept(3, ..., ...) = 4
      +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
      
      0.200 write(4, ..., 1460) = 1460
      +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
      0.200 write(4, ..., 13140) = 13140
      +0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0
      
      0.200 > P. 1:1461(1460) ack 1
      0.200 > . 1461:8761(7300) ack 1
      0.200 > P. 8761:14601(5840) ack 1
      
      0.300 < . 1:1(0) ack 1 win 257 <sack 1461:14601,nop,nop>
      0.300 > P. 1:1461(1460) ack 1
      0.400 < . 1:1(0) ack 14601 win 257
      
      BPF Output Before:
      ~~~~~
      <No output due to missing SCM_TSTAMP_ACK timestamp>
      
      BPF Output After:
      ~~~~~
      <...>-2049  [007] d.s.    89.185538: : ee_data:14599
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Tested-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2de8023e
    • M
      tcp: Carry txstamp_ack in tcp_fragment_tstamp · b51e13fa
      Martin KaFai Lau 提交于
      When a tcp skb is sliced into two smaller skbs (e.g. in
      tcp_fragment() and tso_fragment()),  it does not carry
      the txstamp_ack bit to the newly created skb if it is needed.
      The end result is a timestamping event (SCM_TSTAMP_ACK) will
      be missing from the sk->sk_error_queue.
      
      This patch carries this bit to the new skb2
      in tcp_fragment_tstamp().
      
      BPF Output Before:
      ~~~~~~
      <No output due to missing SCM_TSTAMP_ACK timestamp>
      
      BPF Output After:
      ~~~~~~
      <...>-2050  [000] d.s.   100.928763: : ee_data:14599
      
      Packetdrill Script:
      ~~~~~~
      +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
      +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
      +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0 bind(3, ..., ...) = 0
      +0 listen(3, 1) = 0
      
      0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
      0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
      0.200 < . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
      +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
      
      +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
      0.200 write(4, ..., 14600) = 14600
      +0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0
      
      0.200 > . 1:7301(7300) ack 1
      0.200 > P. 7301:14601(7300) ack 1
      
      0.300 < . 1:1(0) ack 14601 win 257
      
      0.300 close(4) = 0
      0.300 > F. 14601:14601(0) ack 1
      0.400 < F. 1:1(0) ack 16062 win 257
      0.400 > . 14602:14602(0) ack 2
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Tested-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b51e13fa
  6. 24 4月, 2016 2 次提交
  7. 22 4月, 2016 5 次提交
    • M
      tcp: Merge tx_flags and tskey in tcp_shifted_skb · cfea5a68
      Martin KaFai Lau 提交于
      After receiving sacks, tcp_shifted_skb() will collapse
      skbs if possible.  tx_flags and tskey also have to be
      merged.
      
      This patch reuses the tcp_skb_collapse_tstamp() to handle
      them.
      
      BPF Output Before:
      ~~~~~
      <no-output-due-to-missing-tstamp-event>
      
      BPF Output After:
      ~~~~~
      <...>-2024  [007] d.s.    88.644374: : ee_data:14599
      
      Packetdrill Script:
      ~~~~~
      +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
      +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
      +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0 bind(3, ..., ...) = 0
      +0 listen(3, 1) = 0
      
      0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
      0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
      0.200 < . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
      +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
      
      0.200 write(4, ..., 1460) = 1460
      +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
      0.200 write(4, ..., 13140) = 13140
      
      0.200 > P. 1:1461(1460) ack 1
      0.200 > . 1461:8761(7300) ack 1
      0.200 > P. 8761:14601(5840) ack 1
      
      0.300 < . 1:1(0) ack 1 win 257 <sack 1461:14601,nop,nop>
      0.300 > P. 1:1461(1460) ack 1
      0.400 < . 1:1(0) ack 14601 win 257
      
      0.400 close(4) = 0
      0.400 > F. 14601:14601(0) ack 1
      0.500 < F. 1:1(0) ack 14602 win 257
      0.500 > . 14602:14602(0) ack 2
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Tested-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cfea5a68
    • M
      tcp: Merge tx_flags and tskey in tcp_collapse_retrans · 082ac2d5
      Martin KaFai Lau 提交于
      If two skbs are merged/collapsed during retransmission, the current
      logic does not merge the tx_flags and tskey.  The end result is
      the SCM_TSTAMP_ACK timestamp could be missing for a packet.
      
      The patch:
      1. Merge the tx_flags
      2. Overwrite the prev_skb's tskey with the next_skb's tskey
      
      BPF Output Before:
      ~~~~~~
      <no-output-due-to-missing-tstamp-event>
      
      BPF Output After:
      ~~~~~~
      packetdrill-2092  [001] d.s.   453.998486: : ee_data:1459
      
      Packetdrill Script:
      ~~~~~~
      +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
      +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
      +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0 bind(3, ..., ...) = 0
      +0 listen(3, 1) = 0
      
      0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
      0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
      0.200 < . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
      +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
      
      0.200 write(4, ..., 730) = 730
      +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
      0.200 write(4, ..., 730) = 730
      +0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0
      0.200 write(4, ..., 11680) = 11680
      +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
      
      0.200 > P. 1:731(730) ack 1
      0.200 > P. 731:1461(730) ack 1
      0.200 > . 1461:8761(7300) ack 1
      0.200 > P. 8761:13141(4380) ack 1
      
      0.300 < . 1:1(0) ack 1 win 257 <sack 1461:2921,nop,nop>
      0.300 < . 1:1(0) ack 1 win 257 <sack 1461:4381,nop,nop>
      0.300 < . 1:1(0) ack 1 win 257 <sack 1461:5841,nop,nop>
      0.300 > P. 1:1461(1460) ack 1
      0.400 < . 1:1(0) ack 13141 win 257
      
      0.400 close(4) = 0
      0.400 > F. 13141:13141(0) ack 1
      0.500 < F. 1:1(0) ack 13142 win 257
      0.500 > . 13142:13142(0) ack 2
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Tested-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      082ac2d5
    • N
      a9a08042
    • X
      net: use jiffies_to_msecs to replace EXPIRES_IN_MS in inet/sctp_diag · b7de529c
      Xin Long 提交于
      EXPIRES_IN_MS macro comes from net/ipv4/inet_diag.c and dates
      back to before jiffies_to_msecs() has been introduced.
      
      Now we can remove it and use jiffies_to_msecs().
      Suggested-by: NJakub Sitnicki <jkbs@redhat.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NJakub Sitnicki <jkbs@redhat.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b7de529c
    • M
      tcp: Fix SOF_TIMESTAMPING_TX_ACK when handling dup acks · 479f85c3
      Martin KaFai Lau 提交于
      Assuming SOF_TIMESTAMPING_TX_ACK is on. When dup acks are received,
      it could incorrectly think that a skb has already
      been acked and queue a SCM_TSTAMP_ACK cmsg to the
      sk->sk_error_queue.
      
      In tcp_ack_tstamp(), it checks
      'between(shinfo->tskey, prior_snd_una, tcp_sk(sk)->snd_una - 1)'.
      If prior_snd_una == tcp_sk(sk)->snd_una like the following packetdrill
      script, between() returns true but the tskey is actually not acked.
      e.g. try between(3, 2, 1).
      
      The fix is to replace between() with one before() and one !before().
      By doing this, the -1 offset on the tcp_sk(sk)->snd_una can also be
      removed.
      
      A packetdrill script is used to reproduce the dup ack scenario.
      Due to the lacking cmsg support in packetdrill (may be I
      cannot find it),  a BPF prog is used to kprobe to
      sock_queue_err_skb() and print out the value of
      serr->ee.ee_data.
      
      Both the packetdrill and the bcc BPF script is attached at the end of
      this commit message.
      
      BPF Output Before Fix:
      ~~~~~~
            <...>-2056  [001] d.s.   433.927987: : ee_data:1459  #incorrect
      packetdrill-2056  [001] d.s.   433.929563: : ee_data:1459  #incorrect
      packetdrill-2056  [001] d.s.   433.930765: : ee_data:1459  #incorrect
      packetdrill-2056  [001] d.s.   434.028177: : ee_data:1459
      packetdrill-2056  [001] d.s.   434.029686: : ee_data:14599
      
      BPF Output After Fix:
      ~~~~~~
            <...>-2049  [000] d.s.   113.517039: : ee_data:1459
            <...>-2049  [000] d.s.   113.517253: : ee_data:14599
      
      BCC BPF Script:
      ~~~~~~
      #!/usr/bin/env python
      
      from __future__ import print_function
      from bcc import BPF
      
      bpf_text = """
      #include <uapi/linux/ptrace.h>
      #include <net/sock.h>
      #include <bcc/proto.h>
      #include <linux/errqueue.h>
      
      #ifdef memset
      #undef memset
      #endif
      
      int trace_err_skb(struct pt_regs *ctx)
      {
      	struct sk_buff *skb = (struct sk_buff *)ctx->si;
      	struct sock *sk = (struct sock *)ctx->di;
      	struct sock_exterr_skb *serr;
      	u32 ee_data = 0;
      
      	if (!sk || !skb)
      		return 0;
      
      	serr = SKB_EXT_ERR(skb);
      	bpf_probe_read(&ee_data, sizeof(ee_data), &serr->ee.ee_data);
      	bpf_trace_printk("ee_data:%u\\n", ee_data);
      
      	return 0;
      };
      """
      
      b = BPF(text=bpf_text)
      b.attach_kprobe(event="sock_queue_err_skb", fn_name="trace_err_skb")
      print("Attached to kprobe")
      b.trace_print()
      
      Packetdrill Script:
      ~~~~~~
      +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
      +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
      +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0 bind(3, ..., ...) = 0
      +0 listen(3, 1) = 0
      
      0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
      0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
      0.200 < . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
      +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
      
      +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
      0.200 write(4, ..., 1460) = 1460
      0.200 write(4, ..., 13140) = 13140
      
      0.200 > P. 1:1461(1460) ack 1
      0.200 > . 1461:8761(7300) ack 1
      0.200 > P. 8761:14601(5840) ack 1
      
      0.300 < . 1:1(0) ack 1 win 257 <sack 1461:2921,nop,nop>
      0.300 < . 1:1(0) ack 1 win 257 <sack 1461:4381,nop,nop>
      0.300 < . 1:1(0) ack 1 win 257 <sack 1461:5841,nop,nop>
      0.300 > P. 1:1461(1460) ack 1
      0.400 < . 1:1(0) ack 14601 win 257
      
      0.400 close(4) = 0
      0.400 > F. 14601:14601(0) ack 1
      0.500 < F. 1:1(0) ack 14602 win 257
      0.500 > . 14602:14602(0) ack 2
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Tested-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      479f85c3
  8. 19 4月, 2016 1 次提交
  9. 17 4月, 2016 2 次提交
  10. 16 4月, 2016 3 次提交
  11. 15 4月, 2016 4 次提交
    • C
      soreuseport: fix ordering for mixed v4/v6 sockets · d894ba18
      Craig Gallek 提交于
      With the SO_REUSEPORT socket option, it is possible to create sockets
      in the AF_INET and AF_INET6 domains which are bound to the same IPv4 address.
      This is only possible with SO_REUSEPORT and when not using IPV6_V6ONLY on
      the AF_INET6 sockets.
      
      Prior to the commits referenced below, an incoming IPv4 packet would
      always be routed to a socket of type AF_INET when this mixed-mode was used.
      After those changes, the same packet would be routed to the most recently
      bound socket (if this happened to be an AF_INET6 socket, it would
      have an IPv4 mapped IPv6 address).
      
      The change in behavior occurred because the recent SO_REUSEPORT optimizations
      short-circuit the socket scoring logic as soon as they find a match.  They
      did not take into account the scoring logic that favors AF_INET sockets
      over AF_INET6 sockets in the event of a tie.
      
      To fix this problem, this patch changes the insertion order of AF_INET
      and AF_INET6 addresses in the TCP and UDP socket lists when the sockets
      have SO_REUSEPORT set.  AF_INET sockets will be inserted at the head of the
      list and AF_INET6 sockets with SO_REUSEPORT set will always be inserted at
      the tail of the list.  This will force AF_INET sockets to always be
      considered first.
      
      Fixes: e32ea7e7 ("soreuseport: fast reuseport UDP socket selection")
      Fixes: 125e80b88687 ("soreuseport: fast reuseport TCP socket selection")
      Reported-by: NMaciej Żenczykowski <maze@google.com>
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d894ba18
    • A
      GSO: Support partial segmentation offload · 802ab55a
      Alexander Duyck 提交于
      This patch adds support for something I am referring to as GSO partial.
      The basic idea is that we can support a broader range of devices for
      segmentation if we use fixed outer headers and have the hardware only
      really deal with segmenting the inner header.  The idea behind the naming
      is due to the fact that everything before csum_start will be fixed headers,
      and everything after will be the region that is handled by hardware.
      
      With the current implementation it allows us to add support for the
      following GSO types with an inner TSO_MANGLEID or TSO6 offload:
      NETIF_F_GSO_GRE
      NETIF_F_GSO_GRE_CSUM
      NETIF_F_GSO_IPIP
      NETIF_F_GSO_SIT
      NETIF_F_UDP_TUNNEL
      NETIF_F_UDP_TUNNEL_CSUM
      
      In the case of hardware that already supports tunneling we may be able to
      extend this further to support TSO_TCPV4 without TSO_MANGLEID if the
      hardware can support updating inner IPv4 headers.
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      802ab55a
    • A
      GRO: Add support for TCP with fixed IPv4 ID field, limit tunnel IP ID values · 1530545e
      Alexander Duyck 提交于
      This patch does two things.
      
      First it allows TCP to aggregate TCP frames with a fixed IPv4 ID field.  As
      a result we should now be able to aggregate flows that were converted from
      IPv6 to IPv4.  In addition this allows us more flexibility for future
      implementations of segmentation as we may be able to use a fixed IP ID when
      segmenting the flow.
      
      The second thing this does is that it places limitations on the outer IPv4
      ID header in the case of tunneled frames.  Specifically it forces the IP ID
      to be incrementing by 1 unless the DF bit is set in the outer IPv4 header.
      This way we can avoid creating overlapping series of IP IDs that could
      possibly be fragmented if the frame goes through GRO and is then
      resegmented via GSO.
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1530545e
    • A
      GSO: Add GSO type for fixed IPv4 ID · cbc53e08
      Alexander Duyck 提交于
      This patch adds support for TSO using IPv4 headers with a fixed IP ID
      field.  This is meant to allow us to do a lossless GRO in the case of TCP
      flows that use a fixed IP ID such as those that convert IPv6 header to IPv4
      headers.
      
      In addition I am adding a feature that for now I am referring to TSO with
      IP ID mangling.  Basically when this flag is enabled the device has the
      option to either output the flow with incrementing IP IDs or with a fixed
      IP ID regardless of what the original IP ID ordering was.  This is useful
      in cases where the DF bit is set and we do not care if the original IP ID
      value is maintained.
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cbc53e08