1. 11 6月, 2014 1 次提交
  2. 23 5月, 2014 1 次提交
    • N
      tcp: make cwnd-limited checks measurement-based, and gentler · ca8a2263
      Neal Cardwell 提交于
      Experience with the recent e114a710 ("tcp: fix cwnd limited
      checking to improve congestion control") has shown that there are
      common cases where that commit can cause cwnd to be much larger than
      necessary. This leads to TSO autosizing cooking skbs that are too
      large, among other things.
      
      The main problems seemed to be:
      
      (1) That commit attempted to predict the future behavior of the
      connection by looking at the write queue (if TSO or TSQ limit
      sending). That prediction sometimes overestimated future outstanding
      packets.
      
      (2) That commit always allowed cwnd to grow to twice the number of
      outstanding packets (even in congestion avoidance, where this is not
      needed).
      
      This commit improves both of these, by:
      
      (1) Switching to a measurement-based approach where we explicitly
      track the largest number of packets in flight during the past window
      ("max_packets_out"), and remember whether we were cwnd-limited at the
      moment we finished sending that flight.
      
      (2) Only allowing cwnd to grow to twice the number of outstanding
      packets ("max_packets_out") in slow start. In congestion avoidance
      mode we now only allow cwnd to grow if it was fully utilized.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca8a2263
  3. 14 5月, 2014 2 次提交
  4. 04 5月, 2014 1 次提交
  5. 03 5月, 2014 1 次提交
    • E
      tcp: fix cwnd limited checking to improve congestion control · e114a710
      Eric Dumazet 提交于
      Yuchung discovered tcp_is_cwnd_limited() was returning false in
      slow start phase even if the application filled the socket write queue.
      
      All congestion modules take into account tcp_is_cwnd_limited()
      before increasing cwnd, so this behavior limits slow start from
      probing the bandwidth at full speed.
      
      The problem is that even if write queue is full (aka we are _not_
      application limited), cwnd can be under utilized if TSO should auto
      defer or TCP Small queues decided to hold packets.
      
      So the in_flight can be kept to smaller value, and we can get to the
      point tcp_is_cwnd_limited() returns false.
      
      With TCP Small Queues and FQ/pacing, this issue is more visible.
      
      We fix this by having tcp_cwnd_validate(), which is supposed to track
      such things, take into account unsent_segs, the number of segs that we
      are not sending at the moment due to TSO or TSQ, but intend to send
      real soon. Then when we are cwnd-limited, remember this fact while we
      are processing the window of ACKs that comes back.
      
      For example, suppose we have a brand new connection with cwnd=10; we
      are in slow start, and we send a flight of 9 packets. By the time we
      have received ACKs for all 9 packets we want our cwnd to be 18.
      We implement this by setting tp->lsnd_pending to 9, and
      considering ourselves to be cwnd-limited while cwnd is less than
      twice tp->lsnd_pending (2*9 -> 18).
      
      This makes tcp_is_cwnd_limited() more understandable, by removing
      the GSO/TSO kludge, that tried to work around the issue.
      
      Note the in_flight parameter can be removed in a followup cleanup
      patch.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e114a710
  6. 01 5月, 2014 1 次提交
  7. 23 4月, 2014 1 次提交
  8. 21 4月, 2014 1 次提交
  9. 16 4月, 2014 1 次提交
  10. 28 3月, 2014 1 次提交
  11. 27 3月, 2014 1 次提交
  12. 12 3月, 2014 1 次提交
    • E
      tcp: tcp_release_cb() should release socket ownership · c3f9b018
      Eric Dumazet 提交于
      Lars Persson reported following deadlock :
      
      -000 |M:0x0:0x802B6AF8(asm) <-- arch_spin_lock
      -001 |tcp_v4_rcv(skb = 0x8BD527A0) <-- sk = 0x8BE6B2A0
      -002 |ip_local_deliver_finish(skb = 0x8BD527A0)
      -003 |__netif_receive_skb_core(skb = 0x8BD527A0, ?)
      -004 |netif_receive_skb(skb = 0x8BD527A0)
      -005 |elk_poll(napi = 0x8C770500, budget = 64)
      -006 |net_rx_action(?)
      -007 |__do_softirq()
      -008 |do_softirq()
      -009 |local_bh_enable()
      -010 |tcp_rcv_established(sk = 0x8BE6B2A0, skb = 0x87D3A9E0, th = 0x814EBE14, ?)
      -011 |tcp_v4_do_rcv(sk = 0x8BE6B2A0, skb = 0x87D3A9E0)
      -012 |tcp_delack_timer_handler(sk = 0x8BE6B2A0)
      -013 |tcp_release_cb(sk = 0x8BE6B2A0)
      -014 |release_sock(sk = 0x8BE6B2A0)
      -015 |tcp_sendmsg(?, sk = 0x8BE6B2A0, ?, ?)
      -016 |sock_sendmsg(sock = 0x8518C4C0, msg = 0x87D8DAA8, size = 4096)
      -017 |kernel_sendmsg(?, ?, ?, ?, size = 4096)
      -018 |smb_send_kvec()
      -019 |smb_send_rqst(server = 0x87C4D400, rqst = 0x87D8DBA0)
      -020 |cifs_call_async()
      -021 |cifs_async_writev(wdata = 0x87FD6580)
      -022 |cifs_writepages(mapping = 0x852096E4, wbc = 0x87D8DC88)
      -023 |__writeback_single_inode(inode = 0x852095D0, wbc = 0x87D8DC88)
      -024 |writeback_sb_inodes(sb = 0x87D6D800, wb = 0x87E4A9C0, work = 0x87D8DD88)
      -025 |__writeback_inodes_wb(wb = 0x87E4A9C0, work = 0x87D8DD88)
      -026 |wb_writeback(wb = 0x87E4A9C0, work = 0x87D8DD88)
      -027 |wb_do_writeback(wb = 0x87E4A9C0, force_wait = 0)
      -028 |bdi_writeback_workfn(work = 0x87E4A9CC)
      -029 |process_one_work(worker = 0x8B045880, work = 0x87E4A9CC)
      -030 |worker_thread(__worker = 0x8B045880)
      -031 |kthread(_create = 0x87CADD90)
      -032 |ret_from_kernel_thread(asm)
      
      Bug occurs because __tcp_checksum_complete_user() enables BH, assuming
      it is running from softirq context.
      
      Lars trace involved a NIC without RX checksum support but other points
      are problematic as well, like the prequeue stuff.
      
      Problem is triggered by a timer, that found socket being owned by user.
      
      tcp_release_cb() should call tcp_write_timer_handler() or
      tcp_delack_timer_handler() in the appropriate context :
      
      BH disabled and socket lock held, but 'owned' field cleared,
      as if they were running from timer handlers.
      
      Fixes: 6f458dfb ("tcp: improve latencies of timer triggered events")
      Reported-by: NLars Persson <lars.persson@axis.com>
      Tested-by: NLars Persson <lars.persson@axis.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c3f9b018
  13. 11 3月, 2014 1 次提交
  14. 08 3月, 2014 1 次提交
  15. 07 3月, 2014 1 次提交
  16. 04 3月, 2014 2 次提交
  17. 27 2月, 2014 3 次提交
    • E
      tcp: switch rtt estimations to usec resolution · 740b0f18
      Eric Dumazet 提交于
      Upcoming congestion controls for TCP require usec resolution for RTT
      estimations. Millisecond resolution is simply not enough these days.
      
      FQ/pacing in DC environments also require this change for finer control
      and removal of bimodal behavior due to the current hack in
      tcp_update_pacing_rate() for 'small rtt'
      
      TCP_CONG_RTT_STAMP is no longer needed.
      
      As Julian Anastasov pointed out, we need to keep user compatibility :
      tcp_metrics used to export RTT and RTTVAR in msec resolution,
      so we added RTT_US and RTTVAR_US. An iproute2 patch is needed
      to use the new attributes if provided by the kernel.
      
      In this example ss command displays a srtt of 32 usecs (10Gbit link)
      
      lpk51:~# ./ss -i dst lpk52
      Netid  State      Recv-Q Send-Q   Local Address:Port       Peer
      Address:Port
      tcp    ESTAB      0      1         10.246.11.51:42959
      10.246.11.52:64614
               cubic wscale:6,6 rto:201 rtt:0.032/0.001 ato:40 mss:1448
      cwnd:10 send
      3620.0Mbps pacing_rate 7240.0Mbps unacked:1 rcv_rtt:993 rcv_space:29559
      
      Updated iproute2 ip command displays :
      
      lpk51:~# ./ip tcp_metrics | grep 10.246.11.52
      10.246.11.52 age 561.914sec cwnd 10 rtt 274us rttvar 213us source
      10.246.11.51
      
      Old binary displays :
      
      lpk51:~# ip tcp_metrics | grep 10.246.11.52
      10.246.11.52 age 561.914sec cwnd 10 rtt 250us rttvar 125us source
      10.246.11.51
      
      With help from Julian Anastasov, Stephen Hemminger and Yuchung Cheng
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Larry Brakmo <brakmo@google.com>
      Cc: Julian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      740b0f18
    • F
      net: tcp: add mib counters to track zero window transitions · 8e165e20
      Florian Westphal 提交于
      Three counters are added:
      - one to track when we went from non-zero to zero window
      - one to track the reverse
      - one counter incremented when we want to announce zero window,
        but can't because we would shrink current window.
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8e165e20
    • E
      net: tcp: use NET_INC_STATS() · 9a9bfd03
      Eric Dumazet 提交于
      While LINUX_MIB_TCPSPURIOUS_RTX_HOSTQUEUES can only be incremented
      in tcp_transmit_skb() from softirq (incoming message or timer
      activation), it is better to use NET_INC_STATS() instead of
      NET_INC_STATS_BH() as tcp_transmit_skb() can be called from process
      context.
      
      This will avoid copy/paste confusion when/if we want to add
      other SNMP counters in tcp_transmit_skb()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Florian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9a9bfd03
  18. 22 2月, 2014 1 次提交
  19. 20 2月, 2014 1 次提交
    • F
      tcp: use zero-window when free_space is low · 86c1a045
      Florian Westphal 提交于
      Currently the kernel tries to announce a zero window when free_space
      is below the current receiver mss estimate.
      
      When a sender is transmitting small packets and reader consumes data
      slowly (or not at all), receiver might be unable to shrink the receive
      win because
      
      a) we cannot withdraw already-commited receive window, and,
      b) we have to round the current rwin up to a multiple of the wscale
         factor, else we would shrink the current window.
      
      This causes the receive buffer to fill up until the rmem limit is hit.
      When this happens, we start dropping packets.
      
      Moreover, tcp_clamp_window may continue to grow sk_rcvbuf towards rmem[2]
      even if socket is not being read from.
      
      As we cannot avoid the "current_win is rounded up to multiple of mss"
      issue [we would violate a) above] at least try to prevent the receive buf
      growth towards tcp_rmem[2] limit by attempting to move to zero-window
      announcement when free_space becomes less than 1/16 of the current
      allowed receive buffer maximum.  If tcp_rmem[2] is large, this will
      increase our chances to get a zero-window announcement out in time.
      
      Reproducer:
      On server:
      $ nc -l -p 12345
      <suspend it: CTRL-Z>
      
      Client:
      #!/usr/bin/env python
      import socket
      import time
      
      sock = socket.socket()
      sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
      sock.connect(("192.168.4.1", 12345));
      while True:
         sock.send('A' * 23)
         time.sleep(0.005)
      
      socket buffer on server-side will grow until tcp_rmem[2] is hit,
      at which point the client rexmits data until -EDTIMEOUT:
      
      tcp_data_queue invokes tcp_try_rmem_schedule which will call
      tcp_prune_queue which calls tcp_clamp_window().  And that function will
      grow sk->sk_rcvbuf up until it eventually hits tcp_rmem[2].
      
      Thanks to Eric Dumazet for running regression tests.
      
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Tested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      86c1a045
  20. 14 2月, 2014 1 次提交
  21. 11 2月, 2014 1 次提交
    • J
      tcp: tsq: fix nonagle handling · bf06200e
      John Ogness 提交于
      Commit 46d3ceab ("tcp: TCP Small Queues") introduced a possible
      regression for applications using TCP_NODELAY.
      
      If TCP session is throttled because of tsq, we should consult
      tp->nonagle when TX completion is done and allow us to send additional
      segment, especially if this segment is not a full MSS.
      Otherwise this segment is sent after an RTO.
      
      [edumazet] : Cooked the changelog, added another fix about testing
      sk_wmem_alloc twice because TX completion can happen right before
      setting TSQ_THROTTLED bit.
      
      This problem is particularly visible with recent auto corking,
      but might also be triggered with low tcp_limit_output_bytes
      values or NIC drivers delaying TX completion by hundred of usec,
      and very low rtt.
      
      Thomas Glanzmann for example reported an iscsi regression, caused
      by tcp auto corking making this bug quite visible.
      
      Fixes: 46d3ceab ("tcp: TCP Small Queues")
      Signed-off-by: NJohn Ogness <john.ogness@linutronix.de>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NThomas Glanzmann <thomas@glanzmann.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf06200e
  22. 07 2月, 2014 1 次提交
    • E
      tcp: remove 1ms offset in srtt computation · 4a5ab4e2
      Eric Dumazet 提交于
      TCP pacing depends on an accurate srtt estimation.
      
      Current srtt estimation is using jiffie resolution,
      and has an artificial offset of at least 1 ms, which can produce
      slowdowns when FQ/pacing is used, especially in DC world,
      where typical rtt is below 1 ms.
      
      We are planning a switch to usec resolution for linux-3.15,
      but in the meantime, this patch removes the 1 ms offset.
      
      All we need is to have tp->srtt minimal value of 1 to differentiate
      the case of srtt being initialized or not, not 8.
      
      The problematic behavior was observed on a 40Gbit testbed,
      where 32 concurrent netperf were reaching 12Gbps of aggregate
      speed, instead of line speed.
      
      This patch also has the effect of reporting more accurate srtt and send
      rates to iproute2 ss command as in :
      
      $ ss -i dst cca2
      Netid  State      Recv-Q Send-Q          Local Address:Port
      Peer Address:Port
      tcp    ESTAB      0      0                10.244.129.1:56984
      10.244.129.2:12865
      	 cubic wscale:6,6 rto:200 rtt:0.25/0.25 ato:40 mss:1448 cwnd:10 send
      463.4Mbps rcv_rtt:1 rcv_space:29200
      tcp    ESTAB      0      390960           10.244.129.1:60247
      10.244.129.2:50204
      	 cubic wscale:6,6 rto:200 rtt:0.875/0.75 mss:1448 cwnd:73 ssthresh:51
      send 966.4Mbps unacked:73 retrans:0/121 rcv_space:29200
      Reported-by: NVytautas Valancius <valas@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4a5ab4e2
  23. 30 12月, 2013 1 次提交
  24. 18 12月, 2013 1 次提交
    • E
      tcp: refine TSO splits · d4589926
      Eric Dumazet 提交于
      While investigating performance problems on small RPC workloads,
      I noticed linux TCP stack was always splitting the last TSO skb
      into two parts (skbs). One being a multiple of MSS, and a small one
      with the Push flag. This split is done even if TCP_NODELAY is set,
      or if no small packet is in flight.
      
      Example with request/response of 4K/4K
      
      IP A > B: . ack 68432 win 2783 <nop,nop,timestamp 6524593 6525001>
      IP A > B: . 65537:68433(2896) ack 69632 win 2783 <nop,nop,timestamp 6524593 6525001>
      IP A > B: P 68433:69633(1200) ack 69632 win 2783 <nop,nop,timestamp 6524593 6525001>
      IP B > A: . ack 68433 win 2768 <nop,nop,timestamp 6525001 6524593>
      IP B > A: . 69632:72528(2896) ack 69633 win 2768 <nop,nop,timestamp 6525001 6524593>
      IP B > A: P 72528:73728(1200) ack 69633 win 2768 <nop,nop,timestamp 6525001 6524593>
      IP A > B: . ack 72528 win 2783 <nop,nop,timestamp 6524593 6525001>
      IP A > B: . 69633:72529(2896) ack 73728 win 2783 <nop,nop,timestamp 6524593 6525001>
      IP A > B: P 72529:73729(1200) ack 73728 win 2783 <nop,nop,timestamp 6524593 6525001>
      
      We can avoid this split by including the Nagle tests at the right place.
      
      Note : If some NIC had trouble sending TSO packets with a partial
      last segment, we would have hit the problem in GRO/forwarding workload already.
      
      tcp_minshall_update() is moved to tcp_output.c and is updated as we might
      feed a TSO packet with a partial last segment.
      
      This patch tremendously improves performance, as the traffic now looks
      like :
      
      IP A > B: . ack 98304 win 2783 <nop,nop,timestamp 6834277 6834685>
      IP A > B: P 94209:98305(4096) ack 98304 win 2783 <nop,nop,timestamp 6834277 6834685>
      IP B > A: . ack 98305 win 2768 <nop,nop,timestamp 6834686 6834277>
      IP B > A: P 98304:102400(4096) ack 98305 win 2768 <nop,nop,timestamp 6834686 6834277>
      IP A > B: . ack 102400 win 2783 <nop,nop,timestamp 6834279 6834686>
      IP A > B: P 98305:102401(4096) ack 102400 win 2783 <nop,nop,timestamp 6834279 6834686>
      IP B > A: . ack 102401 win 2768 <nop,nop,timestamp 6834687 6834279>
      IP B > A: P 102400:106496(4096) ack 102401 win 2768 <nop,nop,timestamp 6834687 6834279>
      IP A > B: . ack 106496 win 2783 <nop,nop,timestamp 6834280 6834687>
      IP A > B: P 102401:106497(4096) ack 106496 win 2783 <nop,nop,timestamp 6834280 6834687>
      IP B > A: . ack 106497 win 2768 <nop,nop,timestamp 6834688 6834280>
      IP B > A: P 106496:110592(4096) ack 106497 win 2768 <nop,nop,timestamp 6834688 6834280>
      
      Before :
      
      lpq83:~# nstat >/dev/null;perf stat ./super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K
      280774
      
       Performance counter stats for './super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K':
      
           205719.049006 task-clock                #    9.278 CPUs utilized
               8,449,968 context-switches          #    0.041 M/sec
               1,935,997 CPU-migrations            #    0.009 M/sec
                 160,541 page-faults               #    0.780 K/sec
         548,478,722,290 cycles                    #    2.666 GHz                     [83.20%]
         455,240,670,857 stalled-cycles-frontend   #   83.00% frontend cycles idle    [83.48%]
         272,881,454,275 stalled-cycles-backend    #   49.75% backend  cycles idle    [66.73%]
         166,091,460,030 instructions              #    0.30  insns per cycle
                                                   #    2.74  stalled cycles per insn [83.39%]
          29,150,229,399 branches                  #  141.699 M/sec                   [83.30%]
           1,943,814,026 branch-misses             #    6.67% of all branches         [83.32%]
      
            22.173517844 seconds time elapsed
      
      lpq83:~# nstat | egrep "IpOutRequests|IpExtOutOctets"
      IpOutRequests                   16851063           0.0
      IpExtOutOctets                  23878580777        0.0
      
      After patch :
      
      lpq83:~# nstat >/dev/null;perf stat ./super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K
      280877
      
       Performance counter stats for './super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K':
      
           107496.071918 task-clock                #    4.847 CPUs utilized
               5,635,458 context-switches          #    0.052 M/sec
               1,374,707 CPU-migrations            #    0.013 M/sec
                 160,920 page-faults               #    0.001 M/sec
         281,500,010,924 cycles                    #    2.619 GHz                     [83.28%]
         228,865,069,307 stalled-cycles-frontend   #   81.30% frontend cycles idle    [83.38%]
         142,462,742,658 stalled-cycles-backend    #   50.61% backend  cycles idle    [66.81%]
          95,227,712,566 instructions              #    0.34  insns per cycle
                                                   #    2.40  stalled cycles per insn [83.43%]
          16,209,868,171 branches                  #  150.795 M/sec                   [83.20%]
             874,252,952 branch-misses             #    5.39% of all branches         [83.37%]
      
            22.175821286 seconds time elapsed
      
      lpq83:~# nstat | egrep "IpOutRequests|IpExtOutOctets"
      IpOutRequests                   11239428           0.0
      IpExtOutOctets                  23595191035        0.0
      
      Indeed, the occupancy of tx skbs (IpExtOutOctets/IpOutRequests) is higher :
      2099 instead of 1417, thus helping GRO to be more efficient when using FQ packet
      scheduler.
      
      Many thanks to Neal for review and ideas.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Tested-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4589926
  25. 11 12月, 2013 1 次提交
  26. 07 12月, 2013 1 次提交
  27. 20 11月, 2013 1 次提交
    • A
      tcp: don't update snd_nxt, when a socket is switched from repair mode · dbde4979
      Andrey Vagin 提交于
      snd_nxt must be updated synchronously with sk_send_head.  Otherwise
      tp->packets_out may be updated incorrectly, what may bring a kernel panic.
      
      Here is a kernel panic from my host.
      [  103.043194] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
      [  103.044025] IP: [<ffffffff815aaaaf>] tcp_rearm_rto+0xcf/0x150
      ...
      [  146.301158] Call Trace:
      [  146.301158]  [<ffffffff815ab7f0>] tcp_ack+0xcc0/0x12c0
      
      Before this panic a tcp socket was restored. This socket had sent and
      unsent data in the write queue. Sent data was restored in repair mode,
      then the socket was switched from reapair mode and unsent data was
      restored. After that the socket was switched back into repair mode.
      
      In that moment we had a socket where write queue looks like this:
      snd_una    snd_nxt   write_seq
         |_________|________|
                   |
      	  sk_send_head
      
      After a second switching from repair mode the state of socket was
      changed:
      
      snd_una          snd_nxt, write_seq
         |_________ ________|
                   |
      	  sk_send_head
      
      This state is inconsistent, because snd_nxt and sk_send_head are not
      synchronized.
      
      Bellow you can find a call trace, how packets_out can be incremented
      twice for one skb, if snd_nxt and sk_send_head are not synchronized.
      In this case packets_out will be always positive, even when
      sk_write_queue is empty.
      
      tcp_write_wakeup
      	skb = tcp_send_head(sk);
      	tcp_fragment
      		if (!before(tp->snd_nxt, TCP_SKB_CB(buff)->end_seq))
      			tcp_adjust_pcount(sk, skb, diff);
      	tcp_event_new_data_sent
      		tp->packets_out += tcp_skb_pcount(skb);
      
      I think update of snd_nxt isn't required, when a socket is switched from
      repair mode.  Because it's initialized in tcp_connect_init. Then when a
      write queue is restored, snd_nxt is incremented in tcp_event_new_data_sent,
      so it's always is in consistent state.
      
      I have checked, that the bug is not reproduced with this patch and
      all tests about restoring tcp connections work fine.
      
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dbde4979
  28. 15 11月, 2013 1 次提交
    • E
      tcp: tsq: restore minimal amount of queueing · 98e09386
      Eric Dumazet 提交于
      After commit c9eeec26 ("tcp: TSQ can use a dynamic limit"), several
      users reported throughput regressions, notably on mvneta and wifi
      adapters.
      
      802.11 AMPDU requires a fair amount of queueing to be effective.
      
      This patch partially reverts the change done in tcp_write_xmit()
      so that the minimal amount is sysctl_tcp_limit_output_bytes.
      
      It also remove the use of this sysctl while building skb stored
      in write queue, as TSO autosizing does the right thing anyway.
      
      Users with well behaving NICS and correct qdisc (like sch_fq),
      can then lower the default sysctl_tcp_limit_output_bytes value from
      128KB to 8KB.
      
      This new usage of sysctl_tcp_limit_output_bytes permits each driver
      authors to check how their driver performs when/if the value is set
      to a minimum of 4KB.
      
      Normally, line rate for a single TCP flow should be possible,
      but some drivers rely on timers to perform TX completion and
      too long TX completion delays prevent reaching full throughput.
      
      Fixes: c9eeec26 ("tcp: TSQ can use a dynamic limit")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NSujith Manoharan <sujith@msujith.org>
      Reported-by: NArnaud Ebalard <arno@natisbad.org>
      Tested-by: NSujith Manoharan <sujith@msujith.org>
      Cc: Felix Fietkau <nbd@openwrt.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98e09386
  29. 19 10月, 2013 1 次提交
  30. 18 10月, 2013 2 次提交
    • E
      tcp: remove the sk_can_gso() check from tcp_set_skb_tso_segs() · 8f26fb1c
      Eric Dumazet 提交于
      sk_can_gso() should only be used as a hint in tcp_sendmsg() to build GSO
      packets in the first place. (As a performance hint)
      
      Once we have GSO packets in write queue, we can not decide they are no
      longer GSO only because flow now uses a route which doesn't handle
      TSO/GSO.
      
      Core networking stack handles the case very well for us, all we need
      is keeping track of packet counts in MSS terms, regardless of
      segmentation done later (in GSO or hardware)
      
      Right now, if  tcp_fragment() splits a GSO packet in two parts,
      @left and @right, and route changed through a non GSO device,
      both @left and @right have pcount set to 1, which is wrong,
      and leads to incorrect packet_count tracking.
      
      This problem was added in commit d5ac99a6 ("[TCP]: skb pcount with MTU
      discovery")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Reported-by: NMaciej Żenczykowski <maze@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8f26fb1c
    • E
      tcp: must unclone packets before mangling them · c52e2421
      Eric Dumazet 提交于
      TCP stack should make sure it owns skbs before mangling them.
      
      We had various crashes using bnx2x, and it turned out gso_size
      was cleared right before bnx2x driver was populating TC descriptor
      of the _previous_ packet send. TCP stack can sometime retransmit
      packets that are still in Qdisc.
      
      Of course we could make bnx2x driver more robust (using
      ACCESS_ONCE(shinfo->gso_size) for example), but the bug is TCP stack.
      
      We have identified two points where skb_unclone() was needed.
      
      This patch adds a WARN_ON_ONCE() to warn us if we missed another
      fix of this kind.
      
      Kudos to Neal for finding the root cause of this bug. Its visible
      using small MSS.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c52e2421
  31. 12 10月, 2013 1 次提交
  32. 11 10月, 2013 1 次提交
  33. 10 10月, 2013 1 次提交
    • E
      inet: includes a sock_common in request_sock · 634fb979
      Eric Dumazet 提交于
      TCP listener refactoring, part 5 :
      
      We want to be able to insert request sockets (SYN_RECV) into main
      ehash table instead of the per listener hash table to allow RCU
      lookups and remove listener lock contention.
      
      This patch includes the needed struct sock_common in front
      of struct request_sock
      
      This means there is no more inet6_request_sock IPv6 specific
      structure.
      
      Following inet_request_sock fields were renamed as they became
      macros to reference fields from struct sock_common.
      Prefix ir_ was chosen to avoid name collisions.
      
      loc_port   -> ir_loc_port
      loc_addr   -> ir_loc_addr
      rmt_addr   -> ir_rmt_addr
      rmt_port   -> ir_rmt_port
      iif        -> ir_iif
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      634fb979
  34. 03 10月, 2013 1 次提交
  35. 01 10月, 2013 1 次提交
    • E
      tcp: TSQ can use a dynamic limit · c9eeec26
      Eric Dumazet 提交于
      When TCP Small Queues was added, we used a sysctl to limit amount of
      packets queues on Qdisc/device queues for a given TCP flow.
      
      Problem is this limit is either too big for low rates, or too small
      for high rates.
      
      Now TCP stack has rate estimation in sk->sk_pacing_rate, and TSO
      auto sizing, it can better control number of packets in Qdisc/device
      queues.
      
      New limit is two packets or at least 1 to 2 ms worth of packets.
      
      Low rates flows benefit from this patch by having even smaller
      number of packets in queues, allowing for faster recovery,
      better RTT estimations.
      
      High rates flows benefit from this patch by allowing more than 2 packets
      in flight as we had reports this was a limiting factor to reach line
      rate. [ In particular if TX completion is delayed because of coalescing
      parameters ]
      
      Example for a single flow on 10Gbp link controlled by FQ/pacing
      
      14 packets in flight instead of 2
      
      $ tc -s -d qd
      qdisc fq 8001: dev eth0 root refcnt 32 limit 10000p flow_limit 100p
      buckets 1024 quantum 3028 initial_quantum 15140
       Sent 1168459366606 bytes 771822841 pkt (dropped 0, overlimits 0
      requeues 6822476)
       rate 9346Mbit 771713pps backlog 953820b 14p requeues 6822476
        2047 flow, 2046 inactive, 1 throttled, delay 15673 ns
        2372 gc, 0 highprio, 0 retrans, 9739249 throttled, 0 flows_plimit
      
      Note that sk_pacing_rate is currently set to twice the actual rate, but
      this might be refined in the future when a flow is in congestion
      avoidance.
      
      Additional change : skb->destructor should be set to tcp_wfree().
      
      A future patch (for linux 3.13+) might remove tcp_limit_output_bytes
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Wei Liu <wei.liu2@citrix.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9eeec26