1. 28 1月, 2013 1 次提交
    • E
      net: fix possible wrong checksum generation · cef401de
      Eric Dumazet 提交于
      Pravin Shelar mentioned that GSO could potentially generate
      wrong TX checksum if skb has fragments that are overwritten
      by the user between the checksum computation and transmit.
      
      He suggested to linearize skbs but this extra copy can be
      avoided for normal tcp skbs cooked by tcp_sendmsg().
      
      This patch introduces a new SKB_GSO_SHARED_FRAG flag, set
      in skb_shinfo(skb)->gso_type if at least one frag can be
      modified by the user.
      
      Typical sources of such possible overwrites are {vm}splice(),
      sendfile(), and macvtap/tun/virtio_net drivers.
      
      Tested:
      
      $ netperf -H 7.7.8.84
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
      7.7.8.84 () port 0 AF_INET
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       87380  16384  16384    10.00    3959.52
      
      $ netperf -H 7.7.8.84 -t TCP_SENDFILE
      TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 ()
      port 0 AF_INET
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       87380  16384  16384    10.00    3216.80
      
      Performance of the SENDFILE is impacted by the extra allocation and
      copy, and because we use order-0 pages, while the TCP_STREAM uses
      bigger pages.
      Reported-by: NPravin Shelar <pshelar@nicira.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cef401de
  2. 07 1月, 2013 1 次提交
  3. 08 12月, 2012 1 次提交
    • Y
      tcp: bug fix Fast Open client retransmission · 93b174ad
      Yuchung Cheng 提交于
      If SYN-ACK partially acks SYN-data, the client retransmits the
      remaining data by tcp_retransmit_skb(). This increments lost recovery
      state variables like tp->retrans_out in Open state. If loss recovery
      happens before the retransmission is acked, it triggers the WARN_ON
      check in tcp_fastretrans_alert(). For example: the client sends
      SYN-data, gets SYN-ACK acking only ISN, retransmits data, sends
      another 4 data packets and get 3 dupacks.
      
      Since the retransmission is not caused by network drop it should not
      update the recovery state variables. Further the server may return a
      smaller MSS than the cached MSS used for SYN-data, so the retranmission
      needs a loop. Otherwise some data will not be retransmitted until timeout
      or other loss recovery events.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93b174ad
  4. 23 11月, 2012 1 次提交
  5. 16 11月, 2012 1 次提交
    • A
      tcp: fix retransmission in repair mode · ec342325
      Andrew Vagin 提交于
      Currently if a socket was repaired with a few packet in a write queue,
      a kernel bug may be triggered:
      
      kernel BUG at net/ipv4/tcp_output.c:2330!
      RIP: 0010:[<ffffffff8155784f>] tcp_retransmit_skb+0x5ff/0x610
      
      According to the initial realization v3.4-rc2-963-gc0e88ff0,
      all skb-s should look like already posted. This patch fixes code
      according with this sentence.
      
      Here are three points, which were not done in the initial patch:
      1. A tcp send head should not be changed
      2. Initialize TSO state of a skb
      3. Reset the retransmission time
      
      This patch moves logic from tcp_sendmsg to tcp_write_xmit. A packet
      passes the ussual way, but isn't sent to network. This patch solves
      all described problems and handles tcp_sendpages.
      
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec342325
  6. 04 9月, 2012 1 次提交
    • Y
      tcp: use PRR to reduce cwin in CWR state · 684bad11
      Yuchung Cheng 提交于
      Use proportional rate reduction (PRR) algorithm to reduce cwnd in CWR state,
      in addition to Recovery state. Retire the current rate-halving in CWR.
      When losses are detected via ACKs in CWR state, the sender enters Recovery
      state but the cwnd reduction continues and does not restart.
      
      Rename and refactor cwnd reduction functions since both CWR and Recovery
      use the same algorithm:
      tcp_init_cwnd_reduction() is new and initiates reduction state variables.
      tcp_cwnd_reduction() is previously tcp_update_cwnd_in_recovery().
      tcp_ends_cwnd_reduction() is previously  tcp_complete_cwr().
      
      The rate halving functions and logic such as tcp_cwnd_down(), tcp_min_cwnd(),
      and the cwnd moderation inside tcp_enter_cwr() are removed. The unused
      parameter, flag, in tcp_cwnd_reduction() is also removed.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      684bad11
  7. 01 9月, 2012 1 次提交
    • J
      tcp: TCP Fast Open Server - support TFO listeners · 8336886f
      Jerry Chu 提交于
      This patch builds on top of the previous patch to add the support
      for TFO listeners. This includes -
      
      1. allocating, properly initializing, and managing the per listener
      fastopen_queue structure when TFO is enabled
      
      2. changes to the inet_csk_accept code to support TFO. E.g., the
      request_sock can no longer be freed upon accept(), not until 3WHS
      finishes
      
      3. allowing a TCP_SYN_RECV socket to properly poll() and sendmsg()
      if it's a TFO socket
      
      4. properly closing a TFO listener, and a TFO socket before 3WHS
      finishes
      
      5. supporting TCP_FASTOPEN socket option
      
      6. modifying tcp_check_req() to use to check a TFO socket as well
      as request_sock
      
      7. supporting TCP's TFO cookie option
      
      8. adding a new SYN-ACK retransmit handler to use the timer directly
      off the TFO socket rather than the listener socket. Note that TFO
      server side will not retransmit anything other than SYN-ACK until
      the 3WHS is completed.
      
      The patch also contains an important function
      "reqsk_fastopen_remove()" to manage the somewhat complex relation
      between a listener, its request_sock, and the corresponding child
      socket. See the comment above the function for the detail.
      Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8336886f
  8. 22 8月, 2012 1 次提交
    • E
      tcp: fix possible socket refcount problem · 144d56e9
      Eric Dumazet 提交于
      Commit 6f458dfb (tcp: improve latencies of timer triggered events)
      added bug leading to following trace :
      
      [ 2866.131281] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
      [ 2866.131726]
      [ 2866.132188] =========================
      [ 2866.132281] [ BUG: held lock freed! ]
      [ 2866.132281] 3.6.0-rc1+ #622 Not tainted
      [ 2866.132281] -------------------------
      [ 2866.132281] kworker/0:1/652 is freeing memory ffff880019ec0000-ffff880019ec0a1f, with a lock still held there!
      [ 2866.132281]  (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6
      [ 2866.132281] 4 locks held by kworker/0:1/652:
      [ 2866.132281]  #0:  (rpciod){.+.+.+}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f
      [ 2866.132281]  #1:  ((&task->u.tk_work)){+.+.+.}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f
      [ 2866.132281]  #2:  (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6
      [ 2866.132281]  #3:  (&icsk->icsk_retransmit_timer){+.-...}, at: [<ffffffff81078017>] run_timer_softirq+0x1ad/0x35f
      [ 2866.132281]
      [ 2866.132281] stack backtrace:
      [ 2866.132281] Pid: 652, comm: kworker/0:1 Not tainted 3.6.0-rc1+ #622
      [ 2866.132281] Call Trace:
      [ 2866.132281]  <IRQ>  [<ffffffff810bc527>] debug_check_no_locks_freed+0x112/0x159
      [ 2866.132281]  [<ffffffff818a0839>] ? __sk_free+0xfd/0x114
      [ 2866.132281]  [<ffffffff811549fa>] kmem_cache_free+0x6b/0x13a
      [ 2866.132281]  [<ffffffff818a0839>] __sk_free+0xfd/0x114
      [ 2866.132281]  [<ffffffff818a08c0>] sk_free+0x1c/0x1e
      [ 2866.132281]  [<ffffffff81911e1c>] tcp_write_timer+0x51/0x56
      [ 2866.132281]  [<ffffffff81078082>] run_timer_softirq+0x218/0x35f
      [ 2866.132281]  [<ffffffff81078017>] ? run_timer_softirq+0x1ad/0x35f
      [ 2866.132281]  [<ffffffff810f5831>] ? rb_commit+0x58/0x85
      [ 2866.132281]  [<ffffffff81911dcb>] ? tcp_write_timer_handler+0x148/0x148
      [ 2866.132281]  [<ffffffff81070bd6>] __do_softirq+0xcb/0x1f9
      [ 2866.132281]  [<ffffffff81a0a00c>] ? _raw_spin_unlock+0x29/0x2e
      [ 2866.132281]  [<ffffffff81a1227c>] call_softirq+0x1c/0x30
      [ 2866.132281]  [<ffffffff81039f38>] do_softirq+0x4a/0xa6
      [ 2866.132281]  [<ffffffff81070f2b>] irq_exit+0x51/0xad
      [ 2866.132281]  [<ffffffff81a129cd>] do_IRQ+0x9d/0xb4
      [ 2866.132281]  [<ffffffff81a0a3ef>] common_interrupt+0x6f/0x6f
      [ 2866.132281]  <EOI>  [<ffffffff8109d006>] ? sched_clock_cpu+0x58/0xd1
      [ 2866.132281]  [<ffffffff81a0a172>] ? _raw_spin_unlock_irqrestore+0x4c/0x56
      [ 2866.132281]  [<ffffffff81078692>] mod_timer+0x178/0x1a9
      [ 2866.132281]  [<ffffffff818a00aa>] sk_reset_timer+0x19/0x26
      [ 2866.132281]  [<ffffffff8190b2cc>] tcp_rearm_rto+0x99/0xa4
      [ 2866.132281]  [<ffffffff8190dfba>] tcp_event_new_data_sent+0x6e/0x70
      [ 2866.132281]  [<ffffffff8190f7ea>] tcp_write_xmit+0x7de/0x8e4
      [ 2866.132281]  [<ffffffff818a565d>] ? __alloc_skb+0xa0/0x1a1
      [ 2866.132281]  [<ffffffff8190f952>] __tcp_push_pending_frames+0x2e/0x8a
      [ 2866.132281]  [<ffffffff81904122>] tcp_sendmsg+0xb32/0xcc6
      [ 2866.132281]  [<ffffffff819229c2>] inet_sendmsg+0xaa/0xd5
      [ 2866.132281]  [<ffffffff81922918>] ? inet_autobind+0x5f/0x5f
      [ 2866.132281]  [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb
      [ 2866.132281]  [<ffffffff8189adab>] sock_sendmsg+0xa3/0xc4
      [ 2866.132281]  [<ffffffff810f5de6>] ? rb_reserve_next_event+0x26f/0x2d5
      [ 2866.132281]  [<ffffffff8103e6a9>] ? native_sched_clock+0x29/0x6f
      [ 2866.132281]  [<ffffffff8103e6f8>] ? sched_clock+0x9/0xd
      [ 2866.132281]  [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb
      [ 2866.132281]  [<ffffffff8189ae03>] kernel_sendmsg+0x37/0x43
      [ 2866.132281]  [<ffffffff8199ce49>] xs_send_kvec+0x77/0x80
      [ 2866.132281]  [<ffffffff8199cec1>] xs_sendpages+0x6f/0x1a0
      [ 2866.132281]  [<ffffffff8107826d>] ? try_to_del_timer_sync+0x55/0x61
      [ 2866.132281]  [<ffffffff8199d0d2>] xs_tcp_send_request+0x55/0xf1
      [ 2866.132281]  [<ffffffff8199bb90>] xprt_transmit+0x89/0x1db
      [ 2866.132281]  [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c
      [ 2866.132281]  [<ffffffff81999d92>] call_transmit+0x1c5/0x20e
      [ 2866.132281]  [<ffffffff819a0d55>] __rpc_execute+0x6f/0x225
      [ 2866.132281]  [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c
      [ 2866.132281]  [<ffffffff819a0f33>] rpc_async_schedule+0x28/0x34
      [ 2866.132281]  [<ffffffff810835d6>] process_one_work+0x24d/0x47f
      [ 2866.132281]  [<ffffffff81083567>] ? process_one_work+0x1de/0x47f
      [ 2866.132281]  [<ffffffff819a0f0b>] ? __rpc_execute+0x225/0x225
      [ 2866.132281]  [<ffffffff81083a6d>] worker_thread+0x236/0x317
      [ 2866.132281]  [<ffffffff81083837>] ? process_scheduled_works+0x2f/0x2f
      [ 2866.132281]  [<ffffffff8108b7b8>] kthread+0x9a/0xa2
      [ 2866.132281]  [<ffffffff81a12184>] kernel_thread_helper+0x4/0x10
      [ 2866.132281]  [<ffffffff81a0a4b0>] ? retint_restore_args+0x13/0x13
      [ 2866.132281]  [<ffffffff8108b71e>] ? __init_kthread_worker+0x5a/0x5a
      [ 2866.132281]  [<ffffffff81a12180>] ? gs_change+0x13/0x13
      [ 2866.308506] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
      [ 2866.309689] =============================================================================
      [ 2866.310254] BUG TCP (Not tainted): Object already free
      [ 2866.310254] -----------------------------------------------------------------------------
      [ 2866.310254]
      
      The bug comes from the fact that timer set in sk_reset_timer() can run
      before we actually do the sock_hold(). socket refcount reaches zero and
      we free the socket too soon.
      
      timer handler is not allowed to reduce socket refcnt if socket is owned
      by the user, or we need to change sk_reset_timer() implementation.
      
      We should take a reference on the socket in case TCP_DELACK_TIMER_DEFERRED
      or TCP_DELACK_TIMER_DEFERRED bit are set in tsq_flags
      
      Also fix a typo in tcp_delack_timer(), where TCP_WRITE_TIMER_DEFERRED
      was used instead of TCP_DELACK_TIMER_DEFERRED.
      
      For consistency, use same socket refcount change for TCP_MTU_REDUCED_DEFERRED,
      even if not fired from a timer.
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Tested-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      144d56e9
  9. 07 8月, 2012 1 次提交
  10. 02 8月, 2012 1 次提交
  11. 01 8月, 2012 1 次提交
  12. 23 7月, 2012 1 次提交
    • E
      tcp: dont drop MTU reduction indications · 563d34d0
      Eric Dumazet 提交于
      ICMP messages generated in output path if frame length is bigger than
      mtu are actually lost because socket is owned by user (doing the xmit)
      
      One example is the ipgre_tunnel_xmit() calling
      icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu));
      
      We had a similar case fixed in commit a34a101e (ipv6: disable GSO on
      sockets hitting dst_allfrag).
      
      Problem of such fix is that it relied on retransmit timers, so short tcp
      sessions paid a too big latency increase price.
      
      This patch uses the tcp_release_cb() infrastructure so that MTU
      reduction messages (ICMP messages) are not lost, and no extra delay
      is added in TCP transmits.
      Reported-by: NMaciej Żenczykowski <maze@google.com>
      Diagnosed-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Tore Anderson <tore@fud.no>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      563d34d0
  13. 21 7月, 2012 1 次提交
    • E
      tcp: improve latencies of timer triggered events · 6f458dfb
      Eric Dumazet 提交于
      Modern TCP stack highly depends on tcp_write_timer() having a small
      latency, but current implementation doesn't exactly meet the
      expectations.
      
      When a timer fires but finds the socket is owned by the user, it rearms
      itself for an additional delay hoping next run will be more
      successful.
      
      tcp_write_timer() for example uses a 50ms delay for next try, and it
      defeats many attempts to get predictable TCP behavior in term of
      latencies.
      
      Use the recently introduced tcp_release_cb(), so that the user owning
      the socket will call various handlers right before socket release.
      
      This will permit us to post a followup patch to address the
      tcp_tso_should_defer() syndrome (some deferred packets have to wait
      RTO timer to be transmitted, while cwnd should allow us to send them
      sooner)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: H.K. Jerry Chu <hkchu@google.com>
      Cc: John Heffner <johnwheffner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6f458dfb
  14. 20 7月, 2012 4 次提交
    • Y
      net-tcp: Fast Open client - cookie-less mode · 67da22d2
      Yuchung Cheng 提交于
      In trusted networks, e.g., intranet, data-center, the client does not
      need to use Fast Open cookie to mitigate DoS attacks. In cookie-less
      mode, sendmsg() with MSG_FASTOPEN flag will send SYN-data regardless
      of cookie availability.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      67da22d2
    • Y
      net-tcp: Fast Open client - detecting SYN-data drops · aab48743
      Yuchung Cheng 提交于
      On paths with firewalls dropping SYN with data or experimental TCP options,
      Fast Open connections will have experience SYN timeout and bad performance.
      The solution is to track such incidents in the cookie cache and disables
      Fast Open temporarily.
      
      Since only the original SYN includes data and/or Fast Open option, the
      SYN-ACK has some tell-tale sign (tcp_rcv_fastopen_synack()) to detect
      such drops. If a path has recurring Fast Open SYN drops, Fast Open is
      disabled for 2^(recurring_losses) minutes starting from four minutes up to
      roughly one and half day. sendmsg with MSG_FASTOPEN flag will succeed but
      it behaves as connect() then write().
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aab48743
    • Y
      net-tcp: Fast Open client - sending SYN-data · 783237e8
      Yuchung Cheng 提交于
      This patch implements sending SYN-data in tcp_connect(). The data is
      from tcp_sendmsg() with flag MSG_FASTOPEN (implemented in a later patch).
      
      The length of the cookie in tcp_fastopen_req, init'd to 0, controls the
      type of the SYN. If the cookie is not cached (len==0), the host sends
      data-less SYN with Fast Open cookie request option to solicit a cookie
      from the remote. If cookie is not available (len > 0), the host sends
      a SYN-data with Fast Open cookie option. If cookie length is negative,
        the SYN will not include any Fast Open option (for fall back operations).
      
      To deal with middleboxes that may drop SYN with data or experimental TCP
      option, the SYN-data is only sent once. SYN retransmits do not include
      data or Fast Open options. The connection will fall back to regular TCP
      handshake.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      783237e8
    • Y
      net-tcp: Fast Open base · 2100c8d2
      Yuchung Cheng 提交于
      This patch impelements the common code for both the client and server.
      
      1. TCP Fast Open option processing. Since Fast Open does not have an
         option number assigned by IANA yet, it shares the experiment option
         code 254 by implementing draft-ietf-tcpm-experimental-options
         with a 16 bits magic number 0xF989. This enables global experiments
         without clashing the scarce(2) experimental options available for TCP.
      
         When the draft status becomes standard (maybe), the client should
         switch to the new option number assigned while the server supports
         both numbers for transistion.
      
      2. The new sysctl tcp_fastopen
      
      3. A place holder init function
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2100c8d2
  15. 13 7月, 2012 1 次提交
  16. 12 7月, 2012 1 次提交
    • E
      tcp: TCP Small Queues · 46d3ceab
      Eric Dumazet 提交于
      This introduce TSQ (TCP Small Queues)
      
      TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
      device queues), to reduce RTT and cwnd bias, part of the bufferbloat
      problem.
      
      sk->sk_wmem_alloc not allowed to grow above a given limit,
      allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
      given time.
      
      TSO packets are sized/capped to half the limit, so that we have two
      TSO packets in flight, allowing better bandwidth use.
      
      As a side effect, setting the limit to 40000 automatically reduces the
      standard gso max limit (65536) to 40000/2 : It can help to reduce
      latencies of high prio packets, having smaller TSO packets.
      
      This means we divert sock_wfree() to a tcp_wfree() handler, to
      queue/send following frames when skb_orphan() [2] is called for the
      already queued skbs.
      
      Results on my dev machines (tg3/ixgbe nics) are really impressive,
      using standard pfifo_fast, and with or without TSO/GSO.
      
      Without reduction of nominal bandwidth, we have reduction of buffering
      per bulk sender :
      < 1ms on Gbit (instead of 50ms with TSO)
      < 8ms on 100Mbit (instead of 132 ms)
      
      I no longer have 4 MBytes backlogged in qdisc by a single netperf
      session, and both side socket autotuning no longer use 4 Mbytes.
      
      As skb destructor cannot restart xmit itself ( as qdisc lock might be
      taken at this point ), we delegate the work to a tasklet. We use one
      tasklest per cpu for performance reasons.
      
      If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag.
      This flag is tested in a new protocol method called from release_sock(),
      to eventually send new segments.
      
      [1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
      [2] skb_orphan() is usually called at TX completion time,
        but some drivers call it in their start_xmit() handler.
        These drivers should at least use BQL, or else a single TCP
        session can still fill the whole NIC TX ring, since TSQ will
        have no effect.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Dave Taht <dave.taht@bufferbloat.net>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Matt Mathis <mattmathis@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      46d3ceab
  17. 04 6月, 2012 2 次提交
  18. 18 5月, 2012 1 次提交
  19. 16 5月, 2012 2 次提交
  20. 03 5月, 2012 1 次提交
  21. 27 4月, 2012 1 次提交
    • E
      ipv6: RTAX_FEATURE_ALLFRAG causes inefficient TCP segment sizing · 67469601
      Eric Dumazet 提交于
      Quoting Tore Anderson from :
      https://bugzilla.kernel.org/show_bug.cgi?id=42572
      
      When RTAX_FEATURE_ALLFRAG is set on a route, the effective TCP segment
      size does not take into account the size of the IPv6 Fragmentation
      header that needs to be included in outbound packets, causing every
      transmitted TCP segment to be fragmented across two IPv6 packets, the
      latter of which will only contain 8 bytes of actual payload.
      
      RTAX_FEATURE_ALLFRAG is typically set on a route in response to
      receving a ICMPv6 Packet Too Big message indicating a Path MTU of less
      than 1280 bytes. 1280 bytes is the minimum IPv6 MTU, however ICMPv6
      PTBs with MTU < 1280 are still valid, in particular when an IPv6
      packet is sent to an IPv4 destination through a stateless translator.
      Any ICMPv4 Need To Fragment packets originated from the IPv4 part of
      the path will be translated to ICMPv6 PTB which may then indicate an
      MTU of less than 1280.
      
      The Linux kernel refuses to reduce the effective MTU to anything below
      1280 bytes, instead it sets it to exactly 1280 bytes, and
      RTAX_FEATURE_ALLFRAG is also set. However, the TCP segment size appears
      to be set to 1240 bytes (1280 Path MTU - 40 bytes of IPv6 header),
      instead of 1232 (additionally taking into account the 8 bytes required
      by the IPv6 Fragmentation extension header).
      
      This in turn results in rather inefficient transmission, as every
      transmitted TCP segment now is split in two fragments containing
      1232+8 bytes of payload.
      
      After this patch, all the outgoing packets that includes a
      Fragmentation header all are "atomic" or "non-fragmented" fragments,
      i.e., they both have Offset=0 and More Fragments=0.
      
      With help from David S. Miller
      Reported-by: NTore Anderson <tore@fud.no>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Maciej Żenczykowski <maze@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Tested-by: NTore Anderson <tore@fud.no>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      67469601
  22. 22 4月, 2012 3 次提交
    • P
      tcp: Repair socket queues · c0e88ff0
      Pavel Emelyanov 提交于
      Reading queues under repair mode is done with recvmsg call.
      The queue-under-repair set by TCP_REPAIR_QUEUE option is used
      to determine which queue should be read. Thus both send and
      receive queue can be read with this.
      
      Caller must pass the MSG_PEEK flag.
      
      Writing to queues is done with sendmsg call and yet again --
      the repair-queue option can be used to push data into the
      receive queue.
      
      When putting an skb into receive queue a zero tcp header is
      appented to its head to address the tcp_hdr(skb)->syn and
      the ->fin checks by the (after repair) tcp_recvmsg. These
      flags flags are both set to zero and that's why.
      
      The fin cannot be met in the queue while reading the source
      socket, since the repair only works for closed/established
      sockets and queueing fin packet always changes its state.
      
      The syn in the queue denotes that the respective skb's seq
      is "off-by-one" as compared to the actual payload lenght. Thus,
      at the rcv queue refill we can just drop this flag and set the
      skb's sequences to precice values.
      
      When the repair mode is turned off, the write queue seqs are
      updated so that the whole queue is considered to be 'already sent,
      waiting for ACKs' (write_seq = snd_nxt <= snd_una). From the
      protocol POV the send queue looks like it was sent, but the data
      between the write_seq and snd_nxt is lost in the network.
      
      This helps to avoid another sockoption for setting the snd_nxt
      sequence. Leaving the whole queue in a 'not yet sent' state (as
      it will be after sendmsg-s) will not allow to receive any acks
      from the peer since the ack_seq will be after the snd_nxt. Thus
      even the ack for the window probe will be dropped and the
      connection will be 'locked' with the zero peer window.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0e88ff0
    • P
      tcp: Initial repair mode · ee995283
      Pavel Emelyanov 提交于
      This includes (according the the previous description):
      
      * TCP_REPAIR sockoption
      
      This one just puts the socket in/out of the repair mode.
      Allowed for CAP_NET_ADMIN and for closed/establised sockets only.
      When repair mode is turned off and the socket happens to be in
      the established state the window probe is sent to the peer to
      'unlock' the connection.
      
      * TCP_REPAIR_QUEUE sockoption
      
      This one sets the queue which we're about to repair. The
      'no-queue' is set by default.
      
      * TCP_QUEUE_SEQ socoption
      
      Sets the write_seq/rcv_nxt of a selected repaired queue.
      Allowed for TCP_CLOSE-d sockets only. When the socket changes
      its state the other seq-s are changed by the kernel according
      to the protocol rules (most of the existing code is actually
      reused).
      
      * Ability to forcibly bind a socket to a port
      
      The sk->sk_reuse is set to SK_FORCE_REUSE.
      
      * Immediate connect modification
      
      The connect syscall initializes the connection, then directly jumps
      to the code which finalizes it.
      
      * Silent close modification
      
      The close just aborts the connection (similar to SO_LINGER with 0
      time) but without sending any FIN/RST-s to peer.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee995283
    • P
      tcp: Move code around · 370816ae
      Pavel Emelyanov 提交于
      This is just the preparation patch, which makes the needed for
      TCP repair code ready for use.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      370816ae
  23. 19 4月, 2012 1 次提交
    • E
      tcp: fix retransmit of partially acked frames · 22b4a4f2
      Eric Dumazet 提交于
      Alexander Beregalov reported skb_over_panic errors and provided stack
      trace.
      
      I occurs commit a21d4572 (tcp: avoid order-1 allocations on wifi and
      tx path) added a regression, when a retransmit is done after a partial
      ACK.
      
      tcp_retransmit_skb() tries to aggregate several frames if the first one
      has enough available room to hold the following ones payload. This is
      controlled by /proc/sys/net/ipv4/tcp_retrans_collapse tunable (default :
      enabled)
      
      Problem is we must make sure _pskb_trim_head() doesnt fool
      skb_availroom() when pulling some bytes from skb (this pull is done when
      receiver ACK part of the frame).
      Reported-by: NAlexander Beregalov <a.beregalov@gmail.com>
      Cc: Marc MERLIN <marc@merlins.org>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22b4a4f2
  24. 16 4月, 2012 1 次提交
  25. 11 4月, 2012 1 次提交
    • E
      tcp: avoid order-1 allocations on wifi and tx path · a21d4572
      Eric Dumazet 提交于
      Marc Merlin reported many order-1 allocations failures in TX path on its
      wireless setup, that dont make any sense with MTU=1500 network, and non
      SG capable hardware.
      
      After investigation, it turns out TCP uses sk_stream_alloc_skb() and
      used as a convention skb_tailroom(skb) to know how many bytes of data
      payload could be put in this skb (for non SG capable devices)
      
      Note : these skb used kmalloc-4096 (MTU=1500 + MAX_HEADER +
      sizeof(struct skb_shared_info) being above 2048)
      
      Later, mac80211 layer need to add some bytes at the tail of skb
      (IEEE80211_ENCRYPT_TAILROOM = 18 bytes) and since no more tailroom is
      available has to call pskb_expand_head() and request order-1
      allocations.
      
      This patch changes sk_stream_alloc_skb() so that only
      sk->sk_prot->max_header bytes of headroom are reserved, and use a new
      skb field, avail_size to hold the data payload limit.
      
      This way, order-0 allocations done by TCP stack can leave more than 2 KB
      of tailroom and no more allocation is performed in mac80211 layer (or
      any layer needing some tailroom)
      
      avail_size is unioned with mark/dropcount, since mark will be set later
      in IP stack for output packets. Therefore, skb size is unchanged.
      Reported-by: NMarc MERLIN <marc@merlins.org>
      Tested-by: NMarc MERLIN <marc@merlins.org>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a21d4572
  26. 31 1月, 2012 1 次提交
    • N
      tcp: fix tcp_trim_head() to adjust segment count with skb MSS · 5b35e1e6
      Neal Cardwell 提交于
      This commit fixes tcp_trim_head() to recalculate the number of
      segments in the skb with the skb's existing MSS, so trimming the head
      causes the skb segment count to be monotonically non-increasing - it
      should stay the same or go down, but not increase.
      
      Previously tcp_trim_head() used the current MSS of the connection. But
      if there was a decrease in MSS between original transmission and ACK
      (e.g. due to PMTUD), this could cause tcp_trim_head() to
      counter-intuitively increase the segment count when trimming bytes off
      the head of an skb. This violated assumptions in tcp_tso_acked() that
      tcp_trim_head() only decreases the packet count, so that packets_acked
      in tcp_tso_acked() could underflow, leading tcp_clean_rtx_queue() to
      pass u32 pkts_acked values as large as 0xffffffff to
      ca_ops->pkts_acked().
      
      As an aside, if tcp_trim_head() had really wanted the skb to reflect
      the current MSS, it should have called tcp_set_skb_tso_segs()
      unconditionally, since a decrease in MSS would mean that a
      single-packet skb should now be sliced into multiple segments.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NNandita Dukkipati <nanditad@google.com>
      Acked-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5b35e1e6
  27. 27 1月, 2012 1 次提交
  28. 13 12月, 2011 1 次提交
  29. 06 12月, 2011 1 次提交
  30. 05 12月, 2011 1 次提交
    • E
      tcp: take care of misalignments · 117632e6
      Eric Dumazet 提交于
      We discovered that TCP stack could retransmit misaligned skbs if a
      malicious peer acknowledged sub MSS frame. This currently can happen
      only if output interface is non SG enabled : If SG is enabled, tcp
      builds headless skbs (all payload is included in fragments), so the tcp
      trimming process only removes parts of skb fragments, header stay
      aligned.
      
      Some arches cant handle misalignments, so force a head reallocation and
      shrink headroom to MAX_TCP_HEADER.
      
      Dont care about misaligments on x86 and PPC (or other arches setting
      NET_IP_ALIGN to 0)
      
      This patch introduces __pskb_copy() which can specify the headroom of
      new head, and pskb_copy() becomes a wrapper on top of __pskb_copy()
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      117632e6
  31. 29 11月, 2011 1 次提交
  32. 09 11月, 2011 1 次提交
  33. 21 10月, 2011 1 次提交