1. 09 5月, 2012 29 次提交
  2. 05 5月, 2012 1 次提交
    • E
      tcp: be more strict before accepting ECN negociation · bd14b1b2
      Eric Dumazet 提交于
      It appears some networks play bad games with the two bits reserved for
      ECN. This can trigger false congestion notifications and very slow
      transferts.
      
      Since RFC 3168 (6.1.1) forbids SYN packets to carry CT bits, we can
      disable TCP ECN negociation if it happens we receive mangled CT bits in
      the SYN packet.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Perry Lorier <perryl@google.com>
      Cc: Matt Mathis <mattmathis@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Wilmer van der Gaast <wilmer@google.com>
      Cc: Ankur Jain <jankur@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Dave Täht <dave.taht@bufferbloat.net>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd14b1b2
  3. 03 5月, 2012 3 次提交
    • E
      net: implement tcp coalescing in tcp_queue_rcv() · b081f85c
      Eric Dumazet 提交于
      Extend tcp coalescing implementing it from tcp_queue_rcv(), the main
      receiver function when application is not blocked in recvmsg().
      
      Function tcp_queue_rcv() is moved a bit to allow its call from
      tcp_data_queue()
      
      This gives good results especially if GRO could not kick, and if skb
      head is a fragment.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b081f85c
    • Y
      tcp: early retransmit: delayed fast retransmit · 750ea2ba
      Yuchung Cheng 提交于
      Implementing the advanced early retransmit (sysctl_tcp_early_retrans==2).
      Delays the fast retransmit by an interval of RTT/4. We borrow the
      RTO timer to implement the delay. If we receive another ACK or send
      a new packet, the timer is cancelled and restored to original RTO
      value offset by time elapsed.  When the delayed-ER timer fires,
      we enter fast recovery and perform fast retransmit.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      750ea2ba
    • Y
      tcp: early retransmit · eed530b6
      Yuchung Cheng 提交于
      This patch implements RFC 5827 early retransmit (ER) for TCP.
      It reduces DUPACK threshold (dupthresh) if outstanding packets are
      less than 4 to recover losses by fast recovery instead of timeout.
      
      While the algorithm is simple, small but frequent network reordering
      makes this feature dangerous: the connection repeatedly enter
      false recovery and degrade performance. Therefore we implement
      a mitigation suggested in the appendix of the RFC that delays
      entering fast recovery by a small interval, i.e., RTT/4. Currently
      ER is conservative and is disabled for the rest of the connection
      after the first reordering event. A large scale web server
      experiment on the performance impact of ER is summarized in
      section 6 of the paper "Proportional Rate Reduction for TCP”,
      IMC 2011. http://conferences.sigcomm.org/imc/2011/docs/p155.pdf
      
      Note that Linux has a similar feature called THIN_DUPACK. The
      differences are THIN_DUPACK do not mitigate reorderings and is only
      used after slow start. Currently ER is disabled if THIN_DUPACK is
      enabled. I would be happy to merge THIN_DUPACK feature with ER if
      people think it's a good idea.
      
      ER is enabled by sysctl_tcp_early_retrans:
        0: Disables ER
      
        1: Reduce dupthresh to packets_out - 1 when outstanding packets < 4.
      
        2: (Default) reduce dupthresh like mode 1. In addition, delay
           entering fast recovery by RTT/4.
      
      Note: mode 2 is implemented in the third part of this patch series.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eed530b6
  4. 01 5月, 2012 1 次提交
    • E
      net: fix sk_sockets_allocated_read_positive · 518fbf9c
      Eric Dumazet 提交于
      Denys Fedoryshchenko reported frequent crashes on a proxy server and kindly
      provided a lockdep report that explains it all :
      
        [  762.903868]
        [  762.903880] =================================
        [  762.903890] [ INFO: inconsistent lock state ]
        [  762.903903] 3.3.4-build-0061 #8 Not tainted
        [  762.904133] ---------------------------------
        [  762.904344] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
        [  762.904542] squid/1603 [HC0[0]:SC0[0]:HE1:SE1] takes:
        [  762.904542]  (key#3){+.?...}, at: [<c0232cc4>]
      __percpu_counter_sum+0xd/0x58
        [  762.904542] {IN-SOFTIRQ-W} state was registered at:
        [  762.904542]   [<c0158b84>] __lock_acquire+0x284/0xc26
        [  762.904542]   [<c01598e8>] lock_acquire+0x71/0x85
        [  762.904542]   [<c0349765>] _raw_spin_lock+0x33/0x40
        [  762.904542]   [<c0232c93>] __percpu_counter_add+0x58/0x7c
        [  762.904542]   [<c02cfde1>] sk_clone_lock+0x1e5/0x200
        [  762.904542]   [<c0303ee4>] inet_csk_clone_lock+0xe/0x78
        [  762.904542]   [<c0315778>] tcp_create_openreq_child+0x1b/0x404
        [  762.904542]   [<c031339c>] tcp_v4_syn_recv_sock+0x32/0x1c1
        [  762.904542]   [<c031615a>] tcp_check_req+0x1fd/0x2d7
        [  762.904542]   [<c0313f77>] tcp_v4_do_rcv+0xab/0x194
        [  762.904542]   [<c03153bb>] tcp_v4_rcv+0x3b3/0x5cc
        [  762.904542]   [<c02fc0c4>] ip_local_deliver_finish+0x13a/0x1e9
        [  762.904542]   [<c02fc539>] NF_HOOK.clone.11+0x46/0x4d
        [  762.904542]   [<c02fc652>] ip_local_deliver+0x41/0x45
        [  762.904542]   [<c02fc4d1>] ip_rcv_finish+0x31a/0x33c
        [  762.904542]   [<c02fc539>] NF_HOOK.clone.11+0x46/0x4d
        [  762.904542]   [<c02fc857>] ip_rcv+0x201/0x23e
        [  762.904542]   [<c02daa3a>] __netif_receive_skb+0x319/0x368
        [  762.904542]   [<c02dac07>] netif_receive_skb+0x4e/0x7d
        [  762.904542]   [<c02dacf6>] napi_skb_finish+0x1e/0x34
        [  762.904542]   [<c02db122>] napi_gro_receive+0x20/0x24
        [  762.904542]   [<f85d1743>] e1000_receive_skb+0x3f/0x45 [e1000e]
        [  762.904542]   [<f85d3464>] e1000_clean_rx_irq+0x1f9/0x284 [e1000e]
        [  762.904542]   [<f85d3926>] e1000_clean+0x62/0x1f4 [e1000e]
        [  762.904542]   [<c02db228>] net_rx_action+0x90/0x160
        [  762.904542]   [<c012a445>] __do_softirq+0x7b/0x118
        [  762.904542] irq event stamp: 156915469
        [  762.904542] hardirqs last  enabled at (156915469): [<c019b4f4>]
      __slab_alloc.clone.58.clone.63+0xc4/0x2de
        [  762.904542] hardirqs last disabled at (156915468): [<c019b452>]
      __slab_alloc.clone.58.clone.63+0x22/0x2de
        [  762.904542] softirqs last  enabled at (156915466): [<c02ce677>]
      lock_sock_nested+0x64/0x6c
        [  762.904542] softirqs last disabled at (156915464): [<c0349914>]
      _raw_spin_lock_bh+0xe/0x45
        [  762.904542]
        [  762.904542] other info that might help us debug this:
        [  762.904542]  Possible unsafe locking scenario:
        [  762.904542]
        [  762.904542]        CPU0
        [  762.904542]        ----
        [  762.904542]   lock(key#3);
        [  762.904542]   <Interrupt>
        [  762.904542]     lock(key#3);
        [  762.904542]
        [  762.904542]  *** DEADLOCK ***
        [  762.904542]
        [  762.904542] 1 lock held by squid/1603:
        [  762.904542]  #0:  (sk_lock-AF_INET){+.+.+.}, at: [<c03055c0>]
      lock_sock+0xa/0xc
        [  762.904542]
        [  762.904542] stack backtrace:
        [  762.904542] Pid: 1603, comm: squid Not tainted 3.3.4-build-0061 #8
        [  762.904542] Call Trace:
        [  762.904542]  [<c0347b73>] ? printk+0x18/0x1d
        [  762.904542]  [<c015873a>] valid_state+0x1f6/0x201
        [  762.904542]  [<c0158816>] mark_lock+0xd1/0x1bb
        [  762.904542]  [<c015876b>] ? mark_lock+0x26/0x1bb
        [  762.904542]  [<c015805d>] ? check_usage_forwards+0x77/0x77
        [  762.904542]  [<c0158bf8>] __lock_acquire+0x2f8/0xc26
        [  762.904542]  [<c0159b8e>] ? mark_held_locks+0x5d/0x7b
        [  762.904542]  [<c0159cf6>] ? trace_hardirqs_on+0xb/0xd
        [  762.904542]  [<c0158dd4>] ? __lock_acquire+0x4d4/0xc26
        [  762.904542]  [<c01598e8>] lock_acquire+0x71/0x85
        [  762.904542]  [<c0232cc4>] ? __percpu_counter_sum+0xd/0x58
        [  762.904542]  [<c0349765>] _raw_spin_lock+0x33/0x40
        [  762.904542]  [<c0232cc4>] ? __percpu_counter_sum+0xd/0x58
        [  762.904542]  [<c0232cc4>] __percpu_counter_sum+0xd/0x58
        [  762.904542]  [<c02cebc4>] __sk_mem_schedule+0xdd/0x1c7
        [  762.904542]  [<c02d178d>] ? __alloc_skb+0x76/0x100
        [  762.904542]  [<c0305e8e>] sk_wmem_schedule+0x21/0x2d
        [  762.904542]  [<c0306370>] sk_stream_alloc_skb+0x42/0xaa
        [  762.904542]  [<c0306567>] tcp_sendmsg+0x18f/0x68b
        [  762.904542]  [<c031f3dc>] ? ip_fast_csum+0x30/0x30
        [  762.904542]  [<c0320193>] inet_sendmsg+0x53/0x5a
        [  762.904542]  [<c02cb633>] sock_aio_write+0xd2/0xda
        [  762.904542]  [<c015876b>] ? mark_lock+0x26/0x1bb
        [  762.904542]  [<c01a1017>] do_sync_write+0x9f/0xd9
        [  762.904542]  [<c01a2111>] ? file_free_rcu+0x2f/0x2f
        [  762.904542]  [<c01a17a1>] vfs_write+0x8f/0xab
        [  762.904542]  [<c01a284d>] ? fget_light+0x75/0x7c
        [  762.904542]  [<c01a1900>] sys_write+0x3d/0x5e
        [  762.904542]  [<c0349ec9>] syscall_call+0x7/0xb
        [  762.904542]  [<c0340000>] ? rp_sidt+0x41/0x83
      
      Bug is that sk_sockets_allocated_read_positive() calls
      percpu_counter_sum_positive() without BH being disabled.
      
      This bug was added in commit 180d8cd9
      (foundations of per-cgroup memory pressure controlling.), since previous
      code was using percpu_counter_read_positive() which is IRQ safe.
      
      In __sk_mem_schedule() we dont need the precise count of allocated
      sockets and can revert to previous behavior.
      Reported-by: NDenys Fedoryshchenko <denys@visp.net.lb>
      Sined-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      518fbf9c
  5. 30 4月, 2012 2 次提交
  6. 29 4月, 2012 1 次提交
  7. 27 4月, 2012 1 次提交
    • E
      ipv6: RTAX_FEATURE_ALLFRAG causes inefficient TCP segment sizing · 67469601
      Eric Dumazet 提交于
      Quoting Tore Anderson from :
      https://bugzilla.kernel.org/show_bug.cgi?id=42572
      
      When RTAX_FEATURE_ALLFRAG is set on a route, the effective TCP segment
      size does not take into account the size of the IPv6 Fragmentation
      header that needs to be included in outbound packets, causing every
      transmitted TCP segment to be fragmented across two IPv6 packets, the
      latter of which will only contain 8 bytes of actual payload.
      
      RTAX_FEATURE_ALLFRAG is typically set on a route in response to
      receving a ICMPv6 Packet Too Big message indicating a Path MTU of less
      than 1280 bytes. 1280 bytes is the minimum IPv6 MTU, however ICMPv6
      PTBs with MTU < 1280 are still valid, in particular when an IPv6
      packet is sent to an IPv4 destination through a stateless translator.
      Any ICMPv4 Need To Fragment packets originated from the IPv4 part of
      the path will be translated to ICMPv6 PTB which may then indicate an
      MTU of less than 1280.
      
      The Linux kernel refuses to reduce the effective MTU to anything below
      1280 bytes, instead it sets it to exactly 1280 bytes, and
      RTAX_FEATURE_ALLFRAG is also set. However, the TCP segment size appears
      to be set to 1240 bytes (1280 Path MTU - 40 bytes of IPv6 header),
      instead of 1232 (additionally taking into account the 8 bytes required
      by the IPv6 Fragmentation extension header).
      
      This in turn results in rather inefficient transmission, as every
      transmitted TCP segment now is split in two fragments containing
      1232+8 bytes of payload.
      
      After this patch, all the outgoing packets that includes a
      Fragmentation header all are "atomic" or "non-fragmented" fragments,
      i.e., they both have Offset=0 and More Fragments=0.
      
      With help from David S. Miller
      Reported-by: NTore Anderson <tore@fud.no>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Maciej Żenczykowski <maze@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Tested-by: NTore Anderson <tore@fud.no>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      67469601
  8. 24 4月, 2012 2 次提交