1. 03 5月, 2012 5 次提交
    • E
      net: implement tcp coalescing in tcp_queue_rcv() · b081f85c
      Eric Dumazet 提交于
      Extend tcp coalescing implementing it from tcp_queue_rcv(), the main
      receiver function when application is not blocked in recvmsg().
      
      Function tcp_queue_rcv() is moved a bit to allow its call from
      tcp_data_queue()
      
      This gives good results especially if GRO could not kick, and if skb
      head is a fragment.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b081f85c
    • E
      net: take care of cloned skbs in tcp_try_coalesce() · 923dd347
      Eric Dumazet 提交于
      Before stealing fragments or skb head, we must make sure skbs are not
      cloned.
      
      Alexander was worried about destination skb being cloned : In bridge
      setups, a driver could be fooled if skb->data_len would not match skb
      nr_frags.
      
      If source skb is cloned, we must take references on pages instead.
      
      Bug happened using tcpdump (if not using mmap())
      
      Introduce kfree_skb_partial() helper to cleanup code.
      Reported-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      923dd347
    • Y
      tcp: early retransmit: delayed fast retransmit · 750ea2ba
      Yuchung Cheng 提交于
      Implementing the advanced early retransmit (sysctl_tcp_early_retrans==2).
      Delays the fast retransmit by an interval of RTT/4. We borrow the
      RTO timer to implement the delay. If we receive another ACK or send
      a new packet, the timer is cancelled and restored to original RTO
      value offset by time elapsed.  When the delayed-ER timer fires,
      we enter fast recovery and perform fast retransmit.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      750ea2ba
    • Y
      tcp: early retransmit · eed530b6
      Yuchung Cheng 提交于
      This patch implements RFC 5827 early retransmit (ER) for TCP.
      It reduces DUPACK threshold (dupthresh) if outstanding packets are
      less than 4 to recover losses by fast recovery instead of timeout.
      
      While the algorithm is simple, small but frequent network reordering
      makes this feature dangerous: the connection repeatedly enter
      false recovery and degrade performance. Therefore we implement
      a mitigation suggested in the appendix of the RFC that delays
      entering fast recovery by a small interval, i.e., RTT/4. Currently
      ER is conservative and is disabled for the rest of the connection
      after the first reordering event. A large scale web server
      experiment on the performance impact of ER is summarized in
      section 6 of the paper "Proportional Rate Reduction for TCP”,
      IMC 2011. http://conferences.sigcomm.org/imc/2011/docs/p155.pdf
      
      Note that Linux has a similar feature called THIN_DUPACK. The
      differences are THIN_DUPACK do not mitigate reorderings and is only
      used after slow start. Currently ER is disabled if THIN_DUPACK is
      enabled. I would be happy to merge THIN_DUPACK feature with ER if
      people think it's a good idea.
      
      ER is enabled by sysctl_tcp_early_retrans:
        0: Disables ER
      
        1: Reduce dupthresh to packets_out - 1 when outstanding packets < 4.
      
        2: (Default) reduce dupthresh like mode 1. In addition, delay
           entering fast recovery by RTT/4.
      
      Note: mode 2 is implemented in the third part of this patch series.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eed530b6
    • Y
      tcp: early retransmit: tcp_enter_recovery() · 1fbc3405
      Yuchung Cheng 提交于
      This a prepartion patch that refactors the code to enter recovery
      into a new function tcp_enter_recovery(). It's needed to implement
      the delayed fast retransmit in ER.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1fbc3405
  2. 01 5月, 2012 1 次提交
    • E
      tcp: makes tcp_try_coalesce aware of skb->head_frag · 329033f6
      Eric Dumazet 提交于
      TCP coalesce can check if skb to be merged has its skb->head mapped to a
      page fragment, instead of a kmalloc() area.
      
      We had to disable coalescing in this case, for performance reasons.
      
      We 'upgrade' skb->head as a fragment in itself.
      
      This reduces number of cache misses when user makes its copies, since a
      less sk_buff are fetched.
      
      This makes receive and ofo queues shorter and thus reduce cache line
      misses in TCP stack.
      
      This is a followup of patch "net: allow skb->head to be a page fragment"
      
      Tested with tg3 nic, with GRO on or off. We can see "TCPRcvCoalesce"
      counter being incremented.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Maciej Żenczykowski <maze@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Matt Carlson <mcarlson@broadcom.com>
      Cc: Michael Chan <mchan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      329033f6
  3. 27 4月, 2012 1 次提交
    • E
      ipv6: RTAX_FEATURE_ALLFRAG causes inefficient TCP segment sizing · 67469601
      Eric Dumazet 提交于
      Quoting Tore Anderson from :
      https://bugzilla.kernel.org/show_bug.cgi?id=42572
      
      When RTAX_FEATURE_ALLFRAG is set on a route, the effective TCP segment
      size does not take into account the size of the IPv6 Fragmentation
      header that needs to be included in outbound packets, causing every
      transmitted TCP segment to be fragmented across two IPv6 packets, the
      latter of which will only contain 8 bytes of actual payload.
      
      RTAX_FEATURE_ALLFRAG is typically set on a route in response to
      receving a ICMPv6 Packet Too Big message indicating a Path MTU of less
      than 1280 bytes. 1280 bytes is the minimum IPv6 MTU, however ICMPv6
      PTBs with MTU < 1280 are still valid, in particular when an IPv6
      packet is sent to an IPv4 destination through a stateless translator.
      Any ICMPv4 Need To Fragment packets originated from the IPv4 part of
      the path will be translated to ICMPv6 PTB which may then indicate an
      MTU of less than 1280.
      
      The Linux kernel refuses to reduce the effective MTU to anything below
      1280 bytes, instead it sets it to exactly 1280 bytes, and
      RTAX_FEATURE_ALLFRAG is also set. However, the TCP segment size appears
      to be set to 1240 bytes (1280 Path MTU - 40 bytes of IPv6 header),
      instead of 1232 (additionally taking into account the 8 bytes required
      by the IPv6 Fragmentation extension header).
      
      This in turn results in rather inefficient transmission, as every
      transmitted TCP segment now is split in two fragments containing
      1232+8 bytes of payload.
      
      After this patch, all the outgoing packets that includes a
      Fragmentation header all are "atomic" or "non-fragmented" fragments,
      i.e., they both have Offset=0 and More Fragments=0.
      
      With help from David S. Miller
      Reported-by: NTore Anderson <tore@fud.no>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Maciej Żenczykowski <maze@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Tested-by: NTore Anderson <tore@fud.no>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      67469601
  4. 26 4月, 2012 2 次提交
  5. 24 4月, 2012 5 次提交
  6. 23 4月, 2012 1 次提交
  7. 22 4月, 2012 7 次提交
    • N
      tcp: move duplicate code from tcp_v4_init_sock()/tcp_v6_init_sock() · 900f65d3
      Neal Cardwell 提交于
      This commit moves the (substantial) common code shared between
      tcp_v4_init_sock() and tcp_v6_init_sock() to a new address-family
      independent function, tcp_init_sock().
      
      Centralizing this functionality should help avoid drift issues,
      e.g. where the IPv4 side is updated without a corresponding update to
      IPv6. There was already some drift: IPv4 initialized snd_cwnd to
      TCP_INIT_CWND, while the IPv6 side was still initializing snd_cwnd to
      2 (in this case it should not matter, since snd_cwnd is also
      initialized in tcp_init_metrics(), but the general risks and
      maintenance overhead remain).
      
      When diffing the old and new code, note that new tcp_init_sock()
      function uses the order of steps from the tcp_v4_init_sock()
      implementation (the order is slightly different in
      tcp_v6_init_sock()).
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      900f65d3
    • P
      tcp: Repair connection-time negotiated parameters · b139ba4e
      Pavel Emelyanov 提交于
      There are options, which are set up on a socket while performing
      TCP handshake. Need to resurrect them on a socket while repairing.
      A new sockoption accepts a buffer and parses it. The buffer should
      be CODE:VALUE sequence of bytes, where CODE is standard option
      code and VALUE is the respective value.
      
      Only 4 options should be handled on repaired socket.
      
      To read 3 out of 4 of these options the TCP_INFO sockoption can be
      used. An ability to get the last one (the mss_clamp) was added by
      the previous patch.
      
      Now the restore. Three of these options -- timestamp_ok, mss_clamp
      and snd_wscale -- are just restored on a coket.
      
      The sack_ok flags has 2 issues. First, whether or not to do sacks
      at all. This flag is just read and set back. No other sack  info is
      saved or restored, since according to the standart and the code
      dropping all sack-ed segments is OK, the sender will resubmit them
      again, so after the repair we will probably experience a pause in
      connection. Next, the fack bit. It's just set back on a socket if
      the respective sysctl is set. No collected stats about packets flow
      is preserved. As far as I see (plz, correct me if I'm wrong) the
      fack-based congestion algorithm survives dropping all of the stats
      and repairs itself eventually, probably losing the performance for
      that period.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b139ba4e
    • P
      tcp: Report mss_clamp with TCP_MAXSEG option in repair mode · 5e6a3ce6
      Pavel Emelyanov 提交于
      The mss_clamp is the only connection-time negotiated option which
      cannot be obtained from the user space. Make the TCP_MAXSEG sockopt
      report one in the repair mode.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e6a3ce6
    • P
      tcp: Repair socket queues · c0e88ff0
      Pavel Emelyanov 提交于
      Reading queues under repair mode is done with recvmsg call.
      The queue-under-repair set by TCP_REPAIR_QUEUE option is used
      to determine which queue should be read. Thus both send and
      receive queue can be read with this.
      
      Caller must pass the MSG_PEEK flag.
      
      Writing to queues is done with sendmsg call and yet again --
      the repair-queue option can be used to push data into the
      receive queue.
      
      When putting an skb into receive queue a zero tcp header is
      appented to its head to address the tcp_hdr(skb)->syn and
      the ->fin checks by the (after repair) tcp_recvmsg. These
      flags flags are both set to zero and that's why.
      
      The fin cannot be met in the queue while reading the source
      socket, since the repair only works for closed/established
      sockets and queueing fin packet always changes its state.
      
      The syn in the queue denotes that the respective skb's seq
      is "off-by-one" as compared to the actual payload lenght. Thus,
      at the rcv queue refill we can just drop this flag and set the
      skb's sequences to precice values.
      
      When the repair mode is turned off, the write queue seqs are
      updated so that the whole queue is considered to be 'already sent,
      waiting for ACKs' (write_seq = snd_nxt <= snd_una). From the
      protocol POV the send queue looks like it was sent, but the data
      between the write_seq and snd_nxt is lost in the network.
      
      This helps to avoid another sockoption for setting the snd_nxt
      sequence. Leaving the whole queue in a 'not yet sent' state (as
      it will be after sendmsg-s) will not allow to receive any acks
      from the peer since the ack_seq will be after the snd_nxt. Thus
      even the ack for the window probe will be dropped and the
      connection will be 'locked' with the zero peer window.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0e88ff0
    • P
      tcp: Initial repair mode · ee995283
      Pavel Emelyanov 提交于
      This includes (according the the previous description):
      
      * TCP_REPAIR sockoption
      
      This one just puts the socket in/out of the repair mode.
      Allowed for CAP_NET_ADMIN and for closed/establised sockets only.
      When repair mode is turned off and the socket happens to be in
      the established state the window probe is sent to the peer to
      'unlock' the connection.
      
      * TCP_REPAIR_QUEUE sockoption
      
      This one sets the queue which we're about to repair. The
      'no-queue' is set by default.
      
      * TCP_QUEUE_SEQ socoption
      
      Sets the write_seq/rcv_nxt of a selected repaired queue.
      Allowed for TCP_CLOSE-d sockets only. When the socket changes
      its state the other seq-s are changed by the kernel according
      to the protocol rules (most of the existing code is actually
      reused).
      
      * Ability to forcibly bind a socket to a port
      
      The sk->sk_reuse is set to SK_FORCE_REUSE.
      
      * Immediate connect modification
      
      The connect syscall initializes the connection, then directly jumps
      to the code which finalizes it.
      
      * Silent close modification
      
      The close just aborts the connection (similar to SO_LINGER with 0
      time) but without sending any FIN/RST-s to peer.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee995283
    • P
      tcp: Move code around · 370816ae
      Pavel Emelyanov 提交于
      This is just the preparation patch, which makes the needed for
      TCP repair code ready for use.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      370816ae
    • P
      sock: Introduce named constants for sk_reuse · 4a17fd52
      Pavel Emelyanov 提交于
      Name them in a "backward compatible" manner, i.e. reuse or not
      are still 1 and 0 respectively. The reuse value of 2 means that
      the socket with it will forcibly reuse everyone else's port.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4a17fd52
  8. 21 4月, 2012 7 次提交
  9. 20 4月, 2012 1 次提交
  10. 19 4月, 2012 2 次提交
  11. 18 4月, 2012 2 次提交
  12. 16 4月, 2012 2 次提交
  13. 15 4月, 2012 4 次提交