1. 27 4月, 2012 1 次提交
    • E
      ipv6: RTAX_FEATURE_ALLFRAG causes inefficient TCP segment sizing · 67469601
      Eric Dumazet 提交于
      Quoting Tore Anderson from :
      https://bugzilla.kernel.org/show_bug.cgi?id=42572
      
      When RTAX_FEATURE_ALLFRAG is set on a route, the effective TCP segment
      size does not take into account the size of the IPv6 Fragmentation
      header that needs to be included in outbound packets, causing every
      transmitted TCP segment to be fragmented across two IPv6 packets, the
      latter of which will only contain 8 bytes of actual payload.
      
      RTAX_FEATURE_ALLFRAG is typically set on a route in response to
      receving a ICMPv6 Packet Too Big message indicating a Path MTU of less
      than 1280 bytes. 1280 bytes is the minimum IPv6 MTU, however ICMPv6
      PTBs with MTU < 1280 are still valid, in particular when an IPv6
      packet is sent to an IPv4 destination through a stateless translator.
      Any ICMPv4 Need To Fragment packets originated from the IPv4 part of
      the path will be translated to ICMPv6 PTB which may then indicate an
      MTU of less than 1280.
      
      The Linux kernel refuses to reduce the effective MTU to anything below
      1280 bytes, instead it sets it to exactly 1280 bytes, and
      RTAX_FEATURE_ALLFRAG is also set. However, the TCP segment size appears
      to be set to 1240 bytes (1280 Path MTU - 40 bytes of IPv6 header),
      instead of 1232 (additionally taking into account the 8 bytes required
      by the IPv6 Fragmentation extension header).
      
      This in turn results in rather inefficient transmission, as every
      transmitted TCP segment now is split in two fragments containing
      1232+8 bytes of payload.
      
      After this patch, all the outgoing packets that includes a
      Fragmentation header all are "atomic" or "non-fragmented" fragments,
      i.e., they both have Offset=0 and More Fragments=0.
      
      With help from David S. Miller
      Reported-by: NTore Anderson <tore@fud.no>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Maciej Żenczykowski <maze@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Tested-by: NTore Anderson <tore@fud.no>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      67469601
  2. 15 4月, 2012 1 次提交
    • A
      tcp: bind() use stronger condition for bind_conflict · aacd9289
      Alex Copot 提交于
      We must try harder to get unique (addr, port) pairs when
      doing port autoselection for sockets with SO_REUSEADDR
      option set.
      
      We achieve this by adding a relaxation parameter to
      inet_csk_bind_conflict. When 'relax' parameter is off
      we return a conflict whenever the current searched
      pair (addr, port) is not unique.
      
      This tries to address the problems reported in patch:
      	8d238b25
      	Revert "tcp: bind() fix when many ports are bound"
      
      Tests where ran for creating and binding(0) many sockets
      on 100 IPs. The results are, on average:
      
      	* 60000 sockets, 600 ports / IP:
      		* 0.210 s, 620 (IP, port) duplicates without patch
      		* 0.219 s, no duplicates with patch
      	* 100000 sockets, 1000 ports / IP:
      		* 0.371 s, 1720 duplicates without patch
      		* 0.373 s, no duplicates with patch
      	* 200000 sockets, 2000 ports / IP:
      		* 0.766 s, 6900 duplicates without patch
      		* 0.768 s, no duplicates with patch
      	* 500000 sockets, 5000 ports / IP:
      		* 2.227 s, 41500 duplicates without patch
      		* 2.284 s, no duplicates with patch
      Signed-off-by: NAlex Copot <alex.mihai.c@gmail.com>
      Signed-off-by: NDaniel Baluta <dbaluta@ixiacom.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aacd9289
  3. 09 11月, 2011 1 次提交
  4. 19 5月, 2011 1 次提交
  5. 09 5月, 2011 2 次提交
  6. 20 12月, 2010 1 次提交
  7. 01 12月, 2010 1 次提交
  8. 31 8月, 2010 1 次提交
    • J
      tcp: Add TCP_USER_TIMEOUT socket option. · dca43c75
      Jerry Chu 提交于
      This patch provides a "user timeout" support as described in RFC793. The
      socket option is also needed for the the local half of RFC5482 "TCP User
      Timeout Option".
      
      TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int,
      when > 0, to specify the maximum amount of time in ms that transmitted
      data may remain unacknowledged before TCP will forcefully close the
      corresponding connection and return ETIMEDOUT to the application. If
      0 is given, TCP will continue to use the system default.
      
      Increasing the user timeouts allows a TCP connection to survive extended
      periods without end-to-end connectivity. Decreasing the user timeouts
      allows applications to "fail fast" if so desired. Otherwise it may take
      upto 20 minutes with the current system defaults in a normal WAN
      environment.
      
      The socket option can be made during any state of a TCP connection, but
      is only effective during the synchronized states of a connection
      (ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, or LAST-ACK).
      Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option,
      TCP_USER_TIMEOUT will overtake keepalive to determine when to close a
      connection due to keepalive failure.
      
      The option does not change in anyway when TCP retransmits a packet, nor
      when a keepalive probe will be sent.
      
      This option, like many others, will be inherited by an acceptor from its
      listener.
      Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dca43c75
  9. 16 4月, 2010 1 次提交
  10. 12 4月, 2010 1 次提交
  11. 01 10月, 2009 1 次提交
  12. 28 8月, 2008 1 次提交
  13. 04 4月, 2008 1 次提交
  14. 03 2月, 2008 1 次提交
    • A
      [SOCK] proto: Add hashinfo member to struct proto · ab1e0a13
      Arnaldo Carvalho de Melo 提交于
      This way we can remove TCP and DCCP specific versions of
      
      sk->sk_prot->get_port: both v4 and v6 use inet_csk_get_port
      sk->sk_prot->hash:     inet_hash is directly used, only v6 need
                             a specific version to deal with mapped sockets
      sk->sk_prot->unhash:   both v4 and v6 use inet_hash directly
      
      struct inet_connection_sock_af_ops also gets a new member, bind_conflict, so
      that inet_csk_get_port can find the per family routine.
      
      Now only the lookup routines receive as a parameter a struct inet_hashtable.
      
      With this we further reuse code, reducing the difference among INET transport
      protocols.
      
      Eventually work has to be done on UDP and SCTP to make them share this
      infrastructure and get as a bonus inet_diag interfaces so that iproute can be
      used with these protocols.
      
      net-2.6/net/ipv4/inet_hashtables.c:
        struct proto			     |   +8
        struct inet_connection_sock_af_ops |   +8
       2 structs changed
        __inet_hash_nolisten               |  +18
        __inet_hash                        | -210
        inet_put_port                      |   +8
        inet_bind_bucket_create            |   +1
        __inet_hash_connect                |   -8
       5 functions changed, 27 bytes added, 218 bytes removed, diff: -191
      
      net-2.6/net/core/sock.c:
        proto_seq_show                     |   +3
       1 function changed, 3 bytes added, diff: +3
      
      net-2.6/net/ipv4/inet_connection_sock.c:
        inet_csk_get_port                  |  +15
       1 function changed, 15 bytes added, diff: +15
      
      net-2.6/net/ipv4/tcp.c:
        tcp_set_state                      |   -7
       1 function changed, 7 bytes removed, diff: -7
      
      net-2.6/net/ipv4/tcp_ipv4.c:
        tcp_v4_get_port                    |  -31
        tcp_v4_hash                        |  -48
        tcp_v4_destroy_sock                |   -7
        tcp_v4_syn_recv_sock               |   -2
        tcp_unhash                         | -179
       5 functions changed, 267 bytes removed, diff: -267
      
      net-2.6/net/ipv6/inet6_hashtables.c:
        __inet6_hash |   +8
       1 function changed, 8 bytes added, diff: +8
      
      net-2.6/net/ipv4/inet_hashtables.c:
        inet_unhash                        | +190
        inet_hash                          | +242
       2 functions changed, 432 bytes added, diff: +432
      
      vmlinux:
       16 functions changed, 485 bytes added, 492 bytes removed, diff: -7
      
      /home/acme/git/net-2.6/net/ipv6/tcp_ipv6.c:
        tcp_v6_get_port                    |  -31
        tcp_v6_hash                        |   -7
        tcp_v6_syn_recv_sock               |   -9
       3 functions changed, 47 bytes removed, diff: -47
      
      /home/acme/git/net-2.6/net/dccp/proto.c:
        dccp_destroy_sock                  |   -7
        dccp_unhash                        | -179
        dccp_hash                          |  -49
        dccp_set_state                     |   -7
        dccp_done                          |   +1
       5 functions changed, 1 bytes added, 242 bytes removed, diff: -241
      
      /home/acme/git/net-2.6/net/dccp/ipv4.c:
        dccp_v4_get_port                   |  -31
        dccp_v4_request_recv_sock          |   -2
       2 functions changed, 33 bytes removed, diff: -33
      
      /home/acme/git/net-2.6/net/dccp/ipv6.c:
        dccp_v6_get_port                   |  -31
        dccp_v6_hash                       |   -7
        dccp_v6_request_recv_sock          |   +5
       3 functions changed, 5 bytes added, 38 bytes removed, diff: -33
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ab1e0a13
  15. 26 1月, 2007 1 次提交
  16. 04 12月, 2006 1 次提交
  17. 03 12月, 2006 2 次提交
    • A
      [INET_CONNECTION_SOCK]: Pack struct inet_connection_sock_af_ops · 850db6b8
      Arnaldo Carvalho de Melo 提交于
      We have a hole in:
      
      [acme@newtoy net-2.6.20]$ pahole net/ipv6/tcp_ipv6.o inet_connection_sock_af_ops
      /* /pub/scm/linux/kernel/git/acme/net-2.6.20/include/net/inet_connection_sock.h:38 */
      struct inet_connection_sock_af_ops {
              int                        (*queue_xmit)();      /*     0     4 */
              void                       (*send_check)();      /*     4     4 */
              int                        (*rebuild_header)();  /*     8     4 */
              int                        (*conn_request)();    /*    12     4 */
              struct sock *              (*syn_recv_sock)();   /*    16     4 */
              int                        (*remember_stamp)();  /*    20     4 */
              __u16                      net_header_len;       /*    24     2 */
      
              /* XXX 2 bytes hole, try to pack */
      
              int                        (*setsockopt)();      /*    28     4 */
              int                        (*getsockopt)();      /*    32     4 */
              int                        (*compat_setsockopt)(); /*    36     4 */
              int                        (*compat_getsockopt)(); /*    40     4 */
              void                       (*addr2sockaddr)();   /*    44     4 */
              int                        sockaddr_len;         /*    48     4 */
      }; /* size: 52, sum members: 50, holes: 1, sum holes: 2 */
      
      But we don't need sockaddr_len to be an int:
      
      [acme@newtoy net-2.6.20]$ find net -name "*.[ch]" | xargs grep '\.sockaddr_len.\+=' | sort -u
      net/dccp/ipv4.c:        .sockaddr_len      = sizeof(struct sockaddr_in),
      net/dccp/ipv6.c:        .sockaddr_len      = sizeof(struct sockaddr_in6),
      net/ipv4/tcp_ipv4.c:    .sockaddr_len      = sizeof(struct sockaddr_in),
      net/ipv6/tcp_ipv6.c:    .sockaddr_len      = sizeof(struct sockaddr_in6),
      net/sctp/ipv6.c:        .sockaddr_len      = sizeof(struct sockaddr_in6),
      net/sctp/protocol.c:    .sockaddr_len      = sizeof(struct sockaddr_in),
      
      [acme@newtoy net-2.6.20]$ pahole --sizes net/ipv6/tcp_ipv6.o | grep sockaddr_in
      struct sockaddr_in: 16 0
      struct sockaddr_in6: 28 0
      [acme@newtoy net-2.6.20]$
      
      So I turned sockaddr_len a 'u16', and now:
      
      [acme@newtoy net-2.6.20]$ pahole net/ipv6/tcp_ipv6.o inet_connection_sock_af_ops
      /* /pub/scm/linux/kernel/git/acme/net-2.6.20/include/net/inet_connection_sock.h:38 */
      struct inet_connection_sock_af_ops {
              int            (*queue_xmit)();        /*     0   4 */
              void           (*send_check)();        /*     4   4 */
              int            (*rebuild_header)();    /*     8   4 */
              int            (*conn_request)();      /*    12   4 */
              struct sock *  (*syn_recv_sock)();     /*    16   4 */
              int            (*remember_stamp)();    /*    20   4 */
              u16            net_header_len;         /*    24   2 */
              u16            sockaddr_len;           /*    26   2 */
              int            (*setsockopt)();        /*    28   4 */
              int            (*getsockopt)();        /*    32   4 */
              int            (*compat_setsockopt)(); /*    36   4 */
              int            (*compat_getsockopt)(); /*    40   4 */
              void           (*addr2sockaddr)();     /*    44   4 */
      }; /* size: 48 */
      
      So we've saved 4 bytes:
      
      [acme@newtoy net-2.6.20]$ codiff -sV /tmp/tcp_ipv6.o.before net/ipv6/tcp_ipv6.o
      /pub/scm/linux/kernel/git/acme/net-2.6.20/net/ipv6/tcp_ipv6.c:
        struct inet_connection_sock_af_ops |   -4
          net_header_len;
           from: __u16                 /*    24(0)     2(0) */
           to:   u16                   /*    24(0)     2(0) */
          sockaddr_len;
           from: int                   /*    48(0)     4(0) */
           to:   u16                   /*    26(0)     2(0) */
       1 struct changed
      [acme@newtoy net-2.6.20]$
      Signed-off-by: NArnaldo Carvalho de Melo <acme@mandriva.com>
      850db6b8
    • D
      [TCP]: Don't set SKB owner in tcp_transmit_skb(). · 93173112
      David S. Miller 提交于
      The data itself is already charged to the SKB, doing
      the skb_set_owner_w() just generates a lot of noise and
      extra atomics we don't really need.
      
      Lmbench improvements on lat_tcp are minimal:
      
      before:
      TCP latency using localhost: 23.2701 microseconds
      TCP latency using localhost: 23.1994 microseconds
      TCP latency using localhost: 23.2257 microseconds
      
      after:
      TCP latency using localhost: 22.8380 microseconds
      TCP latency using localhost: 22.9465 microseconds
      TCP latency using localhost: 22.8462 microseconds
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93173112
  18. 29 9月, 2006 2 次提交
  19. 23 9月, 2006 1 次提交
    • A
      [TCP]: Send ACKs each 2nd received segment. · 1ef9696c
      Alexey Kuznetsov 提交于
      It does not affect either mss-sized connections (obviously) or
      connections controlled by Nagle (because there is only one small
      segment in flight).
      
      The idea is to record the fact that a small segment arrives on a
      connection, where one small segment has already been received and
      still not-ACKed. In this case ACK is forced after tcp_recvmsg() drains
      receive buffer.
      
      In other words, it is a "soft" each-2nd-segment ACK, which is enough
      to preserve ACK clock even when ABC is enabled.
      Signed-off-by: NAlexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ef9696c
  20. 21 3月, 2006 5 次提交
  21. 11 1月, 2006 1 次提交
  22. 04 1月, 2006 6 次提交
  23. 09 10月, 2005 1 次提交
  24. 30 8月, 2005 5 次提交