1. 26 8月, 2010 1 次提交
    • E
      tcp: fix three tcp sysctls tuning · c5ed63d6
      Eric Dumazet 提交于
      As discovered by Anton Blanchard, current code to autotune 
      tcp_death_row.sysctl_max_tw_buckets, sysctl_tcp_max_orphans and
      sysctl_max_syn_backlog makes little sense.
      
      The bigger a page is, the less tcp_max_orphans is : 4096 on a 512GB
      machine in Anton's case.
      
      (tcp_hashinfo.bhash_size * sizeof(struct inet_bind_hashbucket))
      is much bigger if spinlock debugging is on. Its wrong to select bigger
      limits in this case (where kernel structures are also bigger)
      
      bhash_size max is 65536, and we get this value even for small machines. 
      
      A better ground is to use size of ehash table, this also makes code
      shorter and more obvious.
      
      Based on a patch from Anton, and another from David.
      Reported-and-tested-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c5ed63d6
  2. 25 8月, 2010 1 次提交
  3. 24 8月, 2010 1 次提交
  4. 18 8月, 2010 1 次提交
  5. 08 8月, 2010 1 次提交
  6. 03 8月, 2010 2 次提交
  7. 02 8月, 2010 4 次提交
  8. 31 7月, 2010 1 次提交
  9. 23 7月, 2010 4 次提交
  10. 22 7月, 2010 1 次提交
  11. 20 7月, 2010 1 次提交
  12. 16 7月, 2010 1 次提交
  13. 15 7月, 2010 1 次提交
  14. 13 7月, 2010 2 次提交
  15. 09 7月, 2010 1 次提交
    • S
      gre: propagate ipv6 transport class · dd4ba83d
      Stephen Hemminger 提交于
      This patch makes IPV6 over IPv4 GRE tunnel propagate the transport
      class field from the underlying IPV6 header to the IPV4 Type Of Service
      field. Without the patch, all IPV6 packets in tunnel look the same to QoS.
      
      This assumes that IPV6 transport class is exactly the same
      as IPv4 TOS. Not sure if that is always the case?  Maybe need
      to mask off some bits.
      
      The mask and shift to get tclass is copied from ipv6/datagram.c
      Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dd4ba83d
  16. 08 7月, 2010 1 次提交
  17. 06 7月, 2010 1 次提交
  18. 05 7月, 2010 3 次提交
  19. 01 7月, 2010 2 次提交
    • C
      fragment: add fast path for in-order fragments · d6bebca9
      Changli Gao 提交于
      add fast path for in-order fragments
      
      As the fragments are sent in order in most of OSes, such as Windows, Darwin and
      FreeBSD, it is likely the new fragments are at the end of the inet_frag_queue.
      In the fast path, we check if the skb at the end of the inet_frag_queue is the
      prev we expect.
      Signed-off-by: NChangli Gao <xiaosuo@gmail.com>
      ----
       include/net/inet_frag.h |    1 +
       net/ipv4/ip_fragment.c  |   12 ++++++++++++
       net/ipv6/reassembly.c   |   11 +++++++++++
       3 files changed, 24 insertions(+)
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d6bebca9
    • E
      snmp: 64bit ipstats_mib for all arches · 4ce3c183
      Eric Dumazet 提交于
      /proc/net/snmp and /proc/net/netstat expose SNMP counters.
      
      Width of these counters is either 32 or 64 bits, depending on the size
      of "unsigned long" in kernel.
      
      This means user program parsing these files must already be prepared to
      deal with 64bit values, regardless of user program being 32 or 64 bit.
      
      This patch introduces 64bit snmp values for IPSTAT mib, where some
      counters can wrap pretty fast if they are 32bit wide.
      
      # netstat -s|egrep "InOctets|OutOctets"
          InOctets: 244068329096
          OutOctets: 244069348848
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4ce3c183
  20. 29 6月, 2010 2 次提交
  21. 28 6月, 2010 2 次提交
  22. 27 6月, 2010 2 次提交
  23. 26 6月, 2010 2 次提交
  24. 25 6月, 2010 1 次提交
    • K
      tcp: do not send reset to already closed sockets · 565b7b2d
      Konstantin Khorenko 提交于
      i've found that tcp_close() can be called for an already closed
      socket, but still sends reset in this case (tcp_send_active_reset())
      which seems to be incorrect.  Moreover, a packet with reset is sent
      with different source port as original port number has been already
      cleared on socket.  Besides that incrementing stat counter for
      LINUX_MIB_TCPABORTONCLOSE also does not look correct in this case.
      
      Initially this issue was found on 2.6.18-x RHEL5 kernel, but the same
      seems to be true for the current mainstream kernel (checked on
      2.6.35-rc3).  Please, correct me if i missed something.
      
      How that happens:
      
      1) the server receives a packet for socket in TCP_CLOSE_WAIT state
         that triggers a tcp_reset():
      
      Call Trace:
       <IRQ>  [<ffffffff8025b9b9>] tcp_reset+0x12f/0x1e8
       [<ffffffff80046125>] tcp_rcv_state_process+0x1c0/0xa08
       [<ffffffff8003eb22>] tcp_v4_do_rcv+0x310/0x37a
       [<ffffffff80028bea>] tcp_v4_rcv+0x74d/0xb43
       [<ffffffff8024ef4c>] ip_local_deliver_finish+0x0/0x259
       [<ffffffff80037131>] ip_local_deliver+0x200/0x2f4
       [<ffffffff8003843c>] ip_rcv+0x64c/0x69f
       [<ffffffff80021d89>] netif_receive_skb+0x4c4/0x4fa
       [<ffffffff80032eca>] process_backlog+0x90/0xec
       [<ffffffff8000cc50>] net_rx_action+0xbb/0x1f1
       [<ffffffff80012d3a>] __do_softirq+0xf5/0x1ce
       [<ffffffff8001147a>] handle_IRQ_event+0x56/0xb0
       [<ffffffff8006334c>] call_softirq+0x1c/0x28
       [<ffffffff80070476>] do_softirq+0x2c/0x85
       [<ffffffff80070441>] do_IRQ+0x149/0x152
       [<ffffffff80062665>] ret_from_intr+0x0/0xa
       <EOI>  [<ffffffff80008a2e>] __handle_mm_fault+0x6cd/0x1303
       [<ffffffff80008903>] __handle_mm_fault+0x5a2/0x1303
       [<ffffffff80033a9d>] cache_free_debugcheck+0x21f/0x22e
       [<ffffffff8006a263>] do_page_fault+0x49a/0x7dc
       [<ffffffff80066487>] thread_return+0x89/0x174
       [<ffffffff800c5aee>] audit_syscall_exit+0x341/0x35c
       [<ffffffff80062e39>] error_exit+0x0/0x84
      
      tcp_rcv_state_process()
      ...  // (sk_state == TCP_CLOSE_WAIT here)
      ...
              /* step 2: check RST bit */
              if(th->rst) {
                      tcp_reset(sk);
                      goto discard;
              }
      ...
      ---------------------------------
      tcp_rcv_state_process
       tcp_reset
        tcp_done
         tcp_set_state(sk, TCP_CLOSE);
           inet_put_port
            __inet_put_port
             inet_sk(sk)->num = 0;
      
         sk->sk_shutdown = SHUTDOWN_MASK;
      
      2) After that the process (socket owner) tries to write something to
         that socket and "inet_autobind" sets a _new_ (which differs from
         the original!) port number for the socket:
      
       Call Trace:
        [<ffffffff80255a12>] inet_bind_hash+0x33/0x5f
        [<ffffffff80257180>] inet_csk_get_port+0x216/0x268
        [<ffffffff8026bcc9>] inet_autobind+0x22/0x8f
        [<ffffffff80049140>] inet_sendmsg+0x27/0x57
        [<ffffffff8003a9d9>] do_sock_write+0xae/0xea
        [<ffffffff80226ac7>] sock_writev+0xdc/0xf6
        [<ffffffff800680c7>] _spin_lock_irqsave+0x9/0xe
        [<ffffffff8001fb49>] __pollwait+0x0/0xdd
        [<ffffffff8008d533>] default_wake_function+0x0/0xe
        [<ffffffff800a4f10>] autoremove_wake_function+0x0/0x2e
        [<ffffffff800f0b49>] do_readv_writev+0x163/0x274
        [<ffffffff80066538>] thread_return+0x13a/0x174
        [<ffffffff800145d8>] tcp_poll+0x0/0x1c9
        [<ffffffff800c56d3>] audit_syscall_entry+0x180/0x1b3
        [<ffffffff800f0dd0>] sys_writev+0x49/0xe4
        [<ffffffff800622dd>] tracesys+0xd5/0xe0
      
      3) sendmsg fails at last with -EPIPE (=> 'write' returns -EPIPE in userspace):
      
      F: tcp_sendmsg1 -EPIPE: sk=ffff81000bda00d0, sport=49847, old_state=7, new_state=7, sk_err=0, sk_shutdown=3
      
      Call Trace:
       [<ffffffff80027557>] tcp_sendmsg+0xcb/0xe87
       [<ffffffff80033300>] release_sock+0x10/0xae
       [<ffffffff8016f20f>] vgacon_cursor+0x0/0x1a7
       [<ffffffff8026bd32>] inet_autobind+0x8b/0x8f
       [<ffffffff8003a9d9>] do_sock_write+0xae/0xea
       [<ffffffff80226ac7>] sock_writev+0xdc/0xf6
       [<ffffffff800680c7>] _spin_lock_irqsave+0x9/0xe
       [<ffffffff8001fb49>] __pollwait+0x0/0xdd
       [<ffffffff8008d533>] default_wake_function+0x0/0xe
       [<ffffffff800a4f10>] autoremove_wake_function+0x0/0x2e
       [<ffffffff800f0b49>] do_readv_writev+0x163/0x274
       [<ffffffff80066538>] thread_return+0x13a/0x174
       [<ffffffff800145d8>] tcp_poll+0x0/0x1c9
       [<ffffffff800c56d3>] audit_syscall_entry+0x180/0x1b3
       [<ffffffff800f0dd0>] sys_writev+0x49/0xe4
       [<ffffffff800622dd>] tracesys+0xd5/0xe0
      
      tcp_sendmsg()
      ...
              /* Wait for a connection to finish. */
              if ((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) {
                      int old_state = sk->sk_state;
                      if ((err = sk_stream_wait_connect(sk, &timeo)) != 0) {
      if (f_d && (err == -EPIPE)) {
              printk("F: tcp_sendmsg1 -EPIPE: sk=%p, sport=%u, old_state=%d, new_state=%d, "
                      "sk_err=%d, sk_shutdown=%d\n",
                      sk, ntohs(inet_sk(sk)->sport), old_state, sk->sk_state,
                      sk->sk_err, sk->sk_shutdown);
              dump_stack();
      }
                              goto out_err;
                      }
              }
      ...
      
      4) Then the process (socket owner) understands that it's time to close
         that socket and does that (and thus triggers sending reset packet):
      
      Call Trace:
      ...
       [<ffffffff80032077>] dev_queue_xmit+0x343/0x3d6
       [<ffffffff80034698>] ip_output+0x351/0x384
       [<ffffffff80251ae9>] dst_output+0x0/0xe
       [<ffffffff80036ec6>] ip_queue_xmit+0x567/0x5d2
       [<ffffffff80095700>] vprintk+0x21/0x33
       [<ffffffff800070f0>] check_poison_obj+0x2e/0x206
       [<ffffffff80013587>] poison_obj+0x36/0x45
       [<ffffffff8025dea6>] tcp_send_active_reset+0x15/0x14d
       [<ffffffff80023481>] dbg_redzone1+0x1c/0x25
       [<ffffffff8025dea6>] tcp_send_active_reset+0x15/0x14d
       [<ffffffff8000ca94>] cache_alloc_debugcheck_after+0x189/0x1c8
       [<ffffffff80023405>] tcp_transmit_skb+0x764/0x786
       [<ffffffff8025df8a>] tcp_send_active_reset+0xf9/0x14d
       [<ffffffff80258ff1>] tcp_close+0x39a/0x960
       [<ffffffff8026be12>] inet_release+0x69/0x80
       [<ffffffff80059b31>] sock_release+0x4f/0xcf
       [<ffffffff80059d4c>] sock_close+0x2c/0x30
       [<ffffffff800133c9>] __fput+0xac/0x197
       [<ffffffff800252bc>] filp_close+0x59/0x61
       [<ffffffff8001eff6>] sys_close+0x85/0xc7
       [<ffffffff800622dd>] tracesys+0xd5/0xe0
      
      So, in brief:
      
      * a received packet for socket in TCP_CLOSE_WAIT state triggers
        tcp_reset() which clears inet_sk(sk)->num and put socket into
        TCP_CLOSE state
      
      * an attempt to write to that socket forces inet_autobind() to get a
        new port (but the write itself fails with -EPIPE)
      
      * tcp_close() called for socket in TCP_CLOSE state sends an active
        reset via socket with newly allocated port
      
      This adds an additional check in tcp_close() for already closed
      sockets. We do not want to send anything to closed sockets.
      Signed-off-by: NKonstantin Khorenko <khorenko@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      565b7b2d
  25. 24 6月, 2010 1 次提交