1. 21 9月, 2008 1 次提交
    • T
      tcp: advertise MSS requested by user · f5fff5dc
      Tom Quetchenbach 提交于
      I'm trying to use the TCP_MAXSEG option to setsockopt() to set the MSS
      for both sides of a bidirectional connection.
      
      man tcp says: "If this option is set before connection establishment, it
      also changes the MSS value announced to the other end in the initial
      packet."
      
      However, the kernel only uses the MTU/route cache to set the advertised
      MSS. That means if I set the MSS to, say, 500 before calling connect(),
      I will send at most 500-byte packets, but I will still receive 1500-byte
      packets in reply.
      
      This is a bug, either in the kernel or the documentation.
      
      This patch (applies to latest net-2.6) reduces the advertised value to
      that requested by the user as long as setsockopt() is called before
      connect() or accept(). This seems like the behavior that one would
      expect as well as that which is documented.
      
      I've tried to make sure that things that depend on the advertised MSS
      are set correctly.
      Signed-off-by: NTom Quetchenbach <virtualphtn@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f5fff5dc
  2. 09 9月, 2008 1 次提交
    • D
      netns : fix kernel panic in timewait socket destruction · d315492b
      Daniel Lezcano 提交于
      How to reproduce ?
       - create a network namespace
       - use tcp protocol and get timewait socket
       - exit the network namespace
       - after a moment (when the timewait socket is destroyed), the kernel
         panics.
      
      # BUG: unable to handle kernel NULL pointer dereference at
      0000000000000007
      IP: [<ffffffff821e394d>] inet_twdr_do_twkill_work+0x6e/0xb8
      PGD 119985067 PUD 11c5c0067 PMD 0
      Oops: 0000 [1] SMP
      CPU 1
      Modules linked in: ipv6 button battery ac loop dm_mod tg3 libphy ext3 jbd
      edd fan thermal processor thermal_sys sg sata_svw libata dock serverworks
      sd_mod scsi_mod ide_disk ide_core [last unloaded: freq_table]
      Pid: 0, comm: swapper Not tainted 2.6.27-rc2 #3
      RIP: 0010:[<ffffffff821e394d>] [<ffffffff821e394d>]
      inet_twdr_do_twkill_work+0x6e/0xb8
      RSP: 0018:ffff88011ff7fed0 EFLAGS: 00010246
      RAX: ffffffffffffffff RBX: ffffffff82339420 RCX: ffff88011ff7ff30
      RDX: 0000000000000001 RSI: ffff88011a4d03c0 RDI: ffff88011ac2fc00
      RBP: ffffffff823392e0 R08: 0000000000000000 R09: ffff88002802a200
      R10: ffff8800a5c4b000 R11: ffffffff823e4080 R12: ffff88011ac2fc00
      R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000000
      FS: 0000000041cbd940(0000) GS:ffff8800bff839c0(0000)
      knlGS:0000000000000000
      CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      CR2: 0000000000000007 CR3: 00000000bd87c000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process swapper (pid: 0, threadinfo ffff8800bff9e000, task
      ffff88011ff76690)
      Stack: ffffffff823392e0 0000000000000100 ffffffff821e3a3a
      0000000000000008
      0000000000000000 ffffffff821e3a61 ffff8800bff7c000 ffffffff8203c7e7
      ffff88011ff7ff10 ffff88011ff7ff10 0000000000000021 ffffffff82351108
      Call Trace:
      <IRQ> [<ffffffff821e3a3a>] ? inet_twdr_hangman+0x0/0x9e
      [<ffffffff821e3a61>] ? inet_twdr_hangman+0x27/0x9e
      [<ffffffff8203c7e7>] ? run_timer_softirq+0x12c/0x193
      [<ffffffff820390d1>] ? __do_softirq+0x5e/0xcd
      [<ffffffff8200d08c>] ? call_softirq+0x1c/0x28
      [<ffffffff8200e611>] ? do_softirq+0x2c/0x68
      [<ffffffff8201a055>] ? smp_apic_timer_interrupt+0x8e/0xa9
      [<ffffffff8200cad6>] ? apic_timer_interrupt+0x66/0x70
      <EOI> [<ffffffff82011f4c>] ? default_idle+0x27/0x3b
      [<ffffffff8200abbd>] ? cpu_idle+0x5f/0x7d
      
      
      Code: e8 01 00 00 4c 89 e7 41 ff c5 e8 8d fd ff ff 49 8b 44 24 38 4c 89 e7
      65 8b 14 25 24 00 00 00 89 d2 48 8b 80 e8 00 00 00 48 f7 d0 <48> 8b 04 d0
      48 ff 40 58 e8 fc fc ff ff 48 89 df e8 c0 5f 04 00
      RIP [<ffffffff821e394d>] inet_twdr_do_twkill_work+0x6e/0xb8
      RSP <ffff88011ff7fed0>
      CR2: 0000000000000007
      
      This patch provides a function to purge all timewait sockets related
      to a network namespace. The timewait sockets life cycle is not tied with
      the network namespace, that means the timewait sockets stay alive while
      the network namespace dies. The timewait sockets are for avoiding to
      receive a duplicate packet from the network, if the network namespace is
      freed, the network stack is removed, so no chance to receive any packets
      from the outside world. Furthermore, having a pending destruction timer
      on these sockets with a network namespace freed is not safe and will lead
      to an oops if the timer callback which try to access data belonging to 
      the namespace like for example in:
      	inet_twdr_do_twkill_work
      		-> NET_INC_STATS_BH(twsk_net(tw), LINUX_MIB_TIMEWAITED);
      
      Purging the timewait sockets at the network namespace destruction will:
       1) speed up memory freeing for the namespace
       2) fix kernel panic on asynchronous timewait destruction
      Signed-off-by: NDaniel Lezcano <dlezcano@fr.ibm.com>
      Acked-by: NDenis V. Lunev <den@openvz.org>
      Acked-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d315492b
  3. 28 8月, 2008 1 次提交
    • A
      tcp: Skip empty hash buckets faster in /proc/net/tcp · 6eac5604
      Andi Kleen 提交于
      On most systems most of the TCP established/time-wait hash buckets are empty.
      When walking the hash table for /proc/net/tcp their read locks would
      always be aquired just to find out they're empty. This patch changes the code
      to check first if the buckets have any entries before taking the lock, which
      is much cheaper than taking a lock. Since the hash tables are large
      this makes a measurable difference on processing /proc/net/tcp, 
      especially on architectures with slow read_lock (e.g. PPC) 
      
      On a 2GB Core2 system time cat /proc/net/tcp > /dev/null (with a mostly
      empty hash table) goes from 0.046s to 0.005s.
      
      On systems with slower atomics (like P4 or POWER4) or larger hash tables
      (more RAM) the difference is much higher.
      
      This can be noticeable because there are some daemons around who regularly
      scan /proc/net/tcp.
      
      Original idea for this patch from Marcus Meissner, but redone by me.
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6eac5604
  4. 07 8月, 2008 1 次提交
    • G
      tcp: Fix kernel panic when calling tcp_v(4/6)_md5_do_lookup · 6edafaaf
      Gui Jianfeng 提交于
      If the following packet flow happen, kernel will panic.
      MathineA			MathineB
      		SYN
      	---------------------->    
              	SYN+ACK
      	<----------------------
      		ACK(bad seq)
      	---------------------->
      When a bad seq ACK is received, tcp_v4_md5_do_lookup(skb->sk, ip_hdr(skb)->daddr))
      is finally called by tcp_v4_reqsk_send_ack(), but the first parameter(skb->sk) is 
      NULL at that moment, so kernel panic happens.
      This patch fixes this bug.
      
      OOPS output is as following:
      [  302.812793] IP: [<c05cfaa6>] tcp_v4_md5_do_lookup+0x12/0x42
      [  302.817075] Oops: 0000 [#1] SMP 
      [  302.819815] Modules linked in: ipv6 loop dm_multipath rtc_cmos rtc_core rtc_lib pcspkr pcnet32 mii i2c_piix4 parport_pc i2c_core parport ac button ata_piix libata dm_mod mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod crc_t10dif ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: scsi_wait_scan]
      [  302.849946] 
      [  302.851198] Pid: 0, comm: swapper Not tainted (2.6.27-rc1-guijf #5)
      [  302.855184] EIP: 0060:[<c05cfaa6>] EFLAGS: 00010296 CPU: 0
      [  302.858296] EIP is at tcp_v4_md5_do_lookup+0x12/0x42
      [  302.861027] EAX: 0000001e EBX: 00000000 ECX: 00000046 EDX: 00000046
      [  302.864867] ESI: ceb69e00 EDI: 1467a8c0 EBP: cf75f180 ESP: c0792e54
      [  302.868333]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
      [  302.871287] Process swapper (pid: 0, ti=c0792000 task=c0712340 task.ti=c0746000)
      [  302.875592] Stack: c06f413a 00000000 cf75f180 ceb69e00 00000000 c05d0d86 000016d0 ceac5400 
      [  302.883275]        c05d28f8 000016d0 ceb69e00 ceb69e20 681bf6e3 00001000 00000000 0a67a8c0 
      [  302.890971]        ceac5400 c04250a3 c06f413a c0792eb0 c0792edc cf59a620 cf59a620 cf59a634 
      [  302.900140] Call Trace:
      [  302.902392]  [<c05d0d86>] tcp_v4_reqsk_send_ack+0x17/0x35
      [  302.907060]  [<c05d28f8>] tcp_check_req+0x156/0x372
      [  302.910082]  [<c04250a3>] printk+0x14/0x18
      [  302.912868]  [<c05d0aa1>] tcp_v4_do_rcv+0x1d3/0x2bf
      [  302.917423]  [<c05d26be>] tcp_v4_rcv+0x563/0x5b9
      [  302.920453]  [<c05bb20f>] ip_local_deliver_finish+0xe8/0x183
      [  302.923865]  [<c05bb10a>] ip_rcv_finish+0x286/0x2a3
      [  302.928569]  [<c059e438>] dev_alloc_skb+0x11/0x25
      [  302.931563]  [<c05a211f>] netif_receive_skb+0x2d6/0x33a
      [  302.934914]  [<d0917941>] pcnet32_poll+0x333/0x680 [pcnet32]
      [  302.938735]  [<c05a3b48>] net_rx_action+0x5c/0xfe
      [  302.941792]  [<c042856b>] __do_softirq+0x5d/0xc1
      [  302.944788]  [<c042850e>] __do_softirq+0x0/0xc1
      [  302.948999]  [<c040564b>] do_softirq+0x55/0x88
      [  302.951870]  [<c04501b1>] handle_fasteoi_irq+0x0/0xa4
      [  302.954986]  [<c04284da>] irq_exit+0x35/0x69
      [  302.959081]  [<c0405717>] do_IRQ+0x99/0xae
      [  302.961896]  [<c040422b>] common_interrupt+0x23/0x28
      [  302.966279]  [<c040819d>] default_idle+0x2a/0x3d
      [  302.969212]  [<c0402552>] cpu_idle+0xb2/0xd2
      [  302.972169]  =======================
      [  302.974274] Code: fc ff 84 d2 0f 84 df fd ff ff e9 34 fe ff ff 83 c4 0c 5b 5e 5f 5d c3 90 90 57 89 d7 56 53 89 c3 50 68 3a 41 6f c0 e8 e9 55 e5 ff <8b> 93 9c 04 00 00 58 85 d2 59 74 1e 8b 72 10 31 db 31 c9 85 f6 
      [  303.011610] EIP: [<c05cfaa6>] tcp_v4_md5_do_lookup+0x12/0x42 SS:ESP 0068:c0792e54
      [  303.018360] Kernel panic - not syncing: Fatal exception in interrupt
      Signed-off-by: NGui Jianfeng <guijianfeng@cn.fujitsu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6edafaaf
  5. 01 8月, 2008 1 次提交
  6. 30 7月, 2008 1 次提交
  7. 26 7月, 2008 1 次提交
  8. 19 7月, 2008 2 次提交
    • D
      tcp: fix kernel panic with listening_get_next · bdccc4ca
      Daniel Lezcano 提交于
      # BUG: unable to handle kernel NULL pointer dereference at
      0000000000000038
      IP: [<ffffffff821ed01e>] listening_get_next+0x50/0x1b3
      PGD 11e4b9067 PUD 11d16c067 PMD 0
      Oops: 0000 [1] SMP
      last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
      CPU 3
      Modules linked in: bridge ipv6 button battery ac loop dm_mod tg3 ext3
      jbd edd fan thermal processor thermal_sys hwmon sg sata_svw libata dock
      serverworks sd_mod scsi_mod ide_disk ide_core [last unloaded: freq_table]
      Pid: 3368, comm: slpd Not tainted 2.6.26-rc2-mm1-lxc4 #1
      RIP: 0010:[<ffffffff821ed01e>] [<ffffffff821ed01e>]
      listening_get_next+0x50/0x1b3
      RSP: 0018:ffff81011e1fbe18 EFLAGS: 00010246
      RAX: 0000000000000000 RBX: ffff8100be0ad3c0 RCX: ffff8100619f50c0
      RDX: ffffffff82475be0 RSI: ffff81011d9ae6c0 RDI: ffff8100be0ad508
      RBP: ffff81011f4f1240 R08: 00000000ffffffff R09: ffff8101185b6780
      R10: 000000000000002d R11: ffffffff820fdbfa R12: ffff8100be0ad3c8
      R13: ffff8100be0ad6a0 R14: ffff8100be0ad3c0 R15: ffffffff825b8ce0
      FS: 00007f6a0ebd16d0(0000) GS:ffff81011f424540(0000)
      knlGS:0000000000000000
      CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 0000000000000038 CR3: 000000011dc20000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process slpd (pid: 3368, threadinfo ffff81011e1fa000, task
      ffff81011f4b8660)
      Stack: 00000000000002ee ffff81011f5a57c0 ffff81011f4f1240
      ffff81011e1fbe90
      0000000000001000 0000000000000000 00007fff16bf2590 ffffffff821ed9c8
      ffff81011f5a57c0 ffff81011d9ae6c0 000000000000041a ffffffff820b0abd
      Call Trace:
      [<ffffffff821ed9c8>] ? tcp_seq_next+0x34/0x7e
      [<ffffffff820b0abd>] ? seq_read+0x1aa/0x29d
      [<ffffffff820d21b4>] ? proc_reg_read+0x73/0x8e
      [<ffffffff8209769c>] ? vfs_read+0xaa/0x152
      [<ffffffff82097a7d>] ? sys_read+0x45/0x6e
      [<ffffffff8200bd2b>] ? system_call_after_swapgs+0x7b/0x80
      
      
      Code: 31 a9 25 00 e9 b5 00 00 00 ff 45 20 83 7d 0c 01 75 79 4c 8b 75 10
      48 8b 0e eb 1d 48 8b 51 20 0f b7 45 08 39 02 75 0e 48 8b 41 28 <4c> 39
      78 38 0f 84 93 00 00 00 48 8b 09 48 85 c9 75 de 8b 55 1c
      RIP [<ffffffff821ed01e>] listening_get_next+0x50/0x1b3
      RSP <ffff81011e1fbe18>
      CR2: 0000000000000038
      
      This kernel panic appears with CONFIG_NET_NS=y.
      
      How to reproduce ?
      
          On the buggy host (host A)
             * ip addr add 1.2.3.4/24 dev eth0
      
          On a remote host (host B)
             * ip addr add 1.2.3.5/24 dev eth0
             * iptables -A INPUT -p tcp -s 1.2.3.4 -j DROP
             * ssh 1.2.3.4
      
          On host A:
             * netstat -ta or cat /proc/net/tcp
      
      This bug happens when reading /proc/net/tcp[6] when there is a req_sock
      at the SYN_RECV state.
      
      When a SYN is received the minisock is created and the sk field is set to
      NULL. In the listening_get_next function, we try to look at the field 
      req->sk->sk_net.
      
      When looking at how to fix this bug, I noticed that is useless to do
      the check for the minisock belonging to the namespace. A minisock belongs
      to a listen point and this one is per namespace, so when browsing the
      minisock they are always per namespace.
      Signed-off-by: NDaniel Lezcano <dlezcano@fr.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bdccc4ca
    • A
      tcp: Fix MD5 signatures for non-linear skbs · 49a72dfb
      Adam Langley 提交于
      Currently, the MD5 code assumes that the SKBs are linear and, in the case
      that they aren't, happily goes off and hashes off the end of the SKB and
      into random memory.
      
      Reported by Stephen Hemminger in [1]. Advice thanks to Stephen and Evgeniy
      Polyakov. Also includes a couple of missed route_caps from Stephen's patch
      in [2].
      
      [1] http://marc.info/?l=linux-netdev&m=121445989106145&w=2
      [2] http://marc.info/?l=linux-netdev&m=121459157816964&w=2Signed-off-by: NAdam Langley <agl@imperialviolet.org>
      Acked-by: NStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      49a72dfb
  9. 17 7月, 2008 4 次提交
  10. 15 7月, 2008 2 次提交
  11. 28 6月, 2008 1 次提交
  12. 17 6月, 2008 1 次提交
  13. 15 6月, 2008 1 次提交
  14. 13 6月, 2008 1 次提交
    • D
      tcp: Revert 'process defer accept as established' changes. · ec0a1966
      David S. Miller 提交于
      This reverts two changesets, ec3c0982
      ("[TCP]: TCP_DEFER_ACCEPT updates - process as established") and
      the follow-on bug fix 9ae27e0a
      ("tcp: Fix slab corruption with ipv6 and tcp6fuzz").
      
      This change causes several problems, first reported by Ingo Molnar
      as a distcc-over-loopback regression where connections were getting
      stuck.
      
      Ilpo Järvinen first spotted the locking problems.  The new function
      added by this code, tcp_defer_accept_check(), only has the
      child socket locked, yet it is modifying state of the parent
      listening socket.
      
      Fixing that is non-trivial at best, because we can't simply just grab
      the parent listening socket lock at this point, because it would
      create an ABBA deadlock.  The normal ordering is parent listening
      socket --> child socket, but this code path would require the
      reverse lock ordering.
      
      Next is a problem noticed by Vitaliy Gusev, he noted:
      
      ----------------------------------------
      >--- a/net/ipv4/tcp_timer.c
      >+++ b/net/ipv4/tcp_timer.c
      >@@ -481,6 +481,11 @@ static void tcp_keepalive_timer (unsigned long data)
      > 		goto death;
      > 	}
      >
      >+	if (tp->defer_tcp_accept.request && sk->sk_state == TCP_ESTABLISHED) {
      >+		tcp_send_active_reset(sk, GFP_ATOMIC);
      >+		goto death;
      
      Here socket sk is not attached to listening socket's request queue. tcp_done()
      will not call inet_csk_destroy_sock() (and tcp_v4_destroy_sock() which should
      release this sk) as socket is not DEAD. Therefore socket sk will be lost for
      freeing.
      ----------------------------------------
      
      Finally, Alexey Kuznetsov argues that there might not even be any
      real value or advantage to these new semantics even if we fix all
      of the bugs:
      
      ----------------------------------------
      Hiding from accept() sockets with only out-of-order data only
      is the only thing which is impossible with old approach. Is this really
      so valuable? My opinion: no, this is nothing but a new loophole
      to consume memory without control.
      ----------------------------------------
      
      So revert this thing for now.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec0a1966
  15. 12 6月, 2008 5 次提交
  16. 11 6月, 2008 1 次提交
  17. 02 5月, 2008 1 次提交
  18. 24 4月, 2008 1 次提交
  19. 14 4月, 2008 7 次提交
  20. 10 4月, 2008 1 次提交
    • F
      [Syncookies]: Add support for TCP options via timestamps. · 4dfc2817
      Florian Westphal 提交于
      Allow the use of SACK and window scaling when syncookies are used
      and the client supports tcp timestamps. Options are encoded into
      the timestamp sent in the syn-ack and restored from the timestamp
      echo when the ack is received.
      
      Based on earlier work by Glenn Griffin.
      This patch avoids increasing the size of structs by encoding TCP
      options into the least significant bits of the timestamp and
      by not using any 'timestamp offset'.
      
      The downside is that the timestamp sent in the packet after the synack
      will increase by several seconds.
      
      changes since v1:
       don't duplicate timestamp echo decoding function, put it into ipv4/syncookie.c
       and have ipv6/syncookies.c use it.
       Feedback from Glenn Griffin: fix line indented with spaces, kill redundant if ()
      Reviewed-by: NHagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4dfc2817
  21. 04 4月, 2008 5 次提交