1. 24 10月, 2012 1 次提交
  2. 13 10月, 2012 1 次提交
    • A
      tcp: resets are misrouted · 4c675258
      Alexey Kuznetsov 提交于
      After commit e2446eaa ("tcp_v4_send_reset: binding oif to iif in no
      sock case").. tcp resets are always lost, when routing is asymmetric.
      Yes, backing out that patch will result in misrouting of resets for
      dead connections which used interface binding when were alive, but we
      actually cannot do anything here.  What's died that's died and correct
      handling normal unbound connections is obviously a priority.
      
      Comment to comment:
      > This has few benefits:
      >   1. tcp_v6_send_reset already did that.
      
      It was done to route resets for IPv6 link local addresses. It was a
      mistake to do so for global addresses. The patch fixes this as well.
      
      Actually, the problem appears to be even more serious than guaranteed
      loss of resets.  As reported by Sergey Soloviev <sol@eqv.ru>, those
      misrouted resets create a lot of arp traffic and huge amount of
      unresolved arp entires putting down to knees NAT firewalls which use
      asymmetric routing.
      Signed-off-by: NAlexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      4c675258
  3. 02 10月, 2012 1 次提交
  4. 23 9月, 2012 2 次提交
  5. 06 9月, 2012 1 次提交
  6. 01 9月, 2012 1 次提交
    • J
      tcp: TCP Fast Open Server - support TFO listeners · 8336886f
      Jerry Chu 提交于
      This patch builds on top of the previous patch to add the support
      for TFO listeners. This includes -
      
      1. allocating, properly initializing, and managing the per listener
      fastopen_queue structure when TFO is enabled
      
      2. changes to the inet_csk_accept code to support TFO. E.g., the
      request_sock can no longer be freed upon accept(), not until 3WHS
      finishes
      
      3. allowing a TCP_SYN_RECV socket to properly poll() and sendmsg()
      if it's a TFO socket
      
      4. properly closing a TFO listener, and a TFO socket before 3WHS
      finishes
      
      5. supporting TCP_FASTOPEN socket option
      
      6. modifying tcp_check_req() to use to check a TFO socket as well
      as request_sock
      
      7. supporting TCP's TFO cookie option
      
      8. adding a new SYN-ACK retransmit handler to use the timer directly
      off the TFO socket rather than the listener socket. Note that TFO
      server side will not retransmit anything other than SYN-ACK until
      the 3WHS is completed.
      
      The patch also contains an important function
      "reqsk_fastopen_remove()" to manage the somewhat complex relation
      between a listener, its request_sock, and the corresponding child
      socket. See the comment above the function for the detail.
      Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8336886f
  7. 20 8月, 2012 1 次提交
  8. 15 8月, 2012 1 次提交
  9. 10 8月, 2012 2 次提交
    • E
      net: tcp: ipv6_mapped needs sk_rx_dst_set method · 63d02d15
      Eric Dumazet 提交于
      commit 5d299f3d (net: ipv6: fix TCP early demux) added a
      regression for ipv6_mapped case.
      
      [   67.422369] SELinux: initialized (dev autofs, type autofs), uses
      genfs_contexts
      [   67.449678] SELinux: initialized (dev autofs, type autofs), uses
      genfs_contexts
      [   92.631060] BUG: unable to handle kernel NULL pointer dereference at
      (null)
      [   92.631435] IP: [<          (null)>]           (null)
      [   92.631645] PGD 0
      [   92.631846] Oops: 0010 [#1] SMP
      [   92.632095] Modules linked in: autofs4 sunrpc ipv6 dm_mirror
      dm_region_hash dm_log dm_multipath dm_mod video sbs sbshc battery ac lp
      parport sg snd_hda_intel snd_hda_codec snd_seq_oss snd_seq_midi_event
      snd_seq snd_seq_device pcspkr snd_pcm_oss snd_mixer_oss snd_pcm
      snd_timer serio_raw button floppy snd i2c_i801 i2c_core soundcore
      snd_page_alloc shpchp ide_cd_mod cdrom microcode ehci_hcd ohci_hcd
      uhci_hcd
      [   92.634294] CPU 0
      [   92.634294] Pid: 4469, comm: sendmail Not tainted 3.6.0-rc1 #3
      [   92.634294] RIP: 0010:[<0000000000000000>]  [<          (null)>]
      (null)
      [   92.634294] RSP: 0018:ffff880245fc7cb0  EFLAGS: 00010282
      [   92.634294] RAX: ffffffffa01985f0 RBX: ffff88024827ad00 RCX:
      0000000000000000
      [   92.634294] RDX: 0000000000000218 RSI: ffff880254735380 RDI:
      ffff88024827ad00
      [   92.634294] RBP: ffff880245fc7cc8 R08: 0000000000000001 R09:
      0000000000000000
      [   92.634294] R10: 0000000000000000 R11: ffff880245fc7bf8 R12:
      ffff880254735380
      [   92.634294] R13: ffff880254735380 R14: 0000000000000000 R15:
      7fffffffffff0218
      [   92.634294] FS:  00007f4516ccd6f0(0000) GS:ffff880256600000(0000)
      knlGS:0000000000000000
      [   92.634294] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [   92.634294] CR2: 0000000000000000 CR3: 0000000245ed1000 CR4:
      00000000000007f0
      [   92.634294] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
      0000000000000000
      [   92.634294] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
      0000000000000400
      [   92.634294] Process sendmail (pid: 4469, threadinfo ffff880245fc6000,
      task ffff880254b8cac0)
      [   92.634294] Stack:
      [   92.634294]  ffffffff813837a7 ffff88024827ad00 ffff880254b6b0e8
      ffff880245fc7d68
      [   92.634294]  ffffffff81385083 00000000001d2680 ffff8802547353a8
      ffff880245fc7d18
      [   92.634294]  ffffffff8105903a ffff88024827ad60 0000000000000002
      00000000000000ff
      [   92.634294] Call Trace:
      [   92.634294]  [<ffffffff813837a7>] ? tcp_finish_connect+0x2c/0xfa
      [   92.634294]  [<ffffffff81385083>] tcp_rcv_state_process+0x2b6/0x9c6
      [   92.634294]  [<ffffffff8105903a>] ? sched_clock_cpu+0xc3/0xd1
      [   92.634294]  [<ffffffff81059073>] ? local_clock+0x2b/0x3c
      [   92.634294]  [<ffffffff8138caf3>] tcp_v4_do_rcv+0x63a/0x670
      [   92.634294]  [<ffffffff8133278e>] release_sock+0x128/0x1bd
      [   92.634294]  [<ffffffff8139f060>] __inet_stream_connect+0x1b1/0x352
      [   92.634294]  [<ffffffff813325f5>] ? lock_sock_nested+0x74/0x7f
      [   92.634294]  [<ffffffff8104b333>] ? wake_up_bit+0x25/0x25
      [   92.634294]  [<ffffffff813325f5>] ? lock_sock_nested+0x74/0x7f
      [   92.634294]  [<ffffffff8139f223>] ? inet_stream_connect+0x22/0x4b
      [   92.634294]  [<ffffffff8139f234>] inet_stream_connect+0x33/0x4b
      [   92.634294]  [<ffffffff8132e8cf>] sys_connect+0x78/0x9e
      [   92.634294]  [<ffffffff813fd407>] ? sysret_check+0x1b/0x56
      [   92.634294]  [<ffffffff81088503>] ? __audit_syscall_entry+0x195/0x1c8
      [   92.634294]  [<ffffffff811cc26e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
      [   92.634294]  [<ffffffff813fd3e2>] system_call_fastpath+0x16/0x1b
      [   92.634294] Code:  Bad RIP value.
      [   92.634294] RIP  [<          (null)>]           (null)
      [   92.634294]  RSP <ffff880245fc7cb0>
      [   92.634294] CR2: 0000000000000000
      [   92.648982] ---[ end trace 24e2bed94314c8d9 ]---
      [   92.649146] Kernel panic - not syncing: Fatal exception in interrupt
      
      Fix this using inet_sk_rx_dst_set(), and export this function in case
      IPv6 is modular.
      Reported-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      63d02d15
    • E
      time: jiffies_delta_to_clock_t() helper to the rescue · a399a805
      Eric Dumazet 提交于
      Various /proc/net files sometimes report crazy timer values, expressed
      in clock_t units.
      
      This happens when an expired timer delta (expires - jiffies) is passed
      to jiffies_to_clock_t().
      
      This function has an overflow in :
      
      return div_u64((u64)x * TICK_NSEC, NSEC_PER_SEC / USER_HZ);
      
      commit cbbc719f (time: Change jiffies_to_clock_t() argument type
      to unsigned long) only got around the problem.
      
      As we cant output negative values in /proc/net/tcp without breaking
      various tools, I suggest adding a jiffies_delta_to_clock_t() wrapper
      that caps the negative delta to a 0 value.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NMaciej Żenczykowski <maze@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: hank <pyu@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a399a805
  10. 07 8月, 2012 1 次提交
  11. 01 8月, 2012 2 次提交
  12. 27 7月, 2012 1 次提交
  13. 23 7月, 2012 1 次提交
    • E
      tcp: dont drop MTU reduction indications · 563d34d0
      Eric Dumazet 提交于
      ICMP messages generated in output path if frame length is bigger than
      mtu are actually lost because socket is owned by user (doing the xmit)
      
      One example is the ipgre_tunnel_xmit() calling
      icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu));
      
      We had a similar case fixed in commit a34a101e (ipv6: disable GSO on
      sockets hitting dst_allfrag).
      
      Problem of such fix is that it relied on retransmit timers, so short tcp
      sessions paid a too big latency increase price.
      
      This patch uses the tcp_release_cb() infrastructure so that MTU
      reduction messages (ICMP messages) are not lost, and no extra delay
      is added in TCP transmits.
      Reported-by: NMaciej Żenczykowski <maze@google.com>
      Diagnosed-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Tore Anderson <tore@fud.no>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      563d34d0
  14. 20 7月, 2012 1 次提交
    • Y
      net-tcp: Fast Open base · 2100c8d2
      Yuchung Cheng 提交于
      This patch impelements the common code for both the client and server.
      
      1. TCP Fast Open option processing. Since Fast Open does not have an
         option number assigned by IANA yet, it shares the experiment option
         code 254 by implementing draft-ietf-tcpm-experimental-options
         with a 16 bits magic number 0xF989. This enables global experiments
         without clashing the scarce(2) experimental options available for TCP.
      
         When the draft status becomes standard (maybe), the client should
         switch to the new option number assigned while the server supports
         both numbers for transistion.
      
      2. The new sysctl tcp_fastopen
      
      3. A place holder init function
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2100c8d2
  15. 17 7月, 2012 1 次提交
    • D
      net: Pass optional SKB and SK arguments to dst_ops->{update_pmtu,redirect}() · 6700c270
      David S. Miller 提交于
      This will be used so that we can compose a full flow key.
      
      Even though we have a route in this context, we need more.  In the
      future the routes will be without destination address, source address,
      etc. keying.  One ipv4 route will cover entire subnets, etc.
      
      In this environment we have to have a way to possess persistent storage
      for redirects and PMTU information.  This persistent storage will exist
      in the FIB tables, and that's why we'll need to be able to rebuild a
      full lookup flow key here.  Using that flow key will do a fib_lookup()
      and create/update the persistent entry.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6700c270
  16. 16 7月, 2012 1 次提交
  17. 12 7月, 2012 3 次提交
    • D
      net: Remove checks for dst_ops->redirect being NULL. · 1ed5c48f
      David S. Miller 提交于
      No longer necessary.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ed5c48f
    • D
    • E
      tcp: TCP Small Queues · 46d3ceab
      Eric Dumazet 提交于
      This introduce TSQ (TCP Small Queues)
      
      TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
      device queues), to reduce RTT and cwnd bias, part of the bufferbloat
      problem.
      
      sk->sk_wmem_alloc not allowed to grow above a given limit,
      allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
      given time.
      
      TSO packets are sized/capped to half the limit, so that we have two
      TSO packets in flight, allowing better bandwidth use.
      
      As a side effect, setting the limit to 40000 automatically reduces the
      standard gso max limit (65536) to 40000/2 : It can help to reduce
      latencies of high prio packets, having smaller TSO packets.
      
      This means we divert sock_wfree() to a tcp_wfree() handler, to
      queue/send following frames when skb_orphan() [2] is called for the
      already queued skbs.
      
      Results on my dev machines (tg3/ixgbe nics) are really impressive,
      using standard pfifo_fast, and with or without TSO/GSO.
      
      Without reduction of nominal bandwidth, we have reduction of buffering
      per bulk sender :
      < 1ms on Gbit (instead of 50ms with TSO)
      < 8ms on 100Mbit (instead of 132 ms)
      
      I no longer have 4 MBytes backlogged in qdisc by a single netperf
      session, and both side socket autotuning no longer use 4 Mbytes.
      
      As skb destructor cannot restart xmit itself ( as qdisc lock might be
      taken at this point ), we delegate the work to a tasklet. We use one
      tasklest per cpu for performance reasons.
      
      If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag.
      This flag is tested in a new protocol method called from release_sock(),
      to eventually send new segments.
      
      [1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
      [2] skb_orphan() is usually called at TX completion time,
        but some drivers call it in their start_xmit() handler.
        These drivers should at least use BQL, or else a single TCP
        session can still fill the whole NIC TX ring, since TSQ will
        have no effect.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Dave Taht <dave.taht@bufferbloat.net>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Matt Mathis <mattmathis@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      46d3ceab
  18. 11 7月, 2012 3 次提交
  19. 05 7月, 2012 1 次提交
  20. 29 6月, 2012 3 次提交
  21. 26 6月, 2012 1 次提交
  22. 16 6月, 2012 1 次提交
    • D
      ipv6: Handle PMTU in ICMP error handlers. · 81aded24
      David S. Miller 提交于
      One tricky issue on the ipv6 side vs. ipv4 is that the ICMP callouts
      to handle the error pass the 32-bit info cookie in network byte order
      whereas ipv4 passes it around in host byte order.
      
      Like the ipv4 side, we have two helper functions.  One for when we
      have a socket context and one for when we do not.
      
      ip6ip6 tunnels are not handled here, because they handle PMTU events
      by essentially relaying another ICMP packet-too-big message back to
      the original sender.
      
      This patch allows us to get rid of rt6_do_pmtu_disc().  It handles all
      kinds of situations that simply cannot happen when we do the PMTU
      update directly using a fully resolved route.
      
      In fact, the "plen == 128" check in ip6_rt_update_pmtu() can very
      likely be removed or changed into a BUG_ON() check.  We should never
      have a prefixed ipv6 route when we get there.
      
      Another piece of strange history here is that TCP and DCCP, unlike in
      ipv4, never invoke the update_pmtu() method from their ICMP error
      handlers.  This is incredibly astonishing since this is the context
      where we have the most accurate context in which to make a PMTU
      update, namely we have a fully connected socket and associated cached
      socket route.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      81aded24
  23. 10 6月, 2012 1 次提交
  24. 09 6月, 2012 3 次提交
    • D
      tcp: Get rid of inetpeer special cases. · 4670fd81
      David S. Miller 提交于
      The get_peer method TCP uses is full of special cases that make no
      sense accommodating, and it also gets in the way of doing more
      reasonable things here.
      
      First of all, if the socket doesn't have a usable cached route, there
      is no sense in trying to optimize timewait recycling.
      
      Likewise for the case where we have IP options, such as SRR enabled,
      that make the IP header destination address (and thus the destination
      address of the route key) differ from that of the connection's
      destination address.
      
      Just return a NULL peer in these cases, and thus we're also able to
      get rid of the clumsy inetpeer release logic.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4670fd81
    • D
      inet: Create and use rt{,6}_get_peer_create(). · fbfe95a4
      David S. Miller 提交于
      There's a lot of places that open-code rt{,6}_get_peer() only because
      they want to set 'create' to one.  So add an rt{,6}_get_peer_create()
      for their sake.
      
      There were also a few spots open-coding plain rt{,6}_get_peer() and
      those are transformed here as well.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fbfe95a4
    • G
      inetpeer: add parameter net for inet_getpeer_v4,v6 · 54db0cc2
      Gao feng 提交于
      add struct net as a parameter of inet_getpeer_v[4,6],
      use net to replace &init_net.
      
      and modify some places to provide net for inet_getpeer_v[4,6]
      Signed-off-by: NGao feng <gaofeng@cn.fujitsu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      54db0cc2
  25. 04 6月, 2012 1 次提交
  26. 02 6月, 2012 1 次提交
    • E
      tcp: reflect SYN queue_mapping into SYNACK packets · fff32699
      Eric Dumazet 提交于
      While testing how linux behaves on SYNFLOOD attack on multiqueue device
      (ixgbe), I found that SYNACK messages were dropped at Qdisc level
      because we send them all on a single queue.
      
      Obvious choice is to reflect incoming SYN packet @queue_mapping to
      SYNACK packet.
      
      Under stress, my machine could only send 25.000 SYNACK per second (for
      200.000 incoming SYN per second). NIC : ixgbe with 16 rx/tx queues.
      
      After patch, not a single SYNACK is dropped.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fff32699
  27. 18 5月, 2012 1 次提交
  28. 16 5月, 2012 1 次提交
  29. 05 5月, 2012 1 次提交
    • E
      tcp: be more strict before accepting ECN negociation · bd14b1b2
      Eric Dumazet 提交于
      It appears some networks play bad games with the two bits reserved for
      ECN. This can trigger false congestion notifications and very slow
      transferts.
      
      Since RFC 3168 (6.1.1) forbids SYN packets to carry CT bits, we can
      disable TCP ECN negociation if it happens we receive mangled CT bits in
      the SYN packet.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Perry Lorier <perryl@google.com>
      Cc: Matt Mathis <mattmathis@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Wilmer van der Gaast <wilmer@google.com>
      Cc: Ankur Jain <jankur@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Dave Täht <dave.taht@bufferbloat.net>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd14b1b2