1. 20 5月, 2012 3 次提交
  2. 18 5月, 2012 2 次提交
  3. 17 5月, 2012 1 次提交
  4. 16 5月, 2012 4 次提交
  5. 11 5月, 2012 4 次提交
  6. 09 5月, 2012 1 次提交
    • P
      netfilter: remove ip_queue support · d16cf20e
      Pablo Neira Ayuso 提交于
      This patch removes ip_queue support which was marked as obsolete
      years ago. The nfnetlink_queue modules provides more advanced
      user-space packet queueing mechanism.
      
      This patch also removes capability code included in SELinux that
      refers to ip_queue. Otherwise, we break compilation.
      
      Several warning has been sent regarding this to the mailing list
      in the past month without anyone rising the hand to stop this
      with some strong argument.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      d16cf20e
  7. 08 5月, 2012 1 次提交
  8. 05 5月, 2012 1 次提交
    • E
      tcp: be more strict before accepting ECN negociation · bd14b1b2
      Eric Dumazet 提交于
      It appears some networks play bad games with the two bits reserved for
      ECN. This can trigger false congestion notifications and very slow
      transferts.
      
      Since RFC 3168 (6.1.1) forbids SYN packets to carry CT bits, we can
      disable TCP ECN negociation if it happens we receive mangled CT bits in
      the SYN packet.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Perry Lorier <perryl@google.com>
      Cc: Matt Mathis <mattmathis@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Wilmer van der Gaast <wilmer@google.com>
      Cc: Ankur Jain <jankur@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Dave Täht <dave.taht@bufferbloat.net>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd14b1b2
  9. 04 5月, 2012 1 次提交
  10. 03 5月, 2012 10 次提交
    • A
      tcp: move stats merge to the end of tcp_try_coalesce · 34a802a5
      Alexander Duyck 提交于
      This change cleans up the last bits of tcp_try_coalesce so that we only
      need one goto which jumps to the end of the function.  The idea is to make
      the code more readable by putting things in a linear order so that we start
      execution at the top of the function, and end it at the bottom.
      
      I also made a slight tweak to the code for handling frags when we are a
      clone.  Instead of making it an if (clone) loop else nr_frags = 0 I changed
      the logic so that if (!clone) we just set the number of frags to 0 which
      disables the for loop anyway.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      34a802a5
    • A
      tcp: Move code related to head frag in tcp_try_coalesce · 57b55a7e
      Alexander Duyck 提交于
      This change reorders the code related to the use of an skb->head_frag so it
      is placed before we check the rest of the frags.  This allows the code to
      read more linearly instead of like some sort of loop.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      57b55a7e
    • A
      tcp: Fix truesize accounting in tcp_try_coalesce · c73c3d9c
      Alexander Duyck 提交于
      This patch addresses several issues in the way we were tracking the
      truesize in tcp_try_coalesce.
      
      First it was using ksize which prevents us from having a 0 sized head frag
      and getting a usable result.  To resolve that this patch uses the end
      pointer which is set based off either ksize, or the frag_size supplied in
      build_skb.  This allows us to compute the original truesize of the entire
      buffer and remove that value leaving us with just what was added as pages.
      
      The second issue was the use of skb->len if there is a mergeable head frag.
      We should only need to remove the size of an data aligned sk_buff from our
      current skb->truesize to compute the delta for a buffer with a reused head.
      By using skb->len the value of truesize was being artificially reduced
      which means that head frags could use more memory than buffers using
      standard allocations.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c73c3d9c
    • A
      net: Stop decapitating clones that have a head_frag · 2996d31f
      Alexander Duyck 提交于
      This change is meant ot prevent stealing the skb->head to use as a page in
      the event that the skb->head was cloned.  This allows the other clones to
      track each other via shinfo->dataref.
      
      Without this we break down to two methods for tracking the reference count,
      one being dataref, the other being the page count.  As a result it becomes
      difficult to track how many references there are to skb->head.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2996d31f
    • E
      net: implement tcp coalescing in tcp_queue_rcv() · b081f85c
      Eric Dumazet 提交于
      Extend tcp coalescing implementing it from tcp_queue_rcv(), the main
      receiver function when application is not blocked in recvmsg().
      
      Function tcp_queue_rcv() is moved a bit to allow its call from
      tcp_data_queue()
      
      This gives good results especially if GRO could not kick, and if skb
      head is a fragment.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b081f85c
    • E
      net: take care of cloned skbs in tcp_try_coalesce() · 923dd347
      Eric Dumazet 提交于
      Before stealing fragments or skb head, we must make sure skbs are not
      cloned.
      
      Alexander was worried about destination skb being cloned : In bridge
      setups, a driver could be fooled if skb->data_len would not match skb
      nr_frags.
      
      If source skb is cloned, we must take references on pages instead.
      
      Bug happened using tcpdump (if not using mmap())
      
      Introduce kfree_skb_partial() helper to cleanup code.
      Reported-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      923dd347
    • E
      tcp: change tcp_adv_win_scale and tcp_rmem[2] · b49960a0
      Eric Dumazet 提交于
      tcp_adv_win_scale default value is 2, meaning we expect a good citizen
      skb to have skb->len / skb->truesize ratio of 75% (3/4)
      
      In 2.6 kernels we (mis)accounted for typical MSS=1460 frame :
      1536 + 64 + 256 = 1856 'estimated truesize', and 1856 * 3/4 = 1392.
      So these skbs were considered as not bloated.
      
      With recent truesize fixes, a typical MSS=1460 frame truesize is now the
      more precise :
      2048 + 256 = 2304. But 2304 * 3/4 = 1728.
      So these skb are not good citizen anymore, because 1460 < 1728
      
      (GRO can escape this problem because it build skbs with a too low
      truesize.)
      
      This also means tcp advertises a too optimistic window for a given
      allocated rcvspace : When receiving frames, sk_rmem_alloc can hit
      sk_rcvbuf limit and we call tcp_prune_queue()/tcp_collapse() too often,
      especially when application is slow to drain its receive queue or in
      case of losses (netperf is fast, scp is slow). This is a major latency
      source.
      
      We should adjust the len/truesize ratio to 50% instead of 75%
      
      This patch :
      
      1) changes tcp_adv_win_scale default to 1 instead of 2
      
      2) increase tcp_rmem[2] limit from 4MB to 6MB to take into account
      better truesize tracking and to allow autotuning tcp receive window to
      reach same value than before. Note that same amount of kernel memory is
      consumed compared to 2.6 kernels.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b49960a0
    • Y
      tcp: early retransmit: delayed fast retransmit · 750ea2ba
      Yuchung Cheng 提交于
      Implementing the advanced early retransmit (sysctl_tcp_early_retrans==2).
      Delays the fast retransmit by an interval of RTT/4. We borrow the
      RTO timer to implement the delay. If we receive another ACK or send
      a new packet, the timer is cancelled and restored to original RTO
      value offset by time elapsed.  When the delayed-ER timer fires,
      we enter fast recovery and perform fast retransmit.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      750ea2ba
    • Y
      tcp: early retransmit · eed530b6
      Yuchung Cheng 提交于
      This patch implements RFC 5827 early retransmit (ER) for TCP.
      It reduces DUPACK threshold (dupthresh) if outstanding packets are
      less than 4 to recover losses by fast recovery instead of timeout.
      
      While the algorithm is simple, small but frequent network reordering
      makes this feature dangerous: the connection repeatedly enter
      false recovery and degrade performance. Therefore we implement
      a mitigation suggested in the appendix of the RFC that delays
      entering fast recovery by a small interval, i.e., RTT/4. Currently
      ER is conservative and is disabled for the rest of the connection
      after the first reordering event. A large scale web server
      experiment on the performance impact of ER is summarized in
      section 6 of the paper "Proportional Rate Reduction for TCP”,
      IMC 2011. http://conferences.sigcomm.org/imc/2011/docs/p155.pdf
      
      Note that Linux has a similar feature called THIN_DUPACK. The
      differences are THIN_DUPACK do not mitigate reorderings and is only
      used after slow start. Currently ER is disabled if THIN_DUPACK is
      enabled. I would be happy to merge THIN_DUPACK feature with ER if
      people think it's a good idea.
      
      ER is enabled by sysctl_tcp_early_retrans:
        0: Disables ER
      
        1: Reduce dupthresh to packets_out - 1 when outstanding packets < 4.
      
        2: (Default) reduce dupthresh like mode 1. In addition, delay
           entering fast recovery by RTT/4.
      
      Note: mode 2 is implemented in the third part of this patch series.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eed530b6
    • Y
      tcp: early retransmit: tcp_enter_recovery() · 1fbc3405
      Yuchung Cheng 提交于
      This a prepartion patch that refactors the code to enter recovery
      into a new function tcp_enter_recovery(). It's needed to implement
      the delayed fast retransmit in ER.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1fbc3405
  11. 01 5月, 2012 2 次提交
    • E
      tcp: makes tcp_try_coalesce aware of skb->head_frag · 329033f6
      Eric Dumazet 提交于
      TCP coalesce can check if skb to be merged has its skb->head mapped to a
      page fragment, instead of a kmalloc() area.
      
      We had to disable coalescing in this case, for performance reasons.
      
      We 'upgrade' skb->head as a fragment in itself.
      
      This reduces number of cache misses when user makes its copies, since a
      less sk_buff are fetched.
      
      This makes receive and ofo queues shorter and thus reduce cache line
      misses in TCP stack.
      
      This is a followup of patch "net: allow skb->head to be a page fragment"
      
      Tested with tg3 nic, with GRO on or off. We can see "TCPRcvCoalesce"
      counter being incremented.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Maciej Żenczykowski <maze@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Matt Carlson <mcarlson@broadcom.com>
      Cc: Michael Chan <mchan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      329033f6
    • Y
      tcp: fix infinite cwnd in tcp_complete_cwr() · 1cebce36
      Yuchung Cheng 提交于
      When the cwnd reduction is done, ssthresh may be infinite
      if TCP enters CWR via ECN or F-RTO. If cwnd is not undone, i.e.,
      undo_marker is set, tcp_complete_cwr() falsely set cwnd to the
      infinite ssthresh value. The correct operation is to keep cwnd
      intact because it has been updated in ECN or F-RTO.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1cebce36
  12. 28 4月, 2012 1 次提交
  13. 27 4月, 2012 1 次提交
    • E
      ipv6: RTAX_FEATURE_ALLFRAG causes inefficient TCP segment sizing · 67469601
      Eric Dumazet 提交于
      Quoting Tore Anderson from :
      https://bugzilla.kernel.org/show_bug.cgi?id=42572
      
      When RTAX_FEATURE_ALLFRAG is set on a route, the effective TCP segment
      size does not take into account the size of the IPv6 Fragmentation
      header that needs to be included in outbound packets, causing every
      transmitted TCP segment to be fragmented across two IPv6 packets, the
      latter of which will only contain 8 bytes of actual payload.
      
      RTAX_FEATURE_ALLFRAG is typically set on a route in response to
      receving a ICMPv6 Packet Too Big message indicating a Path MTU of less
      than 1280 bytes. 1280 bytes is the minimum IPv6 MTU, however ICMPv6
      PTBs with MTU < 1280 are still valid, in particular when an IPv6
      packet is sent to an IPv4 destination through a stateless translator.
      Any ICMPv4 Need To Fragment packets originated from the IPv4 part of
      the path will be translated to ICMPv6 PTB which may then indicate an
      MTU of less than 1280.
      
      The Linux kernel refuses to reduce the effective MTU to anything below
      1280 bytes, instead it sets it to exactly 1280 bytes, and
      RTAX_FEATURE_ALLFRAG is also set. However, the TCP segment size appears
      to be set to 1240 bytes (1280 Path MTU - 40 bytes of IPv6 header),
      instead of 1232 (additionally taking into account the 8 bytes required
      by the IPv6 Fragmentation extension header).
      
      This in turn results in rather inefficient transmission, as every
      transmitted TCP segment now is split in two fragments containing
      1232+8 bytes of payload.
      
      After this patch, all the outgoing packets that includes a
      Fragmentation header all are "atomic" or "non-fragmented" fragments,
      i.e., they both have Offset=0 and More Fragments=0.
      
      With help from David S. Miller
      Reported-by: NTore Anderson <tore@fud.no>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Maciej Żenczykowski <maze@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Tested-by: NTore Anderson <tore@fud.no>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      67469601
  14. 26 4月, 2012 3 次提交
  15. 24 4月, 2012 5 次提交