1. 15 4月, 2012 1 次提交
  2. 11 4月, 2012 1 次提交
  3. 06 4月, 2012 1 次提交
  4. 20 3月, 2012 2 次提交
    • E
      tcp: reduce out_of_order memory use · c8628155
      Eric Dumazet 提交于
      With increasing receive window sizes, but speed of light not improved
      that much, out of order queue can contain a huge number of skbs, waiting
      to be moved to receive_queue when missing packets can fill the holes.
      
      Some devices happen to use fat skbs (truesize of 4096 + sizeof(struct
      sk_buff)) to store regular (MTU <= 1500) frames. This makes highly
      probable sk_rmem_alloc hits sk_rcvbuf limit, which can be 4Mbytes in
      many cases.
      
      When limit is hit, tcp stack calls tcp_collapse_ofo_queue(), a true
      latency killer and cpu cache blower.
      
      Doing the coalescing attempt each time we add a frame in ofo queue
      permits to keep memory use tight and in many cases avoid the
      tcp_collapse() thing later.
      
      Tested on various wireless setups (b43, ath9k, ...) known to use big skb
      truesize, this patch removed the "packets collapsed in receive queue due
      to low socket buffer" I had before.
      
      This also reduced average memory used by tcp sockets.
      
      With help from Neal Cardwell.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: H.K. Jerry Chu <hkchu@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c8628155
    • E
      tcp: introduce tcp_data_queue_ofo · e86b2919
      Eric Dumazet 提交于
      Split tcp_data_queue() in two parts for better readability.
      
      tcp_data_queue_ofo() is responsible for queueing incoming skb into out
      of order queue.
      
      Change code layout so that the skb_set_owner_r() is performed only if
      skb is not dropped.
      
      This is a preliminary patch before "reduce out_of_order memory use"
      following patch.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: H.K. Jerry Chu <hkchu@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e86b2919
  5. 13 3月, 2012 1 次提交
  6. 12 3月, 2012 1 次提交
    • J
      net: Convert printks to pr_<level> · 058bd4d2
      Joe Perches 提交于
      Use a more current kernel messaging style.
      
      Convert a printk block to print_hex_dump.
      Coalesce formats, align arguments.
      Use %s, __func__ instead of embedding function names.
      
      Some messages that were prefixed with <foo>_close are
      now prefixed with <foo>_fini.  Some ah4 and esp messages
      are now not prefixed with "ip ".
      
      The intent of this patch is to later add something like
        #define pr_fmt(fmt) "IPv4: " fmt.
      to standardize the output messages.
      
      Text size is trivially reduced. (x86-32 allyesconfig)
      
      $ size net/ipv4/built-in.o*
         text	   data	    bss	    dec	    hex	filename
       887888	  31558	 249696	1169142	 11d6f6	net/ipv4/built-in.o.new
       887934	  31558	 249800	1169292	 11d78c	net/ipv4/built-in.o.old
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      058bd4d2
  7. 07 3月, 2012 1 次提交
    • N
      tcp: fix tcp_shift_skb_data() to not shift SACKed data below snd_una · 4648dc97
      Neal Cardwell 提交于
      This commit fixes tcp_shift_skb_data() so that it does not shift
      SACKed data below snd_una.
      
      This fixes an issue whose symptoms exactly match reports showing
      tp->sacked_out going negative since 3.3.0-rc4 (see "WARNING: at
      net/ipv4/tcp_input.c:3418" thread on netdev).
      
      Since 2008 (832d11c5)
      tcp_shift_skb_data() had been shifting SACKed ranges that were below
      snd_una. It checked that the *end* of the skb it was about to shift
      from was above snd_una, but did not check that the end of the actual
      shifted range was above snd_una; this commit adds that check.
      
      Shifting SACKed ranges below snd_una is problematic because for such
      ranges tcp_sacktag_one() short-circuits: it does not declare anything
      as SACKed and does not increase sacked_out.
      
      Before the fixes in commits cc9a672e
      and daef52ba, shifting SACKed ranges
      below snd_una happened to work because tcp_shifted_skb() was always
      (incorrectly) passing in to tcp_sacktag_one() an skb whose end_seq
      tcp_shift_skb_data() had already guaranteed was beyond snd_una. Hence
      tcp_sacktag_one() never short-circuited and always increased
      tp->sacked_out in this case.
      
      After those two fixes, my testing has verified that shifting SACKed
      ranges below snd_una could cause tp->sacked_out to go negative with
      the following sequence of events:
      
      (1) tcp_shift_skb_data() sees an skb whose end_seq is beyond snd_una,
          then shifts a prefix of that skb that is below snd_una
      
      (2) tcp_shifted_skb() increments the packet count of the
          already-SACKed prev sk_buff
      
      (3) tcp_sacktag_one() sees the end of the new SACKed range is below
          snd_una, so it short-circuits and doesn't increase tp->sacked_out
      
      (5) tcp_clean_rtx_queue() sees the SACKed skb has been ACKed,
          decrements tp->sacked_out by this "inflated" pcount that was
          missing a matching increase in tp->sacked_out, and hence
          tp->sacked_out underflows to a u32 like 0xFFFFFFFF, which casted
          to s32 is negative.
      
      (6) this leads to the warnings seen in the recent "WARNING: at
          net/ipv4/tcp_input.c:3418" thread on the netdev list; e.g.:
          tcp_input.c:3418  WARN_ON((int)tp->sacked_out < 0);
      
      More generally, I think this bug can be tickled in some cases where
      two or more ACKs from the receiver are lost and then a DSACK arrives
      that is immediately above an existing SACKed skb in the write queue.
      
      This fix changes tcp_shift_skb_data() to abort this sequence at step
      (1) in the scenario above by noticing that the bytes are below snd_una
      and not shifting them.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4648dc97
  8. 04 3月, 2012 1 次提交
    • N
      tcp: don't fragment SACKed skbs in tcp_mark_head_lost() · c0638c24
      Neal Cardwell 提交于
      In tcp_mark_head_lost() we should not attempt to fragment a SACKed skb
      to mark the first portion as lost. This is for two primary reasons:
      
      (1) tcp_shifted_skb() coalesces adjacent regions of SACKed skbs. When
      doing this, it preserves the sum of their packet counts in order to
      reflect the real-world dynamics on the wire. But given that skbs can
      have remainders that do not align to MSS boundaries, this packet count
      preservation means that for SACKed skbs there is not necessarily a
      direct linear relationship between tcp_skb_pcount(skb) and
      skb->len. Thus tcp_mark_head_lost()'s previous attempts to fragment
      off and mark as lost a prefix of length (packets - oldcnt)*mss from
      SACKed skbs were leading to occasional failures of the WARN_ON(len >
      skb->len) in tcp_fragment() (which used to be a BUG_ON(); see the
      recent "crash in tcp_fragment" thread on netdev).
      
      (2) there is no real point in fragmenting off part of a SACKed skb and
      calling tcp_skb_mark_lost() on it, since tcp_skb_mark_lost() is a NOP
      for SACKed skbs.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0638c24
  9. 29 2月, 2012 1 次提交
    • N
      tcp: fix false reordering signal in tcp_shifted_skb · 4c90d3b3
      Neal Cardwell 提交于
      When tcp_shifted_skb() shifts bytes from the skb that is currently
      pointed to by 'highest_sack' then the increment of
      TCP_SKB_CB(skb)->seq implicitly advances tcp_highest_sack_seq(). This
      implicit advancement, combined with the recent fix to pass the correct
      SACKed range into tcp_sacktag_one(), caused tcp_sacktag_one() to think
      that the newly SACKed range was before the tcp_highest_sack_seq(),
      leading to a call to tcp_update_reordering() with a degree of
      reordering matching the size of the newly SACKed range (typically just
      1 packet, which is a NOP, but potentially larger).
      
      This commit fixes this by simply calling tcp_sacktag_one() before the
      TCP_SKB_CB(skb)->seq advancement that can advance our notion of the
      highest SACKed sequence.
      
      Correspondingly, we can simplify the code a little now that
      tcp_shifted_skb() should update the lost_cnt_hint in all cases where
      skb == tp->lost_skb_hint.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4c90d3b3
  10. 15 2月, 2012 1 次提交
  11. 13 2月, 2012 2 次提交
  12. 23 1月, 2012 1 次提交
    • Y
      tcp: detect loss above high_seq in recovery · 974c1236
      Yuchung Cheng 提交于
      Correctly implement a loss detection heuristic: New sequences (above
      high_seq) sent during the fast recovery are deemed lost when higher
      sequences are SACKed.
      
      Current code does not catch these losses, because tcp_mark_head_lost()
      does not check packets beyond high_seq. The fix is straight-forward by
      checking packets until the highest sacked packet. In addition, all the
      FLAG_DATA_LOST logic are in-effective and redundant and can be removed.
      
      Update the loss heuristic comments. The algorithm above is documented
      as heuristic B, but it is redundant too because heuristic A already
      covers B.
      
      Note that this change only marks some forward-retransmitted packets LOST.
      It does NOT forbid TCP performing further CWR on new losses. A potential
      follow-up patch under preparation is to perform another CWR on "new"
      losses such as
      1) sequence above high_seq is lost (by resetting high_seq to snd_nxt)
      2) retransmission is lost.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      974c1236
  13. 21 12月, 2011 1 次提交
  14. 13 12月, 2011 1 次提交
  15. 12 12月, 2011 1 次提交
  16. 04 12月, 2011 1 次提交
  17. 28 11月, 2011 5 次提交
  18. 21 10月, 2011 3 次提交
  19. 20 10月, 2011 1 次提交
  20. 14 10月, 2011 1 次提交
    • E
      net: more accurate skb truesize · 87fb4b7b
      Eric Dumazet 提交于
      skb truesize currently accounts for sk_buff struct and part of skb head.
      kmalloc() roundings are also ignored.
      
      Considering that skb_shared_info is larger than sk_buff, its time to
      take it into account for better memory accounting.
      
      This patch introduces SKB_TRUESIZE(X) macro to centralize various
      assumptions into a single place.
      
      At skb alloc phase, we put skb_shared_info struct at the exact end of
      skb head, to allow a better use of memory (lowering number of
      reallocations), since kmalloc() gives us power-of-two memory blocks.
      
      Unless SLUB/SLUB debug is active, both skb->head and skb_shared_info are
      aligned to cache lines, as before.
      
      Note: This patch might trigger performance regressions because of
      misconfigured protocol stacks, hitting per socket or global memory
      limits that were previously not reached. But its a necessary step for a
      more accurate memory accounting.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Andi Kleen <ak@linux.intel.com>
      CC: Ben Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      87fb4b7b
  21. 05 10月, 2011 1 次提交
    • Y
      tcp: properly update lost_cnt_hint during shifting · 1e5289e1
      Yan, Zheng 提交于
      lost_skb_hint is used by tcp_mark_head_lost() to mark the first unhandled skb.
      lost_cnt_hint is the number of packets or sacked packets before the lost_skb_hint;
      When shifting a skb that is before the lost_skb_hint, if tcp_is_fack() is ture,
      the skb has already been counted in the lost_cnt_hint; if tcp_is_fack() is false,
      tcp_sacktag_one() will increase the lost_cnt_hint. So tcp_shifted_skb() does not
      need to adjust the lost_cnt_hint by itself. When shifting a skb that is equal to
      lost_skb_hint, the shifted packets will not be counted by tcp_mark_head_lost().
      So tcp_shifted_skb() should adjust the lost_cnt_hint even tcp_is_fack(tp) is true.
      Signed-off-by: NZheng Yan <zheng.z.yan@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e5289e1
  22. 28 9月, 2011 1 次提交
  23. 27 9月, 2011 2 次提交
  24. 19 9月, 2011 1 次提交
  25. 25 8月, 2011 1 次提交
    • N
      Proportional Rate Reduction for TCP. · a262f0cd
      Nandita Dukkipati 提交于
      This patch implements Proportional Rate Reduction (PRR) for TCP.
      PRR is an algorithm that determines TCP's sending rate in fast
      recovery. PRR avoids excessive window reductions and aims for
      the actual congestion window size at the end of recovery to be as
      close as possible to the window determined by the congestion control
      algorithm. PRR also improves accuracy of the amount of data sent
      during loss recovery.
      
      The patch implements the recommended flavor of PRR called PRR-SSRB
      (Proportional rate reduction with slow start reduction bound) and
      replaces the existing rate halving algorithm. PRR improves upon the
      existing Linux fast recovery under a number of conditions including:
        1) burst losses where the losses implicitly reduce the amount of
      outstanding data (pipe) below the ssthresh value selected by the
      congestion control algorithm and,
        2) losses near the end of short flows where application runs out of
      data to send.
      
      As an example, with the existing rate halving implementation a single
      loss event can cause a connection carrying short Web transactions to
      go into the slow start mode after the recovery. This is because during
      recovery Linux pulls the congestion window down to packets_in_flight+1
      on every ACK. A short Web response often runs out of new data to send
      and its pipe reduces to zero by the end of recovery when all its packets
      are drained from the network. Subsequent HTTP responses using the same
      connection will have to slow start to raise cwnd to ssthresh. PRR on
      the other hand aims for the cwnd to be as close as possible to ssthresh
      by the end of recovery.
      
      A description of PRR and a discussion of its performance can be found at
      the following links:
      - IETF Draft:
          http://tools.ietf.org/html/draft-mathis-tcpm-proportional-rate-reduction-01
      - IETF Slides:
          http://www.ietf.org/proceedings/80/slides/tcpm-6.pdf
          http://tools.ietf.org/agenda/81/slides/tcpm-2.pdf
      - Paper to appear in Internet Measurements Conference (IMC) 2011:
          Improving TCP Loss Recovery
          Nandita Dukkipati, Matt Mathis, Yuchung Cheng
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a262f0cd
  26. 09 6月, 2011 1 次提交
    • J
      tcp: RFC2988bis + taking RTT sample from 3WHS for the passive open side · 9ad7c049
      Jerry Chu 提交于
      This patch lowers the default initRTO from 3secs to 1sec per
      RFC2988bis. It falls back to 3secs if the SYN or SYN-ACK packet
      has been retransmitted, AND the TCP timestamp option is not on.
      
      It also adds support to take RTT sample during 3WHS on the passive
      open side, just like its active open counterpart, and uses it, if
      valid, to seed the initRTO for the data transmission phase.
      
      The patch also resets ssthresh to its initial default at the
      beginning of the data transmission phase, and reduces cwnd to 1 if
      there has been MORE THAN ONE retransmission during 3WHS per RFC5681.
      Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9ad7c049
  27. 23 3月, 2011 2 次提交
  28. 15 3月, 2011 1 次提交
    • S
      tcp: fix RTT for quick packets in congestion control · febf0819
      stephen hemminger 提交于
      In the congestion control interface, the callback for each ACK
      includes an estimated round trip time in microseconds.
      Some algorithms need high resolution (Vegas style) but most only
      need jiffie resolution.  If RTT is not accurate (like a retransmission)
      -1 is used as a flag value.
      
      When doing coarse resolution if RTT is less than a a jiffie
      then 0 should be returned rather than no estimate. Otherwise algorithms
      that expect good ack's to trigger slow start (like CUBIC Hystart)
      will be confused.
      Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      febf0819
  29. 22 2月, 2011 1 次提交
  30. 03 2月, 2011 1 次提交