1. 21 10月, 2005 1 次提交
  2. 14 10月, 2005 1 次提交
  3. 13 10月, 2005 1 次提交
  4. 09 10月, 2005 1 次提交
  5. 30 9月, 2005 1 次提交
    • D
      [TCP]: Revert 6b251858 · 01ff367e
      David S. Miller 提交于
      But retain the comment fix.
      
      Alexey Kuznetsov has explained the situation as follows:
      
      --------------------
      
      I think the fix is incorrect. Look, the RFC function init_cwnd(mss) is
      not continuous: f.e. for mss=1095 it needs initial window 1095*4, but
      for mss=1096 it is 1096*3. We do not know exactly what mss sender used
      for calculations. If we advertised 1096 (and calculate initial window
      3*1096), the sender could limit it to some value < 1096 and then it
      will need window his_mss*4 > 3*1096 to send initial burst.
      
      See?
      
      So, the honest function for inital rcv_wnd derived from
      tcp_init_cwnd() is:
      
      	init_rcv_wnd(mss)=
      	  min { init_cwnd(mss1)*mss1 for mss1 <= mss }
      
      It is something sort of:
      
      	if (mss < 1096)
      		return mss*4;
      	if (mss < 1096*2)
      		return 1096*4;
      	return mss*2;
      
      (I just scrablled a graph of piece of paper, it is difficult to see or
      to explain without this)
      
      I selected it differently giving more window than it is strictly
      required.  Initial receive window must be large enough to allow sender
      following to the rfc (or just setting initial cwnd to 2) to send
      initial burst.  But besides that it is arbitrary, so I decided to give
      slack space of one segment.
      
      Actually, the logic was:
      
      If mss is low/normal (<=ethernet), set window to receive more than
      initial burst allowed by rfc under the worst conditions
      i.e. mss*4. This gives slack space of 1 segment for ethernet frames.
      
      For msses slighlty more than ethernet frame, take 3. Try to give slack
      space of 1 frame again.
      
      If mss is huge, force 2*mss. No slack space.
      
      Value 1460*3 is really confusing. Minimal one is 1096*2, but besides
      that it is an arbitrary value. It was meant to be ~4096. 1460*3 is
      just the magic number from RFC, 1460*3 = 1095*4 is the magic :-), so
      that I guess hands typed this themselves.
      
      --------------------
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      01ff367e
  6. 29 9月, 2005 1 次提交
  7. 23 9月, 2005 1 次提交
    • H
      [TCP]: Adjust Reno SACK estimate in tcp_fragment · 83ca28be
      Herbert Xu 提交于
      Since the introduction of TSO pcount a year ago, it has been possible
      for tcp_fragment() to cause packets_out to decrease.  Prior to that,
      tcp_retrans_try_collapse() was the only way for that to happen on the
      retransmission path.
      
      When this happens with Reno, it is possible for sasked_out to become
      invalid because it is only an estimate and not tied to any particular
      packet on the retransmission queue.
      
      Therefore we need to adjust sacked_out as well as left_out in the Reno
      case.  The following patch does exactly that.
      
      This bug is pretty difficult to trigger in practice though since you
      need a SACKless peer with a retransmission that occurs just as the
      cached MTU value expires.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      83ca28be
  8. 20 9月, 2005 1 次提交
  9. 15 9月, 2005 1 次提交
  10. 11 9月, 2005 1 次提交
  11. 09 9月, 2005 1 次提交
  12. 02 9月, 2005 1 次提交
  13. 30 8月, 2005 6 次提交
  14. 24 8月, 2005 1 次提交
  15. 18 8月, 2005 1 次提交
    • H
      [TCP]: Fix bug #5070: kernel BUG at net/ipv4/tcp_output.c:864 · 35d59efd
      Herbert Xu 提交于
      1) We send out a normal sized packet with TSO on to start off.
      2) ICMP is received indicating a smaller MTU.
      3) We send the current sk_send_head which needs to be fragmented
      since it was created before the ICMP event.  The first fragment
      is then sent out.
      
      At this point the remaining fragment is allocated by tcp_fragment.
      However, its size is padded to fit the L1 cache-line size therefore
      creating tail-room up to 124 bytes long.
      
      This fragment will also be sitting at sk_send_head.
      
      4) tcp_sendmsg is called again and it stores data in the tail-room of
      of the fragment.
      5) tcp_push_one is called by tcp_sendmsg which then calls tso_fragment
      since the packet as a whole exceeds the MTU.
      
      At this point we have a packet that has data in the head area being
      fed to tso_fragment which bombs out.
      
      My take on this is that we shouldn't ever call tcp_fragment on a TSO
      socket for a packet that is yet to be transmitted since this creates
      a packet on sk_send_head that cannot be extended.
      
      So here is a patch to change it so that tso_fragment is always used
      in this case.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35d59efd
  16. 17 8月, 2005 1 次提交
    • H
      [TCP]: Fix bug #5070: kernel BUG at net/ipv4/tcp_output.c:864 · c8ac3774
      Herbert Xu 提交于
      1) We send out a normal sized packet with TSO on to start off.
      2) ICMP is received indicating a smaller MTU.
      3) We send the current sk_send_head which needs to be fragmented
      since it was created before the ICMP event.  The first fragment
      is then sent out.
      
      At this point the remaining fragment is allocated by tcp_fragment.
      However, its size is padded to fit the L1 cache-line size therefore
      creating tail-room up to 124 bytes long.
      
      This fragment will also be sitting at sk_send_head.
      
      4) tcp_sendmsg is called again and it stores data in the tail-room of
      of the fragment.
      5) tcp_push_one is called by tcp_sendmsg which then calls tso_fragment
      since the packet as a whole exceeds the MTU.
      
      At this point we have a packet that has data in the head area being
      fed to tso_fragment which bombs out.
      
      My take on this is that we shouldn't ever call tcp_fragment on a TSO
      socket for a packet that is yet to be transmitted since this creates
      a packet on sk_send_head that cannot be extended.
      
      So here is a patch to change it so that tso_fragment is always used
      in this case.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c8ac3774
  17. 11 8月, 2005 1 次提交
    • H
      [TCP]: Adjust {p,f}ackets_out correctly in tcp_retransmit_skb() · b5da623a
      Herbert Xu 提交于
      Well I've only found one potential cause for the assertion
      failure in tcp_mark_head_lost.  First of all, this can only
      occur if cnt > 1 since tp->packets_out is never zero here.
      If it did hit zero we'd have much bigger problems.
      
      So cnt is equal to fackets_out - reordering.  Normally
      fackets_out is less than packets_out.  The only reason
      I've found that might cause fackets_out to exceed packets_out
      is if tcp_fragment is called from tcp_retransmit_skb with a
      TSO skb and the current MSS is greater than the MSS stored
      in the TSO skb.  This might occur as the result of an expiring
      dst entry.
      
      In that case, packets_out may decrease (line 1380-1381 in
      tcp_output.c).  However, fackets_out is unchanged which means
      that it may in fact exceed packets_out.
      
      Previously tcp_retrans_try_collapse was the only place where
      packets_out can go down and it takes care of this by decrementing
      fackets_out.
      
      So we should make sure that fackets_out is reduced by an appropriate
      amount here as well.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b5da623a
  18. 05 8月, 2005 2 次提交
  19. 09 7月, 2005 1 次提交
  20. 06 7月, 2005 12 次提交
    • D
      [TCP]: Never TSO defer under periods of congestion. · 908a75c1
      David S. Miller 提交于
      Congestion window recover after loss depends upon the fact
      that if we have a full MSS sized frame at the head of the
      send queue, we will send it.  TSO deferral can defeat the
      ACK clocking necessary to exit cleanly from recovery.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      908a75c1
    • D
      [TCP]: Move to new TSO segmenting scheme. · c1b4a7e6
      David S. Miller 提交于
      Make TSO segment transmit size decisions at send time not earlier.
      
      The basic scheme is that we try to build as large a TSO frame as
      possible when pulling in the user data, but the size of the TSO frame
      output to the card is determined at transmit time.
      
      This is guided by tp->xmit_size_goal.  It is always set to a multiple
      of MSS and tells sendmsg/sendpage how large an SKB to try and build.
      
      Later, tcp_write_xmit() and tcp_push_one() chop up the packet if
      necessary and conditions warrant.  These routines can also decide to
      "defer" in order to wait for more ACKs to arrive and thus allow larger
      TSO frames to be emitted.
      
      A general observation is that TSO elongates the pipe, thus requiring a
      larger congestion window and larger buffering especially at the sender
      side.  Therefore, it is important that applications 1) get a large
      enough socket send buffer (this is accomplished by our dynamic send
      buffer expansion code) 2) do large enough writes.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c1b4a7e6
    • D
      [TCP]: Eliminate redundant computations in tcp_write_xmit(). · aa93466b
      David S. Miller 提交于
      tcp_snd_test() is run for every packet output by a single
      call to tcp_write_xmit(), but this is not necessary.
      
      For one, the congestion window space needs to only be
      calculated one time, then used throughout the duration
      of the loop.
      
      This cleanup also makes experimenting with different TSO
      packetization schemes much easier.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aa93466b
    • D
      [TCP]: Break out tcp_snd_test() into it's constituent parts. · 7f4dd0a9
      David S. Miller 提交于
      tcp_snd_test() does several different things, use inline
      functions to express this more clearly.
      
      1) It initializes the TSO count of SKB, if necessary.
      2) It performs the Nagle test.
      3) It makes sure the congestion window is adhered to.
      4) It makes sure SKB fits into the send window.
      
      This cleanup also sets things up so that things like the
      available packets in the congestion window does not need
      to be calculated multiple times by packet sending loops
      such as tcp_write_xmit().
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7f4dd0a9
    • D
      [TCP]: Fix __tcp_push_pending_frames() 'nonagle' handling. · 55c97f3e
      David S. Miller 提交于
      'nonagle' should be passed to the tcp_snd_test() function
      as 'TCP_NAGLE_PUSH' if we are checking an SKB not at the
      tail of the write_queue.  This is because Nagle does not
      apply to such frames since we cannot possibly tack more
      data onto them.
      
      However, while doing this __tcp_push_pending_frames() makes
      all of the packets in the write_queue use this modified
      'nonagle' value.
      
      Fix the bug and simplify this function by just calling
      tcp_write_xmit() directly if sk_send_head is non-NULL.
      
      As a result, we can now make tcp_data_snd_check() just call
      tcp_push_pending_frames() instead of the specialized
      __tcp_data_snd_check().
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      55c97f3e
    • D
      [TCP]: Fix redundant calculations of tcp_current_mss() · a2e2a59c
      David S. Miller 提交于
      tcp_write_xmit() uses tcp_current_mss(), but some of it's callers,
      namely __tcp_push_pending_frames(), already has this value available
      already.
      
      While we're here, fix the "cur_mss" argument to be "unsigned int"
      instead of plain "unsigned".
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a2e2a59c
    • D
      [TCP]: tcp_write_xmit() tabbing cleanup · 92df7b51
      David S. Miller 提交于
      Put the main basic block of work at the top-level of
      tabbing, and mark the TCP_CLOSE test with unlikely().
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      92df7b51
    • D
      [TCP]: Kill extra cwnd validate in __tcp_push_pending_frames(). · a762a980
      David S. Miller 提交于
      The tcp_cwnd_validate() function should only be invoked
      if we actually send some frames, yet __tcp_push_pending_frames()
      will always invoke it.  tcp_write_xmit() does the call for us,
      so the call here can simply be removed.
      
      Also, tcp_write_xmit() can be marked static.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a762a980
    • D
      [TCP]: Add missing skb_header_release() call to tcp_fragment(). · f44b5271
      David S. Miller 提交于
      When we add any new packet to the TCP socket write queue,
      we must call skb_header_release() on it in order for the
      TSO sharing checks in the drivers to work.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f44b5271
    • D
      [TCP]: Move __tcp_data_snd_check into tcp_output.c · 84d3e7b9
      David S. Miller 提交于
      It reimplements portions of tcp_snd_check(), so it
      we move it to tcp_output.c we can consolidate it's
      logic much easier in a later change.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      84d3e7b9
    • D
      [TCP]: Move send test logic out of net/tcp.h · f6302d1d
      David S. Miller 提交于
      This just moves the code into tcp_output.c, no code logic changes are
      made by this patch.
      
      Using this as a baseline, we can begin to untangle the mess of
      comparisons for the Nagle test et al.  We will also be able to reduce
      all of the redundant computation that occurs when outputting data
      packets.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6302d1d
    • D
      [TCP]: Fix quick-ack decrementing with TSO. · fc6415bc
      David S. Miller 提交于
      On each packet output, we call tcp_dec_quickack_mode()
      if the ACK flag is set.  It drops tp->ack.quick until
      it hits zero, at which time we deflate the ATO value.
      
      When doing TSO, we are emitting multiple packets with
      ACK set, so we should decrement tp->ack.quick that many
      segments.
      
      Note that, unlike this case, tcp_enter_cwr() should not
      take the tcp_skb_pcount(skb) into consideration.  That
      function, one time, readjusts tp->snd_cwnd and moves
      into TCP_CA_CWR state.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc6415bc
  21. 24 6月, 2005 1 次提交
  22. 19 6月, 2005 2 次提交
    • A
      [NET] Rename open_request to request_sock · 60236fdd
      Arnaldo Carvalho de Melo 提交于
      Ok, this one just renames some stuff to have a better namespace and to
      dissassociate it from TCP:
      
      struct open_request  -> struct request_sock
      tcp_openreq_alloc    -> reqsk_alloc
      tcp_openreq_free     -> reqsk_free
      tcp_openreq_fastfree -> __reqsk_free
      
      With this most of the infrastructure closely resembles a struct
      sock methods subset.
      Signed-off-by: NArnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      60236fdd
    • A
      [NET] Generalise TCP's struct open_request minisock infrastructure · 2e6599cb
      Arnaldo Carvalho de Melo 提交于
      Kept this first changeset minimal, without changing existing names to
      ease peer review.
      
      Basicaly tcp_openreq_alloc now receives the or_calltable, that in turn
      has two new members:
      
      ->slab, that replaces tcp_openreq_cachep
      ->obj_size, to inform the size of the openreq descendant for
        a specific protocol
      
      The protocol specific fields in struct open_request were moved to a
      class hierarchy, with the things that are common to all connection
      oriented PF_INET protocols in struct inet_request_sock, the TCP ones
      in tcp_request_sock, that is an inet_request_sock, that is an
      open_request.
      
      I.e. this uses the same approach used for the struct sock class
      hierarchy, with sk_prot indicating if the protocol wants to use the
      open_request infrastructure by filling in sk_prot->rsk_prot with an
      or_calltable.
      
      Results? Performance is improved and TCP v4 now uses only 64 bytes per
      open request minisock, down from 96 without this patch :-)
      
      Next changeset will rename some of the structs, fields and functions
      mentioned above, struct or_calltable is way unclear, better name it
      struct request_sock_ops, s/struct open_request/struct request_sock/g,
      etc.
      Signed-off-by: NArnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2e6599cb