1. 25 11月, 2008 7 次提交
    • I
      111cc8b9
    • I
      tcp: Make shifting not clear the hints · 92ee76b6
      Ilpo Järvinen 提交于
      The earlier version was just very basic one which is "playing
      safe" by always clearing the hints. However, clearing of a hint
      is extremely costly operation with large windows, so it must be
      avoided at all cost whenever possible, there is a way with
      shifting too achieve not-clearing.
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      92ee76b6
    • I
      tcp: Try to restore large SKBs while SACK processing · 832d11c5
      Ilpo Järvinen 提交于
      During SACK processing, most of the benefits of TSO are eaten by
      the SACK blocks that one-by-one fragment SKBs to MSS sized chunks.
      Then we're in problems when cleanup work for them has to be done
      when a large cumulative ACK comes. Try to return back to pre-split
      state already while more and more SACK info gets discovered by
      combining newly discovered SACK areas with the previous skb if
      that's SACKed as well.
      
      This approach has a number of benefits:
      
      1) The processing overhead is spread more equally over the RTT
      2) Write queue has less skbs to process (affect everything
         which has to walk in the queue past the sacked areas)
      3) Write queue is consistent whole the time, so no other parts
         of TCP has to be aware of this (this was not the case with
         some other approach that was, well, quite intrusive all
         around).
      4) Clean_rtx_queue can release most of the pages using single
         put_page instead of previous PAGE_SIZE/mss+1 calls
      
      In case a hole is fully filled by the new SACK block, we attempt
      to combine the next skb too which allows construction of skbs
      that are even larger than what tso split them to and it handles
      hole per on every nth patterns that often occur during slow start
      overshoot pretty nicely. Though this to be really useful also
      a retransmission would have to get lost since cumulative ACKs
      advance one hole at a time in the most typical case.
      
      TODO: handle upwards only merging. That should be rather easy
      when segment is fully sacked but I'm leaving that as future
      work item (it won't make very large difference anyway since
      this current approach already covers quite a lot of normal
      cases).
      
      I was earlier thinking of some sophisticated way of tracking
      timestamps of the first and the last segment but later on
      realized that it won't be that necessary at all to store the
      timestamp of the last segment. The cases that can occur are
      basically either:
        1) ambiguous => no sensible measurement can be taken anyway
        2) non-ambiguous is due to reordering => having the timestamp
           of the last segment there is just skewing things more off
           than does some good since the ack got triggered by one of
           the holes (besides some substle issues that would make
           determining right hole/skb even harder problem). Anyway,
           it has nothing to do with this change then.
      
      I choose to route some abnormal looking cases with goto noop,
      some could be handled differently (eg., by stopping the
      walking at that skb but again). In general, they either
      shouldn't happen at all or are rare enough to make no difference
      in practice.
      
      In theory this change (as whole) could cause some macroscale
      regression (global) because of cache misses that are taken over
      the round-trip time but it gets very likely better because of much
      less (local) cache misses per other write queue walkers and the
      big recovery clearing cumulative ack.
      
      Worth to note that these benefits would be very easy to get also
      without TSO/GSO being on as long as the data is in pages so that
      we can merge them. Currently I won't let that happen because
      DSACK splitting at fragment that would mess up pcounts due to
      sk_can_gso in tcp_set_skb_tso_segs. Once DSACKs fragments gets
      avoided, we have some conditions that can be made less strict.
      
      TODO: I will probably have to convert the excessive pointer
      passing to struct sacktag_state... :-)
      
      My testing revealed that considerable amount of skbs couldn't
      be shifted because they were cloned (most likely still awaiting
      tx reclaim)...
      
      [The rest is considering future work instead since I got
      repeatably EFAULT to tcpdump's recvfrom when I added
      pskb_expand_head to deal with clones, so I separated that
      into another, later patch]
      
      ...To counter that, I gave up on the fifth advantage:
      
      5) When growing previous SACK block, less allocs for new skbs
         are done, basically a new alloc is needed only when new hole
         is detected and when the previous skb runs out of frags space
      
      ...which now only happens of if reclaim is fast enough to dispose
      the clone before the SACK block comes in (the window is RTT long),
      otherwise we'll have to alloc some.
      
      With clones being handled I got these numbers (will be somewhat
      worse without that), taken with fine-grained mibs:
      
                        TCPSackShifted 398
                         TCPSackMerged 877
                  TCPSackShiftFallback 320
            TCPSACKCOLLAPSEFALLBACKGSO 0
        TCPSACKCOLLAPSEFALLBACKSKBBITS 0
        TCPSACKCOLLAPSEFALLBACKSKBDATA 0
          TCPSACKCOLLAPSEFALLBACKBELOW 0
          TCPSACKCOLLAPSEFALLBACKFIRST 1
       TCPSACKCOLLAPSEFALLBACKPREVBITS 318
            TCPSACKCOLLAPSEFALLBACKMSS 1
         TCPSACKCOLLAPSEFALLBACKNOHEAD 0
          TCPSACKCOLLAPSEFALLBACKSHIFT 0
                TCPSACKCOLLAPSENOOPSEQ 0
        TCPSACKCOLLAPSENOOPSMALLPCOUNT 0
           TCPSACKCOLLAPSENOOPSMALLLEN 0
                   TCPSACKCOLLAPSEHOLE 12
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      832d11c5
    • I
      tcp: make tcp_sacktag_one able to handle partial skb too · f58b22fd
      Ilpo Järvinen 提交于
      This is preparatory work for SACK combiner patch which may
      have to count TCP state changes for only a part of the skb
      because it will intentionally avoids splitting skb to SACKed
      and not sacked parts.
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f58b22fd
    • I
      tcp: Make SACK code to split only at mss boundaries · adb92db8
      Ilpo Järvinen 提交于
      Sadly enough, this adds possible divide though we try to avoid
      it by checking one mss as common case.
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      adb92db8
    • I
      tcp: more aggressive skipping · e8bae275
      Ilpo Järvinen 提交于
      I knew already when rewriting the sacktag that this condition
      was too conservative, change it now since it prevent lot of
      useless work (especially in the sack shifter decision code
      that is being added by a later patch). This shouldn't change
      anything really, just save some processing regardless of the
      shifter.
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e8bae275
    • I
      e1aa680f
  2. 31 10月, 2008 1 次提交
  3. 30 10月, 2008 1 次提交
  4. 29 10月, 2008 1 次提交
  5. 08 10月, 2008 3 次提交
    • A
      tcp: Fix possible double-ack w/ user dma · 53240c20
      Ali Saidi 提交于
      From: Ali Saidi <saidi@engin.umich.edu>
      
      When TCP receive copy offload is enabled it's possible that
      tcp_rcv_established() will cause two acks to be sent for a single
      packet. In the case that a tcp_dma_early_copy() is successful,
      copied_early is set to true which causes tcp_cleanup_rbuf() to be
      called early which can send an ack. Further along in
      tcp_rcv_established(), __tcp_ack_snd_check() is called and will
      schedule a delayed ACK. If no packets are processed before the delayed
      ack timer expires the packet will be acked twice.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      53240c20
    • I
      tcp: cleanup messy initializer · 4a7e5609
      Ilpo Järvinen 提交于
      I'm quite sure that if I give this function in its old format
      for you to inspect, you start to wonder what is the type of
      demanded or if it's a global variable.
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4a7e5609
    • I
      tcp: kill pointless urg_mode · 33f5f57e
      Ilpo Järvinen 提交于
      It all started from me noticing that this urgent check in
      tcp_clean_rtx_queue is unnecessarily inside the loop. Then
      I took a longer look to it and found out that the users of
      urg_mode can trivially do without, well almost, there was
      one gotcha.
      
      Bonus: those funny people who use urg with >= 2^31 write_seq -
      snd_una could now rejoice too (that's the only purpose for the
      between being there, otherwise a simple compare would have done
      the thing). Not that I assume that the rest of the tcp code
      happily lives with such mind-boggling numbers :-). Alas, it
      turned out to be impossible to set wmem to such numbers anyway,
      yes I really tried a big sendfile after setting some wmem but
      nothing happened :-). ...Tcp_wmem is int and so is sk_sndbuf...
      So I hacked a bit variable to long and found out that it seems
      to work... :-)
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      33f5f57e
  6. 23 9月, 2008 1 次提交
  7. 22 9月, 2008 1 次提交
  8. 21 9月, 2008 8 次提交
  9. 09 9月, 2008 1 次提交
  10. 04 9月, 2008 1 次提交
  11. 23 8月, 2008 3 次提交
  12. 26 7月, 2008 1 次提交
  13. 24 7月, 2008 1 次提交
    • D
      tcp: Clear probes_out more aggressively in tcp_ack(). · 4b53fb67
      David S. Miller 提交于
      This is based upon an excellent bug report from Eric Dumazet.
      
      tcp_ack() should clear ->icsk_probes_out even if there are packets
      outstanding.  Otherwise if we get a sequence of ACKs while we do have
      packets outstanding over and over again, we'll never clear the
      probes_out value and eventually think the connection is too sick and
      we'll reset it.
      
      This appears to be some "optimization" added to tcp_ack() in the 2.4.x
      timeframe.  In 2.2.x, probes_out is pretty much always cleared by
      tcp_ack().
      
      Here is Eric's original report:
      
      ----------------------------------------
      Apparently, we can in some situations reset TCP connections in a couple of seconds when some frames are lost.
      
      In order to reproduce the problem, please try the following program on linux-2.6.25.*
      
      Setup some iptables rules to allow two frames per second sent on loopback interface to tcp destination port 12000
      
      iptables -N SLOWLO
      iptables -A SLOWLO -m hashlimit --hashlimit 2 --hashlimit-burst 1 --hashlimit-mode dstip --hashlimit-name slow2 -j ACCEPT
      iptables -A SLOWLO -j DROP
      
      iptables -A OUTPUT -o lo -p tcp --dport 12000 -j SLOWLO
      
      Then run the attached program and see the output :
      
      # ./loop
      State      Recv-Q Send-Q                                  Local Address:Port                                    Peer Address:Port
      ESTAB      0      40                                          127.0.0.1:54455                                      127.0.0.1:12000  timer:(persist,200ms,1)
      State      Recv-Q Send-Q                                  Local Address:Port                                    Peer Address:Port
      ESTAB      0      40                                          127.0.0.1:54455                                      127.0.0.1:12000  timer:(persist,200ms,3)
      State      Recv-Q Send-Q                                  Local Address:Port                                    Peer Address:Port
      ESTAB      0      40                                          127.0.0.1:54455                                      127.0.0.1:12000  timer:(persist,200ms,5)
      State      Recv-Q Send-Q                                  Local Address:Port                                    Peer Address:Port
      ESTAB      0      40                                          127.0.0.1:54455                                      127.0.0.1:12000  timer:(persist,200ms,7)
      State      Recv-Q Send-Q                                  Local Address:Port                                    Peer Address:Port
      ESTAB      0      40                                          127.0.0.1:54455                                      127.0.0.1:12000  timer:(persist,200ms,9)
      State      Recv-Q Send-Q                                  Local Address:Port                                    Peer Address:Port
      ESTAB      0      40                                          127.0.0.1:54455                                      127.0.0.1:12000  timer:(persist,200ms,11)
      State      Recv-Q Send-Q                                  Local Address:Port                                    Peer Address:Port
      ESTAB      0      40                                          127.0.0.1:54455                                      127.0.0.1:12000  timer:(persist,201ms,13)
      State      Recv-Q Send-Q                                  Local Address:Port                                    Peer Address:Port
      ESTAB      0      40                                          127.0.0.1:54455                                      127.0.0.1:12000  timer:(persist,188ms,15)
      write(): Connection timed out
      wrote 890 bytes but was interrupted after 9 seconds
      ESTAB      0      0                 127.0.0.1:12000            127.0.0.1:54455
      Exiting read() because no data available (4000 ms timeout).
      read 860 bytes
      
      While this tcp session makes progress (sending frames with 50 bytes of payload, every 500ms), linux tcp stack decides to reset it, when tcp_retries 2 is reached (default value : 15)
      
      tcpdump :
      
      15:30:28.856695 IP 127.0.0.1.56554 > 127.0.0.1.12000: S 33788768:33788768(0) win 32792 <mss 16396,nop,nop,sackOK,nop,wscale 7>
      15:30:28.856711 IP 127.0.0.1.12000 > 127.0.0.1.56554: S 33899253:33899253(0) ack 33788769 win 32792 <mss 16396,nop,nop,sackOK,nop,wscale 7>
      15:30:29.356947 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 1:61(60) ack 1 win 257
      15:30:29.356966 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 61 win 257
      15:30:29.866415 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 61:111(50) ack 1 win 257
      15:30:29.866427 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 111 win 257
      15:30:30.366516 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 111:161(50) ack 1 win 257
      15:30:30.366527 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 161 win 257
      15:30:30.876196 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 161:211(50) ack 1 win 257
      15:30:30.876207 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 211 win 257
      15:30:31.376282 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 211:261(50) ack 1 win 257
      15:30:31.376290 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 261 win 257
      15:30:31.885619 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 261:311(50) ack 1 win 257
      15:30:31.885631 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 311 win 257
      15:30:32.385705 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 311:361(50) ack 1 win 257
      15:30:32.385715 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 361 win 257
      15:30:32.895249 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 361:411(50) ack 1 win 257
      15:30:32.895266 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 411 win 257
      15:30:33.395341 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 411:461(50) ack 1 win 257
      15:30:33.395351 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 461 win 257
      15:30:33.918085 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 461:511(50) ack 1 win 257
      15:30:33.918096 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 511 win 257
      15:30:34.418163 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 511:561(50) ack 1 win 257
      15:30:34.418172 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 561 win 257
      15:30:34.927685 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 561:611(50) ack 1 win 257
      15:30:34.927698 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 611 win 257
      15:30:35.427757 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 611:661(50) ack 1 win 257
      15:30:35.427766 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 661 win 257
      15:30:35.937359 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 661:711(50) ack 1 win 257
      15:30:35.937376 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 711 win 257
      15:30:36.437451 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 711:761(50) ack 1 win 257
      15:30:36.437464 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 761 win 257
      15:30:36.947022 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 761:811(50) ack 1 win 257
      15:30:36.947039 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 811 win 257
      15:30:37.447135 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 811:861(50) ack 1 win 257
      15:30:37.447203 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 861 win 257
      15:30:41.448171 IP 127.0.0.1.12000 > 127.0.0.1.56554: F 1:1(0) ack 861 win 257
      15:30:41.448189 IP 127.0.0.1.56554 > 127.0.0.1.12000: R 33789629:33789629(0) win 0
      
      Source of program :
      
      /*
       * small producer/consumer program.
       * setup a listener on 127.0.0.1:12000
       * Forks a child
       *   child connect to 127.0.0.1, and sends 10 bytes on this tcp socket every 100 ms
       * Father accepts connection, and read all data
       */
      #include <sys/types.h>
      #include <sys/socket.h>
      #include <netinet/in.h>
      #include <unistd.h>
      #include <stdio.h>
      #include <time.h>
      #include <sys/poll.h>
      
      int port = 12000;
      char buffer[4096];
      int main(int argc, char *argv[])
      {
              int lfd = socket(AF_INET, SOCK_STREAM, 0);
              struct sockaddr_in socket_address;
              time_t t0, t1;
              int on = 1, sfd, res;
              unsigned long total = 0;
              socklen_t alen = sizeof(socket_address);
              pid_t pid;
      
              time(&t0);
              socket_address.sin_family = AF_INET;
              socket_address.sin_port = htons(port);
              socket_address.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
      
              if (lfd == -1) {
                      perror("socket()");
                      return 1;
              }
              setsockopt(lfd, SOL_SOCKET, SO_REUSEADDR, &on, sizeof(int));
              if (bind(lfd, (struct sockaddr *)&socket_address, sizeof(socket_address)) == -1) {
                      perror("bind");
                      close(lfd);
                      return 1;
              }
              if (listen(lfd, 1) == -1) {
                      perror("listen()");
                      close(lfd);
                      return 1;
              }
              pid = fork();
              if (pid == 0) {
                      int i, cfd = socket(AF_INET, SOCK_STREAM, 0);
                      close(lfd);
                      if (connect(cfd, (struct sockaddr *)&socket_address, sizeof(socket_address)) == -1) {
                              perror("connect()");
                              return 1;
                              }
                      for (i = 0 ; ;) {
                              res = write(cfd, "blablabla\n", 10);
                              if (res > 0) total += res;
                              else if (res == -1) {
                                      perror("write()");
                                      break;
                              } else break;
                              usleep(100000);
                              if (++i == 10) {
                                      system("ss -on dst 127.0.0.1:12000");
                                      i = 0;
                              }
                      }
                      time(&t1);
                      fprintf(stderr, "wrote %lu bytes but was interrupted after %g seconds\n", total, difftime(t1, t0));
                      system("ss -on | grep 127.0.0.1:12000");
                      close(cfd);
                      return 0;
              }
              sfd = accept(lfd, (struct sockaddr *)&socket_address, &alen);
              if (sfd == -1) {
                      perror("accept");
                      return 1;
              }
              close(lfd);
              while (1) {
                      struct pollfd pfd[1];
                      pfd[0].fd = sfd;
                      pfd[0].events = POLLIN;
                      if (poll(pfd, 1, 4000) == 0) {
                              fprintf(stderr, "Exiting read() because no data available (4000 ms timeout).\n");
                              break;
                      }
                      res = read(sfd, buffer, sizeof(buffer));
                      if (res > 0) total += res;
                      else if (res == 0) break;
                      else perror("read()");
              }
              fprintf(stderr, "read %lu bytes\n", total);
              close(sfd);
              return 0;
      }
      ----------------------------------------
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b53fb67
  14. 19 7月, 2008 2 次提交
  15. 17 7月, 2008 3 次提交
  16. 03 7月, 2008 1 次提交
    • P
      tcp: de-bloat a bit with factoring NET_INC_STATS_BH out · 40b215e5
      Pavel Emelyanov 提交于
      There are some places in TCP that select one MIB index to
      bump snmp statistics like this:
      
      	if (<something>)
      		NET_INC_STATS_BH(<some_id>);
      	else if (<something_else>)
      		NET_INC_STATS_BH(<some_other_id>);
      	...
      	else
      		NET_INC_STATS_BH(<default_id>);
      
      or in a more tricky but still similar way.
      
      On the other hand, this NET_INC_STATS_BH is a camouflaged
      increment of percpu variable, which is not that small.
      
      Factoring those cases out de-bloats 235 bytes on non-preemptible
      i386 config and drives parts of the code into 80 columns.
      
      add/remove: 0/0 grow/shrink: 0/7 up/down: 0/-235 (-235)
      function                                     old     new   delta
      tcp_fastretrans_alert                       1437    1424     -13
      tcp_dsack_set                                137     124     -13
      tcp_xmit_retransmit_queue                    690     676     -14
      tcp_try_undo_recovery                        283     265     -18
      tcp_sacktag_write_queue                     1550    1515     -35
      tcp_update_reordering                        162     106     -56
      tcp_retransmit_timer                         990     904     -86
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      40b215e5
  17. 13 6月, 2008 1 次提交
    • D
      tcp: Revert 'process defer accept as established' changes. · ec0a1966
      David S. Miller 提交于
      This reverts two changesets, ec3c0982
      ("[TCP]: TCP_DEFER_ACCEPT updates - process as established") and
      the follow-on bug fix 9ae27e0a
      ("tcp: Fix slab corruption with ipv6 and tcp6fuzz").
      
      This change causes several problems, first reported by Ingo Molnar
      as a distcc-over-loopback regression where connections were getting
      stuck.
      
      Ilpo Järvinen first spotted the locking problems.  The new function
      added by this code, tcp_defer_accept_check(), only has the
      child socket locked, yet it is modifying state of the parent
      listening socket.
      
      Fixing that is non-trivial at best, because we can't simply just grab
      the parent listening socket lock at this point, because it would
      create an ABBA deadlock.  The normal ordering is parent listening
      socket --> child socket, but this code path would require the
      reverse lock ordering.
      
      Next is a problem noticed by Vitaliy Gusev, he noted:
      
      ----------------------------------------
      >--- a/net/ipv4/tcp_timer.c
      >+++ b/net/ipv4/tcp_timer.c
      >@@ -481,6 +481,11 @@ static void tcp_keepalive_timer (unsigned long data)
      > 		goto death;
      > 	}
      >
      >+	if (tp->defer_tcp_accept.request && sk->sk_state == TCP_ESTABLISHED) {
      >+		tcp_send_active_reset(sk, GFP_ATOMIC);
      >+		goto death;
      
      Here socket sk is not attached to listening socket's request queue. tcp_done()
      will not call inet_csk_destroy_sock() (and tcp_v4_destroy_sock() which should
      release this sk) as socket is not DEAD. Therefore socket sk will be lost for
      freeing.
      ----------------------------------------
      
      Finally, Alexey Kuznetsov argues that there might not even be any
      real value or advantage to these new semantics even if we fix all
      of the bugs:
      
      ----------------------------------------
      Hiding from accept() sockets with only out-of-order data only
      is the only thing which is impossible with old approach. Is this really
      so valuable? My opinion: no, this is nothing but a new loophole
      to consume memory without control.
      ----------------------------------------
      
      So revert this thing for now.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec0a1966
  18. 12 6月, 2008 2 次提交
  19. 05 6月, 2008 1 次提交
    • I
      tcp: fix skb vs fack_count out-of-sync condition · a6604471
      Ilpo Järvinen 提交于
      This bug is able to corrupt fackets_out in very rare cases.
      In order for this to cause corruption:
        1) DSACK in the middle of previous SACK block must be generated.
        2) In order to take that particular branch, part or all of the
           DSACKed segment must already be SACKed so that we have that
           in cache in the first place.
        3) The new info must be top enough so that fackets_out will be
           updated on this iteration.
      ...then fack_count is updated while skb wasn't, then we walk again
      that particular segment thus updating fack_count twice for
      a single skb and finally that value is assigned to fackets_out
      by tcp_sacktag_one.
      
      It is safe to call tcp_sacktag_one just once for a segment (at
      DSACK), no need to call again for plain SACK.
      
      Potential problem of the miscount are limited to premature entry
      to recovery and to inflated reordering metric (which could even
      cancel each other out in the most the luckiest scenarios :-)).
      Both are quite insignificant in worst case too and there exists
      also code to reset them (fackets_out once sacked_out becomes zero
      and reordering metric on RTO).
      
      This has been reported by a number of people, because it occurred
      quite rarely, it has been very evasive. Andy Furniss was able to
      get it to occur couple of times so that a bit more info was
      collected about the problem using a debug patch, though it still
      required lot of checking around. Thanks also to others who have
      tried to help here.
      
      This is listed as Bugzilla #10346. The bug was introduced by
      me in commit 68f8353b ([TCP]: Rewrite SACK block processing & 
      sack_recv_cache use), I probably thought back then that there's
      need to scan that entry twice or didn't dare to make it go
      through it just once there. Going through twice would have
      required restoring fack_count after the walk but as noted above,
      I chose to drop the additional walk step altogether here.
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6604471