1. 24 11月, 2008 4 次提交
  2. 20 11月, 2008 3 次提交
  3. 17 11月, 2008 6 次提交
    • G
      dccp: Tidy up setsockopt calls · 19102996
      Gerrit Renker 提交于
      This splits the setsockopt calls into two groups, depending on whether an
      integer argument (val) is required and whether routines being called do
      their own locking.
      
      Some options (such as setting the CCID) use u8 rather than int, so that for
      these the test with regard to integer-sizeof can not be used.
      
      The second switch-case statement now only has those statements which need
      locking and which make use of `val'.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: NIan McDonald <ian.mcdonald@jandi.co.nz>
      Acked-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Reviewed-by: NEugene Teo <eugeneteo@kernel.sg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19102996
    • G
      dccp: Deprecate Ack Ratio sysctl · dd9c0e36
      Gerrit Renker 提交于
      This patch deprecates the Ack Ratio sysctl, since
       * Ack Ratio is entirely ignored by CCID-3 and CCID-4,
       * Ack Ratio currently doesn't work in CCID-2 (i.e. is always set to 1);
       * even if it would work in CCID-2, there is no point for a user to change it:
         - Ack Ratio is constrained by cwnd (RFC 4341, 6.1.2),
         - if Ack Ratio > cwnd, the system resorts to spurious RTO timeouts
           (since waiting for Acks which will never arrive in this window),
         - cwnd is not a user-configurable value.
      
      The only reasonable place for Ack Ratio is to print it for debugging. It is
      planned to do this later on, as part of e.g. dccp_probe.
      
      With this patch Ack Ratio is now under full control of feature negotiation:
       * Ack Ratio is resolved as a dependency of the selected CCID;
       * if the chosen CCID supports it (i.e. CCID == CCID-2), Ack Ratio is set to
         the default of 2, following RFC 4340, 11.3 - "New connections start with Ack
         Ratio 2 for both endpoints";
       * what happens then is part of another patch set, since it concerns the
         dynamic update of Ack Ratio while the connection is in full flight.
      
      Thanks to Tomasz Grobelny for discussion leading up to this patch.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dd9c0e36
    • G
      dccp: Feature negotiation for minimum-checksum-coverage · 29450559
      Gerrit Renker 提交于
      This provides feature negotiation for server minimum checksum coverage
      which so far has been missing.
      
      Since sender/receiver coverage values range only from 0...15, their
      type has also been reduced in size from u16 to u4.
      
      Feature-negotiation options are now generated for both sender and receiver
      coverage, i.e. when the peer has `forgotten' to enable partial coverage
      then feature negotiation will automatically enable (negotiate) the partial
      coverage value for this connection.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: NIan McDonald <ian.mcdonald@jandi.co.nz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      29450559
    • G
      dccp: Deprecate old setsockopt framework · 49aebc66
      Gerrit Renker 提交于
      The previous setsockopt interface, which passed socket options via struct
      dccp_so_feat, is complicated/difficult to use. Continuing to support it leads to
      ugly code since the old approach did not distinguish between NN and SP values.
      
      This patch removes the old setsockopt interface and replaces it with two new
      functions to register NN/SP values for feature negotiation. 
      These are essentially wrappers around the internal __feat_register functions,
      with checking added to avoid
      
       * wrong usage (type);
       * changing values while the connection is in progress.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      49aebc66
    • G
      dccp: Mechanism to resolve CCID dependencies · 0c116839
      Gerrit Renker 提交于
      This adds a hook to resolve features whose value depends on the choice of
      CCID. It is done at the server since it can only be done after the CCID
      values have been negotiated; i.e. the client will add its CCID preference
      list on the Change options sent in the Request, which will be reconciled
      with the local preference list of the server.
      
      The concept is documented on
      http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/feature_negotiation/\
      				implementation_notes.html#ccid_dependencies
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: NIan McDonald <ian.mcdonald@jandi.co.nz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c116839
    • E
      net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls · 3ab5aee7
      Eric Dumazet 提交于
      RCU was added to UDP lookups, using a fast infrastructure :
      - sockets kmem_cache use SLAB_DESTROY_BY_RCU and dont pay the
        price of call_rcu() at freeing time.
      - hlist_nulls permits to use few memory barriers.
      
      This patch uses same infrastructure for TCP/DCCP established
      and timewait sockets.
      
      Thanks to SLAB_DESTROY_BY_RCU, no slowdown for applications
      using short lived TCP connections. A followup patch, converting
      rwlocks to spinlocks will even speedup this case.
      
      __inet_lookup_established() is pretty fast now we dont have to
      dirty a contended cache line (read_lock/read_unlock)
      
      Only established and timewait hashtable are converted to RCU
      (bind table and listen table are still using traditional locking)
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3ab5aee7
  4. 12 11月, 2008 4 次提交
    • G
      dccp: Resolve dependencies of features on choice of CCID · 9eca0a47
      Gerrit Renker 提交于
      This provides a missing link in the code chain, as several features implicitly
      depend and/or rely on the choice of CCID. Most notably, this is the Send Ack Vector
      feature, but also Ack Ratio and Send Loss Event Rate (also taken care of).
      
      For Send Ack Vector, the situation is as follows:
       * since CCID2 mandates the use of Ack Vectors, there is no point in allowing 
         endpoints which use CCID2 to disable Ack Vector features such a connection;
      
       * a peer with a TX CCID of CCID2 will always expect Ack Vectors, and a peer
         with a RX CCID of CCID2 must always send Ack Vectors (RFC 4341, sec. 4);
      
       * for all other CCIDs, the use of (Send) Ack Vector is optional and thus
         negotiable. However, this implies that the code negotiating the use of Ack
         Vectors also supports it (i.e. is able to supply and to either parse or
         ignore received Ack Vectors). Since this is not the case (CCID-3 has no Ack
         Vector support), the use of Ack Vectors is here disabled, with a comment
         in the source code.
      
      An analogous consideration arises for the Send Loss Event Rate feature,
      since the CCID-3 implementation does not support the loss interval options
      of RFC 4342. To make such use explicit, corresponding feature-negotiation
      options are inserted which signal the use of the loss event rate option,
      as it is used by the CCID3 code.
      
      Lastly, the values of the Ack Ratio feature are matched to the choice of CCID.
      
      The patch implements this as a function which is called after the user has
      made all other registrations for changing default values of features.
      
      The table is variable-length, the reserved (and hence for feature-negotiation
      invalid, confirmed by considering section 19.4 of RFC 4340) feature number `0'
      is used to mark the end of the table.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: NIan McDonald <ian.mcdonald@jandi.co.nz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9eca0a47
    • G
      dccp: Query supported CCIDs · d90ebcbf
      Gerrit Renker 提交于
      This provides a data structure to record which CCIDs are locally supported
      and three accessor functions:
       - a test function for internal use which is used to validate CCID requests
         made by the user;
       - a copy function so that the list can be used for feature-negotiation;   
       - documented getsockopt() support so that the user can query capabilities.
      
      The data structure is a table which is filled in at compile-time with the
      list of available CCIDs (which in turn depends on the Kconfig choices).
      
      Using the copy function for cloning the list of supported CCIDs is useful for
      feature negotiation, since the negotiation is now with the full list of available
      CCIDs (e.g. {2, 3}) instead of the default value {2}. This means negotiation 
      will not fail if the peer requests to use CCID3 instead of CCID2. 
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: NIan McDonald <ian.mcdonald@jandi.co.nz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d90ebcbf
    • G
      dccp: Registration routines for changing feature values · e8ef967a
      Gerrit Renker 提交于
      Two registration routines, for SP and NN features, are provided by this patch,
      replacing a previous routine which was used for both feature types.
      
      These are internal-only routines and therefore start with `__feat_register'.
      
      It further exports the known limits of Sequence Window and Ack Ratio as symbolic
      constants.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: NIan McDonald <ian.mcdonald@jandi.co.nz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e8ef967a
    • G
      dccp: Limit feature negotiation to connection setup phase · f74e91b6
      Gerrit Renker 提交于
      This patch limits feature (capability) negotation to the connection setup phase:
      
       1. Although it is theoretically possible to perform feature negotiation at any
          time (and RFC 4340 supports this), in practice this is prohibitively complex,
          as it requires to put traffic on hold for each new negotiation.
       2. As a byproduct of restricting feature negotiation to connection setup, the
          feature-negotiation retransmit timer is no longer required. This part is now
          mapped onto the protocol-level retransmission.
          Details indicating why timers are no longer needed can be found on
          http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/feature_negotiation/\
      	                                      implementation_notes.html
      
      This patch disables anytime negotiation, subsequent patches work out full
      feature negotiation support for connection setup.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f74e91b6
  5. 05 11月, 2008 5 次提交
  6. 31 10月, 2008 1 次提交
  7. 20 10月, 2008 1 次提交
  8. 17 10月, 2008 1 次提交
  9. 09 10月, 2008 1 次提交
  10. 08 10月, 2008 1 次提交
  11. 09 9月, 2008 1 次提交
  12. 04 9月, 2008 12 次提交
    • G
      dccp ccid-3: Preventing Oscillations · a3cbdde8
      Gerrit Renker 提交于
      This implements [RFC 3448, 4.5], which performs congestion avoidance behaviour
      by reducing the transmit rate as the queueing delay (measured in terms of
      long-term RTT) increases.
      
      Oscillation can be turned on/off via a module option (do_osc_prev) and via sysfs
      (using mode 0644), the default is off.
      
      Overflow analysis:
      ------------------
       * oscillation prevention is done after update_x(), so that t_ipi <= 64000;
       * hence the multiplication "t_ipi * sqrt(R_sample)" needs 64 bits;
       * done using u64 for sqrt_sample and explicit typecast of t_ipi;
       * the divisor, R_sqmean, is non-zero because oscillation prevention is first
         called when receiving the second feedback packet, and tfrc_scaled_rtt() > 0.
      
      A detailed discussion of the algorithm (with plots) is on
      http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/ccid3/sender_notes/oscillation_prevention/
      
      The algorithm has negative side effects:
        * when allowing to decrease t_ipi (leads to a large RTT) and
        * when using it during slow-start;
      both uses are therefore disabled.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      a3cbdde8
    • G
      dccp ccid-3: Simplify computing and range-checking of t_ipi · 53ac9570
      Gerrit Renker 提交于
      This patch simplifies the computation of t_ipi, avoiding expensive computations
      to enforce the minimum sending rate.
      
      Both RFC 3448 and rfc3448bis (revision #6), as well as RFC 4342 sec 5., require
      at various stages that at least one packet must be sent per t_mbi = 64 seconds.
      This requires frequent divisions of the type X_min = s/t_mbi, which are later
      converted back into an inter-packet-interval t_ipi_max = s/X_min = t_mbi.
      
      The patch removes the expensive indirection; in the unlikely case of having
      a sending rate less than one packet per 64 seconds, it also re-adjusts X.
      
      The following cases document conformance with RFC 3448  / rfc3448bis-06:
       1) Time until receiving the first feedback packet:
         * if the sender has no initial RTT sample then X = s/1 Bps > s/t_mbi;
         * if the sender has an initial RTT sample or when the first feedback
           packet is received, X = W_init/R > s/t_mbi.
      
       2) Slow-start (p == 0 and feedback packets come in):
         * RFC 3448  (current code) enforces a minimum of s/R > s/t_mbi;
         * rfc3448bis (future code) enforces an even higher minimum of W_init/R.
      
       3) Congestion avoidance with no absence of feedback (p > 0):
         * when X_calc or X_recv/2 are too low, the minimum of X_min = s/t_mbi
           is enforced in update_x() when calling update_send_interval();
         * update_send_interval() is, as before, only called when X changes
           (i.e. either when increasing or decreasing, not when in equilibrium).
      
       4) Reduction of X without prior feedback or during slow-start (p==0):
         * both RFC 3448 and rfc3448bis here halve X directly;
         * the associated constraint X >= s/t_mbi is nforced here by send_interval().
      
       5) Reduction of X when p > 0:
         * X is modified indirectly via X_recv (RFC 3448) or X_recv_set (rfc3448bis);
         * in both cases, control goes back to section 4.3 (in both documents);
         * since p > 0, both documents use X = max(min(...), s/t_mbi), which is
           enforced in this patch by calling send_interval() from update_x().
      
      I think that this analysis is exhaustive. Should I have forgotten a case,
      the worst-case consideration arises when X sinks below s/t_mbi, and is then
      increased back up to this minimum value. Even under this assumption, the
      behaviour is correct, since all lower limits of X in RFC 3448 / rfc3448bis
      are either equal to or greater than s/t_mbi.
      
      Note on the condition X >= s/t_mbi  <==> t_ipi = s/X <= t_mbi: since X is
      scaled by 64, and all time units are in microseconds, the coded condition is:
      
          t_ipi = s * 64 * 10^6 usec / X <= 64 * 10^6 usec
      
      This simplifies to s / X <= 1 second <==> X * 1 second >= s > 0.
      (A zero `s' is not allowed by the CCID-3 code).	
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      53ac9570
    • G
      dccp ccid-3: Measuring the packet size s with regard to rfc3448bis-06 · c8f41d50
      Gerrit Renker 提交于
      rfc3448bis allows three different ways of tracking the packet size `s': 
      
       1. using the MSS/MPS (at initialisation, 4.2, and in 4.1 (1));
       2. using the average of `s' (in 4.1);
       3. using the maximum of `s' (in 4.2).
      
      Instead of hard-coding a single interpretation of rfc3448bis, this implements
      a choice of all three alternatives and suggests the first as default, since it
      is the option which is most consistent with other parts of the specification.
      
      The patch further deprecates the update of t_ipi whenever `s' changes. The
      gains of doing this are only small since a change of s takes effect at the
      next instant X is updated:
       * when the next feedback comes in (within one RTT or less);
       * when the nofeedback timer expires (within at most 4 RTTs).
       
      Further, there are complications caused by updating t_ipi whenever s changes:
       * if t_ipi had previously been updated to effect oscillation prevention (4.5),
         then it is impossible to make the same adjustment to t_ipi again, thus
         counter-acting the algorithm;
       * s may be updated any time and a modification of t_ipi depends on the current
         state (e.g. no oscillation prevention is done in the absence of feedback);
       * in rev-06 of rfc3448bis, there are more possible cases, depending on whether
         the sender is in slow-start (t_ipi <= R/W_init), or in congestion-avoidance,
         limited by X_recv or the throughput equation (t_ipi <= t_mbi).
      
      Thus there are side effects of always updating t_ipi as s changes. These may not
      be desirable. The only case I can think of where such an update makes sense is
      to recompute X_calc when p > 0 and when s changes (not done by this patch).
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      c8f41d50
    • G
      dccp ccid-3: Tidy up CCID-Kconfig dependencies · 891e4d8a
      Gerrit Renker 提交于
      The per-CCID menu has several dependencies on EXPERIMENTAL. These are redundant,
      since net/dccp/ccids/Kconfig is sourced by net/dccp/Kconfig and since the
      latter menu in turn asserts a dependency on EXPERIMENTAL.
      
      The patch removes the redundant dependencies as well as the repeated reference
      within the sub-menu.
      
      Further changes:
      ----------------
      Two single dependencies on CCID-3 are replaced with a single enclosing `if'.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      891e4d8a
    • G
      dccp ccid-3: Implement rfc3448bis change to initial-rate computation · 9d497a2c
      Gerrit Renker 提交于
      The patch updates CCID-3 with regard to the latest rfc3448bis-06: 
       * in the first revisions of the draft, MSS was used for the RFC 3390 window; 
       * then (from revision #1 to revision #2), it used the packet size `s';
       * now, in this revision (and apparently final), the value is back to MSS.
      
      This change has an implication for the case when no RTT sample is available,
      at the time of sending the first packet:
      
       * with RTT sample, 2*MSS/RTT <= initial_rate <= 4*MSS/RTT;
       * without RTT sample, the initial rate is one packet (s bytes) per second
         (sec. 4.2), but using s instead of MSS here creates an imbalance, since
         this would further reduce the initial sending rate.
      
      Hence the patch uses MSS (called MPS in RFC 4340) in all places.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      9d497a2c
    • G
      dccp ccid-3: Update the RX history records in one place · 88e97a93
      Gerrit Renker 提交于
      This patch is a requirement for enabling ECN support later on. With that change
      in mind, the following preparations are done:
       * renamed handle_loss() into congestion_event() since it returns true when a
         congestion event happens (it will eventually also take care of ECN packets);
       * lets tfrc_rx_congestion_event() always update the RX history records, since
         this routine needs to be called for each non-duplicate packet anyway;
       * made all involved boolean-type functions to have return type `bool';
      
      Updating the RX history records is now only necessary for the packets received
      up to sending the first feedback. The receiver code becomes again simpler.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      88e97a93
    • G
      dccp ccid-3: Update the computation of X_recv · 68c89ee5
      Gerrit Renker 提交于
      This updates the computation of X_recv with regard to Errata 610/611 for
      RFC 4342 and draft rfc3448bis-06, ensuring that at least an interval of 1
      RTT is used to compute X_recv.  The change is wrapped into a new function
      ccid3_hc_rx_x_recv().
      
      Further changes:
      ----------------
       * feedback is not sent when no data packets arrived (bytes_recv == 0), as per
         rfc3448bis-06, 6.2;
       * take the timestamp for the feedback /after/ dccp_send_ack() returns, to avoid
         taking the transmission time into account (in case layer-2 is busy);
       * clearer handling of failure in ccid3_first_li().
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      68c89ee5
    • G
      dccp tfrc: Increase number of RTT samples · 22338f09
      Gerrit Renker 提交于
      This improves the receiver RTT sampling algorithm so that it tries harder to get
      as many RTT samples as possible. 
      
      The algorithm is based the concepts presented in RFC 4340, 8.1, using timestamps
      and the CCVal window counter. There exist 4 cases for the CCVal difference:
       * == 0: less than RTT/4 passed since last packet -- unusable;
       *  > 4: (much) more than 1 RTT has passed since last packet -- also unusable;
       * == 4: perfect sample (exactly one RTT has passed since last packet);
       * 1..3: sub-optimal sample (between RTT/4 and 3*RTT/4 has passed).
      
      In the last case the algorithm tried to optimise by storing away the candidate
      and then re-trying next time. The problem is that
       * a large number of samples is needed to smooth out the inaccuracies of the
         algorithm;
       * the sender may not be sending enough packets to warrant a "next time";
       * hence it is better to use suboptimal samples whenever possible.
      The algorithm now stores away the current sample only if the difference is 0.
      
      Applicability and background
      ----------------------------
      A realistic example is MP3 streaming where packets are sent at a rate of less
      than one packet per RTT, which means that suitable samples are absent for a
      very long time.
      
      The effectiveness of using suboptimal samples (with a delta between 1 and 4) was
      confirmed by instrumenting the algorithm with counters. The results of two 20
      second test runs were:
       * With the old algorithm and a total of 38442 function calls, only 394 of these
         calls resulted in usable RTT samples (about 1%), and 378 out of these were
         "perfect" samples and 28013 (unused) samples had a delta of 1..3.
       * With the new algorithm and a total of 37057 function calls, 1702 usable RTT
         samples were retrieved (about 4.6%), 5 out of these were "perfect" samples.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      22338f09
    • G
      dccp: Clamping RTT values · 49ffc29a
      Gerrit Renker 提交于
      This extracts the clamping part of dccp_sample_rtt() and makes it available
      to other parts of the code (as e.g. used in the next patch).
      
      Note: The function dccp_sample_rtt() now reduces to subtracting the elapsed
      time. This could be eliminated but would require shorter prefixes and thus
      is not done by this patch - maybe an idea for later.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      49ffc29a
    • G
      dccp ccid-3: Always perform receiver RTT sampling · 2b81143a
      Gerrit Renker 提交于
      This updates the CCID-3 receiver in part with regard to errata 610 and 611
      (http://www.rfc-editor.org/errata_list.php), which change RFC 4342 to use the
      Receive Rate as specified in rfc3448bis, requiring to constantly sample the
      RTT (or use a sender RTT).
      
      Doing this requires reusing the RX history structure after dealing with a loss.
      
      The patch does not resolve how to compute X_recv if the interval is less
      than 1 RTT. A FIXME has been added (and is resolved in subsequent patch).
      
      Furthermore, since this is all TFRC-based functionality, the RTT estimation
      is now also performed by the dccp_tfrc_lib module. This further simplifies
      the CCID-3 code.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      2b81143a
    • G
      dccp ccid-3: Remove duplicate RX states · 2f3e3bba
      Gerrit Renker 提交于
      The only state information that the CCID-3 receiver keeps is whether initial 
      feedback has been sent or not. Further, this overlaps with use of feedback:
      
       * state == TFRC_RSTATE_NO_DATA as long as no feedback has been sent;
       * state == TFRC_RSTATE_DATA    as soon as the first feedback has been sent.
      
      This patch reduces the duplication, by memorising the type of the last feedback.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      2f3e3bba
    • G
      dccp tfrc: Let dccp_tfrc_lib do the sampling work · 34a081be
      Gerrit Renker 提交于
      This migrates more TFRC-related code into the dccp_tfrc_lib:
       * sampling of the packet size `s' (which is only needed until the first
         loss interval is computed (ccid3_first_li));
       * updating the byte-counter `bytes_recvd' in between sending feedbacks.
      The result is a better separation of CCID-3 specific and TFRC specific
      code, which aids future integration with ECN and e.g. CCID-4.
      
      Further changes:
      ----------------
       * replaced magic number of 536 with equivalent constant TCP_MIN_RCVMSS;
         (this constant is also used when no estimate for `s' is available).
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      34a081be