1. 22 1月, 2014 1 次提交
    • H
      reciprocal_divide: update/correction of the algorithm · 809fa972
      Hannes Frederic Sowa 提交于
      Jakub Zawadzki noticed that some divisions by reciprocal_divide()
      were not correct [1][2], which he could also show with BPF code
      after divisions are transformed into reciprocal_value() for runtime
      invariance which can be passed to reciprocal_divide() later on;
      reverse in BPF dump ended up with a different, off-by-one K in
      some situations.
      
      This has been fixed by Eric Dumazet in commit aee636c4
      ("bpf: do not use reciprocal divide"). This follow-up patch
      improves reciprocal_value() and reciprocal_divide() to work in
      all cases by using Granlund and Montgomery method, so that also
      future use is safe and without any non-obvious side-effects.
      Known problems with the old implementation were that division by 1
      always returned 0 and some off-by-ones when the dividend and divisor
      where very large. This seemed to not be problematic with its
      current users, as far as we can tell. Eric Dumazet checked for
      the slab usage, we cannot surely say so in the case of flex_array.
      Still, in order to fix that, we propose an extension from the
      original implementation from commit 6a2d7a95 resp. [3][4],
      by using the algorithm proposed in "Division by Invariant Integers
      Using Multiplication" [5], Torbjörn Granlund and Peter L.
      Montgomery, that is, pseudocode for q = n/d where q, n, d is in
      u32 universe:
      
      1) Initialization:
      
        int l = ceil(log_2 d)
        uword m' = floor((1<<32)*((1<<l)-d)/d)+1
        int sh_1 = min(l,1)
        int sh_2 = max(l-1,0)
      
      2) For q = n/d, all uword:
      
        uword t = (n*m')>>32
        q = (t+((n-t)>>sh_1))>>sh_2
      
      The assembler implementation from Agner Fog [6] also helped a lot
      while implementing. We have tested the implementation on x86_64,
      ppc64, i686, s390x; on x86_64/haswell we're still half the latency
      compared to normal divide.
      
      Joint work with Daniel Borkmann.
      
        [1] http://www.wireshark.org/~darkjames/reciprocal-buggy.c
        [2] http://www.wireshark.org/~darkjames/set-and-dump-filter-k-bug.c
        [3] https://gmplib.org/~tege/division-paper.pdf
        [4] http://homepage.cs.uiowa.edu/~jones/bcd/divide.html
        [5] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.2556
        [6] http://www.agner.org/optimize/asmlib.zipReported-by: NJakub Zawadzki <darkjames-ws@darkjames.pl>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Austin S Hemmelgarn <ahferroin7@gmail.com>
      Cc: linux-kernel@vger.kernel.org
      Cc: Jesse Gross <jesse@nicira.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Andy Gospodarek <andy@greyhouse.net>
      Cc: Veaceslav Falico <vfalico@redhat.com>
      Cc: Jay Vosburgh <fubar@us.ibm.com>
      Cc: Jakub Zawadzki <darkjames-ws@darkjames.pl>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      809fa972
  2. 15 1月, 2014 1 次提交
  3. 17 4月, 2012 1 次提交
  4. 05 3月, 2012 1 次提交
    • P
      BUG: headers with BUG/BUG_ON etc. need linux/bug.h · 187f1882
      Paul Gortmaker 提交于
      If a header file is making use of BUG, BUG_ON, BUILD_BUG_ON, or any
      other BUG variant in a static inline (i.e. not in a #define) then
      that header really should be including <linux/bug.h> and not just
      expecting it to be implicitly present.
      
      We can make this change risk-free, since if the files using these
      headers didn't have exposure to linux/bug.h already, they would have
      been causing compile failures/warnings.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      187f1882
  5. 13 1月, 2012 1 次提交
    • E
      net_sched: sfq: add optional RED on top of SFQ · ddecf0f4
      Eric Dumazet 提交于
      Adds an optional Random Early Detection on each SFQ flow queue.
      
      Traditional SFQ limits count of packets, while RED permits to also
      control number of bytes per flow, and adds ECN capability as well.
      
      1) We dont handle the idle time management in this RED implementation,
      since each 'new flow' begins with a null qavg. We really want to address
      backlogged flows.
      
      2) if headdrop is selected, we try to ecn mark first packet instead of
      currently enqueued packet. This gives faster feedback for tcp flows
      compared to traditional RED [ marking the last packet in queue ]
      
      Example of use :
      
      tc qdisc add dev $DEV parent 1:1 handle 10: est 1sec 4sec sfq \
      	limit 3000 headdrop flows 512 divisor 16384 \
      	redflowlimit 100000 min 8000 max 60000 probability 0.20 ecn
      
      qdisc sfq 10: parent 1:1 limit 3000p quantum 1514b depth 127 headdrop
      flows 512/16384 divisor 16384
       ewma 6 min 8000b max 60000b probability 0.2 ecn
       prob_mark 0 prob_mark_head 4876 prob_drop 6131
       forced_mark 0 forced_mark_head 0 forced_drop 0
       Sent 1175211782 bytes 777537 pkt (dropped 6131, overlimits 11007
      requeues 0)
       rate 99483Kbit 8219pps backlog 689392b 456p requeues 0
      
      In this test, with 64 netperf TCP_STREAM sessions, 50% using ECN enabled
      flows, we can see number of packets CE marked is smaller than number of
      drops (for non ECN flows)
      
      If same test is run, without RED, we can check backlog is much bigger.
      
      qdisc sfq 10: parent 1:1 limit 3000p quantum 1514b depth 127 headdrop
      flows 512/16384 divisor 16384
       Sent 1148683617 bytes 795006 pkt (dropped 0, overlimits 0 requeues 0)
       rate 98429Kbit 8521pps backlog 1221290b 841p requeues 0
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Stephen Hemminger <shemminger@vyatta.com>
      CC: Dave Taht <dave.taht@gmail.com>
      Tested-by: NDave Taht <dave.taht@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ddecf0f4
  6. 06 1月, 2012 1 次提交
  7. 10 12月, 2011 1 次提交
    • E
      sch_red: generalize accurate MAX_P support to RED/GRED/CHOKE · a73ed26b
      Eric Dumazet 提交于
      Now RED uses a Q0.32 number to store max_p (max probability), allow
      RED/GRED/CHOKE to use/report full resolution at config/dump time.
      
      Old tc binaries are non aware of new attributes, and still set/get Plog.
      
      New tc binary set/get both Plog and max_p for backward compatibility,
      they display "probability value" if they get max_p from new kernels.
      
      # tc -d  qdisc show dev ...
      ...
      qdisc red 10: parent 1:1 limit 360Kb min 30Kb max 90Kb ecn ewma 5
      probability 0.09 Scell_log 15
      
      Make sure we avoid potential divides by 0 in reciprocal_value(), if
      (max_th - min_th) is big.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a73ed26b
  8. 09 12月, 2011 1 次提交
    • E
      sch_red: Adaptative RED AQM · 8af2a218
      Eric Dumazet 提交于
      Adaptative RED AQM for linux, based on paper from Sally FLoyd,
      Ramakrishna Gummadi, and Scott Shenker, August 2001 :
      
      http://icir.org/floyd/papers/adaptiveRed.pdf
      
      Goal of Adaptative RED is to make max_p a dynamic value between 1% and
      50% to reach the target average queue : (max_th - min_th) / 2
      
      Every 500 ms:
       if (avg > target and max_p <= 0.5)
        increase max_p : max_p += alpha;
       else if (avg < target and max_p >= 0.01)
        decrease max_p : max_p *= beta;
      
      target :[min_th + 0.4*(min_th - max_th),
                min_th + 0.6*(min_th - max_th)].
      alpha : min(0.01, max_p / 4)
      beta : 0.9
      max_P is a Q0.32 fixed point number (unsigned, with 32 bits mantissa)
      
      Changes against our RED implementation are :
      
      max_p is no longer a negative power of two (1/(2^Plog)), but a Q0.32
      fixed point number, to allow full range described in Adatative paper.
      
      To deliver a random number, we now use a reciprocal divide (thats really
      a multiply), but this operation is done once per marked/droped packet
      when in RED_BETWEEN_TRESH window, so added cost (compared to previous
      AND operation) is near zero.
      
      dump operation gives current max_p value in a new TCA_RED_MAX_P
      attribute.
      
      Example on a 10Mbit link :
      
      tc qdisc add dev $DEV parent 1:1 handle 10: est 1sec 8sec red \
         limit 400000 min 30000 max 90000 avpkt 1000 \
         burst 55 ecn adaptative bandwidth 10Mbit
      
      # tc -s -d qdisc show dev eth3
      ...
      qdisc red 10: parent 1:1 limit 400000b min 30000b max 90000b ecn
      adaptative ewma 5 max_p=0.113335 Scell_log 15
       Sent 50414282 bytes 34504 pkt (dropped 35, overlimits 1392 requeues 0)
       rate 9749Kbit 831pps backlog 72056b 16p requeues 0
        marked 1357 early 35 pdrop 0 other 0
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8af2a218
  9. 01 12月, 2011 1 次提交
  10. 13 1月, 2011 1 次提交
  11. 04 11月, 2009 1 次提交
  12. 26 4月, 2007 3 次提交
  13. 05 8月, 2006 1 次提交
    • I
      [PKT_SCHED] RED: Fix overflow in calculation of queue average · c4c0ce5c
      Ilpo Järvinen 提交于
      Overflow can occur very easily with 32 bits, e.g., with 1 second
      us_idle is approx. 2^20, which leaves only 11-Wlog bits for queue
      length. Since the EWMA exponent is typically around 9, queue
      lengths larger than 2^2 cause overflow. Whether the affected
      branch is taken when us_idle is as high as 1 second, depends on
      Scell_log, but with rather reasonable configuration Scell_log is
      large enough to cause p->Stab to have zero index, which always
      results zero shift (typically also few other small indices result
      in zero shift).
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c4c0ce5c
  14. 26 4月, 2006 1 次提交
  15. 06 11月, 2005 1 次提交
    • T
      [PKT_SCHED]: Generic RED layer · a7834745
      Thomas Graf 提交于
      Extracts the RED algorithm from sch_red.c and puts it into include/net/red.h
      for use by other RED based modules. The statistics are extended to be more
      fine grained in order to differ between probability/forced marks/drops.
      We now reset the average queue length when setting new parameters, leaving
      it might result in an unreasonable qavg for a while depending on the value of W.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@mandriva.com>
      a7834745