1. 11 5月, 2015 1 次提交
    • E
      codel: add ce_threshold attribute · 80ba92fa
      Eric Dumazet 提交于
      For DCTCP or similar ECN based deployments on fabrics with shallow
      buffers, hosts are responsible for a good part of the buffering.
      
      This patch adds an optional ce_threshold to codel & fq_codel qdiscs,
      so that DCTCP can have feedback from queuing in the host.
      
      A DCTCP enabled egress port simply have a queue occupancy threshold
      above which ECT packets get CE mark.
      
      In codel language this translates to a sojourn time, so that one doesn't
      have to worry about bytes or bandwidth but delays.
      
      This makes the host an active participant in the health of the whole
      network.
      
      This also helps experimenting DCTCP in a setup without DCTCP compliant
      fabric.
      
      On following example, ce_threshold is set to 1ms, and we can see from
      'ldelay xxx us' that TCP is not trying to go around the 5ms codel
      target.
      
      Queue has more capacity to absorb inelastic bursts (say from UDP
      traffic), as queues are maintained to an optimal level.
      
      lpaa23:~# ./tc -s -d qd sh dev eth1
      qdisc mq 1: dev eth1 root
       Sent 87910654696 bytes 58065331 pkt (dropped 0, overlimits 0 requeues 42961)
       backlog 3108242b 364p requeues 42961
      qdisc codel 8063: dev eth1 parent 1:1 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
       Sent 7363778701 bytes 4863809 pkt (dropped 0, overlimits 0 requeues 5503)
       rate 2348Mbit 193919pps backlog 255866b 46p requeues 5503
        count 0 lastcount 0 ldelay 1.0ms drop_next 0us
        maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 72384
      qdisc codel 8064: dev eth1 parent 1:2 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
       Sent 7636486190 bytes 5043942 pkt (dropped 0, overlimits 0 requeues 5186)
       rate 2319Mbit 191538pps backlog 207418b 64p requeues 5186
        count 0 lastcount 0 ldelay 694us drop_next 0us
        maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 69873
      qdisc codel 8065: dev eth1 parent 1:3 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
       Sent 11569360142 bytes 7641602 pkt (dropped 0, overlimits 0 requeues 5554)
       rate 3041Mbit 251096pps backlog 210446b 59p requeues 5554
        count 0 lastcount 0 ldelay 889us drop_next 0us
        maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 37780
      ...
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Glenn Judd <glenn.judd@morganstanley.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      80ba92fa
  2. 05 2月, 2015 1 次提交
    • E
      pkt_sched: fq: better control of DDOS traffic · 06eb395f
      Eric Dumazet 提交于
      FQ has a fast path for skb attached to a socket, as it does not
      have to compute a flow hash. But for other packets, FQ being non
      stochastic means that hosts exposed to random Internet traffic
      can allocate million of flows structure (104 bytes each) pretty
      easily. Not only host can OOM, but lookup in RB trees can take
      too much cpu and memory resources.
      
      This patch adds a new attribute, orphan_mask, that is adding
      possibility of having a stochastic hash for orphaned skb.
      
      Its default value is 1024 slots, to mimic SFQ behavior.
      
      Note: This does not apply to locally generated TCP traffic,
      and no locally generated traffic will share a flow structure
      with another perfect or stochastic flow.
      
      This patch also handles the specific case of SYNACK messages:
      
      They are attached to the listener socket, and therefore all map
      to a single hash bucket. If listener have set SO_MAX_PACING_RATE,
      hoping to have new accepted socket inherit this rate, SYNACK
      might be paced and even dropped.
      
      This is very similar to an internal patch Google have used more
      than one year.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      06eb395f
  3. 07 1月, 2014 1 次提交
    • V
      net: pkt_sched: PIE AQM scheme · d4b36210
      Vijay Subramanian 提交于
      Proportional Integral controller Enhanced (PIE) is a scheduler to address the
      bufferbloat problem.
      
      >From the IETF draft below:
      " Bufferbloat is a phenomenon where excess buffers in the network cause high
      latency and jitter. As more and more interactive applications (e.g. voice over
      IP, real time video streaming and financial transactions) run in the Internet,
      high latency and jitter degrade application performance. There is a pressing
      need to design intelligent queue management schemes that can control latency and
      jitter; and hence provide desirable quality of service to users.
      
      We present here a lightweight design, PIE(Proportional Integral controller
      Enhanced) that can effectively control the average queueing latency to a target
      value. Simulation results, theoretical analysis and Linux testbed results have
      shown that PIE can ensure low latency and achieve high link utilization under
      various congestion situations. The design does not require per-packet
      timestamp, so it incurs very small overhead and is simple enough to implement
      in both hardware and software.  "
      
      Many thanks to Dave Taht for extensive feedback, reviews, testing and
      suggestions. Thanks also to Stephen Hemminger and Eric Dumazet for reviews and
      suggestions.  Naeem Khademi and Dave Taht independently contributed to ECN
      support.
      
      For more information, please see technical paper about PIE in the IEEE
      Conference on High Performance Switching and Routing 2013. A copy of the paper
      can be found at ftp://ftpeng.cisco.com/pie/.
      
      Please also refer to the IETF draft submission at
      http://tools.ietf.org/html/draft-pan-tsvwg-pie-00
      
      All relevant code, documents and test scripts and results can be found at
      ftp://ftpeng.cisco.com/pie/.
      
      For problems with the iproute2/tc or Linux kernel code, please contact Vijay
      Subramanian (vijaynsu@cisco.com or subramanian.vijay@gmail.com) Mythili Prabhu
      (mysuryan@cisco.com)
      Signed-off-by: NVijay Subramanian <subramanian.vijay@gmail.com>
      Signed-off-by: NMythili Prabhu <mysuryan@cisco.com>
      CC: Dave Taht <dave.taht@bufferbloat.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4b36210
  4. 01 1月, 2014 1 次提交
  5. 27 12月, 2013 1 次提交
  6. 20 12月, 2013 1 次提交
    • T
      net-qdisc-hhf: Heavy-Hitter Filter (HHF) qdisc · 10239edf
      Terry Lam 提交于
      This patch implements the first size-based qdisc that attempts to
      differentiate between small flows and heavy-hitters.  The goal is to
      catch the heavy-hitters and move them to a separate queue with less
      priority so that bulk traffic does not affect the latency of critical
      traffic.  Currently "less priority" means less weight (2:1 in
      particular) in a Weighted Deficit Round Robin (WDRR) scheduler.
      
      In essence, this patch addresses the "delay-bloat" problem due to
      bloated buffers. In some systems, large queues may be necessary for
      obtaining CPU efficiency, or due to the presence of unresponsive
      traffic like UDP, or just a large number of connections with each
      having a small amount of outstanding traffic. In these circumstances,
      HHF aims to reduce the HoL blocking for latency sensitive traffic,
      while not impacting the queues built up by bulk traffic.  HHF can also
      be used in conjunction with other AQM mechanisms such as CoDel.
      
      To capture heavy-hitters, we implement the "multi-stage filter" design
      in the following paper:
      C. Estan and G. Varghese, "New Directions in Traffic Measurement and
      Accounting", in ACM SIGCOMM, 2002.
      
      Some configurable qdisc settings through 'tc':
      - hhf_reset_timeout: period to reset counter values in the multi-stage
                           filter (default 40ms)
      - hhf_admit_bytes:   threshold to classify heavy-hitters
                           (default 128KB)
      - hhf_evict_timeout: threshold to evict idle heavy-hitters
                           (default 1s)
      - hhf_non_hh_weight: Weighted Deficit Round Robin (WDRR) weight for
                           non-heavy-hitters (default 2)
      - hh_flows_limit:    max number of heavy-hitter flow entries
                           (default 2048)
      
      Note that the ratio between hhf_admit_bytes and hhf_reset_timeout
      reflects the bandwidth of heavy-hitters that we attempt to capture
      (25Mbps with the above default settings).
      
      The false negative rate (heavy-hitter flows getting away unclassified)
      is zero by the design of the multi-stage filter algorithm.
      With 100 heavy-hitter flows, using four hashes and 4000 counters yields
      a false positive rate (non-heavy-hitters mistakenly classified as
      heavy-hitters) of less than 1e-4.
      Signed-off-by: NTerry Lam <vtlam@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10239edf
  7. 16 11月, 2013 2 次提交
    • E
      pkt_sched: fq: fix pacing for small frames · f52ed899
      Eric Dumazet 提交于
      For performance reasons, sch_fq tried hard to not setup timers for every
      sent packet, using a quantum based heuristic : A delay is setup only if
      the flow exhausted its credit.
      
      Problem is that application limited flows can refill their credit
      for every queued packet, and they can evade pacing.
      
      This problem can also be triggered when TCP flows use small MSS values,
      as TSO auto sizing builds packets that are smaller than the default fq
      quantum (3028 bytes)
      
      This patch adds a 40 ms delay to guard flow credit refill.
      
      Fixes: afe4fd06 ("pkt_sched: fq: Fair Queue packet scheduler")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Maciej Żenczykowski <maze@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f52ed899
    • E
      pkt_sched: fq: warn users using defrate · 65c5189a
      Eric Dumazet 提交于
      Commit 7eec4174 ("pkt_sched: fq: fix non TCP flows pacing")
      obsoleted TCA_FQ_FLOW_DEFAULT_RATE without notice for the users.
      
      Suggested by David Miller
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      65c5189a
  8. 10 11月, 2013 1 次提交
  9. 21 9月, 2013 1 次提交
  10. 30 8月, 2013 1 次提交
    • E
      pkt_sched: fq: Fair Queue packet scheduler · afe4fd06
      Eric Dumazet 提交于
      - Uses perfect flow match (not stochastic hash like SFQ/FQ_codel)
      - Uses the new_flow/old_flow separation from FQ_codel
      - New flows get an initial credit allowing IW10 without added delay.
      - Special FIFO queue for high prio packets (no need for PRIO + FQ)
      - Uses a hash table of RB trees to locate the flows at enqueue() time
      - Smart on demand gc (at enqueue() time, RB tree lookup evicts old
        unused flows)
      - Dynamic memory allocations.
      - Designed to allow millions of concurrent flows per Qdisc.
      - Small memory footprint : ~8K per Qdisc, and 104 bytes per flow.
      - Single high resolution timer for throttled flows (if any).
      - One RB tree to link throttled flows.
      - Ability to have a max rate per flow. We might add a socket option
        to add per socket limitation.
      
      Attempts have been made to add TCP pacing in TCP stack, but this
      seems to add complex code to an already complex stack.
      
      TCP pacing is welcomed for flows having idle times, as the cwnd
      permits TCP stack to queue a possibly large number of packets.
      
      This removes the 'slow start after idle' choice, hitting badly
      large BDP flows, and applications delivering chunks of data
      as video streams.
      
      Nicely spaced packets :
      Here interface is 10Gbit, but flow bottleneck is ~20Mbit
      
      cwin is big, yet FQ avoids the typical bursts generated by TCP
      (as in netperf TCP_RR -- -r 100000,100000)
      
      15:01:23.545279 IP A > B: . 78193:81089(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
      15:01:23.545394 IP B > A: . ack 81089 win 3668 <nop,nop,timestamp 11597985 1115>
      15:01:23.546488 IP A > B: . 81089:83985(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
      15:01:23.546565 IP B > A: . ack 83985 win 3668 <nop,nop,timestamp 11597986 1115>
      15:01:23.547713 IP A > B: . 83985:86881(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
      15:01:23.547778 IP B > A: . ack 86881 win 3668 <nop,nop,timestamp 11597987 1115>
      15:01:23.548911 IP A > B: . 86881:89777(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
      15:01:23.548949 IP B > A: . ack 89777 win 3668 <nop,nop,timestamp 11597988 1115>
      15:01:23.550116 IP A > B: . 89777:92673(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
      15:01:23.550182 IP B > A: . ack 92673 win 3668 <nop,nop,timestamp 11597989 1115>
      15:01:23.551333 IP A > B: . 92673:95569(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
      15:01:23.551406 IP B > A: . ack 95569 win 3668 <nop,nop,timestamp 11597991 1115>
      15:01:23.552539 IP A > B: . 95569:98465(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
      15:01:23.552576 IP B > A: . ack 98465 win 3668 <nop,nop,timestamp 11597992 1115>
      15:01:23.553756 IP A > B: . 98465:99913(1448) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
      15:01:23.554138 IP A > B: P 99913:100001(88) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
      15:01:23.554204 IP B > A: . ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
      15:01:23.554234 IP B > A: . 65248:68144(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
      15:01:23.555620 IP B > A: . 68144:71040(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
      15:01:23.557005 IP B > A: . 71040:73936(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
      15:01:23.558390 IP B > A: . 73936:76832(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
      15:01:23.559773 IP B > A: . 76832:79728(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
      15:01:23.561158 IP B > A: . 79728:82624(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.562543 IP B > A: . 82624:85520(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.563928 IP B > A: . 85520:88416(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.565313 IP B > A: . 88416:91312(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.566698 IP B > A: . 91312:94208(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.568083 IP B > A: . 94208:97104(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.569467 IP B > A: . 97104:100000(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.570852 IP B > A: . 100000:102896(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.572237 IP B > A: . 102896:105792(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.573639 IP B > A: . 105792:108688(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.575024 IP B > A: . 108688:111584(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.576408 IP B > A: . 111584:114480(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.577793 IP B > A: . 114480:117376(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      
      TCP timestamps show that most packets from B were queued in the same ms
      timeframe (TSval 1159799{3,4}), but FQ managed to send them right
      in time to avoid a big burst.
      
      In slow start or steady state, very few packets are throttled [1]
      
      FQ gets a bunch of tunables as :
      
        limit : max number of packets on whole Qdisc (default 10000)
      
        flow_limit : max number of packets per flow (default 100)
      
        quantum : the credit per RR round (default is 2 MTU)
      
        initial_quantum : initial credit for new flows (default is 10 MTU)
      
        maxrate : max per flow rate (default : unlimited)
      
        buckets : number of RB trees (default : 1024) in hash table.
                     (consumes 8 bytes per bucket)
      
        [no]pacing : disable/enable pacing (default is enable)
      
      All of them can be changed on a live qdisc.
      
      $ tc qd add dev eth0 root fq help
      Usage: ... fq [ limit PACKETS ] [ flow_limit PACKETS ]
                    [ quantum BYTES ] [ initial_quantum BYTES ]
                    [ maxrate RATE  ] [ buckets NUMBER ]
                    [ [no]pacing ]
      
      $ tc -s -d qd
      qdisc fq 8002: dev eth0 root refcnt 32 limit 10000p flow_limit 100p buckets 256 quantum 3028 initial_quantum 15140
       Sent 216532416 bytes 148395 pkt (dropped 0, overlimits 0 requeues 14)
       backlog 0b 0p requeues 14
        511 flows, 511 inactive, 0 throttled
        110 gc, 0 highprio, 0 retrans, 1143 throttled, 0 flows_plimit
      
      [1] Except if initial srtt is overestimated, as if using
      cached srtt in tcp metrics. We'll provide a fix for this issue.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      afe4fd06
  11. 15 8月, 2013 1 次提交
    • J
      net_sched: restore "linklayer atm" handling · 8a8e3d84
      Jesper Dangaard Brouer 提交于
      commit 56b765b7 ("htb: improved accuracy at high rates")
      broke the "linklayer atm" handling.
      
       tc class add ... htb rate X ceil Y linklayer atm
      
      The linklayer setting is implemented by modifying the rate table
      which is send to the kernel.  No direct parameter were
      transferred to the kernel indicating the linklayer setting.
      
      The commit 56b765b7 ("htb: improved accuracy at high rates")
      removed the use of the rate table system.
      
      To keep compatible with older iproute2 utils, this patch detects
      the linklayer by parsing the rate table.  It also supports future
      versions of iproute2 to send this linklayer parameter to the
      kernel directly. This is done by using the __reserved field in
      struct tc_ratespec, to convey the choosen linklayer option, but
      only using the lower 4 bits of this field.
      
      Linklayer detection is limited to speeds below 100Mbit/s, because
      at high rates the rtab is gets too inaccurate, so bad that
      several fields contain the same values, this resembling the ATM
      detect.  Fields even start to contain "0" time to send, e.g. at
      1000Mbit/s sending a 96 bytes packet cost "0", thus the rtab have
      been more broken than we first realized.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8a8e3d84
  12. 07 3月, 2013 1 次提交
  13. 13 10月, 2012 1 次提交
  14. 13 5月, 2012 1 次提交
    • E
      fq_codel: Fair Queue Codel AQM · 4b549a2e
      Eric Dumazet 提交于
      Fair Queue Codel packet scheduler
      
      Principles :
      
      - Packets are classified (internal classifier or external) on flows.
      - This is a Stochastic model (as we use a hash, several flows might
                                    be hashed on same slot)
      - Each flow has a CoDel managed queue.
      - Flows are linked onto two (Round Robin) lists,
        so that new flows have priority on old ones.
      
      - For a given flow, packets are not reordered (CoDel uses a FIFO)
      - head drops only.
      - ECN capability is on by default.
      - Very low memory footprint (64 bytes per flow)
      
      tc qdisc ... fq_codel [ limit PACKETS ] [ flows number ]
                            [ target TIME ] [ interval TIME ] [ noecn ]
                            [ quantum BYTES ]
      
      defaults : 1024 flows, 10240 packets limit, quantum : device MTU
                 target : 5ms (CoDel default)
                 interval : 100ms (CoDel default)
      
      Impressive results on load :
      
      class htb 1:1 root leaf 10: prio 0 quantum 1514 rate 200000Kbit ceil 200000Kbit burst 1475b/8 mpu 0b overhead 0b cburst 1475b/8 mpu 0b overhead 0b level 0
       Sent 43304920109 bytes 33063109 pkt (dropped 0, overlimits 0 requeues 0)
       rate 201691Kbit 28595pps backlog 0b 312p requeues 0
       lended: 33063109 borrowed: 0 giants: 0
       tokens: -912 ctokens: -912
      
      class fq_codel 10:1735 parent 10:
       (dropped 1292, overlimits 0 requeues 0)
       backlog 15140b 10p requeues 0
        deficit 1514 count 1 lastcount 1 ldelay 7.1ms
      class fq_codel 10:4524 parent 10:
       (dropped 1291, overlimits 0 requeues 0)
       backlog 16654b 11p requeues 0
        deficit 1514 count 1 lastcount 1 ldelay 7.1ms
      class fq_codel 10:4e74 parent 10:
       (dropped 1290, overlimits 0 requeues 0)
       backlog 6056b 4p requeues 0
        deficit 1514 count 1 lastcount 1 ldelay 6.4ms dropping drop_next 92.0ms
      class fq_codel 10:628a parent 10:
       (dropped 1289, overlimits 0 requeues 0)
       backlog 7570b 5p requeues 0
        deficit 1514 count 1 lastcount 1 ldelay 5.4ms dropping drop_next 90.9ms
      class fq_codel 10:a4b3 parent 10:
       (dropped 302, overlimits 0 requeues 0)
       backlog 16654b 11p requeues 0
        deficit 1514 count 1 lastcount 1 ldelay 7.1ms
      class fq_codel 10:c3c2 parent 10:
       (dropped 1284, overlimits 0 requeues 0)
       backlog 13626b 9p requeues 0
        deficit 1514 count 1 lastcount 1 ldelay 5.9ms
      class fq_codel 10:d331 parent 10:
       (dropped 299, overlimits 0 requeues 0)
       backlog 15140b 10p requeues 0
        deficit 1514 count 1 lastcount 1 ldelay 7.0ms
      class fq_codel 10:d526 parent 10:
       (dropped 12160, overlimits 0 requeues 0)
       backlog 35870b 211p requeues 0
        deficit 1508 count 12160 lastcount 1 ldelay 15.3ms dropping drop_next 247us
      class fq_codel 10:e2c6 parent 10:
       (dropped 1288, overlimits 0 requeues 0)
       backlog 15140b 10p requeues 0
        deficit 1514 count 1 lastcount 1 ldelay 7.1ms
      class fq_codel 10:eab5 parent 10:
       (dropped 1285, overlimits 0 requeues 0)
       backlog 16654b 11p requeues 0
        deficit 1514 count 1 lastcount 1 ldelay 5.9ms
      class fq_codel 10:f220 parent 10:
       (dropped 1289, overlimits 0 requeues 0)
       backlog 15140b 10p requeues 0
        deficit 1514 count 1 lastcount 1 ldelay 7.1ms
      
      qdisc htb 1: root refcnt 6 r2q 10 default 1 direct_packets_stat 0 ver 3.17
       Sent 43331086547 bytes 33092812 pkt (dropped 0, overlimits 66063544 requeues 71)
       rate 201697Kbit 28602pps backlog 0b 260p requeues 71
      qdisc fq_codel 10: parent 1:1 limit 10240p flows 65536 target 5.0ms interval 100.0ms ecn
       Sent 43331086547 bytes 33092812 pkt (dropped 949359, overlimits 0 requeues 0)
       rate 201697Kbit 28602pps backlog 189352b 260p requeues 0
        maxpacket 1514 drop_overlimit 0 new_flow_count 5582 ecn_mark 125593
        new_flows_len 0 old_flows_len 11
      
      PING 172.30.42.18 (172.30.42.18) 56(84) bytes of data.
      64 bytes from 172.30.42.18: icmp_req=1 ttl=64 time=0.227 ms
      64 bytes from 172.30.42.18: icmp_req=2 ttl=64 time=0.165 ms
      64 bytes from 172.30.42.18: icmp_req=3 ttl=64 time=0.166 ms
      64 bytes from 172.30.42.18: icmp_req=4 ttl=64 time=0.151 ms
      64 bytes from 172.30.42.18: icmp_req=5 ttl=64 time=0.164 ms
      64 bytes from 172.30.42.18: icmp_req=6 ttl=64 time=0.172 ms
      64 bytes from 172.30.42.18: icmp_req=7 ttl=64 time=0.175 ms
      64 bytes from 172.30.42.18: icmp_req=8 ttl=64 time=0.183 ms
      64 bytes from 172.30.42.18: icmp_req=9 ttl=64 time=0.158 ms
      64 bytes from 172.30.42.18: icmp_req=10 ttl=64 time=0.200 ms
      
      10 packets transmitted, 10 received, 0% packet loss, time 8999ms
      rtt min/avg/max/mdev = 0.151/0.176/0.227/0.022 ms
      
      Much better than SFQ because of priority given to new flows, and fast
      path dirtying less cache lines.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b549a2e
  15. 11 5月, 2012 1 次提交
    • E
      codel: Controlled Delay AQM · 76e3cc12
      Eric Dumazet 提交于
      An implementation of CoDel AQM, from Kathleen Nichols and Van Jacobson.
      
      http://queue.acm.org/detail.cfm?id=2209336
      
      This AQM main input is no longer queue size in bytes or packets, but the
      delay packets stay in (FIFO) queue.
      
      As we don't have infinite memory, we still can drop packets in enqueue()
      in case of massive load, but mean of CoDel is to drop packets in
      dequeue(), using a control law based on two simple parameters :
      
      target : target sojourn time (default 5ms)
      interval : width of moving time window (default 100ms)
      
      Based on initial work from Dave Taht.
      
      Refactored to help future codel inclusion as a plugin for other linux
      qdisc (FQ_CODEL, ...), like RED.
      
      include/net/codel.h contains codel algorithm as close as possible than
      Kathleen reference.
      
      net/sched/sch_codel.c contains the linux qdisc specific glue.
      
      Separate structures permit a memory efficient implementation of fq_codel
      (to be sent as a separate work) : Each flow has its own struct
      codel_vars.
      
      timestamps are taken at enqueue() time with 1024 ns precision, allowing
      a range of 2199 seconds in queue, and 100Gb links support. iproute2 uses
      usec as base unit.
      
      Selected packets are dropped, unless ECN is enabled and packets can get
      ECN mark instead.
      
      Tested from 2Mb to 10Gb speeds with no particular problems, on ixgbe and
      tg3 drivers (BQL enabled).
      
      Usage: tc qdisc ... codel [ limit PACKETS ] [ target TIME ]
                                [ interval TIME ] [ ecn ]
      
      qdisc codel 10: parent 1:1 limit 2000p target 3.0ms interval 60.0ms ecn
       Sent 13347099587 bytes 8815805 pkt (dropped 0, overlimits 0 requeues 0)
       rate 202365Kbit 16708pps backlog 113550b 75p requeues 0
        count 116 lastcount 98 ldelay 4.3ms dropping drop_next 816us
        maxpacket 1514 ecn_mark 84399 drop_overlimit 0
      
      CoDel must be seen as a base module, and should be used keeping in mind
      there is still a FIFO queue. So a typical setup will probably need a
      hierarchy of several qdiscs and packet classifiers to be able to meet
      whatever constraints a user might have.
      
      One possible example would be to use fq_codel, which combines Fair
      Queueing and CoDel, in replacement of sfq / sfq_red.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDave Taht <dave.taht@bufferbloat.net>
      Cc: Kathleen Nichols <nichols@pollere.com>
      Cc: Van Jacobson <van@pollere.net>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Matt Mathis <mattmathis@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76e3cc12
  16. 01 5月, 2012 1 次提交
  17. 08 2月, 2012 1 次提交
    • S
      net/sched: sch_plug - Queue traffic until an explicit release command · c3059be1
      Shriram Rajagopalan 提交于
      The qdisc supports two operations - plug and unplug. When the
      qdisc receives a plug command via netlink request, packets arriving
      henceforth are buffered until a corresponding unplug command is received.
      Depending on the type of unplug command, the queue can be unplugged
      indefinitely or selectively.
      
      This qdisc can be used to implement output buffering, an essential
      functionality required for consistent recovery in checkpoint based
      fault-tolerance systems. Output buffering enables speculative execution
      by allowing generated network traffic to be rolled back. It is used to
      provide network protection for Xen Guests in the Remus high availability
      project, available as part of Xen.
      
      This module is generic enough to be used by any other system that wishes
      to add speculative execution and output buffering to its applications.
      
      This module was originally available in the linux 2.6.32 PV-OPS tree,
      used as dom0 for Xen.
      
      For more information, please refer to http://nss.cs.ubc.ca/remus/
      and http://wiki.xensource.com/xenwiki/Remus
      
      Changes in V3:
        * Removed debug output (printk) on queue overflow
        * Added TCQ_PLUG_RELEASE_INDEFINITE - that allows the user to
          use this qdisc, for simple plug/unplug operations.
        * Use of packet counts instead of pointers to keep track of
          the buffers in the queue.
      Signed-off-by: NShriram Rajagopalan <rshriram@cs.ubc.ca>
      Signed-off-by: NBrendan Cully <brendan@cs.ubc.ca>
      [author of the code in the linux 2.6.32 pvops tree]
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c3059be1
  18. 13 1月, 2012 1 次提交
    • E
      net_sched: sfq: add optional RED on top of SFQ · ddecf0f4
      Eric Dumazet 提交于
      Adds an optional Random Early Detection on each SFQ flow queue.
      
      Traditional SFQ limits count of packets, while RED permits to also
      control number of bytes per flow, and adds ECN capability as well.
      
      1) We dont handle the idle time management in this RED implementation,
      since each 'new flow' begins with a null qavg. We really want to address
      backlogged flows.
      
      2) if headdrop is selected, we try to ecn mark first packet instead of
      currently enqueued packet. This gives faster feedback for tcp flows
      compared to traditional RED [ marking the last packet in queue ]
      
      Example of use :
      
      tc qdisc add dev $DEV parent 1:1 handle 10: est 1sec 4sec sfq \
      	limit 3000 headdrop flows 512 divisor 16384 \
      	redflowlimit 100000 min 8000 max 60000 probability 0.20 ecn
      
      qdisc sfq 10: parent 1:1 limit 3000p quantum 1514b depth 127 headdrop
      flows 512/16384 divisor 16384
       ewma 6 min 8000b max 60000b probability 0.2 ecn
       prob_mark 0 prob_mark_head 4876 prob_drop 6131
       forced_mark 0 forced_mark_head 0 forced_drop 0
       Sent 1175211782 bytes 777537 pkt (dropped 6131, overlimits 11007
      requeues 0)
       rate 99483Kbit 8219pps backlog 689392b 456p requeues 0
      
      In this test, with 64 netperf TCP_STREAM sessions, 50% using ECN enabled
      flows, we can see number of packets CE marked is smaller than number of
      drops (for non ECN flows)
      
      If same test is run, without RED, we can check backlog is much bigger.
      
      qdisc sfq 10: parent 1:1 limit 3000p quantum 1514b depth 127 headdrop
      flows 512/16384 divisor 16384
       Sent 1148683617 bytes 795006 pkt (dropped 0, overlimits 0 requeues 0)
       rate 98429Kbit 8521pps backlog 1221290b 841p requeues 0
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Stephen Hemminger <shemminger@vyatta.com>
      CC: Dave Taht <dave.taht@gmail.com>
      Tested-by: NDave Taht <dave.taht@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ddecf0f4
  19. 06 1月, 2012 1 次提交
    • E
      net_sched: sfq: extend limits · 18cb8098
      Eric Dumazet 提交于
      SFQ as implemented in Linux is very limited, with at most 127 flows
      and limit of 127 packets. [ So if 127 flows are active, we have one
      packet per flow ]
      
      This patch brings to SFQ following features to cope with modern needs.
      
      - Ability to specify a smaller per flow limit of inflight packets.
          (default value being at 127 packets)
      
      - Ability to have up to 65408 active flows (instead of 127)
      
      - Ability to have head drops instead of tail drops
        (to drop old packets from a flow)
      
      Example of use : No more than 20 packets per flow, max 8000 flows, max
      20000 packets in SFQ qdisc, hash table of 65536 slots.
      
      tc qdisc add ... sfq \
              flows 8000 \
              depth 20 \
              headdrop \
              limit 20000 \
      	divisor 65536
      
      Ram usage :
      
      2 bytes per hash table entry (instead of previous 1 byte/entry)
      32 bytes per flow on 64bit arches, instead of 384 for QFQ, so much
      better cache hit ratio.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Dave Taht <dave.taht@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      18cb8098
  20. 13 12月, 2011 1 次提交
    • H
      netem: add cell concept to simulate special MAC behavior · 90b41a1c
      Hagen Paul Pfeifer 提交于
      This extension can be used to simulate special link layer
      characteristics. Simulate because packet data is not modified, only the
      calculation base is changed to delay a packet based on the original
      packet size and artificial cell information.
      
      packet_overhead can be used to simulate a link layer header compression
      scheme (e.g. set packet_overhead to -20) or with a positive
      packet_overhead value an additional MAC header can be simulated. It is
      also possible to "replace" the 14 byte Ethernet header with something
      else.
      
      cell_size and cell_overhead can be used to simulate link layer schemes,
      based on cells, like some TDMA schemes. Another application area are MAC
      schemes using a link layer fragmentation with a (small) header each.
      Cell size is the maximum amount of data bytes within one cell. Cell
      overhead is an additional variable to change the per-cell-overhead
      (e.g.  5 byte header per fragment).
      
      Example (5 kbit/s, 20 byte per packet overhead, cell-size 100 byte, per
      cell overhead 5 byte):
      
        tc qdisc add dev eth0 root netem rate 5kbit 20 100 5
      Signed-off-by: NHagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90b41a1c
  21. 10 12月, 2011 1 次提交
    • E
      sch_red: generalize accurate MAX_P support to RED/GRED/CHOKE · a73ed26b
      Eric Dumazet 提交于
      Now RED uses a Q0.32 number to store max_p (max probability), allow
      RED/GRED/CHOKE to use/report full resolution at config/dump time.
      
      Old tc binaries are non aware of new attributes, and still set/get Plog.
      
      New tc binary set/get both Plog and max_p for backward compatibility,
      they display "probability value" if they get max_p from new kernels.
      
      # tc -d  qdisc show dev ...
      ...
      qdisc red 10: parent 1:1 limit 360Kb min 30Kb max 90Kb ecn ewma 5
      probability 0.09 Scell_log 15
      
      Make sure we avoid potential divides by 0 in reciprocal_value(), if
      (max_th - min_th) is big.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a73ed26b
  22. 09 12月, 2011 1 次提交
    • E
      sch_red: Adaptative RED AQM · 8af2a218
      Eric Dumazet 提交于
      Adaptative RED AQM for linux, based on paper from Sally FLoyd,
      Ramakrishna Gummadi, and Scott Shenker, August 2001 :
      
      http://icir.org/floyd/papers/adaptiveRed.pdf
      
      Goal of Adaptative RED is to make max_p a dynamic value between 1% and
      50% to reach the target average queue : (max_th - min_th) / 2
      
      Every 500 ms:
       if (avg > target and max_p <= 0.5)
        increase max_p : max_p += alpha;
       else if (avg < target and max_p >= 0.01)
        decrease max_p : max_p *= beta;
      
      target :[min_th + 0.4*(min_th - max_th),
                min_th + 0.6*(min_th - max_th)].
      alpha : min(0.01, max_p / 4)
      beta : 0.9
      max_P is a Q0.32 fixed point number (unsigned, with 32 bits mantissa)
      
      Changes against our RED implementation are :
      
      max_p is no longer a negative power of two (1/(2^Plog)), but a Q0.32
      fixed point number, to allow full range described in Adatative paper.
      
      To deliver a random number, we now use a reciprocal divide (thats really
      a multiply), but this operation is done once per marked/droped packet
      when in RED_BETWEEN_TRESH window, so added cost (compared to previous
      AND operation) is near zero.
      
      dump operation gives current max_p value in a new TCA_RED_MAX_P
      attribute.
      
      Example on a 10Mbit link :
      
      tc qdisc add dev $DEV parent 1:1 handle 10: est 1sec 8sec red \
         limit 400000 min 30000 max 90000 avpkt 1000 \
         burst 55 ecn adaptative bandwidth 10Mbit
      
      # tc -s -d qdisc show dev eth3
      ...
      qdisc red 10: parent 1:1 limit 400000b min 30000b max 90000b ecn
      adaptative ewma 5 max_p=0.113335 Scell_log 15
       Sent 50414282 bytes 34504 pkt (dropped 35, overlimits 1392 requeues 0)
       rate 9749Kbit 831pps backlog 72056b 16p requeues 0
        marked 1357 early 35 pdrop 0 other 0
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8af2a218
  23. 01 12月, 2011 1 次提交
    • H
      netem: rate extension · 7bc0f28c
      Hagen Paul Pfeifer 提交于
      Currently netem is not in the ability to emulate channel bandwidth. Only static
      delay (and optional random jitter) can be configured.
      
      To emulate the channel rate the token bucket filter (sch_tbf) can be used.  But
      TBF has some major emulation flaws. The buffer (token bucket depth/rate) cannot
      be 0. Also the idea behind TBF is that the credit (token in buckets) fills if
      no packet is transmitted. So that there is always a "positive" credit for new
      packets. In real life this behavior contradicts the law of nature where
      nothing can travel faster as speed of light. E.g.: on an emulated 1000 byte/s
      link a small IPv4/TCP SYN packet with ~50 byte require ~0.05 seconds - not 0
      seconds.
      
      Netem is an excellent place to implement a rate limiting feature: static
      delay is already implemented, tfifo already has time information and the
      user can skip TBF configuration completely.
      
      This patch implement rate feature which can be configured via tc. e.g:
      
      	tc qdisc add dev eth0 root netem rate 10kbit
      
      To emulate a link of 5000byte/s and add an additional static delay of 10ms:
      
      	tc qdisc add dev eth0 root netem delay 10ms rate 5KBps
      
      Note: similar to TBF the rate extension is bounded to the kernel timing
      system. Depending on the architecture timer granularity, higher rates (e.g.
      10mbit/s and higher) tend to transmission bursts. Also note: further queues
      living in network adaptors; see ethtool(8).
      Signed-off-by: NHagen Paul Pfeifer <hagen@jauu.net>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@drr.davemloft.net>
      7bc0f28c
  24. 23 11月, 2011 1 次提交
  25. 05 4月, 2011 1 次提交
  26. 31 3月, 2011 1 次提交
  27. 25 2月, 2011 2 次提交
  28. 24 2月, 2011 1 次提交
    • E
      net_sched: SFB flow scheduler · e13e02a3
      Eric Dumazet 提交于
      This is the Stochastic Fair Blue scheduler, based on work from :
      
      W. Feng, D. Kandlur, D. Saha, K. Shin. Blue: A New Class of Active Queue
      Management Algorithms. U. Michigan CSE-TR-387-99, April 1999.
      
      http://www.thefengs.com/wuchang/blue/CSE-TR-387-99.pdf
      
      This implementation is based on work done by Juliusz Chroboczek
      
      General SFB algorithm can be found in figure 14, page 15:
      
      B[l][n] : L x N array of bins (L levels, N bins per level)
      enqueue()
      Calculate hash function values h{0}, h{1}, .. h{L-1}
      Update bins at each level
      for i = 0 to L - 1
         if (B[i][h{i}].qlen > bin_size)
            B[i][h{i}].p_mark += p_increment;
         else if (B[i][h{i}].qlen == 0)
            B[i][h{i}].p_mark -= p_decrement;
      p_min = min(B[0][h{0}].p_mark ... B[L-1][h{L-1}].p_mark);
      if (p_min == 1.0)
          ratelimit();
      else
          mark/drop with probabilty p_min;
      
      I did the adaptation of Juliusz code to meet current kernel standards,
      and various changes to address previous comments :
      
      http://thread.gmane.org/gmane.linux.network/90225
      http://thread.gmane.org/gmane.linux.network/90375
      
      Default flow classifier is the rxhash introduced by RPS in 2.6.35, but
      we can use an external flow classifier if wanted.
      
      tc qdisc add dev $DEV parent 1:11 handle 11:  \
              est 0.5sec 2sec sfb limit 128
      
      tc filter add dev $DEV protocol ip parent 11: handle 3 \
              flow hash keys dst divisor 1024
      
      Notes:
      
      1) SFB default child qdisc is pfifo_fast. It can be changed by another
      qdisc but a child qdisc MUST not drop a packet previously queued. This
      is because SFB needs to handle a dequeued packet in order to maintain
      its virtual queue states. pfifo_head_drop or CHOKe should not be used.
      
      2) ECN is enabled by default, unlike RED/CHOKe/GRED
      
      With help from Patrick McHardy & Andi Kleen
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Juliusz Chroboczek <Juliusz.Chroboczek@pps.jussieu.fr>
      CC: Stephen Hemminger <shemminger@vyatta.com>
      CC: Patrick McHardy <kaber@trash.net>
      CC: Andi Kleen <andi@firstfloor.org>
      CC: John W. Linville <linville@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e13e02a3
  29. 03 2月, 2011 1 次提交
    • S
      sched: CHOKe flow scheduler · 45e14433
      stephen hemminger 提交于
      CHOKe ("CHOose and Kill" or "CHOose and Keep") is an alternative
      packet scheduler based on the Random Exponential Drop (RED) algorithm.
      
      The core idea is:
        For every packet arrival:
        	Calculate Qave
      	if (Qave < minth)
      	     Queue the new packet
      	else
      	     Select randomly a packet from the queue
      	     if (both packets from same flow)
      	     then Drop both the packets
      	     else if (Qave > maxth)
      	          Drop packet
      	     else
      	       	  Admit packet with proability p (same as RED)
      
      See also:
        Rong Pan, Balaji Prabhakar, Konstantinos Psounis, "CHOKe: a stateless active
         queue management scheme for approximating fair bandwidth allocation",
        Proceeding of INFOCOM'2000, March 2000.
      
      Help from:
           Eric Dumazet <eric.dumazet@gmail.com>
           Patrick McHardy <kaber@trash.net>
      Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      45e14433
  30. 20 1月, 2011 1 次提交
    • J
      net_sched: implement a root container qdisc sch_mqprio · b8970f0b
      John Fastabend 提交于
      This implements a mqprio queueing discipline that by default creates
      a pfifo_fast qdisc per tx queue and provides the needed configuration
      interface.
      
      Using the mqprio qdisc the number of tcs currently in use along
      with the range of queues alloted to each class can be configured. By
      default skbs are mapped to traffic classes using the skb priority.
      This mapping is configurable.
      
      Configurable parameters,
      
      struct tc_mqprio_qopt {
      	__u8    num_tc;
      	__u8    prio_tc_map[TC_BITMASK + 1];
      	__u8    hw;
      	__u16   count[TC_MAX_QUEUE];
      	__u16   offset[TC_MAX_QUEUE];
      };
      
      Here the count/offset pairing give the queue alignment and the
      prio_tc_map gives the mapping from skb->priority to tc.
      
      The hw bit determines if the hardware should configure the count
      and offset values. If the hardware bit is set then the operation
      will fail if the hardware does not implement the ndo_setup_tc
      operation. This is to avoid undetermined states where the hardware
      may or may not control the queue mapping. Also minimal bounds
      checking is done on the count/offset to verify a queue does not
      exceed num_tx_queues and that queue ranges do not overlap. Otherwise
      it is left to user policy or hardware configuration to create
      useful mappings.
      
      It is expected that hardware QOS schemes can be implemented by
      creating appropriate mappings of queues in ndo_tc_setup().
      
      One expected use case is drivers will use the ndo_setup_tc to map
      queue ranges onto 802.1Q traffic classes. This provides a generic
      mechanism to map network traffic onto these traffic classes and
      removes the need for lower layer drivers to know specifics about
      traffic types.
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b8970f0b
  31. 05 11月, 2009 1 次提交
  32. 11 2月, 2009 1 次提交
  33. 31 1月, 2009 1 次提交
  34. 20 11月, 2008 1 次提交
  35. 13 9月, 2008 1 次提交
  36. 20 7月, 2008 1 次提交
  37. 18 7月, 2008 1 次提交
    • D
      pkt_sched: Remove RR scheduler. · 1d8ae3fd
      David S. Miller 提交于
      This actually fixes a bug added by the RR scheduler changes.  The
      ->bands and ->prio2band parameters were being set outside of the
      sch_tree_lock() and thus could result in strange behavior and
      inconsistencies.
      
      It might be possible, in the new design (where there will be one qdisc
      per device TX queue) to allow similar functionality via a TX hash
      algorithm for RR but I really see no reason to export this aspect of
      how these multiqueue cards actually implement the scheduling of the
      the individual DMA TX rings and the single physical MAC/PHY port.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1d8ae3fd
  38. 01 2月, 2008 1 次提交