1. 16 3月, 2012 1 次提交
    • E
      sch_sfq: revert dont put new flow at the end of flows · cc34eb67
      Eric Dumazet 提交于
      This reverts commit d47a0ac7 (sch_sfq: dont put new flow at the end of
      flows)
      
      As Jesper found out, patch sounded great but has bad side effects.
      
      In stress situation, pushing new flows in front of the queue can prevent
      old flows doing any progress. Packets can stay in SFQ queue for
      unlimited amount of time.
      
      It's possible to add heuristics to limit this problem, but this would
      add complexity outside of SFQ scope.
      
      A more sensible answer to Dave Taht concerns (who reported the issued I
      tried to solve in original commit) is probably to use a qdisc hierarchy
      so that high prio packets dont enter a potentially crowded SFQ qdisc.
      Reported-by: NJesper Dangaard Brouer <jdb@comx.dk>
      Cc: Dave Taht <dave.taht@gmail.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cc34eb67
  2. 20 2月, 2012 1 次提交
  3. 14 2月, 2012 1 次提交
  4. 10 2月, 2012 1 次提交
  5. 08 2月, 2012 1 次提交
    • S
      net/sched: sch_plug - Queue traffic until an explicit release command · c3059be1
      Shriram Rajagopalan 提交于
      The qdisc supports two operations - plug and unplug. When the
      qdisc receives a plug command via netlink request, packets arriving
      henceforth are buffered until a corresponding unplug command is received.
      Depending on the type of unplug command, the queue can be unplugged
      indefinitely or selectively.
      
      This qdisc can be used to implement output buffering, an essential
      functionality required for consistent recovery in checkpoint based
      fault-tolerance systems. Output buffering enables speculative execution
      by allowing generated network traffic to be rolled back. It is used to
      provide network protection for Xen Guests in the Remus high availability
      project, available as part of Xen.
      
      This module is generic enough to be used by any other system that wishes
      to add speculative execution and output buffering to its applications.
      
      This module was originally available in the linux 2.6.32 PV-OPS tree,
      used as dom0 for Xen.
      
      For more information, please refer to http://nss.cs.ubc.ca/remus/
      and http://wiki.xensource.com/xenwiki/Remus
      
      Changes in V3:
        * Removed debug output (printk) on queue overflow
        * Added TCQ_PLUG_RELEASE_INDEFINITE - that allows the user to
          use this qdisc, for simple plug/unplug operations.
        * Use of packet counts instead of pointers to keep track of
          the buffers in the queue.
      Signed-off-by: NShriram Rajagopalan <rshriram@cs.ubc.ca>
      Signed-off-by: NBrendan Cully <brendan@cs.ubc.ca>
      [author of the code in the linux 2.6.32 pvops tree]
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c3059be1
  6. 07 2月, 2012 1 次提交
  7. 03 2月, 2012 1 次提交
    • L
      cgroup: remove cgroup_subsys argument from callbacks · 761b3ef5
      Li Zefan 提交于
      The argument is not used at all, and it's not necessary, because
      a specific callback handler of course knows which subsys it
      belongs to.
      
      Now only ->pupulate() takes this argument, because the handlers of
      this callback always call cgroup_add_file()/cgroup_add_files().
      
      So we reduce a few lines of code, though the shrinking of object size
      is minimal.
      
       16 files changed, 113 insertions(+), 162 deletions(-)
      
         text    data     bss     dec     hex filename
      5486240  656987 7039960 13183187         c928d3 vmlinux.o.orig
      5486170  656987 7039960 13183117         c9288d vmlinux.o
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      761b3ef5
  8. 23 1月, 2012 1 次提交
  9. 13 1月, 2012 1 次提交
    • E
      net_sched: sfq: add optional RED on top of SFQ · ddecf0f4
      Eric Dumazet 提交于
      Adds an optional Random Early Detection on each SFQ flow queue.
      
      Traditional SFQ limits count of packets, while RED permits to also
      control number of bytes per flow, and adds ECN capability as well.
      
      1) We dont handle the idle time management in this RED implementation,
      since each 'new flow' begins with a null qavg. We really want to address
      backlogged flows.
      
      2) if headdrop is selected, we try to ecn mark first packet instead of
      currently enqueued packet. This gives faster feedback for tcp flows
      compared to traditional RED [ marking the last packet in queue ]
      
      Example of use :
      
      tc qdisc add dev $DEV parent 1:1 handle 10: est 1sec 4sec sfq \
      	limit 3000 headdrop flows 512 divisor 16384 \
      	redflowlimit 100000 min 8000 max 60000 probability 0.20 ecn
      
      qdisc sfq 10: parent 1:1 limit 3000p quantum 1514b depth 127 headdrop
      flows 512/16384 divisor 16384
       ewma 6 min 8000b max 60000b probability 0.2 ecn
       prob_mark 0 prob_mark_head 4876 prob_drop 6131
       forced_mark 0 forced_mark_head 0 forced_drop 0
       Sent 1175211782 bytes 777537 pkt (dropped 6131, overlimits 11007
      requeues 0)
       rate 99483Kbit 8219pps backlog 689392b 456p requeues 0
      
      In this test, with 64 netperf TCP_STREAM sessions, 50% using ECN enabled
      flows, we can see number of packets CE marked is smaller than number of
      drops (for non ECN flows)
      
      If same test is run, without RED, we can check backlog is much bigger.
      
      qdisc sfq 10: parent 1:1 limit 3000p quantum 1514b depth 127 headdrop
      flows 512/16384 divisor 16384
       Sent 1148683617 bytes 795006 pkt (dropped 0, overlimits 0 requeues 0)
       rate 98429Kbit 8521pps backlog 1221290b 841p requeues 0
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Stephen Hemminger <shemminger@vyatta.com>
      CC: Dave Taht <dave.taht@gmail.com>
      Tested-by: NDave Taht <dave.taht@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ddecf0f4
  10. 06 1月, 2012 3 次提交
    • E
      net_sched: red: split red_parms into parms and vars · eeca6688
      Eric Dumazet 提交于
      This patch splits the red_parms structure into two components.
      
      One holding the RED 'constant' parameters, and one containing the
      variables.
      
      This permits a size reduction of GRED qdisc, and is a preliminary step
      to add an optional RED unit to SFQ.
      
      SFQRED will have a single red_parms structure shared by all flows, and a
      private red_vars per flow.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Dave Taht <dave.taht@gmail.com>
      CC: Stephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eeca6688
    • E
      net_sched: sfq: extend limits · 18cb8098
      Eric Dumazet 提交于
      SFQ as implemented in Linux is very limited, with at most 127 flows
      and limit of 127 packets. [ So if 127 flows are active, we have one
      packet per flow ]
      
      This patch brings to SFQ following features to cope with modern needs.
      
      - Ability to specify a smaller per flow limit of inflight packets.
          (default value being at 127 packets)
      
      - Ability to have up to 65408 active flows (instead of 127)
      
      - Ability to have head drops instead of tail drops
        (to drop old packets from a flow)
      
      Example of use : No more than 20 packets per flow, max 8000 flows, max
      20000 packets in SFQ qdisc, hash table of 65536 slots.
      
      tc qdisc add ... sfq \
              flows 8000 \
              depth 20 \
              headdrop \
              limit 20000 \
      	divisor 65536
      
      Ram usage :
      
      2 bytes per hash table entry (instead of previous 1 byte/entry)
      32 bytes per flow on 64bit arches, instead of 384 for QFQ, so much
      better cache hit ratio.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Dave Taht <dave.taht@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      18cb8098
    • H
      net_sched: Bug in netem reordering · eb101924
      Hagen Paul Pfeifer 提交于
      Not now, but it looks you are correct. q->qdisc is NULL until another
      additional qdisc is attached (beside tfifo). See 50612537.
      The following patch should work.
      
      From: Hagen Paul Pfeifer <hagen@jauu.net>
      
      netem: catch NULL pointer by updating the real qdisc statistic
      Reported-by: NVijay Subramanian <subramanian.vijay@gmail.com>
      Signed-off-by: NHagen Paul Pfeifer <hagen@jauu.net>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eb101924
  11. 05 1月, 2012 2 次提交
  12. 04 1月, 2012 4 次提交
  13. 31 12月, 2011 1 次提交
    • E
      netem: fix classful handling · 50612537
      Eric Dumazet 提交于
      Commit 10f6dfcf (Revert "sch_netem: Remove classful functionality")
      reintroduced classful functionality to netem, but broke basic netem
      behavior :
      
      netem uses an t(ime)fifo queue, and store timestamps in skb->cb[]
      
      If qdisc is changed, time constraints are not respected and other qdisc
      can destroy skb->cb[] and block netem at dequeue time.
      
      Fix this by always using internal tfifo, and optionally attach a child
      qdisc to netem (or a tree of qdiscs)
      
      Example of use :
      
      DEV=eth3
      tc qdisc del dev $DEV root
      tc qdisc add dev $DEV root handle 30: est 1sec 8sec netem delay 20ms 10ms
      tc qdisc add dev $DEV handle 40:0 parent 30:0 tbf \
      	burst 20480 limit 20480 mtu 1514 rate 32000bps
      
      qdisc netem 30: root refcnt 18 limit 1000 delay 20.0ms  10.0ms
       Sent 190792 bytes 413 pkt (dropped 0, overlimits 0 requeues 0)
       rate 18416bit 3pps backlog 0b 0p requeues 0
      qdisc tbf 40: parent 30: rate 256000bit burst 20Kb/8 mpu 0b lat 0us
       Sent 190792 bytes 413 pkt (dropped 6, overlimits 10 requeues 0)
       backlog 0b 5p requeues 0
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Stephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      50612537
  14. 30 12月, 2011 1 次提交
    • E
      sch_tbf: report backlog information · b0460e44
      Eric Dumazet 提交于
      Provide child qdisc backlog (byte count) information so that "tc -s
      qdisc" can report it to user.
      
      qdisc netem 30: root refcnt 18 limit 1000 delay 20.0ms  10.0ms
       Sent 948517 bytes 898 pkt (dropped 0, overlimits 0 requeues 1)
       rate 175056bit 16pps backlog 114b 1p requeues 1
      qdisc tbf 40: parent 30: rate 256000bit burst 20Kb/8 mpu 0b lat 0us
       Sent 948517 bytes 898 pkt (dropped 15, overlimits 611 requeues 0)
       backlog 18168b 12p requeues 0
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b0460e44
  15. 25 12月, 2011 1 次提交
  16. 24 12月, 2011 2 次提交
    • S
      netem: loss model API sizes · 2494654d
      stephen hemminger 提交于
      The new netem loss model is configured with nested netlink messages.
      This code is being overly strict about sizes, and is easily confused
      by padding (or possible future expansion). Also message
      for gemodel is incorrect.
      Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2494654d
    • E
      sch_hfsc: report backlog information · f5a59b73
      Eric Dumazet 提交于
      Add backlog (byte count) information in hfsc classes and qdisc, so that
      "tc -s" can report it to user, instead of 0 values :
      
      qdisc hfsc 1: root refcnt 6 default 20
       Sent 45141660 bytes 30545 pkt (dropped 0, overlimits 91751 requeues 0)
       rate 1492Kbit 126pps backlog 103226b 74p requeues 0
      ...
      class hfsc 1:20 parent 1:1 leaf 1201: rt m1 0bit d 0us m2 400000bit ls m1 0bit d 0us m2 200000bit
       Sent 49534912 bytes 33519 pkt (dropped 0, overlimits 0 requeues 0)
       backlog 81822b 56p requeues 0
       period 23 work 49451576 bytes rtwork 13277552 bytes level 0
      ...
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: John A. Sullivan III <jsullivan@opensourcedevel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f5a59b73
  17. 23 12月, 2011 1 次提交
  18. 22 12月, 2011 1 次提交
    • E
      sch_sfq: rehash queues in perturb timer · 225d9b89
      Eric Dumazet 提交于
      A known Out Of Order (OOO) problem hurts SFQ when timer changes
      perturbation value, since all new packets delivered to SFQ enqueue might
      end on different slots than previous in-flight packets.
      
      With round robin delivery, we can thus deliver packets in a different
      order.
      
      Since SFQ is limited to small amount of in-flight packets, we can rehash
      packets so that this OOO problem is fixed.
      
      This rehashing is performed only if internal flow classifier is in use.
      
      We now store in skb->cb[] the "struct flow_keys" so that we dont call
      skb_flow_dissect() again while rehashing.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      225d9b89
  19. 17 12月, 2011 1 次提交
  20. 15 12月, 2011 1 次提交
  21. 13 12月, 2011 2 次提交
    • H
      netem: add cell concept to simulate special MAC behavior · 90b41a1c
      Hagen Paul Pfeifer 提交于
      This extension can be used to simulate special link layer
      characteristics. Simulate because packet data is not modified, only the
      calculation base is changed to delay a packet based on the original
      packet size and artificial cell information.
      
      packet_overhead can be used to simulate a link layer header compression
      scheme (e.g. set packet_overhead to -20) or with a positive
      packet_overhead value an additional MAC header can be simulated. It is
      also possible to "replace" the 14 byte Ethernet header with something
      else.
      
      cell_size and cell_overhead can be used to simulate link layer schemes,
      based on cells, like some TDMA schemes. Another application area are MAC
      schemes using a link layer fragmentation with a (small) header each.
      Cell size is the maximum amount of data bytes within one cell. Cell
      overhead is an additional variable to change the per-cell-overhead
      (e.g.  5 byte header per fragment).
      
      Example (5 kbit/s, 20 byte per packet overhead, cell-size 100 byte, per
      cell overhead 5 byte):
      
        tc qdisc add dev eth0 root netem rate 5kbit 20 100 5
      Signed-off-by: NHagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90b41a1c
    • E
      sch_gred: should not use GFP_KERNEL while holding a spinlock · 3f1e6d3f
      Eric Dumazet 提交于
      gred_change_vq() is called under sch_tree_lock(sch).
      
      This means a spinlock is held, and we are not allowed to sleep in this
      context.
      
      We might pre-allocate memory using GFP_KERNEL before taking spinlock,
      but this is not suitable for stable material.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3f1e6d3f
  22. 10 12月, 2011 1 次提交
    • E
      sch_red: generalize accurate MAX_P support to RED/GRED/CHOKE · a73ed26b
      Eric Dumazet 提交于
      Now RED uses a Q0.32 number to store max_p (max probability), allow
      RED/GRED/CHOKE to use/report full resolution at config/dump time.
      
      Old tc binaries are non aware of new attributes, and still set/get Plog.
      
      New tc binary set/get both Plog and max_p for backward compatibility,
      they display "probability value" if they get max_p from new kernels.
      
      # tc -d  qdisc show dev ...
      ...
      qdisc red 10: parent 1:1 limit 360Kb min 30Kb max 90Kb ecn ewma 5
      probability 0.09 Scell_log 15
      
      Make sure we avoid potential divides by 0 in reciprocal_value(), if
      (max_th - min_th) is big.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a73ed26b
  23. 09 12月, 2011 1 次提交
    • E
      sch_red: Adaptative RED AQM · 8af2a218
      Eric Dumazet 提交于
      Adaptative RED AQM for linux, based on paper from Sally FLoyd,
      Ramakrishna Gummadi, and Scott Shenker, August 2001 :
      
      http://icir.org/floyd/papers/adaptiveRed.pdf
      
      Goal of Adaptative RED is to make max_p a dynamic value between 1% and
      50% to reach the target average queue : (max_th - min_th) / 2
      
      Every 500 ms:
       if (avg > target and max_p <= 0.5)
        increase max_p : max_p += alpha;
       else if (avg < target and max_p >= 0.01)
        decrease max_p : max_p *= beta;
      
      target :[min_th + 0.4*(min_th - max_th),
                min_th + 0.6*(min_th - max_th)].
      alpha : min(0.01, max_p / 4)
      beta : 0.9
      max_P is a Q0.32 fixed point number (unsigned, with 32 bits mantissa)
      
      Changes against our RED implementation are :
      
      max_p is no longer a negative power of two (1/(2^Plog)), but a Q0.32
      fixed point number, to allow full range described in Adatative paper.
      
      To deliver a random number, we now use a reciprocal divide (thats really
      a multiply), but this operation is done once per marked/droped packet
      when in RED_BETWEEN_TRESH window, so added cost (compared to previous
      AND operation) is near zero.
      
      dump operation gives current max_p value in a new TCA_RED_MAX_P
      attribute.
      
      Example on a 10Mbit link :
      
      tc qdisc add dev $DEV parent 1:1 handle 10: est 1sec 8sec red \
         limit 400000 min 30000 max 90000 avpkt 1000 \
         burst 55 ecn adaptative bandwidth 10Mbit
      
      # tc -s -d qdisc show dev eth3
      ...
      qdisc red 10: parent 1:1 limit 400000b min 30000b max 90000b ecn
      adaptative ewma 5 max_p=0.113335 Scell_log 15
       Sent 50414282 bytes 34504 pkt (dropped 35, overlimits 1392 requeues 0)
       rate 9749Kbit 831pps backlog 72056b 16p requeues 0
        marked 1357 early 35 pdrop 0 other 0
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8af2a218
  24. 06 12月, 2011 1 次提交
  25. 02 12月, 2011 2 次提交
    • E
      sch_red: fix red_change · 1ee5fa1e
      Eric Dumazet 提交于
      Le mercredi 30 novembre 2011 à 14:36 -0800, Stephen Hemminger a écrit :
      
      > (Almost) nobody uses RED because they can't figure it out.
      > According to Wikipedia, VJ says that:
      >  "there are not one, but two bugs in classic RED."
      
      RED is useful for high throughput routers, I doubt many linux machines
      act as such devices.
      
      I was considering adding Adaptative RED (Sally Floyd, Ramakrishna
      Gummadi, Scott Shender), August 2001
      
      In this version, maxp is dynamic (from 1% to 50%), and user only have to
      setup min_th (target average queue size)
      (max_th and wq (burst in linux RED) are automatically setup)
      
      By the way it seems we have a small bug in red_change()
      
      if (skb_queue_empty(&sch->q))
      	red_end_of_idle_period(&q->parms);
      
      First, if queue is empty, we should call
      red_start_of_idle_period(&q->parms);
      
      Second, since we dont use anymore sch->q, but q->qdisc, the test is
      meaningless.
      
      Oh well...
      
      [PATCH] sch_red: fix red_change()
      
      Now RED is classful, we must check q->qdisc->q.qlen, and if queue is empty,
      we start an idle period, not end it.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ee5fa1e
    • E
      netem: fix build error on 32bit arches · fc33cc72
      Eric Dumazet 提交于
      ERROR: "__udivdi3" [net/sched/sch_netem.ko] undefined!
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NHagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc33cc72
  26. 01 12月, 2011 2 次提交
    • H
      netem: rate extension · 7bc0f28c
      Hagen Paul Pfeifer 提交于
      Currently netem is not in the ability to emulate channel bandwidth. Only static
      delay (and optional random jitter) can be configured.
      
      To emulate the channel rate the token bucket filter (sch_tbf) can be used.  But
      TBF has some major emulation flaws. The buffer (token bucket depth/rate) cannot
      be 0. Also the idea behind TBF is that the credit (token in buckets) fills if
      no packet is transmitted. So that there is always a "positive" credit for new
      packets. In real life this behavior contradicts the law of nature where
      nothing can travel faster as speed of light. E.g.: on an emulated 1000 byte/s
      link a small IPv4/TCP SYN packet with ~50 byte require ~0.05 seconds - not 0
      seconds.
      
      Netem is an excellent place to implement a rate limiting feature: static
      delay is already implemented, tfifo already has time information and the
      user can skip TBF configuration completely.
      
      This patch implement rate feature which can be configured via tc. e.g:
      
      	tc qdisc add dev eth0 root netem rate 10kbit
      
      To emulate a link of 5000byte/s and add an additional static delay of 10ms:
      
      	tc qdisc add dev eth0 root netem delay 10ms rate 5KBps
      
      Note: similar to TBF the rate extension is bounded to the kernel timing
      system. Depending on the architecture timer granularity, higher rates (e.g.
      10mbit/s and higher) tend to transmission bursts. Also note: further queues
      living in network adaptors; see ethtool(8).
      Signed-off-by: NHagen Paul Pfeifer <hagen@jauu.net>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@drr.davemloft.net>
      7bc0f28c
    • E
      sch_teql: fix lockdep splat · f7e57044
      Eric Dumazet 提交于
      We need rcu_read_lock() protection before using dst_get_neighbour(), and
      we must cache its value (pass it to __teql_resolve())
      
      teql_master_xmit() is called under rcu_read_lock_bh() protection, its
      not enough.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f7e57044
  27. 30 11月, 2011 3 次提交
  28. 29 11月, 2011 1 次提交