1. 10 12月, 2011 1 次提交
    • E
      sch_red: generalize accurate MAX_P support to RED/GRED/CHOKE · a73ed26b
      Eric Dumazet 提交于
      Now RED uses a Q0.32 number to store max_p (max probability), allow
      RED/GRED/CHOKE to use/report full resolution at config/dump time.
      
      Old tc binaries are non aware of new attributes, and still set/get Plog.
      
      New tc binary set/get both Plog and max_p for backward compatibility,
      they display "probability value" if they get max_p from new kernels.
      
      # tc -d  qdisc show dev ...
      ...
      qdisc red 10: parent 1:1 limit 360Kb min 30Kb max 90Kb ecn ewma 5
      probability 0.09 Scell_log 15
      
      Make sure we avoid potential divides by 0 in reciprocal_value(), if
      (max_th - min_th) is big.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a73ed26b
  2. 09 12月, 2011 1 次提交
    • E
      sch_red: Adaptative RED AQM · 8af2a218
      Eric Dumazet 提交于
      Adaptative RED AQM for linux, based on paper from Sally FLoyd,
      Ramakrishna Gummadi, and Scott Shenker, August 2001 :
      
      http://icir.org/floyd/papers/adaptiveRed.pdf
      
      Goal of Adaptative RED is to make max_p a dynamic value between 1% and
      50% to reach the target average queue : (max_th - min_th) / 2
      
      Every 500 ms:
       if (avg > target and max_p <= 0.5)
        increase max_p : max_p += alpha;
       else if (avg < target and max_p >= 0.01)
        decrease max_p : max_p *= beta;
      
      target :[min_th + 0.4*(min_th - max_th),
                min_th + 0.6*(min_th - max_th)].
      alpha : min(0.01, max_p / 4)
      beta : 0.9
      max_P is a Q0.32 fixed point number (unsigned, with 32 bits mantissa)
      
      Changes against our RED implementation are :
      
      max_p is no longer a negative power of two (1/(2^Plog)), but a Q0.32
      fixed point number, to allow full range described in Adatative paper.
      
      To deliver a random number, we now use a reciprocal divide (thats really
      a multiply), but this operation is done once per marked/droped packet
      when in RED_BETWEEN_TRESH window, so added cost (compared to previous
      AND operation) is near zero.
      
      dump operation gives current max_p value in a new TCA_RED_MAX_P
      attribute.
      
      Example on a 10Mbit link :
      
      tc qdisc add dev $DEV parent 1:1 handle 10: est 1sec 8sec red \
         limit 400000 min 30000 max 90000 avpkt 1000 \
         burst 55 ecn adaptative bandwidth 10Mbit
      
      # tc -s -d qdisc show dev eth3
      ...
      qdisc red 10: parent 1:1 limit 400000b min 30000b max 90000b ecn
      adaptative ewma 5 max_p=0.113335 Scell_log 15
       Sent 50414282 bytes 34504 pkt (dropped 35, overlimits 1392 requeues 0)
       rate 9749Kbit 831pps backlog 72056b 16p requeues 0
        marked 1357 early 35 pdrop 0 other 0
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8af2a218
  3. 06 12月, 2011 1 次提交
  4. 02 12月, 2011 2 次提交
    • E
      sch_red: fix red_change · 1ee5fa1e
      Eric Dumazet 提交于
      Le mercredi 30 novembre 2011 à 14:36 -0800, Stephen Hemminger a écrit :
      
      > (Almost) nobody uses RED because they can't figure it out.
      > According to Wikipedia, VJ says that:
      >  "there are not one, but two bugs in classic RED."
      
      RED is useful for high throughput routers, I doubt many linux machines
      act as such devices.
      
      I was considering adding Adaptative RED (Sally Floyd, Ramakrishna
      Gummadi, Scott Shender), August 2001
      
      In this version, maxp is dynamic (from 1% to 50%), and user only have to
      setup min_th (target average queue size)
      (max_th and wq (burst in linux RED) are automatically setup)
      
      By the way it seems we have a small bug in red_change()
      
      if (skb_queue_empty(&sch->q))
      	red_end_of_idle_period(&q->parms);
      
      First, if queue is empty, we should call
      red_start_of_idle_period(&q->parms);
      
      Second, since we dont use anymore sch->q, but q->qdisc, the test is
      meaningless.
      
      Oh well...
      
      [PATCH] sch_red: fix red_change()
      
      Now RED is classful, we must check q->qdisc->q.qlen, and if queue is empty,
      we start an idle period, not end it.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ee5fa1e
    • E
      netem: fix build error on 32bit arches · fc33cc72
      Eric Dumazet 提交于
      ERROR: "__udivdi3" [net/sched/sch_netem.ko] undefined!
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NHagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc33cc72
  5. 01 12月, 2011 2 次提交
    • H
      netem: rate extension · 7bc0f28c
      Hagen Paul Pfeifer 提交于
      Currently netem is not in the ability to emulate channel bandwidth. Only static
      delay (and optional random jitter) can be configured.
      
      To emulate the channel rate the token bucket filter (sch_tbf) can be used.  But
      TBF has some major emulation flaws. The buffer (token bucket depth/rate) cannot
      be 0. Also the idea behind TBF is that the credit (token in buckets) fills if
      no packet is transmitted. So that there is always a "positive" credit for new
      packets. In real life this behavior contradicts the law of nature where
      nothing can travel faster as speed of light. E.g.: on an emulated 1000 byte/s
      link a small IPv4/TCP SYN packet with ~50 byte require ~0.05 seconds - not 0
      seconds.
      
      Netem is an excellent place to implement a rate limiting feature: static
      delay is already implemented, tfifo already has time information and the
      user can skip TBF configuration completely.
      
      This patch implement rate feature which can be configured via tc. e.g:
      
      	tc qdisc add dev eth0 root netem rate 10kbit
      
      To emulate a link of 5000byte/s and add an additional static delay of 10ms:
      
      	tc qdisc add dev eth0 root netem delay 10ms rate 5KBps
      
      Note: similar to TBF the rate extension is bounded to the kernel timing
      system. Depending on the architecture timer granularity, higher rates (e.g.
      10mbit/s and higher) tend to transmission bursts. Also note: further queues
      living in network adaptors; see ethtool(8).
      Signed-off-by: NHagen Paul Pfeifer <hagen@jauu.net>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@drr.davemloft.net>
      7bc0f28c
    • E
      sch_teql: fix lockdep splat · f7e57044
      Eric Dumazet 提交于
      We need rcu_read_lock() protection before using dst_get_neighbour(), and
      we must cache its value (pass it to __teql_resolve())
      
      teql_master_xmit() is called under rcu_read_lock_bh() protection, its
      not enough.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f7e57044
  6. 30 11月, 2011 3 次提交
  7. 29 11月, 2011 2 次提交
  8. 17 11月, 2011 1 次提交
  9. 09 11月, 2011 1 次提交
  10. 01 11月, 2011 2 次提交
  11. 25 10月, 2011 1 次提交
  12. 16 9月, 2011 1 次提交
  13. 27 8月, 2011 1 次提交
  14. 18 8月, 2011 1 次提交
  15. 10 8月, 2011 1 次提交
    • F
      net_sched: prio: use qdisc_dequeue_peeked · 3557619f
      Florian Westphal 提交于
      commit 07bd8df5
      (sch_sfq: fix peek() implementation) changed sfq to use generic
      peek helper.
      
      This makes HFSC complain about a non-work-conserving child qdisc, if
      prio with sfq child is used within hfsc:
      
      hfsc peeks into prio qdisc, which will then peek into sfq.
      returned skb is stashed in sch->gso_skb.
      
      Next, hfsc tries to dequeue from prio, but prio will call sfq dequeue
      directly, which may return NULL instead of previously peeked-at skb.
      
      Have prio call qdisc_dequeue_peeked, so sfq->dequeue() is
      not called in this case.
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3557619f
  16. 01 8月, 2011 1 次提交
  17. 18 7月, 2011 1 次提交
  18. 15 7月, 2011 2 次提交
  19. 06 7月, 2011 1 次提交
  20. 27 6月, 2011 1 次提交
    • J
      net_sched: fix dequeuer fairness · d5b8aa1d
      jamal 提交于
      Results on dummy device can be seen in my netconf 2011
      slides. These results are for a 10Gige IXGBE intel
      nic - on another i5 machine, very similar specs to
      the one used in the netconf2011 results.
      It turns out - this is a hell lot worse than dummy
      and so this patch is even more beneficial for 10G.
      
      Test setup:
      ----------
      
      System under test sending packets out.
      Additional box connected directly dropping packets.
      Installed prio qdisc on the eth device and default
      netdev default length of 1000 used as is.
      The 3 prio bands each were set to 100 (didnt factor in
      the results).
      
      5 packet runs were made and the middle 3 picked.
      
      results
      -------
      
      The "cpu" column indicates the which cpu the sample
      was taken on,
      The "Pkt runx" carries the number of packets a cpu
      dequeued when forced to be in the "dequeuer" role.
      The "avg" for each run is the number of times each
      cpu should be a "dequeuer" if the system was fair.
      
      3.0-rc4      (plain)
      cpu         Pkt run1        Pkt run2        Pkt run3
      ================================================
      cpu0        21853354        21598183        22199900
      cpu1          431058          473476          393159
      cpu2          481975          477529          458466
      cpu3        23261406        23412299        22894315
      avg         11506948        11490372        11486460
      
      3.0-rc4 with patch and default weight 64
      cpu 	     Pkt run1        Pkt run2        Pkt run3
      ================================================
      cpu0        13205312        13109359        13132333
      cpu1        10189914        10159127        10122270
      cpu2        10213871        10124367        10168722
      cpu3        13165760        13164767        13096705
      avg         11693714        11639405        11630008
      
      As you can see the system is still not perfect but
      is a lot better than what it was before...
      
      At the moment we use the old backlog weight, weight_p
      which is 64 packets. It seems to be reasonably fine
      with that value.
      The system could be made more fair if we reduce the
      weight_p (as per my presentation), but we are going
      to affect the shared backlog weight. Unless deemed
      necessary, I think the default value is fine. If not
      we could add yet another knob.
      Signed-off-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d5b8aa1d
  21. 22 6月, 2011 2 次提交
  22. 10 6月, 2011 1 次提交
    • G
      rtnetlink: Compute and store minimum ifinfo dump size · c7ac8679
      Greg Rose 提交于
      The message size allocated for rtnl ifinfo dumps was limited to
      a single page.  This is not enough for additional interface info
      available with devices that support SR-IOV and caused a bug in
      which VF info would not be displayed if more than approximately
      40 VFs were created per interface.
      
      Implement a new function pointer for the rtnl_register service that will
      calculate the amount of data required for the ifinfo dump and allocate
      enough data to satisfy the request.
      Signed-off-by: NGreg Rose <gregory.v.rose@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      c7ac8679
  23. 07 6月, 2011 2 次提交
  24. 26 5月, 2011 1 次提交
  25. 24 5月, 2011 1 次提交
    • E
      sch_sfq: avoid giving spurious NET_XMIT_CN signals · 8efa8854
      Eric Dumazet 提交于
      While chasing a possible net_sched bug, I found that IP fragments have
      litle chance to pass a congestioned SFQ qdisc :
      
      - Say SFQ qdisc is full because one flow is non responsive.
      - ip_fragment() wants to send two fragments belonging to an idle flow.
      - sfq_enqueue() queues first packet, but see queue limit reached :
      - sfq_enqueue() drops one packet from 'big consumer', and returns
      NET_XMIT_CN.
      - ip_fragment() cancel remaining fragments.
      
      This patch restores fairness, making sure we return NET_XMIT_CN only if
      we dropped a packet from the same flow.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Patrick McHardy <kaber@trash.net>
      CC: Jarek Poplawski <jarkao2@gmail.com>
      CC: Jamal Hadi Salim <hadi@cyberus.ca>
      CC: Stephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8efa8854
  26. 23 5月, 2011 1 次提交
    • E
      net: avoid synchronize_rcu() in dev_deactivate_many · 3137663d
      Eric Dumazet 提交于
      dev_deactivate_many() issues one synchronize_rcu() call after qdiscs set
      to noop_qdisc.
      
      This call is here to make sure they are no outstanding qdisc-less
      dev_queue_xmit calls before returning to caller.
      
      But in dismantle phase, we dont have to wait, because we wont activate
      again the device, and we are going to wait one rcu grace period later in
      rollback_registered_many().
      
      After this patch, device dismantle uses one synchronize_net() and one
      rcu_barrier() call only, so we have a ~30% speedup and a smaller RTNL
      latency.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Patrick McHardy <kaber@trash.net>,
      CC: Ben Greear <greearb@candelatech.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3137663d
  27. 20 5月, 2011 2 次提交
  28. 08 5月, 2011 3 次提交