1. 06 12月, 2016 1 次提交
    • E
      net_sched: gen_estimator: complete rewrite of rate estimators · 1c0d32fd
      Eric Dumazet 提交于
      1) Old code was hard to maintain, due to complex lock chains.
         (We probably will be able to remove some kfree_rcu() in callers)
      
      2) Using a single timer to update all estimators does not scale.
      
      3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity
         is not supposed to work well)
      
      In this rewrite :
      
      - I removed the RB tree that had to be scanned in
        gen_estimator_active(). qdisc dumps should be much faster.
      
      - Each estimator has its own timer.
      
      - Estimations are maintained in net_rate_estimator structure,
        instead of dirtying the qdisc. Minor, but part of the simplification.
      
      - Reading the estimator uses RCU and a seqcount to provide proper
        support for 32bit kernels.
      
      - We reduce memory need when estimators are not used, since
        we store a pointer, instead of the bytes/packets counters.
      
      - xt_rateest_mt() no longer has to grab a spinlock.
        (In the future, xt_rateest_tg() could be switched to per cpu counters)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1c0d32fd
  2. 27 10月, 2016 1 次提交
  3. 19 9月, 2016 1 次提交
  4. 19 7月, 2016 1 次提交
  5. 26 6月, 2016 2 次提交
    • E
      net_sched: sch_htb: export class backlog in dumps · 338ed9b4
      Eric Dumazet 提交于
      We already get child qdisc qlen, we also can get its backlog
      so that class dumps can report it.
      
      Also replace qstats by a single drop counter, but move it in
      a separate cache line so that drops do not dirty useful cache lines.
      
      Tested:
      
      $ tc -s cl sh dev eth0
      class htb 1:1 root leaf 3: prio 0 rate 1Gbit ceil 1Gbit burst 500000b cburst 500000b
       Sent 2183346912 bytes 9021815 pkt (dropped 2340774, overlimits 0 requeues 0)
       rate 1001Mbit 517543pps backlog 120758b 499p requeues 0
       lended: 9021770 borrowed: 0 giants: 0
       tokens: 9 ctokens: 9
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      338ed9b4
    • E
      net_sched: drop packets after root qdisc lock is released · 520ac30f
      Eric Dumazet 提交于
      Qdisc performance suffers when packets are dropped at enqueue()
      time because drops (kfree_skb()) are done while qdisc lock is held,
      delaying a dequeue() draining the queue.
      
      Nominal throughput can be reduced by 50 % when this happens,
      at a time we would like the dequeue() to proceed as fast as possible.
      
      Even FQ is vulnerable to this problem, while one of FQ goals was
      to provide some flow isolation.
      
      This patch adds a 'struct sk_buff **to_free' parameter to all
      qdisc->enqueue(), and in qdisc_drop() helper.
      
      I measured a performance increase of up to 12 %, but this patch
      is a prereq so that future batches in enqueue() can fly.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      520ac30f
  6. 16 6月, 2016 2 次提交
  7. 11 6月, 2016 1 次提交
    • E
      net_sched: remove generic throttled management · 45f50bed
      Eric Dumazet 提交于
      __QDISC_STATE_THROTTLED bit manipulation is rather expensive
      for HTB and few others.
      
      I already removed it for sch_fq in commit f2600cf0
      ("net: sched: avoid costly atomic operation in fq_dequeue()")
      and so far nobody complained.
      
      When one ore more packets are stuck in one or more throttled
      HTB class, a htb dequeue() performs two atomic operations
      to clear/set __QDISC_STATE_THROTTLED bit, while root qdisc
      lock is held.
      
      Removing this pair of atomic operations bring me a 8 % performance
      increase on 200 TCP_RR tests, in presence of throttled classes.
      
      This patch has no side effect, since nothing actually uses
      disc_is_throttled() anymore.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      45f50bed
  8. 09 6月, 2016 1 次提交
  9. 08 6月, 2016 1 次提交
    • E
      net: sched: do not acquire qdisc spinlock in qdisc/class stats dump · edb09eb1
      Eric Dumazet 提交于
      Large tc dumps (tc -s {qdisc|class} sh dev ethX) done by Google BwE host
      agent [1] are problematic at scale :
      
      For each qdisc/class found in the dump, we currently lock the root qdisc
      spinlock in order to get stats. Sampling stats every 5 seconds from
      thousands of HTB classes is a challenge when the root qdisc spinlock is
      under high pressure. Not only the dumps take time, they also slow
      down the fast path (queue/dequeue packets) by 10 % to 20 % in some cases.
      
      An audit of existing qdiscs showed that sch_fq_codel is the only qdisc
      that might need the qdisc lock in fq_codel_dump_stats() and
      fq_codel_dump_class_stats()
      
      In v2 of this patch, I now use the Qdisc running seqcount to provide
      consistent reads of packets/bytes counters, regardless of 32/64 bit arches.
      
      I also changed rate estimators to use the same infrastructure
      so that they no longer need to lock root qdisc lock.
      
      [1]
      http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43838.pdfSigned-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: Kevin Athey <kda@google.com>
      Cc: Xiaotian Pei <xiaotian@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      edb09eb1
  10. 25 5月, 2016 1 次提交
    • E
      net_sched: avoid too many hrtimer_start() calls · a9efad8b
      Eric Dumazet 提交于
      I found a serious performance bug in packet schedulers using hrtimers.
      
      sch_htb and sch_fq are definitely impacted by this problem.
      
      We constantly rearm high resolution timers if some packets are throttled
      in one (or more) class, and other packets are flying through qdisc on
      another (non throttled) class.
      
      hrtimer_start() does not have the mod_timer() trick of doing nothing if
      expires value does not change :
      
      	if (timer_pending(timer) &&
                  timer->expires == expires)
                      return 1;
      
      This issue is particularly visible when multiple cpus can queue/dequeue
      packets on the same qdisc, as hrtimer code has to lock a remote base.
      
      I used following fix :
      
      1) Change htb to use qdisc_watchdog_schedule_ns() instead of open-coding
      it.
      
      2) Cache watchdog prior expiration. hrtimer might provide this, but I
      prefer to not rely on some hrtimer internal.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a9efad8b
  11. 26 4月, 2016 1 次提交
  12. 01 3月, 2016 3 次提交
  13. 28 8月, 2015 1 次提交
    • D
      net: sched: consolidate tc_classify{,_compat} · 3b3ae880
      Daniel Borkmann 提交于
      For classifiers getting invoked via tc_classify(), we always need an
      extra function call into tc_classify_compat(), as both are being
      exported as symbols and tc_classify() itself doesn't do much except
      handling of reclassifications when tp->classify() returned with
      TC_ACT_RECLASSIFY.
      
      CBQ and ATM are the only qdiscs that directly call into tc_classify_compat(),
      all others use tc_classify(). When tc actions are being configured
      out in the kernel, tc_classify() effectively does nothing besides
      delegating.
      
      We could spare this layer and consolidate both functions. pktgen on
      single CPU constantly pushing skbs directly into the netif_receive_skb()
      path with a dummy classifier on ingress qdisc attached, improves
      slightly from 22.3Mpps to 23.1Mpps.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b3ae880
  14. 19 8月, 2015 1 次提交
  15. 30 9月, 2014 4 次提交
  16. 26 9月, 2014 1 次提交
    • E
      net: sched: use pinned timers · 4a8e320c
      Eric Dumazet 提交于
      While using a MQ + NETEM setup, I had confirmation that the default
      timer migration ( /proc/sys/kernel/timer_migration ) is killing us.
      
      Installing this on a receiver side of a TCP_STREAM test, (NIC has 8 TX
      queues) :
      
      EST="est 1sec 4sec"
      for ETH in eth1
      do
       tc qd del dev $ETH root 2>/dev/null
       tc qd add dev $ETH root handle 1: mq
       tc qd add dev $ETH parent 1:1 $EST netem limit 70000 delay 6ms
       tc qd add dev $ETH parent 1:2 $EST netem limit 70000 delay 8ms
       tc qd add dev $ETH parent 1:3 $EST netem limit 70000 delay 10ms
       tc qd add dev $ETH parent 1:4 $EST netem limit 70000 delay 12ms
       tc qd add dev $ETH parent 1:5 $EST netem limit 70000 delay 14ms
       tc qd add dev $ETH parent 1:6 $EST netem limit 70000 delay 16ms
       tc qd add dev $ETH parent 1:7 $EST netem limit 80000 delay 18ms
       tc qd add dev $ETH parent 1:8 $EST netem limit 90000 delay 20ms
      done
      
      We can see that timers get migrated into a single cpu, presumably idle
      at the time timers are set up.
      Then all qdisc dequeues run from this cpu and huge lock contention
      happens. This single cpu is stuck in softirq mode and cannot dequeue
      fast enough.
      
          39.24%  [kernel]          [k] _raw_spin_lock
           2.65%  [kernel]          [k] netem_enqueue
           1.80%  [kernel]          [k] netem_dequeue
           1.63%  [kernel]          [k] copy_user_enhanced_fast_string
           1.45%  [kernel]          [k] _raw_spin_lock_bh
      
      By pinning qdisc timers on the cpu running the qdisc, we respect proper
      XPS setting and remove this lock contention.
      
           5.84%  [kernel]          [k] netem_enqueue
           4.83%  [kernel]          [k] _raw_spin_lock
           2.92%  [kernel]          [k] copy_user_enhanced_fast_string
      
      Current Qdiscs that benefit from this change are :
      
      	netem, cbq, fq, hfsc, tbf, htb.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4a8e320c
  17. 20 9月, 2014 1 次提交
  18. 14 9月, 2014 2 次提交
  19. 23 8月, 2014 1 次提交
  20. 07 3月, 2014 1 次提交
  21. 23 1月, 2014 1 次提交
  22. 01 1月, 2014 2 次提交
  23. 12 12月, 2013 2 次提交
  24. 21 9月, 2013 2 次提交
  25. 12 9月, 2013 1 次提交
  26. 15 8月, 2013 1 次提交
    • J
      net_sched: restore "linklayer atm" handling · 8a8e3d84
      Jesper Dangaard Brouer 提交于
      commit 56b765b7 ("htb: improved accuracy at high rates")
      broke the "linklayer atm" handling.
      
       tc class add ... htb rate X ceil Y linklayer atm
      
      The linklayer setting is implemented by modifying the rate table
      which is send to the kernel.  No direct parameter were
      transferred to the kernel indicating the linklayer setting.
      
      The commit 56b765b7 ("htb: improved accuracy at high rates")
      removed the use of the rate table system.
      
      To keep compatible with older iproute2 utils, this patch detects
      the linklayer by parsing the rate table.  It also supports future
      versions of iproute2 to send this linklayer parameter to the
      kernel directly. This is done by using the __reserved field in
      struct tc_ratespec, to convey the choosen linklayer option, but
      only using the lower 4 bits of this field.
      
      Linklayer detection is limited to speeds below 100Mbit/s, because
      at high rates the rtab is gets too inaccurate, so bad that
      several fields contain the same values, this resembling the ATM
      detect.  Fields even start to contain "0" time to send, e.g. at
      1000Mbit/s sending a 96 bytes packet cost "0", thus the rtab have
      been more broken than we first realized.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8a8e3d84
  27. 03 8月, 2013 1 次提交
  28. 20 6月, 2013 1 次提交
    • E
      htb: refactor struct htb_sched fields for performance · c9364636
      Eric Dumazet 提交于
      htb_sched structures are big, and source of false sharing on SMP.
      
      Every time a packet is queued or dequeue, many cache lines must be
      touched because structures are not lay out properly.
      
      By carefully splitting htb_sched in two parts, and define sub structures
      to increase data locality, we can improve performance dramatically on
      SMP.
      
      New htb_prio structure can also be used in htb_class to increase data
      locality.
      
      I got 26 % performance increase on a 24 threads machine, with 200
      concurrent netperf in TCP_RR mode, using a HTB hierarchy of 4 classes.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9364636
  29. 14 6月, 2013 1 次提交