1. 07 6月, 2017 1 次提交
  2. 18 5月, 2017 2 次提交
  3. 14 4月, 2017 1 次提交
  4. 13 3月, 2017 1 次提交
  5. 06 12月, 2016 1 次提交
    • E
      net_sched: gen_estimator: complete rewrite of rate estimators · 1c0d32fd
      Eric Dumazet 提交于
      1) Old code was hard to maintain, due to complex lock chains.
         (We probably will be able to remove some kfree_rcu() in callers)
      
      2) Using a single timer to update all estimators does not scale.
      
      3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity
         is not supposed to work well)
      
      In this rewrite :
      
      - I removed the RB tree that had to be scanned in
        gen_estimator_active(). qdisc dumps should be much faster.
      
      - Each estimator has its own timer.
      
      - Estimations are maintained in net_rate_estimator structure,
        instead of dirtying the qdisc. Minor, but part of the simplification.
      
      - Reading the estimator uses RCU and a seqcount to provide proper
        support for 32bit kernels.
      
      - We reduce memory need when estimators are not used, since
        we store a pointer, instead of the bytes/packets counters.
      
      - xt_rateest_mt() no longer has to grab a spinlock.
        (In the future, xt_rateest_tg() could be switched to per cpu counters)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1c0d32fd
  6. 09 8月, 2016 2 次提交
    • M
      net/sched/sch_hfsc.c: remove unused cl_myfadj · 37088f61
      Michal Soltys 提交于
      The code using this variable has been commented out in the past as it
      was causing issues in upperlimited link-sharing scenarios.
      Signed-off-by: NMichal Soltys <soltys@ziu.info>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      37088f61
    • M
      net/sched/sch_hfsc.c: keep fsc and virtual times in sync; fix an old bug · 678a6241
      Michal Soltys 提交于
      This patch simplifies how we update fsc and calculate vt from it - while
      keeping the expected functionality identical with how hfsc behaves
      curently. It also fixes a certain issue introduced with
      a very old patch.
      
      The idea is, that instead of correcting cl_vt before fsc curve update
      (rtsc_min) and correcting cl_vt after calculation (rtsc_y2x) to keep
      cl_vt local to the current period - we can simply rely on virtual times
      and curve values always being in sync - analogously to how rsc and usc
      function, except that we use virtual time here.
      
      Why hasn't it been done since the beginning this way ? The likely scenario
      (basing on the code trying to correct curves whenever possible) was to
      keep the virtual times as small as possible - as they have tendency to
      "gallop" forward whenever their siblings and other fair sharing
      subtrees are idling. On top of that, current code is subtly bugged, so
      cumulative time (without any corrections) is always kept and used in
      init_vf() when a new backlog period begins (using cl_cvtoff).
      
      Is cumulative value safe ? Generally yes, though corner cases are easy
      to create. For example consider:
      
      1gbit interface
      some 100kbit leaf, everything else idle
      
      With current tick (64ns) 1s is 15625000 ticks, but the leaf is alone and
      it's virtual time, so in reality it's 10000 times more. ITOW 38 bits are
      needed to hold 1 second. 54 - 1 day, 59 - 1 month, 63 - 1 year (all
      logarithms rounded up). It's getting somewhat dangerous, but also
      requires setup excusing this kind of values not mentioning permanently
      backlogged class for a year. In near most extreme case (10gbit, 10kbit
      leaf), we have "enough" to hold ~13.6 days in 64 bits.
      
      Well, the issue remains mostly theoretical and cl_cvtoff has been
      working fine for all those years. Sensible configuration are de-facto
      immune to this issue, and not so sensible can solve it with a cronjob
      and its period inversely proportional to the insanity of such setup =)
      
      Now let's explain the subtle bug mentioned earlier.
      
      The issue is related to how offsets are kept and how we calculate
      virtual times and update fair service curve(s). The issue itself is
      subtle, but easy to observe with long m1 segments. It was introduced in
      rather old patch:
      
      Commit 99296150c7: "[NET_SCHED]: O(1) children vtoff adjustment
      in HFSC scheduler"
      
      (available in git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git)
      
      Originally when a new backlog period was started, cl_vtoff of each
      sibling was updated with cl_cvtmax from past period - naturally moving
      all cl_vt to proper starting point. That patch adjusted it so cumulative
      offset is kept in the parent, and there is no need for traversing the
      list (as any subsequent child activation derives new vt from already
      active sibling(s)).
      
      But with this change, cl_vtoff (of each sibling) is no longer persistent
      across the inactivity periods, as it's calculated from parent's
      cl_cvtoff on a new backlog period, conflicting with the following curve
      correction from the previous period:
      
      if (cl->cl_virtual.x == vt) {
              cl->cl_virtual.x -= cl->cl_vtoff;
      	cl->cl_vtoff = 0;
      }
      
      This essentially tries to keep curve as if it was local to the period
      and resets cl_vtoff (cumulative vt offset of the class) to 0 when
      possible (read: when we have an intersection or if a new curve is below
      the old one). But then it's recalculated from cl_cvtoff on next active
      period.  Then rtsc_min() call preceding the above if() doesn't really
      do what we expect it to do in such scenario - as it calculates the
      minimum of corrected curve (from the previous backlog period) and the
      new uncorrected curve (with offset derived from cl_cvtoff).
      
      Example:
      
      tc class add dev $ife parent 1:0 classid 1:1  hfsc ls m2 100mbit ul m2 100mbit
      tc class add dev $ife parent 1:1 classid 1:10 hfsc ls m1 80mbit d 10s m2 20mbit
      tc class add dev $ife parent 1:1 classid 1:11 hfsc ls m2 20mbit
      
      start B, keep it backlogged, let it run 6s (30s worth of vt as A is idle)
      pause B briefly to force cl_cvtoff update in parent (whole 1:1 going idle)
      start A, let it run 10s
      pause A briefly to force rtsc_min()
      
      At this point we would expect A to continue at 20mbit after a brief
      moment of 80mbit. But instead A will use 80mbit for full 10s again. It's
      the effect of first correcting A (during 'start A'), and then - after
      unpausing - calculating rtsc_min() from old corrected and new uncorrected
      curve.
      
      The patch fixes this bug and keepis vt and fsc in sync (virtual times
      are cumulative, not local to the backlog period).
      Signed-off-by: NMichal Soltys <soltys@ziu.info>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      678a6241
  7. 09 7月, 2016 1 次提交
    • F
      hfsc: reduce hfsc_sched to 14 cachelines · bba7eb5d
      Florian Westphal 提交于
      hfsc_sched is huge (size: 920, cachelines: 15), but we can get it to 14
      cachelines by placing level after filter_cnt (covering 4 byte hole) and
      reducing period/nactive/flags to u32 (period is just a counter,
      incremented when class becomes active -- 2**32 is plenty for this
      purpose, also, long is only 32bit wide on 32bit platforms anyway).
      
      cl_vtperiod is exported to userspace via tc_hfsc_stats, but its period
      member is already u32, so no precision is lost there either.
      
      Cc: Michal Soltys <soltys@ziu.info>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bba7eb5d
  8. 01 7月, 2016 5 次提交
  9. 26 6月, 2016 1 次提交
    • E
      net_sched: drop packets after root qdisc lock is released · 520ac30f
      Eric Dumazet 提交于
      Qdisc performance suffers when packets are dropped at enqueue()
      time because drops (kfree_skb()) are done while qdisc lock is held,
      delaying a dequeue() draining the queue.
      
      Nominal throughput can be reduced by 50 % when this happens,
      at a time we would like the dequeue() to proceed as fast as possible.
      
      Even FQ is vulnerable to this problem, while one of FQ goals was
      to provide some flow isolation.
      
      This patch adds a 'struct sk_buff **to_free' parameter to all
      qdisc->enqueue(), and in qdisc_drop() helper.
      
      I measured a performance increase of up to 12 %, but this patch
      is a prereq so that future batches in enqueue() can fly.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      520ac30f
  10. 11 6月, 2016 1 次提交
    • E
      net_sched: remove generic throttled management · 45f50bed
      Eric Dumazet 提交于
      __QDISC_STATE_THROTTLED bit manipulation is rather expensive
      for HTB and few others.
      
      I already removed it for sch_fq in commit f2600cf0
      ("net: sched: avoid costly atomic operation in fq_dequeue()")
      and so far nobody complained.
      
      When one ore more packets are stuck in one or more throttled
      HTB class, a htb dequeue() performs two atomic operations
      to clear/set __QDISC_STATE_THROTTLED bit, while root qdisc
      lock is held.
      
      Removing this pair of atomic operations bring me a 8 % performance
      increase on 200 TCP_RR tests, in presence of throttled classes.
      
      This patch has no side effect, since nothing actually uses
      disc_is_throttled() anymore.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      45f50bed
  11. 09 6月, 2016 1 次提交
  12. 08 6月, 2016 1 次提交
    • E
      net: sched: do not acquire qdisc spinlock in qdisc/class stats dump · edb09eb1
      Eric Dumazet 提交于
      Large tc dumps (tc -s {qdisc|class} sh dev ethX) done by Google BwE host
      agent [1] are problematic at scale :
      
      For each qdisc/class found in the dump, we currently lock the root qdisc
      spinlock in order to get stats. Sampling stats every 5 seconds from
      thousands of HTB classes is a challenge when the root qdisc spinlock is
      under high pressure. Not only the dumps take time, they also slow
      down the fast path (queue/dequeue packets) by 10 % to 20 % in some cases.
      
      An audit of existing qdiscs showed that sch_fq_codel is the only qdisc
      that might need the qdisc lock in fq_codel_dump_stats() and
      fq_codel_dump_class_stats()
      
      In v2 of this patch, I now use the Qdisc running seqcount to provide
      consistent reads of packets/bytes counters, regardless of 32/64 bit arches.
      
      I also changed rate estimators to use the same infrastructure
      so that they no longer need to lock root qdisc lock.
      
      [1]
      http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43838.pdfSigned-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: Kevin Athey <kda@google.com>
      Cc: Xiaotian Pei <xiaotian@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      edb09eb1
  13. 04 6月, 2016 1 次提交
  14. 01 3月, 2016 2 次提交
  15. 28 8月, 2015 1 次提交
    • D
      net: sched: consolidate tc_classify{,_compat} · 3b3ae880
      Daniel Borkmann 提交于
      For classifiers getting invoked via tc_classify(), we always need an
      extra function call into tc_classify_compat(), as both are being
      exported as symbols and tc_classify() itself doesn't do much except
      handling of reclassifications when tp->classify() returned with
      TC_ACT_RECLASSIFY.
      
      CBQ and ATM are the only qdiscs that directly call into tc_classify_compat(),
      all others use tc_classify(). When tc actions are being configured
      out in the kernel, tc_classify() effectively does nothing besides
      delegating.
      
      We could spare this layer and consolidate both functions. pktgen on
      single CPU constantly pushing skbs directly into the netif_receive_skb()
      path with a dummy classifier on ingress qdisc attached, improves
      slightly from 22.3Mpps to 23.1Mpps.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b3ae880
  16. 30 9月, 2014 4 次提交
  17. 14 9月, 2014 2 次提交
  18. 14 3月, 2014 1 次提交
  19. 11 6月, 2013 1 次提交
    • E
      net_sched: add 64bit rate estimators · 45203a3b
      Eric Dumazet 提交于
      struct gnet_stats_rate_est contains u32 fields, so the bytes per second
      field can wrap at 34360Mbit.
      
      Add a new gnet_stats_rate_est64 structure to get 64bit bps/pps fields,
      and switch the kernel to use this structure natively.
      
      This structure is dumped to user space as a new attribute :
      
      TCA_STATS_RATE_EST64
      
      Old tc command will now display the capped bps (to 34360Mbit), instead
      of wrapped values, and updated tc command will display correct
      information.
      
      Old tc command output, after patch :
      
      eric:~# tc -s -d qd sh dev lo
      qdisc pfifo 8001: root refcnt 2 limit 1000p
       Sent 80868245400 bytes 1978837 pkt (dropped 0, overlimits 0 requeues 0)
       rate 34360Mbit 189696pps backlog 0b 0p requeues 0
      
      This patch carefully reorganizes "struct Qdisc" layout to get optimal
      performance on SMP.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      45203a3b
  20. 28 2月, 2013 1 次提交
    • S
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin 提交于
      I'm not sure why, but the hlist for each entry iterators were conceived
      
              list_for_each_entry(pos, head, member)
      
      The hlist ones were greedy and wanted an extra parameter:
      
              hlist_for_each_entry(tpos, pos, head, member)
      
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      
      Besides the semantic patch, there was some manual work required:
      
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      
      @@
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      @@
      
      -T b;
          <+... when != b
      (
      hlist_for_each_entry(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue(a,
      - b,
      c) S
      |
      hlist_for_each_entry_from(a,
      - b,
      c) S
      |
      hlist_for_each_entry_rcu(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_rcu_bh(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue_rcu_bh(a,
      - b,
      c) S
      |
      for_each_busy_worker(a, c,
      - b,
      d) S
      |
      ax25_uid_for_each(a,
      - b,
      c) S
      |
      ax25_for_each(a,
      - b,
      c) S
      |
      inet_bind_bucket_for_each(a,
      - b,
      c) S
      |
      sctp_for_each_hentry(a,
      - b,
      c) S
      |
      sk_for_each(a,
      - b,
      c) S
      |
      sk_for_each_rcu(a,
      - b,
      c) S
      |
      sk_for_each_from
      -(a, b)
      +(a)
      S
      + sk_for_each_from(a) S
      |
      sk_for_each_safe(a,
      - b,
      c, d) S
      |
      sk_for_each_bound(a,
      - b,
      c) S
      |
      hlist_for_each_entry_safe(a,
      - b,
      c, d, e) S
      |
      hlist_for_each_entry_continue_rcu(a,
      - b,
      c) S
      |
      nr_neigh_for_each(a,
      - b,
      c) S
      |
      nr_neigh_for_each_safe(a,
      - b,
      c, d) S
      |
      nr_node_for_each(a,
      - b,
      c) S
      |
      nr_node_for_each_safe(a,
      - b,
      c, d) S
      |
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      |
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      |
      for_each_host(a,
      - b,
      c) S
      |
      for_each_host_safe(a,
      - b,
      c, d) S
      |
      for_each_mesh_entry(a,
      - b,
      c, d) S
      )
          ...+>
      
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: NPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b67bfe0d
  21. 11 5月, 2012 1 次提交
  22. 02 4月, 2012 1 次提交
  23. 24 12月, 2011 1 次提交
    • E
      sch_hfsc: report backlog information · f5a59b73
      Eric Dumazet 提交于
      Add backlog (byte count) information in hfsc classes and qdisc, so that
      "tc -s" can report it to user, instead of 0 values :
      
      qdisc hfsc 1: root refcnt 6 default 20
       Sent 45141660 bytes 30545 pkt (dropped 0, overlimits 91751 requeues 0)
       rate 1492Kbit 126pps backlog 103226b 74p requeues 0
      ...
      class hfsc 1:20 parent 1:1 leaf 1201: rt m1 0bit d 0us m2 400000bit ls m1 0bit d 0us m2 200000bit
       Sent 49534912 bytes 33519 pkt (dropped 0, overlimits 0 requeues 0)
       backlog 81822b 56p requeues 0
       period 23 work 49451576 bytes rtwork 13277552 bytes level 0
      ...
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: John A. Sullivan III <jsullivan@opensourcedevel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f5a59b73
  24. 21 1月, 2011 2 次提交
    • E
      net_sched: accurate bytes/packets stats/rates · 9190b3b3
      Eric Dumazet 提交于
      In commit 44b82883 (net_sched: pfifo_head_drop problem), we fixed
      a problem with pfifo_head drops that incorrectly decreased
      sch->bstats.bytes and sch->bstats.packets
      
      Several qdiscs (CHOKe, SFQ, pfifo_head, ...) are able to drop a
      previously enqueued packet, and bstats cannot be changed, so
      bstats/rates are not accurate (over estimated)
      
      This patch changes the qdisc_bstats updates to be done at dequeue() time
      instead of enqueue() time. bstats counters no longer account for dropped
      frames, and rates are more correct, since enqueue() bursts dont have
      effect on dequeue() rate.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9190b3b3
    • E
      net_sched: move TCQ_F_THROTTLED flag · fd245a4a
      Eric Dumazet 提交于
      In commit 37112105 (net: QDISC_STATE_RUNNING dont need atomic bit
      ops) I moved QDISC_STATE_RUNNING flag to __state container, located in
      the cache line containing qdisc lock and often dirtied fields.
      
      I now move TCQ_F_THROTTLED bit too, so that we let first cache line read
      mostly, and shared by all cpus. This should speedup HTB/CBQ for example.
      
      Not using test_bit()/__clear_bit()/__test_and_set_bit allows to use an
      "unsigned int" for __state container, reducing by 8 bytes Qdisc size.
      
      Introduce helpers to hide implementation details.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Patrick McHardy <kaber@trash.net>
      CC: Jesper Dangaard Brouer <hawk@diku.dk>
      CC: Jarek Poplawski <jarkao2@gmail.com>
      CC: Jamal Hadi Salim <hadi@cyberus.ca>
      CC: Stephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd245a4a
  25. 20 1月, 2011 1 次提交
  26. 11 1月, 2011 1 次提交
  27. 21 10月, 2010 1 次提交
  28. 02 9月, 2010 1 次提交