1. 22 12月, 2017 8 次提交
  2. 22 10月, 2017 1 次提交
  3. 17 10月, 2017 1 次提交
  4. 22 9月, 2017 1 次提交
  5. 31 8月, 2017 1 次提交
  6. 26 8月, 2017 1 次提交
    • W
      net_sched: remove tc class reference counting · 143976ce
      WANG Cong 提交于
      For TC classes, their ->get() and ->put() are always paired, and the
      reference counting is completely useless, because:
      
      1) For class modification and dumping paths, we already hold RTNL lock,
         so all of these ->get(),->change(),->put() are atomic.
      
      2) For filter bindiing/unbinding, we use other reference counter than
         this one, and they should have RTNL lock too.
      
      3) For ->qlen_notify(), it is special because it is called on ->enqueue()
         path, but we already hold qdisc tree lock there, and we hold this
         tree lock when graft or delete the class too, so it should not be gone
         or changed until we release the tree lock.
      
      Therefore, this patch removes ->get() and ->put(), but:
      
      1) Adds a new ->find() to find the pointer to a class by classid, no
         refcnt.
      
      2) Move the original class destroy upon the last refcnt into ->delete(),
         right after releasing tree lock. This is fine because the class is
         already removed from hash when holding the lock.
      
      For those who also use ->put() as ->unbind(), just rename them to reflect
      this change.
      
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      143976ce
  7. 17 8月, 2017 2 次提交
  8. 16 8月, 2017 1 次提交
  9. 12 8月, 2017 1 次提交
  10. 07 6月, 2017 1 次提交
  11. 18 5月, 2017 2 次提交
  12. 14 4月, 2017 1 次提交
  13. 13 3月, 2017 1 次提交
  14. 06 12月, 2016 1 次提交
    • E
      net_sched: gen_estimator: complete rewrite of rate estimators · 1c0d32fd
      Eric Dumazet 提交于
      1) Old code was hard to maintain, due to complex lock chains.
         (We probably will be able to remove some kfree_rcu() in callers)
      
      2) Using a single timer to update all estimators does not scale.
      
      3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity
         is not supposed to work well)
      
      In this rewrite :
      
      - I removed the RB tree that had to be scanned in
        gen_estimator_active(). qdisc dumps should be much faster.
      
      - Each estimator has its own timer.
      
      - Estimations are maintained in net_rate_estimator structure,
        instead of dirtying the qdisc. Minor, but part of the simplification.
      
      - Reading the estimator uses RCU and a seqcount to provide proper
        support for 32bit kernels.
      
      - We reduce memory need when estimators are not used, since
        we store a pointer, instead of the bytes/packets counters.
      
      - xt_rateest_mt() no longer has to grab a spinlock.
        (In the future, xt_rateest_tg() could be switched to per cpu counters)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1c0d32fd
  15. 09 8月, 2016 2 次提交
    • M
      net/sched/sch_hfsc.c: remove unused cl_myfadj · 37088f61
      Michal Soltys 提交于
      The code using this variable has been commented out in the past as it
      was causing issues in upperlimited link-sharing scenarios.
      Signed-off-by: NMichal Soltys <soltys@ziu.info>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      37088f61
    • M
      net/sched/sch_hfsc.c: keep fsc and virtual times in sync; fix an old bug · 678a6241
      Michal Soltys 提交于
      This patch simplifies how we update fsc and calculate vt from it - while
      keeping the expected functionality identical with how hfsc behaves
      curently. It also fixes a certain issue introduced with
      a very old patch.
      
      The idea is, that instead of correcting cl_vt before fsc curve update
      (rtsc_min) and correcting cl_vt after calculation (rtsc_y2x) to keep
      cl_vt local to the current period - we can simply rely on virtual times
      and curve values always being in sync - analogously to how rsc and usc
      function, except that we use virtual time here.
      
      Why hasn't it been done since the beginning this way ? The likely scenario
      (basing on the code trying to correct curves whenever possible) was to
      keep the virtual times as small as possible - as they have tendency to
      "gallop" forward whenever their siblings and other fair sharing
      subtrees are idling. On top of that, current code is subtly bugged, so
      cumulative time (without any corrections) is always kept and used in
      init_vf() when a new backlog period begins (using cl_cvtoff).
      
      Is cumulative value safe ? Generally yes, though corner cases are easy
      to create. For example consider:
      
      1gbit interface
      some 100kbit leaf, everything else idle
      
      With current tick (64ns) 1s is 15625000 ticks, but the leaf is alone and
      it's virtual time, so in reality it's 10000 times more. ITOW 38 bits are
      needed to hold 1 second. 54 - 1 day, 59 - 1 month, 63 - 1 year (all
      logarithms rounded up). It's getting somewhat dangerous, but also
      requires setup excusing this kind of values not mentioning permanently
      backlogged class for a year. In near most extreme case (10gbit, 10kbit
      leaf), we have "enough" to hold ~13.6 days in 64 bits.
      
      Well, the issue remains mostly theoretical and cl_cvtoff has been
      working fine for all those years. Sensible configuration are de-facto
      immune to this issue, and not so sensible can solve it with a cronjob
      and its period inversely proportional to the insanity of such setup =)
      
      Now let's explain the subtle bug mentioned earlier.
      
      The issue is related to how offsets are kept and how we calculate
      virtual times and update fair service curve(s). The issue itself is
      subtle, but easy to observe with long m1 segments. It was introduced in
      rather old patch:
      
      Commit 99296150c7: "[NET_SCHED]: O(1) children vtoff adjustment
      in HFSC scheduler"
      
      (available in git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git)
      
      Originally when a new backlog period was started, cl_vtoff of each
      sibling was updated with cl_cvtmax from past period - naturally moving
      all cl_vt to proper starting point. That patch adjusted it so cumulative
      offset is kept in the parent, and there is no need for traversing the
      list (as any subsequent child activation derives new vt from already
      active sibling(s)).
      
      But with this change, cl_vtoff (of each sibling) is no longer persistent
      across the inactivity periods, as it's calculated from parent's
      cl_cvtoff on a new backlog period, conflicting with the following curve
      correction from the previous period:
      
      if (cl->cl_virtual.x == vt) {
              cl->cl_virtual.x -= cl->cl_vtoff;
      	cl->cl_vtoff = 0;
      }
      
      This essentially tries to keep curve as if it was local to the period
      and resets cl_vtoff (cumulative vt offset of the class) to 0 when
      possible (read: when we have an intersection or if a new curve is below
      the old one). But then it's recalculated from cl_cvtoff on next active
      period.  Then rtsc_min() call preceding the above if() doesn't really
      do what we expect it to do in such scenario - as it calculates the
      minimum of corrected curve (from the previous backlog period) and the
      new uncorrected curve (with offset derived from cl_cvtoff).
      
      Example:
      
      tc class add dev $ife parent 1:0 classid 1:1  hfsc ls m2 100mbit ul m2 100mbit
      tc class add dev $ife parent 1:1 classid 1:10 hfsc ls m1 80mbit d 10s m2 20mbit
      tc class add dev $ife parent 1:1 classid 1:11 hfsc ls m2 20mbit
      
      start B, keep it backlogged, let it run 6s (30s worth of vt as A is idle)
      pause B briefly to force cl_cvtoff update in parent (whole 1:1 going idle)
      start A, let it run 10s
      pause A briefly to force rtsc_min()
      
      At this point we would expect A to continue at 20mbit after a brief
      moment of 80mbit. But instead A will use 80mbit for full 10s again. It's
      the effect of first correcting A (during 'start A'), and then - after
      unpausing - calculating rtsc_min() from old corrected and new uncorrected
      curve.
      
      The patch fixes this bug and keepis vt and fsc in sync (virtual times
      are cumulative, not local to the backlog period).
      Signed-off-by: NMichal Soltys <soltys@ziu.info>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      678a6241
  16. 09 7月, 2016 1 次提交
    • F
      hfsc: reduce hfsc_sched to 14 cachelines · bba7eb5d
      Florian Westphal 提交于
      hfsc_sched is huge (size: 920, cachelines: 15), but we can get it to 14
      cachelines by placing level after filter_cnt (covering 4 byte hole) and
      reducing period/nactive/flags to u32 (period is just a counter,
      incremented when class becomes active -- 2**32 is plenty for this
      purpose, also, long is only 32bit wide on 32bit platforms anyway).
      
      cl_vtperiod is exported to userspace via tc_hfsc_stats, but its period
      member is already u32, so no precision is lost there either.
      
      Cc: Michal Soltys <soltys@ziu.info>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bba7eb5d
  17. 01 7月, 2016 5 次提交
  18. 26 6月, 2016 1 次提交
    • E
      net_sched: drop packets after root qdisc lock is released · 520ac30f
      Eric Dumazet 提交于
      Qdisc performance suffers when packets are dropped at enqueue()
      time because drops (kfree_skb()) are done while qdisc lock is held,
      delaying a dequeue() draining the queue.
      
      Nominal throughput can be reduced by 50 % when this happens,
      at a time we would like the dequeue() to proceed as fast as possible.
      
      Even FQ is vulnerable to this problem, while one of FQ goals was
      to provide some flow isolation.
      
      This patch adds a 'struct sk_buff **to_free' parameter to all
      qdisc->enqueue(), and in qdisc_drop() helper.
      
      I measured a performance increase of up to 12 %, but this patch
      is a prereq so that future batches in enqueue() can fly.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      520ac30f
  19. 11 6月, 2016 1 次提交
    • E
      net_sched: remove generic throttled management · 45f50bed
      Eric Dumazet 提交于
      __QDISC_STATE_THROTTLED bit manipulation is rather expensive
      for HTB and few others.
      
      I already removed it for sch_fq in commit f2600cf0
      ("net: sched: avoid costly atomic operation in fq_dequeue()")
      and so far nobody complained.
      
      When one ore more packets are stuck in one or more throttled
      HTB class, a htb dequeue() performs two atomic operations
      to clear/set __QDISC_STATE_THROTTLED bit, while root qdisc
      lock is held.
      
      Removing this pair of atomic operations bring me a 8 % performance
      increase on 200 TCP_RR tests, in presence of throttled classes.
      
      This patch has no side effect, since nothing actually uses
      disc_is_throttled() anymore.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      45f50bed
  20. 09 6月, 2016 1 次提交
  21. 08 6月, 2016 1 次提交
    • E
      net: sched: do not acquire qdisc spinlock in qdisc/class stats dump · edb09eb1
      Eric Dumazet 提交于
      Large tc dumps (tc -s {qdisc|class} sh dev ethX) done by Google BwE host
      agent [1] are problematic at scale :
      
      For each qdisc/class found in the dump, we currently lock the root qdisc
      spinlock in order to get stats. Sampling stats every 5 seconds from
      thousands of HTB classes is a challenge when the root qdisc spinlock is
      under high pressure. Not only the dumps take time, they also slow
      down the fast path (queue/dequeue packets) by 10 % to 20 % in some cases.
      
      An audit of existing qdiscs showed that sch_fq_codel is the only qdisc
      that might need the qdisc lock in fq_codel_dump_stats() and
      fq_codel_dump_class_stats()
      
      In v2 of this patch, I now use the Qdisc running seqcount to provide
      consistent reads of packets/bytes counters, regardless of 32/64 bit arches.
      
      I also changed rate estimators to use the same infrastructure
      so that they no longer need to lock root qdisc lock.
      
      [1]
      http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43838.pdfSigned-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: Kevin Athey <kda@google.com>
      Cc: Xiaotian Pei <xiaotian@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      edb09eb1
  22. 04 6月, 2016 1 次提交
  23. 01 3月, 2016 2 次提交
  24. 28 8月, 2015 1 次提交
    • D
      net: sched: consolidate tc_classify{,_compat} · 3b3ae880
      Daniel Borkmann 提交于
      For classifiers getting invoked via tc_classify(), we always need an
      extra function call into tc_classify_compat(), as both are being
      exported as symbols and tc_classify() itself doesn't do much except
      handling of reclassifications when tp->classify() returned with
      TC_ACT_RECLASSIFY.
      
      CBQ and ATM are the only qdiscs that directly call into tc_classify_compat(),
      all others use tc_classify(). When tc actions are being configured
      out in the kernel, tc_classify() effectively does nothing besides
      delegating.
      
      We could spare this layer and consolidate both functions. pktgen on
      single CPU constantly pushing skbs directly into the netif_receive_skb()
      path with a dummy classifier on ingress qdisc attached, improves
      slightly from 22.3Mpps to 23.1Mpps.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b3ae880
  25. 30 9月, 2014 1 次提交