提交 · e25ea21ffa66a029acfa89d2611c0e7ef23e7d8c · openanolis / cloud-kernel

07 6月, 2017 1 次提交

net: sched: introduce a TRAP control action · e25ea21f

由 Jiri Pirko 提交于 6月 06, 2017

There is need to instruct the HW offloaded path to push certain matched
packets to cpu/kernel for further analysis. So this patch introduces a
new TRAP control action to TC.

For kernel datapath, this action does not make much sense. So with the
same logic as in HW, new TRAP behaves similar to STOLEN. The skb is just
dropped in the datapath (and virtually ejected to an upper level, which
does not exist in case of kernel).
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Reviewed-by: NYotam Gigi <yotamg@mellanox.com>
Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e25ea21f

18 5月, 2017 2 次提交

net: sched: introduce tcf block infractructure · 6529eaba

由 Jiri Pirko 提交于 5月 17, 2017

Currently, the filter chains are direcly put into the private structures
of qdiscs. In order to be able to have multiple chains per qdisc and to
allow filter chains sharing among qdiscs, there is a need for common
object that would hold the chains. This introduces such object and calls
it "tcf_block".

Helpers to get and put the blocks are provided to be called from
individual qdisc code. Also, the original filter_list pointers are left
in qdisc privs to allow the entry into tcf_block processing without any
added overhead of possible multiple pointer dereference on fast path.
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6529eaba

net: sched: move tc_classify function to cls_api.c · 87d83093

由 Jiri Pirko 提交于 5月 17, 2017

Move tc_classify function to cls_api.c where it belongs, rename it to
fit the namespace.
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

87d83093

14 4月, 2017 1 次提交

netlink: pass extended ACK struct to parsing functions · fceb6435

由 Johannes Berg 提交于 4月 12, 2017

Pass the new extended ACK reporting struct to all of the generic
netlink parsing functions. For now, pass NULL in almost all callers
(except for some in the core.)
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fceb6435

13 3月, 2017 1 次提交

net: sched: make default fifo qdiscs appear in the dump · 49b49971

由 Jiri Kosina 提交于 3月 08, 2017

The original reason [1] for having hidden qdiscs (potential scalability
issues in qdisc_match_from_root() with single linked list in case of large
amount of qdiscs) has been invalidated by 59cc1f61 ("net: sched: convert
qdisc linked list to hashtable").

This allows us for bringing more clarity and determinism into the dump by
making default pfifo qdiscs visible.

We're not turning this on by default though, at it was deemed [2] too
intrusive / unnecessary change of default behavior towards userspace.
Instead, TCA_DUMP_INVISIBLE netlink attribute is introduced, which allows
applications to request complete qdisc hierarchy dump, including the
ones that have always been implicit/invisible.

Singleton noop_qdisc stays invisible, as teaching the whole infrastructure
about singletons would require quite some surgery with very little gain
(seeing no qdisc or seeing noop qdisc in the dump is probably setting
the same user expectation).

[1] http://lkml.kernel.org/r/1460732328.10638.74.camel@edumazet-glaptop3.roam.corp.google.com
[2] http://lkml.kernel.org/r/20161021.105935.1907696543877061916.davem@davemloft.netSigned-off-by: NJiri Kosina <jkosina@suse.cz>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

49b49971

06 12月, 2016 1 次提交

net_sched: gen_estimator: complete rewrite of rate estimators · 1c0d32fd

由 Eric Dumazet 提交于 12月 04, 2016

1) Old code was hard to maintain, due to complex lock chains.
   (We probably will be able to remove some kfree_rcu() in callers)

2) Using a single timer to update all estimators does not scale.

3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity
   is not supposed to work well)

In this rewrite :

- I removed the RB tree that had to be scanned in
  gen_estimator_active(). qdisc dumps should be much faster.

- Each estimator has its own timer.

- Estimations are maintained in net_rate_estimator structure,
  instead of dirtying the qdisc. Minor, but part of the simplification.

- Reading the estimator uses RCU and a seqcount to provide proper
  support for 32bit kernels.

- We reduce memory need when estimators are not used, since
  we store a pointer, instead of the bytes/packets counters.

- xt_rateest_mt() no longer has to grab a spinlock.
  (In the future, xt_rateest_tg() could be switched to per cpu counters)
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1c0d32fd

09 8月, 2016 2 次提交

net/sched/sch_hfsc.c: remove unused cl_myfadj · 37088f61

由 Michal Soltys 提交于 8月 03, 2016

The code using this variable has been commented out in the past as it
was causing issues in upperlimited link-sharing scenarios.
Signed-off-by: NMichal Soltys <soltys@ziu.info>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

37088f61

net/sched/sch_hfsc.c: keep fsc and virtual times in sync; fix an old bug · 678a6241

由 Michal Soltys 提交于 8月 03, 2016

This patch simplifies how we update fsc and calculate vt from it - while
keeping the expected functionality identical with how hfsc behaves
curently. It also fixes a certain issue introduced with
a very old patch.

The idea is, that instead of correcting cl_vt before fsc curve update
(rtsc_min) and correcting cl_vt after calculation (rtsc_y2x) to keep
cl_vt local to the current period - we can simply rely on virtual times
and curve values always being in sync - analogously to how rsc and usc
function, except that we use virtual time here.

Why hasn't it been done since the beginning this way ? The likely scenario
(basing on the code trying to correct curves whenever possible) was to
keep the virtual times as small as possible - as they have tendency to
"gallop" forward whenever their siblings and other fair sharing
subtrees are idling. On top of that, current code is subtly bugged, so
cumulative time (without any corrections) is always kept and used in
init_vf() when a new backlog period begins (using cl_cvtoff).

Is cumulative value safe ? Generally yes, though corner cases are easy
to create. For example consider:

1gbit interface
some 100kbit leaf, everything else idle

With current tick (64ns) 1s is 15625000 ticks, but the leaf is alone and
it's virtual time, so in reality it's 10000 times more. ITOW 38 bits are
needed to hold 1 second. 54 - 1 day, 59 - 1 month, 63 - 1 year (all
logarithms rounded up). It's getting somewhat dangerous, but also
requires setup excusing this kind of values not mentioning permanently
backlogged class for a year. In near most extreme case (10gbit, 10kbit
leaf), we have "enough" to hold ~13.6 days in 64 bits.

Well, the issue remains mostly theoretical and cl_cvtoff has been
working fine for all those years. Sensible configuration are de-facto
immune to this issue, and not so sensible can solve it with a cronjob
and its period inversely proportional to the insanity of such setup =)

Now let's explain the subtle bug mentioned earlier.

The issue is related to how offsets are kept and how we calculate
virtual times and update fair service curve(s). The issue itself is
subtle, but easy to observe with long m1 segments. It was introduced in
rather old patch:

Commit 99296150c7: "[NET_SCHED]: O(1) children vtoff adjustment
in HFSC scheduler"

(available in git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git)

Originally when a new backlog period was started, cl_vtoff of each
sibling was updated with cl_cvtmax from past period - naturally moving
all cl_vt to proper starting point. That patch adjusted it so cumulative
offset is kept in the parent, and there is no need for traversing the
list (as any subsequent child activation derives new vt from already
active sibling(s)).

But with this change, cl_vtoff (of each sibling) is no longer persistent
across the inactivity periods, as it's calculated from parent's
cl_cvtoff on a new backlog period, conflicting with the following curve
correction from the previous period:

if (cl->cl_virtual.x == vt) {
        cl->cl_virtual.x -= cl->cl_vtoff;
	cl->cl_vtoff = 0;
}

This essentially tries to keep curve as if it was local to the period
and resets cl_vtoff (cumulative vt offset of the class) to 0 when
possible (read: when we have an intersection or if a new curve is below
the old one). But then it's recalculated from cl_cvtoff on next active
period.  Then rtsc_min() call preceding the above if() doesn't really
do what we expect it to do in such scenario - as it calculates the
minimum of corrected curve (from the previous backlog period) and the
new uncorrected curve (with offset derived from cl_cvtoff).

Example:

tc class add dev $ife parent 1:0 classid 1:1  hfsc ls m2 100mbit ul m2 100mbit
tc class add dev $ife parent 1:1 classid 1:10 hfsc ls m1 80mbit d 10s m2 20mbit
tc class add dev $ife parent 1:1 classid 1:11 hfsc ls m2 20mbit

start B, keep it backlogged, let it run 6s (30s worth of vt as A is idle)
pause B briefly to force cl_cvtoff update in parent (whole 1:1 going idle)
start A, let it run 10s
pause A briefly to force rtsc_min()

At this point we would expect A to continue at 20mbit after a brief
moment of 80mbit. But instead A will use 80mbit for full 10s again. It's
the effect of first correcting A (during 'start A'), and then - after
unpausing - calculating rtsc_min() from old corrected and new uncorrected
curve.

The patch fixes this bug and keepis vt and fsc in sync (virtual times
are cumulative, not local to the backlog period).
Signed-off-by: NMichal Soltys <soltys@ziu.info>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

678a6241

09 7月, 2016 1 次提交

hfsc: reduce hfsc_sched to 14 cachelines · bba7eb5d

由 Florian Westphal 提交于 7月 04, 2016

hfsc_sched is huge (size: 920, cachelines: 15), but we can get it to 14
cachelines by placing level after filter_cnt (covering 4 byte hole) and
reducing period/nactive/flags to u32 (period is just a counter,
incremented when class becomes active -- 2**32 is plenty for this
purpose, also, long is only 32bit wide on 32bit platforms anyway).

cl_vtperiod is exported to userspace via tc_hfsc_stats, but its period
member is already u32, so no precision is lost there either.

Cc: Michal Soltys <soltys@ziu.info>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bba7eb5d

01 7月, 2016 5 次提交

net/sched/sch_hfsc.c: anchor virtual curve at proper vt in hfsc_change_fsc() · 33ef84a7

由 Michal Soltys 提交于 6月 30, 2016

cl->cl_vt alone is relative only to the current backlog period, while
the curve operates on cumulative virtual time. This patch adds missing
cl->cl_vtoff.
Signed-off-by: NMichal Soltys <soltys@ziu.info>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

33ef84a7

net/sched/sch_hfsc.c: go passive after vt update · ab12cb47

由 Michal Soltys 提交于 6月 30, 2016

When a class is going passive, it should update its cl_vt first
to be consistent with the last dequeue operation.

Otherwise its cl_vt will be one packet behind and parent's cvtmax might
not be updated as well.

One possible side effect is if some class goes passive and subsequently
goes active /without/ its parent going passive - with cl_vt lagging one
packet behind - comparison made in init_vf() will be affected (same
period).
Signed-off-by: NMichal Soltys <soltys@ziu.info>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ab12cb47

net/sched/sch_hfsc.c: remove leftover dlist and droplist · 2354f056

由 Michal Soltys 提交于 6月 30, 2016

This is update to:
commit a09ceb0e ("sched: remove qdisc->drop")

That commit removed qdisc->drop, but left alone dlist and droplist
that no longer serve any meaningful purpose.
Signed-off-by: NMichal Soltys <soltys@ziu.info>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2354f056

net/sched/sch_hfsc.c: add unlikely() in qdisc_peek_len() · d1d0fc5e

由 Michal Soltys 提交于 6月 30, 2016

The condition can only succeed on wrong configurations.
Signed-off-by: NMichal Soltys <soltys@ziu.info>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d1d0fc5e

net/sched/sch_hfsc.c: handle corner cases where head may change invalidating calculated deadline · 12d0ad3b

由 Michal Soltys 提交于 6月 30, 2016

Realtime scheduling implemented in HFSC uses head of the queue to make
the decision about which packet to schedule next. But in case of any
head drop, the deadline calculated for the previous head is not
necessarily correct for the next head (unless both packets have the same
length).

Thanks to peek() function used during dequeue - which internally is a
dequeue operation - hfsc is almost safe from this issue, as peek()
dequeues and isolates the head storing it temporarily until the real
dequeue happens.

But there is one exception: if after the class activation a drop happens
before the first dequeue operation, there's never a chance to do the
peek().

Adding peek() call in enqueue - if this is the first packet in a new
backlog period AND the scheduler has realtime curve defined - fixes that
one corner case. The 1st hfsc_dequeue() will use that peeked packet,
similarly as every subsequent hfsc_dequeue() call uses packet peeked by
the previous call.
Signed-off-by: NMichal Soltys <soltys@ziu.info>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

12d0ad3b

26 6月, 2016 1 次提交

net_sched: drop packets after root qdisc lock is released · 520ac30f

由 Eric Dumazet 提交于 6月 21, 2016

Qdisc performance suffers when packets are dropped at enqueue()
time because drops (kfree_skb()) are done while qdisc lock is held,
delaying a dequeue() draining the queue.

Nominal throughput can be reduced by 50 % when this happens,
at a time we would like the dequeue() to proceed as fast as possible.

Even FQ is vulnerable to this problem, while one of FQ goals was
to provide some flow isolation.

This patch adds a 'struct sk_buff **to_free' parameter to all
qdisc->enqueue(), and in qdisc_drop() helper.

I measured a performance increase of up to 12 %, but this patch
is a prereq so that future batches in enqueue() can fly.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

520ac30f

11 6月, 2016 1 次提交

net_sched: remove generic throttled management · 45f50bed

由 Eric Dumazet 提交于 6月 10, 2016

__QDISC_STATE_THROTTLED bit manipulation is rather expensive
for HTB and few others.

I already removed it for sch_fq in commit f2600cf0
("net: sched: avoid costly atomic operation in fq_dequeue()")
and so far nobody complained.

When one ore more packets are stuck in one or more throttled
HTB class, a htb dequeue() performs two atomic operations
to clear/set __QDISC_STATE_THROTTLED bit, while root qdisc
lock is held.

Removing this pair of atomic operations bring me a 8 % performance
increase on 200 TCP_RR tests, in presence of throttled classes.

This patch has no side effect, since nothing actually uses
disc_is_throttled() anymore.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

45f50bed

09 6月, 2016 1 次提交

sched: remove qdisc->drop · a09ceb0e

由 Florian Westphal 提交于 6月 09, 2016

after removal of TCA_CBQ_OVL_STRATEGY from cbq scheduler, there are no
more callers of ->drop() outside of other ->drop functions, i.e.
nothing calls them.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a09ceb0e

08 6月, 2016 1 次提交

net: sched: do not acquire qdisc spinlock in qdisc/class stats dump · edb09eb1

由 Eric Dumazet 提交于 6月 06, 2016

Large tc dumps (tc -s {qdisc|class} sh dev ethX) done by Google BwE host
agent [1] are problematic at scale :

For each qdisc/class found in the dump, we currently lock the root qdisc
spinlock in order to get stats. Sampling stats every 5 seconds from
thousands of HTB classes is a challenge when the root qdisc spinlock is
under high pressure. Not only the dumps take time, they also slow
down the fast path (queue/dequeue packets) by 10 % to 20 % in some cases.

An audit of existing qdiscs showed that sch_fq_codel is the only qdisc
that might need the qdisc lock in fq_codel_dump_stats() and
fq_codel_dump_class_stats()

In v2 of this patch, I now use the Qdisc running seqcount to provide
consistent reads of packets/bytes counters, regardless of 32/64 bit arches.

I also changed rate estimators to use the same infrastructure
so that they no longer need to lock root qdisc lock.

[1]
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43838.pdfSigned-off-by: NEric Dumazet <edumazet@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kevin Athey <kda@google.com>
Cc: Xiaotian Pei <xiaotian@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

edb09eb1

04 6月, 2016 1 次提交

sch_hfsc: always keep backlog updated · 357cc9b4

由 WANG Cong 提交于 6月 01, 2016

hfsc updates backlog lazily, that is only when we
dump the stats. This is problematic after we begin to
update backlog in qdisc_tree_reduce_backlog().
Reported-by: NStas Nichiporovich <stasn77@gmail.com>
Tested-by: NStas Nichiporovich <stasn77@gmail.com>
Fixes: 2ccccf5f ("net_sched: update hierarchical backlog too")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

357cc9b4

01 3月, 2016 2 次提交

net_sched: update hierarchical backlog too · 2ccccf5f

由 WANG Cong 提交于 2月 25, 2016

When the bottom qdisc decides to, for example, drop some packet,
it calls qdisc_tree_decrease_qlen() to update the queue length
for all its ancestors, we need to update the backlog too to
keep the stats on root qdisc accurate.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2ccccf5f

net_sched: introduce qdisc_replace() helper · 86a7996c

由 WANG Cong 提交于 2月 25, 2016

Remove nearly duplicated code and prepare for the following patch.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

86a7996c

28 8月, 2015 1 次提交

net: sched: consolidate tc_classify{,_compat} · 3b3ae880

由 Daniel Borkmann 提交于 8月 26, 2015

For classifiers getting invoked via tc_classify(), we always need an
extra function call into tc_classify_compat(), as both are being
exported as symbols and tc_classify() itself doesn't do much except
handling of reclassifications when tp->classify() returned with
TC_ACT_RECLASSIFY.

CBQ and ATM are the only qdiscs that directly call into tc_classify_compat(),
all others use tc_classify(). When tc actions are being configured
out in the kernel, tc_classify() effectively does nothing besides
delegating.

We could spare this layer and consolidate both functions. pktgen on
single CPU constantly pushing skbs directly into the netif_receive_skb()
path with a dummy classifier on ingress qdisc attached, improves
slightly from 22.3Mpps to 23.1Mpps.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3b3ae880

30 9月, 2014 4 次提交

net: sched: enable per cpu qstats · b0ab6f92

由 John Fastabend 提交于 9月 28, 2014

After previous patches to simplify qstats the qstats can be
made per cpu with a packed union in Qdisc struct.
Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b0ab6f92

net: sched: restrict use of qstats qlen · 64015853

由 John Fastabend 提交于 9月 28, 2014

This removes the use of qstats->qlen variable from the classifiers
and makes it an explicit argument to gnet_stats_copy_queue().

The qlen represents the qdisc queue length and is packed into
the qstats at the last moment before passnig to user space. By
handling it explicitely we avoid, in the percpu stats case, having
to figure out which per_cpu variable to put it in.

It would probably be best to remove it from qstats completely
but qstats is a user space ABI and can't be broken. A future
patch could make an internal only qstats structure that would
avoid having to allocate an additional u32 variable on the
Qdisc struct. This would make the qstats struct 128bits instead
of 128+32.
Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

64015853

net: sched: implement qstat helper routines · 25331d6c

由 John Fastabend 提交于 9月 28, 2014

This adds helpers to manipulate qstats logic and replaces locations
that touch the counters directly. This simplifies future patches
to push qstats onto per cpu counters.
Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

25331d6c

net: sched: make bstats per cpu and estimator RCU safe · 22e0f8b9

由 John Fastabend 提交于 9月 28, 2014

In order to run qdisc's without locking statistics and estimators
need to be handled correctly.

To resolve bstats make the statistics per cpu. And because this is
only needed for qdiscs that are running without locks which is not
the case for most qdiscs in the near future only create percpu
stats when qdiscs set the TCQ_F_CPUSTATS flag.

Next because estimators use the bstats to calculate packets per
second and bytes per second the estimator code paths are updated
to use the per cpu statistics.
Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

22e0f8b9

14 9月, 2014 2 次提交

net: rcu-ify tcf_proto · 25d8c0d5

由 John Fastabend 提交于 9月 12, 2014

rcu'ify tcf_proto this allows calling tc_classify() without holding
any locks. Updaters are protected by RTNL.

This patch prepares the core net_sched infrastracture for running
the classifier/action chains without holding the qdisc lock however
it does nothing to ensure cls_xxx and act_xxx types also work without
locking. Additional patches are required to address the fall out.
Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

25d8c0d5

net: rcu-ify tcf_proto · 80a735f7

由 John Fastabend 提交于 9月 12, 2014

rcu'ify tcf_proto this allows calling tc_classify() without holding
any locks. Updaters are protected by RTNL.

This patch prepares the core net_sched infrastracture for running
the classifier/action chains without holding the qdisc lock however
it does nothing to ensure cls_xxx and act_xxx types also work without
locking. Additional patches are required to address the fall out.
Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

80a735f7

14 3月, 2014 1 次提交

net_sched: return nla_nest_end() instead of skb->len · d59b7d80

由 Yang Yingliang 提交于 3月 12, 2014

nla_nest_end() already has return skb->len, so replace
return skb->len with return nla_nest_end instead().
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d59b7d80

11 6月, 2013 1 次提交

net_sched: add 64bit rate estimators · 45203a3b

由 Eric Dumazet 提交于 6月 06, 2013

struct gnet_stats_rate_est contains u32 fields, so the bytes per second
field can wrap at 34360Mbit.

Add a new gnet_stats_rate_est64 structure to get 64bit bps/pps fields,
and switch the kernel to use this structure natively.

This structure is dumped to user space as a new attribute :

TCA_STATS_RATE_EST64

Old tc command will now display the capped bps (to 34360Mbit), instead
of wrapped values, and updated tc command will display correct
information.

Old tc command output, after patch :

eric:~# tc -s -d qd sh dev lo
qdisc pfifo 8001: root refcnt 2 limit 1000p
 Sent 80868245400 bytes 1978837 pkt (dropped 0, overlimits 0 requeues 0)
 rate 34360Mbit 189696pps backlog 0b 0p requeues 0

This patch carefully reorganizes "struct Qdisc" layout to get optimal
performance on SMP.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

45203a3b

28 2月, 2013 1 次提交

hlist: drop the node parameter from iterators · b67bfe0d

由 Sasha Levin 提交于 2月 27, 2013

I'm not sure why, but the hlist for each entry iterators were conceived

        list_for_each_entry(pos, head, member)

The hlist ones were greedy and wanted an extra parameter:

        hlist_for_each_entry(tpos, pos, head, member)

Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.

Besides the semantic patch, there was some manual work required:

 - Fix up the actual hlist iterators in linux/list.h
 - Fix up the declaration of other iterators based on the hlist ones.
 - A very small amount of places were using the 'node' parameter, this
 was modified to use 'obj->member' instead.
 - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
 properly, so those had to be fixed up manually.

The semantic patch which is mostly the work of Peter Senna Tschudin is here:

@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

type T;
expression a,c,d,e;
identifier b;
statement S;
@@

-T b;
    <+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
    ...+>

[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: NPeter Senna Tschudin <peter.senna@gmail.com>
Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b67bfe0d

11 5月, 2012 1 次提交

net_sched: update bstats in dequeue() · 2dd875ff

由 Eric Dumazet 提交于 5月 10, 2012

Class bytes/packets stats can be misleading because they are updated in
enqueue() while packet might be dropped later.

We already fixed all qdiscs but sch_atm.

This patch makes the final cleanup.

class rate estimators can now match qdisc ones.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2dd875ff

02 4月, 2012 1 次提交

pkt_sched: Stop using NLA_PUT*(). · 1b34ec43

由 David S. Miller 提交于 3月 29, 2012

These macros contain a hidden goto, and are thus extremely error
prone and make code hard to audit.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1b34ec43

24 12月, 2011 1 次提交

sch_hfsc: report backlog information · f5a59b73

由 Eric Dumazet 提交于 12月 23, 2011

Add backlog (byte count) information in hfsc classes and qdisc, so that
"tc -s" can report it to user, instead of 0 values :

qdisc hfsc 1: root refcnt 6 default 20
 Sent 45141660 bytes 30545 pkt (dropped 0, overlimits 91751 requeues 0)
 rate 1492Kbit 126pps backlog 103226b 74p requeues 0
...
class hfsc 1:20 parent 1:1 leaf 1201: rt m1 0bit d 0us m2 400000bit ls m1 0bit d 0us m2 200000bit
 Sent 49534912 bytes 33519 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 81822b 56p requeues 0
 period 23 work 49451576 bytes rtwork 13277552 bytes level 0
...
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
CC: John A. Sullivan III <jsullivan@opensourcedevel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f5a59b73

21 1月, 2011 2 次提交

net_sched: accurate bytes/packets stats/rates · 9190b3b3

由 Eric Dumazet 提交于 1月 20, 2011

In commit 44b82883 (net_sched: pfifo_head_drop problem), we fixed
a problem with pfifo_head drops that incorrectly decreased
sch->bstats.bytes and sch->bstats.packets

Several qdiscs (CHOKe, SFQ, pfifo_head, ...) are able to drop a
previously enqueued packet, and bstats cannot be changed, so
bstats/rates are not accurate (over estimated)

This patch changes the qdisc_bstats updates to be done at dequeue() time
instead of enqueue() time. bstats counters no longer account for dropped
frames, and rates are more correct, since enqueue() bursts dont have
effect on dequeue() rate.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Acked-by: NStephen Hemminger <shemminger@vyatta.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9190b3b3

net_sched: move TCQ_F_THROTTLED flag · fd245a4a

由 Eric Dumazet 提交于 1月 20, 2011

In commit 37112105 (net: QDISC_STATE_RUNNING dont need atomic bit
ops) I moved QDISC_STATE_RUNNING flag to __state container, located in
the cache line containing qdisc lock and often dirtied fields.

I now move TCQ_F_THROTTLED bit too, so that we let first cache line read
mostly, and shared by all cpus. This should speedup HTB/CBQ for example.

Not using test_bit()/__clear_bit()/__test_and_set_bit allows to use an
"unsigned int" for __state container, reducing by 8 bytes Qdisc size.

Introduce helpers to hide implementation details.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
CC: Patrick McHardy <kaber@trash.net>
CC: Jesper Dangaard Brouer <hawk@diku.dk>
CC: Jarek Poplawski <jarkao2@gmail.com>
CC: Jamal Hadi Salim <hadi@cyberus.ca>
CC: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fd245a4a

20 1月, 2011 1 次提交

net_sched: cleanups · cc7ec456

由 Eric Dumazet 提交于 1月 19, 2011

Cleanup net/sched code to current CodingStyle and practices.

Reduce inline abuse
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cc7ec456

11 1月, 2011 1 次提交

net_sched: factorize qdisc stats handling · bfe0d029

由 Eric Dumazet 提交于 1月 09, 2011

HTB takes into account skb is segmented in stats updates.
Generalize this to all schedulers.

They should use qdisc_bstats_update() helper instead of manipulating
bstats.bytes and bstats.packets

Add bstats_update() helper too for classes that use
gnet_stats_basic_packed fields.

Note : Right now, TCQ_F_CAN_BYPASS shortcurt can be taken only if no
stab is setup on qdisc.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bfe0d029

21 10月, 2010 1 次提交

net_sched: remove the unused parameter of qdisc_create_dflt() · 3511c913

由 Changli Gao 提交于 10月 16, 2010

The first parameter dev isn't in use in qdisc_create_dflt().
Signed-off-by: NChangli Gao <xiaosuo@gmail.com>
Acked-by: NJamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3511c913

02 9月, 2010 1 次提交

net/sched/sch_hfsc.c: initialize parent's cl_cfmin properly in init_vf() · 3b2eb613

由 Michal Soltys 提交于 8月 30, 2010

This patch fixes init_vf() function, so on each new backlog period parent's
cl_cfmin is properly updated (including further propgation towards the root),
even if the activated leaf has no upperlimit curve defined.
Signed-off-by: NMichal Soltys <soltys@ziu.info>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3b2eb613

openanolis / cloud-kernel 接近 2 年 前同步成功

openanolis / cloud-kernel
接近 2 年前同步成功