提交 · 7b9364050246bd4c24b36b71c8990b2922dcc027 · openanolis / cloud-kernel

05 7月, 2017 1 次提交

net, sched: convert Qdisc.refcnt from atomic_t to refcount_t · 7b936405

由 Reshetova, Elena 提交于 7月 04, 2017

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: NElena Reshetova <elena.reshetova@intel.com>
Signed-off-by: NHans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NDavid Windsor <dwindsor@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7b936405

07 4月, 2017 1 次提交

net_sched: check noop_qdisc before qdisc_hash_add() · 92f91706

由 WANG Cong 提交于 4月 04, 2017

Dmitry reported a crash when injecting faults in
attach_one_default_qdisc() and dev->qdisc is still
a noop_disc, the check before qdisc_hash_add() fails
to catch it because it tests NULL. We should test
against noop_qdisc since it is the default qdisc
at this point.

Fixes: 59cc1f61 ("net: sched: convert qdisc linked list to hashtable")
Reported-by: NDmitry Vyukov <dvyukov@google.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Acked-by: NJiri Kosina <jkosina@suse.cz>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

92f91706

13 3月, 2017 1 次提交

net: sched: make default fifo qdiscs appear in the dump · 49b49971

由 Jiri Kosina 提交于 3月 08, 2017

The original reason [1] for having hidden qdiscs (potential scalability
issues in qdisc_match_from_root() with single linked list in case of large
amount of qdiscs) has been invalidated by 59cc1f61 ("net: sched: convert
qdisc linked list to hashtable").

This allows us for bringing more clarity and determinism into the dump by
making default pfifo qdiscs visible.

We're not turning this on by default though, at it was deemed [2] too
intrusive / unnecessary change of default behavior towards userspace.
Instead, TCA_DUMP_INVISIBLE netlink attribute is introduced, which allows
applications to request complete qdisc hierarchy dump, including the
ones that have always been implicit/invisible.

Singleton noop_qdisc stays invisible, as teaching the whole infrastructure
about singletons would require quite some surgery with very little gain
(seeing no qdisc or seeing noop qdisc in the dump is probably setting
the same user expectation).

[1] http://lkml.kernel.org/r/1460732328.10638.74.camel@edumazet-glaptop3.roam.corp.google.com
[2] http://lkml.kernel.org/r/20161021.105935.1907696543877061916.davem@davemloft.netSigned-off-by: NJiri Kosina <jkosina@suse.cz>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

49b49971

30 12月, 2016 1 次提交

net: dev_weight: TX/RX orthogonality · 3d48b53f

由 Matthias Tafelmeier 提交于 12月 29, 2016

Oftenly, introducing side effects on packet processing on the other half
of the stack by adjusting one of TX/RX via sysctl is not desirable.
There are cases of demand for asymmetric, orthogonal configurability.

This holds true especially for nodes where RPS for RFS usage on top is
configured and therefore use the 'old dev_weight'. This is quite a
common base configuration setup nowadays, even with NICs of superior processing
support (e.g. aRFS).

A good example use case are nodes acting as noSQL data bases with a
large number of tiny requests and rather fewer but large packets as responses.
It's affordable to have large budget and rx dev_weights for the
requests. But as a side effect having this large a number on TX
processed in one run can overwhelm drivers.

This patch therefore introduces an independent configurability via sysctl to
userland.
Signed-off-by: NMatthias Tafelmeier <matthias.tafelmeier@gmx.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3d48b53f

06 12月, 2016 1 次提交

net_sched: gen_estimator: complete rewrite of rate estimators · 1c0d32fd

由 Eric Dumazet 提交于 12月 04, 2016

1) Old code was hard to maintain, due to complex lock chains.
   (We probably will be able to remove some kfree_rcu() in callers)

2) Using a single timer to update all estimators does not scale.

3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity
   is not supposed to work well)

In this rewrite :

- I removed the RB tree that had to be scanned in
  gen_estimator_active(). qdisc dumps should be much faster.

- Each estimator has its own timer.

- Estimations are maintained in net_rate_estimator structure,
  instead of dirtying the qdisc. Minor, but part of the simplification.

- Reading the estimator uses RCU and a seqcount to provide proper
  support for 32bit kernels.

- We reduce memory need when estimators are not used, since
  we store a pointer, instead of the bytes/packets counters.

- xt_rateest_mt() no longer has to grab a spinlock.
  (In the future, xt_rateest_tg() could be switched to per cpu counters)
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1c0d32fd

19 9月, 2016 3 次提交

sched: add and use qdisc_skb_head helpers · 48da34b7

由 Florian Westphal 提交于 9月 18, 2016

This change replaces sk_buff_head struct in Qdiscs with new qdisc_skb_head.

Its similar to the skb_buff_head api, but does not use skb->prev pointers.

Qdiscs will commonly enqueue at the tail of a list and dequeue at head.
While skb_buff_head works fine for this, enqueue/dequeue needs to also
adjust the prev pointer of next element.

The ->prev pointer is not required for qdiscs so we can just leave
it undefined and avoid one cacheline write access for en/dequeue.
Suggested-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

48da34b7

sched: remove qdisc arg from __qdisc_dequeue_head · ec323368

由 Florian Westphal 提交于 9月 18, 2016

Moves qdisc stat accouting to qdisc_dequeue_head.

The only direct caller of the __qdisc_dequeue_head version open-codes
this now.

This allows us to later use __qdisc_dequeue_head as a replacement
of __skb_dequeue() (which operates on sk_buff_head list).
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ec323368

sched: don't use skb queue helpers · 97d0678f

由 Florian Westphal 提交于 9月 18, 2016

A followup change will replace the sk_buff_head in the qdisc
struct with a slightly different list.

Use of the sk_buff_head helpers will thus cause compiler
warnings.

Open-code these accesses in an extra change to ease review.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

97d0678f

26 8月, 2016 1 次提交

qdisc: fix a module refcount leak in qdisc_create_dflt() · 166ee5b8

由 Eric Dumazet 提交于 8月 24, 2016

Should qdisc_alloc() fail, we must release the module refcount
we got right before.

Fixes: 6da7c8fc ("qdisc: allow setting default queuing discipline")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NJohn Fastabend <john.r.fastabend@intel.com>
Acked-by: NJohn Fastabend <john.r.fastabend@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

166ee5b8

11 8月, 2016 1 次提交

net: sched: convert qdisc linked list to hashtable · 59cc1f61

由 Jiri Kosina 提交于 8月 10, 2016

Convert the per-device linked list into a hashtable. The primary
motivation for this change is that currently, we're not tracking all the
qdiscs in hierarchy (e.g. excluding default qdiscs), as the lookup
performed over the linked list by qdisc_match_from_root() is rather
expensive.

The ultimate goal is to get rid of hidden qdiscs completely, which will
bring much more determinism in user experience.
Reviewed-by: NCong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NJiri Kosina <jkosina@suse.cz>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

59cc1f61

26 6月, 2016 2 次提交

net_sched: generalize bulk dequeue · 4d202a0d

由 Eric Dumazet 提交于 6月 21, 2016

When qdisc bulk dequeue was added in linux-3.18 (commit
5772e9a3 "qdisc: bulk dequeue support for qdiscs
with TCQ_F_ONETXQUEUE"), it was constrained to some
specific qdiscs.

With some extra care, we can extend this to all qdiscs,
so that typical traffic shaping solutions can benefit from
small batches (8 packets in this patch).

For example, HTB is often used on some multi queue device.
And bonding/team are multi queue devices...

Idea is to bulk-dequeue packets mapping to the same transmit queue.

This brings between 35 and 80 % performance increase in HTB setup
under pressure on a bonding setup :

1) NUMA node contention :   610,000 pps -> 1,110,000 pps
2) No node contention   : 1,380,000 pps -> 1,930,000 pps

Now we should work to add batches on the enqueue() side ;)
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Florian Westphal <fw@strlen.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4d202a0d

net_sched: drop packets after root qdisc lock is released · 520ac30f

由 Eric Dumazet 提交于 6月 21, 2016

Qdisc performance suffers when packets are dropped at enqueue()
time because drops (kfree_skb()) are done while qdisc lock is held,
delaying a dequeue() draining the queue.

Nominal throughput can be reduced by 50 % when this happens,
at a time we would like the dequeue() to proceed as fast as possible.

Even FQ is vulnerable to this problem, while one of FQ goals was
to provide some flow isolation.

This patch adds a 'struct sk_buff **to_free' parameter to all
qdisc->enqueue(), and in qdisc_drop() helper.

I measured a performance increase of up to 12 %, but this patch
is a prereq so that future batches in enqueue() can fly.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

520ac30f

16 6月, 2016 1 次提交

net_sched: add the ability to defer skb freeing · 1b5c5493

由 Eric Dumazet 提交于 6月 13, 2016

qdisc are changed under RTNL protection and often
while blocking BH and root qdisc spinlock.

When lots of skbs need to be dropped, we free
them under these locks causing TX/RX freezes,
and more generally latency spikes.

This commit adds rtnl_kfree_skbs(), used to queue
skbs for deferred freeing.

Actual freeing happens right after RTNL is released,
with appropriate scheduling points.

rtnl_qdisc_drop() can also be used in place
of disc_drop() when RTNL is held.

qdisc_reset_queue() and __qdisc_reset_queue() get
the new behavior, so standard qdiscs like pfifo, pfifo_fast...
have their ->reset() method automatically handled.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1b5c5493

10 6月, 2016 1 次提交

net: sched: fix qdisc->running lockdep annotations · 52fbb290

由 Eric Dumazet 提交于 6月 09, 2016

1) qdisc_run_begin() is really using the equivalent of a trylock.
  Instead of using write_seqcount_begin(), use a combination of
  raw_write_seqcount_begin() and correct lockdep annotation.

2) sch_direct_xmit() should use regular spin_lock(root_lock)

Fixes: f9eb8aea ("net_sched: transform qdisc running bit into a seqcount")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

52fbb290

08 6月, 2016 1 次提交

net_sched: transform qdisc running bit into a seqcount · f9eb8aea

由 Eric Dumazet 提交于 6月 06, 2016

Instead of using a single bit (__QDISC___STATE_RUNNING)
in sch->__state, use a seqcount.

This adds lockdep support, but more importantly it will allow us
to sample qdisc/class statistics without having to grab qdisc root lock.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f9eb8aea

07 6月, 2016 1 次提交

net_sched: keep backlog updated with qlen · a27758ff

由 WANG Cong 提交于 6月 03, 2016

For gso_skb we only update qlen, backlog should be updated too.

Note, it is correct to just update these stats at one layer,
because the gso_skb is cached there.
Reported-by: NStas Nichiporovich <stasn77@gmail.com>
Fixes: 2ccccf5f ("net_sched: update hierarchical backlog too")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a27758ff

05 5月, 2016 2 次提交

net: remove dev->trans_start · 9b36627a

由 Florian Westphal 提交于 5月 03, 2016

previous patches removed all direct accesses to dev->trans_start,
so change the netif_trans_update helper to update trans_start of
netdev queue 0 instead and then remove trans_start from struct net_device.

AFAICS a lot of the netif_trans_update() invocations are now useless
because they occur in ndo_start_xmit and driver doesn't set LLTX
(i.e. stack already took care of the update).

As I can't test any of them it seems better to just leave them alone.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9b36627a

treewide: replace dev->trans_start update with helper · 860e9538

由 Florian Westphal 提交于 5月 03, 2016

Replace all trans_start updates with netif_trans_update helper.
change was done via spatch:

struct net_device *d;
@@
- d->trans_start = jiffies
+ netif_trans_update(d)

Compile tested only.

Cc: user-mode-linux-devel@lists.sourceforge.net
Cc: linux-xtensa@linux-xtensa.org
Cc: linux1394-devel@lists.sourceforge.net
Cc: linux-rdma@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: MPT-FusionLinux.pdl@broadcom.com
Cc: linux-scsi@vger.kernel.org
Cc: linux-can@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linux-omap@vger.kernel.org
Cc: linux-hams@vger.kernel.org
Cc: linux-usb@vger.kernel.org
Cc: linux-wireless@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: devel@driverdev.osuosl.org
Cc: b.a.t.m.a.n@lists.open-mesh.org
Cc: linux-bluetooth@vger.kernel.org
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Acked-by: NFelipe Balbi <felipe.balbi@linux.intel.com>
Acked-by: NMugunthan V N <mugunthanvnm@ti.com>
Acked-by: NAntonio Quartulli <a@unstable.cc>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

860e9538

27 4月, 2016 1 次提交

net: remove NETDEV_TX_LOCKED support · f0cdf76c

由 Florian Westphal 提交于 4月 24, 2016

No more users in the tree, remove NETDEV_TX_LOCKED support.
Adds another hole in softnet_stats struct, but better than keeping
the unused collision counter around.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f0cdf76c

14 4月, 2016 1 次提交

net: sched: do not requeue a NULL skb · 3dcd493f

由 Lars Persson 提交于 4月 12, 2016

A failure in validate_xmit_skb_list() triggered an unconditional call
to dev_requeue_skb with skb=NULL. This slowly grows the queue
discipline's qlen count until all traffic through the queue stops.

We take the optimistic approach and continue running the queue after a
failure since it is unknown if later packets also will fail in the
validate path.

Fixes: 55a93b3e ("qdisc: validate skb without holding lock")
Signed-off-by: NLars Persson <larper@axis.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3dcd493f

04 3月, 2016 1 次提交

net: sched: use pfifo_fast for non real queues · 1f27cde3

由 Eric Dumazet 提交于 3月 02, 2016

Some devices declare a high number of TX queues, then set a much
lower real_num_tx_queues

This cause setups using fq_codel, sfq or fq as the default qdisc to consume
more memory than really needed.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1f27cde3

06 1月, 2016 1 次提交

net: sched: fix missing free per cpu on qstats · 73c20a8b

由 John Fastabend 提交于 1月 05, 2016

When a qdisc is using per cpu stats (currently just the ingress
qdisc) only the bstats are being freed. This also free's the qstats.

Fixes: b0ab6f92 ("net: sched: enable per cpu qstats")
Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

73c20a8b

04 12月, 2015 1 次提交

net_sched: fix qdisc_tree_decrease_qlen() races · 4eaf3b84

由 Eric Dumazet 提交于 12月 01, 2015

qdisc_tree_decrease_qlen() suffers from two problems on multiqueue
devices.

One problem is that it updates sch->q.qlen and sch->qstats.drops
on the mq/mqprio root qdisc, while it should not : Daniele
reported underflows errors :
[  681.774821] PAX: sch->q.qlen: 0 n: 1
[  681.774825] PAX: size overflow detected in function qdisc_tree_decrease_qlen net/sched/sch_api.c:769 cicus.693_49 min, count: 72, decl: qlen; num: 0; context: sk_buff_head;
[  681.774954] CPU: 2 PID: 19 Comm: ksoftirqd/2 Tainted: G           O    4.2.6.201511282239-1-grsec #1
[  681.774955] Hardware name: ASUSTeK COMPUTER INC. X302LJ/X302LJ, BIOS X302LJ.202 03/05/2015
[  681.774956]  ffffffffa9a04863 0000000000000000 0000000000000000 ffffffffa990ff7c
[  681.774959]  ffffc90000d3bc38 ffffffffa95d2810 0000000000000007 ffffffffa991002b
[  681.774960]  ffffc90000d3bc68 ffffffffa91a44f4 0000000000000001 0000000000000001
[  681.774962] Call Trace:
[  681.774967]  [<ffffffffa95d2810>] dump_stack+0x4c/0x7f
[  681.774970]  [<ffffffffa91a44f4>] report_size_overflow+0x34/0x50
[  681.774972]  [<ffffffffa94d17e2>] qdisc_tree_decrease_qlen+0x152/0x160
[  681.774976]  [<ffffffffc02694b1>] fq_codel_dequeue+0x7b1/0x820 [sch_fq_codel]
[  681.774978]  [<ffffffffc02680a0>] ? qdisc_peek_dequeued+0xa0/0xa0 [sch_fq_codel]
[  681.774980]  [<ffffffffa94cd92d>] __qdisc_run+0x4d/0x1d0
[  681.774983]  [<ffffffffa949b2b2>] net_tx_action+0xc2/0x160
[  681.774985]  [<ffffffffa90664c1>] __do_softirq+0xf1/0x200
[  681.774987]  [<ffffffffa90665ee>] run_ksoftirqd+0x1e/0x30
[  681.774989]  [<ffffffffa90896b0>] smpboot_thread_fn+0x150/0x260
[  681.774991]  [<ffffffffa9089560>] ? sort_range+0x40/0x40
[  681.774992]  [<ffffffffa9085fe4>] kthread+0xe4/0x100
[  681.774994]  [<ffffffffa9085f00>] ? kthread_worker_fn+0x170/0x170
[  681.774995]  [<ffffffffa95d8d1e>] ret_from_fork+0x3e/0x70

mq/mqprio have their own ways to report qlen/drops by folding stats on
all their queues, with appropriate locking.

A second problem is that qdisc_tree_decrease_qlen() calls qdisc_lookup()
without proper locking : concurrent qdisc updates could corrupt the list
that qdisc_match_from_root() parses to find a qdisc given its handle.

Fix first problem adding a TCQ_F_NOPARENT qdisc flag that
qdisc_tree_decrease_qlen() can use to abort its tree traversal,
as soon as it meets a mq/mqprio qdisc children.

Second problem can be fixed by RCU protection.
Qdisc are already freed after RCU grace period, so qdisc_list_add() and
qdisc_list_del() simply have to use appropriate rcu list variants.

A future patch will add a per struct netdev_queue list anchor, so that
qdisc_tree_decrease_qlen() can have more efficient lookups.
Reported-by: NDaniele Fucini <dfucini@gmail.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Cong Wang <cwang@twopensource.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4eaf3b84

28 8月, 2015 3 次提交

net: sched: simplify attach_one_default_qdisc() · 3e692f21

由 Phil Sutter 提交于 8月 27, 2015

Now that noqueue qdisc can be attached just like any other qdisc, no
special treatment is necessary anymore when attaching it as default
qdisc.

This change has the added benefit that 'tc qdisc show' prints noqueue
instead of nothing for devices defaulting to noqueue.
Signed-off-by: NPhil Sutter <phil@nwl.cc>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3e692f21

net: sched: register noqueue qdisc · d66d6c31

由 Phil Sutter 提交于 8月 27, 2015

This way users can attach noqueue just like any other qdisc using tc
without having to mess with tx_queue_len first.
Signed-off-by: NPhil Sutter <phil@nwl.cc>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d66d6c31

net: sched: ignore tx_queue_len when assigning default qdisc · db4094bc

由 Phil Sutter 提交于 8月 27, 2015

Since alloc_netdev_mqs() sets IFF_NO_QUEUE for drivers not initializing
tx_queue_len, it is safe to assume that if tx_queue_len is zero,
dev->priv flags always contains IFF_NO_QUEUE.
Signed-off-by: NPhil Sutter <phil@nwl.cc>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

db4094bc

18 8月, 2015 1 次提交

net: sch_generic: react upon IFF_NO_QUEUE flag · 4b469955

由 Phil Sutter 提交于 8月 13, 2015

Handle IFF_NO_QUEUE as alternative to tx_queue_len being zero.
Signed-off-by: NPhil Sutter <phil@nwl.cc>
Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4b469955

10 10月, 2014 1 次提交

net_sched: restore qdisc quota fairness limits after bulk dequeue · b8358d70

由 Jesper Dangaard Brouer 提交于 10月 09, 2014

Restore the quota fairness between qdisc's, that we broke with commit
5772e9a3 ("qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE").

Before that commit, the quota in __qdisc_run() were in packets as
dequeue_skb() would only dequeue a single packet, that assumption
broke with bulk dequeue.

We choose not to account for the number of packets inside the TSO/GSO
packets (accessable via "skb_gso_segs").  As the previous fairness
also had this "defect". Thus, GSO/TSO packets counts as a single
packet.

Further more, we choose to slack on accuracy, by allowing a bulk
dequeue try_bulk_dequeue_skb() to exceed the "packets" limit, only
limited by the BQL bytelimit.  This is done because BQL prefers to get
its full budget for appropriate feedback from TX completion.

In future, we might consider reworking this further and, if it allows,
switch to a time-based model, as suggested by Eric. Right now, we only
restore old semantics.

Joint work with Eric, Hannes, Daniel and Jesper.  Hannes wrote the
first patch in cooperation with Daniel and Jesper.  Eric rewrote the
patch.

Fixes: 5772e9a3 ("qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b8358d70

08 10月, 2014 1 次提交

net: better IFF_XMIT_DST_RELEASE support · 02875878

由 Eric Dumazet 提交于 10月 05, 2014

Testing xmit_more support with netperf and connected UDP sockets,
I found strange dst refcount false sharing.

Current handling of IFF_XMIT_DST_RELEASE is not optimal.

Dropping dst in validate_xmit_skb() is certainly too late in case
packet was queued by cpu X but dequeued by cpu Y

The logical point to take care of drop/force is in __dev_queue_xmit()
before even taking qdisc lock.

As Julian Anastasov pointed out, need for skb_dst() might come from some
packet schedulers or classifiers.

This patch adds new helper to cleanly express needs of various drivers
or qdiscs/classifiers.

Drivers that need skb_dst() in their ndo_start_xmit() should call
following helper in their setup instead of the prior :

	dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
->
	netif_keep_dst(dev);

Instead of using a single bit, we use two bits, one being
eventually rebuilt in bonding/team drivers.

The other one, is permanent and blocks IFF_XMIT_DST_RELEASE being
rebuilt in bonding/team. Eventually, we could add something
smarter later.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

02875878

04 10月, 2014 3 次提交

qdisc: validate skb without holding lock · 55a93b3e

由 Eric Dumazet 提交于 10月 03, 2014

Validation of skb can be pretty expensive :

GSO segmentation and/or checksum computations.

We can do this without holding qdisc lock, so that other cpus
can queue additional packets.

Trick is that requeued packets were already validated, so we carry
a boolean so that sch_direct_xmit() can validate a fresh skb list,
or directly use an old one.

Tested on 40Gb NIC (8 TX queues) and 200 concurrent flows, 48 threads
host.

Turning TSO on or off had no effect on throughput, only few more cpu
cycles. Lock contention on qdisc lock disappeared.

Same if disabling TX checksum offload.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

55a93b3e

qdisc: dequeue bulking also pickup GSO/TSO packets · 808e7ac0

由 Jesper Dangaard Brouer 提交于 10月 01, 2014

The TSO and GSO segmented packets already benefit from bulking
on their own.

The TSO packets have always taken advantage of the only updating
the tailptr once for a large packet.

The GSO segmented packets have recently taken advantage of
bulking xmit_more API, via merge commit 53fda7f7 ("Merge
branch 'xmit_list'"), specifically via commit 7f2e870f ("net:
Move main gso loop out of dev_hard_start_xmit() into helper.")
allowing qdisc requeue of remaining list.  And via commit
ce93718f ("net: Don't keep around original SKB when we
software segment GSO frames.").

This patch allow further bulking of TSO/GSO packets together,
when dequeueing from the qdisc.

Testing:
 Measuring HoL (Head-of-Line) blocking for TSO and GSO, with
netperf-wrapper. Bulking several TSO show no performance regressions
(requeues were in the area 32 requeues/sec).

Bulking several GSOs does show small regression or very small
improvement (requeues were in the area 8000 requeues/sec).

 Using ixgbe 10Gbit/s with GSO bulking, we can measure some additional
latency. Base-case, which is "normal" GSO bulking, sees varying
high-prio queue delay between 0.38ms to 0.47ms.  Bulking several GSOs
together, result in a stable high-prio queue delay of 0.50ms.

 Using igb at 100Mbit/s with GSO bulking, shows an improvement.
Base-case sees varying high-prio queue delay between 2.23ms to 2.35ms
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

808e7ac0

qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE · 5772e9a3

由 Jesper Dangaard Brouer 提交于 10月 01, 2014

Based on DaveM's recent API work on dev_hard_start_xmit(), that allows
sending/processing an entire skb list.

This patch implements qdisc bulk dequeue, by allowing multiple packets
to be dequeued in dequeue_skb().

The optimization principle for this is two fold, (1) to amortize
locking cost and (2) avoid expensive tailptr update for notifying HW.
 (1) Several packets are dequeued while holding the qdisc root_lock,
amortizing locking cost over several packet.  The dequeued SKB list is
processed under the TXQ lock in dev_hard_start_xmit(), thus also
amortizing the cost of the TXQ lock.
 (2) Further more, dev_hard_start_xmit() will utilize the skb->xmit_more
API to delay HW tailptr update, which also reduces the cost per
packet.

One restriction of the new API is that every SKB must belong to the
same TXQ.  This patch takes the easy way out, by restricting bulk
dequeue to qdisc's with the TCQ_F_ONETXQUEUE flag, that specifies the
qdisc only have attached a single TXQ.

Some detail about the flow; dev_hard_start_xmit() will process the skb
list, and transmit packets individually towards the driver (see
xmit_one()).  In case the driver stops midway in the list, the
remaining skb list is returned by dev_hard_start_xmit().  In
sch_direct_xmit() this returned list is requeued by dev_requeue_skb().

To avoid overshooting the HW limits, which results in requeuing, the
patch limits the amount of bytes dequeued, based on the drivers BQL
limits.  In-effect bulking will only happen for BQL enabled drivers.

Small amounts for extra HoL blocking (2x MTU/0.24ms) were
measured at 100Mbit/s, with bulking 8 packets, but the
oscillating nature of the measurement indicate something, like
sched latency might be causing this effect. More comparisons
show, that this oscillation goes away occationally. Thus, we
disregard this artifact completely and remove any "magic" bulking
limit.

For now, as a conservative approach, stop bulking when seeing TSO and
segmented GSO packets.  They already benefit from bulking on their own.
A followup patch add this, to allow easier bisect-ability for finding
regressions.

Jointed work with Hannes, Daniel and Florian.
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5772e9a3

30 9月, 2014 1 次提交

net: sched: make bstats per cpu and estimator RCU safe · 22e0f8b9

由 John Fastabend 提交于 9月 28, 2014

In order to run qdisc's without locking statistics and estimators
need to be handled correctly.

To resolve bstats make the statistics per cpu. And because this is
only needed for qdiscs that are running without locks which is not
the case for most qdiscs in the near future only create percpu
stats when qdiscs set the TCQ_F_CPUSTATS flag.

Next because estimators use the bstats to calculate packets per
second and bytes per second the estimator code paths are updated
to use the per cpu statistics.
Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

22e0f8b9

20 9月, 2014 1 次提交

net: sched: use __skb_queue_head_init() where applicable · ab34f648

由 Eric Dumazet 提交于 9月 17, 2014

pfifo_fast and htb use skb lists, without needing their spinlocks.
(They instead use the standard qdisc lock)

We can use __skb_queue_head_init() instead of skb_queue_head_init()
to be consistent.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ab34f648

14 9月, 2014 2 次提交

net: qdisc: use rcu prefix and silence sparse warnings · 46e5da40

由 John Fastabend 提交于 9月 12, 2014

Add __rcu notation to qdisc handling by doing this we can make
smatch output more legible. And anyways some of the cases should
be using rcu_dereference() see qdisc_all_tx_empty(),
qdisc_tx_chainging(), and so on.

Also *wake_queue() API is commonly called from driver timer routines
without rcu lock or rtnl lock. So I added rcu_read_lock() blocks
around netif_wake_subqueue and netif_tx_wake_queue.
Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

46e5da40

net: qdisc: use rcu prefix and silence sparse warnings · b26b0d1e

由 John Fastabend 提交于 9月 12, 2014

Add __rcu notation to qdisc handling by doing this we can make
smatch output more legible. And anyways some of the cases should
be using rcu_dereference() see qdisc_all_tx_empty(),
qdisc_tx_chainging(), and so on.

Also *wake_queue() API is commonly called from driver timer routines
without rcu lock or rtnl lock. So I added rcu_read_lock() blocks
around netif_wake_subqueue and netif_tx_wake_queue.
Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b26b0d1e

04 9月, 2014 1 次提交

qdisc: exit case fixes for skb list handling in qdisc layer · 3f3c7eec

由 Jesper Dangaard Brouer 提交于 9月 03, 2014

More minor fixes to merge commit 53fda7f7 (Merge branch 'xmit_list')
that allows us to work with a list of SKBs.

Fixing exit cases in qdisc_reset() and qdisc_destroy(), where a
leftover requeued SKB (qdisc->gso_skb) can have the potential of
being a skb list, thus use kfree_skb_list().

This is a followup to commit 10770bc2 ("qdisc: adjustments for
API allowing skb list xmits").
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3f3c7eec

03 9月, 2014 1 次提交

qdisc: adjustments for API allowing skb list xmits · 10770bc2

由 Jesper Dangaard Brouer 提交于 9月 02, 2014

Minor adjustments for merge commit 53fda7f7 (Merge branch 'xmit_list')
that allows us to work with a list of SKBs.

Update code doc to function sch_direct_xmit().

In handle_dev_cpu_collision() use kfree_skb_list() in error handling.
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

10770bc2

02 9月, 2014 2 次提交
- D
  net: Don't keep around original SKB when we software segment GSO frames. · ce93718f
  由 David S. Miller 提交于 8月 30, 2014
```
Just maintain the list properly by returning the head of the remaining
SKB list from dev_hard_start_xmit().
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
  ce93718f
- D
  net: Validate xmit SKBs right when we pull them out of the qdisc. · 50cbe9ab
  由 David S. Miller 提交于 8月 30, 2014
```
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
  50cbe9ab

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功