提交 · 808e7ac0bdef31204184904f6b3ea356a30a9ed5 · OpenHarmony / kernel_linux

04 10月, 2014 2 次提交

qdisc: dequeue bulking also pickup GSO/TSO packets · 808e7ac0

由 Jesper Dangaard Brouer 提交于 10月 01, 2014

The TSO and GSO segmented packets already benefit from bulking
on their own.

The TSO packets have always taken advantage of the only updating
the tailptr once for a large packet.

The GSO segmented packets have recently taken advantage of
bulking xmit_more API, via merge commit 53fda7f7 ("Merge
branch 'xmit_list'"), specifically via commit 7f2e870f ("net:
Move main gso loop out of dev_hard_start_xmit() into helper.")
allowing qdisc requeue of remaining list.  And via commit
ce93718f ("net: Don't keep around original SKB when we
software segment GSO frames.").

This patch allow further bulking of TSO/GSO packets together,
when dequeueing from the qdisc.

Testing:
 Measuring HoL (Head-of-Line) blocking for TSO and GSO, with
netperf-wrapper. Bulking several TSO show no performance regressions
(requeues were in the area 32 requeues/sec).

Bulking several GSOs does show small regression or very small
improvement (requeues were in the area 8000 requeues/sec).

 Using ixgbe 10Gbit/s with GSO bulking, we can measure some additional
latency. Base-case, which is "normal" GSO bulking, sees varying
high-prio queue delay between 0.38ms to 0.47ms.  Bulking several GSOs
together, result in a stable high-prio queue delay of 0.50ms.

 Using igb at 100Mbit/s with GSO bulking, shows an improvement.
Base-case sees varying high-prio queue delay between 2.23ms to 2.35ms
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

808e7ac0

qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE · 5772e9a3

由 Jesper Dangaard Brouer 提交于 10月 01, 2014

Based on DaveM's recent API work on dev_hard_start_xmit(), that allows
sending/processing an entire skb list.

This patch implements qdisc bulk dequeue, by allowing multiple packets
to be dequeued in dequeue_skb().

The optimization principle for this is two fold, (1) to amortize
locking cost and (2) avoid expensive tailptr update for notifying HW.
 (1) Several packets are dequeued while holding the qdisc root_lock,
amortizing locking cost over several packet.  The dequeued SKB list is
processed under the TXQ lock in dev_hard_start_xmit(), thus also
amortizing the cost of the TXQ lock.
 (2) Further more, dev_hard_start_xmit() will utilize the skb->xmit_more
API to delay HW tailptr update, which also reduces the cost per
packet.

One restriction of the new API is that every SKB must belong to the
same TXQ.  This patch takes the easy way out, by restricting bulk
dequeue to qdisc's with the TCQ_F_ONETXQUEUE flag, that specifies the
qdisc only have attached a single TXQ.

Some detail about the flow; dev_hard_start_xmit() will process the skb
list, and transmit packets individually towards the driver (see
xmit_one()).  In case the driver stops midway in the list, the
remaining skb list is returned by dev_hard_start_xmit().  In
sch_direct_xmit() this returned list is requeued by dev_requeue_skb().

To avoid overshooting the HW limits, which results in requeuing, the
patch limits the amount of bytes dequeued, based on the drivers BQL
limits.  In-effect bulking will only happen for BQL enabled drivers.

Small amounts for extra HoL blocking (2x MTU/0.24ms) were
measured at 100Mbit/s, with bulking 8 packets, but the
oscillating nature of the measurement indicate something, like
sched latency might be causing this effect. More comparisons
show, that this oscillation goes away occationally. Thus, we
disregard this artifact completely and remove any "magic" bulking
limit.

For now, as a conservative approach, stop bulking when seeing TSO and
segmented GSO packets.  They already benefit from bulking on their own.
A followup patch add this, to allow easier bisect-ability for finding
regressions.

Jointed work with Hannes, Daniel and Florian.
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5772e9a3

30 9月, 2014 1 次提交

net: sched: make bstats per cpu and estimator RCU safe · 22e0f8b9

由 John Fastabend 提交于 9月 28, 2014

In order to run qdisc's without locking statistics and estimators
need to be handled correctly.

To resolve bstats make the statistics per cpu. And because this is
only needed for qdiscs that are running without locks which is not
the case for most qdiscs in the near future only create percpu
stats when qdiscs set the TCQ_F_CPUSTATS flag.

Next because estimators use the bstats to calculate packets per
second and bytes per second the estimator code paths are updated
to use the per cpu statistics.
Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

22e0f8b9

20 9月, 2014 1 次提交

net: sched: use __skb_queue_head_init() where applicable · ab34f648

由 Eric Dumazet 提交于 9月 17, 2014

pfifo_fast and htb use skb lists, without needing their spinlocks.
(They instead use the standard qdisc lock)

We can use __skb_queue_head_init() instead of skb_queue_head_init()
to be consistent.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ab34f648

14 9月, 2014 2 次提交

net: qdisc: use rcu prefix and silence sparse warnings · 46e5da40

由 John Fastabend 提交于 9月 12, 2014

Add __rcu notation to qdisc handling by doing this we can make
smatch output more legible. And anyways some of the cases should
be using rcu_dereference() see qdisc_all_tx_empty(),
qdisc_tx_chainging(), and so on.

Also *wake_queue() API is commonly called from driver timer routines
without rcu lock or rtnl lock. So I added rcu_read_lock() blocks
around netif_wake_subqueue and netif_tx_wake_queue.
Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

46e5da40

net: qdisc: use rcu prefix and silence sparse warnings · b26b0d1e

由 John Fastabend 提交于 9月 12, 2014

Add __rcu notation to qdisc handling by doing this we can make
smatch output more legible. And anyways some of the cases should
be using rcu_dereference() see qdisc_all_tx_empty(),
qdisc_tx_chainging(), and so on.

Also *wake_queue() API is commonly called from driver timer routines
without rcu lock or rtnl lock. So I added rcu_read_lock() blocks
around netif_wake_subqueue and netif_tx_wake_queue.
Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b26b0d1e

04 9月, 2014 1 次提交

qdisc: exit case fixes for skb list handling in qdisc layer · 3f3c7eec

由 Jesper Dangaard Brouer 提交于 9月 03, 2014

More minor fixes to merge commit 53fda7f7 (Merge branch 'xmit_list')
that allows us to work with a list of SKBs.

Fixing exit cases in qdisc_reset() and qdisc_destroy(), where a
leftover requeued SKB (qdisc->gso_skb) can have the potential of
being a skb list, thus use kfree_skb_list().

This is a followup to commit 10770bc2 ("qdisc: adjustments for
API allowing skb list xmits").
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3f3c7eec

03 9月, 2014 1 次提交

qdisc: adjustments for API allowing skb list xmits · 10770bc2

由 Jesper Dangaard Brouer 提交于 9月 02, 2014

Minor adjustments for merge commit 53fda7f7 (Merge branch 'xmit_list')
that allows us to work with a list of SKBs.

Update code doc to function sch_direct_xmit().

In handle_dev_cpu_collision() use kfree_skb_list() in error handling.
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

10770bc2

02 9月, 2014 2 次提交
- D
  net: Don't keep around original SKB when we software segment GSO frames. · ce93718f
  由 David S. Miller 提交于 8月 30, 2014
```
Just maintain the list properly by returning the head of the remaining
SKB list from dev_hard_start_xmit().
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
  ce93718f
- D
  net: Validate xmit SKBs right when we pull them out of the qdisc. · 50cbe9ab
  由 David S. Miller 提交于 8月 30, 2014
```
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
  50cbe9ab
30 8月, 2014 1 次提交

net: add skb_get_tx_queue() helper · 10c51b56

由 Daniel Borkmann 提交于 8月 27, 2014

Replace occurences of skb_get_queue_mapping() and follow-up
netdev_get_tx_queue() with an actual helper function.
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

10c51b56

02 7月, 2014 1 次提交

net: fix some typos in comment · 9bf2b8c2

由 Ying Xue 提交于 6月 26, 2014

In commit 37112105("net:
QDISC_STATE_RUNNING dont need atomic bit ops") the
__QDISC_STATE_RUNNING is renamed to __QDISC___STATE_RUNNING,
but the old names existing in comment are not replaced with
the new name completely.
Signed-off-by: NYing Xue <ying.xue@windriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9bf2b8c2

01 4月, 2014 1 次提交

net-sysfs: expose number of carrier on/off changes · 2d3b479d

由 david decotigny 提交于 3月 29, 2014

This allows to monitor carrier on/off transitions and detect link
flapping issues:
 - new /sys/class/net/X/carrier_changes
 - new rtnetlink IFLA_CARRIER_CHANGES (getlink)

Tested:
  - grep . /sys/class/net/*/carrier_changes
    + ip link set dev X down/up
    + plug/unplug cable
  - updated iproute2: prints IFLA_CARRIER_CHANGES
  - iproute2 20121211-2 (debian): unchanged behavior
Signed-off-by: NDavid Decotigny <decot@googlers.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2d3b479d

11 1月, 2014 1 次提交

net: core: explicitly select a txq before doing l2 forwarding · f663dd9a

由 Jason Wang 提交于 1月 10, 2014

Currently, the tx queue were selected implicitly in ndo_dfwd_start_xmit(). The
will cause several issues:

- NETIF_F_LLTX were removed for macvlan, so txq lock were done for macvlan
  instead of lower device which misses the necessary txq synchronization for
  lower device such as txq stopping or frozen required by dev watchdog or
  control path.
- dev_hard_start_xmit() was called with NULL txq which bypasses the net device
  watchdog.
- dev_hard_start_xmit() does not check txq everywhere which will lead a crash
  when tso is disabled for lower device.

Fix this by explicitly introducing a new param for .ndo_select_queue() for just
selecting queues in the case of l2 forwarding offload. netdev_pick_tx() was also
extended to accept this parameter and dev_queue_xmit_accel() was used to do l2
forwarding transmission.

With this fixes, NETIF_F_LLTX could be preserved for macvlan and there's no need
to check txq against NULL in dev_hard_start_xmit(). Also there's no need to keep
a dedicated ndo_dfwd_start_xmit() and we can just reuse the code of
dev_queue_xmit() to do the transmission.

In the future, it was also required for macvtap l2 forwarding support since it
provides a necessary synchronization method.

Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: e1000-devel@lists.sourceforge.net
Signed-off-by: NJason Wang <jasowang@redhat.com>
Acked-by: NNeil Horman <nhorman@tuxdriver.com>
Acked-by: NJohn Fastabend <john.r.fastabend@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f663dd9a

14 12月, 2013 1 次提交

pkt_sched: set root qdisc before change() in attach_default_qdiscs() · e57a784d

由 Eric Dumazet 提交于 12月 12, 2013

After commit 95dc1929 ("pkt_sched: give visibility to mq slave
qdiscs") we call disc_list_add() while the device qdisc might be
the noop_qdisc one.

This shows up as duplicates in "tc qdisc show", as all inactive devices
point to noop_qdisc.

Fix this by setting dev->qdisc to the new qdisc before calling
ops->change() in attach_default_qdiscs()

Add a WARN_ON_ONCE() to catch any future similar problem.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e57a784d

11 12月, 2013 1 次提交

net_sched: change "foo* bar" to "foo *bar" · 82d567c2

由 Yang Yingliang 提交于 12月 10, 2013

"foo* bar" or "foo * bar" should be "foo *bar".
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

82d567c2

08 11月, 2013 1 次提交

net: Add layer 2 hardware acceleration operations for macvlan devices · a6cc0cfa

由 John Fastabend 提交于 11月 06, 2013

Add a operations structure that allows a network interface to export
the fact that it supports package forwarding in hardware between
physical interfaces and other mac layer devices assigned to it (such
as macvlans). This operaions structure can be used by virtual mac
devices to bypass software switching so that forwarding can be done
in hardware more efficiently.
Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a6cc0cfa

08 10月, 2013 1 次提交

net: Separate the close_list and the unreg_list v2 · 5cde2829

由 Eric W. Biederman 提交于 10月 05, 2013

Separate the unreg_list and the close_list in dev_close_many preventing
dev_close_many from permuting the unreg_list.  The permutations of the
unreg_list have resulted in cases where the loopback device is accessed
it has been freed in code such as dst_ifdown.  Resulting in subtle memory
corruption.

This is the second bug from sharing the storage between the close_list
and the unreg_list.  The issues that crop up with sharing are
apparently too subtle to show up in normal testing or usage, so let's
forget about being clever and use two separate lists.

v2: Make all callers pass in a close_list to dev_close_many
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5cde2829

21 9月, 2013 1 次提交

net_sched: add u64 rate to psched_ratecfg_precompute() · 3e1e3aae

由 Eric Dumazet 提交于 9月 19, 2013

Add an extra u64 rate parameter to psched_ratecfg_precompute()
so that some qdisc can opt-in for 64bit rates in the future,
to overcome the ~34 Gbits limit.

psched_ratecfg_getrate() reports a legacy structure to
tc utility, so if actual rate is above the 32bit rate field,
cap it to the 34Gbit limit.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3e1e3aae

01 9月, 2013 2 次提交

qdisc: fix build with !CONFIG_NET_SCHED · 34aedd3f

由 stephen hemminger 提交于 8月 31, 2013

Multiqueue scheduler refers to default_qdisc_ops; therefore the
variable definition needs to be moved to handle case where net
scheduler API is not available.
Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

34aedd3f

qdisc: make args to qdisc_create_default const · d2a7f269

由 stephen hemminger 提交于 8月 31, 2013

Fixes warnings introduced by the qdisc default patch.
Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d2a7f269

31 8月, 2013 1 次提交

qdisc: allow setting default queuing discipline · 6da7c8fc

由 stephen hemminger 提交于 8月 27, 2013

By default, the pfifo_fast queue discipline has been used by default
for all devices. But we have better choices now.

This patch allow setting the default queueing discipline with sysctl.
This allows easy use of better queueing disciplines on all devices
without having to use tc qdisc scripts. It is intended to allow
an easy path for distributions to make fq_codel or sfq the default
qdisc.

This patch also makes pfifo_fast more of a first class qdisc, since
it is now possible to manually override the default and explicitly
use pfifo_fast. The behavior for systems who do not use the sysctl
is unchanged, they still get pfifo_fast

Also removes leftover random # in sysctl net core.
Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6da7c8fc

15 8月, 2013 1 次提交

net_sched: restore "linklayer atm" handling · 8a8e3d84

由 Jesper Dangaard Brouer 提交于 8月 14, 2013

commit 56b765b7 ("htb: improved accuracy at high rates")
broke the "linklayer atm" handling.

 tc class add ... htb rate X ceil Y linklayer atm

The linklayer setting is implemented by modifying the rate table
which is send to the kernel.  No direct parameter were
transferred to the kernel indicating the linklayer setting.

The commit 56b765b7 ("htb: improved accuracy at high rates")
removed the use of the rate table system.

To keep compatible with older iproute2 utils, this patch detects
the linklayer by parsing the rate table.  It also supports future
versions of iproute2 to send this linklayer parameter to the
kernel directly. This is done by using the __reserved field in
struct tc_ratespec, to convey the choosen linklayer option, but
only using the lower 4 bits of this field.

Linklayer detection is limited to speeds below 100Mbit/s, because
at high rates the rtab is gets too inaccurate, so bad that
several fields contain the same values, this resembling the ATM
detect.  Fields even start to contain "0" time to send, e.g. at
1000Mbit/s sending a 96 bytes packet cost "0", thus the rtab have
been more broken than we first realized.
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8a8e3d84

06 8月, 2013 1 次提交

net_sched: make dev_trans_start return vlan's real dev trans_start · 07ce76aa

由 nikolay@redhat.com 提交于 8月 03, 2013

Vlan devices are LLTX and don't update their own trans_start, so if
dev_trans_start has to be called with a vlan device then 0 or a stale
value will be returned. Currently the bonding is the only such user, and
it's needed for proper arp monitoring when the slaves are vlans.
Fix this by extracting the vlan's real device trans_start.
Suggested-by: NDavid Miller <davem@davemloft.net>
Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com>
Acked-by: NVeaceslav Falico <vfalico@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

07ce76aa

12 6月, 2013 1 次提交

net_sched: psched_ratecfg_precompute() improvements · 130d3d68

由 Eric Dumazet 提交于 6月 06, 2013

Before allowing 64bits bytes rates, refactor
psched_ratecfg_precompute() to get better comments
and increased accuracy.

rate_bps field is renamed to rate_bytes_ps, as we only
have to worry about bytes per second.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Ben Greear <greearb@candelatech.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

130d3d68

03 6月, 2013 1 次提交

net_sched: restore "overhead xxx" handling · 01cb71d2

由 Eric Dumazet 提交于 6月 02, 2013

commit 56b765b7 ("htb: improved accuracy at high rates")
broke the "overhead xxx" handling, as well as the "linklayer atm"
attribute.

tc class add ... htb rate X ceil Y linklayer atm overhead 10

This patch restores the "overhead xxx" handling, for htb, tbf
and act_police

The "linklayer atm" thing needs a separate fix.
Reported-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Vimalkumar <j.vimal@gmail.com>
Cc: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

01cb71d2

28 3月, 2013 1 次提交

sch: add missing u64 in psched_ratecfg_precompute() · ea872d77

由 Sergey Popovich 提交于 3月 27, 2013

It seems that commit

commit 292f1c7f
Author: Jiri Pirko <jiri@resnulli.us>
Date:   Tue Feb 12 00:12:03 2013 +0000

    sch: make htb_rate_cfg and functions around that generic

adds little regression.

Before:

# tc qdisc add dev eth0 root handle 1: htb default ffff
# tc class add dev eth0 classid 1:ffff htb rate 5Gbit
# tc -s class show dev eth0
class htb 1:ffff root prio 0 rate 5000Mbit ceil 5000Mbit burst 625b cburst
625b
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0
 lended: 0 borrowed: 0 giants: 0
 tokens: 31 ctokens: 31

After:

# tc qdisc add dev eth0 root handle 1: htb default ffff
# tc class add dev eth0 classid 1:ffff htb rate 5Gbit
# tc -s class show dev eth0
class htb 1:ffff root prio 0 rate 1544Mbit ceil 1544Mbit burst 625b cburst
625b
 Sent 5073 bytes 41 pkt (dropped 0, overlimits 0 requeues 0)
 rate 1976bit 2pps backlog 0b 0p requeues 0
 lended: 41 borrowed: 0 giants: 0
 tokens: 1802 ctokens: 1802

This probably due to lost u64 cast of rate parameter in
psched_ratecfg_precompute() (net/sched/sch_generic.c).
Signed-off-by: NSergey Popovich <popovich_sergei@mail.ru>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ea872d77

13 2月, 2013 1 次提交

sch: make htb_rate_cfg and functions around that generic · 292f1c7f

由 Jiri Pirko 提交于 2月 12, 2013

As it is going to be used in tbf as well, push these to generic code.
Signed-off-by: NJiri Pirko <jiri@resnulli.us>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

292f1c7f

12 12月, 2012 1 次提交

pkt_sched: avoid requeues if possible · 1abbe139

由 Eric Dumazet 提交于 12月 11, 2012

With BQL being deployed, we can more likely have following behavior :

We dequeue a packet from qdisc in dequeue_skb(), then we realize target
tx queue is in XOFF state in sch_direct_xmit(), and we have to hold the
skb into gso_skb for later.

This shows in stats (tc -s qdisc dev eth0) as requeues.

Problem of these requeues is that high priority packets can not be
dequeued as long as this (possibly low prio and big TSO packet) is not
removed from gso_skb.

At 1Gbps speed, a full size TSO packet is 500 us of extra latency.

In some cases, we know that all packets dequeued from a qdisc are
for a particular and known txq :

- If device is non multi queue
- For all MQ/MQPRIO slave qdiscs

This patch introduces a new qdisc flag, TCQ_F_ONETXQUEUE to mark
this capability, so that dequeue_skb() is allowed to dequeue a packet
only if the associated txq is not stopped.

This indeed reduce latencies for high prio packets (or improve fairness
with sfq/fq_codel), and almost remove qdisc 'requeues'.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1abbe139

06 9月, 2012 1 次提交

net: qdisc busylock needs lockdep annotations · 23d3b8bf

由 Eric Dumazet 提交于 9月 05, 2012

It seems we need to provide ability for stacked devices
to use specific lock_class_key for sch->busylock

We could instead default l2tpeth tx_queue_len to 0 (no qdisc), but
a user might use a qdisc anyway.

(So same fixes are probably needed on non LLTX stacked drivers)

Noticed while stressing L2TPV3 setup :

======================================================
 [ INFO: possible circular locking dependency detected ]
 3.6.0-rc3+ #788 Not tainted
 -------------------------------------------------------
 netperf/4660 is trying to acquire lock:
  (l2tpsock){+.-...}, at: [<ffffffffa0208db2>] l2tp_xmit_skb+0x172/0xa50 [l2tp_core]

 but task is already holding lock:
  (&(&sch->busylock)->rlock){+.-...}, at: [<ffffffff81596595>] dev_queue_xmit+0xd75/0xe00

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #1 (&(&sch->busylock)->rlock){+.-...}:
        [<ffffffff810a5df0>] lock_acquire+0x90/0x200
        [<ffffffff817499fc>] _raw_spin_lock_irqsave+0x4c/0x60
        [<ffffffff81074872>] __wake_up+0x32/0x70
        [<ffffffff8136d39e>] tty_wakeup+0x3e/0x80
        [<ffffffff81378fb3>] pty_write+0x73/0x80
        [<ffffffff8136cb4c>] tty_put_char+0x3c/0x40
        [<ffffffff813722b2>] process_echoes+0x142/0x330
        [<ffffffff813742ab>] n_tty_receive_buf+0x8fb/0x1230
        [<ffffffff813777b2>] flush_to_ldisc+0x142/0x1c0
        [<ffffffff81062818>] process_one_work+0x198/0x760
        [<ffffffff81063236>] worker_thread+0x186/0x4b0
        [<ffffffff810694d3>] kthread+0x93/0xa0
        [<ffffffff81753e24>] kernel_thread_helper+0x4/0x10

 -> #0 (l2tpsock){+.-...}:
        [<ffffffff810a5288>] __lock_acquire+0x1628/0x1b10
        [<ffffffff810a5df0>] lock_acquire+0x90/0x200
        [<ffffffff817498c1>] _raw_spin_lock+0x41/0x50
        [<ffffffffa0208db2>] l2tp_xmit_skb+0x172/0xa50 [l2tp_core]
        [<ffffffffa021a802>] l2tp_eth_dev_xmit+0x32/0x60 [l2tp_eth]
        [<ffffffff815952b2>] dev_hard_start_xmit+0x502/0xa70
        [<ffffffff815b63ce>] sch_direct_xmit+0xfe/0x290
        [<ffffffff81595a05>] dev_queue_xmit+0x1e5/0xe00
        [<ffffffff815d9d60>] ip_finish_output+0x3d0/0x890
        [<ffffffff815db019>] ip_output+0x59/0xf0
        [<ffffffff815da36d>] ip_local_out+0x2d/0xa0
        [<ffffffff815da5a3>] ip_queue_xmit+0x1c3/0x680
        [<ffffffff815f4192>] tcp_transmit_skb+0x402/0xa60
        [<ffffffff815f4a94>] tcp_write_xmit+0x1f4/0xa30
        [<ffffffff815f5300>] tcp_push_one+0x30/0x40
        [<ffffffff815e6672>] tcp_sendmsg+0xe82/0x1040
        [<ffffffff81614495>] inet_sendmsg+0x125/0x230
        [<ffffffff81576cdc>] sock_sendmsg+0xdc/0xf0
        [<ffffffff81579ece>] sys_sendto+0xfe/0x130
        [<ffffffff81752c92>] system_call_fastpath+0x16/0x1b
  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(&(&sch->busylock)->rlock);
                                lock(l2tpsock);
                                lock(&(&sch->busylock)->rlock);
   lock(l2tpsock);

  *** DEADLOCK ***

 5 locks held by netperf/4660:
  #0:  (sk_lock-AF_INET){+.+.+.}, at: [<ffffffff815e581c>] tcp_sendmsg+0x2c/0x1040
  #1:  (rcu_read_lock){.+.+..}, at: [<ffffffff815da3e0>] ip_queue_xmit+0x0/0x680
  #2:  (rcu_read_lock_bh){.+....}, at: [<ffffffff815d9ac5>] ip_finish_output+0x135/0x890
  #3:  (rcu_read_lock_bh){.+....}, at: [<ffffffff81595820>] dev_queue_xmit+0x0/0xe00
  #4:  (&(&sch->busylock)->rlock){+.-...}, at: [<ffffffff81596595>] dev_queue_xmit+0xd75/0xe00

 stack backtrace:
 Pid: 4660, comm: netperf Not tainted 3.6.0-rc3+ #788
 Call Trace:
  [<ffffffff8173dbf8>] print_circular_bug+0x1fb/0x20c
  [<ffffffff810a5288>] __lock_acquire+0x1628/0x1b10
  [<ffffffff810a334b>] ? check_usage+0x9b/0x4d0
  [<ffffffff810a3f44>] ? __lock_acquire+0x2e4/0x1b10
  [<ffffffff810a5df0>] lock_acquire+0x90/0x200
  [<ffffffffa0208db2>] ? l2tp_xmit_skb+0x172/0xa50 [l2tp_core]
  [<ffffffff817498c1>] _raw_spin_lock+0x41/0x50
  [<ffffffffa0208db2>] ? l2tp_xmit_skb+0x172/0xa50 [l2tp_core]
  [<ffffffffa0208db2>] l2tp_xmit_skb+0x172/0xa50 [l2tp_core]
  [<ffffffffa021a802>] l2tp_eth_dev_xmit+0x32/0x60 [l2tp_eth]
  [<ffffffff815952b2>] dev_hard_start_xmit+0x502/0xa70
  [<ffffffff81594e0e>] ? dev_hard_start_xmit+0x5e/0xa70
  [<ffffffff81595961>] ? dev_queue_xmit+0x141/0xe00
  [<ffffffff815b63ce>] sch_direct_xmit+0xfe/0x290
  [<ffffffff81595a05>] dev_queue_xmit+0x1e5/0xe00
  [<ffffffff81595820>] ? dev_hard_start_xmit+0xa70/0xa70
  [<ffffffff815d9d60>] ip_finish_output+0x3d0/0x890
  [<ffffffff815d9ac5>] ? ip_finish_output+0x135/0x890
  [<ffffffff815db019>] ip_output+0x59/0xf0
  [<ffffffff815da36d>] ip_local_out+0x2d/0xa0
  [<ffffffff815da5a3>] ip_queue_xmit+0x1c3/0x680
  [<ffffffff815da3e0>] ? ip_local_out+0xa0/0xa0
  [<ffffffff815f4192>] tcp_transmit_skb+0x402/0xa60
  [<ffffffff815fa25e>] ? tcp_md5_do_lookup+0x18e/0x1a0
  [<ffffffff815f4a94>] tcp_write_xmit+0x1f4/0xa30
  [<ffffffff815f5300>] tcp_push_one+0x30/0x40
  [<ffffffff815e6672>] tcp_sendmsg+0xe82/0x1040
  [<ffffffff81614495>] inet_sendmsg+0x125/0x230
  [<ffffffff81614370>] ? inet_create+0x6b0/0x6b0
  [<ffffffff8157e6e2>] ? sock_update_classid+0xc2/0x3b0
  [<ffffffff8157e750>] ? sock_update_classid+0x130/0x3b0
  [<ffffffff81576cdc>] sock_sendmsg+0xdc/0xf0
  [<ffffffff81162579>] ? fget_light+0x3f9/0x4f0
  [<ffffffff81579ece>] sys_sendto+0xfe/0x130
  [<ffffffff810a69ad>] ? trace_hardirqs_on+0xd/0x10
  [<ffffffff8174a0b0>] ? _raw_spin_unlock_irq+0x30/0x50
  [<ffffffff810757e3>] ? finish_task_switch+0x83/0xf0
  [<ffffffff810757a6>] ? finish_task_switch+0x46/0xf0
  [<ffffffff81752cb7>] ? sysret_check+0x1b/0x56
  [<ffffffff81752c92>] system_call_fastpath+0x16/0x1b
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

23d3b8bf

15 8月, 2012 1 次提交

net: move and rename netif_notify_peers() · ee89bab1

由 Amerigo Wang 提交于 8月 09, 2012

I believe net/core/dev.c is a better place for netif_notify_peers(),
because other net event notify functions also stay in this file.

And rename it to netdev_notify_peers().

Cc: David S. Miller <davem@davemloft.net>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Signed-off-by: NCong Wang <amwang@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ee89bab1

16 5月, 2012 1 次提交

net: Convert net_ratelimit uses to net_<level>_ratelimited · e87cc472

由 Joe Perches 提交于 5月 13, 2012

Standardize the net core ratelimited logging functions.

Coalesce formats, align arguments.
Change a printk then vprintk sequence to use printf extension %pV.
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e87cc472

02 4月, 2012 1 次提交

pkt_sched: Stop using NLA_PUT*(). · 1b34ec43

由 David S. Miller 提交于 3月 29, 2012

These macros contain a hidden goto, and are thus extremely error
prone and make code hard to audit.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1b34ec43

30 11月, 2011 1 次提交

net: Add queue state xoff flag for stack · 73466498

由 Tom Herbert 提交于 11月 28, 2011

Create separate queue state flags so that either the stack or drivers
can turn on XOFF.  Added a set of functions used in the stack to determine
if a queue is really stopped (either by stack or driver)
Signed-off-by: NTom Herbert <therbert@google.com>
Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

73466498

17 11月, 2011 1 次提交

net: new counter for tx_timeout errors in sysfs · ccf5ff69

由 david decotigny 提交于 11月 16, 2011

This adds the /sys/class/net/DEV/queues/Q/tx_timeout attribute
containing the total number of timeout events on the given queue. It
is always available with CONFIG_SYSFS, independently of
CONFIG_RPS/XPS.

Credits to Stephen Hemminger for a preliminary version of this patch.

Tested:
  without CONFIG_SYSFS (compilation only)
  with sysfs and without CONFIG_RPS & CONFIG_XPS
  with sysfs and without CONFIG_RPS
  with sysfs and without CONFIG_XPS
  with defaults
Signed-off-by: NDavid Decotigny <david.decotigny@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ccf5ff69

15 7月, 2011 1 次提交

Remove redundant variable/code in __qdisc_run · f0c50c7c

由 Krishna Kumar 提交于 7月 14, 2011

Remove redundant variable "work".
Signed-off-by: NKrishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f0c50c7c

27 6月, 2011 1 次提交

net_sched: fix dequeuer fairness · d5b8aa1d

由 jamal 提交于 6月 26, 2011

Results on dummy device can be seen in my netconf 2011
slides. These results are for a 10Gige IXGBE intel
nic - on another i5 machine, very similar specs to
the one used in the netconf2011 results.
It turns out - this is a hell lot worse than dummy
and so this patch is even more beneficial for 10G.

Test setup:
----------

System under test sending packets out.
Additional box connected directly dropping packets.
Installed prio qdisc on the eth device and default
netdev default length of 1000 used as is.
The 3 prio bands each were set to 100 (didnt factor in
the results).

5 packet runs were made and the middle 3 picked.

results
-------

The "cpu" column indicates the which cpu the sample
was taken on,
The "Pkt runx" carries the number of packets a cpu
dequeued when forced to be in the "dequeuer" role.
The "avg" for each run is the number of times each
cpu should be a "dequeuer" if the system was fair.

3.0-rc4      (plain)
cpu         Pkt run1        Pkt run2        Pkt run3
================================================
cpu0        21853354        21598183        22199900
cpu1          431058          473476          393159
cpu2          481975          477529          458466
cpu3        23261406        23412299        22894315
avg         11506948        11490372        11486460

3.0-rc4 with patch and default weight 64
cpu 	     Pkt run1        Pkt run2        Pkt run3
================================================
cpu0        13205312        13109359        13132333
cpu1        10189914        10159127        10122270
cpu2        10213871        10124367        10168722
cpu3        13165760        13164767        13096705
avg         11693714        11639405        11630008

As you can see the system is still not perfect but
is a lot better than what it was before...

At the moment we use the old backlog weight, weight_p
which is 64 packets. It seems to be reasonably fine
with that value.
The system could be made more fair if we reduce the
weight_p (as per my presentation), but we are going
to affect the shared backlog weight. Unless deemed
necessary, I think the default value is fine. If not
we could add yet another knob.
Signed-off-by: NJamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d5b8aa1d

07 6月, 2011 1 次提交

net: Rework netdev_drivername() to avoid warning. · 3019de12

由 David S. Miller 提交于 6月 06, 2011

This interface uses a temporary buffer, but for no real reason.
And now can generate warnings like:

net/sched/sch_generic.c: In function dev_watchdog
net/sched/sch_generic.c:254:10: warning: unused variable drivername

Just return driver->name directly or "".
Reported-by: NConnor Hansen <cmdkhh@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3019de12

23 5月, 2011 1 次提交

net: avoid synchronize_rcu() in dev_deactivate_many · 3137663d

由 Eric Dumazet 提交于 5月 19, 2011

dev_deactivate_many() issues one synchronize_rcu() call after qdiscs set
to noop_qdisc.

This call is here to make sure they are no outstanding qdisc-less
dev_queue_xmit calls before returning to caller.

But in dismantle phase, we dont have to wait, because we wont activate
again the device, and we are going to wait one rcu grace period later in
rollback_registered_many().

After this patch, device dismantle uses one synchronize_net() and one
rcu_barrier() call only, so we have a ~30% speedup and a smaller RTNL
latency.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
CC: Patrick McHardy <kaber@trash.net>,
CC: Ben Greear <greearb@candelatech.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3137663d

04 3月, 2011 1 次提交

net_sched: reduce fifo qdisc size · d276055c

由 Eric Dumazet 提交于 3月 03, 2011

Because of various alignements [SLUB / qdisc], we use 512 bytes of
memory for one {p|b}fifo qdisc, instead of 256 bytes on 64bit arches and
192 bytes on 32bit ones.

Move the "u32 limit" inside "struct Qdisc" (no impact on other qdiscs)

Change qdisc_alloc(), first trying a regular allocation before an
oversized one.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d276055c

OpenHarmony / kernel_linux 上一次同步 大约 4 年

OpenHarmony / kernel_linux
上一次同步大约 4 年