提交 · 278b20837511776dc9d5f6ee1c7fabd5479838bb · openanolis / cloud-kernel

02 8月, 2013 24 次提交

bonding: initial RCU conversion · 278b2083

由 nikolay@redhat.com 提交于 8月 01, 2013

This patch does the initial bonding conversion to RCU. After it the
following modes are protected by RCU alone: roundrobin, active-backup,
broadcast and xor. Modes ALB/TLB and 3ad still acquire bond->lock for
reading, and will be dealt with later. curr_active_slave needs to be
dereferenced via rcu in the converted modes because the only thing
protecting the slave after this patch is rcu_read_lock, so we need the
proper barrier for weakly ordered archs and to make sure we don't have
stale pointer. It's not tagged with __rcu yet because there's still work
to be done to remove the curr_slave_lock, so sparse will complain when
rcu_assign_pointer and rcu_dereference are used, but the alternative to use
rcu_dereference_protected would've created much bigger code churn which is
more difficult to test and review. That will be converted in time.

1. Active-backup mode
 1.1 Perf recording while doing iperf -P 4
  - old bonding: iperf spent 0.55% in bonding, system spent 0.29% CPU
                 in bonding
  - new bonding: iperf spent 0.29% in bonding, system spent 0.15% CPU
                 in bonding
 1.2. Bandwidth measurements
  - old bonding: 16.1 gbps consistently
  - new bonding: 17.5 gbps consistently

2. Round-robin mode
 2.1 Perf recording while doing iperf -P 4
  - old bonding: iperf spent 0.51% in bonding, system spent 0.24% CPU
                 in bonding
  - new bonding: iperf spent 0.16% in bonding, system spent 0.11% CPU
                 in bonding
 2.2 Bandwidth measurements
  - old bonding: 8 gbps (variable due to packet reorderings)
  - new bonding: 10 gbps (variable due to packet reorderings)

Of course the latency has improved in all converted modes, and moreover
while
doing enslave/release (since it doesn't affect tx anymore).

Also I've stress tested all modes doing enslave/release in a loop while
transmitting traffic.
Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

278b2083

bonding: factor out slave id tx code and simplify xmit paths · 15077228

由 Nikolay Aleksandrov 提交于 8月 01, 2013

I factored out the tx xmit code which relies on slave id in
bond_xmit_slave_id. It is global because later it can be used also in
3ad mode xmit. Unnecessary obvious comments are removed. Active-backup
mode is simplified because bond_dev_queue_xmit always consumes the skb.
bond_xmit_xor becomes one line because of bond_xmit_slave_id.
bond_for_each_slave_from is not used in bond_xmit_slave_id because later
when RCU is used we can avoid important race condition by using standard
rculist routines.
Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

15077228

bonding: simplify broadcast_xmit function · 78a646ce

由 Nikolay Aleksandrov 提交于 8月 01, 2013

We don't need to start from the curr_active_slave as the frame will be
sent to all eligible slaves anyway, so we remove the unnecessary local
variables, checks and comments, and make it use the standard list API.
This has the nice side-effect that later when it's converted to RCU
a race condition will be avoided which could lead to double packet tx.
Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

78a646ce

bonding: remove unnecessary read_locks of curr_slave_lock · 71bc3b2d

由 nikolay@redhat.com 提交于 8月 01, 2013

In all the cases we already hold bond->lock for reading, so the slave
can't get away and the check != NULL is sufficient. curr_active_slave
can still change after the read_lock is unlocked prior to use of the
dereferenced value, so there's no need for it. It either contains a
valid slave which we use (and can't get away), or it is NULL which is
checked.
In some places the read_lock of curr_slave_lock was left because we need
it not to change while performing some action (e.g. syncing current
active slave's addresses, sending ARP requests through the active slave)
such cases will be dealt with individually while converting to RCU.
Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

71bc3b2d

bonding: convert to list API and replace bond's custom list · dec1e90e

由 nikolay@redhat.com 提交于 8月 01, 2013

This patch aims to remove struct bonding's first_slave and struct
slave's next and prev pointers, and replace them with the standard Linux
list API. The old macros are converted to list API as well and some new
primitives are available now. The checks if there're slaves that used
slave_cnt have been replaced by the list_empty macro.
Also a few small style fixes, changing longest -> shortest line in local
variable declarations, leaving an empty line before return and removing
unnecessary brackets.
This is the first step to gradual RCU conversion.
Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dec1e90e

ipv6: bump genid when delete/add address · 439677d7

由 fan.du 提交于 8月 01, 2013

Server           Client
2001:1::803/64  <-> 2001:1::805/64
2001:2::804/64  <-> 2001:2::806/64

Server side fib binary tree looks like this:

                                   (2001:/64)
                                   /
                                  /
                   ffff88002103c380
                 /                 \
     (2)        /                   \
 (2001::803/128)                     ffff880037ac07c0
                                    /               \
                                   /                 \  (3)
                      ffff880037ac0640               (2001::806/128)
                       /             \
             (1)      /               \
        (2001::804/128)               (2001::805/128)

Delete 2001::804/64 won't cause prefix route deleted as well as rt in (3)
destinate to 2001::806 with source address as 2001::804/64. That's because
2001::803/64 is still alive, which make onlink=1 in ipv6_del_addr, this is
where the substantial difference between same prefix configuration and
different prefix configuration :) So packet are still transmitted out to
2001::806 with source address as 2001::804/64.

So bump genid will clear rt in (3), and up layer protocol will eventually
find the right one for themselves.

This problem arised from the discussion in here:
http://marc.info/?l=linux-netdev&m=137404469219410&w=4Signed-off-by: NFan Du <fan.du@windriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

439677d7

Merge branch 'for-davem' of git://gitorious.org/linux-can/linux-can-next · c1fc20aa

由 David S. Miller 提交于 8月 01, 2013

Marc Kleine-Budde says:

====================
this is a pull-request for net-next/master. It consists of two patches
by Fabio Estevam. Them first convert the flexcan driver to use
devm_ioremap_resource(), the second adds return value checking for
clk_prepare_enable().
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c1fc20aa

bnx2x: Revising locking scheme for MAC configuration · 8b09be5f

由 Yuval Mintz 提交于 8月 01, 2013

On very rare occasions, repeated load/unload stress test in the presence of
our storage driver (bnx2i/bnx2fc) causes a kernel panic in bnx2x code
(NULL pointer dereference). Stack traces indicate the issue happens during MAC
configuration; thorough code review showed that indeed several races exist
in which one thread can iterate over the list of configured MACs while another
deletes entries from the same list.

This patch adds a varient on the single-writer/Multiple-reader lock mechanism -
It utilizes an already exsiting bottom-half lock, using it so that Whenever
a writer is unable to continue due to the existence of another writer/reader,
it pends its request for future deliverance.
The writer / last readers will check for the existence of such requests and
perform them instead of the original initiator.
This prevents the writer from having to sleep while waiting for the lock
to be accessible, which might cause deadlocks given the locks already
held by the writer.

Another result of this patch is that setting of Rx Mode is now made in
sleepable context - Setting of Rx Mode is made under a bottom-half lock, which
was always nontrivial for the bnx2x driver, as the HW/FW configuration requires
wait for completions.
Since sleep was impossible (due to the sleepless-context), various mechanisms
were utilized to prevent the calling thread from sleep, but the truth was that
when the caller thread (i.e, the one calling ndo_set_rx_mode()) returned, the
Rx mode was still not set in HW/FW.

bnx2x_set_rx_mode() will now overtly schedule for the Rx changes to be
configured by the sp_rtnl_task which hold the RTNL lock and is sleepable
context.
Signed-off-by: NYuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: NAriel Elior <ariele@broadcom.com>
Signed-off-by: NEilon Greenstein <eilong@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8b09be5f

bonding: fix system hang due to fast igmp timer rescheduling · 4beac029

由 Nikolay Aleksandrov 提交于 8月 01, 2013

After commit 4aa5dee4 ("net: convert resend IGMP to notifier event")
we try to acquire rtnl in bond_resend_igmp_join_requests but it can be
scheduled with rtnl already held (e.g. when bond_change_active_slave is
called with rtnl) causing a loop of immediate reschedules + calls because
rtnl_trylock fails each time since it's being already held.
For me this issue leads to system hangs very easy:
modprobe bonding; ifconfig bond0 up; ifenslave bond0 eth0; rmmod
bonding;

The fix is to introduce a small (1 jiffy) delay which is enough for the
sections holding rtnl to finish without putting any strain on the system.
Also adjust the timer in bond_change_active_slave to be 1 jiffy, since
most of the time it's called with rtnl already held.
Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4beac029

tile: support PTP using the tilegx mPIPE (IEEE 1588) · 9ab5ec59

由 Chris Metcalf 提交于 8月 01, 2013

Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>
Acked-by: NRichard Cochran <richardcochran@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9ab5ec59

C
tile: remove deprecated NETIF_F_LLTX flag from tile drivers · 84e181ba
由 Chris Metcalf 提交于 8月 01, 2013
```
Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
84e181ba
C
tile: make "tile_net.custom" a proper bool module parameter · 4aa02644
由 Chris Metcalf 提交于 8月 01, 2013
```
Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
4aa02644

tile: support TSO for IPv6 in tilegx network driver · 2c7d04a9

由 Chris Metcalf 提交于 8月 01, 2013

Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2c7d04a9

tile: support multiple mPIPE shims in tilegx network driver · f3286a3a

由 Chris Metcalf 提交于 8月 01, 2013

The initial driver support was for a single mPIPE shim on the chip
(as is the case for the Gx36 hardware).  The Gx72 chip has two mPIPE
shims, so we extend the driver to handle that case.
Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f3286a3a

tile: enable GRO in the tilegx network driver · 6ab4ae9a

由 Chris Metcalf 提交于 8月 01, 2013

Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6ab4ae9a

tile: fix panic bug in napi support for tilegx network driver · 5e7a54a2

由 Chris Metcalf 提交于 8月 01, 2013

The code used to call napi_disable() in an interrupt handler
(from smp_call_function), which in turn could call msleep().
Unfortunately you can't sleep in an interrupt context.

Luckily it turns out all the NAPI support functions are
just operating on data structures and not on any deeply
per-cpu data, so we can arrange to set up and tear down all
the NAPI state on the core driving the process, and just
do the IRQ enable/disable as a smp_call_function thing.
Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5e7a54a2

C
tile: update dev->stats directly in tilegx network driver · ad018185
由 Chris Metcalf 提交于 8月 01, 2013
```
Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
ad018185
C
tile: support jumbo frames in the tilegx network driver · 2628e8af
由 Chris Metcalf 提交于 8月 01, 2013
```
Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
2628e8af
C
tile: remove dead is_dup_ack() function from tilepro net driver · 48f2a4e1
由 Chris Metcalf 提交于 8月 01, 2013
```
Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
48f2a4e1

tile: avoid bug in tilepro net driver built with old hypervisor · 815d3bae

由 Chris Metcalf 提交于 8月 01, 2013

Building against headers from an older Tilera hypervisor can cause
the frags[] array to be overrun.  Don't enable TSO in that case.
Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

815d3bae

C
tile: support rx_dropped/rx_errors in tilepro net driver · 439a93a0
由 Chris Metcalf 提交于 8月 01, 2013
```
Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
439a93a0

tile: set hw_features and vlan_features in setup · a8eaed55

由 Chris Metcalf 提交于 8月 01, 2013

This change allows the user to configure various features of the tile
networking drivers on and off. There is no change to the default
initialization state of either the tilegx or tilepro drivers.

Neither driver needs the ndo_fix_features or ndo_set_features callbacks,
since the generic code already handles the dependencies for
fix_features, and there is no hardware state to tweak in set_features.
Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a8eaed55

gianfar: Remove unused field grp_id from gfar_priv_grp · 84915c64

由 Claudiu Manoil 提交于 8月 01, 2013

grp->grp_id is obsolete. It has no use in the current driver.
Remove it from gfar_priv_grp and put the 'rstat' member
in its place, in the 2nd cache line, as rstat needs fast access.
Signed-off-by: NClaudiu Manoil <claudiu.manoil@freescale.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

84915c64

net: add a temporary sanity check in skb_orphan() · 376c7311

由 Eric Dumazet 提交于 8月 01, 2013

David suggested to add a BUG_ON() to catch if some layer
sets skb->sk pointer without a corresponding destructor.

As skb can sit in a queue, it's mandatory to make sure the
socket cannot disappear, and it's usually done by taking a
reference on the socket, then releasing it from the skb
destructor.

This patch is a follow-up to commit c34a7612
("net: skb_orphan() changes") and will be reverted after
catching all possible offenders if any.
Suggested-by: NDavid Miller <davem@davemloft.net>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

376c7311

01 8月, 2013 16 次提交

can: flexcan: Check the return value from clk_prepare_enable() · aa10181b

由 Fabio Estevam 提交于 7月 22, 2013

clk_prepare_enable() may fail, so let's check its return value and propagate it
in the case of error.
Signed-off-by: NFabio Estevam <fabio.estevam@freescale.com>
Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>

aa10181b

can: flexcan: Use devm_ioremap_resource() · 933e4af4

由 Fabio Estevam 提交于 7月 22, 2013

Using devm_ioremap_resource() can make the code simpler and smaller.

Also, place alloc_candev() after of_match_device() to make error handling
easier.
Signed-off-by: NFabio Estevam <fabio.estevam@freescale.com>
Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>

933e4af4

ipv6: fib6_rules should return exact return value · 46b3a421

由 Hannes Frederic Sowa 提交于 8月 01, 2013

With the addition of the suppress operation
(7764a45a ("fib_rules: add .suppress
operation") we rely on accurate error reporting of the fib_rules.actions.

fib6_rule_action always returned -EAGAIN in case we could not find a
matching route and 0 if a rule was matched. This also included a match
for blackhole or prohibited rule actions which could get suppressed by
the new logic.

So adapt fib6_rule_action to always return the correct error code as
its counterpart fib4_rule_action does. This also fixes a possiblity of
nullptr-deref where we don't find a table, thus rt == NULL. Because
the condition rt != ip6_null_entry still holdes it seems we could later
get a nullptr bug on dereference rt->dst.

v2:
a) Fixed a brain fart in the commit msg (the rule => a table, etc). No
   changes to the patch.

Cc: Stefan Tomanek <stefan.tomanek@wertarbyte.de>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

46b3a421

cls_cgroup.h netprio_cgroup.h: Remove extern from function prototypes · 37830721

由 Joe Perches 提交于 7月 31, 2013

There are a mix of function prototypes with and without extern
in the kernel sources.  Standardize on not using extern for
function prototypes.

Function prototypes don't need to be written with extern.
extern is assumed by the compiler.  Its use is as unnecessary as
using auto to declare automatic/local variables in a block.
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

37830721

checksum: Remove extern from function prototypes · 4fc70747

由 Joe Perches 提交于 7月 31, 2013

There are a mix of function prototypes with and without extern
in the kernel sources.  Standardize on not using extern for
function prototypes.

Function prototypes don't need to be written with extern.
extern is assumed by the compiler.  Its use is as unnecessary as
using auto to declare automatic/local variables in a block.
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4fc70747

cfg80211.h/mac80211.h: Remove extern from function prototypes · 10dd9b7c

由 Joe Perches 提交于 7月 31, 2013

There are a mix of function prototypes with and without extern
in the kernel sources.  Standardize on not using extern for
function prototypes.

Function prototypes don't need to be written with extern.
extern is assumed by the compiler.  Its use is as unnecessary as
using auto to declare automatic/local variables in a block.

Reflow modified prototypes to 80 columns.
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

10dd9b7c

ax25.h: Remove extern from function prototypes · c1d8f804

由 Joe Perches 提交于 7月 31, 2013

There are a mix of function prototypes with and without extern
in the kernel sources.  Standardize on not using extern for
function prototypes.

Function prototypes don't need to be written with extern.
extern is assumed by the compiler.  Its use is as unnecessary as
using auto to declare automatic/local variables in a block.

Reflow modified prototypes to 80 columns.
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c1d8f804

arp/neighbour.h: Remove extern from function prototypes · 90972b22

由 Joe Perches 提交于 7月 31, 2013

There are a mix of function prototypes with and without extern
in the kernel sources.  Standardize on not using extern for
function prototypes.

Function prototypes don't need to be written with extern.
extern is assumed by the compiler.  Its use is as unnecessary as
using auto to declare automatic/local variables in a block.

Reflow modified prototypes to 80 columns.
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

90972b22

af_rxrpc.h: Remove extern from function prototypes · cd2cf63a

由 Joe Perches 提交于 7月 31, 2013

There are a mix of function prototypes with and without extern
in the kernel sources.  Standardize on not using extern for
function prototypes.

Function prototypes don't need to be written with extern.
extern is assumed by the compiler.  Its use is as unnecessary as
using auto to declare automatic/local variables in a block.

Reflow modified prototypes to 80 columns.
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cd2cf63a

af_unix.h: Remove extern from function prototypes · b60a8280

由 Joe Perches 提交于 7月 31, 2013

There are a mix of function prototypes with and without extern
in the kernel sources.  Standardize on not using extern for
function prototypes.

Function prototypes don't need to be written with extern.
extern is assumed by the compiler.  Its use is as unnecessary as
using auto to declare automatic/local variables in a block.

Reflow modified prototypes to 80 columns.
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b60a8280

addrconf.h: Remove extern function prototypes · e8e54d3c

由 Joe Perches 提交于 7月 31, 2013

There are a mix of function prototypes with and without extern
in the kernel sources.  Standardize on not using extern for
function prototypes.

Function prototypes don't need to be written with extern.
extern is assumed by the compiler.  Its use is as unnecessary as
using auto to declare automatic/local variables in a block.

Reflow modified prototypes to 80 columns.
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e8e54d3c

Documentation: add networking/netdev-FAQ.txt · 49dfe762

由 Paul Gortmaker 提交于 7月 31, 2013

A collection of expectations and operational details about how
networking development takes place in the context of the netdev
mailing list.

The content is meant to capture specific items that are unique
to netdev workflow, and not re-document generic linux expectations
that are already captured elsewhere.

This was originally proposed[1] as a regular posting mailing list
FAQ, but it probably is more universally accessible here in tree.

[1] https://lwn.net/Articles/559211/Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

49dfe762

fib_rules: add .suppress operation · 7764a45a

由 Stefan Tomanek 提交于 8月 01, 2013

This change adds a new operation to the fib_rules_ops struct; it allows the
suppression of routing decisions if certain criteria are not met by its
results.

The first implemented constraint is a minimum prefix length added to the
structures of routing rules. If a rule is added with a minimum prefix length
>0, only routes meeting this threshold will be considered. Any other (more
general) routing table entries will be ignored.

When configuring a system with multiple network uplinks and default routes, it
is often convinient to reference the main routing table multiple times - but
omitting the default route. Using this patch and a modified "ip" utility, this
can be achieved by using the following command sequence:

  $ ip route add table secuplink default via 10.42.23.1

  $ ip rule add pref 100            table main prefixlength 1
  $ ip rule add pref 150 fwmark 0xA table secuplink

With this setup, packets marked 0xA will be processed by the additional routing
table "secuplink", but only if no suitable route in the main routing table can
be found. By using a minimal prefixlength of 1, the default route (/0) of the
table "main" is hidden to packets processed by rule 100; packets traveling to
destinations with more specific routing entries are processed as usual.
Signed-off-by: NStefan Tomanek <stefan.tomanek@wertarbyte.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7764a45a

net: Remove extern from include/net/ scheduling prototypes · 5c15257f

由 Joe Perches 提交于 7月 30, 2013

There are a mix of function prototypes with and without extern
in the kernel sources.  Standardize on not using extern for
function prototypes.

Function prototypes don't need to be written with extern.
extern is assumed by the compiler.  Its use is as unnecessary as
using auto to declare automatic/local variables in a block.

Reflow modified prototypes to 80 columns.
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5c15257f

net: skb_orphan() changes · c34a7612

由 Eric Dumazet 提交于 7月 30, 2013

It is illegal to set skb->sk without corresponding destructor.

Its therefore safe for skb_orphan() to not clear skb->sk if
skb->destructor is not set.

Also avoid clearing skb->destructor if already NULL.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c34a7612

netem: Introduce skb_orphan_partial() helper · f2f872f9

由 Eric Dumazet 提交于 7月 30, 2013

Commit 547669d4 ("tcp: xps: fix reordering issues") added
unexpected reorders in case netem is used in a MQ setup for high
performance test bed.

ETH=eth0
tc qd del dev $ETH root 2>/dev/null
tc qd add dev $ETH root handle 1: mq
for i in `seq 1 32`
do
 tc qd add dev $ETH parent 1:$i netem delay 100ms
done

As all tcp packets are orphaned by netem, TCP stack believes it can
set skb->ooo_okay on all packets.

In order to allow producers to send more packets, we want to
keep sk_wmem_alloc from reaching sk_sndbuf limit.

We can do that by accounting one byte per skb in netem queues,
so that TCP stack is not fooled too much.

Tested:

With above MQ/netem setup, scaling number of concurrent flows gives
linear results and no reorders/retransmits

lpq83:~# for n in 1 10 20 30 40 50 60 70 80 90 100
 do echo -n "n:$n " ; ./super_netperf $n -H 10.7.7.84; done
n:1 198.46
n:10 2002.69
n:20 4000.98
n:30 6006.35
n:40 8020.93
n:50 10032.3
n:60 12081.9
n:70 13971.3
n:80 16009.7
n:90 17117.3
n:100 17425.5
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f2f872f9

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功