提交 · ed9af2e839c06c18f721da2c768fbb444c4a10e5 · OpenHarmony / kernel_linux

16 11月, 2010 4 次提交

net: Move TX queue allocation to alloc_netdev_mq · ed9af2e8

由 Tom Herbert 提交于 11月 09, 2010

TX queues are now allocated in alloc_netdev_mq and freed in
free_netdev.
Signed-off-by: NTom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ed9af2e8

offloading: Force software GSO for multiple vlan tags. · 58e998c6

由 Jesse Gross 提交于 10月 29, 2010

We currently use vlan_features to check for TSO support if there is
a vlan tag.  However, it's quite likely that the NIC is not able to
do TSO when there is an arbitrary number of tags.  Therefore if there
is more than one tag (in-band or out-of-band), fall back to software
emulation.
Signed-off-by: NJesse Gross <jesse@nicira.com>
CC: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

58e998c6

offloading: Support multiple vlan tags in GSO. · c8d5bcd1

由 Jesse Gross 提交于 10月 29, 2010

We assume that hardware TSO can't support multiple levels of vlan tags
but we allow it to be done.  Therefore, enable GSO to parse these tags
so we can fallback to software.
Signed-off-by: NJesse Gross <jesse@nicira.com>
CC: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c8d5bcd1

offloading: Make scatter/gather more tolerant of vlans. · e1e78db6

由 Jesse Gross 提交于 10月 29, 2010

When checking if it is necessary to linearize a packet, we currently
use vlan_features if the packet contains either an in-band or out-
of-band vlan tag.  However, in-band tags aren't special in any way
for scatter/gather since they are part of the packet buffer and are
simply more data to DMA.  Therefore, only use vlan_features for out-
of-band tags, which could potentially have some interaction with
scatter/gather.
Signed-off-by: NJesse Gross <jesse@nicira.com>
CC: Ben Hutchings <bhutchings@solarflare.com>
Reviewed-by: NBen Hutchings <bhutchings@solarflare.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e1e78db6

13 11月, 2010 1 次提交

rtnetlink: Fix message size calculation for link messages · 369cf77a

由 Thomas Graf 提交于 11月 11, 2010

nlmsg_total_size() calculates the length of a netlink message
including header and alignment. nla_total_size() calculates the
space an individual attribute consumes which was meant to be used
in this context.

Also, ensure to account for the attribute header for the
IFLA_INFO_XSTATS attribute as implementations of get_xstats_size()
seem to assume that we do so.

The addition of two message headers minus the missing attribute
header resulted in a calculated message size that was larger than
required. Therefore we never risked running out of skb tailroom.
Signed-off-by: NThomas Graf <tgraf@infradead.org>
Acked-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

369cf77a

11 11月, 2010 2 次提交

net: avoid limits overflow · 8d987e5c

由 Eric Dumazet 提交于 11月 09, 2010

Robin Holt tried to boot a 16TB machine and found some limits were
reached : sysctl_tcp_mem[2], sysctl_udp_mem[2]

We can switch infrastructure to use long "instead" of "int", now
atomic_long_t primitives are available for free.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Reported-by: NRobin Holt <holt@sgi.com>
Reviewed-by: NRobin Holt <holt@sgi.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8d987e5c

filter: make sure filters dont read uninitialized memory · 57fe93b3

由 David S. Miller 提交于 11月 10, 2010

There is a possibility malicious users can get limited information about
uninitialized stack mem array. Even if sk_run_filter() result is bound
to packet length (0 .. 65535), we could imagine this can be used by
hostile user.

Initializing mem[] array, like Dan Rosenberg suggested in his patch is
expensive since most filters dont even use this array.

Its hard to make the filter validation in sk_chk_filter(), because of
the jumps. This might be done later.

In this patch, I use a bitmap (a single long var) so that only filters
using mem[] loads/stores pay the price of added security checks.

For other filters, additional cost is a single instruction.

[ Since we access fentry->k a lot now, cache it in a local variable
  and mark filter entry pointer as const. -DaveM ]
Reported-by: NDan Rosenberg <drosenberg@vsecurity.com>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

57fe93b3

10 11月, 2010 2 次提交

net/dst: dst_dev_event() called after other notifiers · 332dd96f

由 Eric Dumazet 提交于 11月 09, 2010

Followup of commit ef885afb (net: use rcu_barrier() in
rollback_registered_many)

dst_dev_event() scans a garbage dst list that might be feeded by various
network notifiers at device dismantle time.

Its important to call dst_dev_event() after other notifiers, or we might
enter the infamous msleep(250) in netdev_wait_allrefs(), and wait one
second before calling again call_netdevice_notifiers(NETDEV_UNREGISTER,
dev) to properly remove last device references.

Use priority -10 to let dst_dev_notifier be called after other network
notifiers (they have the default 0 priority)
Reported-by: NBen Greear <greearb@candelatech.com>
Reported-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Reported-by: NOctavian Purdila <opurdila@ixiacom.com>
Reported-by: NBenjamin LaHaise <bcrl@kvack.org>
Tested-by: NBen Greear <greearb@candelatech.com>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

332dd96f

net/core/dev.c: Update WARN uses · b194a367

由 Joe Perches 提交于 10月 30, 2010

Coalesce long formats.
Add missing newlines.
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b194a367

09 11月, 2010 1 次提交

pktgen: correct uninitialized queue_map · eb589063

由 Junchang Wang 提交于 11月 07, 2010

This fix a bug reported by backyes.
Right the first time pktgen's using queue_map that's not been initialized
by set_cur_queue_map(pkt_dev);
Signed-off-by: NJunchang Wang <junchangwang@gmail.com>
Signed-off-by: NBackyes <backyes@mail.ustc.edu.cn>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

eb589063

07 11月, 2010 1 次提交

NET: pktgen - fix compile warning · 86c2c0a8

由 Dmitry Torokhov 提交于 11月 06, 2010

This should fix the following warning:

net/core/pktgen.c: In function ‘pktgen_if_write’:
net/core/pktgen.c:890: warning: comparison of distinct pointer types lacks a cast
Signed-off-by: NDmitry Torokhov <dtor@mail.ru>
Reviewed-by: NNelson Elhage <nelhage@ksplice.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

86c2c0a8

02 11月, 2010 1 次提交

net: check queue_index from sock is valid for device · df32cc19

由 Tom Herbert 提交于 11月 01, 2010

In dev_pick_tx recompute the queue index if the value stored in the
socket is greater than or equal to the number of real queues for the
device.  The saved index in the sock structure is not guaranteed to
be appropriate for the egress device (this could happen on a route
change or in presence of tunnelling).  The result of the queue index
being bad would be to return a bogus queue (crash could prersumably
follow).
Signed-off-by: NTom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

df32cc19

29 10月, 2010 2 次提交

pktgen: Limit how much data we copy onto the stack. · 448d7b5d

由 Nelson Elhage 提交于 10月 28, 2010

A program that accidentally writes too much data to the pktgen file can overflow
the kernel stack and oops the machine. This is only triggerable by root, so
there's no security issue, but it's still an unfortunate bug.

printk() won't print more than 1024 bytes in a single call, anyways, so let's
just never copy more than that much data. We're on a fairly shallow stack, so
that should be safe even with CONFIG_4KSTACKS.
Signed-off-by: NNelson Elhage <nelhage@ksplice.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

448d7b5d

net: Limit socket I/O iovec total length to INT_MAX. · 8acfe468

由 David S. Miller 提交于 10月 28, 2010

This helps protect us from overflow issues down in the
individual protocol sendmsg/recvmsg handlers.  Once
we hit INT_MAX we truncate out the rest of the iovec
by setting the iov_len members to zero.

This works because:

1) For SOCK_STREAM and SOCK_SEQPACKET sockets, partial
   writes are allowed and the application will just continue
   with another write to send the rest of the data.

2) For datagram oriented sockets, where there must be a
   one-to-one correspondance between write() calls and
   packets on the wire, INT_MAX is going to be far larger
   than the packet size limit the protocol is going to
   check for and signal with -EMSGSIZE.

Based upon a patch by Linus Torvalds.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8acfe468

28 10月, 2010 3 次提交

fib_rules: __rcu annotates ctarget · 7a2b03c5

由 Eric Dumazet 提交于 10月 26, 2010

Adds __rcu annotation to (struct fib_rule)->ctarget
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7a2b03c5

net: NETIF_F_HW_CSUM does not imply FCoE CRC offload · 66c68bcc

由 Ben Hutchings 提交于 10月 22, 2010

NETIF_F_HW_CSUM indicates the ability to update an TCP/IP-style 16-bit
checksum with the checksum of an arbitrary part of the packet data,
whereas the FCoE CRC is something entirely different.
Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
Cc: stable@kernel.org [2.6.32+]
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

66c68bcc

net: Fix some corner cases in dev_can_checksum() · af1905db

由 Ben Hutchings 提交于 10月 22, 2010

dev_can_checksum() incorrectly returns true in these cases:

1. The skb has both out-of-band and in-band VLAN tags and the device
   supports checksum offload for the encapsulated protocol but only with
   one layer of encapsulation.
2. The skb has a VLAN tag and the device supports generic checksumming
   but not in conjunction with VLAN encapsulation.

Rearrange the VLAN tag checks to avoid these.
Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

af1905db

27 10月, 2010 1 次提交

fib: fix fib_nl_newrule() · ebb9fed2

由 Eric Dumazet 提交于 10月 23, 2010

Some panic reports in fib_rules_lookup() show a rule could have a NULL
pointer as a next pointer in the rules_list.

This can actually happen because of a bug in fib_nl_newrule() : It
checks if current rule is the destination of unresolved gotos. (Other
rules have gotos to this about to be inserted rule)

Problem is it does the resolution of the gotos before the rule is
inserted in the rules_list (and has a valid next pointer)

Fix this by moving the rules_list insertion before the changes on gotos.

A lockless reader can not any more follow a ctarget pointer, unless
destination is ready (has a valid next pointer)
Reported-by: NOleg A. Arkhangelsky <sysoleg@yandex.ru>
Reported-by: NJoe Buehler <aspam@cox.net>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ebb9fed2

26 10月, 2010 5 次提交

net: add __rcu annotation to sk_filter · 0d7da9dd

由 Eric Dumazet 提交于 10月 25, 2010

Add __rcu annotation to :
        (struct sock)->sk_filter

And use appropriate rcu primitives to reduce sparse warnings if
CONFIG_SPARSE_RCU_POINTER=y
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0d7da9dd

net_ns: add __rcu annotations · 1c87733d

由 Eric Dumazet 提交于 10月 25, 2010

add __rcu annotation to (struct net)->gen, and use
rcu_dereference_protected() in net_assign_generic()
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1c87733d

rps: add __rcu annotations · 6e3f7faf

由 Eric Dumazet 提交于 10月 25, 2010

Add __rcu annotations to :
	(struct netdev_rx_queue)->rps_map
	(struct netdev_rx_queue)->rps_flow_table
	struct rps_sock_flow_table *rps_sock_flow_table;

And use appropriate rcu primitives.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6e3f7faf

ipv6: ip6_ptr rcu annotations · 198caeca

由 Eric Dumazet 提交于 10月 24, 2010

(struct net_device)->ip6_ptr is rcu protected :

add __rcu annotation and proper rcu primitives.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

198caeca

net: Increase xmit RECURSION_LIMIT to 10. · 11a766ce

由 David S. Miller 提交于 10月 25, 2010

Three is definitely too low, and we know from reports that GRE tunnels
stacked as deeply as 37 levels cause stack overflows, so pick some
reasonable value between those two.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

11a766ce

25 10月, 2010 1 次提交

pktgen: clean up handling of local/transient counter vars · d6182223

由 Paul Gortmaker 提交于 10月 18, 2010

The temporary variable "i" is needlessly initialized to zero
in two distinct cases in this file:

1) where it is set to zero and then used as an argument in an addition
before being assigned a non-zero value.

2) where it is only used in a standard/typical loop counter

For (1), simply delete assignment to zero and usages while still
zero; for (2) simply make the loop start at zero as per standard
practice as seen everywhere else in the same file.
Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d6182223

21 10月, 2010 7 次提交

napi: unexport napi_reuse_skb · d0c2b0d2

由 stephen hemminger 提交于 10月 19, 2010

The function napi_reuse_skb is only used inside core.
Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d0c2b0d2

net/neighbour: cancel_delayed_work() + flush_scheduled_work() -> cancel_delayed_work_sync() · a5c30b34

由 Tejun Heo 提交于 10月 19, 2010

flush_scheduled_work() is going away.  Prepare for it.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a5c30b34

net/core: Allow tagged VLAN packets to flow through VETH devices. · d2ed8177

由 Ben Greear 提交于 10月 21, 2010

When there are VLANs on a VETH device, the packets being transmitted
through the VETH device may be 4 bytes bigger than MTU.  A check
in dev_forward_skb did not take this into account and so dropped
these packets.

This patch is needed at least as far back as 2.6.34.7 and should
be considered for -stable.
Signed-off-by: NBen Greear <greearb@candelatech.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d2ed8177

rtnetlink: remove rtnl_kill_links · 8d8a0b1c

由 stephen hemminger 提交于 10月 15, 2010

The function rtnl_kill_links is defined but never used.
Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8d8a0b1c

ethtool: Add support for vlan accleration. · d5dbda23

由 Jesse Gross 提交于 10月 20, 2010

Now that vlan acceleration is handled consistently regardless of usage,
it is possible to enable and disable it at will.  This adds support for
Ethtool operations that change the offloading status for debugging
purposes, similar to other forms of hardware acceleration.
Signed-off-by: NJesse Gross <jesse@nicira.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d5dbda23

vlan: Centralize handling of hardware acceleration. · 3701e513

由 Jesse Gross 提交于 10月 20, 2010

Currently each driver that is capable of vlan hardware acceleration
must be aware of the vlan groups that are configured and then pass
the stripped tag to a specialized receive function.  This is

different from other types of hardware offload in that it places a
significant amount of knowledge in the driver itself rather keeping
it in the networking core.

This makes vlan offloading function more similarly to other forms
of offloading (such as checksum offloading or TSO) by doing the
following:
* On receive, stripped vlans are passed directly to the network
core, without attempting to check for vlan groups or reconstructing
the header if no group
* vlans are made less special by folding the logic into the main
receive routines
* On transmit, the device layer will add the vlan header in software
if the hardware doesn't support it, instead of spreading that logic
out in upper layers, such as bonding.

There are a number of advantages to this:
* Fixes all bugs with drivers incorrectly dropping vlan headers at once.
* Avoids having to disable VLAN acceleration when in promiscuous mode
(good for bridging since it always puts devices in promiscuous mode).
* Keeps VLAN tag separate until given to ultimate consumer, which
avoids needing to do header reconstruction as in tg3 unless absolutely
necessary.
* Consolidates common code in core networking.
Signed-off-by: NJesse Gross <jesse@nicira.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3701e513

vlan: Enable software emulation for vlan accleration. · 7b9c6090

由 Jesse Gross 提交于 10月 20, 2010

Currently users of hardware vlan accleration need to know whether
the device supports it before generating packets.  However, vlan
acceleration will soon be available in a more flexible manner so
knowing ahead of time becomes much more difficult.  This adds
a software fallback path for vlan packets on devices without the
necessary offloading support, similar to other types of hardware
accleration.
Signed-off-by: NJesse Gross <jesse@nicira.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7b9c6090

20 10月, 2010 5 次提交

net: avoid RCU for NOCACHE dst · 27b75c95

由 Eric Dumazet 提交于 10月 15, 2010

There is no point using RCU for dst we allocate for a very short time
(used once).

Change dst_release() to take DST_NOCACHE into account, but also change
skb_dst_set_noref() to force a refcount increment for such dst.

This is a _huge_ gain, because we dont waste memory to store xx thousand
of dsts. Instead of queueing them to RCU, we can free them instantly.

CPU caches can stay hot, re-using same memory blocks to hold temporary
dsts.

Note : remove unneeded smp_mb__before_atomic_dec(); in dst_release(),
since atomic_dec_return() implies a full memory barrier.

Stress test, 160.000.000 udp frames sent, IP route cache disabled
(DDOS).

Before:

real    0m38.091s
user    0m13.189s
sys     7m53.018s

After:

real	0m29.946s
user	0m12.157s
sys	7m40.605s

For reference, if IP route cache was enabled :

real	0m32.030s
user	0m10.521s
sys	8m15.243s
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

27b75c95

net: allocate tx queues in register_netdevice · e6484930

由 Tom Herbert 提交于 10月 18, 2010

This patch introduces netif_alloc_netdev_queues which is called from
register_device instead of alloc_netdev_mq.  This makes TX queue
allocation symmetric with RX allocation.  Also, queue locks allocation
is done in netdev_init_one_queue.  Change set_real_num_tx_queues to
fail if requested number < 1 or greater than number of allocated
queues.
Signed-off-by: NTom Herbert <therbert@google.com>
Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e6484930

net: cleanups in RX queue allocation · bd25fa7b

由 Tom Herbert 提交于 10月 18, 2010

Clean up in RX queue allocation.  In netif_set_real_num_rx_queues
return error on attempt to set zero queues, or requested number is
greater than number of allocated queues.  In netif_alloc_rx_queues,
do BUG_ON if queue_count is zero.
Signed-off-by: NTom Herbert <therbert@google.com>
Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bd25fa7b

net: fail alloc_netdev_mq if queue count < 1 · 55513fb4

由 Tom Herbert 提交于 10月 18, 2010

In alloc_netdev_mq fail if requested queue_count < 1.
Signed-off-by: NTom Herbert <therbert@google.com>
Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

55513fb4

netpoll: Revert napi_poll fix for bonding driver · f13d493d

由 Neil Horman 提交于 10月 19, 2010

In an erlier patch I modified napi_poll so that devices with IFF_MASTER polled
the per_cpu list instead of the device list for napi. I did this because the
bonding driver has no napi instances to poll, it instead expects to check the
slave devices napi instances, which napi_poll was unaware of. Looking at this
more closely however, I now see this isn't strictly needed. As the bond driver
poll_controller calls the slaves poll_controller via netpoll_poll_dev, which
recursively calls poll_napi on each slave, allowing those napi instances to get
serviced. The earlier patch isn't at all harmfull, its just not needed, so lets
revert it to make the code cleaner. Sorry for the noise,
Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
Reviewed-by: NWANG Cong <amwang@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f13d493d

18 10月, 2010 2 次提交

bonding: Fix napi poll for bonding driver · 990c3d6f

由 Neil Horman 提交于 10月 13, 2010

Usually the netpoll path, when preforming a napi poll can get away with just
polling all the napi instances of the configured device. Thats not the case for
the bonding driver however, as the napi instances which may wind up getting
flagged as needing polling after the poll_controller call don't belong to the
bonded device, but rather to the slave devices. Fix this by checking the device
in question for the IFF_MASTER flag, if set, we know we need to check the full
poll list for this cpu, rather than just the devices napi instance list.
Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

990c3d6f

bonding: Fix bonding drivers improper modification of netpoll structure · c2355e1a

由 Neil Horman 提交于 10月 13, 2010

The bonding driver currently modifies the netpoll structure in its xmit path
while sending frames from netpoll. This is racy, as other cpus can access the
netpoll structure in parallel. Since the bonding driver points np->dev to a
slave device, other cpus can inadvertently attempt to send data directly to
slave devices, leading to improper locking with the bonding master, lost frames,
and deadlocks. This patch fixes that up.

This patch also removes the real_dev pointer from the netpoll structure as that
data is really only used by bonding in the poll_controller, and we can emulate
its behavior by check each slave for IS_UP.
Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c2355e1a

17 10月, 2010 2 次提交

fib: remove a useless synchronize_rcu() call · a0a4a85a

由 Eric Dumazet 提交于 10月 13, 2010

fib_nl_delrule() calls synchronize_rcu() for no apparent reason,
while rtnl is held.

I suspect it was done to avoid an atomic_inc_not_zero() in
fib_rules_lookup(), which commit 7fa7cb71 added anyway.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a0a4a85a

net: allocate skbs on local node · 564824b0

由 Eric Dumazet 提交于 10月 11, 2010

commit b30973f8 (node-aware skb allocation) spread a wrong habit of
allocating net drivers skbs on a given memory node : The one closest to
the NIC hardware. This is wrong because as soon as we try to scale
network stack, we need to use many cpus to handle traffic and hit
slub/slab management on cross-node allocations/frees when these cpus
have to alloc/free skbs bound to a central node.

skb allocated in RX path are ephemeral, they have a very short
lifetime : Extra cost to maintain NUMA affinity is too expensive. What
appeared as a nice idea four years ago is in fact a bad one.

In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load,
and two 10Gb NIC might deliver more than 28 million packets per second,
needing all the available cpus.

Cost of cross-node handling in network and vm stacks outperforms the
small benefit hardware had when doing its DMA transfert in its 'local'
memory node at RX time. Even trying to differentiate the two allocations
done for one skb (the sk_buff on local node, the data part on NIC
hardware node) is not enough to bring good performance.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Acked-by: NTom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

564824b0

OpenHarmony / kernel_linux 上一次同步 3 年多

OpenHarmony / kernel_linux
上一次同步 3 年多