提交 · 5219e4c93c281377700206ae2b3ba4d91653d2ba · openanolis / cloud-kernel

15 11月, 2011 7 次提交

bnx2x: add endline at end of message · 5219e4c9

由 Dmitry Kravkov 提交于 11月 14, 2011

Reported-by: NJoe Perches <joe@perches.com>
Signed-off-by: NDmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5219e4c9

IPv6 routing, NLM_F_* flag support: REPLACE and EXCL flags support, warn about missing CREATE flag · 4a287eba

由 Matti Vaittinen 提交于 11月 14, 2011

The support for NLM_F_* flags at IPv6 routing requests.

If NLM_F_CREATE flag is not defined for RTM_NEWROUTE request,
warning is printed, but no error is returned. Instead new route is
added. Later NLM_F_CREATE may be required for
new route creation.

Exception is when NLM_F_REPLACE flag is given without NLM_F_CREATE, and
no matching route is found. In this case it should be safe to assume
that the request issuer is familiar with NLM_F_* flags, and does really
not want route to be created.

Specifying NLM_F_REPLACE flag will now make the kernel to search for
matching route, and replace it with new one. If no route is found and
NLM_F_CREATE is specified as well, then new route is created.

Also, specifying NLM_F_EXCL will yield returning of error if matching
route is found.

Patch created against linux-3.2-rc1
Signed-off-by: NMatti Vaittinen <Mazziesaccount@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4a287eba

IPv6 routing, NLM_F_* flag support: warn if new route is created without NLM_F_CREATE · d71314b4

由 Matti Vaittinen 提交于 11月 14, 2011

The support for NLM_F_* flags at IPv6 routing requests.

Warn if NLM_F_CREATE flag is not defined for RTM_NEWROUTE request,
creating new table. Later NLM_F_CREATE may be required for
new route creation.

Patch created against linux-3.2-rc1
Signed-off-by: NMatti Vaittinen <Mazziesaccount@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d71314b4

net/can/mscan: Fix buggy listen only mode setting · abbd00b8

由 Wolfgang Grandegger 提交于 11月 14, 2011

This patch fixes an issue introduced recently with commit
452448f9.

CC: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: NWolfgang Grandegger <wg@grandegger.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

abbd00b8

Sweep the last of the active .get_drvinfo floors under ethernet/ · 612a94d6

由 Rick Jones 提交于 11月 14, 2011

This round of floor sweeping converts strncpy calls in various .get_drvinfo
routines to the preferred strlcpy.  It also does a modicum of other
cleaning in those routines.
Signed-off-by: NRick Jones <rick.jones2@hp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

612a94d6

bnx2x: uses build_skb() in receive path · e52fcb24

由 Eric Dumazet 提交于 11月 14, 2011

bnx2x uses following formula to compute its rx_buf_sz :

dev->mtu + 2*L1_CACHE_BYTES + 14 + 8 + 8 + 2

Then core network adds NET_SKB_PAD and SKB_DATA_ALIGN(sizeof(struct
skb_shared_info))

Final allocated size for skb head on x86_64 (L1_CACHE_BYTES = 64,
MTU=1500) : 2112 bytes : SLUB/SLAB round this to 4096 bytes.

Since skb truesize is then bigger than SK_MEM_QUANTUM, we have lot of
false sharing because of mem_reclaim in UDP stack.

One possible way to half truesize is to reduce the need by 64 bytes
(2112 -> 2048 bytes)

Instead of allocating a full cache line at the end of packet for
alignment, we can use the fact that skb_shared_info sits at the end of
skb->head, and we can use this room, if we convert bnx2x to new
build_skb() infrastructure.

skb_shared_info will be initialized after hardware finished its
transfert, so we can eventually overwrite the final padding.

Using build_skb() also reduces cache line misses in the driver, since we
use cache hot skb instead of cold ones. Number of in-flight sk_buff
structures is lower, they are recycled while still hot.

Performance results :

(820.000 pps on a rx UDP monothread benchmark, instead of 720.000 pps)
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
CC: Eilon Greenstein <eilong@broadcom.com>
CC: Ben Hutchings <bhutchings@solarflare.com>
CC: Tom Herbert <therbert@google.com>
CC: Jamal Hadi Salim <hadi@mojatatu.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
CC: Thomas Graf <tgraf@infradead.org>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Acked-by: NEilon Greenstein <eilong@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e52fcb24

net: introduce build_skb() · b2b5ce9d

由 Eric Dumazet 提交于 11月 14, 2011

One of the thing we discussed during netdev 2011 conference was the idea
to change some network drivers to allocate/populate their skb at RX
completion time, right before feeding the skb to network stack.

In old days, we allocated skbs when populating the RX ring.

This means bringing into cpu cache sk_buff and skb_shared_info cache
lines (since we clear/initialize them), then 'queue' skb->data to NIC.

By the time NIC fills a frame in skb->data buffer and host can process
it, cpu probably threw away the cache lines from its caches, because lot
of things happened between the allocation and final use.

So the deal would be to allocate only the data buffer for the NIC to
populate its RX ring buffer. And use build_skb() at RX completion to
attach a data buffer (now filled with an ethernet frame) to a new skb,
initialize the skb_shared_info portion, and give the hot skb to network
stack.

build_skb() is the function to allocate an skb, caller providing the
data buffer that should be attached to it. Drivers are expected to call
skb_reserve() right after build_skb() to adjust skb->data to the
Ethernet frame (usually skipping NET_SKB_PAD and NET_IP_ALIGN, but some
drivers might add a hardware provided alignment)

Data provided to build_skb() MUST have been allocated by a prior
kmalloc() call, with enough room to add SKB_DATA_ALIGN(sizeof(struct
skb_shared_info)) bytes at the end of the data without corrupting
incoming frame.

data = kmalloc(NET_SKB_PAD + NET_IP_ALIGN + 1536 +
               SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),
	       GFP_ATOMIC);
...
skb = build_skb(data);
if (!skb) {
	recycle_data(data);
} else {
	skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
	...
}
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
CC: Eilon Greenstein <eilong@broadcom.com>
CC: Ben Hutchings <bhutchings@solarflare.com>
CC: Tom Herbert <therbert@google.com>
CC: Jamal Hadi Salim <hadi@mojatatu.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
CC: Thomas Graf <tgraf@infradead.org>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b2b5ce9d

14 11月, 2011 27 次提交

net: fsl_pq_mdio: fix non tbi phy access · c3e072f8

由 Baruch Siach 提交于 11月 14, 2011

Since 952c5ca1 (fsl_pq_mdio: Clean up tbi address configuration) .probe returns
-EBUSY when the "tbi-phy" node is missing. Fix this.

Cc: Andy Fleming <afleming@freescale.com>
Signed-off-by: NBaruch Siach <baruch@tkos.co.il>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c3e072f8

net/can/mscan: add listen only mode · 452448f9

由 Marc Kleine-Budde 提交于 11月 09, 2011

This patch adds listen only mode to the mscan controller.
Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
Acked-by: NWolfgang Grandegger <wg@grandegger.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

452448f9

neigh: new unresolved queue limits · 8b5c171b

由 Eric Dumazet 提交于 11月 09, 2011

Le mercredi 09 novembre 2011 à 16:21 -0500, David Miller a écrit :
> From: David Miller <davem@davemloft.net>
> Date: Wed, 09 Nov 2011 16:16:44 -0500 (EST)
>
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Wed, 09 Nov 2011 12:14:09 +0100
> >
> >> unres_qlen is the number of frames we are able to queue per unresolved
> >> neighbour. Its default value (3) was never changed and is responsible
> >> for strange drops, especially if IP fragments are used, or multiple
> >> sessions start in parallel. Even a single tcp flow can hit this limit.
> >  ...
> >
> > Ok, I've applied this, let's see what happens :-)
>
> Early answer, build fails.
>
> Please test build this patch with DECNET enabled and resubmit.  The
> decnet neigh layer still refers to the removed ->queue_len member.
>
> Thanks.

Ouch, this was fixed on one machine yesterday, but not the other one I
used this morning, sorry.

[PATCH V5 net-next] neigh: new unresolved queue limits

unres_qlen is the number of frames we are able to queue per unresolved
neighbour. Its default value (3) was never changed and is responsible
for strange drops, especially if IP fragments are used, or multiple
sessions start in parallel. Even a single tcp flow can hit this limit.

$ arp -d 192.168.20.108 ; ping -c 2 -s 8000 192.168.20.108
PING 192.168.20.108 (192.168.20.108) 8000(8028) bytes of data.
8008 bytes from 192.168.20.108: icmp_seq=2 ttl=64 time=0.322 ms
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8b5c171b

bridge: add NTF_USE support · 292d1398

由 stephen hemminger 提交于 11月 09, 2011

More changes to the recent code to support control of forwarding
database via netlink.
   * Support NTF_USE like neighbour table
   * Validate state bits from application
   * Only send notifications (and change bits) if new entry is
     different.
Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

292d1398

Sweep additional floors of strcpy in .get_drvinfo routines · 23020ab3

由 Rick Jones 提交于 11月 09, 2011

Perform another round of floor sweeping, converting the .get_drvinfo
routines of additional drivers from strcpy to strlcpy along with
some conversion of sprintf to snprintf.
Signed-off-by: NRick Jones <rick.jones2@hp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

23020ab3

fsl_pq_mdio: Clean up tbi address configuration · 952c5ca1

由 Andy Fleming 提交于 11月 11, 2011

The code for setting the address of the internal TBI PHY was
convoluted enough without a maze of ifdefs. Clean it up a bit
so we allow the logic to fail down to -ENODEV at the end of
the if/else ladder, rather than using ifdefs to repeat the same
failure code over and over.

Also, remove the support for the auto-configuration. I'm not aware of
anyone using it, and it ends up using the bus mutex before it's been
initialized.
Signed-off-by: NAndy Fleming <afleming@freescale.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

952c5ca1

net-forcedeth: Add internal loopback support for forcedeth NICs. · e19df76a

由 Sanjay Hortikar 提交于 11月 11, 2011

Support enabling/disabling/querying internal loopback mode for
forcedeth NICs using ethtool.
Signed-off-by: NSanjay Hortikar <horti@google.com>
Signed-off-by: NMahesh Bandewar <maheshb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e19df76a

6LoWPAN: update documentation · 63ce40e4

由 alex.bluesman.smirnov@gmail.com 提交于 11月 10, 2011

This patch adds chapter to documentation which describes how to use
6lowpan technology.
Signed-off-by: NAlexander Smirnov <alex.bluesman.smirnov@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

63ce40e4

6LoWPAN: UDP header decompression · f8b1b5d2

由 alex.bluesman.smirnov@gmail.com 提交于 11月 10, 2011

This patch provides possibility to decompress UDP headers.
Derived from Contiki OS.
Signed-off-by: NAlexander Smirnov <alex.bluesman.smirnov@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f8b1b5d2

6LoWPAN: UDP header compression · 3bd5b958

由 alex.bluesman.smirnov@gmail.com 提交于 11月 10, 2011

This patch adds support for UDP header compression.
Derived from Contiki OS.
Signed-off-by: NAlexander Smirnov <alex.bluesman.smirnov@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3bd5b958

6LoWPAN: set proper netdev flags · 4d039f68

由 alex.bluesman.smirnov@gmail.com 提交于 11月 10, 2011

This patch fixes settings for device initialization which makes possible to
use NDISC and TCP.
Signed-off-by: NAlexander Smirnov <alex.bluesman.smirnov@gmail.com>
Acked-by: NDmitry Eremin-Solenikov <dbaryshkov@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4d039f68

6LoWPAN: disable debugging by default · e86586ba

由 alex.bluesman.smirnov@gmail.com 提交于 11月 10, 2011

This patch disables debug output enabled by default.
Signed-off-by: NAlexander Smirnov <alex.bluesman.smirnov@gmail.com>
Acked-by: NDmitry Eremin-Solenikov <dbaryshkov@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e86586ba

6LoWPAN: add fragmentation support · 719269af

由 alex.bluesman.smirnov@gmail.com 提交于 11月 10, 2011

This patch adds support for frame fragmentation.
Signed-off-by: NAlexander Smirnov <alex.bluesman.smirnov@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

719269af

ipv6: reduce percpu needs for icmpv6msg mibs · 2a24444f

由 Eric Dumazet 提交于 11月 13, 2011

Reading /proc/net/snmp6 on a machine with a lot of cpus is very
expensive (can be ~88000 us).

This is because ICMPV6MSG MIB uses 4096 bytes per cpu, and folding
values for all possible cpus can read 16 Mbytes of memory (32MBytes on
non x86 arches)

ICMP messages are not considered as fast path on a typical server, and
eventually few cpus handle them anyway. We can afford an atomic
operation instead of using percpu data.

This saves 4096 bytes per cpu and per network namespace.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2a24444f

net: introduce ethernet teaming device · 3d249d4c

由 Jiri Pirko 提交于 11月 11, 2011

This patch introduces new network device called team. It supposes to be
very fast, simple, userspace-driven alternative to existing bonding
driver.

Userspace library called libteam with couple of demo apps is available
here:
https://github.com/jpirko/libteam
Note it's still in its dipers atm.

team<->libteam use generic netlink for communication. That and rtnl
suppose to be the only way to configure team device, no sysfs etc.

Python binding of libteam was recently introduced.
Daemon providing arpmon/miimon active-backup functionality will be
introduced shortly. All what's necessary is already implemented in
kernel team driver.

v7->v8:
	- check ndo_ndo_vlan_rx_[add/kill]_vid functions before calling
	  them.
	- use dev_kfree_skb_any() instead of dev_kfree_skb()

v6->v7:
	- transmit and receive functions are not checked in hot paths.
	  That also resolves memory leak on transmit when no port is
	  present

v5->v6:
	- changed couple of _rcu calls to non _rcu ones in non-readers

v4->v5:
	- team_change_mtu() uses team->lock while travesing though port
	  list
	- mac address changes are moved completely to jurisdiction of
	  userspace daemon. This way the daemon can do FOM1, FOM2 and
	  possibly other weird things with mac addresses.
	  Only round-robin mode sets up all ports to bond's address then
	  enslaved.
	- Extended Kconfig text

v3->v4:
	- remove redundant synchronize_rcu from __team_change_mode()
	- revert "set and clear of mode_ops happens per pointer, not per
	  byte"
	- extend comment of function __team_change_mode()

v2->v3:
	- team_change_mtu() uses rcu version of list traversal to unwind
	- set and clear of mode_ops happens per pointer, not per byte
	- port hashlist changed to be embedded into team structure
	- error branch in team_port_enter() does cleanup now
	- fixed rtln->rtnl

v1->v2:
	- modes are made as modules. Makes team more modular and
	  extendable.
	- several commenters' nitpicks found on v1 were fixed
	- several other bugs were fixed.
	- note I ignored Eric's comment about roundrobin port selector
	  as Eric's way may be easily implemented as another mode (mode
	  "random") in future.
Signed-off-by: NJiri Pirko <jpirko@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3d249d4c

bnx2x: update driver version to 1.70.35-0 · 5d70b88c

由 Dmitry Kravkov 提交于 11月 13, 2011

Signed-off-by: NDmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: NEilon Greenstein <eilong@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5d70b88c

bnx2x: Remove on-stack napi struct variable · 72754080

由 Ariel Elior 提交于 11月 13, 2011

Signed-off-by: NAriel Elior <ariele@broadcom.com>
Signed-off-by: NEilon Greenstein <eilong@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

72754080

bnx2x: prevent race in statistics flow · 4a025f49

由 Dmitry Kravkov 提交于 11月 13, 2011

The race may cause access of registers while MAC hw block is
in reset state. As a result syslog will show error messages.
We can prevent this by using state from local variable.
Signed-off-by: NDmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: NEilon Greenstein <eilong@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4a025f49

bnx2x: add fan failure event handling · 8304859a

由 Ariel Elior 提交于 11月 13, 2011

Shut down the device in case of fan failure to prevent HW damage.
Signed-off-by: NDmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: NEilon Greenstein <eilong@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8304859a

bnx2x: remove unused #define · 46fa1309

由 Dmitry Kravkov 提交于 11月 13, 2011

Signed-off-by: NDmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: NEilon Greenstein <eilong@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

46fa1309

bnx2x: simplify definition of RX_SGE_MASK_LEN and use it. · b3637827

由 Dmitry Kravkov 提交于 11月 13, 2011

Signed-off-by: NDmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: NEilon Greenstein <eilong@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b3637827

bnx2x: DCBX: use #define instead of magic · f9c058b6

由 Dmitry Kravkov 提交于 11月 13, 2011

Signed-off-by: NDmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: NEilon Greenstein <eilong@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f9c058b6

bnx2x: propagate DCBX negotiation · 00253a8c

由 Dmitry Kravkov 提交于 11月 13, 2011

We need propagate the DCBX results from PMF to other functions
on the same port, in order to properly update netdev structure
and allow following new ETS and PFC configurations.
Signed-off-by: NDmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: NEilon Greenstein <eilong@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

00253a8c

bnx2x: separate FCoE and iSCSI license initialization. · b306f5ed

由 Dmitry Kravkov 提交于 11月 13, 2011

FCoE license info must be initialized at probe(), but
iSCSI at open().
Signed-off-by: NDmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: NEilon Greenstein <eilong@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b306f5ed

bnx2x: remove unused variable · ad756594

由 Dmitry Kravkov 提交于 11月 13, 2011

Signed-off-by: NDmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: NEilon Greenstein <eilong@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ad756594

bnx2x: use rx_queue index for skb_record_rx_queue() · f233cafe

由 Dmitry Kravkov 提交于 11月 13, 2011

Signed-off-by: NDmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: NEilon Greenstein <eilong@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f233cafe

bnx2x: allow FCoE and DCB for 578xx · 62ac0dc9

由 Dmitry Kravkov 提交于 11月 13, 2011

Signed-off-by: NDmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: NEilon Greenstein <eilong@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

62ac0dc9

13 11月, 2011 4 次提交

be2net: stop issuing FW cmds if any cmd times out · 6589ade0

由 Sathya Perla 提交于 11月 10, 2011

A FW cmd timeout (with a sufficiently large timeout value in the
order of tens of seconds) indicates an unresponsive FW. In this state
issuing further cmds and waiting for a completion will only stall the process.
Signed-off-by: NSathya Perla <sathya.perla@emulex.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6589ade0

be2net: don't log more than one error on detecting EEH/UE errors · 434b3648

由 Sathya Perla 提交于 11月 10, 2011

Currently we're spamming error messages each time a FW cmd call is made
while in EEH/UE error state. One log msg on error detection is enough.
Signed-off-by: NSathya Perla <sathya.perla@emulex.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

434b3648

S
be2net: stop checking the UE registers after an EEH error · 72f02485
由 Sathya Perla 提交于 11月 10, 2011
```
Signed-off-by: NSathya Perla <sathya.perla@emulex.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
72f02485

be2net: init (vf)_if_handle/vf_pmac_id to handle failure scenarios · 30128031

由 Sathya Perla 提交于 11月 10, 2011

Initialize if_handle, vf_if_handle and vf_pmac_id with "-1" so that in
failure cases when be_clear() is called, we can skip over
if_destroy/pmac_del cmds if they have not been created.
Signed-off-by: NSathya Perla <sathya.perla@emulex.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

30128031

10 11月, 2011 2 次提交

ipv4: PKTINFO doesnt need dst reference · d826eb14

由 Eric Dumazet 提交于 11月 09, 2011

Le lundi 07 novembre 2011 à 15:33 +0100, Eric Dumazet a écrit :

> At least, in recent kernels we dont change dst->refcnt in forwarding
> patch (usinf NOREF skb->dst)
>
> One particular point is the atomic_inc(dst->refcnt) we have to perform
> when queuing an UDP packet if socket asked PKTINFO stuff (for example a
> typical DNS server has to setup this option)
>
> I have one patch somewhere that stores the information in skb->cb[] and
> avoid the atomic_{inc|dec}(dst->refcnt).
>

OK I found it, I did some extra tests and believe its ready.

[PATCH net-next] ipv4: IP_PKTINFO doesnt need dst reference

When a socket uses IP_PKTINFO notifications, we currently force a dst
reference for each received skb. Reader has to access dst to get needed
information (rt_iif & rt_spec_dst) and must release dst reference.

We also forced a dst reference if skb was put in socket backlog, even
without IP_PKTINFO handling. This happens under stress/load.

We can instead store the needed information in skb->cb[], so that only
softirq handler really access dst, improving cache hit ratios.

This removes two atomic operations per packet, and false sharing as
well.

On a benchmark using a mono threaded receiver (doing only recvmsg()
calls), I can reach 720.000 pps instead of 570.000 pps.

IP_PKTINFO is typically used by DNS servers, and any multihomed aware
UDP application.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d826eb14

ipv4: reduce percpu needs for icmpmsg mibs · acb32ba3

由 Eric Dumazet 提交于 11月 08, 2011

Reading /proc/net/snmp on a machine with a lot of cpus is very expensive
(can be ~88000 us).

This is because ICMPMSG MIB uses 4096 bytes per cpu, and folding values
for all possible cpus can read 16 Mbytes of memory.

ICMP messages are not considered as fast path on a typical server, and
eventually few cpus handle them anyway. We can afford an atomic
operation instead of using percpu data.

This saves 4096 bytes per cpu and per network namespace.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

acb32ba3

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功