提交 · b10e3d6c2e557e1dd62c0b6276707b68aa0e5287 · openeuler / raspberrypi-kernel

25 5月, 2015 3 次提交

net: af_unix: implement splice for stream af_unix sockets · 2b514574

由 Hannes Frederic Sowa 提交于 5月 21, 2015

unix_stream_recvmsg is refactored to unix_stream_read_generic in this
patch and enhanced to deal with pipe splicing. The refactoring is
inneglible, we mostly have to deal with a non-existing struct msghdr
argument.
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2b514574

net: make skb_splice_bits more configureable · a60e3cc7

由 Hannes Frederic Sowa 提交于 5月 21, 2015

Prepare skb_splice_bits to be able to deal with AF_UNIX sockets.

AF_UNIX sockets don't use lock_sock/release_sock and thus we have to
use a callback to make the locking and unlocking configureable.
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a60e3cc7

net: skbuff: add skb_append_pagefrags and use it · be12a1fe

由 Hannes Frederic Sowa 提交于 5月 21, 2015

Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

be12a1fe

14 5月, 2015 1 次提交

net: Reserve skb headroom and set skb->dev even if using __alloc_skb · a080e7bd

由 Alexander Duyck 提交于 5月 13, 2015

When I had inlined __alloc_rx_skb into __netdev_alloc_skb and
__napi_alloc_skb I had overlooked the fact that there was a return in the
__alloc_rx_skb.  As a result we weren't reserving headroom or setting the
skb->dev in certain cases.  This change corrects that by adding a couple of
jump labels to jump to depending on __alloc_skb either succeeding or failing.

Fixes: 9451980a ("net: Use cached copy of pfmemalloc to avoid accessing page")
Reported-by: NFelipe Balbi <balbi@ti.com>
Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
Tested-by: NKevin Hilman <khilman@linaro.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a080e7bd

12 5月, 2015 4 次提交

net: Add skb_free_frag to replace use of put_page in freeing skb->head · 181edb2b

由 Alexander Duyck 提交于 5月 06, 2015

This change adds a function called skb_free_frag which is meant to
compliment the function netdev_alloc_frag. The general idea is to enable a
more lightweight version of page freeing since we don't actually need all
the overhead of a put_page, and we don't quite fit the model of __free_pages.
Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

181edb2b

mm/net: Rename and move page fragment handling from net/ to mm/ · b63ae8ca

由 Alexander Duyck 提交于 5月 06, 2015

This change moves the __alloc_page_frag functionality out of the networking
stack and into the page allocation portion of mm. The idea it so help make
this maintainable by placing it with other page allocation functions.

Since we are moving it from skbuff.c to page_alloc.c I have also renamed
the basic defines and structure from netdev_alloc_cache to page_frag_cache
to reflect that this is now part of a different kernel subsystem.

I have also added a simple __free_page_frag function which can handle
freeing the frags based on the skb->head pointer. The model for this is
based off of __free_pages since we don't actually need to deal with all of
the cases that put_page handles. I incorporated the virt_to_head_page call
and compound_order into the function as it actually allows for a signficant
size reduction by reducing code duplication.
Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b63ae8ca

net: Store virtual address instead of page in netdev_alloc_cache · 0e392508

由 Alexander Duyck 提交于 5月 06, 2015

This change makes it so that we store the virtual address of the page
in the netdev_alloc_cache instead of the page pointer. The idea behind
this is to avoid multiple calls to page_address since the virtual address
is required for every access, but the page pointer is only needed at
allocation or reset of the page.

While I was at it I also reordered the netdev_alloc_cache structure a bit
so that the size is always 16 bytes by dropping size in the case where
PAGE_SIZE is greater than or equal to 32KB.
Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0e392508

net: Use cached copy of pfmemalloc to avoid accessing page · 9451980a

由 Alexander Duyck 提交于 5月 06, 2015

While testing I found that the testing for pfmemalloc in build_skb was
rather expensive.  I found the issue to be two-fold.  First we have to get
from the virtual address to the head page and that comes at the cost of
something like 11 cycles.  Then there is the cost for reading pfmemalloc out
of the head page which can be cache cold due to the fact that
put_page_testzero is likely invalidating the cache-line on one or more
CPUs as the fragments can be shared.

To avoid this extra expense I have added a pfmemalloc member to the
netdev_alloc_cache.  I then pushed pieces of __alloc_rx_skb into
__napi_alloc_skb and __netdev_alloc_skb so that I could rewrite them to
make use of the cached pfmemalloc value.  The result is that my perf traces
show a reduction from 9.28% overhead to 3.7% for the code covered by
build_skb, __alloc_rx_skb, and __napi_alloc_skb when performing a test with
the packet being dropped instead of being handed to napi_gro_receive.
Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9451980a

05 5月, 2015 2 次提交

net: fix two sparse warnings introduced by IGMP/MLD parsing exports · fcba67c9

由 Linus Lüssing 提交于 5月 05, 2015

> net/core/skbuff.c:4108:13: sparse: incorrect type in assignment (different base types)
> net/ipv6/mcast_snoop.c:63 ipv6_mc_check_exthdrs() warn: unsigned 'offset' is never less than zero.

Introduced by 9afd85c9
("net: Export IGMP/MLD message validation code")
Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
Signed-off-by: NLinus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fcba67c9

net: Export IGMP/MLD message validation code · 9afd85c9

由 Linus Lüssing 提交于 5月 02, 2015

With this patch, the IGMP and MLD message validation functions are moved
from the bridge code to IPv4/IPv6 multicast files. Some small
refactoring was done to enhance readibility and to iron out some
differences in behaviour between the IGMP and MLD parsing code (e.g. the
skb-cloning of MLD messages is now only done if necessary, just like the
IGMP part always did).

Finally, these IGMP and MLD message validation functions are exported so
that not only the bridge can use it but batman-adv later, too.
Signed-off-by: NLinus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9afd85c9

26 4月, 2015 1 次提交

net: fix crash in build_skb() · 2ea2f62c

由 Eric Dumazet 提交于 4月 24, 2015

When I added pfmemalloc support in build_skb(), I forgot netlink
was using build_skb() with a vmalloc() area.

In this patch I introduce __build_skb() for netlink use,
and build_skb() is a wrapper handling both skb->head_frag and
skb->pfmemalloc

This means netlink no longer has to hack skb->head_frag

[ 1567.700067] kernel BUG at arch/x86/mm/physaddr.c:26!
[ 1567.700067] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
[ 1567.700067] Dumping ftrace buffer:
[ 1567.700067]    (ftrace buffer empty)
[ 1567.700067] Modules linked in:
[ 1567.700067] CPU: 9 PID: 16186 Comm: trinity-c182 Not tainted 4.0.0-next-20150424-sasha-00037-g4796e21 #2167
[ 1567.700067] task: ffff880127efb000 ti: ffff880246770000 task.ti: ffff880246770000
[ 1567.700067] RIP: __phys_addr (arch/x86/mm/physaddr.c:26 (discriminator 3))
[ 1567.700067] RSP: 0018:ffff8802467779d8  EFLAGS: 00010202
[ 1567.700067] RAX: 000041000ed8e000 RBX: ffffc9008ed8e000 RCX: 000000000000002c
[ 1567.700067] RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffffffb3fd6049
[ 1567.700067] RBP: ffff8802467779f8 R08: 0000000000000019 R09: ffff8801d0168000
[ 1567.700067] R10: ffff8801d01680c7 R11: ffffed003a02d019 R12: ffffc9000ed8e000
[ 1567.700067] R13: 0000000000000f40 R14: 0000000000001180 R15: ffffc9000ed8e000
[ 1567.700067] FS:  00007f2a7da3f700(0000) GS:ffff8801d1000000(0000) knlGS:0000000000000000
[ 1567.700067] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1567.700067] CR2: 0000000000738308 CR3: 000000022e329000 CR4: 00000000000007e0
[ 1567.700067] Stack:
[ 1567.700067]  ffffc9000ed8e000 ffff8801d0168000 ffffc9000ed8e000 ffff8801d0168000
[ 1567.700067]  ffff880246777a28 ffffffffad7c0a21 0000000000001080 ffff880246777c08
[ 1567.700067]  ffff88060d302e68 ffff880246777b58 ffff880246777b88 ffffffffad9a6821
[ 1567.700067] Call Trace:
[ 1567.700067] build_skb (include/linux/mm.h:508 net/core/skbuff.c:316)
[ 1567.700067] netlink_sendmsg (net/netlink/af_netlink.c:1633 net/netlink/af_netlink.c:2329)
[ 1567.774369] ? sched_clock_cpu (kernel/sched/clock.c:311)
[ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
[ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
[ 1567.774369] sock_sendmsg (net/socket.c:614 net/socket.c:623)
[ 1567.774369] sock_write_iter (net/socket.c:823)
[ 1567.774369] ? sock_sendmsg (net/socket.c:806)
[ 1567.774369] __vfs_write (fs/read_write.c:479 fs/read_write.c:491)
[ 1567.774369] ? get_lock_stats (kernel/locking/lockdep.c:249)
[ 1567.774369] ? default_llseek (fs/read_write.c:487)
[ 1567.774369] ? vtime_account_user (kernel/sched/cputime.c:701)
[ 1567.774369] ? rw_verify_area (fs/read_write.c:406 (discriminator 4))
[ 1567.774369] vfs_write (fs/read_write.c:539)
[ 1567.774369] SyS_write (fs/read_write.c:586 fs/read_write.c:577)
[ 1567.774369] ? SyS_read (fs/read_write.c:577)
[ 1567.774369] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 1567.774369] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2594 kernel/locking/lockdep.c:2636)
[ 1567.774369] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:42)
[ 1567.774369] system_call_fastpath (arch/x86/kernel/entry_64.S:261)

Fixes: 79930f58 ("net: do not deplete pfmemalloc reserve")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NSasha Levin <sasha.levin@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2ea2f62c

23 4月, 2015 1 次提交

net: do not deplete pfmemalloc reserve · 79930f58

由 Eric Dumazet 提交于 4月 22, 2015

build_skb() should look at the page pfmemalloc status.
If set, this means page allocator allocated this page in the
expectation it would help to free other pages. Networking
stack can do that only if skb->pfmemalloc is also set.

Also, we must refrain using high order pages from the pfmemalloc
reserve, so __page_frag_refill() must also use __GFP_NOMEMALLOC for
them. Under memory pressure, using order-0 pages is probably the best
strategy.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

79930f58

17 4月, 2015 2 次提交

skbuff: Do not scrub skb mark within the same name space · 213dd74a

由 Herbert Xu 提交于 4月 16, 2015

On Wed, Apr 15, 2015 at 05:41:26PM +0200, Nicolas Dichtel wrote:
> Le 15/04/2015 15:57, Herbert Xu a écrit :
> >On Wed, Apr 15, 2015 at 06:22:29PM +0800, Herbert Xu wrote:
> [snip]
> >Subject: skbuff: Do not scrub skb mark within the same name space
> >
> >The commit ea23192e ("tunnels:
> Maybe add a Fixes tag?
> Fixes: ea23192e ("tunnels: harmonize cleanup done on skb on rx path")
>
> >harmonize cleanup done on skb on rx path") broke anyone trying to
> >use netfilter marking across IPv4 tunnels.  While most of the
> >fields that are cleared by skb_scrub_packet don't matter, the
> >netfilter mark must be preserved.
> >
> >This patch rearranges skb_scurb_packet to preserve the mark field.
> nit: s/scurb/scrub
>
> Else it's fine for me.

Sure.

PS I used the wrong email for James the first time around.  So
let me repeat the question here.  Should secmark be preserved
or cleared across tunnels within the same name space? In fact,
do our security models even support name spaces?

---8<---
The commit ea23192e ("tunnels:
harmonize cleanup done on skb on rx path") broke anyone trying to
use netfilter marking across IPv4 tunnels.  While most of the
fields that are cleared by skb_scrub_packet don't matter, the
netfilter mark must be preserved.

This patch rearranges skb_scrub_packet to preserve the mark field.

Fixes: ea23192e ("tunnels: harmonize cleanup done on skb on rx path")
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Acked-by: NThomas Graf <tgraf@suug.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

213dd74a

Revert "net: Reset secmark when scrubbing packet" · 4c0ee414

由 Herbert Xu 提交于 4月 16, 2015

This patch reverts commit b8fb4e06
because the secmark must be preserved even when a packet crosses
namespace boundaries.  The reason is that security labels apply to
the system as a whole and is not per-namespace.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4c0ee414

08 4月, 2015 1 次提交

net: remove extra newlines · 8bc0034c

由 Sheng Yong 提交于 4月 08, 2015

Signed-off-by: NSheng Yong <shengyong1@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8bc0034c

12 3月, 2015 2 次提交

sock: fix possible NULL sk dereference in __skb_tstamp_tx · 3a8dd971

由 Willem de Bruijn 提交于 3月 11, 2015

Test that sk != NULL before reading sk->sk_tsflags.

Fixes: 49ca0d8b ("net-timestamp: no-payload option")
Reported-by: NOne Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3a8dd971

xps: must clear sender_cpu before forwarding · c29390c6

由 Eric Dumazet 提交于 3月 11, 2015

John reported that my previous commit added a regression
on his router.

This is because sender_cpu & napi_id share a common location,
so get_xps_queue() can see garbage and perform an out of bound access.

We need to make sure sender_cpu is cleared before doing the transmit,
otherwise any NIC busy poll enabled (skb_mark_napi_id()) can trigger
this bug.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NJohn <jw@nuclearfallout.net>
Bisected-by: NJohn <jw@nuclearfallout.net>
Fixes: 2bd82484 ("xps: fix xps for stacked devices")
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c29390c6

07 3月, 2015 1 次提交

net: gro: remove obsolete code from skb_gro_receive() · 58025e46

由 Eric Dumazet 提交于 3月 05, 2015

Some drivers use copybreak to copy tiny frames into smaller skb,
and this smaller skb might not have skb->head_frag set for various
reasons.

skb_gro_receive() currently doesn't allow to aggregate the smaller skb
into the previous GRO packet if this GRO packet has at least 2 MSS in
it.

Following workload easily demonstrates the problem.

netperf -t TCP_RR -H target -- -r 3000,3000

(tcpdump shows one GRO packet with 2 MSS, plus one additional packet of
104 bytes that should have been appended.)

It turns out that we can remove code from skb_gro_receive(), because
commit 8a29111c ("net: gro: allow to build full sized skb") and its
followups removed the assumption that a GRO packet with a frag_list had
to have an empty head.

Removing this code allows the aggregation of the last (incomplete) frame
in some RPC workloads. Note that tcp_gro_receive() already takes care of
forcing a flush if necessary, including this case.

If we want to avoid using frag_list in the first place (in forwarding
workloads for example, as the outgoing NIC is generally not able to cope
with skbs having a frag_list), we need to address this separately.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

58025e46

23 2月, 2015 1 次提交

net: Remove state argument from skb_find_text() · 059a2440

由 Bojan Prtvar 提交于 2月 22, 2015

Although it is clear that textsearch state is intentionally passed to
skb_find_text() as uninitialized argument, it was never used by the
callers. Therefore, we can simplify skb_find_text() by making it
local variable.
Signed-off-by: NBojan Prtvar <prtvar.b@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

059a2440

21 2月, 2015 1 次提交

sock: sock_dequeue_err_skb() needs hard irq safety · 997d5c3f

由 Eric Dumazet 提交于 2月 18, 2015

Non NAPI drivers can call skb_tstamp_tx() and then sock_queue_err_skb()
from hard IRQ context.

Therefore, sock_dequeue_err_skb() needs to block hard irq or
corruptions or hangs can happen.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Fixes: 364a9e93 ("sock: deduplicate errqueue dequeue")
Fixes: cb820f8e ("net: Provide a generic socket error queue delivery method for Tx time stamps.")
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

997d5c3f

05 2月, 2015 1 次提交

xps: fix xps for stacked devices · 2bd82484

由 Eric Dumazet 提交于 2月 03, 2015

A typical qdisc setup is the following :

bond0 : bonding device, using HTB hierarchy
eth1/eth2 : slaves, multiqueue NIC, using MQ + FQ qdisc

XPS allows to spread packets on specific tx queues, based on the cpu
doing the send.

Problem is that dequeues from bond0 qdisc can happen on random cpus,
due to the fact that qdisc_run() can dequeue a batch of packets.

CPUA -> queue packet P1 on bond0 qdisc, P1->ooo_okay=1
CPUA -> queue packet P2 on bond0 qdisc, P2->ooo_okay=0

CPUB -> dequeue packet P1 from bond0
        enqueue packet on eth1/eth2
CPUC -> dequeue packet P2 from bond0
        enqueue packet on eth1/eth2 using sk cache (ooo_okay is 0)

get_xps_queue() then might select wrong queue for P1, since current cpu
might be different than CPUA.

P2 might be sent on the old queue (stored in sk->sk_tx_queue_mapping),
if CPUC runs a bit faster (or CPUB spins a bit on qdisc lock)

Effect of this bug is TCP reorders, and more generally not optimal
TX queue placement. (A victim bulk flow can be migrated to the wrong TX
queue for a while)

To fix this, we have to record sender cpu number the first time
dev_queue_xmit() is called for one tx skb.

We can union napi_id (used on receive path) and sender_cpu,
granted we clear sender_cpu in skb_scrub_packet() (credit to Willem for
this union idea)
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2bd82484

03 2月, 2015 2 次提交

net-timestamp: no-payload only sysctl · b245be1f

由 Willem de Bruijn 提交于 1月 30, 2015

Tx timestamps are looped onto the error queue on top of an skb. This
mechanism leaks packet headers to processes unless the no-payload
options SOF_TIMESTAMPING_OPT_TSONLY is set.

Add a sysctl that optionally drops looped timestamp with data. This
only affects processes without CAP_NET_RAW.

The policy is checked when timestamps are generated in the stack.
It is possible for timestamps with data to be reported after the
sysctl is set, if these were queued internally earlier.

No vulnerability is immediately known that exploits knowledge
gleaned from packet headers, but it may still be preferable to allow
administrators to lock down this path at the cost of possible
breakage of legacy applications.
Signed-off-by: NWillem de Bruijn <willemb@google.com>

----

Changes
  (v1 -> v2)
  - test socket CAP_NET_RAW instead of capable(CAP_NET_RAW)
  (rfc -> v1)
  - document the sysctl in Documentation/sysctl/net.txt
  - fix access control race: read .._OPT_TSONLY only once,
        use same value for permission check and skb generation.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b245be1f

net-timestamp: no-payload option · 49ca0d8b

由 Willem de Bruijn 提交于 1月 30, 2015

Add timestamping option SOF_TIMESTAMPING_OPT_TSONLY. For transmit
timestamps, this loops timestamps on top of empty packets.

Doing so reduces the pressure on SO_RCVBUF. Payload inspection and
cmsg reception (aside from timestamps) are no longer possible. This
works together with a follow on patch that allows administrators to
only allow tx timestamping if it does not loop payload or metadata.
Signed-off-by: NWillem de Bruijn <willemb@google.com>

----

Changes (rfc -> v1)
  - add documentation
  - remove unnecessary skb->len test (thanks to Richard Cochran)
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

49ca0d8b

14 1月, 2015 1 次提交

net: rename vlan_tx_* helpers since "tx" is misleading there · df8a39de

由 Jiri Pirko 提交于 1月 13, 2015

The same macros are used for rx as well. So rename it.
Signed-off-by: NJiri Pirko <jiri@resnulli.us>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

df8a39de

03 1月, 2015 1 次提交

net: skbuff: don't zero tc members when freeing skb · e8768f97

由 Florian Westphal 提交于 12月 31, 2014

Not needed, only four cases:
 - kfree_skb (or one of its aliases).
   Don't need to zero, memory will be freed.
 - kfree_skb_partial and head was stolen:  memory will be freed.
 - skb_morph:  The skb header fields (including tc ones) will be
   copied over from the 'to-be-morphed' skb right after
   skb_release_head_state returns.
 - skb_segment:  Same as before, all the skb header
   fields are copied over from the original skb right away.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e8768f97

24 12月, 2014 1 次提交

net: Reset secmark when scrubbing packet · b8fb4e06

由 Thomas Graf 提交于 12月 23, 2014

skb_scrub_packet() is called when a packet switches between a context
such as between underlay and overlay, between namespaces, or between
L3 subnets.

While we already scrub the packet mark, connection tracking entry,
and cached destination, the security mark/context is left intact.

It seems wrong to inherit the security context of a packet when going
from overlay to underlay or across forwarding paths.
Signed-off-by: NThomas Graf <tgraf@suug.ch>
Acked-by: NFlavio Leitner <fbl@sysclose.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b8fb4e06

11 12月, 2014 2 次提交

net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb · fd11a83d

由 Alexander Duyck 提交于 12月 09, 2014

This change pulls the core functionality out of __netdev_alloc_skb and
places them in a new function named __alloc_rx_skb. The reason for doing
this is to make these bits accessible to a new function __napi_alloc_skb.
In addition __alloc_rx_skb now has a new flags value that is used to
determine which page frag pool to allocate from. If the SKB_ALLOC_NAPI
flag is set then the NAPI pool is used. The advantage of this is that we
do not have to use local_irq_save/restore when accessing the NAPI pool from
NAPI context.

In my test setup I saw at least 11ns of savings using the napi_alloc_skb
function versus the netdev_alloc_skb function, most of this being due to
the fact that we didn't have to call local_irq_save/restore.

The main use case for napi_alloc_skb would be for things such as copybreak
or page fragment based receive paths where an skb is allocated after the
data has been received instead of before.
Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fd11a83d

net: Split netdev_alloc_frag into __alloc_page_frag and add __napi_alloc_frag · ffde7328

由 Alexander Duyck 提交于 12月 09, 2014

This patch splits the netdev_alloc_frag function up so that it can be used
on one of two page frag pools instead of being fixed on the
netdev_alloc_cache.  By doing this we can add a NAPI specific function
__napi_alloc_frag that accesses a pool that is only used from softirq
context.  The advantage to this is that we do not need to call
local_irq_save/restore which can be a significant savings.

I also took the opportunity to refactor the core bits that were placed in
__alloc_page_frag.  First I updated the allocation to do either a 32K
allocation or an order 0 page.  This is based on the changes in commmit
d9b2938a where it was found that latencies could be reduced in case of
failures.  Then I also rewrote the logic to work from the end of the page to
the start.  By doing this the size value doesn't have to be used unless we
have run out of space for page fragments.  Finally I cleaned up the atomic
bits so that we just do an atomic_sub_and_test and if that returns true then
we set the page->_count via an atomic_set.  This way we can remove the extra
conditional for the atomic_read since it would have led to an atomic_inc in
the case of success anyway.
Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ffde7328

10 12月, 2014 1 次提交

net: avoid two atomic operations in fast clones · 6ffe75eb

由 Eric Dumazet 提交于 12月 03, 2014

Commit ce1a4ea3 ("net: avoid one atomic operation in skb_clone()")
took the wrong way to save one atomic operation.

It is actually possible to avoid two atomic operations, if we
do not change skb->fclone values, and only rely on clone_ref
content to signal if the clone is available or not.

skb_clone() can simply use the fast clone if clone_ref is 1.

kfree_skbmem() can avoid the atomic_dec_and_test() if clone_ref is 1.

Note that because we usually free the clone before the original skb,
this particular attempt is only done for the original skb to have better
branch prediction.

SKB_FCLONE_FREE is removed.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Chris Mason <clm@fb.com>
Cc: Sabrina Dubroca <sd@queasysnail.net>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6ffe75eb

22 11月, 2014 3 次提交

net: Revert "net: avoid one atomic operation in skb_clone()" · e7820e39

由 Eric Dumazet 提交于 11月 21, 2014

Not sure what I was thinking, but doing anything after
releasing a refcount is suicidal or/and embarrassing.

By the time we set skb->fclone to SKB_FCLONE_FREE, another cpu
could have released last reference and freed whole skb.

We potentially corrupt memory or trap if CONFIG_DEBUG_PAGEALLOC is set.
Reported-by: NChris Mason <clm@fb.com>
Fixes: ce1a4ea3 ("net: avoid one atomic operation in skb_clone()")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e7820e39

net: move vlan pop/push functions into common code · 93515d53

由 Jiri Pirko 提交于 11月 19, 2014

So it can be used from out of openvswitch code.
Did couple of cosmetic changes on the way, namely variable naming and
adding support for 8021AD proto.
Signed-off-by: NJiri Pirko <jiri@resnulli.us>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

93515d53

net: move make_writable helper into common code · e2195121

由 Jiri Pirko 提交于 11月 19, 2014

note that skb_make_writable already exists in net/netfilter/core.c
but does something slightly different.
Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NJiri Pirko <jiri@resnulli.us>
Acked-by: NPravin B Shelar <pshelar@nicira.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e2195121

06 11月, 2014 1 次提交

udp: Changes to udp_offload to support remote checksum offload · e585f236

由 Tom Herbert 提交于 11月 04, 2014

Add a new GSO type, SKB_GSO_TUNNEL_REMCSUM, which indicates remote
checksum offload being done (in this case inner checksum must not
be offloaded to the NIC).

Added logic in __skb_udp_tunnel_segment to handle remote checksum
offload case.
Signed-off-by: NTom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e585f236

30 10月, 2014 1 次提交

net: skb_segment() should preserve backpressure · 432c856f

由 Toshiaki Makita 提交于 10月 27, 2014

This patch generalizes commit d6a4a104 ("tcp: GSO should be TSQ
friendly") to protocols using skb_set_owner_w()

TCP uses its own destructor (tcp_wfree) and needs a more complex scheme
as explained in commit 6ff50cd5 ("tcp: gso: do not generate out of
order packets")

This allows UDP sockets using UFO to get proper backpressure,
thus avoiding qdisc drops and excessive cpu usage.

Here are performance test results (macvlan on vlan):

- Before
# netperf -t UDP_STREAM ...
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992   65507   60.00      144096 1224195    1258.56
212992           60.00          51              0.45

Average:        CPU     %user     %nice   %system   %iowait    %steal     %idle
Average:        all      0.23      0.00     25.26      0.08      0.00     74.43

- After
# netperf -t UDP_STREAM ...
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992   65507   60.00      109593      0     957.20
212992           60.00      109593            957.20

Average:        CPU     %user     %nice   %system   %iowait    %steal     %idle
Average:        all      0.18      0.00      8.38      0.02      0.00     91.43

[edumazet] Rewrote patch and changelog.
Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

432c856f

21 10月, 2014 1 次提交

net: core: handle encapsulation offloads when computing segment lengths · f993bc25

由 Florian Westphal 提交于 10月 20, 2014

if ->encapsulation is set we have to use inner_tcp_hdrlen and add the
size of the inner network headers too.

This is 'mostly harmless'; tbf might send skb that is slightly over
quota or drop skb even if it would have fit.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f993bc25

11 10月, 2014 1 次提交

net: fix races in page->_count manipulation · 4c450583

由 Eric Dumazet 提交于 10月 10, 2014

This is illegal to use atomic_set(&page->_count, ...) even if we 'own'
the page. Other entities in the kernel need to use get_page_unless_zero()
to get a reference to the page before testing page properties, so we could
loose a refcount increment.

The only case it is valid is when page->_count is 0

Fixes: 540eb7bf ("net: Update alloc frag to reduce get/put page usage and recycle pages")
Signed-off-by: NEric Dumaze <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4c450583

10 10月, 2014 1 次提交

net: Missing @ before descriptions cause make xmldocs warning · de3f0d0e

由 Masanari Iida 提交于 10月 09, 2014

This patch fix following warning.
Warning(.//net/core/skbuff.c:4142): No description found for parameter 'header_len'
Warning(.//net/core/skbuff.c:4142): No description found for parameter 'data_len'
Warning(.//net/core/skbuff.c:4142): No description found for parameter 'max_page_order'
Warning(.//net/core/skbuff.c:4142): No description found for parameter 'errcode'
Warning(.//net/core/skbuff.c:4142): No description found for parameter 'gfp_mask'

Acutually the descriptions exist, but missing "@" in front.

This problem start to happen when following commit was merged
into Linus's tree during 3.18-rc1 merge period.
commit 2e4e4410
net: add alloc_skb_with_frags() helper
Signed-off-by: NMasanari Iida <standby24x7@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

de3f0d0e

06 10月, 2014 1 次提交

net: skb_segment() provides list head and tail · bec3cfdc

由 Eric Dumazet 提交于 10月 03, 2014

Its unfortunate we have to walk again skb list to find the tail
after segmentation, even if data is probably hot in cpu caches.

skb_segment() can store the tail of the list into segs->prev,
and validate_xmit_skb_list() can immediately get the tail.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bec3cfdc

05 10月, 2014 1 次提交

net: Cleanup skb cloning by adding SKB_FCLONE_FREE · c8753d55

由 Vijay Subramanian 提交于 10月 02, 2014

SKB_FCLONE_UNAVAILABLE has overloaded meaning depending on type of skb.
1: If skb is allocated from head_cache, it indicates fclone is not available.
2: If skb is a companion fclone skb (allocated from fclone_cache), it indicates
it is available to be used.

To avoid confusion for case 2 above, this patch  replaces
SKB_FCLONE_UNAVAILABLE with SKB_FCLONE_FREE where appropriate. For fclone
companion skbs, this indicates it is free for use.

SKB_FCLONE_UNAVAILABLE will now simply indicate skb is from head_cache and
cannot / will not have a companion fclone.
Signed-off-by: NVijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c8753d55

04 10月, 2014 1 次提交

net: do not export skb_gro_receive() · 01291202

由 Eric Dumazet 提交于 10月 02, 2014

skb_gro_receive() is only called from tcp_gro_receive() which is
not in a module.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

01291202