提交 · c89489b47289e222c4363c20515e0ac321acbae4 · openeuler / Kernel

11 3月, 2021 2 次提交

net, bpf: Fix ip6ip6 crash with collect_md populated skbs · a188bb56

由 Daniel Borkmann 提交于 3月 10, 2021

I ran into a crash where setting up a ip6ip6 tunnel device which was /not/
set to collect_md mode was receiving collect_md populated skbs for xmit.

The BPF prog was populating the skb via bpf_skb_set_tunnel_key() which is
assigning special metadata dst entry and then redirecting the skb to the
device, taking ip6_tnl_start_xmit() -> ipxip6_tnl_xmit() -> ip6_tnl_xmit()
and in the latter it performs a neigh lookup based on skb_dst(skb) where
we trigger a NULL pointer dereference on dst->ops->neigh_lookup() since
the md_dst_ops do not populate neigh_lookup callback with a fake handler.

Transform the md_dst_ops into generic dst_blackhole_ops that can also be
reused elsewhere when needed, and use them for the metadata dst entries as
callback ops.

Also, remove the dst_md_discard{,_out}() ops and rely on dst_discard{,_out}()
from dst_init() which free the skb the same way modulo the splat. Given we
will be able to recover just fine from there, avoid any potential splats
iff this gets ever triggered in future (or worse, panic on warns when set).

Fixes: f38a9eb1 ("dst: Metadata destinations")
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a188bb56

net: Consolidate common blackhole dst ops · c4c877b2

由 Daniel Borkmann 提交于 3月 10, 2021

Move generic blackhole dst ops to the core and use them from both
ipv4_dst_blackhole_ops and ip6_dst_blackhole_ops where possible. No
functional change otherwise. We need these also in other locations
and having to define them over and over again is not great.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c4c877b2

04 3月, 2021 1 次提交

rtnetlink: using dev_base_seq from target net · a9ecb0cb

由 zhang kai 提交于 3月 02, 2021

Signed-off-by: Nzhang kai <zhangkaiheb@126.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a9ecb0cb

02 3月, 2021 1 次提交

net: expand textsearch ts_state to fit skb_seq_state · b228c9b0

由 Willem de Bruijn 提交于 3月 01, 2021

The referenced commit expands the skb_seq_state used by
skb_find_text with a 4B frag_off field, growing it to 48B.

This exceeds container ts_state->cb, causing a stack corruption:

[   73.238353] Kernel panic - not syncing: stack-protector: Kernel stack
is corrupted in: skb_find_text+0xc5/0xd0
[   73.247384] CPU: 1 PID: 376 Comm: nping Not tainted 5.11.0+ #4
[   73.252613] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.14.0-2 04/01/2014
[   73.260078] Call Trace:
[   73.264677]  dump_stack+0x57/0x6a
[   73.267866]  panic+0xf6/0x2b7
[   73.270578]  ? skb_find_text+0xc5/0xd0
[   73.273964]  __stack_chk_fail+0x10/0x10
[   73.277491]  skb_find_text+0xc5/0xd0
[   73.280727]  string_mt+0x1f/0x30
[   73.283639]  ipt_do_table+0x214/0x410

The struct is passed between skb_find_text and its callbacks
skb_prepare_seq_read, skb_seq_read and skb_abort_seq read through
the textsearch interface using TS_SKB_CB.

I assumed that this mapped to skb->cb like other .._SKB_CB wrappers.
skb->cb is 48B. But it maps to ts_state->cb, which is only 40B.

skb->cb was increased from 40B to 48B after ts_state was introduced,
in commit 3e3850e9 ("[NETFILTER]: Fix xfrm lookup in
ip_route_me_harder/ip6_route_me_harder").

Increase ts_state.cb[] to 48 to fit the struct.

Also add a BUILD_BUG_ON to avoid a repeat.

The alternative is to directly add a dependency from textsearch onto
linux/skbuff.h, but I think the intent is textsearch to have no such
dependencies on its callers.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=211911
Fixes: 97550f6f ("net: compound page support in skb_seq_read")
Reported-by: NKris Karas <bugs-a17@moonlit-rail.com>
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b228c9b0

14 2月, 2021 11 次提交

skbuff: queue NAPI_MERGED_FREE skbs into NAPI cache instead of freeing · 9243adfc

由 Alexander Lobakin 提交于 2月 13, 2021

napi_frags_finish() and napi_skb_finish() can only be called inside
NAPI Rx context, so we can feed NAPI cache with skbuff_heads that
got NAPI_MERGED_FREE verdict instead of immediate freeing.
Replace __kfree_skb() with __kfree_skb_defer() in napi_skb_finish()
and move napi_skb_free_stolen_head() to skbuff.c, so it can drop skbs
to NAPI cache.
As many drivers call napi_alloc_skb()/napi_get_frags() on their
receive path, this becomes especially useful.
Signed-off-by: NAlexander Lobakin <alobakin@pm.me>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9243adfc

skbuff: allow to use NAPI cache from __napi_alloc_skb() · cfb8ec65

由 Alexander Lobakin 提交于 2月 13, 2021

{,__}napi_alloc_skb() is mostly used either for optional non-linear
receive methods (usually controlled via Ethtool private flags and off
by default) and/or for Rx copybreaks.
Use __napi_build_skb() here for obtaining skbuff_heads from NAPI cache
instead of inplace allocations. This includes both kmalloc and page
frag paths.
Signed-off-by: NAlexander Lobakin <alobakin@pm.me>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cfb8ec65

skbuff: allow to optionally use NAPI cache from __alloc_skb() · d13612b5

由 Alexander Lobakin 提交于 2月 13, 2021

Reuse the old and forgotten SKB_ALLOC_NAPI to add an option to get
an skbuff_head from the NAPI cache instead of inplace allocation
inside __alloc_skb().
This implies that the function is called from softirq or BH-off
context, not for allocating a clone or from a distant node.

Cc: Alexander Duyck <alexander.duyck@gmail.com> # Simplified flags check
Signed-off-by: NAlexander Lobakin <alobakin@pm.me>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d13612b5

skbuff: introduce {,__}napi_build_skb() which reuses NAPI cache heads · f450d539

由 Alexander Lobakin 提交于 2月 13, 2021

Instead of just bulk-flushing skbuff_heads queued up through
napi_consume_skb() or __kfree_skb_defer(), try to reuse them
on allocation path.
If the cache is empty on allocation, bulk-allocate the first
16 elements, which is more efficient than per-skb allocation.
If the cache is full on freeing, bulk-wipe the second half of
the cache (32 elements).
This also includes custom KASAN poisoning/unpoisoning to be
double sure there are no use-after-free cases.

To not change current behaviour, introduce a new function,
napi_build_skb(), to optionally use a new approach later
in drivers.

Note on selected bulk size, 16:
 - this equals to XDP_BULK_QUEUE_SIZE, DEV_MAP_BULK_SIZE
   and especially VETH_XDP_BATCH, which is also used to
   bulk-allocate skbuff_heads and was tested on powerful
   setups;
 - this also showed the best performance in the actual
   test series (from the array of {8, 16, 32}).

Suggested-by: Edward Cree <ecree.xilinx@gmail.com> # Divide on two halves
Suggested-by: Eric Dumazet <edumazet@google.com>   # KASAN poisoning
Cc: Dmitry Vyukov <dvyukov@google.com>             # Help with KASAN
Cc: Paolo Abeni <pabeni@redhat.com>                # Reduced batch size
Signed-off-by: NAlexander Lobakin <alobakin@pm.me>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f450d539

skbuff: move NAPI cache declarations upper in the file · 50fad4b5

由 Alexander Lobakin 提交于 2月 13, 2021

NAPI cache structures will be used for allocating skbuff_heads,
so move their declarations a bit upper.
Signed-off-by: NAlexander Lobakin <alobakin@pm.me>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

50fad4b5

skbuff: remove __kfree_skb_flush() · fec6e49b

由 Alexander Lobakin 提交于 2月 13, 2021

This function isn't much needed as NAPI skb queue gets bulk-freed
anyway when there's no more room, and even may reduce the efficiency
of bulk operations.
It will be even less needed after reusing skb cache on allocation path,
so remove it and this way lighten network softirqs a bit.
Suggested-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NAlexander Lobakin <alobakin@pm.me>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fec6e49b

skbuff: use __build_skb_around() in __alloc_skb() · f9d6725b

由 Alexander Lobakin 提交于 2月 13, 2021

Just call __build_skb_around() instead of open-coding it.
Signed-off-by: NAlexander Lobakin <alobakin@pm.me>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f9d6725b

skbuff: simplify __alloc_skb() a bit · df1ae022

由 Alexander Lobakin 提交于 2月 13, 2021

Use unlikely() annotations for skbuff_head and data similarly to the
two other allocation functions and remove totally redundant goto.
Signed-off-by: NAlexander Lobakin <alobakin@pm.me>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

df1ae022

skbuff: make __build_skb_around() return void · 483126b3

由 Alexander Lobakin 提交于 2月 13, 2021

__build_skb_around() can never fail and always returns passed skb.
Make it return void to simplify and optimize the code.
Signed-off-by: NAlexander Lobakin <alobakin@pm.me>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

483126b3

skbuff: simplify kmalloc_reserve() · ef28095f

由 Alexander Lobakin 提交于 2月 13, 2021

Eversince the introduction of __kmalloc_reserve(), "ip" argument
hasn't been used. _RET_IP_ is embedded inside
kmalloc_node_track_caller().
Remove the redundant macro and rename the function after it.
Signed-off-by: NAlexander Lobakin <alobakin@pm.me>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ef28095f

skbuff: move __alloc_skb() next to the other skb allocation functions · 5381b23d

由 Alexander Lobakin 提交于 2月 13, 2021

In preparation before reusing several functions in all three skb
allocation variants, move __alloc_skb() next to the
__netdev_alloc_skb() and __napi_alloc_skb().
No functional changes.
Signed-off-by: NAlexander Lobakin <alobakin@pm.me>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5381b23d

13 2月, 2021 6 次提交

flow_dissector: fix TTL and TOS dissection on IPv4 fragments · d2126838

由 Davide Caratti 提交于 2月 12, 2021

the following command:

 # tc filter add dev $h2 ingress protocol ip pref 1 handle 101 flower \
   $tcflags dst_ip 192.0.2.2 ip_ttl 63 action drop

doesn't drop all IPv4 packets that match the configured TTL / destination
address. In particular, if "fragment offset" or "more fragments" have non
zero value in the IPv4 header, setting of FLOW_DISSECTOR_KEY_IP is simply
ignored. Fix this dissecting IPv4 TTL and TOS before fragment info; while
at it, add a selftest for tc flower's match on 'ip_ttl' that verifies the
correct behavior.

Fixes: 518d8a2e ("net/flow_dissector: add support for dissection of misc ip header fields")
Reported-by: NShuang Li <shuali@redhat.com>
Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d2126838

bpf: Drop MTU check when doing TC-BPF redirect to ingress · 5f7d5728

由 Jesper Dangaard Brouer 提交于 2月 09, 2021

The use-case for dropping the MTU check when TC-BPF does redirect to
ingress, is described by Eyal Birger in email[0]. The summary is the
ability to increase packet size (e.g. with IPv6 headers for NAT64) and
ingress redirect packet and let normal netstack fragment packet as needed.

[0] https://lore.kernel.org/netdev/CAHsH6Gug-hsLGHQ6N0wtixdOa85LDZ3HNRHVd0opR=19Qo4W4Q@mail.gmail.com/

V15:
 - missing static for function declaration

V9:
 - Make net_device "up" (IFF_UP) check explicit in skb_do_redirect

V4:
 - Keep net_device "up" (IFF_UP) check.
 - Adjustment to handle bpf_redirect_peer() helper
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/161287790971.790810.11785274340154740591.stgit@firesoul

5f7d5728

bpf: Add BPF-helper for MTU checking · 34b2021c

由 Jesper Dangaard Brouer 提交于 2月 09, 2021

This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs.

The SKB object is complex and the skb->len value (accessible from
BPF-prog) also include the length of any extra GRO/GSO segments, but
without taking into account that these GRO/GSO segments get added
transport (L4) and network (L3) headers before being transmitted. Thus,
this BPF-helper is created such that the BPF-programmer don't need to
handle these details in the BPF-prog.

The API is designed to help the BPF-programmer, that want to do packet
context size changes, which involves other helpers. These other helpers
usually does a delta size adjustment. This helper also support a delta
size (len_diff), which allow BPF-programmer to reuse arguments needed by
these other helpers, and perform the MTU check prior to doing any actual
size adjustment of the packet context.

It is on purpose, that we allow the len adjustment to become a negative
result, that will pass the MTU check. This might seem weird, but it's not
this helpers responsibility to "catch" wrong len_diff adjustments. Other
helpers will take care of these checks, if BPF-programmer chooses to do
actual size adjustment.

V14:
 - Improve man-page desc of len_diff.

V13:
 - Enforce flag BPF_MTU_CHK_SEGS cannot use len_diff.

V12:
 - Simplify segment check that calls skb_gso_validate_network_len.
 - Helpers should return long

V9:
- Use dev->hard_header_len (instead of ETH_HLEN)
- Annotate with unlikely req from Daniel
- Fix logic error using skb_gso_validate_network_len from Daniel

V6:
- Took John's advice and dropped BPF_MTU_CHK_RELAX
- Returned MTU is kept at L3-level (like fib_lookup)

V4: Lot of changes
 - ifindex 0 now use current netdev for MTU lookup
 - rename helper from bpf_mtu_check to bpf_check_mtu
 - fix bug for GSO pkt length (as skb->len is total len)
 - remove __bpf_len_adj_positive, simply allow negative len adj
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/161287790461.790810.3429728639563297353.stgit@firesoul

34b2021c

bpf: bpf_fib_lookup return MTU value as output when looked up · e1850ea9

由 Jesper Dangaard Brouer 提交于 2月 09, 2021

The BPF-helpers for FIB lookup (bpf_xdp_fib_lookup and bpf_skb_fib_lookup)
can perform MTU check and return BPF_FIB_LKUP_RET_FRAG_NEEDED. The BPF-prog
don't know the MTU value that caused this rejection.

If the BPF-prog wants to implement PMTU (Path MTU Discovery) (rfc1191) it
need to know this MTU value for the ICMP packet.

Patch change lookup and result struct bpf_fib_lookup, to contain this MTU
value as output via a union with 'tot_len' as this is the value used for
the MTU lookup.

V5:
 - Fixed uninit value spotted by Dan Carpenter.
 - Name struct output member mtu_result
Reported-by: Nkernel test robot <lkp@intel.com>
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/161287789952.790810.13134700381067698781.stgit@firesoul

e1850ea9

bpf: Fix bpf_fib_lookup helper MTU check for SKB ctx · 2c0a10af

由 Jesper Dangaard Brouer 提交于 2月 09, 2021

BPF end-user on Cilium slack-channel (Carlo Carraro) wants to use
bpf_fib_lookup for doing MTU-check, but *prior* to extending packet size,
by adjusting fib_params 'tot_len' with the packet length plus the expected
encap size. (Just like the bpf_check_mtu helper supports). He discovered
that for SKB ctx the param->tot_len was not used, instead skb->len was used
(via MTU check in is_skb_forwardable() that checks against netdev MTU).

Fix this by using fib_params 'tot_len' for MTU check. If not provided (e.g.
zero) then keep existing TC behaviour intact. Notice that 'tot_len' for MTU
check is done like XDP code-path, which checks against FIB-dst MTU.

V16:
- Revert V13 optimization, 2nd lookup is against egress/resulting netdev

V13:
- Only do ifindex lookup one time, calling dev_get_by_index_rcu().

V10:
- Use same method as XDP for 'tot_len' MTU check

Fixes: 4c79579b ("bpf: Change bpf_fib_lookup to return lookup status")
Reported-by: NCarlo Carraro <colrack@gmail.com>
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/161287789444.790810.15247494756551413508.stgit@firesoul

2c0a10af

bpf: Remove MTU check in __bpf_skb_max_len · 6306c118

由 Jesper Dangaard Brouer 提交于 2月 09, 2021

Multiple BPF-helpers that can manipulate/increase the size of the SKB uses
__bpf_skb_max_len() as the max-length. This function limit size against
the current net_device MTU (skb->dev->mtu).

When a BPF-prog grow the packet size, then it should not be limited to the
MTU. The MTU is a transmit limitation, and software receiving this packet
should be allowed to increase the size. Further more, current MTU check in
__bpf_skb_max_len uses the MTU from ingress/current net_device, which in
case of redirects uses the wrong net_device.

This patch keeps a sanity max limit of SKB_MAX_ALLOC (16KiB). The real limit
is elsewhere in the system. Jesper's testing[1] showed it was not possible
to exceed 8KiB when expanding the SKB size via BPF-helper. The limiting
factor is the define KMALLOC_MAX_CACHE_SIZE which is 8192 for
SLUB-allocator (CONFIG_SLUB) in-case PAGE_SIZE is 4096. This define is
in-effect due to this being called from softirq context see code
__gfp_pfmemalloc_flags() and __do_kmalloc_node(). Jakub's testing showed
that frames above 16KiB can cause NICs to reset (but not crash). Keep this
sanity limit at this level as memory layer can differ based on kernel
config.

[1] https://github.com/xdp-project/bpf-examples/tree/master/MTU-testsSigned-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/161287788936.790810.2937823995775097177.stgit@firesoul

6306c118

12 2月, 2021 4 次提交

net/sock: Add kernel config SOCK_RX_QUEUE_MAPPING · 4e1beecc

由 Tariq Toukan 提交于 2月 11, 2021

Use a new config SOCK_RX_QUEUE_MAPPING to compile-in the socket
RX queue field and logic, instead of the XPS config.
This breaks dependency in XPS, and allows selecting it from non-XPS
use cases, as we do in the next patch.

In addition, use the new flag to wrap the logic in sk_rx_queue_get()
and protect access to the sk_rx_queue_mapping field, while keeping
the function exposed unconditionally, just like sk_rx_queue_set()
and sk_rx_queue_clear().
Signed-off-by: NTariq Toukan <tariqt@nvidia.com>
Reviewed-by: NMaxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4e1beecc

net: fix dev_ifsioc_locked() race condition · 3b23a32a

由 Cong Wang 提交于 2月 11, 2021

dev_ifsioc_locked() is called with only RCU read lock, so when
there is a parallel writer changing the mac address, it could
get a partially updated mac address, as shown below:

Thread 1			Thread 2
// eth_commit_mac_addr_change()
memcpy(dev->dev_addr, addr->sa_data, ETH_ALEN);
				// dev_ifsioc_locked()
				memcpy(ifr->ifr_hwaddr.sa_data,
					dev->dev_addr,...);

Close this race condition by guarding them with a RW semaphore,
like netdev_get_name(). We can not use seqlock here as it does not
allow blocking. The writers already take RTNL anyway, so this does
not affect the slow path. To avoid bothering existing
dev_set_mac_address() callers in drivers, introduce a new wrapper
just for user-facing callers on ioctl and rtnetlink paths.

Note, bonding also changes slave mac addresses but that requires
a separate patch due to the complexity of bonding code.

Fixes: 3710becf ("net: RCU locking for simple ioctl()")
Reported-by: N"Gong, Sishuai" <sishuai@purdue.edu>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: NCong Wang <cong.wang@bytedance.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3b23a32a

bpf: Expose bpf_get_socket_cookie to tracing programs · c5dbb89f

由 Florent Revest 提交于 2月 10, 2021

This needs a new helper that:
- can work in a sleepable context (using sock_gen_cookie)
- takes a struct sock pointer and checks that it's not NULL
Signed-off-by: NFlorent Revest <revest@chromium.org>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NKP Singh <kpsingh@kernel.org>
Acked-by: NAndrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210210111406.785541-2-revest@chromium.org

c5dbb89f

net: initialize net->net_cookie at netns setup · 3d368ab8

由 Eric Dumazet 提交于 2月 10, 2021

It is simpler to make net->net_cookie a plain u64
written once in setup_net() instead of looping
and using atomic64 helpers.

Lorenz Bauer wants to add SO_NETNS_COOKIE socket option
and this patch would makes his patch series simpler.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Tested-by: NLorenz Bauer <lmb@cloudflare.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3d368ab8

10 2月, 2021 3 次提交

net: add sysfs attribute to control napi threaded mode · 5fdd2f0e

由 Wei Wang 提交于 2月 08, 2021

This patch adds a new sysfs attribute to the network device class.
Said attribute provides a per-device control to enable/disable the
threaded mode for all the napi instances of the given network device,
without the need for a device up/down.
User sets it to 1 or 0 to enable or disable threaded mode.
Note: when switching between threaded and the current softirq based mode
for a napi instance, it will not immediately take effect if the napi is
currently being polled. The mode switch will happen for the next time
napi_schedule() is called.
Co-developed-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Co-developed-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Co-developed-by: NFelix Fietkau <nbd@nbd.name>
Signed-off-by: NFelix Fietkau <nbd@nbd.name>
Signed-off-by: NWei Wang <weiwan@google.com>
Reviewed-by: NAlexander Duyck <alexanderduyck@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5fdd2f0e

net: implement threaded-able napi poll loop support · 29863d41

由 Wei Wang 提交于 2月 08, 2021

This patch allows running each napi poll loop inside its own
kernel thread.
The kthread is created during netif_napi_add() if dev->threaded
is set. And threaded mode is enabled in napi_enable(). We will
provide a way to set dev->threaded and enable threaded mode
without a device up/down in the following patch.

Once that threaded mode is enabled and the kthread is
started, napi_schedule() will wake-up such thread instead
of scheduling the softirq.

The threaded poll loop behaves quite likely the net_rx_action,
but it does not have to manipulate local irqs and uses
an explicit scheduling point based on netdev_budget.
Co-developed-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Co-developed-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Co-developed-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NWei Wang <weiwan@google.com>
Reviewed-by: NAlexander Duyck <alexanderduyck@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

29863d41

net: extract napi poll functionality to __napi_poll() · 898f8015

由 Felix Fietkau 提交于 2月 08, 2021

This commit introduces a new function __napi_poll() which does the main
logic of the existing napi_poll() function, and will be called by other
functions in later commits.
This idea and implementation is done by Felix Fietkau <nbd@nbd.name> and
is proposed as part of the patch to move napi work to work_queue
context.
This commit by itself is a code restructure.
Signed-off-by: NFelix Fietkau <nbd@nbd.name>
Signed-off-by: NWei Wang <weiwan@google.com>
Reviewed-by: NAlexander Duyck <alexanderduyck@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

898f8015

09 2月, 2021 1 次提交

net-sysfs: Add rtnl locking for getting Tx queue traffic class · b2f17564

由 Alexander Duyck 提交于 2月 08, 2021

In order to access the suboordinate dev for a device we should be holding
the rtnl_lock when outside of the transmit path. The existing code was not
doing that for the sysfs dump function and as a result we were open to a
possible race.

To resolve that take the rtnl lock prior to accessing the sb_dev field of
the Tx queue and release it after we have retrieved the tc for the queue.
Signed-off-by: NAlexander Duyck <alexanderduyck@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b2f17564

07 2月, 2021 2 次提交

Revert "net: Have netpoll bring-up DSA management interface" · ea92000d

由 Vladimir Oltean 提交于 2月 05, 2021

This reverts commit 1532b977.

The above commit is good and it works, however it was meant as a bugfix
for stable kernels and now we have more self-contained ways in DSA to
handle the situation where the DSA master must be brought up.
Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

ea92000d

net: Introduce {netdev,napi}_alloc_frag_align() · 3f6e687d

由 Kevin Hao 提交于 2月 04, 2021

In the current implementation of {netdev,napi}_alloc_frag(), it doesn't
have any align guarantee for the returned buffer address, But for some
hardwares they do require the DMA buffer to be aligned correctly,
so we would have to use some workarounds like below if the buffers
allocated by the {netdev,napi}_alloc_frag() are used by these hardwares
for DMA.
    buf = napi_alloc_frag(really_needed_size + align);
    buf = PTR_ALIGN(buf, align);

These codes seems ugly and would waste a lot of memories if the buffers
are used in a network driver for the TX/RX. We have added the align
support for the page_frag functions, so add the corresponding
{netdev,napi}_frag functions.
Signed-off-by: NKevin Hao <haokexin@gmail.com>
Reviewed-by: NAlexander Duyck <alexanderduyck@fb.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

3f6e687d

06 2月, 2021 1 次提交

net: gro: do not keep too many GRO packets in napi->rx_list · 8dc1c444

由 Eric Dumazet 提交于 2月 04, 2021

Commit c8079432 ("net: Fix packet reordering caused by GRO and
listified RX cooperation") had the unfortunate effect of adding
latencies in common workloads.

Before the patch, GRO packets were immediately passed to
upper stacks.

After the patch, we can accumulate quite a lot of GRO
packets (depdending on NAPI budget).

My fix is counting in napi->rx_count number of segments
instead of number of logical packets.

Fixes: c8079432 ("net: Fix packet reordering caused by GRO and listified RX cooperation")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Bisected-by: NJohn Sperbeck <jsperbeck@google.com>
Tested-by: NJian Yang <jianyang@google.com>
Cc: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: NSaeed Mahameed <saeedm@nvidia.com>
Reviewed-by: NEdward Cree <ecree.xilinx@gmail.com>
Reviewed-by: NAlexander Lobakin <alobakin@pm.me>
Link: https://lore.kernel.org/r/20210204213146.4192368-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

8dc1c444

05 2月, 2021 3 次提交

udp: fix skb_copy_and_csum_datagram with odd segment sizes · 52cbd23a

由 Willem de Bruijn 提交于 2月 03, 2021

When iteratively computing a checksum with csum_block_add, track the
offset "pos" to correctly rotate in csum_block_add when offset is odd.

The open coded implementation of skb_copy_and_csum_datagram did this.
With the switch to __skb_datagram_iter calling csum_and_copy_to_iter,
pos was reinitialized to 0 on each call.

Bring back the pos by passing it along with the csum to the callback.

Changes v1->v2
  - pass csum value, instead of csump pointer (Alexander Duyck)

Link: https://lore.kernel.org/netdev/20210128152353.GB27281@optiplex/
Fixes: 950fcaec ("datagram: consolidate datagram copy to iter helpers")
Reported-by: NOliver Graute <oliver.graute@gmail.com>
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Reviewed-by: NAlexander Duyck <alexanderduyck@fb.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210203192952.1849843-1-willemdebruijn.kernel@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

52cbd23a

net/core: move gro function declarations to separate header · 04f00ab2

由 Leon Romanovsky 提交于 2月 03, 2021

Fir the following compilation warnings:
 1031 | INDIRECT_CALLABLE_SCOPE void udp_v6_early_demux(struct sk_buff *skb)

net/ipv6/ip6_offload.c:182:41: warning: no previous prototype for ‘ipv6_gro_receive’ [-Wmissing-prototypes]
  182 | INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head,
      |                                         ^~~~~~~~~~~~~~~~
net/ipv6/ip6_offload.c:320:29: warning: no previous prototype for ‘ipv6_gro_complete’ [-Wmissing-prototypes]
  320 | INDIRECT_CALLABLE_SCOPE int ipv6_gro_complete(struct sk_buff *skb, int nhoff)
      |                             ^~~~~~~~~~~~~~~~~
net/ipv6/ip6_offload.c:182:41: warning: no previous prototype for ‘ipv6_gro_receive’ [-Wmissing-prototypes]
  182 | INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head,
      |                                         ^~~~~~~~~~~~~~~~
net/ipv6/ip6_offload.c:320:29: warning: no previous prototype for ‘ipv6_gro_complete’ [-Wmissing-prototypes]
  320 | INDIRECT_CALLABLE_SCOPE int ipv6_gro_complete(struct sk_buff *skb, int nhoff)
Signed-off-by: NLeon Romanovsky <leonro@nvidia.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

04f00ab2

net: page_pool: simplify page recycling condition tests · 05656132

由 Alexander Lobakin 提交于 2月 02, 2021

pool_page_reusable() is a leftover from pre-NUMA-aware times. For now,
this function is just a redundant wrapper over page_is_pfmemalloc(),
so inline it into its sole call site.
Signed-off-by: NAlexander Lobakin <alobakin@pm.me>
Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
Reviewed-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

05656132

04 2月, 2021 2 次提交

net, veth: Alloc skb in bulk for ndo_xdp_xmit · 65e6dcf7

由 Lorenzo Bianconi 提交于 1月 29, 2021

Split ndo_xdp_xmit and ndo_start_xmit use cases in veth_xdp_rcv routine
in order to alloc skbs in bulk for XDP_PASS verdict.

Introduce xdp_alloc_skb_bulk utility routine to alloc skb bulk list.
The proposed approach has been tested in the following scenario:

eth (ixgbe) --> XDP_REDIRECT --> veth0 --> (remote-ns) veth1 --> XDP_PASS

XDP_REDIRECT: xdp_redirect_map bpf sample
XDP_PASS: xdp_rxq_info bpf sample

traffic generator: pkt_gen sending udp traffic on a remote device

bpf-next master: ~3.64Mpps
bpf-next + skb bulking allocation: ~3.79Mpps
Signed-off-by: NLorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Reviewed-by: NToshiaki Makita <toshiaki.makita1@gmail.com>
Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/a14a30d3c06fff24e13f836c733d80efc0bd6eb5.1611957532.git.lorenzo@kernel.org

65e6dcf7

net: indirect call helpers for ipv4/ipv6 dst_check functions · bbd807df

由 Brian Vazquez 提交于 2月 01, 2021

This patch avoids the indirect call for the common case:
ip6_dst_check and ipv4_dst_check
Signed-off-by: NBrian Vazquez <brianvv@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

bbd807df

03 2月, 2021 1 次提交

net: fix up truesize of cloned skb in skb_prepare_for_shift() · 097b9146

由 Marco Elver 提交于 2月 01, 2021

Avoid the assumption that ksize(kmalloc(S)) == ksize(kmalloc(S)): when
cloning an skb, save and restore truesize after pskb_expand_head(). This
can occur if the allocator decides to service an allocation of the same
size differently (e.g. use a different size class, or pass the
allocation on to KFENCE).

Because truesize is used for bookkeeping (such as sk_wmem_queued), a
modified truesize of a cloned skb may result in corrupt bookkeeping and
relevant warnings (such as in sk_stream_kill_queues()).

Link: https://lkml.kernel.org/r/X9JR/J6dMMOy1obu@elver.google.com
Reported-by: syzbot+7b99aafdcc2eedea6178@syzkaller.appspotmail.com
Suggested-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NMarco Elver <elver@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210201160420.2826895-1-elver@google.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

097b9146

31 1月, 2021 1 次提交

neighbour: Prevent a dead entry from updating gc_list · eb4e8fac

由 Chinmay Agarwal 提交于 1月 27, 2021

Following race condition was detected:
<CPU A, t0> - neigh_flush_dev() is under execution and calls
neigh_mark_dead(n) marking the neighbour entry 'n' as dead.

<CPU B, t1> - Executing: __netif_receive_skb() ->
__netif_receive_skb_core() -> arp_rcv() -> arp_process().arp_process()
calls __neigh_lookup() which takes a reference on neighbour entry 'n'.

<CPU A, t2> - Moves further along neigh_flush_dev() and calls
neigh_cleanup_and_release(n), but since reference count increased in t2,
'n' couldn't be destroyed.

<CPU B, t3> - Moves further along, arp_process() and calls
neigh_update()-> __neigh_update() -> neigh_update_gc_list(), which adds
the neighbour entry back in gc_list(neigh_mark_dead(), removed it
earlier in t0 from gc_list)

<CPU B, t4> - arp_process() finally calls neigh_release(n), destroying
the neighbour entry.

This leads to 'n' still being part of gc_list, but the actual
neighbour structure has been freed.

The situation can be prevented from happening if we disallow a dead
entry to have any possibility of updating gc_list. This is what the
patch intends to achieve.

Fixes: 9c29a2f5 ("neighbor: Fix locking order for gc_list changes")
Signed-off-by: NChinmay Agarwal <chinagar@codeaurora.org>
Reviewed-by: NCong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20210127165453.GA20514@chinagar-linux.qualcomm.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

eb4e8fac

30 1月, 2021 1 次提交

net: Remove redundant calls of sk_tx_queue_clear(). · df610cd9

由 Kuniyuki Iwashima 提交于 1月 29, 2021

The commit 41b14fb8 ("net: Do not clear the sock TX queue in
sk_set_socket()") removes sk_tx_queue_clear() from sk_set_socket() and adds
it instead in sk_alloc() and sk_clone_lock() to fix an issue introduced in
the commit e022f0b4 ("net: Introduce sk_tx_queue_mapping"). On the
other hand, the original commit had already put sk_tx_queue_clear() in
sk_prot_alloc(): the callee of sk_alloc() and sk_clone_lock(). Thus
sk_tx_queue_clear() is called twice in each path.

If we remove sk_tx_queue_clear() in sk_alloc() and sk_clone_lock(), it
currently works well because (i) sk_tx_queue_mapping is defined between
sk_dontcopy_begin and sk_dontcopy_end, and (ii) sock_copy() called after
sk_prot_alloc() in sk_clone_lock() does not overwrite sk_tx_queue_mapping.
However, if we move sk_tx_queue_mapping out of the no copy area, it
introduces a bug unintentionally.

Therefore, this patch adds a compile-time check to take care of the order
of sock_copy() and sk_tx_queue_clear() and removes sk_tx_queue_clear() from
sk_prot_alloc() so that it does the only allocation and its callers
initialize fields.

CC: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
Acked-by: NTariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20210128150217.6060-1-kuniyu@amazon.co.jpSigned-off-by: NJakub Kicinski <kuba@kernel.org>

df610cd9

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功