提交 · 8fd813441e78f1fc7918f55dba6553943eaf1085 · openeuler / Kernel

30 4月, 2022 1 次提交

net: inline sock_alloc_send_skb · de32bc6a

由 Pavel Begunkov 提交于 4月 28, 2022

sock_alloc_send_skb() is simple and just proxying to another function,
so we can inline it and cut associated overhead.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

de32bc6a

29 4月, 2022 1 次提交

net: SO_RCVMARK socket option for SO_MARK with recvmsg() · 6fd1d51c

由 Erin MacNeil 提交于 4月 27, 2022

Adding a new socket option, SO_RCVMARK, to indicate that SO_MARK
should be included in the ancillary data returned by recvmsg().

Renamed the sock_recv_ts_and_drops() function to sock_recv_cmsgs().
Signed-off-by: NErin MacNeil <lnx.erin@gmail.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Acked-by: NMarc Kleine-Budde <mkl@pengutronix.de>
Link: https://lore.kernel.org/r/20220427200259.2564-1-lnx.erin@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

6fd1d51c

27 4月, 2022 1 次提交

net: generalize skb freeing deferral to per-cpu lists · 68822bdf

由 Eric Dumazet 提交于 4月 22, 2022

Logic added in commit f35f8219 ("tcp: defer skb freeing after socket
lock is released") helped bulk TCP flows to move the cost of skbs
frees outside of critical section where socket lock was held.

But for RPC traffic, or hosts with RFS enabled, the solution is far from
being ideal.

For RPC traffic, recvmsg() has to return to user space right after
skb payload has been consumed, meaning that BH handler has no chance
to pick the skb before recvmsg() thread. This issue is more visible
with BIG TCP, as more RPC fit one skb.

For RFS, even if BH handler picks the skbs, they are still picked
from the cpu on which user thread is running.

Ideally, it is better to free the skbs (and associated page frags)
on the cpu that originally allocated them.

This patch removes the per socket anchor (sk->defer_list) and
instead uses a per-cpu list, which will hold more skbs per round.

This new per-cpu list is drained at the end of net_action_rx(),
after incoming packets have been processed, to lower latencies.

In normal conditions, skbs are added to the per-cpu list with
no further action. In the (unlikely) cases where the cpu does not
run net_action_rx() handler fast enough, we use an IPI to raise
NET_RX_SOFTIRQ on the remote cpu.

Also, we do not bother draining the per-cpu list from dev_cpu_dead()
This is because skbs in this list have no requirement on how fast
they should be freed.

Note that we can add in the future a small per-cpu cache
if we see any contention on sd->defer_lock.

Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
page recycling strategy used by NIC driver (its page pool capacity
being too small compared to number of skbs/pages held in sockets
receive queues)

Note that this tuning was only done to demonstrate worse
conditions for skb freeing for this particular test.
These conditions can happen in more general production workload.

10 runs of one TCP_STREAM flow

Before:
Average throughput: 49685 Mbit.

Kernel profiles on cpu running user thread recvmsg() show high cost for
skb freeing related functions (*)

    57.81%  [kernel]       [k] copy_user_enhanced_fast_string
(*) 12.87%  [kernel]       [k] skb_release_data
(*)  4.25%  [kernel]       [k] __free_one_page
(*)  3.57%  [kernel]       [k] __list_del_entry_valid
     1.85%  [kernel]       [k] __netif_receive_skb_core
     1.60%  [kernel]       [k] __skb_datagram_iter
(*)  1.59%  [kernel]       [k] free_unref_page_commit
(*)  1.16%  [kernel]       [k] __slab_free
     1.16%  [kernel]       [k] _copy_to_iter
(*)  1.01%  [kernel]       [k] kfree
(*)  0.88%  [kernel]       [k] free_unref_page
     0.57%  [kernel]       [k] ip6_rcv_core
     0.55%  [kernel]       [k] ip6t_do_table
     0.54%  [kernel]       [k] flush_smp_call_function_queue
(*)  0.54%  [kernel]       [k] free_pcppages_bulk
     0.51%  [kernel]       [k] llist_reverse_order
     0.38%  [kernel]       [k] process_backlog
(*)  0.38%  [kernel]       [k] free_pcp_prepare
     0.37%  [kernel]       [k] tcp_recvmsg_locked
(*)  0.37%  [kernel]       [k] __list_add_valid
     0.34%  [kernel]       [k] sock_rfree
     0.34%  [kernel]       [k] _raw_spin_lock_irq
(*)  0.33%  [kernel]       [k] __page_cache_release
     0.33%  [kernel]       [k] tcp_v6_rcv
(*)  0.33%  [kernel]       [k] __put_page
(*)  0.29%  [kernel]       [k] __mod_zone_page_state
     0.27%  [kernel]       [k] _raw_spin_lock

After patch:
Average throughput: 73076 Mbit.

Kernel profiles on cpu running user thread recvmsg() looks better:

    81.35%  [kernel]       [k] copy_user_enhanced_fast_string
     1.95%  [kernel]       [k] _copy_to_iter
     1.95%  [kernel]       [k] __skb_datagram_iter
     1.27%  [kernel]       [k] __netif_receive_skb_core
     1.03%  [kernel]       [k] ip6t_do_table
     0.60%  [kernel]       [k] sock_rfree
     0.50%  [kernel]       [k] tcp_v6_rcv
     0.47%  [kernel]       [k] ip6_rcv_core
     0.45%  [kernel]       [k] read_tsc
     0.44%  [kernel]       [k] _raw_spin_lock_irqsave
     0.37%  [kernel]       [k] _raw_spin_lock
     0.37%  [kernel]       [k] native_irq_return_iret
     0.33%  [kernel]       [k] __inet6_lookup_established
     0.31%  [kernel]       [k] ip6_protocol_deliver_rcu
     0.29%  [kernel]       [k] tcp_rcv_established
     0.29%  [kernel]       [k] llist_reverse_order

v2: kdoc issue (kernel bots)
    do not defer if (alloc_cpu == smp_processor_id()) (Paolo)
    replace the sk_buff_head with a single-linked list (Jakub)
    add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NPaolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

68822bdf

12 4月, 2022 1 次提交

net: remove noblock parameter from recvmsg() entities · ec095263

由 Oliver Hartkopp 提交于 4月 11, 2022

The internal recvmsg() functions have two parameters 'flags' and 'noblock'
that were merged inside skb_recv_datagram(). As a follow up patch to commit
f4b41f06 ("net: remove noblock parameter from skb_recv_datagram()")
this patch removes the separate 'noblock' parameter for recvmsg().

Analogue to the referenced patch for skb_recv_datagram() the 'flags' and
'noblock' parameters are unnecessarily split up with e.g.

err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT,
                           flags & ~MSG_DONTWAIT, &addr_len);

or in

err = INDIRECT_CALL_2(sk->sk_prot->recvmsg, tcp_recvmsg, udp_recvmsg,
                      sk, msg, size, flags & MSG_DONTWAIT,
                      flags & ~MSG_DONTWAIT, &addr_len);

instead of simply using only flags all the time and check for MSG_DONTWAIT
where needed (to preserve for the formerly separated no(n)block condition).
Signed-off-by: NOliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20220411124955.154876-1-socketcan@hartkopp.netSigned-off-by: NPaolo Abeni <pabeni@redhat.com>

ec095263

11 4月, 2022 1 次提交

net: sock: introduce sock_queue_rcv_skb_reason() · c1b8a567

由 Menglong Dong 提交于 4月 07, 2022

In order to report the reasons of skb drops in 'sock_queue_rcv_skb()',
introduce the function 'sock_queue_rcv_skb_reason()'.

As the return value of 'sock_queue_rcv_skb()' is used as the error code,
we can't make it as drop reason and have to pass extra output argument.
'sock_queue_rcv_skb()' is used in many places, so we can't change it
directly.

Introduce the new function 'sock_queue_rcv_skb_reason()' and make
'sock_queue_rcv_skb()' an inline call to it.
Reviewed-by: NHao Peng <flyingpeng@tencent.com>
Reviewed-by: NJiang Biao <benbjiang@tencent.com>
Signed-off-by: NMenglong Dong <imagedong@tencent.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c1b8a567

08 4月, 2022 1 次提交

net: extract a few internals from netdevice.h · 6264f58c

由 Jakub Kicinski 提交于 4月 06, 2022

There's a number of functions and static variables used
under net/core/ but not from the outside. We currently
dump most of them into netdevice.h. That bad for many
reasons:
 - netdevice.h is very cluttered, hard to figure out
   what the APIs are;
 - netdevice.h is very long;
 - we have to touch netdevice.h more which causes expensive
   incremental builds.

Create a header under net/core/ and move some declarations.

The new header is also a bit of a catch-all but that's
fine, if we create more specific headers people will
likely over-think where their declaration fit best.
And end up putting them in netdevice.h, again.

More work should be done on splitting netdevice.h into more
targeted headers, but that'd be more time consuming so small
steps.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

6264f58c

09 3月, 2022 1 次提交

SO_ZEROCOPY should return -EOPNOTSUPP rather than -ENOTSUPP · 869420a8

由 Samuel Thibault 提交于 3月 07, 2022

ENOTSUPP is documented as "should never be seen by user programs",
and thus not exposed in <errno.h>, and thus applications cannot safely
check against it (they get "Unknown error 524" as strerror). We should
rather return the well-known -EOPNOTSUPP.

This is similar to 2230a7ef ("drop_monitor: Use correct error
code") and 4a5cdc60 ("net/tls: Fix return values to avoid
ENOTSUPP"), which did not seem to cause problems.
Signed-off-by: NSamuel Thibault <samuel.thibault@labri.fr>
Acked-by: NWillem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20220307223126.djzvg44v2o2jkjsx@beginSigned-off-by: NJakub Kicinski <kuba@kernel.org>

869420a8

18 2月, 2022 2 次提交

net-timestamp: convert sk->sk_tskey to atomic_t · a1cdec57

由 Eric Dumazet 提交于 2月 17, 2022

UDP sendmsg() can be lockless, this is causing all kinds
of data races.

This patch converts sk->sk_tskey to remove one of these races.

BUG: KCSAN: data-race in __ip_append_data / __ip_append_data

read to 0xffff8881035d4b6c of 4 bytes by task 8877 on cpu 1:
 __ip_append_data+0x1c1/0x1de0 net/ipv4/ip_output.c:994
 ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636
 udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249
 inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819
 sock_sendmsg_nosec net/socket.c:705 [inline]
 sock_sendmsg net/socket.c:725 [inline]
 ____sys_sendmsg+0x39a/0x510 net/socket.c:2413
 ___sys_sendmsg net/socket.c:2467 [inline]
 __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
 __do_sys_sendmmsg net/socket.c:2582 [inline]
 __se_sys_sendmmsg net/socket.c:2579 [inline]
 __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

write to 0xffff8881035d4b6c of 4 bytes by task 8880 on cpu 0:
 __ip_append_data+0x1d8/0x1de0 net/ipv4/ip_output.c:994
 ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636
 udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249
 inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819
 sock_sendmsg_nosec net/socket.c:705 [inline]
 sock_sendmsg net/socket.c:725 [inline]
 ____sys_sendmsg+0x39a/0x510 net/socket.c:2413
 ___sys_sendmsg net/socket.c:2467 [inline]
 __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
 __do_sys_sendmmsg net/socket.c:2582 [inline]
 __se_sys_sendmmsg net/socket.c:2579 [inline]
 __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

value changed: 0x0000054d -> 0x0000054e

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 8880 Comm: syz-executor.5 Not tainted 5.17.0-rc2-syzkaller-00167-gdcb85f85-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Fixes: 09c2d251 ("net-timestamp: add key to disambiguate concurrent datagrams")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Reported-by: Nsyzbot <syzkaller@googlegroups.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a1cdec57

net: add sanity check in proto_register() · f20cfd66

由 Eric Dumazet 提交于 2月 16, 2022

prot->memory_allocated should only be set if prot->sysctl_mem
is also set.

This is a followup of commit 25206111 ("crypto: af_alg - get
rid of alg_memory_allocated").
Signed-off-by: NEric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220216171801.3604366-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

f20cfd66

02 2月, 2022 1 次提交

net: allow SO_MARK with CAP_NET_RAW via cmsg · 91f0d8a4

由 Jakub Kicinski 提交于 1月 31, 2022

There's not reason SO_MARK would be allowed via setsockopt()
and not via cmsg, let's keep the two consistent. See
commit 079925cc ("net: allow SO_MARK with CAP_NET_RAW")
for justification why NET_RAW -> SO_MARK is safe.
Reviewed-by: NMaciej Żenczykowski <maze@google.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20220131233357.52964-1-kuba@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>

91f0d8a4

31 1月, 2022 2 次提交

tcp: Change SYN ACK retransmit behaviour to account for rehash · cb6cd2ce

由 Akhmat Karakotov 提交于 1月 31, 2022

Disabling rehash behavior did not affect SYN ACK retransmits because hash
was forcefully changed bypassing the sk_rethink_hash function. This patch
adds a condition which checks for rehash mode before resetting hash.
Signed-off-by: NAkhmat Karakotov <hmukos@yandex-team.ru>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cb6cd2ce

txhash: Add socket option to control TX hash rethink behavior · 26859240

由 Akhmat Karakotov 提交于 1月 31, 2022

Add the SO_TXREHASH socket option to control hash rethink behavior per socket.
When default mode is set, sockets disable rehash at initialization and use
sysctl option when entering listen state. setsockopt() overrides default
behavior.
Signed-off-by: NAkhmat Karakotov <hmukos@yandex-team.ru>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

26859240

26 1月, 2022 1 次提交

net: Adjust sk_gso_max_size once when set · ab14f180

由 David Ahern 提交于 1月 24, 2022

sk_gso_max_size is set based on the dst dev. Both users of it
adjust the value by the same offset - (MAX_TCP_HEADER + 1). Rather
than compute the same adjusted value on each call do the adjustment
once when set.
Signed-off-by: NDavid Ahern <dsahern@kernel.org>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220125024511.27480-1-dsahern@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>

ab14f180

17 1月, 2022 1 次提交

net: Flush deferred skb free on socket destroy · 79074a72

由 Gal Pressman 提交于 1月 17, 2022

The cited Fixes patch moved to a deferred skb approach where the skbs
are not freed immediately under the socket lock.  Add a WARN_ON_ONCE()
to verify the deferred list is empty on socket destroy, and empty it to
prevent potential memory leaks.

Fixes: f35f8219 ("tcp: defer skb freeing after socket lock is released")
Signed-off-by: NGal Pressman <gal@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

79074a72

12 1月, 2022 1 次提交

net: fix sock_timestamping_bind_phc() to release device · 2a4d75bf

由 Miroslav Lichvar 提交于 1月 11, 2022

Don't forget to release the device in sock_timestamping_bind_phc() after
it was used to get the vclock indices.

Fixes: d463126e ("net: sock: extend SO_TIMESTAMPING for PHC binding")
Signed-off-by: NMiroslav Lichvar <mlichvar@redhat.com>
Cc: Yangbo Lu <yangbo.lu@nxp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2a4d75bf

11 12月, 2021 1 次提交

sock: Use sock_owned_by_user_nocheck() instead of sk_lock.owned. · 33d60fbd

由 Kuniyuki Iwashima 提交于 12月 08, 2021

This patch moves sock_release_ownership() down in include/net/sock.h and
replaces some sk_lock.owned tests with sock_owned_by_user_nocheck().
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
Link: https://lore.kernel.org/r/20211208062158.54132-1-kuniyu@amazon.co.jpSigned-off-by: NJakub Kicinski <kuba@kernel.org>

33d60fbd

10 12月, 2021 1 次提交

net: add netns refcount tracker to struct sock · ffa84b5f

由 Eric Dumazet 提交于 12月 09, 2021

Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

ffa84b5f

25 11月, 2021 2 次提交

net: allow SO_MARK with CAP_NET_RAW · 079925cc

由 Maciej Żenczykowski 提交于 11月 23, 2021

A CAP_NET_RAW capable process can already spoof (on transmit) anything
it desires via raw packet sockets...  There is no good reason to not
allow it to also be able to play routing tricks on packets from its
own normal sockets.

There is a desire to be able to use SO_MARK for routing table selection
(via ip rule fwmark) from within a user process without having to run
it as root.  Granting it CAP_NET_RAW is much less dangerous than
CAP_NET_ADMIN (CAP_NET_RAW doesn't permit persistent state change,
while CAP_NET_ADMIN does - by for example allowing the reconfiguration
of the routing tables and/or bringing up/down devices).

Let's keep CAP_NET_ADMIN for persistent state changes,
while using CAP_NET_RAW for non-configuration related stuff.
Signed-off-by: NMaciej Żenczykowski <maze@google.com>
Link: https://lore.kernel.org/r/20211123203715.193413-1-zenczykowski@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

079925cc

net: allow CAP_NET_RAW to setsockopt SO_PRIORITY · a1b519b7

由 Maciej Żenczykowski 提交于 11月 23, 2021

CAP_NET_ADMIN is and should continue to be about configuring the
system as a whole, not about configuring per-socket or per-packet
parameters.
Sending and receiving raw packets is what CAP_NET_RAW is all about.

It can already send packets with any VLAN tag, and any IPv4 TOS
mark, and any IPv6 TCLASS mark, simply by virtue of building
such a raw packet.  Not to mention using any protocol and source/
/destination ip address/port tuple.

These are the fields that networking gear uses to prioritize packets.

Hence, a CAP_NET_RAW process is already capable of affecting traffic
prioritization after it hits the wire.  This change makes it capable
of affecting traffic prioritization even in the host at the nic and
before that in the queueing disciplines (provided skb->priority is
actually being used for prioritization, and not the TOS/TCLASS field)

Hence it makes sense to allow a CAP_NET_RAW process to set the
priority of sockets and thus packets it sends.
Signed-off-by: NMaciej Żenczykowski <maze@google.com>
Link: https://lore.kernel.org/r/20211123203702.193221-1-zenczykowski@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

a1b519b7

22 11月, 2021 2 次提交

net: annotate accesses to dev->gso_max_segs · 6d872df3

由 Eric Dumazet 提交于 11月 19, 2021

dev->gso_max_segs is written under RTNL protection, or when the device is
not yet visible, but is read locklessly.

Add netif_set_gso_max_segs() helper.

Add the READ_ONCE()/WRITE_ONCE() pairs, and use netif_set_gso_max_segs()
where we can to better document what is going on.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6d872df3

net: annotate accesses to dev->gso_max_size · 4b66d216

由 Eric Dumazet 提交于 11月 19, 2021

dev->gso_max_size is written under RTNL protection, or when the device is
not yet visible, but is read locklessly.

Add the READ_ONCE()/WRITE_ONCE() pairs, and use netif_set_gso_max_size()
where we can to better document what is going on.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4b66d216

16 11月, 2021 7 次提交

net: merge net->core.prot_inuse and net->core.sock_inuse · 4199bae1