提交 · a8f4a42f066e02783e960a065fc1241abc82dd45 · openeuler / Kernel

18 7月, 2023 1 次提交

net: core: Add a GID field to struct sock. · 932c1a8f

由 JofDiamonds 提交于 7月 17, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I7LTRR
CVE: NA

Reference: https://gitee.com/openeuler/kernel/commit/f6740a11189620e5fd5ec0642c41b00f71b01689

--------------------------------

UID and GID are requested as filters for socketmap, but we can only get
UID from sock structure. This patch adds GID field to struct sock as UID.
Signed-off-by: NLu Wei <luwei32@huawei.com>
Signed-off-by: NLiu Jian <liujian56@huawei.com>
Conflicts:
	include/net/sock.h
	net/core/sock.c
Signed-off-by: NJofDiamonds <kwb0523@163.com>
Reviewed-by: Nwuchangye <wuchangye@huawei.com>

932c1a8f

22 6月, 2023 1 次提交

revert "net: align SO_RCVMARK required privileges with SO_MARK" · a9628e88

由 Maciej Żenczykowski 提交于 6月 18, 2023

This reverts commit 1f86123b ("net: align SO_RCVMARK required
privileges with SO_MARK") because the reasoning in the commit message
is not really correct:
  SO_RCVMARK is used for 'reading' incoming skb mark (via cmsg), as such
  it is more equivalent to 'getsockopt(SO_MARK)' which has no priv check
  and retrieves the socket mark, rather than 'setsockopt(SO_MARK) which
  sets the socket mark and does require privs.

  Additionally incoming skb->mark may already be visible if
  sysctl_fwmark_reflect and/or sysctl_tcp_fwmark_accept are enabled.

  Furthermore, it is easier to block the getsockopt via bpf
  (either cgroup setsockopt hook, or via syscall filters)
  then to unblock it if it requires CAP_NET_RAW/ADMIN.

On Android the socket mark is (among other things) used to store
the network identifier a socket is bound to.  Setting it is privileged,
but retrieving it is not.  We'd like unprivileged userspace to be able
to read the network id of incoming packets (where mark is set via
iptables [to be moved to bpf])...

An alternative would be to add another sysctl to control whether
setting SO_RCVMARK is privilged or not.
(or even a MASK of which bits in the mark can be exposed)
But this seems like over-engineering...

Note: This is a non-trivial revert, due to later merged commit e42c7bee
("bpf: net: Consider has_current_bpf_ctx() when testing capable() in sk_setsockopt()")
which changed both 'ns_capable' into 'sockopt_ns_capable' calls.

Fixes: 1f86123b ("net: align SO_RCVMARK required privileges with SO_MARK")
Cc: Larysa Zaremba <larysa.zaremba@intel.com>
Cc: Simon Horman <simon.horman@corigine.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Eyal Birger <eyal.birger@gmail.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Patrick Rohr <prohr@google.com>
Signed-off-by: NMaciej Żenczykowski <maze@google.com>
Reviewed-by: NSimon Horman <simon.horman@corigine.com>
Reviewed-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230618103130.51628-1-maze@google.comSigned-off-by: NPaolo Abeni <pabeni@redhat.com>

a9628e88

31 5月, 2023 1 次提交

udp6: Fix race condition in udp6_sendmsg & connect · 448a5ce1

由 Vladislav Efanov 提交于 5月 30, 2023

Syzkaller got the following report:
BUG: KASAN: use-after-free in sk_setup_caps+0x621/0x690 net/core/sock.c:2018
Read of size 8 at addr ffff888027f82780 by task syz-executor276/3255

The function sk_setup_caps (called by ip6_sk_dst_store_flow->
ip6_dst_store) referenced already freed memory as this memory was
freed by parallel task in udpv6_sendmsg->ip6_sk_dst_lookup_flow->
sk_dst_check.

          task1 (connect)              task2 (udp6_sendmsg)
        sk_setup_caps->sk_dst_set |
                                  |  sk_dst_check->
                                  |      sk_dst_set
                                  |      dst_release
        sk_setup_caps references  |
        to already freed dst_entry|

The reason for this race condition is: sk_setup_caps() keeps using
the dst after transferring the ownership to the dst cache.

Found by Linux Verification Center (linuxtesting.org) with syzkaller.

Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Signed-off-by: NVladislav Efanov <VEfanov@ispras.ru>
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

448a5ce1

08 4月, 2023 1 次提交

net: make SO_BUSY_POLL available to all users · 48b7ea1d

由 Eric Dumazet 提交于 4月 06, 2023

After commit 217f6974 ("net: busy-poll: allow preemption
in sk_busy_loop()"), a thread willing to use busy polling
is not hurting other threads anymore in a non preempt kernel.

I think it is safe to remove CAP_NET_ADMIN check.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230406194634.1804691-1-edumazet@google.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

48b7ea1d

02 3月, 2023 1 次提交

net: use indirect calls helpers for sk_exit_memory_pressure() · 5c1ebbfa

由 Brian Vazquez 提交于 3月 01, 2023

Florian reported a regression and sent a patch with the following
changelog:

<quote>
There is a noticeable tcp performance regression (loopback or cross-netns),
seen with iperf3 -Z (sendfile mode) when generic retpolines are needed.

With SK_RECLAIM_THRESHOLD checks gone number of calls to enter/leave
memory pressure happen much more often. For TCP indirect calls are
used.

We can't remove the if-set-return short-circuit check in
tcp_enter_memory_pressure because there are callers other than
sk_enter_memory_pressure. Doing a check in the sk wrapper too
reduces the indirect calls enough to recover some performance.

Before,
0.00-60.00 sec 322 GBytes 46.1 Gbits/sec receiver

After:
0.00-60.04 sec 359 GBytes 51.4 Gbits/sec receiver

"iperf3 -c $peer -t 60 -Z -f g", connected via veth in another netns.
</quote>

It seems we forgot to upstream this indirect call mitigation we
had for years, lets do this instead.

[edumazet] - It seems we forgot to upstream this indirect call
mitigation we had for years, let's do this instead.
- Changed to INDIRECT_CALL_INET_1() to avoid bots reports.

Fixes: 4890b686 ("net: keep sk->sk_forward_alloc as small as possible")
Reported-by: NFlorian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/netdev/20230227152741.4a53634b@kernel.org/T/Signed-off-by: NBrian Vazquez <brianvv@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230301133247.2346111-1-edumazet@google.comSigned-off-by: NPaolo Abeni <pabeni@redhat.com>

5c1ebbfa

15 2月, 2023 1 次提交

net: no longer support SOCK_REFCNT_DEBUG feature · fe33311c

由 Jason Xing 提交于 2月 14, 2023

Commit e48c414e ("[INET]: Generalise the TCP sock ID lookup routines")
commented out the definition of SOCK_REFCNT_DEBUG in 2005 and later another
commit 463c84b9 ("[NET]: Introduce inet_connection_sock") removed it.
Since we could track all of them through bpf and kprobe related tools
and the feature could print loads of information which might not be
that helpful even under a little bit pressure, the whole feature which
has been inactive for many years is no longer supported.

Link: https://lore.kernel.org/lkml/20230211065153.54116-1-kerneljasonxing@gmail.com/Suggested-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: NJason Xing <kernelxing@tencent.com>
Reviewed-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: NWenjia Zhang <wenjia@linux.ibm.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Acked-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fe33311c

08 2月, 2023 1 次提交

txhash: fix sk->sk_txrehash default · c11204c7

由 Kevin Yang 提交于 2月 07, 2023

This code fix a bug that sk->sk_txrehash gets its default enable
value from sysctl_txrehash only when the socket is a TCP listener.

We should have sysctl_txrehash to set the default sk->sk_txrehash,
no matter TCP, nor listerner/connector.

Tested by following packetdrill:
  0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
  +0 socket(..., SOCK_DGRAM, IPPROTO_UDP) = 4
  // SO_TXREHASH == 74, default to sysctl_txrehash == 1
  +0 getsockopt(3, SOL_SOCKET, 74, [1], [4]) = 0
  +0 getsockopt(4, SOL_SOCKET, 74, [1], [4]) = 0

Fixes: 26859240 ("txhash: Add socket option to control TX hash rethink behavior")
Signed-off-by: NKevin Yang <yyd@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c11204c7

06 2月, 2023 1 次提交

net: add sock_init_data_uid() · 584f3742

由 Pietro Borrello 提交于 2月 04, 2023

Add sock_init_data_uid() to explicitly initialize the socket uid.
To initialise the socket uid, sock_init_data() assumes a the struct
socket* sock is always embedded in a struct socket_alloc, used to
access the corresponding inode uid. This may not be true.
Examples are sockets created in tun_chr_open() and tap_open().

Fixes: 86741ec2 ("net: core: Add a UID field to struct sock.")
Signed-off-by: NPietro Borrello <borrello@diag.uniroma1.it>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

584f3742

02 2月, 2023 1 次提交

net: add support for ipv4 big tcp · b1a78b9b

由 Xin Long 提交于 1月 28, 2023

Similar to Eric's IPv6 BIG TCP, this patch is to enable IPv4 BIG TCP.

Firstly, allow sk->sk_gso_max_size to be set to a value greater than
GSO_LEGACY_MAX_SIZE by not trimming gso_max_size in sk_trim_gso_size()
for IPv4 TCP sockets.

Then on TX path, set IP header tot_len to 0 when skb->len > IP_MAX_MTU
in __ip_local_out() to allow to send BIG TCP packets, and this implies
that skb->len is the length of a IPv4 packet; On RX path, use skb->len
as the length of the IPv4 packet when the IP header tot_len is 0 and
skb->len > IP_MAX_MTU in ip_rcv_core(). As the API iph_set_totlen() and
skb_ip_totlen() are used in __ip_local_out() and ip_rcv_core(), we only
need to update these APIs.

Also in GRO receive, add the check for ETH_P_IP/IPPROTO_TCP, and allows
the merged packet size >= GRO_LEGACY_MAX_SIZE in skb_gro_receive(). In
GRO complete, set IP header tot_len to 0 when the merged packet size
greater than IP_MAX_MTU in iph_set_totlen() so that it can be processed
on RX path.

Note that by checking skb_is_gso_tcp() in API iph_totlen(), it makes
this implementation safe to use iph->len == 0 indicates IPv4 BIG TCP
packets.
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

b1a78b9b

23 1月, 2023 1 次提交

net/sock: Introduce trace_sk_data_ready() · 40e0b090

由 Peilin Ye 提交于 1月 19, 2023

As suggested by Cong, introduce a tracepoint for all ->sk_data_ready()
callback implementations.  For example:

<...>
  iperf-609  [002] .....  70.660425: sk_data_ready: family=2 protocol=6 func=sock_def_readable
  iperf-609  [002] .....  70.660436: sk_data_ready: family=2 protocol=6 func=sock_def_readable
<...>
Suggested-by: NCong Wang <cong.wang@bytedance.com>
Signed-off-by: NPeilin Ye <peilin.ye@bytedance.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

40e0b090

20 12月, 2022 1 次提交

net: Introduce sk_use_task_frag in struct sock. · fb87bd47

由 Guillaume Nault 提交于 12月 16, 2022

Sockets that can be used while recursing into memory reclaim, like
those used by network block devices and file systems, mustn't use
current->task_frag: if the current process is already using it, then
the inner memory reclaim call would corrupt the task_frag structure.

To avoid this, sk_page_frag() uses ->sk_allocation to detect sockets
that mustn't use current->task_frag, assuming that those used during
memory reclaim had their allocation constraints reflected in
->sk_allocation.

This unfortunately doesn't cover all cases: in an attempt to remove all
usage of GFP_NOFS and GFP_NOIO, sunrpc stopped setting these flags in
->sk_allocation, and used memalloc_nofs critical sections instead.
This breaks the sk_page_frag() heuristic since the allocation
constraints are now stored in current->flags, which sk_page_frag()
can't read without risking triggering a cache miss and slowing down
TCP's fast path.

This patch creates a new field in struct sock, named sk_use_task_frag,
which sockets with memory reclaim constraints can set to false if they
can't safely use current->task_frag. In such cases, sk_page_frag() now
always returns the socket's page_frag (->sk_frag). The first user is
sunrpc, which needs to avoid using current->task_frag but can keep
->sk_allocation set to GFP_KERNEL otherwise.

Eventually, it might be possible to simplify sk_page_frag() by only
testing ->sk_use_task_frag and avoid relying on the ->sk_allocation
heuristic entirely (assuming other sockets will set ->sk_use_task_frag
according to their constraints in the future).

The new ->sk_use_task_frag field is placed in a hole in struct sock and
belongs to a cache line shared with ->sk_shutdown. Therefore it should
be hot and shouldn't have negative performance impacts on TCP's fast
path (sk_shutdown is tested just before the while() loop in
tcp_sendmsg_locked()).

Link: https://lore.kernel.org/netdev/b4d8cb09c913d3e34f853736f3f5628abfd7f4b6.1656699567.git.gnault@redhat.com/Signed-off-by: NGuillaume Nault <gnault@redhat.com>
Reviewed-by: NBenjamin Coddington <bcodding@redhat.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

fb87bd47

09 12月, 2022 1 次提交

net_tstamp: add SOF_TIMESTAMPING_OPT_ID_TCP · b534dc46

由 Willem de Bruijn 提交于 12月 07, 2022

Add an option to initialize SOF_TIMESTAMPING_OPT_ID for TCP from
write_seq sockets instead of snd_una.

This should have been the behavior from the start. Because processes
may now exist that rely on the established behavior, do not change
behavior of the existing option, but add the right behavior with a new
flag. It is encouraged to always set SOF_TIMESTAMPING_OPT_ID_TCP on
stream sockets along with the existing SOF_TIMESTAMPING_OPT_ID.

Intuitively the contract is that the counter is zero after the
setsockopt, so that the next write N results in a notification for
the last byte N - 1.

On idle sockets snd_una == write_seq and this holds for both. But on
sockets with data in transmission, snd_una records the unacked offset
in the stream. This depends on the ACK response from the peer. A
process cannot learn this in a race free manner (ioctl SIOCOUTQ is one
racy approach).

write_seq records the offset at the last byte written by the process.
This is a better starting point. It matches the intuitive contract in
all circumstances, unaffected by external behavior.

The new timestamp flag necessitates increasing sk_tsflags to 32 bits.
Move the field in struct sock to avoid growing the socket (for some
common CONFIG variants). The UAPI interface so_timestamping.flags is
already int, so 32 bits wide.
Reported-by: NSotirios Delimanolis <sotodel@meta.com>
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20221207143701.29861-1-willemdebruijn.kernel@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

b534dc46

05 11月, 2022 1 次提交

lsm: make security_socket_getpeersec_stream() sockptr_t safe · b10b9c34

由 Paul Moore 提交于 10月 10, 2022

Commit 4ff09db1 ("bpf: net: Change sk_getsockopt() to take the
sockptr_t argument") made it possible to call sk_getsockopt()
with both user and kernel address space buffers through the use of
the sockptr_t type.  Unfortunately at the time of conversion the
security_socket_getpeersec_stream() LSM hook was written to only
accept userspace buffers, and in a desire to avoid having to change
the LSM hook the commit author simply passed the sockptr_t's
userspace buffer pointer.  Since the only sk_getsockopt() callers
at the time of conversion which used kernel sockptr_t buffers did
not allow SO_PEERSEC, and hence the
security_socket_getpeersec_stream() hook, this was acceptable but
also very fragile as future changes presented the possibility of
silently passing kernel space pointers to the LSM hook.

There are several ways to protect against this, including careful
code review of future commits, but since relying on code review to
catch bugs is a recipe for disaster and the upstream eBPF maintainer
is "strongly against defensive programming", this patch updates the
LSM hook, and all of the implementations to support sockptr_t and
safely handle both user and kernel space buffers.
Acked-by: NCasey Schaufler <casey@schaufler-ca.com>
Acked-by: NJohn Johansen <john.johansen@canonical.com>
Signed-off-by: NPaul Moore <paul@paul-moore.com>

b10b9c34

25 10月, 2022 1 次提交

soreuseport: Fix socket selection for SO_INCOMING_CPU. · b261eda8

由 Kuniyuki Iwashima 提交于 10月 21, 2022

Kazuho Oku reported that setsockopt(SO_INCOMING_CPU) does not work
with setsockopt(SO_REUSEPORT) since v4.6.

With the combination of SO_REUSEPORT and SO_INCOMING_CPU, we could
build a highly efficient server application.

setsockopt(SO_INCOMING_CPU) associates a CPU with a TCP listener
or UDP socket, and then incoming packets processed on the CPU will
likely be distributed to the socket.  Technically, a socket could
even receive packets handled on another CPU if no sockets in the
reuseport group have the same CPU receiving the flow.

The logic exists in compute_score() so that a socket will get a higher
score if it has the same CPU with the flow.  However, the score gets
ignored after the blamed two commits, which introduced a faster socket
selection algorithm for SO_REUSEPORT.

This patch introduces a counter of sockets with SO_INCOMING_CPU in
a reuseport group to check if we should iterate all sockets to find
a proper one.  We increment the counter when

  * calling listen() if the socket has SO_INCOMING_CPU and SO_REUSEPORT

  * enabling SO_INCOMING_CPU if the socket is in a reuseport group

Also, we decrement it when

  * detaching a socket out of the group to apply SO_INCOMING_CPU to
    migrated TCP requests

  * disabling SO_INCOMING_CPU if the socket is in a reuseport group

When the counter reaches 0, we can get back to the O(1) selection
algorithm.

The overall changes are negligible for the non-SO_INCOMING_CPU case,
and the only notable thing is that we have to update sk_incomnig_cpu
under reuseport_lock.  Otherwise, the race prevents transitioning to
the O(n) algorithm and results in the wrong socket selection.

 cpu1 (setsockopt)               cpu2 (listen)
+-----------------+             +-------------+

lock_sock(sk1)                  lock_sock(sk2)

reuseport_update_incoming_cpu(sk1, val)
.
|  /* set CPU as 0 */
|- WRITE_ONCE(sk1->incoming_cpu, val)
|
|                               spin_lock_bh(&reuseport_lock)
|                               reuseport_grow(sk2, reuse)
|                               .
|                               |- more_socks_size = reuse->max_socks * 2U;
|                               |- if (more_socks_size > U16_MAX &&
|                               |       reuse->num_closed_socks)
|                               |  .
|                               |  |- RCU_INIT_POINTER(sk1->sk_reuseport_cb, NULL);
|                               |  `- __reuseport_detach_closed_sock(sk1, reuse)
|                               |     .
|                               |     `- reuseport_put_incoming_cpu(sk1, reuse)
|                               |        .
|                               |        |  /* Read shutdown()ed sk1's sk_incoming_cpu
|                               |        |   * without lock_sock().
|                               |        |   */
|                               |        `- if (sk1->sk_incoming_cpu >= 0)
|                               |           .
|                               |           |  /* decrement not-yet-incremented
|                               |           |   * count, which is never incremented.
|                               |           |   */
|                               |           `- __reuseport_put_incoming_cpu(reuse);
|                               |
|                               `- spin_lock_bh(&reuseport_lock)
|
|- spin_lock_bh(&reuseport_lock)
|
|- reuse = rcu_dereference_protected(sk1->sk_reuseport_cb, ...)
|- if (!reuse)
|  .
|  |  /* Cannot increment reuse->incoming_cpu. */
|  `- goto out;
|
`- spin_unlock_bh(&reuseport_lock)

Fixes: e32ea7e7 ("soreuseport: fast reuseport UDP socket selection")
Fixes: c125e80b ("soreuseport: fast reuseport TCP socket selection")
Reported-by: NKazuho Oku <kazuhooku@gmail.com>
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>

b261eda8

24 10月, 2022 2 次提交

net: remove useless parameter of __sock_cmsg_send · 233baf9a

由 xu xin 提交于 10月 20, 2022

The parameter 'msg' has never been used by __sock_cmsg_send, so we can remove it
safely.
Reported-by: NZeal Robot <zealci@zte.com.cn>
Signed-off-by: Nxu xin <xu.xin16@zte.com.cn>
Reviewed-by: NZhang Yunkai <zhang.yunkai@zte.com.cn>
Acked-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

233baf9a

net: add a refcount tracker for kernel sockets · 0cafd77d

由 Eric Dumazet 提交于 10月 20, 2022

Commit ffa84b5f ("net: add netns refcount tracker to struct sock")
added a tracker to sockets, but did not track kernel sockets.

We still have syzbot reports hinting about netns being destroyed
while some kernel TCP sockets had not been dismantled.

This patch tracks kernel sockets, and adds a ref_tracker_dir_print()
call to net_free() right before the netns is freed.

Normally, each layer is responsible for properly releasing its
kernel sockets before last call to net_free().

This debugging facility is enabled with CONFIG_NET_NS_REFCNT_TRACKER=y
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reviewed-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Tested-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0cafd77d

13 10月, 2022 1 次提交

ipv6: Fix data races around sk->sk_prot. · 364f997b

由 Kuniyuki Iwashima 提交于 10月 06, 2022

Commit 086d4905 ("ipv6: annotate some data-races around sk->sk_prot")
fixed some data-races around sk->sk_prot but it was not enough.

Some functions in inet6_(stream|dgram)_ops still access sk->sk_prot
without lock_sock() or rtnl_lock(), so they need READ_ONCE() to avoid
load tearing.

Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

364f997b

03 9月, 2022 3 次提交

bpf: Change bpf_getsockopt(SOL_SOCKET) to reuse sk_getsockopt() · 65ddc82d

由 Martin KaFai Lau 提交于 9月 01, 2022

This patch changes bpf_getsockopt(SOL_SOCKET) to reuse
sk_getsockopt(). It removes all duplicated code from
bpf_getsockopt(SOL_SOCKET).

Before this patch, there were some optnames available to
bpf_setsockopt(SOL_SOCKET) but missing in bpf_getsockopt(SOL_SOCKET).
It surprises users from time to time. For example, SO_REUSEADDR,
SO_KEEPALIVE, SO_RCVLOWAT, and SO_MAX_PACING_RATE. This patch
automatically closes this gap without duplicating more code.
The only exception is SO_BINDTODEVICE because it needs to acquire a
blocking lock. Thus, SO_BINDTODEVICE is not supported.
Signed-off-by: NMartin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20220902002912.2894040-1-kafai@fb.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

65ddc82d

bpf: net: Change sk_getsockopt() to take the sockptr_t argument · 4ff09db1

由 Martin KaFai Lau 提交于 9月 01, 2022

This patch changes sk_getsockopt() to take the sockptr_t argument
such that it can be used by bpf_getsockopt(SOL_SOCKET) in a
latter patch.

security_socket_getpeersec_stream() is not changed.  It stays
with the __user ptr (optval.user and optlen.user) to avoid changes
to other security hooks.  bpf_getsockopt(SOL_SOCKET) also does not
support SO_PEERSEC.
Signed-off-by: NMartin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20220902002802.2888419-1-kafai@fb.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

4ff09db1

net: Change sock_getsockopt() to take the sk ptr instead of the sock ptr · ba74a760

由 Martin KaFai Lau 提交于 9月 01, 2022

A latter patch refactors bpf_getsockopt(SOL_SOCKET) with the
sock_getsockopt() to avoid code duplication and code
drift between the two duplicates.

The current sock_getsockopt() takes sock ptr as the argument.
The very first thing of this function is to get back the sk ptr
by 'sk = sock->sk'.

bpf_getsockopt() could be called when the sk does not have
the sock ptr created.  Meaning sk->sk_socket is NULL.  For example,
when a passive tcp connection has just been established but has yet
been accept()-ed.  Thus, it cannot use the sock_getsockopt(sk->sk_socket)
or else it will pass a NULL ptr.

This patch moves all sock_getsockopt implementation to the newly
added sk_getsockopt().  The new sk_getsockopt() takes a sk ptr
and immediately gets the sock ptr by 'sock = sk->sk_socket'

The existing sock_getsockopt(sock) is changed to call
sk_getsockopt(sock->sk).  All existing callers have both sock->sk
and sk->sk_socket pointer.

The latter patch will make bpf_getsockopt(SOL_SOCKET) call
sk_getsockopt(sk) directly.  The bpf_getsockopt(SOL_SOCKET) does
not use the optnames that require sk->sk_socket, so it will
be safe.
Signed-off-by: NMartin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20220902002756.2887884-1-kafai@fb.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

ba74a760

24 8月, 2022 3 次提交

net: Fix a data-race around sysctl_net_busy_read. · e59ef36f

由 Kuniyuki Iwashima 提交于 8月 23, 2022

While reading sysctl_net_busy_read, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.

Fixes: 2d48d67f ("net: poll/select low latency socket support")
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e59ef36f

net: Fix data-races around sysctl_optmem_max. · 7de6d09f

由 Kuniyuki Iwashima 提交于 8月 23, 2022

While reading sysctl_optmem_max, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.

Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7de6d09f

net: Fix data-races around sysctl_[rw]mem_(max|default). · 1227c177

由 Kuniyuki Iwashima 提交于 8月 23, 2022

While reading sysctl_[rw]mem_(max|default), they can be changed
concurrently.  Thus, we need to add READ_ONCE() to its readers.

Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1227c177

19 8月, 2022 4 次提交

bpf: Change bpf_setsockopt(SOL_SOCKET) to reuse sk_setsockopt() · 29003875

由 Martin KaFai Lau 提交于 8月 16, 2022

After the prep work in the previous patches,
this patch removes most of the dup code from bpf_setsockopt(SOL_SOCKET)
and reuses them from sk_setsockopt().

The sock ptr test is added to the SO_RCVLOWAT because
the sk->sk_socket could be NULL in some of the bpf hooks.

The existing optname white-list is refactored into a new
function sol_socket_setsockopt().
Reviewed-by: NStanislav Fomichev <sdf@google.com>
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061804.4178920-1-kafai@fb.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

29003875

bpf: net: Consider has_current_bpf_ctx() when testing capable() in sk_setsockopt() · e42c7bee

由 Martin KaFai Lau 提交于 8月 16, 2022

When bpf program calling bpf_setsockopt(SOL_SOCKET),
it could be run in softirq and doesn't make sense to do the capable
check. There was a similar situation in bpf_setsockopt(TCP_CONGESTION).
In commit 8d650cde ("tcp: fix tcp_set_congestion_control() use from bpf hook"),
tcp_set_congestion_control(..., cap_net_admin) was added to skip
the cap check for bpf prog.

This patch adds sockopt_ns_capable() and sockopt_capable() for
the sk_setsockopt() to use. They will consider the
has_current_bpf_ctx() before doing the ns_capable() and capable() test.
They are in EXPORT_SYMBOL for the ipv6 module to use in a latter patch.
Suggested-by: NStanislav Fomichev <sdf@google.com>
Reviewed-by: NStanislav Fomichev <sdf@google.com>
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061723.4175820-1-kafai@fb.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

e42c7bee

bpf: net: Avoid sk_setsockopt() taking sk lock when called from bpf · 24426654

由 Martin KaFai Lau 提交于 8月 16, 2022

Most of the code in bpf_setsockopt(SOL_SOCKET) are duplicated from
the sk_setsockopt().  The number of supported optnames are
increasing ever and so as the duplicated code.

One issue in reusing sk_setsockopt() is that the bpf prog
has already acquired the sk lock.  This patch adds a
has_current_bpf_ctx() to tell if the sk_setsockopt() is called from
a bpf prog.  The bpf prog calling bpf_setsockopt() is either running
in_task() or in_serving_softirq().  Both cases have the current->bpf_ctx
initialized.  Thus, the has_current_bpf_ctx() only needs to
test !!current->bpf_ctx.

This patch also adds sockopt_{lock,release}_sock() helpers
for sk_setsockopt() to use.  These helpers will test
has_current_bpf_ctx() before acquiring/releasing the lock.  They are
in EXPORT_SYMBOL for the ipv6 module to use in a latter patch.

Note on the change in sock_setbindtodevice().  sockopt_lock_sock()
is done in sock_setbindtodevice() instead of doing the lock_sock
in sock_bindtoindex(..., lock_sk = true).
Reviewed-by: NStanislav Fomichev <sdf@google.com>
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061717.4175589-1-kafai@fb.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

24426654

net: Add sk_setsockopt() to take the sk ptr instead of the sock ptr · 4d748f99

由 Martin KaFai Lau 提交于 8月 16, 2022

A latter patch refactors bpf_setsockopt(SOL_SOCKET) with the
sock_setsockopt() to avoid code duplication and code
drift between the two duplicates.

The current sock_setsockopt() takes sock ptr as the argument.
The very first thing of this function is to get back the sk ptr
by 'sk = sock->sk'.

bpf_setsockopt() could be called when the sk does not have
the sock ptr created.  Meaning sk->sk_socket is NULL.  For example,
when a passive tcp connection has just been established but has yet
been accept()-ed.  Thus, it cannot use the sock_setsockopt(sk->sk_socket)
or else it will pass a NULL ptr.

This patch moves all sock_setsockopt implementation to the newly
added sk_setsockopt().  The new sk_setsockopt() takes a sk ptr
and immediately gets the sock ptr by 'sock = sk->sk_socket'

The existing sock_setsockopt(sock) is changed to call
sk_setsockopt(sock->sk).  All existing callers have both sock->sk
and sk->sk_socket pointer.

The latter patch will make bpf_setsockopt(SOL_SOCKET) call
sk_setsockopt(sk) directly.  The bpf_setsockopt(SOL_SOCKET) does
not use the optnames that require sk->sk_socket, so it will
be safe.
Reviewed-by: NStanislav Fomichev <sdf@google.com>
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061711.4175048-1-kafai@fb.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

4d748f99

06 7月, 2022 1 次提交

tls: rx: periodically flush socket backlog · c46b0183

由 Jakub Kicinski 提交于 7月 05, 2022

We continuously hold the socket lock during large reads and writes.
This may inflate RTT and negatively impact TCP performance.
Flush the backlog periodically. I tried to pick a flush period (128kB)
which gives significant benefit but the max Bps rate is not yet visibly
impacted.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c46b0183

13 6月, 2022 1 次提交

tcp: sk_forced_mem_schedule() optimization · 219160be

由 Eric Dumazet 提交于 6月 10, 2022

sk_memory_allocated_add() has three callers, and returns
to them @memory_allocated.

sk_forced_mem_schedule() is one of them, and ignores
the returned value.

Change sk_memory_allocated_add() to return void.

Change sock_reserve_memory() and __sk_mem_raise_allocated()
to call sk_memory_allocated().

This removes one cache line miss [1] for RPC workloads,
as first skbs in TCP write queue and receive queue go through
sk_forced_mem_schedule().

[1] Cache line holding tcp_memory_allocated.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

219160be

11 6月, 2022 3 次提交

net: unexport __sk_mem_{raise|reduce}_allocated · 0f2c2693

由 Eric Dumazet 提交于 6月 08, 2022

These two helpers are only used from core networking.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

0f2c2693

net: add per_cpu_fw_alloc field to struct proto · 0defbb0a

由 Eric Dumazet 提交于 6月 08, 2022

Each protocol having a ->memory_allocated pointer gets a corresponding
per-cpu reserve, that following patches will use.

Instead of having reserved bytes per socket,
we want to have per-cpu reserves.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

0defbb0a

net: remove SK_MEM_QUANTUM and SK_MEM_QUANTUM_SHIFT · 100fdd1f

由 Eric Dumazet 提交于 6月 08, 2022

Due to memcg interface, SK_MEM_QUANTUM is effectively PAGE_SIZE.

This might change in the future, but it seems better to avoid the
confusion.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

100fdd1f

10 6月, 2022 1 次提交

net: use DEBUG_NET_WARN_ON_ONCE() in __release_sock() · 63fbdd3c

由 Eric Dumazet 提交于 6月 08, 2022

Check against skb dst in socket backlog has never triggered
in past years.

Keep the check omly for CONFIG_DEBUG_NET=y builds.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

63fbdd3c

16 5月, 2022 2 次提交

net: core: add READ_ONCE/WRITE_ONCE annotations for sk->sk_bound_dev_if · e5fccaa1

由 Eric Dumazet 提交于 5月 13, 2022

sock_bindtoindex_locked() needs to use WRITE_ONCE(sk->sk_bound_dev_if, val),
because other cpus/threads might locklessly read this field.

sock_getbindtodevice(), sock_getsockopt() need READ_ONCE()
because they run without socket lock held.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e5fccaa1

net: allow gso_max_size to exceed 65536 · 7c4e983c

由 Alexander Duyck 提交于 5月 13, 2022

The code for gso_max_size was added originally to allow for debugging and
workaround of buggy devices that couldn't support TSO with blocks 64K in
size. The original reason for limiting it to 64K was because that was the
existing limits of IPv4 and non-jumbogram IPv6 length fields.

With the addition of Big TCP we can remove this limit and allow the value
to potentially go up to UINT_MAX and instead be limited by the tso_max_size
value.

So in order to support this we need to go through and clean up the
remaining users of the gso_max_size value so that the values will cap at
64K for non-TCPv6 flows. In addition we can clean up the GSO_MAX_SIZE value
so that 64K becomes GSO_LEGACY_MAX_SIZE and UINT_MAX will now be the upper
limit for GSO_MAX_SIZE.

v6: (edumazet) fixed a compile error if CONFIG_IPV6=n,
               in a new sk_trim_gso_size() helper.
               netif_set_tso_max_size() caps the requested TSO size
               with GSO_MAX_SIZE.
Signed-off-by: NAlexander Duyck <alexanderduyck@fb.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7c4e983c

06 5月, 2022 1 次提交

net: align SO_RCVMARK required privileges with SO_MARK · 1f86123b

由 Eyal Birger 提交于 5月 04, 2022

The commit referenced in the "Fixes" tag added the SO_RCVMARK socket
option for receiving the skb mark in the ancillary data.

Since this is a new capability, and exposes admin configured details
regarding the underlying network setup to sockets, let's align the
needed capabilities with those of SO_MARK.

Fixes: 6fd1d51c ("net: SO_RCVMARK socket option for SO_MARK with recvmsg()")
Signed-off-by: NEyal Birger <eyal.birger@gmail.com>
Link: https://lore.kernel.org/r/20220504095459.2663513-1-eyal.birger@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

1f86123b

01 5月, 2022 3 次提交

sock: optimise sock_def_write_space barriers · 0a8afd9f

由 Pavel Begunkov 提交于 4月 28, 2022

Now we have a separate path for sock_def_write_space() and can go one
step further. When it's called from sock_wfree() we know that there is a
preceding atomic for putting down ->sk_wmem_alloc. We can use it to
replace to replace smb_mb() with a less expensive
smp_mb__after_atomic(). It also removes an extra RCU read lock/unlock as
a small bonus.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0a8afd9f

sock: optimise UDP sock_wfree() refcounting · 052ada09

由 Pavel Begunkov 提交于 4月 28, 2022

For non SOCK_USE_WRITE_QUEUE sockets, sock_wfree() (atomically) puts
->sk_wmem_alloc twice. It's needed to keep the socket alive while
calling ->sk_write_space() after the first put.

However, some sockets, such as UDP, are freed by RCU
(i.e. SOCK_RCU_FREE) and use already RCU-safe sock_def_write_space().
Carve a fast path for such sockets, put down all refs in one go before
calling sock_def_write_space() but guard the socket from being freed
by an RCU read section.

note: because TCP sockets are marked with SOCK_USE_WRITE_QUEUE it
doesn't add extra checks in its path.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

052ada09

sock: dedup sock_def_write_space wmem_alloc checks · 14bfee9b

由 Pavel Begunkov 提交于 4月 28, 2022

Except for minor rounding differences the first ->sk_wmem_alloc test in
sock_def_write_space() is a hand coded version of sock_writeable().
Replace it with the helper, and also kill the following if duplicating
the check.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

14bfee9b

30 4月, 2022 1 次提交

net: inline sock_alloc_send_skb · de32bc6a

由 Pavel Begunkov 提交于 4月 28, 2022

sock_alloc_send_skb() is simple and just proxying to another function,
so we can inline it and cut associated overhead.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

de32bc6a

openeuler / Kernel 大约 1 年 前同步成功

openeuler / Kernel
大约 1 年前同步成功