提交 · f5d39b020809146cc28e6e73369bf8065e0310aa · openeuler / Kernel

08 9月, 2022 1 次提交

freezer,sched: Rewrite core freezer logic · f5d39b02

由 Peter Zijlstra 提交于 8月 22, 2022

Rewrite the core freezer to behave better wrt thawing and be simpler
in general.

By replacing PF_FROZEN with TASK_FROZEN, a special block state, it is
ensured frozen tasks stay frozen until thawed and don't randomly wake
up early, as is currently possible.

As such, it does away with PF_FROZEN and PF_FREEZER_SKIP, freeing up
two PF_flags (yay!).

Specifically; the current scheme works a little like:

	freezer_do_not_count();
	schedule();
	freezer_count();

And either the task is blocked, or it lands in try_to_freezer()
through freezer_count(). Now, when it is blocked, the freezer
considers it frozen and continues.

However, on thawing, once pm_freezing is cleared, freezer_count()
stops working, and any random/spurious wakeup will let a task run
before its time.

That is, thawing tries to thaw things in explicit order; kernel
threads and workqueues before doing bringing SMP back before userspace
etc.. However due to the above mentioned races it is entirely possible
for userspace tasks to thaw (by accident) before SMP is back.

This can be a fatal problem in asymmetric ISA architectures (eg ARMv9)
where the userspace task requires a special CPU to run.

As said; replace this with a special task state TASK_FROZEN and add
the following state transitions:

	TASK_FREEZABLE	-> TASK_FROZEN
	__TASK_STOPPED	-> TASK_FROZEN
	__TASK_TRACED	-> TASK_FROZEN

The new TASK_FREEZABLE can be set on any state part of TASK_NORMAL
(IOW. TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE) -- any such state
is already required to deal with spurious wakeups and the freezer
causes one such when thawing the task (since the original state is
lost).

The special __TASK_{STOPPED,TRACED} states *can* be restored since
their canonical state is in ->jobctl.

With this, frozen tasks need an explicit TASK_FROZEN wakeup and are
free of undue (early / spurious) wakeups.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NIngo Molnar <mingo@kernel.org>
Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/20220822114649.055452969@infradead.org

f5d39b02

07 7月, 2022 1 次提交

af_unix: Optimise hash table layout. · cf21b355

由 Kuniyuki Iwashima 提交于 7月 05, 2022

Commit 6dd4142f ("Merge branch 'af_unix-per-netns-socket-hash'") and
commit 51bae889 ("af_unix: Put pathname sockets in the global hash
table.") changed a hash table layout.

  Before:
    unix_socket_table [0   - 255] : abstract & pathname sockets
                      [256 - 511] : unnamed sockets

  After:
    per-netns table   [0   - 255] : abstract & pathname sockets
                      [256 - 511] : unnamed sockets
    bsd_socket_table  [0   - 255] : pathname sockets (sk_bind_node)

Now, while looking up sockets, we traverse the global table for the
pathname sockets and the first half of each per-netns hash table for
abstract sockets, where pathname sockets are also linked.  Thus, the
more pathname sockets we have, the longer we take to look up abstract
sockets.  This characteristic has been there before the layout change,
but we can improve it now.

This patch changes the per-netns hash table's layout so that sockets not
requiring lookup reside in the first half and do not impact the lookup of
abstract sockets.

    per-netns table   [0   - 255] : pathname & unnamed sockets
                      [256 - 511] : abstract sockets
    bsd_socket_table  [0   - 255] : pathname sockets (sk_bind_node)

We have run a test that bind()s 100,000 abstract/pathname sockets for
each, bind()s an abstract socket 100,000 times and measures the time
on __unix_find_socket_byname().  The result shows that the patch makes
each lookup faster.

  Without this patch:
    $ sudo ./funclatency -p 2278 --microseconds __unix_find_socket_byname.isra.44
     usec                : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 126      |                                        |
        16 -> 31         : 1438     |*                                       |
        32 -> 63         : 4150     |***                                     |
        64 -> 127        : 9049     |*******                                 |
       128 -> 255        : 37704    |*******************************         |
       256 -> 511        : 47533    |****************************************|

  With this patch:
    $ sudo ./funclatency -p 3648 --microseconds __unix_find_socket_byname.isra.46
     usec                : count    distribution
         0 -> 1          : 109      |                                        |
         2 -> 3          : 318      |                                        |
         4 -> 7          : 725      |                                        |
         8 -> 15         : 2501     |*                                       |
        16 -> 31         : 3061     |**                                      |
        32 -> 63         : 4028     |***                                     |
        64 -> 127        : 9312     |*******                                 |
       128 -> 255        : 51372    |****************************************|
       256 -> 511        : 28574    |**********************                  |
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20220705233715.759-1-kuniyu@amazon.comSigned-off-by: NPaolo Abeni <pabeni@redhat.com>

cf21b355

05 7月, 2022 1 次提交

af_unix: Put pathname sockets in the global hash table. · 51bae889

由 Kuniyuki Iwashima 提交于 7月 02, 2022

Commit cf2f225e ("af_unix: Put a socket into a per-netns hash table.")
accidentally broke user API for pathname sockets.  A socket was able to
connect() to a pathname socket whose file was visible even if they were in
different network namespaces.

The commit puts all sockets into a per-netns hash table.  As a result,
connect() to a pathname socket in a different netns fails to find it in the
caller's per-netns hash table and returns -ECONNREFUSED even when the task
can view the peer socket file.

We can reproduce this issue by:

  Console A:

    # python3
    >>> from socket import *
    >>> s = socket(AF_UNIX, SOCK_STREAM, 0)
    >>> s.bind('test')
    >>> s.listen(32)

  Console B:

    # ip netns add test
    # ip netns exec test sh
    # python3
    >>> from socket import *
    >>> s = socket(AF_UNIX, SOCK_STREAM, 0)
    >>> s.connect('test')

Note when dumping sockets by sock_diag, procfs, and bpf_iter, they are
filtered only by netns.  In other words, even if they are visible and
connect()able, all sockets in different netns are skipped while iterating
sockets.  Thus, we need a fix only for finding a peer pathname socket.

This patch adds a global hash table for pathname sockets, links them with
sk_bind_node, and uses it in unix_find_socket_byinode().  By doing so, we
can keep sockets in per-netns hash tables and dump them easily.

Thanks to Sachin Sant and Leonard Crestez for reports, logs and a reproducer.

Fixes: cf2f225e ("af_unix: Put a socket into a per-netns hash table.")
Reported-by: NSachin Sant <sachinp@linux.ibm.com>
Reported-by: NLeonard Crestez <cdleonard@gmail.com>
Tested-by: NSachin Sant <sachinp@linux.ibm.com>
Tested-by: NNathan Chancellor <nathan@kernel.org>
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Tested-by: NLeonard Crestez <cdleonard@gmail.com>
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>

51bae889

22 6月, 2022 6 次提交

af_unix: Remove unix_table_locks. · 2f7ca90a

由 Kuniyuki Iwashima 提交于 6月 21, 2022

unix_table_locks are to protect the global hash table, unix_socket_table.
The previous commit removed it, so let's clean up the unnecessary locks.

Here is a test result on EC2 c5.9xlarge where 10 processes run concurrently
in different netns and bind 100,000 sockets for each.

  without this series : 1m 38s
  with this series    :    11s

It is ~10x faster because the global hash table is split into 10 netns in
this case.
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2f7ca90a

af_unix: Put a socket into a per-netns hash table. · cf2f225e

由 Kuniyuki Iwashima 提交于 6月 21, 2022

This commit replaces the global hash table with a per-netns one and removes
the global one.

We now link a socket in each netns's hash table so we can save some netns
comparisons when iterating through a hash bucket.
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cf2f225e

af_unix: Acquire/Release per-netns hash table's locks. · 79b05bea

由 Kuniyuki Iwashima 提交于 6月 21, 2022

This commit adds extra spin_lock/spin_unlock() for a per-netns
hash table inside the existing ones for unix_table_locks.

As of this commit, sockets are still linked in the global hash
table.  After putting sockets in a per-netns hash table and
removing the old one in the next patch, we remove the global
locks in the last patch.
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

79b05bea

af_unix: Define a per-netns hash table. · b6e81138

由 Kuniyuki Iwashima 提交于 6月 21, 2022

This commit adds a per netns hash table for AF_UNIX, which size is fixed
as UNIX_HASH_SIZE for now.

The first implementation defines a per-netns hash table as a single array
of lock and list:

	struct unix_hashbucket {
		spinlock_t		lock;
		struct hlist_head	head;
	};

	struct netns_unix {
		struct unix_hashbucket	*hash;
		...
	};

But, Eric pointed out memory cost that the structure has holes because of
sizeof(spinlock_t), which is 4 (or more if LOCKDEP is enabled). [0]  It
could be expensive on a host with thousands of netns and few AF_UNIX
sockets.  For this reason, a per-netns hash table uses two dense arrays.

	struct unix_table {
		spinlock_t		*locks;
		struct hlist_head	*buckets;
	};

	struct netns_unix {
		struct unix_table	table;
		...
	};

Note the length of the list has a significant impact rather than lock
contention, so having shared locks can be an option.  But, per-netns
locks and lists still perform better than the global locks and per-netns
lists. [1]

Also, this patch adds a change so that struct netns_unix disappears from
struct net if CONFIG_UNIX is disabled.

[0]: https://lore.kernel.org/netdev/CANn89iLVxO5aqx16azNU7p7Z-nz5NrnM5QTqOzueVxEnkVTxyg@mail.gmail.com/
[1]: https://lore.kernel.org/netdev/20220617175215.1769-1-kuniyu@amazon.com/Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b6e81138

af_unix: Include the whole hash table size in UNIX_HASH_SIZE. · f302d180

由 Kuniyuki Iwashima 提交于 6月 21, 2022

Currently, the size of AF_UNIX hash table is UNIX_HASH_SIZE * 2,
the first half for bind()ed sockets and the second half for unbound
ones.  UNIX_HASH_SIZE * 2 is used to define the table and iterate
over it.

In some places, we use ARRAY_SIZE(unix_socket_table) instead of
UNIX_HASH_SIZE * 2.  However, we cannot use it anymore because we
will allocate the hash table dynamically.  Then, we would have to
add UNIX_HASH_SIZE * 2 in many places, which would be troublesome.

This patch adapts the UNIX_HASH_SIZE definition to include bound
and unbound sockets and defines a new UNIX_HASH_MOD macro to ease
calculations.
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f302d180

af_unix: Clean up some sock_net() uses. · 340c3d33

由 Kuniyuki Iwashima 提交于 6月 21, 2022

Some functions define a net pointer only for one-shot use.  Others call
sock_net() redundantly even when a net pointer is available.  Let's fix
these and make the code simpler.
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

340c3d33

20 6月, 2022 1 次提交

net: Introduce a new proto_ops ->read_skb() · 965b57b4

由 Cong Wang 提交于 6月 15, 2022

Currently both splice() and sockmap use ->read_sock() to
read skb from receive queue, but for sockmap we only read
one entire skb at a time, so ->read_sock() is too conservative
to use. Introduce a new proto_ops ->read_skb() which supports
this sematic, with this we can finally pass the ownership of
skb to recv actors.

For non-TCP protocols, all ->read_sock() can be simply
converted to ->read_skb().
Signed-off-by: NCong Wang <cong.wang@bytedance.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Reviewed-by: NJohn Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20220615162014.89193-3-xiyou.wangcong@gmail.com

965b57b4

10 6月, 2022 1 次提交

af_unix: use DEBUG_NET_WARN_ON_ONCE() · dd29c67d

由 Eric Dumazet 提交于 6月 08, 2022

Replace four WARN_ON() that have not triggered recently
with DEBUG_NET_WARN_ON_ONCE().
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

dd29c67d

07 6月, 2022 1 次提交

af_unix: Fix a data-race in unix_dgram_peer_wake_me(). · 662a8094

由 Kuniyuki Iwashima 提交于 6月 05, 2022

unix_dgram_poll() calls unix_dgram_peer_wake_me() without `other`'s
lock held and check if its receive queue is full. Here we need to
use unix_recvq_full_lockless() instead of unix_recvq_full(), otherwise
KCSAN will report a data-race.

Fixes: 7d267278 ("unix: avoid use-after-free in ep_remove_wait_queue")
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20220605232325.11804-1-kuniyu@amazon.comSigned-off-by: NPaolo Abeni <pabeni@redhat.com>

662a8094

17 5月, 2022 1 次提交

af_unix: Silence randstruct GCC plugin warning · b146cbf2

由 Kees Cook 提交于 5月 10, 2022

While preparing for Clang randstruct support (which duplicated many of
the warnings the randstruct GCC plugin warned about), one strange one
remained only for the randstruct GCC plugin. Eliminating this rids
the plugin of the last exception.

It seems the plugin is happy to dereference individual members of
a cross-struct cast, but it is upset about casting to a whole object
pointer. This only manifests in one place in the kernel, so just replace
the variable with individual member accesses. There is no change in
executable instruction output.

Drop the last exception from the randstruct GCC plugin.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Cong Wang <cong.wang@bytedance.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: netdev@vger.kernel.org
Cc: linux-hardening@vger.kernel.org
Acked-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
Link: https://lore.kernel.org/lkml/20220511022217.58586-1-kuniyu@amazon.co.jpAcked-by: NJakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/lkml/20220511151542.4cb3ff17@kernel.orgSigned-off-by: NKees Cook <keescook@chromium.org>

b146cbf2

12 4月, 2022 1 次提交

net: remove noblock parameter from recvmsg() entities · ec095263

由 Oliver Hartkopp 提交于 4月 11, 2022

The internal recvmsg() functions have two parameters 'flags' and 'noblock'
that were merged inside skb_recv_datagram(). As a follow up patch to commit
f4b41f06 ("net: remove noblock parameter from skb_recv_datagram()")
this patch removes the separate 'noblock' parameter for recvmsg().

Analogue to the referenced patch for skb_recv_datagram() the 'flags' and
'noblock' parameters are unnecessarily split up with e.g.

err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT,
                           flags & ~MSG_DONTWAIT, &addr_len);

or in

err = INDIRECT_CALL_2(sk->sk_prot->recvmsg, tcp_recvmsg, udp_recvmsg,
                      sk, msg, size, flags & MSG_DONTWAIT,
                      flags & ~MSG_DONTWAIT, &addr_len);

instead of simply using only flags all the time and check for MSG_DONTWAIT
where needed (to preserve for the formerly separated no(n)block condition).
Signed-off-by: NOliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20220411124955.154876-1-socketcan@hartkopp.netSigned-off-by: NPaolo Abeni <pabeni@redhat.com>

ec095263

06 4月, 2022 1 次提交

net: remove noblock parameter from skb_recv_datagram() · f4b41f06

由 Oliver Hartkopp 提交于 4月 04, 2022

skb_recv_datagram() has two parameters 'flags' and 'noblock' that are
merged inside skb_recv_datagram() by 'flags | (noblock ? MSG_DONTWAIT : 0)'

As 'flags' may contain MSG_DONTWAIT as value most callers split the 'flags'
into 'flags' and 'noblock' with finally obsolete bit operations like this:

skb_recv_datagram(sk, flags & ~MSG_DONTWAIT, flags & MSG_DONTWAIT, &rc);

And this is not even done consistently with the 'flags' parameter.

This patch removes the obsolete and costly splitting into two parameters
and only performs bit operations when really needed on the caller side.

One missing conversion thankfully reported by kernel test robot. I missed
to enable kunit tests to build the mctp code.
Reported-by: Nkernel test robot <lkp@intel.com>
Signed-off-by: NOliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f4b41f06

19 3月, 2022 1 次提交

af_unix: Remove unnecessary brackets around CONFIG_AF_UNIX_OOB. · 4edf21aa

由 Kuniyuki Iwashima 提交于 3月 17, 2022

Let's remove unnecessary brackets around CONFIG_AF_UNIX_OOB.
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
Link: https://lore.kernel.org/r/20220317032308.65372-1-kuniyu@amazon.co.jpSigned-off-by: NJakub Kicinski <kuba@kernel.org>

4edf21aa

18 3月, 2022 2 次提交

af_unix: Support POLLPRI for OOB. · d9a232d4

由 Kuniyuki Iwashima 提交于 3月 17, 2022

The commit 314001f0 ("af_unix: Add OOB support") introduced OOB for
AF_UNIX, but it lacks some changes for POLLPRI.  Let's add the missing
piece.

In the selftest, normal datagrams are sent followed by OOB data, so this
commit replaces `POLLIN | POLLPRI` with just `POLLPRI` in the first test
case.

Fixes: 314001f0 ("af_unix: Add OOB support")
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d9a232d4

af_unix: Fix some data-races around unix_sk(sk)->oob_skb. · e82025c6

由 Kuniyuki Iwashima 提交于 3月 17, 2022

Out-of-band data automatically places a "mark" showing wherein the
sequence the out-of-band data would have been.  If the out-of-band data
implies cancelling everything sent so far, the "mark" is helpful to flush
them.  When the socket's read pointer reaches the "mark", the ioctl() below
sets a non zero value to the arg `atmark`:

The out-of-band data is queued in sk->sk_receive_queue as well as ordinary
data and also saved in unix_sk(sk)->oob_skb.  It can be used to test if the
head of the receive queue is the out-of-band data meaning the socket is at
the "mark".

While testing that, unix_ioctl() reads unix_sk(sk)->oob_skb locklessly.
Thus, all accesses to oob_skb need some basic protection to avoid
load/store tearing which KCSAN detects when these are called concurrently:

  - ioctl(fd_a, SIOCATMARK, &atmark, sizeof(atmark))
  - send(fd_b_connected_to_a, buf, sizeof(buf), MSG_OOB)

BUG: KCSAN: data-race in unix_ioctl / unix_stream_sendmsg

write to 0xffff888003d9cff0 of 8 bytes by task 175 on cpu 1:
 unix_stream_sendmsg (net/unix/af_unix.c:2087 net/unix/af_unix.c:2191)
 sock_sendmsg (net/socket.c:705 net/socket.c:725)
 __sys_sendto (net/socket.c:2040)
 __x64_sys_sendto (net/socket.c:2048)
 do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113)

read to 0xffff888003d9cff0 of 8 bytes by task 176 on cpu 0:
 unix_ioctl (net/unix/af_unix.c:3101 (discriminator 1))
 sock_do_ioctl (net/socket.c:1128)
 sock_ioctl (net/socket.c:1242)
 __x64_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:874 fs/ioctl.c:860 fs/ioctl.c:860)
 do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113)

value changed: 0xffff888003da0c00 -> 0xffff888003da0d00

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 176 Comm: unix_race_oob_i Not tainted 5.17.0-rc5-59529-g83dc4c2a #12
Hardware name: Red Hat KVM, BIOS 1.11.0-2.amzn2 04/01/2014

Fixes: 314001f0 ("af_unix: Add OOB support")
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e82025c6

19 1月, 2022 3 次提交

bpf: Support bpf_(get|set)sockopt() in bpf unix iter. · eb7d8f1d

由 Kuniyuki Iwashima 提交于 1月 13, 2022

This patch makes bpf_(get|set)sockopt() available when iterating AF_UNIX
sockets.
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
Link: https://lore.kernel.org/r/20220113002849.4384-4-kuniyu@amazon.co.jpSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

eb7d8f1d

bpf: af_unix: Use batching algorithm in bpf unix iter. · 855d8e77

由 Kuniyuki Iwashima 提交于 1月 13, 2022

The commit 04c7820b ("bpf: tcp: Bpf iter batching and lock_sock")
introduces the batching algorithm to iterate TCP sockets with more
consistency.

This patch uses the same algorithm to iterate AF_UNIX sockets.
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
Link: https://lore.kernel.org/r/20220113002849.4384-3-kuniyu@amazon.co.jpSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

855d8e77

af_unix: Refactor unix_next_socket(). · 4408d55a

由 Kuniyuki Iwashima 提交于 1月 13, 2022

Currently, unix_next_socket() is overloaded depending on the 2nd argument.
If it is NULL, unix_next_socket() returns the first socket in the hash. If
not NULL, it returns the next socket in the same hash list or the first
socket in the next non-empty hash list.

This patch refactors unix_next_socket() into two functions unix_get_first()
and unix_get_next(). unix_get_first() newly acquires a lock and returns
the first socket in the list. unix_get_next() returns the next socket in a
list or releases a lock and falls back to unix_get_first().

In the following patch, bpf iter holds entire sockets in a list and always
releases the lock before .show(). It always calls unix_get_first() to
acquire a lock in each iteration. So, this patch makes the change easier
to follow.
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
Link: https://lore.kernel.org/r/20220113002849.4384-2-kuniyu@amazon.co.jpSigned-off-by: NAlexei Starovoitov <ast@kernel.org>

4408d55a

30 12月, 2021 1 次提交

net: Don't include filter.h from net/sock.h · b6459415

由 Jakub Kicinski 提交于 12月 28, 2021

sock.h is pretty heavily used (5k objects rebuilt on x86 after
it's touched). We can drop the include of filter.h from it and
add a forward declaration of struct sk_filter instead.
This decreases the number of rebuilt objects when bpf.h
is touched from ~5k to ~1k.

There's a lot of missing includes this was masking. Primarily
in networking tho, this time.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NMarc Kleine-Budde <mkl@pengutronix.de>
Acked-by: NFlorian Fainelli <f.fainelli@gmail.com>
Acked-by: NNikolay Aleksandrov <nikolay@nvidia.com>
Acked-by: NStefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/bpf/20211229004913.513372-1-kuba@kernel.org

b6459415

27 11月, 2021 13 次提交

af_unix: Relax race in unix_autobind(). · 9acbc584

由 Kuniyuki Iwashima 提交于 11月 24, 2021

When we bind an AF_UNIX socket without a name specified, the kernel selects
an available one from 0x00000 to 0xFFFFF. unix_autobind() starts searching
from a number in the 'static' variable and increments it after acquiring
two locks.

If multiple processes try autobind, they obtain the same lock and check if
a socket in the hash list has the same name. If not, one process uses it,
and all except one end up retrying the _next_ number (actually not, it may
be incremented by the other processes). The more we autobind sockets in
parallel, the longer the latency gets. We can avoid such a race by
searching for a name from a random number.

These show latency in unix_autobind() while 64 CPUs are simultaneously
autobind-ing 1024 sockets for each.

Without this patch:

usec : count distribution
0 : 1176 |*** |
2 : 3655 |*********** |
4 : 4094 |************* |
6 : 3831 |************ |
8 : 3829 |************ |
10 : 3844 |************ |
12 : 3638 |*********** |
14 : 2992 |********* |
16 : 2485 |******* |
18 : 2230 |******* |
20 : 2095 |****** |
22 : 1853 |***** |
24 : 1827 |***** |
26 : 1677 |***** |
28 : 1473 |**** |
30 : 1573 |***** |
32 : 1417 |**** |
34 : 1385 |**** |
36 : 1345 |**** |
38 : 1344 |**** |
40 : 1200 |*** |

With this patch:

usec : count distribution
0 : 1855 |****** |
2 : 6464 |********************* |
4 : 9936 |******************************** |
6 : 12107 |****************************************|
8 : 10441 |********************************** |
10 : 7264 |*********************** |
12 : 4254 |************** |
14 : 2538 |******** |
16 : 1596 |***** |
18 : 1088 |*** |
20 : 800 |** |
22 : 670 |** |
24 : 601 |* |
26 : 562 |* |
28 : 525 |* |
30 : 446 |* |
32 : 378 |* |
34 : 337 |* |
36 : 317 |* |
38 : 314 |* |
40 : 298 | |
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

9acbc584

af_unix: Replace the big lock with small locks. · afd20b92

由 Kuniyuki Iwashima 提交于 11月 24, 2021

The hash table of AF_UNIX sockets is protected by the single lock. This
patch replaces it with per-hash locks.

The effect is noticeable when we handle multiple sockets simultaneously.
Here is a test result on an EC2 c5.24xlarge instance. It shows latency
(under 10us only) in unix_insert_unbound_socket() while 64 CPUs creating
1024 sockets for each in parallel.

Without this patch:

nsec : count distribution
0 : 179 | |
500 : 3021 |********* |
1000 : 6271 |******************* |
1500 : 6318 |******************* |
2000 : 5828 |***************** |
2500 : 5124 |*************** |
3000 : 4426 |************* |
3500 : 3672 |*********** |
4000 : 3138 |********* |
4500 : 2811 |******** |
5000 : 2384 |******* |
5500 : 2023 |****** |
6000 : 1954 |***** |
6500 : 1737 |***** |
7000 : 1749 |***** |
7500 : 1520 |**** |
8000 : 1469 |**** |
8500 : 1394 |**** |
9000 : 1232 |*** |
9500 : 1138 |*** |
10000 : 994 |*** |

With this patch:

nsec : count distribution
0 : 1634 |**** |
500 : 13170 |****************************************|
1000 : 13156 |*************************************** |
1500 : 9010 |*************************** |
2000 : 6363 |******************* |
2500 : 4443 |************* |
3000 : 3240 |********* |
3500 : 2549 |******* |
4000 : 1872 |***** |
4500 : 1504 |**** |
5000 : 1247 |*** |
5500 : 1035 |*** |
6000 : 889 |** |
6500 : 744 |** |
7000 : 634 |* |
7500 : 498 |* |
8000 : 433 |* |
8500 : 355 |* |
9000 : 336 |* |
9500 : 284 | |
10000 : 243 | |
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

afd20b92

af_unix: Save hash in sk_hash. · e6b4b873

由 Kuniyuki Iwashima 提交于 11月 24, 2021

To replace unix_table_lock with per-hash locks in the next patch, we need
to save a hash in each socket because /proc/net/unix or BPF prog iterate
sockets while holding a hash table lock and release it later in a different
function.

Currently, we store a real/pseudo hash in struct unix_address.  However, we
do not allocate it to unbound sockets, nor should we do just for that.  For
this purpose, we can use sk_hash.  Then, we no longer use the hash field in
struct unix_address and can remove it.

Also, this patch does
  - rename unix_insert_socket() to unix_insert_unbound_socket()
  - remove the redundant list argument from __unix_insert_socket() and
     unix_insert_unbound_socket()
  - use 'unsigned int' instead of 'unsigned' in __unix_set_addr_hash()
  - remove 'inline' from unix_remove_socket() and
     unix_insert_unbound_socket().
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

e6b4b873

af_unix: Add helpers to calculate hashes. · f452be49

由 Kuniyuki Iwashima 提交于 11月 24, 2021

This patch adds three helper functions that calculate hashes for unbound
sockets and bound sockets with BSD/abstract addresses.
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

f452be49

af_unix: Remove UNIX_ABSTRACT() macro and test sun_path[0] instead. · 5ce7ab49

由 Kuniyuki Iwashima 提交于 11月 24, 2021

In BSD and abstract address cases, we store sockets in the hash table with
keys between 0 and UNIX_HASH_SIZE - 1. However, the hash saved in a socket
varies depending on its address type; sockets with BSD addresses always
have UNIX_HASH_SIZE in their unix_sk(sk)->addr->hash.

This is just for the UNIX_ABSTRACT() macro used to check the address type.
The difference of the saved hashes comes from the first byte of the address
in the first place. So, we can test it directly.

Then we can keep a real hash in each socket and replace unix_table_lock
with per-hash locks in the later patch.
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

5ce7ab49

af_unix: Allocate unix_address in unix_bind_(bsd|abstract)(). · 12f21c49

由 Kuniyuki Iwashima 提交于 11月 24, 2021

To terminate address with '\0' in unix_bind_bsd(), we add
unix_create_addr() and call it in unix_bind_bsd() and unix_bind_abstract().

Also, unix_bind_abstract() does not return -EEXIST.  Only
kern_path_create() and vfs_mknod() in unix_bind_bsd() can return it,
so we move the last error check in unix_bind() to unix_bind_bsd().
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

12f21c49

af_unix: Remove unix_mkname(). · 5c32a3ed

由 Kuniyuki Iwashima 提交于 11月 24, 2021

This patch removes unix_mkname() and postpones calculating a hash to
unix_bind_abstract().  Some BSD stuffs still remain in unix_bind()
though, the next patch packs them into unix_bind_bsd().
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

5c32a3ed

af_unix: Copy unix_mkname() into unix_find_(bsd|abstract)(). · d2d8c9fd

由 Kuniyuki Iwashima 提交于 11月 24, 2021

We should not call unix_mkname() before unix_find_other() and instead do
the same thing where necessary based on the address type:

  - terminating the address with '\0' in unix_find_bsd()
  - calculating the hash in unix_find_abstract().
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

d2d8c9fd

af_unix: Cut unix_validate_addr() out of unix_mkname(). · b8a58aa6

由 Kuniyuki Iwashima 提交于 11月 24, 2021

unix_mkname() tests socket address length and family and does some
processing based on the address type.  It is called in the early stage,
and therefore some instructions are redundant and can end up in vain.

The address length/family tests are done twice in unix_bind().  Also, the
address type is rechecked later in unix_bind() and unix_find_other(), where
we can do the same processing.  Moreover, in the BSD address case, the hash
is set to 0 but never used and confusing.

This patch moves the address tests out of unix_mkname(), and the following
patches move the other part into appropriate places and remove
unix_mkname() finally.
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

b8a58aa6

af_unix: Return an error as a pointer in unix_find_other(). · aed26f55

由 Kuniyuki Iwashima 提交于 11月 24, 2021

We can return an error as a pointer and need not pass an additional
argument to unix_find_other().
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

aed26f55

af_unix: Factorise unix_find_other() based on address types. · fa39ef0e

由 Kuniyuki Iwashima 提交于 11月 24, 2021

As done in the commit fa42d910 ("unix_bind(): take BSD and abstract
address cases into new helpers"), this patch moves BSD and abstract address
cases from unix_find_other() into unix_find_bsd() and unix_find_abstract().
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

fa39ef0e

af_unix: Pass struct sock to unix_autobind(). · f7ed31f4

由 Kuniyuki Iwashima 提交于 11月 24, 2021

We do not use struct socket in unix_autobind() and pass struct sock to
unix_bind_bsd() and unix_bind_abstract().  Let's pass it to unix_autobind()
as well.

Also, this patch fixes these errors by checkpatch.pl.

  ERROR: do not use assignment in if condition
  #1795: FILE: net/unix/af_unix.c:1795:
  +	if (test_bit(SOCK_PASSCRED, &sock->flags) && !u->addr

  CHECK: Logical continuations should be on the previous line
  #1796: FILE: net/unix/af_unix.c:1796:
  +	if (test_bit(SOCK_PASSCRED, &sock->flags) && !u->addr
  +	    && (err = unix_autobind(sock)) != 0)
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

f7ed31f4

af_unix: Use offsetof() instead of sizeof(). · 755662ce

由 Kuniyuki Iwashima 提交于 11月 24, 2021

The length of the AF_UNIX socket address contains an offset to the member
sun_path of struct sockaddr_un.

Currently, the preceding member is just sun_family, and its type is
sa_family_t and resolved to short.  Therefore, the offset is represented by
sizeof(short).  However, it is not clear and fragile to changes in struct
sockaddr_storage or sockaddr_un.

This commit makes it clear and robust by rewriting sizeof() with
offsetof().
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

755662ce

20 11月, 2021 1 次提交

af_unix: fix regression in read after shutdown · f9390b24

由 Vincent Whitchurch 提交于 11月 19, 2021

On kernels before v5.15, calling read() on a unix socket after
shutdown(SHUT_RD) or shutdown(SHUT_RDWR) would return the data
previously written or EOF. But now, while read() after
shutdown(SHUT_RD) still behaves the same way, read() after
shutdown(SHUT_RDWR) always fails with -EINVAL.

This behaviour change was apparently inadvertently introduced as part of
a bug fix for a different regression caused by the commit adding sockmap
support to af_unix, commit 94531cfc ("af_unix: Add
unix_stream_proto for sockmap"). Those commits, for unclear reasons,
started setting the socket state to TCP_CLOSE on shutdown(SHUT_RDWR),
while this state change had previously only been done in
unix_release_sock().

Restore the original behaviour. The sockmap tests in
tests/selftests/bpf continue to pass after this patch.

Fixes: d0c6416b ("unix: Fix an issue in unix_shutdown causing the other end read/write failures")
Link: https://lore.kernel.org/lkml/20211111140000.GA10779@axis.com/Signed-off-by: NVincent Whitchurch <vincent.whitchurch@axis.com>
Tested-by: NCasey Schaufler <casey@schaufler-ca.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f9390b24

16 11月, 2021 1 次提交

net: drop nopreempt requirement on sock_prot_inuse_add() · b3cb764a

由 Eric Dumazet 提交于 11月 15, 2021

This is distracting really, let's make this simpler,
because many callers had to take care of this
by themselves, even if on x86 this adds more
code than really needed.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b3cb764a

27 10月, 2021 1 次提交

net: Implement ->sock_is_readable() for UDP and AF_UNIX · af493388

由 Cong Wang 提交于 10月 08, 2021

Yucong noticed we can't poll() sockets in sockmap even
when they are the destination sockets of redirections.
This is because we never poll any psock queues in ->poll(),
except for TCP. With ->sock_is_readable() now we can
overwrite >sock_is_readable(), invoke and implement it for
both UDP and AF_UNIX sockets.
Reported-by: NYucong Sun <sunyucong@gmail.com>
Signed-off-by: NCong Wang <cong.wang@bytedance.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211008203306.37525-4-xiyou.wangcong@gmail.com

af493388

12 10月, 2021 1 次提交

af_unix: Rename UNIX-DGRAM to UNIX to maintain backwards compatability · 0edf0824

由 Stephen Boyd 提交于 10月 08, 2021

Then name of this protocol changed in commit 94531cfc ("af_unix: Add
unix_stream_proto for sockmap") because that commit added stream support
to the af_unix protocol. Renaming the existing protocol makes a ChromeOS
protocol test[1] fail now that the name has changed in
/proc/net/protocols from "UNIX" to "UNIX-DGRAM".

Let's put the name back to how it was while keeping the stream protocol
as "UNIX-STREAM" so that the procfs interface doesn't change. This fixes
the test and maintains backwards compatibility in proc.

Cc: Jiang Wang <jiang.wang@bytedance.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Cong Wang <cong.wang@bytedance.com>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Dmitry Osipenko <digetx@gmail.com>
Link: https://source.chromium.org/chromiumos/chromiumos/codesearch/+/main:src/platform/tast-tests/src/chromiumos/tast/local/bundles/cros/network/supported_protocols.go;l=50;drc=e8b1c3f94cb40a054f4aa1ef1aff61e75dc38f18 [1]
Fixes: 94531cfc ("af_unix: Add unix_stream_proto for sockmap")
Signed-off-by: NStephen Boyd <swboyd@chromium.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0edf0824

06 10月, 2021 1 次提交

unix: Fix an issue in unix_shutdown causing the other end read/write failures · d0c6416b

由 Jiang Wang 提交于 10月 04, 2021

Commit 94531cfc ("af_unix: Add unix_stream_proto for sockmap") sets
unix domain socket peer state to TCP_CLOSE in unix_shutdown. This could
happen when the local end is shutdown but the other end is not. Then,
the other end will get read or write failures which is not expected.
Fix the issue by setting the local state to shutdown.

Fixes: 94531cfc ("af_unix: Add unix_stream_proto for sockmap")
Reported-by: NCasey Schaufler <casey@schaufler-ca.com>
Suggested-by: NCong Wang <cong.wang@bytedance.com>
Signed-off-by: NJiang Wang <jiang.wang@bytedance.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Tested-by: NCasey Schaufler <casey@schaufler-ca.com>
Reviewed-by: NCasey Schaufler <casey@schaufler-ca.com>
Acked-by: NSong Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20211004232530.2377085-1-jiang.wang@bytedance.com

d0c6416b

openeuler / Kernel 大约 2 年 前同步成功

openeuler / Kernel
大约 2 年前同步成功