提交 · 8217ca653ec601246832d562207bc24bdf652d2f · openeuler / Kernel

11 8月, 2018 2 次提交

bpf: Enable BPF_PROG_TYPE_SK_REUSEPORT bpf prog in reuseport selection · 8217ca65

由 Martin KaFai Lau 提交于 8月 08, 2018

This patch allows a BPF_PROG_TYPE_SK_REUSEPORT bpf prog to select a
SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY introduced in
the earlier patch. "bpf_run_sk_reuseport()" will return -ECONNREFUSED
when the BPF_PROG_TYPE_SK_REUSEPORT prog returns SK_DROP.
The callers, in inet[6]_hashtable.c and ipv[46]/udp.c, are modified to
handle this case and return NULL immediately instead of continuing the
sk search from its hashtable.

It re-uses the existing SO_ATTACH_REUSEPORT_EBPF setsockopt to attach
BPF_PROG_TYPE_SK_REUSEPORT. The "sk_reuseport_attach_bpf()" will check
if the attaching bpf prog is in the new SK_REUSEPORT or the existing
SOCKET_FILTER type and then check different things accordingly.

One level of "__reuseport_attach_prog()" call is removed. The
"sk_unhashed() && ..." and "sk->sk_reuseport_cb" tests are pushed
back to "reuseport_attach_prog()" in sock_reuseport.c. sock_reuseport.c
seems to have more knowledge on those test requirements than filter.c.
In "reuseport_attach_prog()", after new_prog is attached to reuse->prog,
the old_prog (if any) is also directly freed instead of returning the
old_prog to the caller and asking the caller to free.

The sysctl_optmem_max check is moved back to the
"sk_reuseport_attach_filter()" and "sk_reuseport_attach_bpf()".
As of other bpf prog types, the new BPF_PROG_TYPE_SK_REUSEPORT is only
bounded by the usual "bpf_prog_charge_memlock()" during load time
instead of bounded by both bpf_prog_charge_memlock and sysctl_optmem_max.
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

8217ca65

bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT · 2dbb9b9e

由 Martin KaFai Lau 提交于 8月 08, 2018

This patch adds a BPF_PROG_TYPE_SK_REUSEPORT which can select
a SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY.  Like other
non SK_FILTER/CGROUP_SKB program, it requires CAP_SYS_ADMIN.

BPF_PROG_TYPE_SK_REUSEPORT introduces "struct sk_reuseport_kern"
to store the bpf context instead of using the skb->cb[48].

At the SO_REUSEPORT sk lookup time, it is in the middle of transiting
from a lower layer (ipv4/ipv6) to a upper layer (udp/tcp).  At this
point,  it is not always clear where the bpf context can be appended
in the skb->cb[48] to avoid saving-and-restoring cb[].  Even putting
aside the difference between ipv4-vs-ipv6 and udp-vs-tcp.  It is not
clear if the lower layer is only ipv4 and ipv6 in the future and
will it not touch the cb[] again before transiting to the upper
layer.

For example, in udp_gro_receive(), it uses the 48 byte NAPI_GRO_CB
instead of IP[6]CB and it may still modify the cb[] after calling
the udp[46]_lib_lookup_skb().  Because of the above reason, if
sk->cb is used for the bpf ctx, saving-and-restoring is needed
and likely the whole 48 bytes cb[] has to be saved and restored.

Instead of saving, setting and restoring the cb[], this patch opts
to create a new "struct sk_reuseport_kern" and setting the needed
values in there.

The new BPF_PROG_TYPE_SK_REUSEPORT and "struct sk_reuseport_(kern|md)"
will serve all ipv4/ipv6 + udp/tcp combinations.  There is no protocol
specific usage at this point and it is also inline with the current
sock_reuseport.c implementation (i.e. no protocol specific requirement).

In "struct sk_reuseport_md", this patch exposes data/data_end/len
with semantic similar to other existing usages.  Together
with "bpf_skb_load_bytes()" and "bpf_skb_load_bytes_relative()",
the bpf prog can peek anywhere in the skb.  The "bind_inany" tells
the bpf prog that the reuseport group is bind-ed to a local
INANY address which cannot be learned from skb.

The new "bind_inany" is added to "struct sock_reuseport" which will be
used when running the new "BPF_PROG_TYPE_SK_REUSEPORT" bpf prog in order
to avoid repeating the "bind INANY" test on
"sk_v6_rcv_saddr/sk->sk_rcv_saddr" every time a bpf prog is run.  It can
only be properly initialized when a "sk->sk_reuseport" enabled sk is
adding to a hashtable (i.e. during "reuseport_alloc()" and
"reuseport_add_sock()").

The new "sk_select_reuseport()" is the main helper that the
bpf prog will use to select a SO_REUSEPORT sk.  It is the only function
that can use the new BPF_MAP_TYPE_REUSEPORT_ARRAY.  As mentioned in
the earlier patch, the validity of a selected sk is checked in
run time in "sk_select_reuseport()".  Doing the check in
verification time is difficult and inflexible (consider the map-in-map
use case).  The runtime check is to compare the selected sk's reuseport_id
with the reuseport_id that we want.  This helper will return -EXXX if the
selected sk cannot serve the incoming request (e.g. reuseport_id
not match).  The bpf prog can decide if it wants to do SK_DROP as its
discretion.

When the bpf prog returns SK_PASS, the kernel will check if a
valid sk has been selected (i.e. "reuse_kern->selected_sk != NULL").
If it does , it will use the selected sk.  If not, the kernel
will select one from "reuse->socks[]" (as before this patch).

The SK_DROP and SK_PASS handling logic will be in the next patch.
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

2dbb9b9e

20 6月, 2018 1 次提交

net/tcp: Fix socket lookups with SO_BINDTODEVICE · 8c43bd17

由 David Ahern 提交于 6月 18, 2018

Similar to 69678bcd ("udp: fix SO_BINDTODEVICE"), TCP socket lookups
need to fail if dev_match is not true. Currently, a packet to a given port
can match a socket bound to device when it should not. In the VRF case,
this causes the lookup to hit a VRF socket and not a global socket
resulting in a response trying to go through the VRF when it should not.

Fixes: 3fa6f616 ("net: ipv4: add second dif to inet socket lookups")
Fixes: 4297a0ef ("net: ipv6: add second dif to inet6 socket lookups")
Reported-by: NLou Berger <lberger@labn.net>
Diagnosed-by: NRenato Westphal <renato@opensourcerouting.org>
Tested-by: NRenato Westphal <renato@opensourcerouting.org>
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8c43bd17

01 2月, 2018 1 次提交

inet: Avoid unitialized variable warning in inet_unhash() · 0ba98718

由 Geert Uytterhoeven 提交于 2月 01, 2018

With gcc-4.1.2:

    net/ipv4/inet_hashtables.c: In function ‘inet_unhash’:
    net/ipv4/inet_hashtables.c:628: warning: ‘ilb’ may be used uninitialized in this function

While this is a false positive, it can easily be avoided by using the
pointer itself as the canary variable.
Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Acked-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0ba98718

21 12月, 2017 1 次提交

net: tracepoint: replace tcp_set_state tracepoint with inet_sock_set_state tracepoint · 563e0bb0

由 Yafang Shao 提交于 12月 20, 2017

As sk_state is a common field for struct sock, so the state
transition tracepoint should not be a TCP specific feature.
Currently it traces all AF_INET state transition, so I rename this
tracepoint to inet_sock_set_state tracepoint with some minor changes and move it
into trace/events/sock.h.
We dont need to create a file named trace/events/inet_sock.h for this one single
tracepoint.

Two helpers are introduced to trace sk_state transition
    - void inet_sk_state_store(struct sock *sk, int newstate);
    - void inet_sk_set_state(struct sock *sk, int state);
As trace header should not be included in other header files,
so they are defined in sock.c.

The protocol such as SCTP maybe compiled as a ko, hence export
inet_sk_set_state().
Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

563e0bb0

03 12月, 2017 2 次提交

inet: Add a 2nd listener hashtable (port+addr) · 61b7c691

由 Martin KaFai Lau 提交于 12月 01, 2017

The current listener hashtable is hashed by port only.
When a process is listening at many IP addresses with the same port (e.g.
[IP1]:443, [IP2]:443... [IPN]:443), the inet[6]_lookup_listener()
performance is degraded to a link list.  It is prone to syn attack.

UDP had a similar issue and a second hashtable was added to resolve it.

This patch adds a second hashtable for the listener's sockets.
The second hashtable is hashed by port and address.

It cannot reuse the existing skc_portaddr_node which is shared
with skc_bind_node.  TCP listener needs to use skc_bind_node.
Instead, this patch adds a hlist_node 'icsk_listen_portaddr_node' to
the inet_connection_sock which the listener (like TCP) also belongs to.

The new portaddr hashtable may need two lookup (First by IP:PORT.
Second by INADDR_ANY:PORT if the IP:PORT is a not found).   Hence,
it implements a similar cut off as UDP such that it will only consult the
new portaddr hashtable if the current port-only hashtable has >10
sk in the link-list.

lhash2 and lhash2_mask are added to 'struct inet_hashinfo'.  I take
this chance to plug a 4 bytes hole.  It is done by first moving
the existing bind_bucket_cachep up and then add the new
(int lhash2_mask, *lhash2) after the existing bhash_size.
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

61b7c691

inet: Add a count to struct inet_listen_hashbucket · 76d013b2

由 Martin KaFai Lau 提交于 12月 01, 2017

This patch adds a count to the 'struct inet_listen_hashbucket'.
It counts how many sk is hashed to a bucket.  It will be
used to decide if the (to-be-added) portaddr listener's hashtable
should be used during inet[6]_lookup_listener().
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

76d013b2

30 11月, 2017 1 次提交

net/reuseport: drop legacy code · e94a62f5

由 Paolo Abeni 提交于 11月 30, 2017

Since commit e32ea7e7 ("soreuseport: fast reuseport UDP socket
selection") and commit c125e80b ("soreuseport: fast reuseport
TCP socket selection") the relevant reuseport socket matching the current
packet is selected by the reuseport_select_sock() call. The only
exceptions are invalid BPF filters/filters returning out-of-range
indices.
In the latter case the code implicitly falls back to using the hash
demultiplexing, but instead of selecting the socket inside the
reuseport_select_sock() function, it relies on the hash selection
logic introduced with the early soreuseport implementation.

With this patch, in case of a BPF filter returning a bad socket
index value, we fall back to hash-based selection inside the
reuseport_select_sock() body, so that we can drop some duplicate
code in the ipv4 and ipv6 stack.

This also allows faster lookup in the above scenario and will allow
us to avoid computing the hash value for successful, BPF based
demultiplexing - in a later patch.
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Acked-by: NCraig Gallek <kraig@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e94a62f5

22 10月, 2017 1 次提交

soreuseport: fix initialization race · 1b5f962e

由 Craig Gallek 提交于 10月 19, 2017

Syzkaller stumbled upon a way to trigger
WARNING: CPU: 1 PID: 13881 at net/core/sock_reuseport.c:41
reuseport_alloc+0x306/0x3b0 net/core/sock_reuseport.c:39

There are two initialization paths for the sock_reuseport structure in a
socket: Through the udp/tcp bind paths of SO_REUSEPORT sockets or through
SO_ATTACH_REUSEPORT_[CE]BPF before bind. The existing implementation
assumedthat the socket lock protected both of these paths when it actually
only protects the SO_ATTACH_REUSEPORT path. Syzkaller triggered this
double allocation by running these paths concurrently.

This patch moves the check for double allocation into the reuseport_alloc
function which is protected by a global spin lock.

Fixes: e32ea7e7 ("soreuseport: fast reuseport UDP socket selection")
Fixes: c125e80b ("soreuseport: fast reuseport TCP socket selection")
Signed-off-by: NCraig Gallek <kraig@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1b5f962e

08 8月, 2017 1 次提交

net: ipv4: add second dif to inet socket lookups · 3fa6f616

由 David Ahern 提交于 8月 07, 2017

Add a second device index, sdif, to inet socket lookups. sdif is the
index for ingress devices enslaved to an l3mdev. It allows the lookups
to consider the enslaved device as well as the L3 domain when searching
for a socket.

TCP moves the data in the cb. Prior to tcp_v4_rcv (e.g., early demux) the
ingress index is obtained from IPCB using inet_sdif and after the cb move
in  tcp_v4_rcv the tcp_v4_sdif helper is used.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3fa6f616

03 7月, 2017 1 次提交

net: make sk_ehashfn() static · 784c372a

由 Eric Dumazet 提交于 7月 03, 2017

sk_ehashfn() is only used from a single file.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

784c372a

01 7月, 2017 1 次提交

net: convert sock.sk_refcnt from atomic_t to refcount_t · 41c6d650

由 Reshetova, Elena 提交于 6月 30, 2017

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

This patch uses refcount_inc_not_zero() instead of
atomic_inc_not_zero_hint() due to absense of a _hint()
version of refcount API. If the hint() version must
be used, we might need to revisit API.
Signed-off-by: NElena Reshetova <elena.reshetova@intel.com>
Signed-off-by: NHans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NDavid Windsor <dwindsor@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

41c6d650

09 5月, 2017 1 次提交

treewide: use kv[mz]alloc* rather than opencoded variants · 752ade68

由 Michal Hocko 提交于 5月 08, 2017

There are many code paths opencoding kvmalloc.  Let's use the helper
instead.  The main difference to kvmalloc is that those users are
usually not considering all the aspects of the memory allocator.  E.g.
allocation requests <= 32kB (with 4kB pages) are basically never failing
and invoke OOM killer to satisfy the allocation.  This sounds too
disruptive for something that has a reasonable fallback - the vmalloc.
On the other hand those requests might fallback to vmalloc even when the
memory allocator would succeed after several more reclaim/compaction
attempts previously.  There is no guarantee something like that happens
though.

This patch converts many of those places to kv[mz]alloc* helpers because
they are more conservative.

Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits
Acked-by: NKees Cook <keescook@chromium.org>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390
Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim
Acked-by: David Sterba <dsterba@suse.com> # btrfs
Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph
Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4
Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Santosh Raspatur <santosh@chelsio.com>
Cc: Hariprasad S <hariprasad@chelsio.com>
Cc: Yishai Hadas <yishaih@mellanox.com>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: "Yan, Zheng" <zyan@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

752ade68

19 1月, 2017 2 次提交

inet: kill smallest_size and smallest_port · b9470c27

由 Josef Bacik 提交于 1月 17, 2017

In inet_csk_get_port we seem to be using smallest_port to figure out where the
best place to look for a SO_REUSEPORT sk that matches with an existing set of
SO_REUSEPORT's.  However if we get to the logic

if (smallest_size != -1) {
	port = smallest_port;
	goto have_port;
}

we will do a useless search, because we would have already done the
inet_csk_bind_conflict for that port and it would have returned 1, otherwise we
would have gone to found_tb and succeeded.  Since this logic makes us do yet
another trip through inet_csk_bind_conflict for a port we know won't work just
delete this code and save us the time.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b9470c27

inet: collapse ipv4/v6 rcv_saddr_equal functions into one · fe38d2a1

由 Josef Bacik 提交于 1月 17, 2017

We pass these per-protocol equal functions around in various places, but
we can just have one function that checks the sk->sk_family and then do
the right comparison function.  I've also changed the ipv4 version to
not cast to inet_sock since it is unneeded.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fe38d2a1

17 10月, 2016 1 次提交

net: Require exact match for TCP socket lookups if dif is l3mdev · a04a480d

由 David Ahern 提交于 10月 16, 2016

Currently, socket lookups for l3mdev (vrf) use cases can match a socket
that is bound to a port but not a device (ie., a global socket). If the
sysctl tcp_l3mdev_accept is not set this leads to ack packets going out
based on the main table even though the packet came in from an L3 domain.
The end result is that the connection does not establish creating
confusion for users since the service is running and a socket shows in
ss output. Fix by requiring an exact dif to sk_bound_dev_if match if the
skb came through an interface enslaved to an l3mdev device and the
tcp_l3mdev_accept is not set.

skb's through an l3mdev interface are marked by setting a flag in
inet{6}_skb_parm. The IPv6 variant is already set; this patch adds the
flag for IPv4. Using an skb flag avoids a device lookup on the dif. The
flag is set in the VRF driver using the IP{6}CB macros. For IPv4, the
inet_skb_parm struct is moved in the cb per commit 971f10ec, so the
match function in the TCP stack needs to use TCP_SKB_CB. For IPv6, the
move is done after the socket lookup, so IP6CB is used.

The flags field in inet_skb_parm struct needs to be increased to add
another flag. There is currently a 1-byte hole following the flags,
so it can be expanded to u16 without increasing the size of the struct.

Fixes: 193125db ("net: Introduce VRF device driver")
Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a04a480d

02 5月, 2016 1 次提交

soreuseport: Fix TCP listener hash collision · 90e5d0db

由 Craig Gallek 提交于 4月 28, 2016

I forgot to include a check for listener port equality when deciding
if two sockets should belong to the same reuseport group.  This was
not caught previously because it's only necessary when two listening
sockets for the same user happen to hash to the same listener bucket.
The same error does not exist in the UDP path.

Fixes: c125e80b("soreuseport: fast reuseport TCP socket selection")
Signed-off-by: NCraig Gallek <kraig@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

90e5d0db

28 4月, 2016 1 次提交

net: rename NET_{ADD|INC}_STATS_BH() · 02a1d6e7

由 Eric Dumazet 提交于 4月 27, 2016

Rename NET_INC_STATS_BH() to __NET_INC_STATS()
and NET_ADD_STATS_BH() to __NET_ADD_STATS()
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

02a1d6e7

26 4月, 2016 1 次提交

soreuseport: Resolve merge conflict for v4/v6 ordering fix · d296ba60

由 Craig Gallek 提交于 4月 25, 2016

d894ba18 ("soreuseport: fix ordering for mixed v4/v6 sockets")
was merged as a bug fix to the net tree.  Two conflicting changes
were committed to net-next before the above fix was merged back to
net-next:
ca065d0c ("udp: no longer use SLAB_DESTROY_BY_RCU")
3b24d854 ("tcp/dccp: do not touch listener sk_refcnt under synflood")

These changes switched the datastructure used for TCP and UDP sockets
from hlist_nulls to hlist.  This patch applies the necessary parts
of the net tree fix to net-next which were not automatic as part of the
merge.

Fixes: 1602f49b ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
Signed-off-by: NCraig Gallek <kraig@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d296ba60

08 4月, 2016 1 次提交

tcp/dccp: fix inet_reuseport_add_sock() · 85017869

由 Eric Dumazet 提交于 4月 06, 2016

David Ahern reported panics in __inet_hash() caused by my recent commit.

The reason is inet_reuseport_add_sock() was still using
sk_nulls_for_each_rcu() instead of sk_for_each_rcu().
SO_REUSEPORT enabled listeners were causing an instant crash.

While chasing this bug, I found that I forgot to clear SOCK_RCU_FREE
flag, as it is inherited from the parent at clone time.

Fixes: 3b24d854 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NDavid Ahern <dsa@cumulusnetworks.com>
Tested-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

85017869

05 4月, 2016 2 次提交

tcp/dccp: do not touch listener sk_refcnt under synflood · 3b24d854

由 Eric Dumazet 提交于 4月 01, 2016

When a SYNFLOOD targets a non SO_REUSEPORT listener, multiple
cpus contend on sk->sk_refcnt and sk->sk_wmem_alloc changes.

By letting listeners use SOCK_RCU_FREE infrastructure,
we can relax TCP_LISTEN lookup rules and avoid touching sk_refcnt

Note that we still use SLAB_DESTROY_BY_RCU rules for other sockets,
only listeners are impacted by this change.

Peak performance under SYNFLOOD is increased by ~33% :

On my test machine, I could process 3.2 Mpps instead of 2.4 Mpps

Most consuming functions are now skb_set_owner_w() and sock_wfree()
contending on sk->sk_wmem_alloc when cooking SYNACK and freeing them.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3b24d854

tcp/dccp: use rcu locking in inet_diag_find_one_icsk() · 2d331915

由 Eric Dumazet 提交于 4月 01, 2016

RX packet processing holds rcu_read_lock(), so we can remove
pairs of rcu_read_lock()/rcu_read_unlock() in lookup functions
if inet_diag also holds rcu before calling them.

This is needed anyway as __inet_lookup_listener() and
inet6_lookup_listener() will soon no longer increment
refcount on the found listener.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2d331915

12 2月, 2016 1 次提交

tcp/dccp: better use of ephemeral ports in connect() · 1580ab63

由 Eric Dumazet 提交于 2月 11, 2016

In commit 07f4c900 ("tcp/dccp: try to not exhaust ip_local_port_range
in connect()"), I added a very simple heuristic, so that we got better
chances to use even ports, and allow bind() users to have more available
slots.

It gave nice results, but with more than 200,000 TCP sessions on a typical
server, the ~30,000 ephemeral ports are still a rare resource.

I chose to go a step further, by looking at all even ports, and if none
was available, fallback to odd ports.

The companion patch does the same in bind(), but in opposite way.

I've seen exec times of up to 30ms on busy servers, so I no longer
disable BH for the whole traversal, but only for each hash bucket.
I also call cond_resched() to be gentle to other tasks.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1580ab63

11 2月, 2016 3 次提交

soreuseport: fast reuseport TCP socket selection · c125e80b

由 Craig Gallek 提交于 2月 10, 2016

This change extends the fast SO_REUSEPORT socket lookup implemented
for UDP to TCP.  Listener sockets with SO_REUSEPORT and the same
receive address are additionally added to an array for faster
random access.  This means that only a single socket from the group
must be found in the listener list before any socket in the group can
be used to receive a packet.  Previously, every socket in the group
needed to be considered before handing off the incoming packet.

This feature also exposes the ability to use a BPF program when
selecting a socket from a reuseport group.
Signed-off-by: NCraig Gallek <kraig@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c125e80b

inet: refactor inet[6]_lookup functions to take skb · a583636a

由 Craig Gallek 提交于 2月 10, 2016

This is a preliminary step to allow fast socket lookup of SO_REUSEPORT
groups.  Doing so with a BPF filter will require access to the
skb in question.  This change plumbs the skb (and offset to payload
data) through the call stack to the listening socket lookup
implementations where it will be used in a following patch.
Signed-off-by: NCraig Gallek <kraig@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a583636a

sock: struct proto hash function may error · 086c653f

由 Craig Gallek 提交于 2月 10, 2016

In order to support fast reuseport lookups in TCP, the hash function
defined in struct proto must be capable of returning an error code.
This patch changes the function signature of all related hash functions
to return an integer and handles or propagates this return value at
all call sites.
Signed-off-by: NCraig Gallek <kraig@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

086c653f

23 10月, 2015 1 次提交

tcp/dccp: fix hashdance race for passive sessions · 5e0724d0

由 Eric Dumazet 提交于 10月 22, 2015

Multiple cpus can process duplicates of incoming ACK messages
matching a SYN_RECV request socket. This is a rare event under
normal operations, but definitely can happen.

Only one must win the race, otherwise corruption would occur.

To fix this without adding new atomic ops, we use logic in
inet_ehash_nolisten() to detect the request was present in the same
ehash bucket where we try to insert the new child.

If request socket was not found, we have to undo the child creation.

This actually removes a spin_lock()/spin_unlock() pair in
reqsk_queue_unlink() for the fast path.

Fixes: e994b2f0 ("tcp: do not lock listener to process SYN packets")
Fixes: 079096f1 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5e0724d0

15 10月, 2015 1 次提交

tcp/dccp: fix potential NULL deref in __inet_inherit_port() · c2f34a65

由 Eric Dumazet 提交于 10月 14, 2015

As we no longer hold listener lock in fast path, it is possible that a
child is created right after listener freed its bound port, if a close()
is done while incoming packets are processed.

__inet_inherit_port() must detect this and return an error,
so that caller can free the child earlier.

Fixes: e994b2f0 ("tcp: do not lock listener to process SYN packets")
Fixes: 079096f1 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c2f34a65

13 10月, 2015 1 次提交

net: SO_INCOMING_CPU setsockopt() support · 70da268b

由 Eric Dumazet 提交于 10月 08, 2015

SO_INCOMING_CPU as added in commit 2c8c56e1 was a getsockopt() command
to fetch incoming cpu handling a particular TCP flow after accept()

This commits adds setsockopt() support and extends SO_REUSEPORT selection
logic : If a TCP listener or UDP socket has this option set, a packet is
delivered to this socket only if CPU handling the packet matches the specified
one.

This allows to build very efficient TCP servers, using one listener per
RX queue, as the associated TCP listener should only accept flows handled
in softirq by the same cpu.
This provides optimal NUMA behavior and keep cpu caches hot.

Note that __inet_lookup_listener() still has to iterate over the list of
all listeners. Following patch puts sk_refcnt in a different cache line
to let this iteration hit only shared and read mostly cache lines.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

70da268b

03 10月, 2015 1 次提交

tcp/dccp: install syn_recv requests into ehash table · 079096f1

由 Eric Dumazet 提交于 10月 02, 2015

In this patch, we insert request sockets into TCP/DCCP
regular ehash table (where ESTABLISHED and TIMEWAIT sockets
are) instead of using the per listener hash table.

ACK packets find SYN_RECV pseudo sockets without having
to find and lock the listener.

In nominal conditions, this halves pressure on listener lock.

Note that this will allow for SO_REUSEPORT refinements,
so that we can select a listener using cpu/numa affinities instead
of the prior 'consistent hash', since only SYN packets will
apply this selection logic.

We will shrink listen_sock in the following patch to ease
code review.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Ying Cai <ycai@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

079096f1

30 9月, 2015 1 次提交

inet: constify __inet_inherit_port() sock argument · 1ce31c9e

由 Eric Dumazet 提交于 9月 29, 2015

socket is not touched, make it const.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1ce31c9e

22 7月, 2015 1 次提交

tcp: suppress a division by zero warning · 89e478a2

由 Eric Dumazet 提交于 7月 22, 2015

Andrew Morton reported following warning on one ARM build
with gcc-4.4 :

net/ipv4/inet_hashtables.c: In function 'inet_ehash_locks_alloc':
net/ipv4/inet_hashtables.c:617: warning: division by zero

Even guarded with a test on sizeof(spinlock_t), compiler does not
like current construct on a !CONFIG_SMP build.

Remove the warning by using a temporary variable.

Fixes: 095dc8e0 ("tcp: fix/cleanup inet_ehash_locks_alloc()")
Reported-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

89e478a2

10 7月, 2015 2 次提交

inet: inet_twsk_deschedule factorization · dbe7faa4

由 Eric Dumazet 提交于 7月 08, 2015

inet_twsk_deschedule() calls are followed by inet_twsk_put().

Only particular case is in inet_twsk_purge() but there is no point
to defer the inet_twsk_put() after re-enabling BH.

Lets rename inet_twsk_deschedule() to inet_twsk_deschedule_put()
and move the inet_twsk_put() inside.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dbe7faa4

inet: simplify timewait refcounting · fc01538f

由 Eric Dumazet 提交于 7月 08, 2015

timewait sockets have a complex refcounting logic.
Once we realize it should be similar to established and
syn_recv sockets, we can use sk_nulls_del_node_init_rcu()
and remove inet_twsk_unhash()

In particular, deferred inet_twsk_put() added in commit
13475a30 ("tcp: connect() race with timewait reuse")
looks unecessary : When removing a timewait socket from
ehash or bhash, caller must own a reference on the socket
anyway.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fc01538f

28 5月, 2015 2 次提交

tcp: connect() from bound sockets can be faster · e2baad9e

由 Eric Dumazet 提交于 5月 27, 2015

__inet_hash_connect() does not use its third argument (port_offset)
if socket was already bound to a source port.

No need to perform useless but expensive md5 computations.
Reported-by: NCrestez Dan Leonard <cdleonard@gmail.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e2baad9e

tcp/dccp: try to not exhaust ip_local_port_range in connect() · 07f4c900

由 Eric Dumazet 提交于 5月 24, 2015

A long standing problem on busy servers is the tiny available TCP port
range (/proc/sys/net/ipv4/ip_local_port_range) and the default
sequential allocation of source ports in connect() system call.

If a host is having a lot of active TCP sessions, chances are
very high that all ports are in use by at least one flow,
and subsequent bind(0) attempts fail, or have to scan a big portion of
space to find a slot.

In this patch, I changed the starting point in __inet_hash_connect()
so that we try to favor even [1] ports, leaving odd ports for bind()
users.

We still perform a sequential search, so there is no guarantee, but
if connect() targets are very different, end result is we leave
more ports available to bind(), and we spread them all over the range,
lowering time for both connect() and bind() to find a slot.

This strategy only works well if /proc/sys/net/ipv4/ip_local_port_range
is even, ie if start/end values have different parity.

Therefore, default /proc/sys/net/ipv4/ip_local_port_range was changed to
32768 - 60999 (instead of 32768 - 61000)

There is no change on security aspects here, only some poor hashing
schemes could be eventually impacted by this change.

[1] : The odd/even property depends on ip_local_port_range values parity
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

07f4c900

27 5月, 2015 1 次提交

tcp: fix/cleanup inet_ehash_locks_alloc() · 095dc8e0

由 Eric Dumazet 提交于 5月 26, 2015

If tcp ehash table is constrained to a very small number of buckets
(eg boot parameter thash_entries=128), then we can crash if spinlock
array has more entries.

While we are at it, un-inline inet_ehash_locks_alloc() and make
following changes :

- Budget 2 cache lines per cpu worth of 'spinlocks'
- Try to kmalloc() the array to avoid extra TLB pressure.
  (Most servers at Google allocate 8192 bytes for this hash table)
- Get rid of various #ifdef
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

095dc8e0

22 5月, 2015 1 次提交

inet_hashinfo: remove bsocket counter · f5af1f57

由 Eric Dumazet 提交于 5月 20, 2015

We no longer need bsocket atomic counter, as inet_csk_get_port()
calls bind_conflict() regardless of its value, after commit
2b05ad33 ("tcp: bind() fix autoselection to share ports")

This patch removes overhead of maintaining this counter and
double inet_csk_get_port() calls under pressure.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Marcelo Ricardo Leitner <mleitner@redhat.com>
Cc: Flavio Leitner <fbl@redhat.com>
Acked-by: NFlavio Leitner <fbl@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f5af1f57

14 4月, 2015 1 次提交

tcp/dccp: get rid of central timewait timer · 789f558c

由 Eric Dumazet 提交于 4月 12, 2015

Using a timer wheel for timewait sockets was nice ~15 years ago when
memory was expensive and machines had a single processor.

This does not scale, code is ugly and source of huge latencies
(Typically 30 ms have been seen, cpus spinning on death_lock spinlock.)

We can afford to use an extra 64 bytes per timewait sock and spread
timewait load to all cpus to have better behavior.

Tested:

On following test, /proc/sys/net/ipv4/tcp_tw_recycle is set to 1
on the target (lpaa24)

Before patch :

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
419594

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
437171

While test is running, we can observe 25 or even 33 ms latencies.

lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 20601ms
rtt min/avg/max/mdev = 0.020/0.217/25.771/1.535 ms, pipe 2

lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 20702ms
rtt min/avg/max/mdev = 0.019/0.183/33.761/1.441 ms, pipe 2

After patch :

About 90% increase of throughput :

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
810442

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
800992

And latencies are kept to minimal values during this load, even
if network utilization is 90% higher :

lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 19991ms
rtt min/avg/max/mdev = 0.023/0.064/0.360/0.042 ms
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

789f558c

04 4月, 2015 1 次提交

ipv4: coding style: comparison for inequality with NULL · 00db4124

由 Ian Morris 提交于 4月 03, 2015

The ipv4 code uses a mixture of coding styles. In some instances check
for non-NULL pointer is done as x != NULL and sometimes as x. x is
preferred according to checkpatch and this patch makes the code
consistent by adopting the latter form.

No changes detected by objdiff.
Signed-off-by: NIan Morris <ipm@chirality.org.uk>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

00db4124

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功