提交 · fa76ce7328b289b6edd476e24eb52fd634261720 · openanolis / cloud-kernel

21 3月, 2015 2 次提交

inet: get rid of central tcp/dccp listener timer · fa76ce73

由 Eric Dumazet 提交于 3月 19, 2015

One of the major issue for TCP is the SYNACK rtx handling,
done by inet_csk_reqsk_queue_prune(), fired by the keepalive
timer of a TCP_LISTEN socket.

This function runs for awful long times, with socket lock held,
meaning that other cpus needing this lock have to spin for hundred of ms.

SYNACK are sent in huge bursts, likely to cause severe drops anyway.

This model was OK 15 years ago when memory was very tight.

We now can afford to have a timer per request sock.

Timer invocations no longer need to lock the listener,
and can be run from all cpus in parallel.

With following patch increasing somaxconn width to 32 bits,
I tested a listener with more than 4 million active request sockets,
and a steady SYNFLOOD of ~200,000 SYN per second.
Host was sending ~830,000 SYNACK per second.

This is ~100 times more what we could achieve before this patch.

Later, we will get rid of the listener hash and use ehash instead.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fa76ce73

inet: drop prev pointer handling in request sock · 52452c54

由 Eric Dumazet 提交于 3月 19, 2015

When request sock are put in ehash table, the whole notion
of having a previous request to update dl_next is pointless.

Also, following patch will get rid of big purge timer,
so we want to delete a request sock without holding listener lock.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

52452c54

19 3月, 2015 9 次提交

inet: add a schedule point in inet_twsk_purge() · 738e6d30

由 Eric Dumazet 提交于 3月 18, 2015

On a large hash table, we can easily spend seconds to
walk over all entries. Add a cond_resched() to yield
cpu if necessary.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

738e6d30

ipv4, ipv6: kill ip_mc_{join, leave}_group and ipv6_sock_mc_{join, drop} · 54ff9ef3

由 Marcelo Ricardo Leitner 提交于 3月 18, 2015

in favor of their inner __ ones, which doesn't grab rtnl.

As these functions need to operate on a locked socket, we can't be
grabbing rtnl by then. It's too late and doing so causes reversed
locking.

So this patch:
- move rtnl handling to callers instead while already fixing some
  reversed locking situations, like on vxlan and ipvs code.
- renames __ ones to not have the __ mark:
  __ip_mc_{join,leave}_group -> ip_mc_{join,leave}_group
  __ipv6_sock_mc_{join,drop} -> ipv6_sock_mc_{join,drop}
Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

54ff9ef3

ipv4,ipv6: grab rtnl before locking the socket · baf606d9

由 Marcelo Ricardo Leitner 提交于 3月 18, 2015

There are some setsockopt operations in ipv4 and ipv6 that are grabbing
rtnl after having grabbed the socket lock. Yet this makes it impossible
to do operations that have to lock the socket when already within a rtnl
protected scope, like ndo dev_open and dev_stop.

We normally take coarse grained locks first but setsockopt inverted that.

So this patch invert the lock logic for these operations and makes
setsockopt grab rtnl if it will be needed prior to grabbing socket lock.
Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

baf606d9

inet: request sock should init IPv6/IPv4 addresses · 08d2cc3b

由 Eric Dumazet 提交于 3月 18, 2015

In order to be able to use sk_ehashfn() for request socks,
we need to initialize their IPv6/IPv4 addresses.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

08d2cc3b

inet: get rid of last __inet_hash_connect() argument · b4d6444e

由 Eric Dumazet 提交于 3月 18, 2015

We now always call __inet_hash_nolisten(), no need to pass it
as an argument.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b4d6444e

ipv6: get rid of __inet6_hash() · 77a6a471

由 Eric Dumazet 提交于 3月 18, 2015

We can now use inet_hash() and __inet_hash() instead of private
functions.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

77a6a471

inet: add IPv6 support to sk_ehashfn() · d1e559d0

由 Eric Dumazet 提交于 3月 18, 2015

Intent is to converge IPv4 & IPv6 inet_hash functions to
factorize code.

IPv4 sockets initialize sk_rcv_saddr and sk_v6_daddr
in this patch, thanks to new sk_daddr_set() and sk_rcv_saddr_set()
helpers.

__inet6_hash can now use sk_ehashfn() instead of a private
inet6_sk_ehashfn() and will simply use __inet_hash() in a
following patch.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d1e559d0

net: introduce sk_ehashfn() helper · 5b441f76

由 Eric Dumazet 提交于 3月 18, 2015

Goal is to unify IPv4/IPv6 inet_hash handling, and use common helpers
for all kind of sockets (full sockets, timewait and request sockets)

inet_sk_ehashfn() becomes sk_ehashfn() but still only copes with IPv4
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5b441f76

netns: constify net_hash_mix() and various callers · 6eada011

由 Eric Dumazet 提交于 3月 18, 2015

const qualifiers ease code review by making clear
which objects are not written in a function.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6eada011

18 3月, 2015 8 次提交

inet: fix request sock refcounting · 0470c8ca

由 Eric Dumazet 提交于 3月 17, 2015

While testing last patch series, I found req sock refcounting was wrong.

We must set skc_refcnt to 1 for all request socks added in hashes,
but also on request sockets created by FastOpen or syncookies.

It is tricky because we need to defer this initialization so that
future RCU lookups do not try to take a refcount on a not yet
fully initialized request socket.

Also get rid of ireq_refcnt alias.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Fixes: 13854e5a ("inet: add proper refcounting to request sock")
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0470c8ca

inet: avoid fastopen lock for regular accept() · e3d95ad7

由 Eric Dumazet 提交于 3月 17, 2015

It is not because a TCP listener is FastOpen ready that
all incoming sockets actually used FastOpen.

Avoid taking queue->fastopenq->lock if not needed.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e3d95ad7

tcp: rename struct tcp_request_sock listener · 9439ce00

由 Eric Dumazet 提交于 3月 17, 2015

The listener field in struct tcp_request_sock is a pointer
back to the listener. We now have req->rsk_listener, so TCP
only needs one boolean and not a full pointer.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9439ce00

inet: add rsk_listener field to struct request_sock · 4e9a578e

由 Eric Dumazet 提交于 3月 17, 2015

Once we'll be able to lookup request sockets in ehash table,
we'll need to get access to listener which created this request.

This avoid doing a lookup to find the listener, which benefits
for a more solid SO_REUSEPORT, and is needed once we no
longer queue request sock into a listener private queue.

Note that 'struct tcp_request_sock'->listener could be reduced
to a single bit, as TFO listener should match req->rsk_listener.
TFO will no longer need to hold a reference on the listener.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4e9a578e

inet: uninline inet_reqsk_alloc() · e49bb337

由 Eric Dumazet 提交于 3月 17, 2015

inet_reqsk_alloc() is becoming fat and should not be inlined.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e49bb337

inet: add sk_listener argument to inet_reqsk_alloc() · 407640de

由 Eric Dumazet 提交于 3月 17, 2015

listener socket can be used to set net pointer, and will
be later used to hold a reference on listener.

Add a const qualifier to first argument (struct request_sock_ops *),
and factorize all write_pnet(&ireq->ireq_net, sock_net(sk));
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

407640de

tcp: uninline tcp_oow_rate_limited() · 7970ddc8

由 Eric Dumazet 提交于 3月 16, 2015

tcp_oow_rate_limited() is hardly used in fast path, there is
no point inlining it.
Signed-of-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7970ddc8

tcp: move tcp_openreq_init() to tcp_input.c · 1bfc4438

由 Eric Dumazet 提交于 3月 16, 2015

This big helper is called once from tcp_conn_request(), there is no
point having it in an include. Compiler will inline it anyway.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1bfc4438

17 3月, 2015 5 次提交

tcp_metrics: fix wrong lockdep annotations · 9f1ab186

由 Eric Dumazet 提交于 3月 16, 2015

Changes in tcp_metric hash table are protected by tcp_metrics_lock
only, not by genl_mutex

While we are at it use deref_locked() instead of rcu_dereference()
in tcp_new() to avoid unnecessary barrier, as we hold tcp_metrics_lock
as well.
Reported-by: NAndrew Vagin <avagin@parallels.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Fixes: 098a697b ("tcp_metrics: Use a single hash table for all network namespaces.")
Reviewed-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9f1ab186

inet: add proper refcounting to request sock · 13854e5a

由 Eric Dumazet 提交于 3月 15, 2015

reqsk_put() is the generic function that should be used
to release a refcount (and automatically call reqsk_free())

reqsk_free() might be called if refcount is known to be 0
or undefined.

refcnt is set to one in inet_csk_reqsk_queue_add()

As request socks are not yet in global ehash table,
I added temporary debugging checks in reqsk_put() and reqsk_free()
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

13854e5a

inet: factorize sock_edemux()/sock_gen_put() code · 2c13270b

由 Eric Dumazet 提交于 3月 15, 2015

sock_edemux() is not used in fast path, and should
really call sock_gen_put() to save some code.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2c13270b

inet_diag: allow sk_diag_fill() to handle request socks · a58917f5

由 Eric Dumazet 提交于 3月 15, 2015

inet_diag_fill_req() is renamed to inet_req_diag_fill()
and moved up, so that it can be called fom sk_diag_fill()

inet_diag_bc_sk() is ready to handle request socks.

inet_twsk_diag_dump() is no longer needed.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a58917f5

inet: ip early demux should avoid request sockets · f7e4eb03

由 Eric Dumazet 提交于 3月 15, 2015

When a request socket is created, we do not cache ip route
dst entry, like for timewait sockets.

Let's use sk_fullsock() helper.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f7e4eb03

15 3月, 2015 3 次提交

inet_diag: factorize code in new inet_diag_msg_common_fill() helper · a4458343

由 Eric Dumazet 提交于 3月 13, 2015

Now the three type of sockets share a common base, we can factorize
code in inet_diag_msg_common_fill().

inet_diag_entry no longer requires saddr_storage & daddr_storage
and the extra copies.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a4458343

inet_diag: adjust inet_sk_diag_fill() bug condition · a07c9207

由 Eric Dumazet 提交于 3月 13, 2015

inet_sk_diag_fill() only copes with non timewait and non request socks
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a07c9207

inet: fill request sock ir_iif for IPv4 · 16f86165

由 Eric Dumazet 提交于 3月 13, 2015

Once request socks will be in ehash table, they will need to have
a valid ir_iff field.

This is currently true only for IPv6. This patch extends support
for IPv4 as well.

This means inet_diag_fill_req() can now properly use ir_iif,
which is better for IPv6 link locals anyway, as request sockets
and established sockets will propagate consistent netlink idiag_if.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

16f86165

13 3月, 2015 13 次提交

tcp_metrics: Use a single hash table for all network namespaces. · 098a697b

由 Eric W. Biederman 提交于 3月 13, 2015

Now that all of the operations are safe on a single hash table
accross network namespaces, allocate a single global hash table
and update the code to use it.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

098a697b

tcp_metrics: Rewrite tcp_metrics_flush_all · 04f721c6

由 Eric W. Biederman 提交于 3月 13, 2015

Rewrite tcp_metrics_flush_all so that it can cope with entries from
different network namespaces on it's hash chain.

This is based on the logic in tcp_metrics_nl_cmd_del for deleting
a selection of entries from a tcp metrics hash chain.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

04f721c6

tcp_metrics: Remove the unused return code from tcp_metrics_flush_all · 8a4bff71

由 Eric W. Biederman 提交于 3月 13, 2015

tcp_metrics_flush_all always returns 0. Remove the unnecessary return code.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8a4bff71

tcp_metrics: Add a field tcpm_net and verify it matches on lookup · 849e8a0c

由 Eric W. Biederman 提交于 3月 13, 2015

In preparation for using one tcp metrics hash table for all network
namespaces add a field tcpm_net to struct tcp_metrics_block, and
verify that field on all hash table lookups.

Make the field tcpm_net of type possible_net_t so it takes no space
when network namespaces are disabled.

Further add a function tm_net to read that field so we can be
efficient when network namespaces are disabled and concise
the rest of the time.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

849e8a0c

tcp_metrics: Mix the network namespace into the hash function. · 3e5da62d

由 Eric W. Biederman 提交于 3月 13, 2015

In preparation for using one hash table for all network namespaces
mix the network namespace into the hash value.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3e5da62d

tcp_metrics: panic when tcp_metrics_init fails. · 6493517e

由 Eric W. Biederman 提交于 3月 13, 2015

There is not a practical way to cleanup during boot so
just panic if there is a problem initializing tcp_metrics.

That will at least give us a clear place to start debugging
if something does go wrong.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6493517e

inet: introduce ireq_family · 3f66b083

由 Eric Dumazet 提交于 3月 12, 2015

Before inserting request socks into general hash table,
fill their socket family.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3f66b083

inet: get_openreq4() & get_openreq6() do not need listener · d4f06873

由 Eric Dumazet 提交于 3月 12, 2015

ireq->ir_num contains local port, use it.

Also, get_openreq4() dumping listen_sk->refcnt makes litle sense.

inet_diag_fill_req() can also use ireq->ir_num
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d4f06873

inet: prepare sock_edemux() & sock_gen_put() for new SYN_RECV state · 41b822c5

由 Eric Dumazet 提交于 3月 12, 2015

sock_edemux() & sock_gen_put() should be ready to cope with request socks.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

41b822c5

ipv6: add missing ireq_net & ir_cookie initializations · bd337c58

由 Eric Dumazet 提交于 3月 12, 2015

I forgot to update dccp_v6_conn_request() & cookie_v6_check().
They both need to set ireq->ireq_net and ireq->ir_cookie

Lets clear ireq->ir_cookie in inet_reqsk_alloc()
Signed-off-by: NEric Dumazet <edumazet@google.com>
Fixes: 33cf7c90 ("net: add real socket cookies")
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bd337c58

fib_trie: Provide a deterministic order for fib_alias w/ tables merged · 0b65bd97

由 Alexander Duyck 提交于 3月 12, 2015

This change makes it so that we should always have a deterministic ordering
for the main and local aliases within the merged table when two leaves
overlap.

So for example if we have a leaf with a key of 192.168.254.0. If we
previously added two aliases with a prefix length of 24 from both local and
main the first entry would be first and the second would be second. When I
was coding this I had added a WARN_ON should such a situation occur as I
wasn't sure how likely it would be. However this WARN_ON has been
triggered so this is something that should be addressed.

With this patch the ordering of the aliases is as follows. First they are
sorted on prefix length, then on their table ID, then tos, and finally
priority. This way what we end up doing is essentially interleaving the
two tables on what used to be leaf_info structure boundaries.

Fixes: 0ddcf43d ("ipv4: FIB Local/MAIN table collapse")
Reported-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0b65bd97

fib_trie: Avoid NULL pointer if local table is not allocated · 3c9e9f73

由 Alexander Duyck 提交于 3月 12, 2015

The function fib_unmerge assumed the local table had already been
allocated.  If that is not the case however when custom rules are applied
then this can result in a NULL pointer dereference.

In order to prevent this we must check the value of the local table pointer
and if it is NULL simply return 0 as there is no local table to separate
from the main.

Fixes: 0ddcf43d ("ipv4: FIB Local/MAIN table collapse")
Reported-by: NMadhu Challa <challa@noironetworks.com>
Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3c9e9f73

net: Introduce possible_net_t · 0c5c9fb5

由 Eric W. Biederman 提交于 3月 11, 2015

Having to say
> #ifdef CONFIG_NET_NS
> 	struct net *net;
> #endif

in structures is a little bit wordy and a little bit error prone.

Instead it is possible to say:
> typedef struct {
> #ifdef CONFIG_NET_NS
>       struct net *net;
> #endif
> } possible_net_t;

And then in a header say:

> 	possible_net_t net;

Which is cleaner and easier to use and easier to test, as the
possible_net_t is always there no matter what the compile options.

Further this allows read_pnet and write_pnet to be functions in all
cases which is better at catching typos.

This change adds possible_net_t, updates the definitions of read_pnet
and write_pnet, updates optional struct net * variables that
write_pnet uses on to have the type possible_net_t, and finally fixes
up the b0rked users of read_pnet and write_pnet.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0c5c9fb5

openanolis / cloud-kernel 接近 2 年 前同步成功

openanolis / cloud-kernel
接近 2 年前同步成功