- 24 3月, 2015 3 次提交
-
-
由 Eric Dumazet 提交于
listener can be source of false sharing. request sock has some useful information like : ireq->ir_iif, ireq->ir_num, ireq->ireq_net This patch does not solve the major problem of having to read sk->sk_protocol which is sharing a cache line with sk->sk_wmem_alloc. (This same field is read later in ip_build_and_send_pkt()) One idea would be to move sk_protocol close to sk_family (using 8 bits instead of 16 for sk_family seems enough) Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
It is not needed, and req->sk_listener points to the listener anyway. request_sock argument can be const. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Cache listen_sock_qlen() to limit false sharing, and read rskq_defer_accept once as it might change under us. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 21 3月, 2015 4 次提交
-
-
由 Eric Dumazet 提交于
This reverts commit ca10b9e9. No longer needed after commit eb8895de ("tcp: tcp_make_synack() should use sock_wmalloc") When under SYNFLOOD, we build lot of SYNACK and hit false sharing because of multiple modifications done on sk_listener->sk_wmem_alloc Since tcp_make_synack() uses sock_wmalloc(), there is no need to call skb_set_owner_w() again, as this adds two atomic operations. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Josh Hunt 提交于
tcp_send_fin() does not account for the memory it allocates properly, so sk_forward_alloc can be negative in cases where we've sent a FIN: ss example output (ss -amn | grep -B1 f4294): tcp FIN-WAIT-1 0 1 192.168.0.1:45520 192.0.2.1:8080 skmem:(r0,rb87380,t0,tb87380,f4294966016,w1280,o0,bl0) Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
One of the major issue for TCP is the SYNACK rtx handling, done by inet_csk_reqsk_queue_prune(), fired by the keepalive timer of a TCP_LISTEN socket. This function runs for awful long times, with socket lock held, meaning that other cpus needing this lock have to spin for hundred of ms. SYNACK are sent in huge bursts, likely to cause severe drops anyway. This model was OK 15 years ago when memory was very tight. We now can afford to have a timer per request sock. Timer invocations no longer need to lock the listener, and can be run from all cpus in parallel. With following patch increasing somaxconn width to 32 bits, I tested a listener with more than 4 million active request sockets, and a steady SYNFLOOD of ~200,000 SYN per second. Host was sending ~830,000 SYNACK per second. This is ~100 times more what we could achieve before this patch. Later, we will get rid of the listener hash and use ehash instead. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
When request sock are put in ehash table, the whole notion of having a previous request to update dl_next is pointless. Also, following patch will get rid of big purge timer, so we want to delete a request sock without holding listener lock. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 19 3月, 2015 9 次提交
-
-
由 Eric Dumazet 提交于
On a large hash table, we can easily spend seconds to walk over all entries. Add a cond_resched() to yield cpu if necessary. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Marcelo Ricardo Leitner 提交于
in favor of their inner __ ones, which doesn't grab rtnl. As these functions need to operate on a locked socket, we can't be grabbing rtnl by then. It's too late and doing so causes reversed locking. So this patch: - move rtnl handling to callers instead while already fixing some reversed locking situations, like on vxlan and ipvs code. - renames __ ones to not have the __ mark: __ip_mc_{join,leave}_group -> ip_mc_{join,leave}_group __ipv6_sock_mc_{join,drop} -> ipv6_sock_mc_{join,drop} Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Marcelo Ricardo Leitner 提交于
There are some setsockopt operations in ipv4 and ipv6 that are grabbing rtnl after having grabbed the socket lock. Yet this makes it impossible to do operations that have to lock the socket when already within a rtnl protected scope, like ndo dev_open and dev_stop. We normally take coarse grained locks first but setsockopt inverted that. So this patch invert the lock logic for these operations and makes setsockopt grab rtnl if it will be needed prior to grabbing socket lock. Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
In order to be able to use sk_ehashfn() for request socks, we need to initialize their IPv6/IPv4 addresses. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
We now always call __inet_hash_nolisten(), no need to pass it as an argument. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
We can now use inet_hash() and __inet_hash() instead of private functions. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Intent is to converge IPv4 & IPv6 inet_hash functions to factorize code. IPv4 sockets initialize sk_rcv_saddr and sk_v6_daddr in this patch, thanks to new sk_daddr_set() and sk_rcv_saddr_set() helpers. __inet6_hash can now use sk_ehashfn() instead of a private inet6_sk_ehashfn() and will simply use __inet_hash() in a following patch. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Goal is to unify IPv4/IPv6 inet_hash handling, and use common helpers for all kind of sockets (full sockets, timewait and request sockets) inet_sk_ehashfn() becomes sk_ehashfn() but still only copes with IPv4 Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
const qualifiers ease code review by making clear which objects are not written in a function. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 18 3月, 2015 9 次提交
-
-
由 Eric Dumazet 提交于
While testing last patch series, I found req sock refcounting was wrong. We must set skc_refcnt to 1 for all request socks added in hashes, but also on request sockets created by FastOpen or syncookies. It is tricky because we need to defer this initialization so that future RCU lookups do not try to take a refcount on a not yet fully initialized request socket. Also get rid of ireq_refcnt alias. Signed-off-by: NEric Dumazet <edumazet@google.com> Fixes: 13854e5a ("inet: add proper refcounting to request sock") Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
It is not because a TCP listener is FastOpen ready that all incoming sockets actually used FastOpen. Avoid taking queue->fastopenq->lock if not needed. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
The listener field in struct tcp_request_sock is a pointer back to the listener. We now have req->rsk_listener, so TCP only needs one boolean and not a full pointer. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Once we'll be able to lookup request sockets in ehash table, we'll need to get access to listener which created this request. This avoid doing a lookup to find the listener, which benefits for a more solid SO_REUSEPORT, and is needed once we no longer queue request sock into a listener private queue. Note that 'struct tcp_request_sock'->listener could be reduced to a single bit, as TFO listener should match req->rsk_listener. TFO will no longer need to hold a reference on the listener. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
inet_reqsk_alloc() is becoming fat and should not be inlined. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
listener socket can be used to set net pointer, and will be later used to hold a reference on listener. Add a const qualifier to first argument (struct request_sock_ops *), and factorize all write_pnet(&ireq->ireq_net, sock_net(sk)); Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
tcp_oow_rate_limited() is hardly used in fast path, there is no point inlining it. Signed-of-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
This big helper is called once from tcp_conn_request(), there is no point having it in an include. Compiler will inline it anyway. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
I got the following trace with current net-next kernel : [14723.885290] WARNING: CPU: 26 PID: 22658 at kernel/sched/core.c:7285 __might_sleep+0x89/0xa0() [14723.885325] do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff810e8734>] prepare_to_wait_exclusive+0x34/0xa0 [14723.885355] CPU: 26 PID: 22658 Comm: netserver Not tainted 4.0.0-dbg-DEV #1379 [14723.885359] ffffffff81a223a8 ffff881fae9e7ca8 ffffffff81650b5d 0000000000000001 [14723.885364] ffff881fae9e7cf8 ffff881fae9e7ce8 ffffffff810a72e7 0000000000000000 [14723.885367] ffffffff81a57620 000000000000093a 0000000000000000 ffff881fae9e7e64 [14723.885371] Call Trace: [14723.885377] [<ffffffff81650b5d>] dump_stack+0x4c/0x65 [14723.885382] [<ffffffff810a72e7>] warn_slowpath_common+0x97/0xe0 [14723.885386] [<ffffffff810a73e6>] warn_slowpath_fmt+0x46/0x50 [14723.885390] [<ffffffff810f4c5d>] ? trace_hardirqs_on_caller+0x10d/0x1d0 [14723.885393] [<ffffffff810e8734>] ? prepare_to_wait_exclusive+0x34/0xa0 [14723.885396] [<ffffffff810e8734>] ? prepare_to_wait_exclusive+0x34/0xa0 [14723.885399] [<ffffffff810ccdc9>] __might_sleep+0x89/0xa0 [14723.885403] [<ffffffff81581846>] lock_sock_nested+0x36/0xb0 [14723.885406] [<ffffffff815829a3>] ? release_sock+0x173/0x1c0 [14723.885411] [<ffffffff815ea1f7>] inet_csk_accept+0x157/0x2a0 [14723.885415] [<ffffffff810e8900>] ? abort_exclusive_wait+0xc0/0xc0 [14723.885419] [<ffffffff8161b96d>] inet_accept+0x2d/0x150 [14723.885424] [<ffffffff8157db6f>] SYSC_accept4+0xff/0x210 [14723.885428] [<ffffffff8165a451>] ? retint_swapgs+0xe/0x44 [14723.885431] [<ffffffff810f4c5d>] ? trace_hardirqs_on_caller+0x10d/0x1d0 [14723.885437] [<ffffffff81369c0e>] ? trace_hardirqs_on_thunk+0x3a/0x3f [14723.885441] [<ffffffff8157ef40>] SyS_accept+0x10/0x20 [14723.885444] [<ffffffff81659872>] system_call_fastpath+0x12/0x17 [14723.885447] ---[ end trace ff74cd83355b1873 ]--- In commit 26cabd31 Peter added a sched_annotate_sleep() in sk_wait_event() Is the following patch needed as well ? Alternative would be to use sk_wait_event() from inet_csk_wait_for_connect() Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 17 3月, 2015 5 次提交
-
-
由 Eric Dumazet 提交于
Changes in tcp_metric hash table are protected by tcp_metrics_lock only, not by genl_mutex While we are at it use deref_locked() instead of rcu_dereference() in tcp_new() to avoid unnecessary barrier, as we hold tcp_metrics_lock as well. Reported-by: NAndrew Vagin <avagin@parallels.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Fixes: 098a697b ("tcp_metrics: Use a single hash table for all network namespaces.") Reviewed-by: N"Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
reqsk_put() is the generic function that should be used to release a refcount (and automatically call reqsk_free()) reqsk_free() might be called if refcount is known to be 0 or undefined. refcnt is set to one in inet_csk_reqsk_queue_add() As request socks are not yet in global ehash table, I added temporary debugging checks in reqsk_put() and reqsk_free() Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
sock_edemux() is not used in fast path, and should really call sock_gen_put() to save some code. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
inet_diag_fill_req() is renamed to inet_req_diag_fill() and moved up, so that it can be called fom sk_diag_fill() inet_diag_bc_sk() is ready to handle request socks. inet_twsk_diag_dump() is no longer needed. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
When a request socket is created, we do not cache ip route dst entry, like for timewait sockets. Let's use sk_fullsock() helper. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 15 3月, 2015 3 次提交
-
-
由 Eric Dumazet 提交于
Now the three type of sockets share a common base, we can factorize code in inet_diag_msg_common_fill(). inet_diag_entry no longer requires saddr_storage & daddr_storage and the extra copies. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
inet_sk_diag_fill() only copes with non timewait and non request socks Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Once request socks will be in ehash table, they will need to have a valid ir_iff field. This is currently true only for IPv6. This patch extends support for IPv4 as well. This means inet_diag_fill_req() can now properly use ir_iif, which is better for IPv6 link locals anyway, as request sockets and established sockets will propagate consistent netlink idiag_if. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 14 3月, 2015 1 次提交
-
-
由 Eric Dumazet 提交于
inet_diag_dump_one_icsk() allocates too small skb. Add inet_sk_attr_size() helper right before inet_sk_diag_fill() so that it can be updated if/when new attributes are added. iproute2/ss currently does not use this dump_one() interface, this might explain nobody noticed this problem yet. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 13 3月, 2015 6 次提交
-
-
由 Eric W. Biederman 提交于
Now that all of the operations are safe on a single hash table accross network namespaces, allocate a single global hash table and update the code to use it. Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric W. Biederman 提交于
Rewrite tcp_metrics_flush_all so that it can cope with entries from different network namespaces on it's hash chain. This is based on the logic in tcp_metrics_nl_cmd_del for deleting a selection of entries from a tcp metrics hash chain. Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric W. Biederman 提交于
tcp_metrics_flush_all always returns 0. Remove the unnecessary return code. Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric W. Biederman 提交于
In preparation for using one tcp metrics hash table for all network namespaces add a field tcpm_net to struct tcp_metrics_block, and verify that field on all hash table lookups. Make the field tcpm_net of type possible_net_t so it takes no space when network namespaces are disabled. Further add a function tm_net to read that field so we can be efficient when network namespaces are disabled and concise the rest of the time. Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric W. Biederman 提交于
In preparation for using one hash table for all network namespaces mix the network namespace into the hash value. Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric W. Biederman 提交于
There is not a practical way to cleanup during boot so just panic if there is a problem initializing tcp_metrics. That will at least give us a clear place to start debugging if something does go wrong. Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-