提交 · ca6fb06518836ef9b65dc0aac02ff97704d52a05 · openanolis / cloud-kernel

03 10月, 2015 10 次提交

tcp: attach SYNACK messages to request sockets instead of listener · ca6fb065

由 Eric Dumazet 提交于 10月 02, 2015

If a listen backlog is very big (to avoid syncookies), then
the listener sk->sk_wmem_alloc is the main source of false
sharing, as we need to touch it twice per SYNACK re-transmit
and TX completion.

(One SYN packet takes listener lock once, but up to 6 SYNACK
are generated)

By attaching the skb to the request socket, we remove this
source of contention.

Tested:

 listen(fd, 10485760); // single listener (no SO_REUSEPORT)
 16 RX/TX queue NIC
 Sustain a SYNFLOOD attack of ~320,000 SYN per second,
 Sending ~1,400,000 SYNACK per second.
 Perf profiles now show listener spinlock being next bottleneck.

    20.29%  [kernel]  [k] queued_spin_lock_slowpath
    10.06%  [kernel]  [k] __inet_lookup_established
     5.12%  [kernel]  [k] reqsk_timer_handler
     3.22%  [kernel]  [k] get_next_timer_interrupt
     3.00%  [kernel]  [k] tcp_make_synack
     2.77%  [kernel]  [k] ipt_do_table
     2.70%  [kernel]  [k] run_timer_softirq
     2.50%  [kernel]  [k] ip_finish_output
     2.04%  [kernel]  [k] cascade
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ca6fb065

tcp/dccp: install syn_recv requests into ehash table · 079096f1

由 Eric Dumazet 提交于 10月 02, 2015

In this patch, we insert request sockets into TCP/DCCP
regular ehash table (where ESTABLISHED and TIMEWAIT sockets
are) instead of using the per listener hash table.

ACK packets find SYN_RECV pseudo sockets without having
to find and lock the listener.

In nominal conditions, this halves pressure on listener lock.

Note that this will allow for SO_REUSEPORT refinements,
so that we can select a listener using cpu/numa affinities instead
of the prior 'consistent hash', since only SYN packets will
apply this selection logic.

We will shrink listen_sock in the following patch to ease
code review.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Ying Cai <ycai@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

079096f1

tcp/dccp: remove inet_csk_reqsk_queue_added() timeout argument · 2feda341

由 Eric Dumazet 提交于 10月 02, 2015

This is no longer used.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2feda341

tcp: get_openreq[46]() changes · aa3a0c8c

由 Eric Dumazet 提交于 10月 02, 2015

When request sockets are no longer in a per listener hash table
but on regular TCP ehash, we need to access listener uid
through req->rsk_listener

get_openreq6() also gets a const for its request socket argument.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

aa3a0c8c

tcp: remove BUG_ON() in tcp_check_req() · 9cfd0860

由 Eric Dumazet 提交于 10月 02, 2015

Once listener is lockless, its sk_state can change anytime.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9cfd0860

tcp: cleanup tcp_v[46]_inbound_md5_hash() · ba8e275a

由 Eric Dumazet 提交于 10月 02, 2015

We'll soon have to call tcp_v[46]_inbound_md5_hash() twice.
Also add const attribute to the socket, as it might be the
unlocked listener for SYN packets.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ba8e275a

tcp: call sk_mark_napi_id() on the child, not the listener · 38cb5245

由 Eric Dumazet 提交于 10月 02, 2015

This fixes a typo : We want to store the NAPI id on child socket.
Presumably nobody really uses busy polling, on short lived flows.

Fixes: 3d97379a ("tcp: move sk_mark_napi_id() at the right place")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

38cb5245

tcp: move synflood_warned into struct request_sock_queue · 8d2675f1

由 Eric Dumazet 提交于 10月 02, 2015

long term plan is to remove struct listen_sock when its hash
table is no longer there.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8d2675f1

tcp: move qlen/young out of struct listen_sock · aac065c5

由 Eric Dumazet 提交于 10月 02, 2015

qlen_inc & young_inc were protected by listener lock,
while qlen_dec & young_dec were atomic fields.

Everything needs to be atomic for upcoming lockless listener.

Also move qlen/young in request_sock_queue as we'll get rid
of struct listen_sock eventually.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

aac065c5

tcp: add a spinlock to protect struct request_sock_queue · fff1f300

由 Eric Dumazet 提交于 10月 02, 2015

struct request_sock_queue fields are currently protected
by the listener 'lock' (not a real spinlock)

We need to add a private spinlock instead, so that softirq handlers
creating children do not have to worry with backlog notion
that the listener 'lock' carries.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fff1f300

30 9月, 2015 23 次提交

net: Initialize flow flags in input path · b84f7878

由 David Ahern 提交于 9月 29, 2015

The fib_table_lookup tracepoint found 2 places where the flowi4_flags is
not initialized.
Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b84f7878

net: Replace calls to vrf_dev_get_rth · 8e1ed705

由 David Ahern 提交于 9月 29, 2015

Replace calls to vrf_dev_get_rth with l3mdev_get_rtable.
The check on the flow flags is handled in the l3mdev operation.
Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8e1ed705

net: Replace vrf_dev_table and friends · 3236b004

由 David Ahern 提交于 9月 29, 2015

Replace calls to vrf_dev_table and friends with l3mdev_fib_table
and kin.
Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3236b004

net: Replace vrf_master_ifindex{, _rcu} with l3mdev equivalents · 385add90

由 David Ahern 提交于 9月 29, 2015

Replace calls to vrf_master_ifindex_rcu and vrf_master_ifindex with either
l3mdev_master_ifindex_rcu or l3mdev_master_ifindex.

The pattern:
    oif = vrf_master_ifindex(dev) ? : dev->ifindex;
is replaced with
    oif = l3mdev_fib_oif(dev);

And remove the now unused vrf macros.
Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

385add90

net: Rename IFF_VRF_MASTER to IFF_L3MDEV_MASTER · 007979ea

由 David Ahern 提交于 9月 29, 2015

Rename IFF_VRF_MASTER to IFF_L3MDEV_MASTER and update the name of the
netif_is_vrf and netif_index_is_vrf macros.
Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

007979ea

tcp: prepare fastopen code for upcoming listener changes · 0536fcc0