1. 23 9月, 2017 1 次提交
  2. 13 9月, 2017 1 次提交
    • E
      tcp/dccp: remove reqsk_put() from inet_child_forget() · da8ab578
      Eric Dumazet 提交于
      Back in linux-4.4, I inadvertently put a call to reqsk_put() in
      inet_child_forget(), forgetting it could be called from two different
      points.
      
      In the case it is called from inet_csk_reqsk_queue_add(), we want to
      keep the reference on the request socket, since it is released later by
      the caller (tcp_v{4|6}_rcv())
      
      This bug never showed up because atomic_dec_and_test() was not signaling
      the underflow, and SLAB_DESTROY_BY RCU semantic for request sockets
      prevented the request to be put in quarantine.
      
      Recent conversion of socket refcount from atomic_t to refcount_t finally
      exposed the bug.
      
      So move the reqsk_put() to inet_csk_listen_stop() to fix this.
      
      Thanks to Shankara Pailoor for using syzkaller and providing
      a nice set of .config and C repro.
      
      WARNING: CPU: 2 PID: 4277 at lib/refcount.c:186
      refcount_sub_and_test+0x167/0x1b0 lib/refcount.c:186
      Kernel panic - not syncing: panic_on_warn set ...
      
      CPU: 2 PID: 4277 Comm: syz-executor0 Not tainted 4.13.0-rc7 #3
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      Ubuntu-1.8.2-1ubuntu1 04/01/2014
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:16 [inline]
       dump_stack+0xf7/0x1aa lib/dump_stack.c:52
       panic+0x1ae/0x3a7 kernel/panic.c:180
       __warn+0x1c4/0x1d9 kernel/panic.c:541
       report_bug+0x211/0x2d0 lib/bug.c:183
       fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:190
       do_trap_no_signal arch/x86/kernel/traps.c:224 [inline]
       do_trap+0x260/0x390 arch/x86/kernel/traps.c:273
       do_error_trap+0x118/0x340 arch/x86/kernel/traps.c:310
       do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:323
       invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:846
      RIP: 0010:refcount_sub_and_test+0x167/0x1b0 lib/refcount.c:186
      RSP: 0018:ffff88006e006b60 EFLAGS: 00010286
      RAX: 0000000000000026 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: 0000000000000026 RSI: 1ffff1000dc00d2c RDI: ffffed000dc00d60
      RBP: ffff88006e006bf0 R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 1ffff1000dc00d6d
      R13: 00000000ffffffff R14: 0000000000000001 R15: ffff88006ce9d340
       refcount_dec_and_test+0x1a/0x20 lib/refcount.c:211
       reqsk_put+0x71/0x2b0 include/net/request_sock.h:123
       tcp_v4_rcv+0x259e/0x2e20 net/ipv4/tcp_ipv4.c:1729
       ip_local_deliver_finish+0x2e2/0xba0 net/ipv4/ip_input.c:216
       NF_HOOK include/linux/netfilter.h:248 [inline]
       ip_local_deliver+0x1ce/0x6d0 net/ipv4/ip_input.c:257
       dst_input include/net/dst.h:477 [inline]
       ip_rcv_finish+0x8db/0x19c0 net/ipv4/ip_input.c:397
       NF_HOOK include/linux/netfilter.h:248 [inline]
       ip_rcv+0xc3f/0x17d0 net/ipv4/ip_input.c:488
       __netif_receive_skb_core+0x1fb7/0x31f0 net/core/dev.c:4298
       __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4336
       process_backlog+0x1c5/0x6d0 net/core/dev.c:5102
       napi_poll net/core/dev.c:5499 [inline]
       net_rx_action+0x6d3/0x14a0 net/core/dev.c:5565
       __do_softirq+0x2cb/0xb2d kernel/softirq.c:284
       do_softirq_own_stack+0x1c/0x30 arch/x86/entry/entry_64.S:898
       </IRQ>
       do_softirq.part.16+0x63/0x80 kernel/softirq.c:328
       do_softirq kernel/softirq.c:176 [inline]
       __local_bh_enable_ip+0x84/0x90 kernel/softirq.c:181
       local_bh_enable include/linux/bottom_half.h:31 [inline]
       rcu_read_unlock_bh include/linux/rcupdate.h:705 [inline]
       ip_finish_output2+0x8ad/0x1360 net/ipv4/ip_output.c:231
       ip_finish_output+0x74e/0xb80 net/ipv4/ip_output.c:317
       NF_HOOK_COND include/linux/netfilter.h:237 [inline]
       ip_output+0x1cc/0x850 net/ipv4/ip_output.c:405
       dst_output include/net/dst.h:471 [inline]
       ip_local_out+0x95/0x160 net/ipv4/ip_output.c:124
       ip_queue_xmit+0x8c6/0x1810 net/ipv4/ip_output.c:504
       tcp_transmit_skb+0x1963/0x3320 net/ipv4/tcp_output.c:1123
       tcp_send_ack.part.35+0x38c/0x620 net/ipv4/tcp_output.c:3575
       tcp_send_ack+0x49/0x60 net/ipv4/tcp_output.c:3545
       tcp_rcv_synsent_state_process net/ipv4/tcp_input.c:5795 [inline]
       tcp_rcv_state_process+0x4876/0x4b60 net/ipv4/tcp_input.c:5930
       tcp_v4_do_rcv+0x58a/0x820 net/ipv4/tcp_ipv4.c:1483
       sk_backlog_rcv include/net/sock.h:907 [inline]
       __release_sock+0x124/0x360 net/core/sock.c:2223
       release_sock+0xa4/0x2a0 net/core/sock.c:2715
       inet_wait_for_connect net/ipv4/af_inet.c:557 [inline]
       __inet_stream_connect+0x671/0xf00 net/ipv4/af_inet.c:643
       inet_stream_connect+0x58/0xa0 net/ipv4/af_inet.c:682
       SYSC_connect+0x204/0x470 net/socket.c:1628
       SyS_connect+0x24/0x30 net/socket.c:1609
       entry_SYSCALL_64_fastpath+0x18/0xad
      RIP: 0033:0x451e59
      RSP: 002b:00007f474843fc08 EFLAGS: 00000216 ORIG_RAX: 000000000000002a
      RAX: ffffffffffffffda RBX: 0000000000718000 RCX: 0000000000451e59
      RDX: 0000000000000010 RSI: 0000000020002000 RDI: 0000000000000007
      RBP: 0000000000000046 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000216 R12: 0000000000000000
      R13: 00007ffc040a0f8f R14: 00007f47484409c0 R15: 0000000000000000
      
      Fixes: ebb516af ("tcp/dccp: fix race at listener dismantle phase")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NShankara Pailoor <sp3485@columbia.edu>
      Tested-by: NShankara Pailoor <sp3485@columbia.edu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      da8ab578
  3. 01 7月, 2017 1 次提交
  4. 05 6月, 2017 1 次提交
  5. 22 5月, 2017 1 次提交
  6. 10 5月, 2017 1 次提交
  7. 10 3月, 2017 1 次提交
    • D
      net: Work around lockdep limitation in sockets that use sockets · cdfbabfb
      David Howells 提交于
      Lockdep issues a circular dependency warning when AFS issues an operation
      through AF_RXRPC from a context in which the VFS/VM holds the mmap_sem.
      
      The theory lockdep comes up with is as follows:
      
       (1) If the pagefault handler decides it needs to read pages from AFS, it
           calls AFS with mmap_sem held and AFS begins an AF_RXRPC call, but
           creating a call requires the socket lock:
      
      	mmap_sem must be taken before sk_lock-AF_RXRPC
      
       (2) afs_open_socket() opens an AF_RXRPC socket and binds it.  rxrpc_bind()
           binds the underlying UDP socket whilst holding its socket lock.
           inet_bind() takes its own socket lock:
      
      	sk_lock-AF_RXRPC must be taken before sk_lock-AF_INET
      
       (3) Reading from a TCP socket into a userspace buffer might cause a fault
           and thus cause the kernel to take the mmap_sem, but the TCP socket is
           locked whilst doing this:
      
      	sk_lock-AF_INET must be taken before mmap_sem
      
      However, lockdep's theory is wrong in this instance because it deals only
      with lock classes and not individual locks.  The AF_INET lock in (2) isn't
      really equivalent to the AF_INET lock in (3) as the former deals with a
      socket entirely internal to the kernel that never sees userspace.  This is
      a limitation in the design of lockdep.
      
      Fix the general case by:
      
       (1) Double up all the locking keys used in sockets so that one set are
           used if the socket is created by userspace and the other set is used
           if the socket is created by the kernel.
      
       (2) Store the kern parameter passed to sk_alloc() in a variable in the
           sock struct (sk_kern_sock).  This informs sock_lock_init(),
           sock_init_data() and sk_clone_lock() as to the lock keys to be used.
      
           Note that the child created by sk_clone_lock() inherits the parent's
           kern setting.
      
       (3) Add a 'kern' parameter to ->accept() that is analogous to the one
           passed in to ->create() that distinguishes whether kernel_accept() or
           sys_accept4() was the caller and can be passed to sk_alloc().
      
           Note that a lot of accept functions merely dequeue an already
           allocated socket.  I haven't touched these as the new socket already
           exists before we get the parameter.
      
           Note also that there are a couple of places where I've made the accepted
           socket unconditionally kernel-based:
      
      	irda_accept()
      	rds_rcp_accept_one()
      	tcp_accept_from_sock()
      
           because they follow a sock_create_kern() and accept off of that.
      
      Whilst creating this, I noticed that lustre and ocfs don't create sockets
      through sock_create_kern() and thus they aren't marked as for-kernel,
      though they appear to be internal.  I wonder if these should do that so
      that they use the new set of lock keys.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cdfbabfb
  8. 21 1月, 2017 2 次提交
  9. 19 1月, 2017 6 次提交
    • J
      inet: reset tb->fastreuseport when adding a reuseport sk · 637bc8bb
      Josef Bacik 提交于
      If we have non reuseport sockets on a tb we will set tb->fastreuseport to 0 and
      never set it again.  Which means that in the future if we end up adding a bunch
      of reuseport sk's to that tb we'll have to do the expensive scan every time.
      Instead add the ipv4/ipv6 saddr fields to the bind bucket, as well as the family
      so we know what comparison to make, and the ipv6 only setting so we can make
      sure to compare with new sockets appropriately.  Once one sk has made it onto
      the list we know that there are no potential bind conflicts on the owners list
      that match that sk's rcv_addr.  So copy the sk's information into our bind
      bucket and set tb->fastruseport to FASTREUSESOCK_STRICT so we know we have to do
      an extra check for subsequent reuseport sockets and skip the expensive bind
      conflict check.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      637bc8bb
    • J
      inet: split inet_csk_get_port into two functions · 289141b7
      Josef Bacik 提交于
      inet_csk_get_port does two different things, it either scans for an open port,
      or it tries to see if the specified port is available for use.  Since these two
      operations have different rules and are basically independent lets split them
      into two different functions to make them both more readable.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      289141b7
    • J
      inet: don't check for bind conflicts twice when searching for a port · 6cd66616
      Josef Bacik 提交于
      This is just wasted time, we've already found a tb that doesn't have a bind
      conflict, and we don't drop the head lock so scanning again isn't going to give
      us a different answer.  Instead move the tb->reuse setting logic outside of the
      found_tb path and put it in the success: path.  Then make it so that we don't
      goto again if we find a bind conflict in the found_tb path as we won't reach
      this anymore when we are scanning for an ephemeral port.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6cd66616
    • J
      inet: kill smallest_size and smallest_port · b9470c27
      Josef Bacik 提交于
      In inet_csk_get_port we seem to be using smallest_port to figure out where the
      best place to look for a SO_REUSEPORT sk that matches with an existing set of
      SO_REUSEPORT's.  However if we get to the logic
      
      if (smallest_size != -1) {
      	port = smallest_port;
      	goto have_port;
      }
      
      we will do a useless search, because we would have already done the
      inet_csk_bind_conflict for that port and it would have returned 1, otherwise we
      would have gone to found_tb and succeeded.  Since this logic makes us do yet
      another trip through inet_csk_bind_conflict for a port we know won't work just
      delete this code and save us the time.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b9470c27
    • J
      inet: drop ->bind_conflict · aa078842
      Josef Bacik 提交于
      The only difference between inet6_csk_bind_conflict and inet_csk_bind_conflict
      is how they check the rcv_saddr, so delete this call back and simply
      change inet_csk_bind_conflict to call inet_rcv_saddr_equal.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aa078842
    • J
      inet: collapse ipv4/v6 rcv_saddr_equal functions into one · fe38d2a1
      Josef Bacik 提交于
      We pass these per-protocol equal functions around in various places, but
      we can just have one function that checks the sk->sk_family and then do
      the right comparison function.  I've also changed the ipv4 version to
      not cast to inet_sock since it is unneeded.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fe38d2a1
  10. 18 12月, 2016 2 次提交
    • T
      inet: Fix get port to handle zero port number with soreuseport set · 0643ee4f
      Tom Herbert 提交于
      A user may call listen with binding an explicit port with the intent
      that the kernel will assign an available port to the socket. In this
      case inet_csk_get_port does a port scan. For such sockets, the user may
      also set soreuseport with the intent a creating more sockets for the
      port that is selected. The problem is that the initial socket being
      opened could inadvertently choose an existing and unreleated port
      number that was already created with soreuseport.
      
      This patch adds a boolean parameter to inet_bind_conflict that indicates
      rather soreuseport is allowed for the check (in addition to
      sk->sk_reuseport). In calls to inet_bind_conflict from inet_csk_get_port
      the argument is set to true if an explicit port is being looked up (snum
      argument is nonzero), and is false if port scan is done.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0643ee4f
    • T
      inet: Don't go into port scan when looking for specific bind port · 9af7e923
      Tom Herbert 提交于
      inet_csk_get_port is called with port number (snum argument) that may be
      zero or nonzero. If it is zero, then the intent is to find an available
      ephemeral port number to bind to. If snum is non-zero then the caller
      is asking to allocate a specific port number. In the latter case we
      never want to perform the scan in ephemeral port range. It is
      conceivable that this can happen if the "goto again" in "tb_found:"
      is done. This patch adds a check that snum is zero before doing
      the "goto again".
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9af7e923
  11. 05 11月, 2016 1 次提交
    • L
      net: inet: Support UID-based routing in IP protocols. · e2d118a1
      Lorenzo Colitti 提交于
      - Use the UID in routing lookups made by protocol connect() and
        sendmsg() functions.
      - Make sure that routing lookups triggered by incoming packets
        (e.g., Path MTU discovery) take the UID of the socket into
        account.
      - For packets not associated with a userspace socket, (e.g., ping
        replies) use UID 0 inside the user namespace corresponding to
        the network namespace the socket belongs to. This allows
        all namespaces to apply routing and iptables rules to
        kernel-originated traffic in that namespaces by matching UID 0.
        This is better than using the UID of the kernel socket that is
        sending the traffic, because the UID of kernel sockets created
        at namespace creation time (e.g., the per-processor ICMP and
        TCP sockets) is the UID of the user that created the socket,
        which might not be mapped in the namespace.
      
      Tested: compiles allnoconfig, allyesconfig, allmodconfig
      Tested: https://android-review.googlesource.com/253302Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2d118a1
  12. 07 7月, 2016 1 次提交
  13. 05 5月, 2016 1 次提交
  14. 28 4月, 2016 1 次提交
  15. 08 4月, 2016 1 次提交
  16. 25 2月, 2016 1 次提交
  17. 19 2月, 2016 1 次提交
    • E
      tcp/dccp: fix another race at listener dismantle · 7716682c
      Eric Dumazet 提交于
      Ilya reported following lockdep splat:
      
      kernel: =========================
      kernel: [ BUG: held lock freed! ]
      kernel: 4.5.0-rc1-ceph-00026-g5e0a311 #1 Not tainted
      kernel: -------------------------
      kernel: swapper/5/0 is freeing memory
      ffff880035c9d200-ffff880035c9dbff, with a lock still held there!
      kernel: (&(&queue->rskq_lock)->rlock){+.-...}, at:
      [<ffffffff816f6a88>] inet_csk_reqsk_queue_add+0x28/0xa0
      kernel: 4 locks held by swapper/5/0:
      kernel: #0:  (rcu_read_lock){......}, at: [<ffffffff8169ef6b>]
      netif_receive_skb_internal+0x4b/0x1f0
      kernel: #1:  (rcu_read_lock){......}, at: [<ffffffff816e977f>]
      ip_local_deliver_finish+0x3f/0x380
      kernel: #2:  (slock-AF_INET){+.-...}, at: [<ffffffff81685ffb>]
      sk_clone_lock+0x19b/0x440
      kernel: #3:  (&(&queue->rskq_lock)->rlock){+.-...}, at:
      [<ffffffff816f6a88>] inet_csk_reqsk_queue_add+0x28/0xa0
      
      To properly fix this issue, inet_csk_reqsk_queue_add() needs
      to return to its callers if the child as been queued
      into accept queue.
      
      We also need to make sure listener is still there before
      calling sk->sk_data_ready(), by holding a reference on it,
      since the reference carried by the child can disappear as
      soon as the child is put on accept queue.
      Reported-by: NIlya Dryomov <idryomov@gmail.com>
      Fixes: ebb516af ("tcp/dccp: fix race at listener dismantle phase")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7716682c
  18. 12 2月, 2016 1 次提交
  19. 11 2月, 2016 2 次提交
    • C
      soreuseport: fast reuseport TCP socket selection · c125e80b
      Craig Gallek 提交于
      This change extends the fast SO_REUSEPORT socket lookup implemented
      for UDP to TCP.  Listener sockets with SO_REUSEPORT and the same
      receive address are additionally added to an array for faster
      random access.  This means that only a single socket from the group
      must be found in the listener list before any socket in the group can
      be used to receive a packet.  Previously, every socket in the group
      needed to be considered before handing off the incoming packet.
      
      This feature also exposes the ability to use a BPF program when
      selecting a socket from a reuseport group.
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c125e80b
    • C
      sock: struct proto hash function may error · 086c653f
      Craig Gallek 提交于
      In order to support fast reuseport lookups in TCP, the hash function
      defined in struct proto must be capable of returning an error code.
      This patch changes the function signature of all related hash functions
      to return an integer and handles or propagates this return value at
      all call sites.
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      086c653f
  20. 08 2月, 2016 1 次提交
  21. 16 11月, 2015 1 次提交
  22. 23 10月, 2015 1 次提交
    • E
      tcp/dccp: fix hashdance race for passive sessions · 5e0724d0
      Eric Dumazet 提交于
      Multiple cpus can process duplicates of incoming ACK messages
      matching a SYN_RECV request socket. This is a rare event under
      normal operations, but definitely can happen.
      
      Only one must win the race, otherwise corruption would occur.
      
      To fix this without adding new atomic ops, we use logic in
      inet_ehash_nolisten() to detect the request was present in the same
      ehash bucket where we try to insert the new child.
      
      If request socket was not found, we have to undo the child creation.
      
      This actually removes a spin_lock()/spin_unlock() pair in
      reqsk_queue_unlink() for the fast path.
      
      Fixes: e994b2f0 ("tcp: do not lock listener to process SYN packets")
      Fixes: 079096f1 ("tcp/dccp: install syn_recv requests into ehash table")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e0724d0
  23. 16 10月, 2015 2 次提交
  24. 15 10月, 2015 1 次提交
  25. 07 10月, 2015 1 次提交
  26. 05 10月, 2015 1 次提交
    • E
      inet: fix race in reqsk_queue_unlink() · 2306c704
      Eric Dumazet 提交于
      reqsk_timer_handler() tests if icsk_accept_queue.listen_opt
      is NULL at its beginning.
      
      By the time it calls inet_csk_reqsk_queue_drop() and
      reqsk_queue_unlink(), listener might have been closed and
      inet_csk_listen_stop() had called reqsk_queue_yank_acceptq()
      which sets icsk_accept_queue.listen_opt to NULL
      
      We therefore need to correctly check listen_opt being NULL
      after holding syn_wait_lock for proper synchronization.
      
      Fixes: fa76ce73 ("inet: get rid of central tcp/dccp listener timer")
      Fixes: b357a364 ("inet: fix possible panic in reqsk_queue_unlink()")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2306c704
  27. 03 10月, 2015 5 次提交
    • E
      tcp/dccp: add a reschedule point in inet_csk_listen_stop() · 92d6f176
      Eric Dumazet 提交于
      If a listener with thousands of children in accept queue
      is dismantled, it can take a while to close all of them.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      92d6f176
    • E
      tcp: remove max_qlen_log · ef547f2a
      Eric Dumazet 提交于
      This control variable was set at first listen(fd, backlog)
      call, but not updated if application tried to increase or decrease
      backlog. It made sense at the time listener had a non resizeable
      hash table.
      
      Also rounding to powers of two was not very friendly.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef547f2a
    • E
      tcp/dccp: remove struct listen_sock · 10cbc8f1
      Eric Dumazet 提交于
      It is enough to check listener sk_state, no need for an extra
      condition.
      
      max_qlen_log can be moved into struct request_sock_queue
      
      We can remove syn_wait_lock and the alignment it enforced.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10cbc8f1
    • E
      tcp: attach SYNACK messages to request sockets instead of listener · ca6fb065
      Eric Dumazet 提交于
      If a listen backlog is very big (to avoid syncookies), then
      the listener sk->sk_wmem_alloc is the main source of false
      sharing, as we need to touch it twice per SYNACK re-transmit
      and TX completion.
      
      (One SYN packet takes listener lock once, but up to 6 SYNACK
      are generated)
      
      By attaching the skb to the request socket, we remove this
      source of contention.
      
      Tested:
      
       listen(fd, 10485760); // single listener (no SO_REUSEPORT)
       16 RX/TX queue NIC
       Sustain a SYNFLOOD attack of ~320,000 SYN per second,
       Sending ~1,400,000 SYNACK per second.
       Perf profiles now show listener spinlock being next bottleneck.
      
          20.29%  [kernel]  [k] queued_spin_lock_slowpath
          10.06%  [kernel]  [k] __inet_lookup_established
           5.12%  [kernel]  [k] reqsk_timer_handler
           3.22%  [kernel]  [k] get_next_timer_interrupt
           3.00%  [kernel]  [k] tcp_make_synack
           2.77%  [kernel]  [k] ipt_do_table
           2.70%  [kernel]  [k] run_timer_softirq
           2.50%  [kernel]  [k] ip_finish_output
           2.04%  [kernel]  [k] cascade
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca6fb065
    • E
      tcp/dccp: install syn_recv requests into ehash table · 079096f1
      Eric Dumazet 提交于
      In this patch, we insert request sockets into TCP/DCCP
      regular ehash table (where ESTABLISHED and TIMEWAIT sockets
      are) instead of using the per listener hash table.
      
      ACK packets find SYN_RECV pseudo sockets without having
      to find and lock the listener.
      
      In nominal conditions, this halves pressure on listener lock.
      
      Note that this will allow for SO_REUSEPORT refinements,
      so that we can select a listener using cpu/numa affinities instead
      of the prior 'consistent hash', since only SYN packets will
      apply this selection logic.
      
      We will shrink listen_sock in the following patch to ease
      code review.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ying Cai <ycai@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      079096f1