1. 13 10月, 2015 1 次提交
    • E
      net: align sk_refcnt on 128 bytes boundary · 8e5eb54d
      Eric Dumazet 提交于
      sk->sk_refcnt is dirtied for every TCP/UDP incoming packet.
      This is a performance issue if multiple cpus hit a common socket,
      or multiple sockets are chained due to SO_REUSEPORT.
      
      By moving sk_refcnt 8 bytes further, first 128 bytes of sockets
      are mostly read. As they contain the lookup keys, this has
      a considerable performance impact, as cpus can cache them.
      
      These 8 bytes are not wasted, we use them as a place holder
      for various fields, depending on the socket type.
      
      Tested:
       SYN flood hitting a 16 RX queues NIC.
       TCP listener using 16 sockets and SO_REUSEPORT
       and SO_INCOMING_CPU for proper siloing.
      
       Could process 6.0 Mpps SYN instead of 4.2 Mpps
      
       Kernel profile looked like :
          11.68%  [kernel]  [k] sha_transform
           6.51%  [kernel]  [k] __inet_lookup_listener
           5.07%  [kernel]  [k] __inet_lookup_established
           4.15%  [kernel]  [k] memcpy_erms
           3.46%  [kernel]  [k] ipt_do_table
           2.74%  [kernel]  [k] fib_table_lookup
           2.54%  [kernel]  [k] tcp_make_synack
           2.34%  [kernel]  [k] tcp_conn_request
           2.05%  [kernel]  [k] __netif_receive_skb_core
           2.03%  [kernel]  [k] kmem_cache_alloc
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8e5eb54d
  2. 05 10月, 2015 2 次提交
  3. 04 10月, 2015 1 次提交
    • E
      tcp/dccp: add SLAB_DESTROY_BY_RCU flag for request sockets · e96f78ab
      Eric Dumazet 提交于
      Before letting request sockets being put in TCP/DCCP regular
      ehash table, we need to add either :
      
      - SLAB_DESTROY_BY_RCU flag to their kmem_cache
      - add RCU grace period before freeing them.
      
      Since we carefully respected the SLAB_DESTROY_BY_RCU protocol
      like ESTABLISH and TIMEWAIT sockets, use it here.
      
      req_prot_init() being only used by TCP and DCCP, I did not add
      a new slab_flags into their rsk_prot, but reuse prot->slab_flags
      
      Since all reqsk_alloc() users are correctly dealing with a failure,
      add the __GFP_NOWARN flag to avoid traces under pressure.
      
      Fixes: 079096f1 ("tcp/dccp: install syn_recv requests into ehash table")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e96f78ab
  4. 03 10月, 2015 8 次提交
  5. 30 9月, 2015 3 次提交
  6. 26 9月, 2015 2 次提交
  7. 06 5月, 2015 1 次提交
    • E
      tcp: provide SYN headers for passive connections · cd8ae852
      Eric Dumazet 提交于
      This patch allows a server application to get the TCP SYN headers for
      its passive connections.  This is useful if the server is doing
      fingerprinting of clients based on SYN packet contents.
      
      Two socket options are added: TCP_SAVE_SYN and TCP_SAVED_SYN.
      
      The first is used on a socket to enable saving the SYN headers
      for child connections. This can be set before or after the listen()
      call.
      
      The latter is used to retrieve the SYN headers for passive connections,
      if the parent listener has enabled TCP_SAVE_SYN.
      
      TCP_SAVED_SYN is read once, it frees the saved SYN headers.
      
      The data returned in TCP_SAVED_SYN are network (IPv4/IPv6) and TCP
      headers.
      
      Original patch was written by Tom Herbert, I changed it to not hold
      a full skb (and associated dst and conntracking reference).
      
      We have used such patch for about 3 years at Google.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Tested-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd8ae852
  8. 24 4月, 2015 1 次提交
    • E
      inet: fix possible panic in reqsk_queue_unlink() · b357a364
      Eric Dumazet 提交于
      [ 3897.923145] BUG: unable to handle kernel NULL pointer dereference at
       0000000000000080
      [ 3897.931025] IP: [<ffffffffa9f27686>] reqsk_timer_handler+0x1a6/0x243
      
      There is a race when reqsk_timer_handler() and tcp_check_req() call
      inet_csk_reqsk_queue_unlink() on the same req at the same time.
      
      Before commit fa76ce73 ("inet: get rid of central tcp/dccp listener
      timer"), listener spinlock was held and race could not happen.
      
      To solve this bug, we change reqsk_queue_unlink() to not assume req
      must be found, and we return a status, to conditionally release a
      refcount on the request sock.
      
      This also means tcp_check_req() in non fastopen case might or not
      consume req refcount, so tcp_v6_hnd_req() & tcp_v4_hnd_req() have
      to properly handle this.
      
      (Same remark for dccp_check_req() and its callers)
      
      inet_csk_reqsk_queue_drop() is now too big to be inlined, as it is
      called 4 times in tcp and 3 times in dccp.
      
      Fixes: fa76ce73 ("inet: get rid of central tcp/dccp listener timer")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b357a364
  9. 24 3月, 2015 2 次提交
  10. 21 3月, 2015 2 次提交
    • E
      inet: get rid of central tcp/dccp listener timer · fa76ce73
      Eric Dumazet 提交于
      One of the major issue for TCP is the SYNACK rtx handling,
      done by inet_csk_reqsk_queue_prune(), fired by the keepalive
      timer of a TCP_LISTEN socket.
      
      This function runs for awful long times, with socket lock held,
      meaning that other cpus needing this lock have to spin for hundred of ms.
      
      SYNACK are sent in huge bursts, likely to cause severe drops anyway.
      
      This model was OK 15 years ago when memory was very tight.
      
      We now can afford to have a timer per request sock.
      
      Timer invocations no longer need to lock the listener,
      and can be run from all cpus in parallel.
      
      With following patch increasing somaxconn width to 32 bits,
      I tested a listener with more than 4 million active request sockets,
      and a steady SYNFLOOD of ~200,000 SYN per second.
      Host was sending ~830,000 SYNACK per second.
      
      This is ~100 times more what we could achieve before this patch.
      
      Later, we will get rid of the listener hash and use ehash instead.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa76ce73
    • E
      inet: drop prev pointer handling in request sock · 52452c54
      Eric Dumazet 提交于
      When request sock are put in ehash table, the whole notion
      of having a previous request to update dl_next is pointless.
      
      Also, following patch will get rid of big purge timer,
      so we want to delete a request sock without holding listener lock.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52452c54
  11. 19 3月, 2015 1 次提交
  12. 18 3月, 2015 2 次提交
    • E
      inet: fix request sock refcounting · 0470c8ca
      Eric Dumazet 提交于
      While testing last patch series, I found req sock refcounting was wrong.
      
      We must set skc_refcnt to 1 for all request socks added in hashes,
      but also on request sockets created by FastOpen or syncookies.
      
      It is tricky because we need to defer this initialization so that
      future RCU lookups do not try to take a refcount on a not yet
      fully initialized request socket.
      
      Also get rid of ireq_refcnt alias.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Fixes: 13854e5a ("inet: add proper refcounting to request sock")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0470c8ca
    • E
      inet: add rsk_listener field to struct request_sock · 4e9a578e
      Eric Dumazet 提交于
      Once we'll be able to lookup request sockets in ehash table,
      we'll need to get access to listener which created this request.
      
      This avoid doing a lookup to find the listener, which benefits
      for a more solid SO_REUSEPORT, and is needed once we no
      longer queue request sock into a listener private queue.
      
      Note that 'struct tcp_request_sock'->listener could be reduced
      to a single bit, as TFO listener should match req->rsk_listener.
      TFO will no longer need to hold a reference on the listener.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4e9a578e
  13. 17 3月, 2015 1 次提交
  14. 13 3月, 2015 2 次提交
  15. 10 10月, 2013 1 次提交
    • E
      inet: includes a sock_common in request_sock · 634fb979
      Eric Dumazet 提交于
      TCP listener refactoring, part 5 :
      
      We want to be able to insert request sockets (SYN_RECV) into main
      ehash table instead of the per listener hash table to allow RCU
      lookups and remove listener lock contention.
      
      This patch includes the needed struct sock_common in front
      of struct request_sock
      
      This means there is no more inet6_request_sock IPv6 specific
      structure.
      
      Following inet_request_sock fields were renamed as they became
      macros to reference fields from struct sock_common.
      Prefix ir_ was chosen to avoid name collisions.
      
      loc_port   -> ir_loc_port
      loc_addr   -> ir_loc_addr
      rmt_addr   -> ir_rmt_addr
      rmt_port   -> ir_rmt_port
      iif        -> ir_iif
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      634fb979
  16. 23 9月, 2013 1 次提交
  17. 23 4月, 2013 1 次提交
  18. 18 3月, 2013 1 次提交
    • C
      tcp: Remove TCPCT · 1a2c6181
      Christoph Paasch 提交于
      TCPCT uses option-number 253, reserved for experimental use and should
      not be used in production environments.
      Further, TCPCT does not fully implement RFC 6013.
      
      As a nice side-effect, removing TCPCT increases TCP's performance for
      very short flows:
      
      Doing an apache-benchmark with -c 100 -n 100000, sending HTTP-requests
      for files of 1KB size.
      
      before this patch:
      	average (among 7 runs) of 20845.5 Requests/Second
      after:
      	average (among 7 runs) of 21403.6 Requests/Second
      Signed-off-by: NChristoph Paasch <christoph.paasch@uclouvain.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a2c6181
  19. 04 11月, 2012 1 次提交
    • E
      tcp: better retrans tracking for defer-accept · e6c022a4
      Eric Dumazet 提交于
      For passive TCP connections using TCP_DEFER_ACCEPT facility,
      we incorrectly increment req->retrans each time timeout triggers
      while no SYNACK is sent.
      
      SYNACK are not sent for TCP_DEFER_ACCEPT that were established (for
      which we received the ACK from client). Only the last SYNACK is sent
      so that we can receive again an ACK from client, to move the req into
      accept queue. We plan to change this later to avoid the useless
      retransmit (and potential problem as this SYNACK could be lost)
      
      TCP_INFO later gives wrong information to user, claiming imaginary
      retransmits.
      
      Decouple req->retrans field into two independent fields :
      
      num_retrans : number of retransmit
      num_timeout : number of timeouts
      
      num_timeout is the counter that is incremented at each timeout,
      regardless of actual SYNACK being sent or not, and used to
      compute the exponential timeout.
      
      Introduce inet_rtx_syn_ack() helper to increment num_retrans
      only if ->rtx_syn_ack() succeeded.
      
      Use inet_rtx_syn_ack() from tcp_check_req() to increment num_retrans
      when we re-send a SYNACK in answer to a (retransmitted) SYN.
      Prior to this patch, we were not counting these retransmits.
      
      Change tcp_v[46]_rtx_synack() to increment TCP_MIB_RETRANSSEGS
      only if a synack packet was successfully queued.
      Reported-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Julian Anastasov <ja@ssi.bg>
      Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
      Cc: Elliott Hughes <enh@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e6c022a4
  20. 01 9月, 2012 2 次提交
    • J
      tcp: TCP Fast Open Server - support TFO listeners · 8336886f
      Jerry Chu 提交于
      This patch builds on top of the previous patch to add the support
      for TFO listeners. This includes -
      
      1. allocating, properly initializing, and managing the per listener
      fastopen_queue structure when TFO is enabled
      
      2. changes to the inet_csk_accept code to support TFO. E.g., the
      request_sock can no longer be freed upon accept(), not until 3WHS
      finishes
      
      3. allowing a TCP_SYN_RECV socket to properly poll() and sendmsg()
      if it's a TFO socket
      
      4. properly closing a TFO listener, and a TFO socket before 3WHS
      finishes
      
      5. supporting TCP_FASTOPEN socket option
      
      6. modifying tcp_check_req() to use to check a TFO socket as well
      as request_sock
      
      7. supporting TCP's TFO cookie option
      
      8. adding a new SYN-ACK retransmit handler to use the timer directly
      off the TFO socket rather than the listener socket. Note that TFO
      server side will not retransmit anything other than SYN-ACK until
      the 3WHS is completed.
      
      The patch also contains an important function
      "reqsk_fastopen_remove()" to manage the somewhat complex relation
      between a listener, its request_sock, and the corresponding child
      socket. See the comment above the function for the detail.
      Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8336886f
    • J
      tcp: TCP Fast Open Server - header & support functions · 10467163
      Jerry Chu 提交于
      This patch adds all the necessary data structure and support
      functions to implement TFO server side. It also documents a number
      of flags for the sysctl_tcp_fastopen knob, and adds a few Linux
      extension MIBs.
      
      In addition, it includes the following:
      
      1. a new TCP_FASTOPEN socket option an application must call to
      supply a max backlog allowed in order to enable TFO on its listener.
      
      2. A number of key data structures:
      "fastopen_rsk" in tcp_sock - for a big socket to access its
      request_sock for retransmission and ack processing purpose. It is
      non-NULL iff 3WHS not completed.
      
      "fastopenq" in request_sock_queue - points to a per Fast Open
      listener data structure "fastopen_queue" to keep track of qlen (# of
      outstanding Fast Open requests) and max_qlen, among other things.
      
      "listener" in tcp_request_sock - to point to the original listener
      for book-keeping purpose, i.e., to maintain qlen against max_qlen
      as part of defense against IP spoofing attack.
      
      3. various data structure and functions, many in tcp_fastopen.c, to
      support server side Fast Open cookie operations, including
      /proc/sys/net/ipv4/tcp_fastopen_key to allow manual rekeying.
      Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10467163
  21. 16 9月, 2011 1 次提交
    • E
      tcp: Change possible SYN flooding messages · 946cedcc
      Eric Dumazet 提交于
      "Possible SYN flooding on port xxxx " messages can fill logs on servers.
      
      Change logic to log the message only once per listener, and add two new
      SNMP counters to track :
      
      TCPReqQFullDoCookies : number of times a SYNCOOKIE was replied to client
      
      TCPReqQFullDrop : number of times a SYN request was dropped because
      syncookies were not enabled.
      
      Based on a prior patch from Tom Herbert, and suggestions from David.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      946cedcc
  22. 18 1月, 2010 1 次提交
    • O
      tcp: account SYN-ACK timeouts & retransmissions · 72659ecc
      Octavian Purdila 提交于
      Currently we don't increment SYN-ACK timeouts & retransmissions
      although we do increment the same stats for SYN. We seem to have lost
      the SYN-ACK accounting with the introduction of tcp_syn_recv_timer
      (commit 2248761e in the netdev-vger-cvs tree).
      
      This patch fixes this issue. In the process we also rename the v4/v6
      syn/ack retransmit functions for clarity. We also add a new
      request_socket operations (syn_ack_timeout) so we can keep code in
      inet_connection_sock.c protocol agnostic.
      Signed-off-by: NOctavian Purdila <opurdila@ixiacom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      72659ecc
  23. 03 12月, 2009 1 次提交
  24. 22 11月, 2008 1 次提交