1. 14 4月, 2015 1 次提交
    • E
      tcp/dccp: get rid of central timewait timer · 789f558c
      Eric Dumazet 提交于
      Using a timer wheel for timewait sockets was nice ~15 years ago when
      memory was expensive and machines had a single processor.
      
      This does not scale, code is ugly and source of huge latencies
      (Typically 30 ms have been seen, cpus spinning on death_lock spinlock.)
      
      We can afford to use an extra 64 bytes per timewait sock and spread
      timewait load to all cpus to have better behavior.
      
      Tested:
      
      On following test, /proc/sys/net/ipv4/tcp_tw_recycle is set to 1
      on the target (lpaa24)
      
      Before patch :
      
      lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
      419594
      
      lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
      437171
      
      While test is running, we can observe 25 or even 33 ms latencies.
      
      lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
      ...
      1000 packets transmitted, 1000 received, 0% packet loss, time 20601ms
      rtt min/avg/max/mdev = 0.020/0.217/25.771/1.535 ms, pipe 2
      
      lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
      ...
      1000 packets transmitted, 1000 received, 0% packet loss, time 20702ms
      rtt min/avg/max/mdev = 0.019/0.183/33.761/1.441 ms, pipe 2
      
      After patch :
      
      About 90% increase of throughput :
      
      lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
      810442
      
      lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
      800992
      
      And latencies are kept to minimal values during this load, even
      if network utilization is 90% higher :
      
      lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
      ...
      1000 packets transmitted, 1000 received, 0% packet loss, time 19991ms
      rtt min/avg/max/mdev = 0.023/0.064/0.360/0.042 ms
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      789f558c
  2. 01 4月, 2015 2 次提交
  3. 30 3月, 2015 1 次提交
  4. 25 3月, 2015 4 次提交
  5. 24 3月, 2015 2 次提交
  6. 21 3月, 2015 2 次提交
    • E
      inet: get rid of central tcp/dccp listener timer · fa76ce73
      Eric Dumazet 提交于
      One of the major issue for TCP is the SYNACK rtx handling,
      done by inet_csk_reqsk_queue_prune(), fired by the keepalive
      timer of a TCP_LISTEN socket.
      
      This function runs for awful long times, with socket lock held,
      meaning that other cpus needing this lock have to spin for hundred of ms.
      
      SYNACK are sent in huge bursts, likely to cause severe drops anyway.
      
      This model was OK 15 years ago when memory was very tight.
      
      We now can afford to have a timer per request sock.
      
      Timer invocations no longer need to lock the listener,
      and can be run from all cpus in parallel.
      
      With following patch increasing somaxconn width to 32 bits,
      I tested a listener with more than 4 million active request sockets,
      and a steady SYNFLOOD of ~200,000 SYN per second.
      Host was sending ~830,000 SYNACK per second.
      
      This is ~100 times more what we could achieve before this patch.
      
      Later, we will get rid of the listener hash and use ehash instead.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa76ce73
    • E
      inet: drop prev pointer handling in request sock · 52452c54
      Eric Dumazet 提交于
      When request sock are put in ehash table, the whole notion
      of having a previous request to update dl_next is pointless.
      
      Also, following patch will get rid of big purge timer,
      so we want to delete a request sock without holding listener lock.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52452c54
  7. 19 3月, 2015 2 次提交
  8. 17 3月, 2015 1 次提交
  9. 15 3月, 2015 1 次提交
  10. 13 3月, 2015 2 次提交
  11. 06 1月, 2015 1 次提交
    • D
      net: tcp: add per route congestion control · 81164413
      Daniel Borkmann 提交于
      This work adds the possibility to define a per route/destination
      congestion control algorithm. Generally, this opens up the possibility
      for a machine with different links to enforce specific congestion
      control algorithms with optimal strategies for each of them based
      on their network characteristics, even transparently for a single
      application listening on all links.
      
      For our specific use case, this additionally facilitates deployment
      of DCTCP, for example, applications can easily serve internal
      traffic/dsts in DCTCP and external one with CUBIC. Other scenarios
      would also allow for utilizing e.g. long living, low priority
      background flows for certain destinations/routes while still being
      able for normal traffic to utilize the default congestion control
      algorithm. We also thought about a per netns setting (where different
      defaults are possible), but given its actually a link specific
      property, we argue that a per route/destination setting is the most
      natural and flexible.
      
      The administrator can utilize this through ip-route(8) by appending
      "congctl [lock] <name>", where <name> denotes the name of a
      congestion control algorithm and the optional lock parameter allows
      to enforce the given algorithm so that applications in user space
      would not be allowed to overwrite that algorithm for that destination.
      
      The dst metric lookups are being done when a dst entry is already
      available in order to avoid a costly lookup and still before the
      algorithms are being initialized, thus overhead is very low when the
      feature is not being used. While the client side would need to drop
      the current reference on the module, on server side this can actually
      even be avoided as we just got a flat-copied socket clone.
      
      Joint work with Florian Westphal.
      Suggested-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      81164413
  12. 23 12月, 2014 1 次提交
  13. 10 12月, 2014 1 次提交
  14. 26 11月, 2014 1 次提交
  15. 12 11月, 2014 2 次提交
    • E
      net: introduce SO_INCOMING_CPU · 2c8c56e1
      Eric Dumazet 提交于
      Alternative to RPS/RFS is to use hardware support for multiple
      queues.
      
      Then split a set of million of sockets into worker threads, each
      one using epoll() to manage events on its own socket pool.
      
      Ideally, we want one thread per RX/TX queue/cpu, but we have no way to
      know after accept() or connect() on which queue/cpu a socket is managed.
      
      We normally use one cpu per RX queue (IRQ smp_affinity being properly
      set), so remembering on socket structure which cpu delivered last packet
      is enough to solve the problem.
      
      After accept(), connect(), or even file descriptor passing around
      processes, applications can use :
      
       int cpu;
       socklen_t len = sizeof(cpu);
      
       getsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len);
      
      And use this information to put the socket into the right silo
      for optimal performance, as all networking stack should run
      on the appropriate cpu, without need to send IPI (RPS/RFS).
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c8c56e1
    • E
      tcp: move sk_mark_napi_id() at the right place · 3d97379a
      Eric Dumazet 提交于
      sk_mark_napi_id() is used to record for a flow napi id of incoming
      packets for busypoll sake.
      We should do this only on established flows, not on listeners.
      
      This was 'working' by virtue of the socket cloning, but doing
      this on SYN packets in unecessary cache line dirtying.
      
      Even if we move sk_napi_id in the same cache line than sk_lock,
      we are working to make SYN processing lockless, so it is desirable
      to set sk_napi_id only for established flows.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3d97379a
  16. 23 10月, 2014 1 次提交
    • S
      net: fix saving TX flow hash in sock for outgoing connections · 9e7ceb06
      Sathya Perla 提交于
      The commit "net: Save TX flow hash in sock and set in skbuf on xmit"
      introduced the inet_set_txhash() and ip6_set_txhash() routines to calculate
      and record flow hash(sk_txhash) in the socket structure. sk_txhash is used
      to set skb->hash which is used to spread flows across multiple TXQs.
      
      But, the above routines are invoked before the source port of the connection
      is created. Because of this all outgoing connections that just differ in the
      source port get hashed into the same TXQ.
      
      This patch fixes this problem for IPv4/6 by invoking the the above routines
      after the source port is available for the socket.
      
      Fixes: b73c3d0e("net: Save TX flow hash in sock and set in skbuf on xmit")
      Signed-off-by: NSathya Perla <sathya.perla@emulex.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9e7ceb06
  17. 18 10月, 2014 1 次提交
  18. 29 9月, 2014 2 次提交
    • E
      tcp: better TCP_SKB_CB layout to reduce cache line misses · 971f10ec
      Eric Dumazet 提交于
      TCP maintains lists of skb in write queue, and in receive queues
      (in order and out of order queues)
      
      Scanning these lists both in input and output path usually requires
      access to skb->next, TCP_SKB_CB(skb)->seq, and TCP_SKB_CB(skb)->end_seq
      
      These fields are currently in two different cache lines, meaning we
      waste lot of memory bandwidth when these queues are big and flows
      have either packet drops or packet reorders.
      
      We can move TCP_SKB_CB(skb)->header at the end of TCP_SKB_CB, because
      this header is not used in fast path. This allows TCP to search much faster
      in the skb lists.
      
      Even with regular flows, we save one cache line miss in fast path.
      
      Thanks to Christoph Paasch for noticing we need to cleanup
      skb->cb[] (IPCB/IP6CB) before entering IP stack in tx path,
      and that I forgot IPCB use in tcp_v4_hnd_req() and tcp_v4_save_options().
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      971f10ec
    • E
      ipv6: add a struct inet6_skb_parm param to ipv6_opt_accepted() · a224772d
      Eric Dumazet 提交于
      ipv6_opt_accepted() assumes IP6CB(skb) holds the struct inet6_skb_parm
      that it needs. Lets not assume this, as TCP stack might use a different
      place.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a224772d
  19. 28 9月, 2014 1 次提交
  20. 16 9月, 2014 1 次提交
    • E
      tcp: use TCP_SKB_CB(skb)->tcp_flags in input path · e11ecddf
      Eric Dumazet 提交于
      Input path of TCP do not currently uses TCP_SKB_CB(skb)->tcp_flags,
      which is only used in output path.
      
      tcp_recvmsg(), looks at tcp_hdr(skb)->syn for every skb found in receive queue,
      and its unfortunate because this bit is located in a cache line right before
      the payload.
      
      We can simplify TCP by copying tcp flags into TCP_SKB_CB(skb)->tcp_flags.
      
      This patch does so, and avoids the cache line miss in tcp_recvmsg()
      
      Following patches will
      - allow a segment with FIN being coalesced in tcp_try_coalesce()
      - simplify tcp_collapse() by not copying the headers.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e11ecddf
  21. 10 9月, 2014 1 次提交
    • E
      tcp: remove dst refcount false sharing for prequeue mode · ca777eff
      Eric Dumazet 提交于
      Alexander Duyck reported high false sharing on dst refcount in tcp stack
      when prequeue is used. prequeue is the mechanism used when a thread is
      blocked in recvmsg()/read() on a TCP socket, using a blocking model
      rather than select()/poll()/epoll() non blocking one.
      
      We already try to use RCU in input path as much as possible, but we were
      forced to take a refcount on the dst when skb escaped RCU protected
      region. When/if the user thread runs on different cpu, dst_release()
      will then touch dst refcount again.
      
      Commit 09316255 (tcp: force a dst refcount when prequeue packet)
      was an example of a race fix.
      
      It turns out the only remaining usage of skb->dst for a packet stored
      in a TCP socket prequeue is IP early demux.
      
      We can add a logic to detect when IP early demux is probably going
      to use skb->dst. Because we do an optimistic check rather than duplicate
      existing logic, we need to guard inet_sk_rx_dst_set() and
      inet6_sk_rx_dst_set() from using a NULL dst.
      
      Many thanks to Alexander for providing a nice bug report, git bisection,
      and reproducer.
      
      Tested using Alexander script on a 40Gb NIC, 8 RX queues.
      Hosts have 24 cores, 48 hyper threads.
      
      echo 0 >/proc/sys/net/ipv4/tcp_autocorking
      
      for i in `seq 0 47`
      do
        for j in `seq 0 2`
        do
           netperf -H $DEST -t TCP_STREAM -l 1000 \
                   -c -C -T $i,$i -P 0 -- \
                   -m 64 -s 64K -D &
        done
      done
      
      Before patch : ~6Mpps and ~95% cpu usage on receiver
      After patch : ~9Mpps and ~35% cpu usage on receiver.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca777eff
  22. 06 9月, 2014 1 次提交
  23. 15 8月, 2014 1 次提交
  24. 07 8月, 2014 1 次提交
  25. 08 7月, 2014 2 次提交
    • T
      net: Save TX flow hash in sock and set in skbuf on xmit · b73c3d0e
      Tom Herbert 提交于
      For a connected socket we can precompute the flow hash for setting
      in skb->hash on output. This is a performance advantage over
      calculating the skb->hash for every packet on the connection. The
      computation is done using the common hash algorithm to be consistent
      with computations done for packets of the connection in other states
      where thers is no socket (e.g. time-wait, syn-recv, syn-cookies).
      
      This patch adds sk_txhash to the sock structure. inet_set_txhash and
      ip6_set_txhash functions are added which are called from points in
      TCP and UDP where socket moves to established state.
      
      skb_set_hash_from_sk is a function which sets skb->hash from the
      sock txhash value. This is called in UDP and TCP transmit path when
      transmitting within the context of a socket.
      
      Tested: ran super_netperf with 200 TCP_RR streams over a vxlan
      interface (in this case skb_get_hash called on every TX packet to
      create a UDP source port).
      
      Before fix:
      
        95.02% CPU utilization
        154/256/505 90/95/99% latencies
        1.13042e+06 tps
      
        Time in functions:
          0.28% skb_flow_dissect
          0.21% __skb_get_hash
      
      After fix:
      
        94.95% CPU utilization
        156/254/485 90/95/99% latencies
        1.15447e+06
      
        Neither __skb_get_hash nor skb_flow_dissect appear in perf
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b73c3d0e
    • N
      tcp: switch snt_synack back to measuring transmit time of first SYNACK · 86c6a2c7
      Neal Cardwell 提交于
      Always store in snt_synack the time at which the server received the
      first client SYN and attempted to send the first SYNACK.
      
      Recent commit aa27fc50 ("tcp: tcp_v[46]_conn_request: fix snt_synack
      initialization") resolved an inconsistency between IPv4 and IPv6 in
      the initialization of snt_synack. This commit brings back the idea
      from 843f4a55 (tcp: use tcp_v4_send_synack on first SYN-ACK), which
      was going for the original behavior of snt_synack from the commit
      where it was added in 9ad7c049 ("tcp: RFC2988bis + taking RTT
      sample from 3WHS for the passive open side") in v3.1.
      
      In addition to being simpler (and probably a tiny bit faster),
      unconditionally storing the time of the first SYNACK attempt has been
      useful because it allows calculating a performance metric quantifying
      how long it took to establish a passive TCP connection.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Cc: Octavian Purdila <octavian.purdila@intel.com>
      Cc: Jerry Chu <hkchu@google.com>
      Acked-by: NOctavian Purdila <octavian.purdila@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      86c6a2c7
  26. 28 6月, 2014 4 次提交