1. 24 4月, 2015 1 次提交
    • E
      inet: fix possible panic in reqsk_queue_unlink() · b357a364
      Eric Dumazet 提交于
      [ 3897.923145] BUG: unable to handle kernel NULL pointer dereference at
       0000000000000080
      [ 3897.931025] IP: [<ffffffffa9f27686>] reqsk_timer_handler+0x1a6/0x243
      
      There is a race when reqsk_timer_handler() and tcp_check_req() call
      inet_csk_reqsk_queue_unlink() on the same req at the same time.
      
      Before commit fa76ce73 ("inet: get rid of central tcp/dccp listener
      timer"), listener spinlock was held and race could not happen.
      
      To solve this bug, we change reqsk_queue_unlink() to not assume req
      must be found, and we return a status, to conditionally release a
      refcount on the request sock.
      
      This also means tcp_check_req() in non fastopen case might or not
      consume req refcount, so tcp_v6_hnd_req() & tcp_v4_hnd_req() have
      to properly handle this.
      
      (Same remark for dccp_check_req() and its callers)
      
      inet_csk_reqsk_queue_drop() is now too big to be inlined, as it is
      called 4 times in tcp and 3 times in dccp.
      
      Fixes: fa76ce73 ("inet: get rid of central tcp/dccp listener timer")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b357a364
  2. 14 4月, 2015 1 次提交
    • E
      tcp/dccp: get rid of central timewait timer · 789f558c
      Eric Dumazet 提交于
      Using a timer wheel for timewait sockets was nice ~15 years ago when
      memory was expensive and machines had a single processor.
      
      This does not scale, code is ugly and source of huge latencies
      (Typically 30 ms have been seen, cpus spinning on death_lock spinlock.)
      
      We can afford to use an extra 64 bytes per timewait sock and spread
      timewait load to all cpus to have better behavior.
      
      Tested:
      
      On following test, /proc/sys/net/ipv4/tcp_tw_recycle is set to 1
      on the target (lpaa24)
      
      Before patch :
      
      lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
      419594
      
      lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
      437171
      
      While test is running, we can observe 25 or even 33 ms latencies.
      
      lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
      ...
      1000 packets transmitted, 1000 received, 0% packet loss, time 20601ms
      rtt min/avg/max/mdev = 0.020/0.217/25.771/1.535 ms, pipe 2
      
      lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
      ...
      1000 packets transmitted, 1000 received, 0% packet loss, time 20702ms
      rtt min/avg/max/mdev = 0.019/0.183/33.761/1.441 ms, pipe 2
      
      After patch :
      
      About 90% increase of throughput :
      
      lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
      810442
      
      lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
      800992
      
      And latencies are kept to minimal values during this load, even
      if network utilization is 90% higher :
      
      lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
      ...
      1000 packets transmitted, 1000 received, 0% packet loss, time 19991ms
      rtt min/avg/max/mdev = 0.023/0.064/0.360/0.042 ms
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      789f558c
  3. 10 4月, 2015 1 次提交
  4. 04 4月, 2015 2 次提交
  5. 30 3月, 2015 1 次提交
  6. 25 3月, 2015 4 次提交
  7. 24 3月, 2015 4 次提交
    • M
      tcp: prevent fetching dst twice in early demux code · d0c294c5
      Michal Kubeček 提交于
      On s390x, gcc 4.8 compiles this part of tcp_v6_early_demux()
      
              struct dst_entry *dst = sk->sk_rx_dst;
      
              if (dst)
                      dst = dst_check(dst, inet6_sk(sk)->rx_dst_cookie);
      
      to code reading sk->sk_rx_dst twice, once for the test and once for
      the argument of ip6_dst_check() (dst_check() is inline). This allows
      ip6_dst_check() to be called with null first argument, causing a crash.
      
      Protect sk->sk_rx_dst access by READ_ONCE() both in IPv4 and IPv6
      TCP early demux code.
      
      Fixes: 41063e9d ("ipv4: Early TCP socket demux.")
      Fixes: c7109986 ("ipv6: Early TCP socket demux")
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d0c294c5
    • F
      inet: fix double request socket freeing · c6973669
      Fan Du 提交于
      Eric Hugne reported following error :
      
      I'm hitting this warning on latest net-next when i try to SSH into a machine
      with eth0 added to a bridge (but i think the problem is older than that)
      
      Steps to reproduce:
      node2 ~ # brctl addif br0 eth0
      [  223.758785] device eth0 entered promiscuous mode
      node2 ~ # ip link set br0 up
      [  244.503614] br0: port 1(eth0) entered forwarding state
      [  244.505108] br0: port 1(eth0) entered forwarding state
      node2 ~ # [  251.160159] ------------[ cut here ]------------
      [  251.160831] WARNING: CPU: 0 PID: 3 at include/net/request_sock.h:102 tcp_v4_err+0x6b1/0x720()
      [  251.162077] Modules linked in:
      [  251.162496] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.0.0-rc3+ #18
      [  251.163334] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      [  251.164078]  ffffffff81a8365c ffff880038a6ba18 ffffffff8162ace4 0000000000009898
      [  251.165084]  0000000000000000 ffff880038a6ba58 ffffffff8104da85 ffff88003fa437c0
      [  251.166195]  ffff88003fa437c0 ffff88003fa74e00 ffff88003fa43bb8 ffff88003fad99a0
      [  251.167203] Call Trace:
      [  251.167533]  [<ffffffff8162ace4>] dump_stack+0x45/0x57
      [  251.168206]  [<ffffffff8104da85>] warn_slowpath_common+0x85/0xc0
      [  251.169239]  [<ffffffff8104db65>] warn_slowpath_null+0x15/0x20
      [  251.170271]  [<ffffffff81559d51>] tcp_v4_err+0x6b1/0x720
      [  251.171408]  [<ffffffff81630d03>] ? _raw_read_lock_irq+0x3/0x10
      [  251.172589]  [<ffffffff81534e20>] ? inet_del_offload+0x40/0x40
      [  251.173366]  [<ffffffff81569295>] icmp_socket_deliver+0x65/0xb0
      [  251.174134]  [<ffffffff815693a2>] icmp_unreach+0xc2/0x280
      [  251.174820]  [<ffffffff8156a82d>] icmp_rcv+0x2bd/0x3a0
      [  251.175473]  [<ffffffff81534ea2>] ip_local_deliver_finish+0x82/0x1e0
      [  251.176282]  [<ffffffff815354d8>] ip_local_deliver+0x88/0x90
      [  251.177004]  [<ffffffff815350f0>] ip_rcv_finish+0xf0/0x310
      [  251.177693]  [<ffffffff815357bc>] ip_rcv+0x2dc/0x390
      [  251.178336]  [<ffffffff814f5da3>] __netif_receive_skb_core+0x713/0xa20
      [  251.179170]  [<ffffffff814f7fca>] __netif_receive_skb+0x1a/0x80
      [  251.179922]  [<ffffffff814f97d4>] process_backlog+0x94/0x120
      [  251.180639]  [<ffffffff814f9612>] net_rx_action+0x1e2/0x310
      [  251.181356]  [<ffffffff81051267>] __do_softirq+0xa7/0x290
      [  251.182046]  [<ffffffff81051469>] run_ksoftirqd+0x19/0x30
      [  251.182726]  [<ffffffff8106cc23>] smpboot_thread_fn+0x153/0x1d0
      [  251.183485]  [<ffffffff8106cad0>] ? SyS_setgroups+0x130/0x130
      [  251.184228]  [<ffffffff8106935e>] kthread+0xee/0x110
      [  251.184871]  [<ffffffff81069270>] ? kthread_create_on_node+0x1b0/0x1b0
      [  251.185690]  [<ffffffff81631108>] ret_from_fork+0x58/0x90
      [  251.186385]  [<ffffffff81069270>] ? kthread_create_on_node+0x1b0/0x1b0
      [  251.187216] ---[ end trace c947fc7b24e42ea1 ]---
      [  259.542268] br0: port 1(eth0) entered forwarding state
      
      Remove the double calls to reqsk_put()
      
      [edumazet] :
      
      I got confused because reqsk_timer_handler() _has_ to call
      reqsk_put(req) after calling inet_csk_reqsk_queue_drop(), as
      the timer handler holds a reference on req.
      Signed-off-by: NFan Du <fan.du@intel.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NErik Hugne <erik.hugne@ericsson.com>
      Fixes: fa76ce73 ("inet: get rid of central tcp/dccp listener timer")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c6973669
    • E
      ipv4: tcp: handle ICMP messages on TCP_NEW_SYN_RECV request sockets · 26e37360
      Eric Dumazet 提交于
      tcp_v4_err() can restrict lookups to ehash table, and not to listeners.
      
      Note this patch creates the infrastructure, but this means that ICMP
      messages for request sockets are ignored until complete conversion.
      
      New tcp_req_err() helper is exported so that we can use it in IPv6
      in following patch.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      26e37360
    • E
      net: convert syn_wait_lock to a spinlock · b2827053
      Eric Dumazet 提交于
      This is a low hanging fruit, as we'll get rid of syn_wait_lock eventually.
      
      We hold syn_wait_lock for such small sections, that it makes no sense to use
      a read/write lock. A spin lock is simply faster.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2827053
  8. 21 3月, 2015 2 次提交
    • E
      inet: get rid of central tcp/dccp listener timer · fa76ce73
      Eric Dumazet 提交于
      One of the major issue for TCP is the SYNACK rtx handling,
      done by inet_csk_reqsk_queue_prune(), fired by the keepalive
      timer of a TCP_LISTEN socket.
      
      This function runs for awful long times, with socket lock held,
      meaning that other cpus needing this lock have to spin for hundred of ms.
      
      SYNACK are sent in huge bursts, likely to cause severe drops anyway.
      
      This model was OK 15 years ago when memory was very tight.
      
      We now can afford to have a timer per request sock.
      
      Timer invocations no longer need to lock the listener,
      and can be run from all cpus in parallel.
      
      With following patch increasing somaxconn width to 32 bits,
      I tested a listener with more than 4 million active request sockets,
      and a steady SYNFLOOD of ~200,000 SYN per second.
      Host was sending ~830,000 SYNACK per second.
      
      This is ~100 times more what we could achieve before this patch.
      
      Later, we will get rid of the listener hash and use ehash instead.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa76ce73
    • E
      inet: drop prev pointer handling in request sock · 52452c54
      Eric Dumazet 提交于
      When request sock are put in ehash table, the whole notion
      of having a previous request to update dl_next is pointless.
      
      Also, following patch will get rid of big purge timer,
      so we want to delete a request sock without holding listener lock.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52452c54
  9. 19 3月, 2015 2 次提交
  10. 17 3月, 2015 1 次提交
  11. 13 3月, 2015 2 次提交
  12. 07 3月, 2015 2 次提交
    • F
      ipv4: Create probe timer for tcp PMTU as per RFC4821 · 05cbc0db
      Fan Du 提交于
      As per RFC4821 7.3.  Selecting Probe Size, a probe timer should
      be armed once probing has converged. Once this timer expired,
      probing again to take advantage of any path PMTU change. The
      recommended probing interval is 10 minutes per RFC1981. Probing
      interval could be sysctled by sysctl_tcp_probe_interval.
      
      Eric Dumazet suggested to implement pseudo timer based on 32bits
      jiffies tcp_time_stamp instead of using classic timer for such
      rare event.
      Signed-off-by: NFan Du <fan.du@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      05cbc0db
    • F
      ipv4: Use binary search to choose tcp PMTU probe_size · 6b58e0a5
      Fan Du 提交于
      Current probe_size is chosen by doubling mss_cache,
      the probing process will end shortly with a sub-optimal
      mss size, and the link mtu will not be taken full
      advantage of, in return, this will make user to tweak
      tcp_base_mss with care.
      
      Use binary search to choose probe_size in a fine
      granularity manner, an optimal mss will be found
      to boost performance as its maxmium.
      
      In addition, introduce a sysctl_tcp_probe_threshold
      to control when probing will stop in respect to
      the width of search range.
      
      Test env:
      Docker instance with vxlan encapuslation(82599EB)
      iperf -c 10.0.0.24  -t 60
      
      before this patch:
      1.26 Gbits/sec
      
      After this patch: increase 26%
      1.59 Gbits/sec
      Signed-off-by: NFan Du <fan.du@intel.com>
      Acked-by: NJohn Heffner <johnwheffner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b58e0a5
  13. 10 2月, 2015 1 次提交
    • F
      ipv4: Namespecify TCP PMTU mechanism · b0f9ca53
      Fan Du 提交于
      Packetization Layer Path MTU Discovery works separately beside
      Path MTU Discovery at IP level, different net namespace has
      various requirements on which one to chose, e.g., a virutalized
      container instance would require TCP PMTU to probe an usable
      effective mtu for underlying tunnel, while the host would
      employ classical ICMP based PMTU to function.
      
      Hence making TCP PMTU mechanism per net namespace to decouple
      two functionality. Furthermore the probe base MSS should also
      be configured separately for each namespace.
      Signed-off-by: NFan Du <fan.du@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b0f9ca53
  14. 02 2月, 2015 1 次提交
    • E
      ipv4: tcp: get rid of ugly unicast_sock · bdbbb852
      Eric Dumazet 提交于
      In commit be9f4a44 ("ipv4: tcp: remove per net tcp_sock")
      I tried to address contention on a socket lock, but the solution
      I chose was horrible :
      
      commit 3a7c384f ("ipv4: tcp: unicast_sock should not land outside
      of TCP stack") addressed a selinux regression.
      
      commit 0980e56e ("ipv4: tcp: set unicast_sock uc_ttl to -1")
      took care of another regression.
      
      commit b5ec8eea ("ipv4: fix ip_send_skb()") fixed another regression.
      
      commit 811230cd ("tcp: ipv4: initialize unicast_sock sk_pacing_rate")
      was another shot in the dark.
      
      Really, just use a proper socket per cpu, and remove the skb_orphan()
      call, to re-enable flow control.
      
      This solves a serious problem with FQ packet scheduler when used in
      hostile environments, as we do not want to allocate a flow structure
      for every RST packet sent in response to a spoofed packet.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bdbbb852
  15. 06 1月, 2015 1 次提交
    • D
      net: tcp: add per route congestion control · 81164413
      Daniel Borkmann 提交于
      This work adds the possibility to define a per route/destination
      congestion control algorithm. Generally, this opens up the possibility
      for a machine with different links to enforce specific congestion
      control algorithms with optimal strategies for each of them based
      on their network characteristics, even transparently for a single
      application listening on all links.
      
      For our specific use case, this additionally facilitates deployment
      of DCTCP, for example, applications can easily serve internal
      traffic/dsts in DCTCP and external one with CUBIC. Other scenarios
      would also allow for utilizing e.g. long living, low priority
      background flows for certain destinations/routes while still being
      able for normal traffic to utilize the default congestion control
      algorithm. We also thought about a per netns setting (where different
      defaults are possible), but given its actually a link specific
      property, we argue that a per route/destination setting is the most
      natural and flexible.
      
      The administrator can utilize this through ip-route(8) by appending
      "congctl [lock] <name>", where <name> denotes the name of a
      congestion control algorithm and the optional lock parameter allows
      to enforce the given algorithm so that applications in user space
      would not be allowed to overwrite that algorithm for that destination.
      
      The dst metric lookups are being done when a dst entry is already
      available in order to avoid a costly lookup and still before the
      algorithms are being initialized, thus overhead is very low when the
      feature is not being used. While the client side would need to drop
      the current reference on the module, on server side this can actually
      even be avoided as we just got a flat-copied socket clone.
      
      Joint work with Florian Westphal.
      Suggested-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      81164413
  16. 10 12月, 2014 1 次提交
  17. 26 11月, 2014 1 次提交
  18. 12 11月, 2014 2 次提交
    • E
      net: introduce SO_INCOMING_CPU · 2c8c56e1
      Eric Dumazet 提交于
      Alternative to RPS/RFS is to use hardware support for multiple
      queues.
      
      Then split a set of million of sockets into worker threads, each
      one using epoll() to manage events on its own socket pool.
      
      Ideally, we want one thread per RX/TX queue/cpu, but we have no way to
      know after accept() or connect() on which queue/cpu a socket is managed.
      
      We normally use one cpu per RX queue (IRQ smp_affinity being properly
      set), so remembering on socket structure which cpu delivered last packet
      is enough to solve the problem.
      
      After accept(), connect(), or even file descriptor passing around
      processes, applications can use :
      
       int cpu;
       socklen_t len = sizeof(cpu);
      
       getsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len);
      
      And use this information to put the socket into the right silo
      for optimal performance, as all networking stack should run
      on the appropriate cpu, without need to send IPI (RPS/RFS).
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c8c56e1
    • E
      tcp: move sk_mark_napi_id() at the right place · 3d97379a
      Eric Dumazet 提交于
      sk_mark_napi_id() is used to record for a flow napi id of incoming
      packets for busypoll sake.
      We should do this only on established flows, not on listeners.
      
      This was 'working' by virtue of the socket cloning, but doing
      this on SYN packets in unecessary cache line dirtying.
      
      Even if we move sk_napi_id in the same cache line than sk_lock,
      we are working to make SYN processing lockless, so it is desirable
      to set sk_napi_id only for established flows.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3d97379a
  19. 23 10月, 2014 1 次提交
    • S
      net: fix saving TX flow hash in sock for outgoing connections · 9e7ceb06
      Sathya Perla 提交于
      The commit "net: Save TX flow hash in sock and set in skbuf on xmit"
      introduced the inet_set_txhash() and ip6_set_txhash() routines to calculate
      and record flow hash(sk_txhash) in the socket structure. sk_txhash is used
      to set skb->hash which is used to spread flows across multiple TXQs.
      
      But, the above routines are invoked before the source port of the connection
      is created. Because of this all outgoing connections that just differ in the
      source port get hashed into the same TXQ.
      
      This patch fixes this problem for IPv4/6 by invoking the the above routines
      after the source port is available for the socket.
      
      Fixes: b73c3d0e("net: Save TX flow hash in sock and set in skbuf on xmit")
      Signed-off-by: NSathya Perla <sathya.perla@emulex.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9e7ceb06
  20. 18 10月, 2014 2 次提交
  21. 29 9月, 2014 2 次提交
    • E
      tcp: better TCP_SKB_CB layout to reduce cache line misses · 971f10ec
      Eric Dumazet 提交于
      TCP maintains lists of skb in write queue, and in receive queues
      (in order and out of order queues)
      
      Scanning these lists both in input and output path usually requires
      access to skb->next, TCP_SKB_CB(skb)->seq, and TCP_SKB_CB(skb)->end_seq
      
      These fields are currently in two different cache lines, meaning we
      waste lot of memory bandwidth when these queues are big and flows
      have either packet drops or packet reorders.
      
      We can move TCP_SKB_CB(skb)->header at the end of TCP_SKB_CB, because
      this header is not used in fast path. This allows TCP to search much faster
      in the skb lists.
      
      Even with regular flows, we save one cache line miss in fast path.
      
      Thanks to Christoph Paasch for noticing we need to cleanup
      skb->cb[] (IPCB/IP6CB) before entering IP stack in tx path,
      and that I forgot IPCB use in tcp_v4_hnd_req() and tcp_v4_save_options().
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      971f10ec
    • E
      ipv4: rename ip_options_echo to __ip_options_echo() · 24a2d43d
      Eric Dumazet 提交于
      ip_options_echo() assumes struct ip_options is provided in &IPCB(skb)->opt
      Lets break this assumption, but provide a helper to not change all call points.
      
      ip_send_unicast_reply() gets a new struct ip_options pointer.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      24a2d43d
  22. 28 9月, 2014 1 次提交
  23. 23 9月, 2014 1 次提交
  24. 16 9月, 2014 1 次提交
    • E
      tcp: use TCP_SKB_CB(skb)->tcp_flags in input path · e11ecddf
      Eric Dumazet 提交于
      Input path of TCP do not currently uses TCP_SKB_CB(skb)->tcp_flags,
      which is only used in output path.
      
      tcp_recvmsg(), looks at tcp_hdr(skb)->syn for every skb found in receive queue,
      and its unfortunate because this bit is located in a cache line right before
      the payload.
      
      We can simplify TCP by copying tcp flags into TCP_SKB_CB(skb)->tcp_flags.
      
      This patch does so, and avoids the cache line miss in tcp_recvmsg()
      
      Following patches will
      - allow a segment with FIN being coalesced in tcp_try_coalesce()
      - simplify tcp_collapse() by not copying the headers.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e11ecddf
  25. 10 9月, 2014 1 次提交
    • E
      tcp: remove dst refcount false sharing for prequeue mode · ca777eff
      Eric Dumazet 提交于
      Alexander Duyck reported high false sharing on dst refcount in tcp stack
      when prequeue is used. prequeue is the mechanism used when a thread is
      blocked in recvmsg()/read() on a TCP socket, using a blocking model
      rather than select()/poll()/epoll() non blocking one.
      
      We already try to use RCU in input path as much as possible, but we were
      forced to take a refcount on the dst when skb escaped RCU protected
      region. When/if the user thread runs on different cpu, dst_release()
      will then touch dst refcount again.
      
      Commit 09316255 (tcp: force a dst refcount when prequeue packet)
      was an example of a race fix.
      
      It turns out the only remaining usage of skb->dst for a packet stored
      in a TCP socket prequeue is IP early demux.
      
      We can add a logic to detect when IP early demux is probably going
      to use skb->dst. Because we do an optimistic check rather than duplicate
      existing logic, we need to guard inet_sk_rx_dst_set() and
      inet6_sk_rx_dst_set() from using a NULL dst.
      
      Many thanks to Alexander for providing a nice bug report, git bisection,
      and reproducer.
      
      Tested using Alexander script on a 40Gb NIC, 8 RX queues.
      Hosts have 24 cores, 48 hyper threads.
      
      echo 0 >/proc/sys/net/ipv4/tcp_autocorking
      
      for i in `seq 0 47`
      do
        for j in `seq 0 2`
        do
           netperf -H $DEST -t TCP_STREAM -l 1000 \
                   -c -C -T $i,$i -P 0 -- \
                   -m 64 -s 64K -D &
        done
      done
      
      Before patch : ~6Mpps and ~95% cpu usage on receiver
      After patch : ~9Mpps and ~35% cpu usage on receiver.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca777eff
  26. 06 9月, 2014 1 次提交