1. 26 5月, 2020 1 次提交
    • E
      tcp: allow traceroute -Mtcp for unpriv users · 45af29ca
      Eric Dumazet 提交于
      Unpriv users can use traceroute over plain UDP sockets, but not TCP ones.
      
      $ traceroute -Mtcp 8.8.8.8
      You do not have enough privileges to use this traceroute method.
      
      $ traceroute -n -Mudp 8.8.8.8
      traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 60 byte packets
       1  192.168.86.1  3.631 ms  3.512 ms  3.405 ms
       2  10.1.10.1  4.183 ms  4.125 ms  4.072 ms
       3  96.120.88.125  20.621 ms  19.462 ms  20.553 ms
       4  96.110.177.65  24.271 ms  25.351 ms  25.250 ms
       5  69.139.199.197  44.492 ms  43.075 ms  44.346 ms
       6  68.86.143.93  27.969 ms  25.184 ms  25.092 ms
       7  96.112.146.18  25.323 ms 96.112.146.22  25.583 ms 96.112.146.26  24.502 ms
       8  72.14.239.204  24.405 ms 74.125.37.224  16.326 ms  17.194 ms
       9  209.85.251.9  18.154 ms 209.85.247.55  14.449 ms 209.85.251.9  26.296 ms^C
      
      We can easily support traceroute over TCP, by queueing an error message
      into socket error queue.
      
      Note that applications need to set IP_RECVERR/IPV6_RECVERR option to
      enable this feature, and that the error message is only queued
      while in SYN_SNT state.
      
      socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 3
      setsockopt(3, SOL_IPV6, IPV6_RECVERR, [1], 4) = 0
      setsockopt(3, SOL_SOCKET, SO_TIMESTAMP_OLD, [1], 4) = 0
      setsockopt(3, SOL_IPV6, IPV6_UNICAST_HOPS, [5], 4) = 0
      connect(3, {sa_family=AF_INET6, sin6_port=htons(8787), sin6_flowinfo=htonl(0),
              inet_pton(AF_INET6, "2002:a05:6608:297::", &sin6_addr), sin6_scope_id=0}, 28) = -1 EHOSTUNREACH (No route to host)
      recvmsg(3, {msg_name={sa_family=AF_INET6, sin6_port=htons(8787), sin6_flowinfo=htonl(0),
              inet_pton(AF_INET6, "2002:a05:6608:297::", &sin6_addr), sin6_scope_id=0},
              msg_namelen=1024->28, msg_iov=[{iov_base="`\r\337\320\0004\6\1&\7\370\260\200\231\16\27\0\0\0\0\0\0\0\0 \2\n\5f\10\2\227"..., iov_len=1024}],
              msg_iovlen=1, msg_control=[{cmsg_len=32, cmsg_level=SOL_SOCKET, cmsg_type=SO_TIMESTAMP_OLD, cmsg_data={tv_sec=1590340680, tv_usec=272424}},
                                         {cmsg_len=60, cmsg_level=SOL_IPV6, cmsg_type=IPV6_RECVERR}],
              msg_controllen=96, msg_flags=MSG_ERRQUEUE}, MSG_ERRQUEUE) = 144
      
      Suggested-by: Maciej Żenczykowski <maze@google.com
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Reviewed-by: NMaciej Żenczykowski <maze@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      45af29ca
  2. 01 5月, 2020 1 次提交
  3. 13 3月, 2020 1 次提交
  4. 25 2月, 2020 1 次提交
  5. 10 1月, 2020 2 次提交
  6. 03 1月, 2020 4 次提交
  7. 14 12月, 2019 1 次提交
  8. 10 12月, 2019 1 次提交
    • K
      net-tcp: Disable TCP ssthresh metrics cache by default · 65e6d901
      Kevin(Yudong) Yang 提交于
      This patch introduces a sysctl knob "net.ipv4.tcp_no_ssthresh_metrics_save"
      that disables TCP ssthresh metrics cache by default. Other parts of TCP
      metrics cache, e.g. rtt, cwnd, remain unchanged.
      
      As modern networks becoming more and more dynamic, TCP metrics cache
      today often causes more harm than benefits. For example, the same IP
      address is often shared by different subscribers behind NAT in residential
      networks. Even if the IP address is not shared by different users,
      caching the slow-start threshold of a previous short flow using loss-based
      congestion control (e.g. cubic) often causes the future longer flows of
      the same network path to exit slow-start prematurely with abysmal
      throughput.
      
      Caching ssthresh is very risky and can lead to terrible performance.
      Therefore it makes sense to make disabling ssthresh caching by
      default and opt-in for specific networks by the administrators.
      This practice also has worked well for several years of deployment with
      CUBIC congestion control at Google.
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NKevin(Yudong) Yang <yyd@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      65e6d901
  9. 07 11月, 2019 1 次提交
  10. 02 11月, 2019 1 次提交
    • E
      inet: stop leaking jiffies on the wire · a904a069
      Eric Dumazet 提交于
      Historically linux tried to stick to RFC 791, 1122, 2003
      for IPv4 ID field generation.
      
      RFC 6864 made clear that no matter how hard we try,
      we can not ensure unicity of IP ID within maximum
      lifetime for all datagrams with a given source
      address/destination address/protocol tuple.
      
      Linux uses a per socket inet generator (inet_id), initialized
      at connection startup with a XOR of 'jiffies' and other
      fields that appear clear on the wire.
      
      Thiemo Nagel pointed that this strategy is a privacy
      concern as this provides 16 bits of entropy to fingerprint
      devices.
      
      Let's switch to a random starting point, this is just as
      good as far as RFC 6864 is concerned and does not leak
      anything critical.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NThiemo Nagel <tnagel@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a904a069
  11. 01 11月, 2019 1 次提交
  12. 14 10月, 2019 4 次提交
    • E
      tcp: annotate tp->write_seq lockless reads · 0f317464
      Eric Dumazet 提交于
      There are few places where we fetch tp->write_seq while
      this field can change from IRQ or other cpu.
      
      We need to add READ_ONCE() annotations, and also make
      sure write sides use corresponding WRITE_ONCE() to avoid
      store-tearing.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0f317464
    • E
      tcp: annotate tp->copied_seq lockless reads · 7db48e98
      Eric Dumazet 提交于
      There are few places where we fetch tp->copied_seq while
      this field can change from IRQ or other cpu.
      
      We need to add READ_ONCE() annotations, and also make
      sure write sides use corresponding WRITE_ONCE() to avoid
      store-tearing.
      
      Note that tcp_inq_hint() was already using READ_ONCE(tp->copied_seq)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7db48e98
    • E
      tcp: annotate tp->rcv_nxt lockless reads · dba7d9b8
      Eric Dumazet 提交于
      There are few places where we fetch tp->rcv_nxt while
      this field can change from IRQ or other cpu.
      
      We need to add READ_ONCE() annotations, and also make
      sure write sides use corresponding WRITE_ONCE() to avoid
      store-tearing.
      
      Note that tcp_inq_hint() was already using READ_ONCE(tp->rcv_nxt)
      
      syzbot reported :
      
      BUG: KCSAN: data-race in tcp_poll / tcp_queue_rcv
      
      write to 0xffff888120425770 of 4 bytes by interrupt on cpu 0:
       tcp_rcv_nxt_update net/ipv4/tcp_input.c:3365 [inline]
       tcp_queue_rcv+0x180/0x380 net/ipv4/tcp_input.c:4638
       tcp_rcv_established+0xbf1/0xf50 net/ipv4/tcp_input.c:5616
       tcp_v4_do_rcv+0x381/0x4e0 net/ipv4/tcp_ipv4.c:1542
       tcp_v4_rcv+0x1a03/0x1bf0 net/ipv4/tcp_ipv4.c:1923
       ip_protocol_deliver_rcu+0x51/0x470 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5004
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5118
       netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5208
       napi_skb_finish net/core/dev.c:5671 [inline]
       napi_gro_receive+0x28f/0x330 net/core/dev.c:5704
       receive_buf+0x284/0x30b0 drivers/net/virtio_net.c:1061
      
      read to 0xffff888120425770 of 4 bytes by task 7254 on cpu 1:
       tcp_stream_is_readable net/ipv4/tcp.c:480 [inline]
       tcp_poll+0x204/0x6b0 net/ipv4/tcp.c:554
       sock_poll+0xed/0x250 net/socket.c:1256
       vfs_poll include/linux/poll.h:90 [inline]
       ep_item_poll.isra.0+0x90/0x190 fs/eventpoll.c:892
       ep_send_events_proc+0x113/0x5c0 fs/eventpoll.c:1749
       ep_scan_ready_list.constprop.0+0x189/0x500 fs/eventpoll.c:704
       ep_send_events fs/eventpoll.c:1793 [inline]
       ep_poll+0xe3/0x900 fs/eventpoll.c:1930
       do_epoll_wait+0x162/0x180 fs/eventpoll.c:2294
       __do_sys_epoll_pwait fs/eventpoll.c:2325 [inline]
       __se_sys_epoll_pwait fs/eventpoll.c:2311 [inline]
       __x64_sys_epoll_pwait+0xcd/0x170 fs/eventpoll.c:2311
       do_syscall_64+0xcf/0x2f0 arch/x86/entry/common.c:296
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 7254 Comm: syz-fuzzer Not tainted 5.3.0+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dba7d9b8
    • E
      tcp: add rcu protection around tp->fastopen_rsk · d983ea6f
      Eric Dumazet 提交于
      Both tcp_v4_err() and tcp_v6_err() do the following operations
      while they do not own the socket lock :
      
      	fastopen = tp->fastopen_rsk;
       	snd_una = fastopen ? tcp_rsk(fastopen)->snt_isn : tp->snd_una;
      
      The problem is that without appropriate barrier, the compiler
      might reload tp->fastopen_rsk and trigger a NULL deref.
      
      request sockets are protected by RCU, we can simply add
      the missing annotations and barriers to solve the issue.
      
      Fixes: 168a8f58 ("tcp: TCP Fast Open Server - main code path")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d983ea6f
  13. 10 10月, 2019 1 次提交
    • E
      net: silence KCSAN warnings around sk_add_backlog() calls · 8265792b
      Eric Dumazet 提交于
      sk_add_backlog() callers usually read sk->sk_rcvbuf without
      owning the socket lock. This means sk_rcvbuf value can
      be changed by other cpus, and KCSAN complains.
      
      Add READ_ONCE() annotations to document the lockless nature
      of these reads.
      
      Note that writes over sk_rcvbuf should also use WRITE_ONCE(),
      but this will be done in separate patches to ease stable
      backports (if we decide this is relevant for stable trees).
      
      BUG: KCSAN: data-race in tcp_add_backlog / tcp_recvmsg
      
      write to 0xffff88812ab369f8 of 8 bytes by interrupt on cpu 1:
       __sk_add_backlog include/net/sock.h:902 [inline]
       sk_add_backlog include/net/sock.h:933 [inline]
       tcp_add_backlog+0x45a/0xcc0 net/ipv4/tcp_ipv4.c:1737
       tcp_v4_rcv+0x1aba/0x1bf0 net/ipv4/tcp_ipv4.c:1925
       ip_protocol_deliver_rcu+0x51/0x470 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5004
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5118
       netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5208
       napi_skb_finish net/core/dev.c:5671 [inline]
       napi_gro_receive+0x28f/0x330 net/core/dev.c:5704
       receive_buf+0x284/0x30b0 drivers/net/virtio_net.c:1061
       virtnet_receive drivers/net/virtio_net.c:1323 [inline]
       virtnet_poll+0x436/0x7d0 drivers/net/virtio_net.c:1428
       napi_poll net/core/dev.c:6352 [inline]
       net_rx_action+0x3ae/0xa50 net/core/dev.c:6418
      
      read to 0xffff88812ab369f8 of 8 bytes by task 7271 on cpu 0:
       tcp_recvmsg+0x470/0x1a30 net/ipv4/tcp.c:2047
       inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838
       sock_recvmsg_nosec net/socket.c:871 [inline]
       sock_recvmsg net/socket.c:889 [inline]
       sock_recvmsg+0x92/0xb0 net/socket.c:885
       sock_read_iter+0x15f/0x1e0 net/socket.c:967
       call_read_iter include/linux/fs.h:1864 [inline]
       new_sync_read+0x389/0x4f0 fs/read_write.c:414
       __vfs_read+0xb1/0xc0 fs/read_write.c:427
       vfs_read fs/read_write.c:461 [inline]
       vfs_read+0x143/0x2c0 fs/read_write.c:446
       ksys_read+0xd5/0x1b0 fs/read_write.c:587
       __do_sys_read fs/read_write.c:597 [inline]
       __se_sys_read fs/read_write.c:595 [inline]
       __x64_sys_read+0x4c/0x60 fs/read_write.c:595
       do_syscall_64+0xcf/0x2f0 arch/x86/entry/common.c:296
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 7271 Comm: syz-fuzzer Not tainted 5.3.0+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      8265792b
  14. 02 10月, 2019 2 次提交
  15. 27 9月, 2019 1 次提交
    • E
      tcp: honor SO_PRIORITY in TIME_WAIT state · f6c0f5d2
      Eric Dumazet 提交于
      ctl packets sent on behalf of TIME_WAIT sockets currently
      have a zero skb->priority, which can cause various problems.
      
      In this patch we :
      
      - add a tw_priority field in struct inet_timewait_sock.
      
      - populate it from sk->sk_priority when a TIME_WAIT is created.
      
      - For IPv4, change ip_send_unicast_reply() and its two
        callers to propagate tw_priority correctly.
        ip_send_unicast_reply() no longer changes sk->sk_priority.
      
      - For IPv6, make sure TIME_WAIT sockets pass their tw_priority
        field to tcp_v6_send_response() and tcp_v6_send_ack().
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6c0f5d2
  16. 10 8月, 2019 1 次提交
    • J
      tcp: add new tcp_mtu_probe_floor sysctl · c04b79b6
      Josh Hunt 提交于
      The current implementation of TCP MTU probing can considerably
      underestimate the MTU on lossy connections allowing the MSS to get down to
      48. We have found that in almost all of these cases on our networks these
      paths can handle much larger MTUs meaning the connections are being
      artificially limited. Even though TCP MTU probing can raise the MSS back up
      we have seen this not to be the case causing connections to be "stuck" with
      an MSS of 48 when heavy loss is present.
      
      Prior to pushing out this change we could not keep TCP MTU probing enabled
      b/c of the above reasons. Now with a reasonble floor set we've had it
      enabled for the past 6 months.
      
      The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives
      administrators the ability to control the floor of MSS probing.
      Signed-off-by: NJosh Hunt <johunt@akamai.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c04b79b6
  17. 31 7月, 2019 1 次提交
  18. 16 6月, 2019 1 次提交
    • E
      tcp: add tcp_min_snd_mss sysctl · 5f3e2bf0
      Eric Dumazet 提交于
      Some TCP peers announce a very small MSS option in their SYN and/or
      SYN/ACK messages.
      
      This forces the stack to send packets with a very high network/cpu
      overhead.
      
      Linux has enforced a minimal value of 48. Since this value includes
      the size of TCP options, and that the options can consume up to 40
      bytes, this means that each segment can include only 8 bytes of payload.
      
      In some cases, it can be useful to increase the minimal value
      to a saner value.
      
      We still let the default to 48 (TCP_MIN_SND_MSS), for compatibility
      reasons.
      
      Note that TCP_MAXSEG socket option enforces a minimal value
      of (TCP_MIN_MSS). David Miller increased this minimal value
      in commit c39508d6 ("tcp: Make TCP_MAXSEG minimum more correct.")
      from 64 to 88.
      
      We might in the future merge TCP_MIN_SND_MSS and TCP_MIN_MSS.
      
      CVE-2019-11479 -- tcp mss hardcoded to 48
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Suggested-by: NJonathan Looney <jtl@netflix.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Tyler Hicks <tyhicks@canonical.com>
      Cc: Bruce Curtis <brucec@netflix.com>
      Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f3e2bf0
  19. 15 6月, 2019 1 次提交
  20. 13 6月, 2019 1 次提交
    • E
      tcp: add optional per socket transmit delay · a842fe14
      Eric Dumazet 提交于
      Adding delays to TCP flows is crucial for studying behavior
      of TCP stacks, including congestion control modules.
      
      Linux offers netem module, but it has unpractical constraints :
      - Need root access to change qdisc
      - Hard to setup on egress if combined with non trivial qdisc like FQ
      - Single delay for all flows.
      
      EDT (Earliest Departure Time) adoption in TCP stack allows us
      to enable a per socket delay at a very small cost.
      
      Networking tools can now establish thousands of flows, each of them
      with a different delay, simulating real world conditions.
      
      This requires FQ packet scheduler or a EDT-enabled NIC.
      
      This patchs adds TCP_TX_DELAY socket option, to set a delay in
      usec units.
      
        unsigned int tx_delay = 10000; /* 10 msec */
      
        setsockopt(fd, SOL_TCP, TCP_TX_DELAY, &tx_delay, sizeof(tx_delay));
      
      Note that FQ packet scheduler limits might need some tweaking :
      
      man tc-fq
      
      PARAMETERS
         limit
             Hard  limit  on  the  real  queue  size. When this limit is
             reached, new packets are dropped. If the value is  lowered,
             packets  are  dropped so that the new limit is met. Default
             is 10000 packets.
      
         flow_limit
             Hard limit on the maximum  number  of  packets  queued  per
             flow.  Default value is 100.
      
      Use of TCP_TX_DELAY option will increase number of skbs in FQ qdisc,
      so packets would be dropped if any of the previous limit is hit.
      
      Use of a jump label makes this support runtime-free, for hosts
      never using the option.
      
      Also note that TSQ (TCP Small Queues) limits are slightly changed
      with this patch : we need to account that skbs artificially delayed
      wont stop us providind more skbs to feed the pipe (netem uses
      skb_orphan_partial() for this purpose, but FQ can not use this trick)
      
      Because of that, using big delays might very well trigger
      old bugs in TSO auto defer logic and/or sndbuf limited detection.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a842fe14
  21. 04 6月, 2019 1 次提交
  22. 31 5月, 2019 1 次提交
  23. 30 4月, 2019 1 次提交
    • E
      tcp: add sanity tests in tcp_add_backlog() · ca2fe295
      Eric Dumazet 提交于
      Richard and Bruno both reported that my commit added a bug,
      and Bruno was able to determine the problem came when a segment
      wih a FIN packet was coalesced to a prior one in tcp backlog queue.
      
      It turns out the header prediction in tcp_rcv_established()
      looks back to TCP headers in the packet, not in the metadata
      (aka TCP_SKB_CB(skb)->tcp_flags)
      
      The fast path in tcp_rcv_established() is not supposed to
      handle a FIN flag (it does not call tcp_fin())
      
      Therefore we need to make sure to propagate the FIN flag,
      so that the coalesced packet does not go through the fast path,
      the same than a GRO packet carrying a FIN flag.
      
      While we are at it, make sure we do not coalesce packets with
      RST or SYN, or if they do not have ACK set.
      
      Many thanks to Richard and Bruno for pinpointing the bad commit,
      and to Richard for providing a first version of the fix.
      
      Fixes: 4f693b55 ("tcp: implement coalescing on backlog queue")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NRichard Purdie <richard.purdie@linuxfoundation.org>
      Reported-by: NBruno Prémont <bonbons@sysophe.eu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca2fe295
  24. 02 4月, 2019 1 次提交
  25. 24 3月, 2019 1 次提交
    • E
      tcp: add one skb cache for rx · 8b27dae5
      Eric Dumazet 提交于
      Often times, recvmsg() system calls and BH handling for a particular
      TCP socket are done on different cpus.
      
      This means the incoming skb had to be allocated on a cpu,
      but freed on another.
      
      This incurs a high spinlock contention in slab layer for small rpc,
      but also a high number of cache line ping pongs for larger packets.
      
      A full size GRO packet might use 45 page fragments, meaning
      that up to 45 put_page() can be involved.
      
      More over performing the __kfree_skb() in the recvmsg() context
      adds a latency for user applications, and increase probability
      of trapping them in backlog processing, since the BH handler
      might found the socket owned by the user.
      
      This patch, combined with the prior one increases the rpc
      performance by about 10 % on servers with large number of cores.
      
      (tcp_rr workload with 10,000 flows and 112 threads reach 9 Mpps
       instead of 8 Mpps)
      
      This also increases single bulk flow performance on 40Gbit+ links,
      since in this case there are often two cpus working in tandem :
      
       - CPU handling the NIC rx interrupts, feeding the receive queue,
        and (after this patch) freeing the skbs that were consumed.
      
       - CPU in recvmsg() system call, essentially 100 % busy copying out
        data to user space.
      
      Having at most one skb in a per-socket cache has very little risk
      of memory exhaustion, and since it is protected by socket lock,
      its management is essentially free.
      
      Note that if rps/rfs is used, we do not enable this feature, because
      there is high chance that the same cpu is handling both the recvmsg()
      system call and the TCP rx path, but that another cpu did the skb
      allocations in the device driver right before the RPS/RFS logic.
      
      To properly handle this case, it seems we would need to record
      on which cpu skb was allocated, and use a different channel
      to give skbs back to this cpu.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b27dae5
  26. 12 3月, 2019 1 次提交
    • C
      tcp: Don't access TCP_SKB_CB before initializing it · f2feaefd
      Christoph Paasch 提交于
      Since commit eeea10b8 ("tcp: add
      tcp_v4_fill_cb()/tcp_v4_restore_cb()"), tcp_vX_fill_cb is only called
      after tcp_filter(). That means, TCP_SKB_CB(skb)->end_seq still points to
      the IP-part of the cb.
      
      We thus should not mock with it, as this can trigger bugs (thanks
      syzkaller):
      [   12.349396] ==================================================================
      [   12.350188] BUG: KASAN: slab-out-of-bounds in ip6_datagram_recv_specific_ctl+0x19b3/0x1a20
      [   12.351035] Read of size 1 at addr ffff88006adbc208 by task test_ip6_datagr/1799
      
      Setting end_seq is actually no more necessary in tcp_filter as it gets
      initialized later on in tcp_vX_fill_cb.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Fixes: eeea10b8 ("tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()")
      Signed-off-by: NChristoph Paasch <cpaasch@apple.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f2feaefd
  27. 27 2月, 2019 1 次提交
  28. 18 2月, 2019 1 次提交
  29. 28 1月, 2019 1 次提交
  30. 01 12月, 2018 2 次提交
  31. 21 11月, 2018 1 次提交
    • E
      tcp: drop dst in tcp_add_backlog() · ade9628e
      Eric Dumazet 提交于
      Under stress, softirq rx handler often hits a socket owned by the user,
      and has to queue the packet into socket backlog.
      
      When this happens, skb dst refcount is taken before we escape rcu
      protected region. This is done from __sk_add_backlog() calling
      skb_dst_force().
      
      Consumer will have to perform the opposite costly operation.
      
      AFAIK nothing in tcp stack requests the dst after skb was stored
      in the backlog. If this was the case, we would have had failures
      already since skb_dst_force() can end up clearing skb dst anyway.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ade9628e