1. 11 9月, 2020 2 次提交
  2. 25 8月, 2020 2 次提交
    • R
      net: ipv4: delete repeated words · 2bdcc73c
      Randy Dunlap 提交于
      Drop duplicate words in comments in net/ipv4/.
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2bdcc73c
    • M
      bpf: tcp: Add bpf_skops_hdr_opt_len() and bpf_skops_write_hdr_opt() · 331fca43
      Martin KaFai Lau 提交于
      The bpf prog needs to parse the SYN header to learn what options have
      been sent by the peer's bpf-prog before writing its options into SYNACK.
      This patch adds a "syn_skb" arg to tcp_make_synack() and send_synack().
      This syn_skb will eventually be made available (as read-only) to the
      bpf prog.  This will be the only SYN packet available to the bpf
      prog during syncookie.  For other regular cases, the bpf prog can
      also use the saved_syn.
      
      When writing options, the bpf prog will first be called to tell the
      kernel its required number of bytes.  It is done by the new
      bpf_skops_hdr_opt_len().  The bpf prog will only be called when the new
      BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG is set in tp->bpf_sock_ops_cb_flags.
      When the bpf prog returns, the kernel will know how many bytes are needed
      and then update the "*remaining" arg accordingly.  4 byte alignment will
      be included in the "*remaining" before this function returns.  The 4 byte
      aligned number of bytes will also be stored into the opts->bpf_opt_len.
      "bpf_opt_len" is a newly added member to the struct tcp_out_options.
      
      Then the new bpf_skops_write_hdr_opt() will call the bpf prog to write the
      header options.  The bpf prog is only called if it has reserved spaces
      before (opts->bpf_opt_len > 0).
      
      The bpf prog is the last one getting a chance to reserve header space
      and writing the header option.
      
      These two functions are half implemented to highlight the changes in
      TCP stack.  The actual codes preparing the bpf running context and
      invoking the bpf prog will be added in the later patch with other
      necessary bpf pieces.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/bpf/20200820190052.2885316-1-kafai@fb.com
      331fca43
  3. 26 7月, 2020 2 次提交
  4. 25 7月, 2020 1 次提交
  5. 22 7月, 2020 1 次提交
  6. 20 7月, 2020 2 次提交
  7. 02 7月, 2020 1 次提交
    • E
      tcp: md5: refine tcp_md5_do_add()/tcp_md5_hash_key() barriers · e6ced831
      Eric Dumazet 提交于
      My prior fix went a bit too far, according to Herbert and Mathieu.
      
      Since we accept that concurrent TCP MD5 lookups might see inconsistent
      keys, we can use READ_ONCE()/WRITE_ONCE() instead of smp_rmb()/smp_wmb()
      
      Clearing all key->key[] is needed to avoid possible KMSAN reports,
      if key->keylen is increased. Since tcp_md5_do_add() is not fast path,
      using __GFP_ZERO to clear all struct tcp_md5sig_key is simpler.
      
      data_race() was added in linux-5.8 and will prevent KCSAN reports,
      this can safely be removed in stable backports, if data_race() is
      not yet backported.
      
      v2: use data_race() both in tcp_md5_hash_key() and tcp_md5_do_add()
      
      Fixes: 6a2febec ("tcp: md5: add missing memory barriers in tcp_md5_do_add()/tcp_md5_hash_key()")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Marco Elver <elver@google.com>
      Reviewed-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e6ced831
  8. 01 7月, 2020 1 次提交
    • E
      tcp: md5: add missing memory barriers in tcp_md5_do_add()/tcp_md5_hash_key() · 6a2febec
      Eric Dumazet 提交于
      MD5 keys are read with RCU protection, and tcp_md5_do_add()
      might update in-place a prior key.
      
      Normally, typical RCU updates would allocate a new piece
      of memory. In this case only key->key and key->keylen might
      be updated, and we do not care if an incoming packet could
      see the old key, the new one, or some intermediate value,
      since changing the key on a live flow is known to be problematic
      anyway.
      
      We only want to make sure that in the case key->keylen
      is changed, cpus in tcp_md5_hash_key() wont try to use
      uninitialized data, or crash because key->keylen was
      read twice to feed sg_init_one() and ahash_request_set_crypt()
      
      Fixes: 9ea88a15 ("tcp: md5: check md5 signature without socket lock")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6a2febec
  9. 25 6月, 2020 2 次提交
  10. 29 5月, 2020 1 次提交
    • E
      tcp: ipv6: support RFC 6069 (TCP-LD) · d2924569
      Eric Dumazet 提交于
      Make tcp_ld_RTO_revert() helper available to IPv6, and
      implement RFC 6069 :
      
      Quoting this RFC :
      
      3. Connectivity Disruption Indication
      
         For Internet Protocol version 6 (IPv6) [RFC2460], the counterpart of
         the ICMP destination unreachable message of code 0 (net unreachable)
         and of code 1 (host unreachable) is the ICMPv6 destination
         unreachable message of code 0 (no route to destination) [RFC4443].
         As with IPv4, a router should generate an ICMPv6 destination
         unreachable message of code 0 in response to a packet that cannot be
         delivered to its destination address because it lacks a matching
         entry in its routing table.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d2924569
  11. 28 5月, 2020 2 次提交
  12. 27 5月, 2020 1 次提交
  13. 26 5月, 2020 1 次提交
    • E
      tcp: allow traceroute -Mtcp for unpriv users · 45af29ca
      Eric Dumazet 提交于
      Unpriv users can use traceroute over plain UDP sockets, but not TCP ones.
      
      $ traceroute -Mtcp 8.8.8.8
      You do not have enough privileges to use this traceroute method.
      
      $ traceroute -n -Mudp 8.8.8.8
      traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 60 byte packets
       1  192.168.86.1  3.631 ms  3.512 ms  3.405 ms
       2  10.1.10.1  4.183 ms  4.125 ms  4.072 ms
       3  96.120.88.125  20.621 ms  19.462 ms  20.553 ms
       4  96.110.177.65  24.271 ms  25.351 ms  25.250 ms
       5  69.139.199.197  44.492 ms  43.075 ms  44.346 ms
       6  68.86.143.93  27.969 ms  25.184 ms  25.092 ms
       7  96.112.146.18  25.323 ms 96.112.146.22  25.583 ms 96.112.146.26  24.502 ms
       8  72.14.239.204  24.405 ms 74.125.37.224  16.326 ms  17.194 ms
       9  209.85.251.9  18.154 ms 209.85.247.55  14.449 ms 209.85.251.9  26.296 ms^C
      
      We can easily support traceroute over TCP, by queueing an error message
      into socket error queue.
      
      Note that applications need to set IP_RECVERR/IPV6_RECVERR option to
      enable this feature, and that the error message is only queued
      while in SYN_SNT state.
      
      socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 3
      setsockopt(3, SOL_IPV6, IPV6_RECVERR, [1], 4) = 0
      setsockopt(3, SOL_SOCKET, SO_TIMESTAMP_OLD, [1], 4) = 0
      setsockopt(3, SOL_IPV6, IPV6_UNICAST_HOPS, [5], 4) = 0
      connect(3, {sa_family=AF_INET6, sin6_port=htons(8787), sin6_flowinfo=htonl(0),
              inet_pton(AF_INET6, "2002:a05:6608:297::", &sin6_addr), sin6_scope_id=0}, 28) = -1 EHOSTUNREACH (No route to host)
      recvmsg(3, {msg_name={sa_family=AF_INET6, sin6_port=htons(8787), sin6_flowinfo=htonl(0),
              inet_pton(AF_INET6, "2002:a05:6608:297::", &sin6_addr), sin6_scope_id=0},
              msg_namelen=1024->28, msg_iov=[{iov_base="`\r\337\320\0004\6\1&\7\370\260\200\231\16\27\0\0\0\0\0\0\0\0 \2\n\5f\10\2\227"..., iov_len=1024}],
              msg_iovlen=1, msg_control=[{cmsg_len=32, cmsg_level=SOL_SOCKET, cmsg_type=SO_TIMESTAMP_OLD, cmsg_data={tv_sec=1590340680, tv_usec=272424}},
                                         {cmsg_len=60, cmsg_level=SOL_IPV6, cmsg_type=IPV6_RECVERR}],
              msg_controllen=96, msg_flags=MSG_ERRQUEUE}, MSG_ERRQUEUE) = 144
      
      Suggested-by: Maciej Żenczykowski <maze@google.com
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Reviewed-by: NMaciej Żenczykowski <maze@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      45af29ca
  14. 01 5月, 2020 1 次提交
  15. 13 3月, 2020 1 次提交
  16. 25 2月, 2020 1 次提交
  17. 10 1月, 2020 2 次提交
  18. 03 1月, 2020 4 次提交
  19. 14 12月, 2019 1 次提交
  20. 10 12月, 2019 1 次提交
    • K
      net-tcp: Disable TCP ssthresh metrics cache by default · 65e6d901
      Kevin(Yudong) Yang 提交于
      This patch introduces a sysctl knob "net.ipv4.tcp_no_ssthresh_metrics_save"
      that disables TCP ssthresh metrics cache by default. Other parts of TCP
      metrics cache, e.g. rtt, cwnd, remain unchanged.
      
      As modern networks becoming more and more dynamic, TCP metrics cache
      today often causes more harm than benefits. For example, the same IP
      address is often shared by different subscribers behind NAT in residential
      networks. Even if the IP address is not shared by different users,
      caching the slow-start threshold of a previous short flow using loss-based
      congestion control (e.g. cubic) often causes the future longer flows of
      the same network path to exit slow-start prematurely with abysmal
      throughput.
      
      Caching ssthresh is very risky and can lead to terrible performance.
      Therefore it makes sense to make disabling ssthresh caching by
      default and opt-in for specific networks by the administrators.
      This practice also has worked well for several years of deployment with
      CUBIC congestion control at Google.
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NKevin(Yudong) Yang <yyd@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      65e6d901
  21. 07 11月, 2019 1 次提交
  22. 02 11月, 2019 1 次提交
    • E
      inet: stop leaking jiffies on the wire · a904a069
      Eric Dumazet 提交于
      Historically linux tried to stick to RFC 791, 1122, 2003
      for IPv4 ID field generation.
      
      RFC 6864 made clear that no matter how hard we try,
      we can not ensure unicity of IP ID within maximum
      lifetime for all datagrams with a given source
      address/destination address/protocol tuple.
      
      Linux uses a per socket inet generator (inet_id), initialized
      at connection startup with a XOR of 'jiffies' and other
      fields that appear clear on the wire.
      
      Thiemo Nagel pointed that this strategy is a privacy
      concern as this provides 16 bits of entropy to fingerprint
      devices.
      
      Let's switch to a random starting point, this is just as
      good as far as RFC 6864 is concerned and does not leak
      anything critical.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NThiemo Nagel <tnagel@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a904a069
  23. 01 11月, 2019 1 次提交
  24. 14 10月, 2019 4 次提交
    • E
      tcp: annotate tp->write_seq lockless reads · 0f317464
      Eric Dumazet 提交于
      There are few places where we fetch tp->write_seq while
      this field can change from IRQ or other cpu.
      
      We need to add READ_ONCE() annotations, and also make
      sure write sides use corresponding WRITE_ONCE() to avoid
      store-tearing.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0f317464
    • E
      tcp: annotate tp->copied_seq lockless reads · 7db48e98
      Eric Dumazet 提交于
      There are few places where we fetch tp->copied_seq while
      this field can change from IRQ or other cpu.
      
      We need to add READ_ONCE() annotations, and also make
      sure write sides use corresponding WRITE_ONCE() to avoid
      store-tearing.
      
      Note that tcp_inq_hint() was already using READ_ONCE(tp->copied_seq)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7db48e98
    • E
      tcp: annotate tp->rcv_nxt lockless reads · dba7d9b8
      Eric Dumazet 提交于
      There are few places where we fetch tp->rcv_nxt while
      this field can change from IRQ or other cpu.
      
      We need to add READ_ONCE() annotations, and also make
      sure write sides use corresponding WRITE_ONCE() to avoid
      store-tearing.
      
      Note that tcp_inq_hint() was already using READ_ONCE(tp->rcv_nxt)
      
      syzbot reported :
      
      BUG: KCSAN: data-race in tcp_poll / tcp_queue_rcv
      
      write to 0xffff888120425770 of 4 bytes by interrupt on cpu 0:
       tcp_rcv_nxt_update net/ipv4/tcp_input.c:3365 [inline]
       tcp_queue_rcv+0x180/0x380 net/ipv4/tcp_input.c:4638
       tcp_rcv_established+0xbf1/0xf50 net/ipv4/tcp_input.c:5616
       tcp_v4_do_rcv+0x381/0x4e0 net/ipv4/tcp_ipv4.c:1542
       tcp_v4_rcv+0x1a03/0x1bf0 net/ipv4/tcp_ipv4.c:1923
       ip_protocol_deliver_rcu+0x51/0x470 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5004
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5118
       netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5208
       napi_skb_finish net/core/dev.c:5671 [inline]
       napi_gro_receive+0x28f/0x330 net/core/dev.c:5704
       receive_buf+0x284/0x30b0 drivers/net/virtio_net.c:1061
      
      read to 0xffff888120425770 of 4 bytes by task 7254 on cpu 1:
       tcp_stream_is_readable net/ipv4/tcp.c:480 [inline]
       tcp_poll+0x204/0x6b0 net/ipv4/tcp.c:554
       sock_poll+0xed/0x250 net/socket.c:1256
       vfs_poll include/linux/poll.h:90 [inline]
       ep_item_poll.isra.0+0x90/0x190 fs/eventpoll.c:892
       ep_send_events_proc+0x113/0x5c0 fs/eventpoll.c:1749
       ep_scan_ready_list.constprop.0+0x189/0x500 fs/eventpoll.c:704
       ep_send_events fs/eventpoll.c:1793 [inline]
       ep_poll+0xe3/0x900 fs/eventpoll.c:1930
       do_epoll_wait+0x162/0x180 fs/eventpoll.c:2294
       __do_sys_epoll_pwait fs/eventpoll.c:2325 [inline]
       __se_sys_epoll_pwait fs/eventpoll.c:2311 [inline]
       __x64_sys_epoll_pwait+0xcd/0x170 fs/eventpoll.c:2311
       do_syscall_64+0xcf/0x2f0 arch/x86/entry/common.c:296
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 7254 Comm: syz-fuzzer Not tainted 5.3.0+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dba7d9b8
    • E
      tcp: add rcu protection around tp->fastopen_rsk · d983ea6f
      Eric Dumazet 提交于
      Both tcp_v4_err() and tcp_v6_err() do the following operations
      while they do not own the socket lock :
      
      	fastopen = tp->fastopen_rsk;
       	snd_una = fastopen ? tcp_rsk(fastopen)->snt_isn : tp->snd_una;
      
      The problem is that without appropriate barrier, the compiler
      might reload tp->fastopen_rsk and trigger a NULL deref.
      
      request sockets are protected by RCU, we can simply add
      the missing annotations and barriers to solve the issue.
      
      Fixes: 168a8f58 ("tcp: TCP Fast Open Server - main code path")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d983ea6f
  25. 10 10月, 2019 1 次提交
    • E
      net: silence KCSAN warnings around sk_add_backlog() calls · 8265792b
      Eric Dumazet 提交于
      sk_add_backlog() callers usually read sk->sk_rcvbuf without
      owning the socket lock. This means sk_rcvbuf value can
      be changed by other cpus, and KCSAN complains.
      
      Add READ_ONCE() annotations to document the lockless nature
      of these reads.
      
      Note that writes over sk_rcvbuf should also use WRITE_ONCE(),
      but this will be done in separate patches to ease stable
      backports (if we decide this is relevant for stable trees).
      
      BUG: KCSAN: data-race in tcp_add_backlog / tcp_recvmsg
      
      write to 0xffff88812ab369f8 of 8 bytes by interrupt on cpu 1:
       __sk_add_backlog include/net/sock.h:902 [inline]
       sk_add_backlog include/net/sock.h:933 [inline]
       tcp_add_backlog+0x45a/0xcc0 net/ipv4/tcp_ipv4.c:1737
       tcp_v4_rcv+0x1aba/0x1bf0 net/ipv4/tcp_ipv4.c:1925
       ip_protocol_deliver_rcu+0x51/0x470 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5004
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5118
       netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5208
       napi_skb_finish net/core/dev.c:5671 [inline]
       napi_gro_receive+0x28f/0x330 net/core/dev.c:5704
       receive_buf+0x284/0x30b0 drivers/net/virtio_net.c:1061
       virtnet_receive drivers/net/virtio_net.c:1323 [inline]
       virtnet_poll+0x436/0x7d0 drivers/net/virtio_net.c:1428
       napi_poll net/core/dev.c:6352 [inline]
       net_rx_action+0x3ae/0xa50 net/core/dev.c:6418
      
      read to 0xffff88812ab369f8 of 8 bytes by task 7271 on cpu 0:
       tcp_recvmsg+0x470/0x1a30 net/ipv4/tcp.c:2047
       inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838
       sock_recvmsg_nosec net/socket.c:871 [inline]
       sock_recvmsg net/socket.c:889 [inline]
       sock_recvmsg+0x92/0xb0 net/socket.c:885
       sock_read_iter+0x15f/0x1e0 net/socket.c:967
       call_read_iter include/linux/fs.h:1864 [inline]
       new_sync_read+0x389/0x4f0 fs/read_write.c:414
       __vfs_read+0xb1/0xc0 fs/read_write.c:427
       vfs_read fs/read_write.c:461 [inline]
       vfs_read+0x143/0x2c0 fs/read_write.c:446
       ksys_read+0xd5/0x1b0 fs/read_write.c:587
       __do_sys_read fs/read_write.c:597 [inline]
       __se_sys_read fs/read_write.c:595 [inline]
       __x64_sys_read+0x4c/0x60 fs/read_write.c:595
       do_syscall_64+0xcf/0x2f0 arch/x86/entry/common.c:296
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 7271 Comm: syz-fuzzer Not tainted 5.3.0+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      8265792b
  26. 02 10月, 2019 2 次提交