1. 15 10月, 2021 3 次提交
  2. 19 4月, 2021 1 次提交
  3. 10 12月, 2020 1 次提交
  4. 25 11月, 2020 1 次提交
    • A
      tcp: Set ECT0 bit in tos/tclass for synack when BPF needs ECN · 407c85c7
      Alexander Duyck 提交于
      When a BPF program is used to select between a type of TCP congestion
      control algorithm that uses either ECN or not there is a case where the
      synack for the frame was coming up without the ECT0 bit set. A bit of
      research found that this was due to the final socket being configured to
      dctcp while the listener socket was staying in cubic.
      
      To reproduce it all that is needed is to monitor TCP traffic while running
      the sample bpf program "samples/bpf/tcp_cong_kern.c". What is observed,
      assuming tcp_dctcp module is loaded or compiled in and the traffic matches
      the rules in the sample file, is that for all frames with the exception of
      the synack the ECT0 bit is set.
      
      To address that it is necessary to make one additional call to
      tcp_bpf_ca_needs_ecn using the request socket and then use the output of
      that to set the ECT0 bit for the tos/tclass of the packet.
      
      Fixes: 91b5b21c ("bpf: Add support for changing congestion control")
      Signed-off-by: NAlexander Duyck <alexanderduyck@fb.com>
      Link: https://lore.kernel.org/r/160593039663.2604.1374502006916871573.stgit@localhost.localdomainSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      407c85c7
  5. 24 11月, 2020 1 次提交
    • R
      tcp: fix race condition when creating child sockets from syncookies · 01770a16
      Ricardo Dias 提交于
      When the TCP stack is in SYN flood mode, the server child socket is
      created from the SYN cookie received in a TCP packet with the ACK flag
      set.
      
      The child socket is created when the server receives the first TCP
      packet with a valid SYN cookie from the client. Usually, this packet
      corresponds to the final step of the TCP 3-way handshake, the ACK
      packet. But is also possible to receive a valid SYN cookie from the
      first TCP data packet sent by the client, and thus create a child socket
      from that SYN cookie.
      
      Since a client socket is ready to send data as soon as it receives the
      SYN+ACK packet from the server, the client can send the ACK packet (sent
      by the TCP stack code), and the first data packet (sent by the userspace
      program) almost at the same time, and thus the server will equally
      receive the two TCP packets with valid SYN cookies almost at the same
      instant.
      
      When such event happens, the TCP stack code has a race condition that
      occurs between the momement a lookup is done to the established
      connections hashtable to check for the existence of a connection for the
      same client, and the moment that the child socket is added to the
      established connections hashtable. As a consequence, this race condition
      can lead to a situation where we add two child sockets to the
      established connections hashtable and deliver two sockets to the
      userspace program to the same client.
      
      This patch fixes the race condition by checking if an existing child
      socket exists for the same client when we are adding the second child
      socket to the established connections socket. If an existing child
      socket exists, we drop the packet and discard the second child socket
      to the same client.
      Signed-off-by: NRicardo Dias <rdias@singlestore.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20201120111133.GA67501@rdias-suse-pc.lanSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      01770a16
  6. 21 11月, 2020 1 次提交
    • A
      tcp: Allow full IP tos/IPv6 tclass to be reflected in L3 header · 861602b5
      Alexander Duyck 提交于
      An issue was recently found where DCTCP SYN/ACK packets did not have the
      ECT bit set in the L3 header. A bit of code review found that the recent
      change referenced below had gone though and added a mask that prevented the
      ECN bits from being populated in the L3 header.
      
      This patch addresses that by rolling back the mask so that it is only
      applied to the flags coming from the incoming TCP request instead of
      applying it to the socket tos/tclass field. Doing this the ECT bits were
      restored in the SYN/ACK packets in my testing.
      
      One thing that is not addressed by this patch set is the fact that
      tcp_reflect_tos appears to be incompatible with ECN based congestion
      avoidance algorithms. At a minimum the feature should likely be documented
      which it currently isn't.
      
      Fixes: ac8f1710 ("tcp: reflect tos value received in SYN to the socket")
      Signed-off-by: NAlexander Duyck <alexanderduyck@fb.com>
      Acked-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      861602b5
  7. 19 9月, 2020 1 次提交
  8. 11 9月, 2020 1 次提交
    • W
      tcp: reflect tos value received in SYN to the socket · ac8f1710
      Wei Wang 提交于
      This commit adds a new TCP feature to reflect the tos value received in
      SYN, and send it out on the SYN-ACK, and eventually set the tos value of
      the established socket with this reflected tos value. This provides a
      way to set the traffic class/QoS level for all traffic in the same
      connection to be the same as the incoming SYN request. It could be
      useful in data centers to provide equivalent QoS according to the
      incoming request.
      This feature is guarded by /proc/sys/net/ipv4/tcp_reflect_tos, and is by
      default turned off.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac8f1710
  9. 09 9月, 2020 1 次提交
  10. 25 8月, 2020 1 次提交
    • M
      bpf: tcp: Add bpf_skops_hdr_opt_len() and bpf_skops_write_hdr_opt() · 331fca43
      Martin KaFai Lau 提交于
      The bpf prog needs to parse the SYN header to learn what options have
      been sent by the peer's bpf-prog before writing its options into SYNACK.
      This patch adds a "syn_skb" arg to tcp_make_synack() and send_synack().
      This syn_skb will eventually be made available (as read-only) to the
      bpf prog.  This will be the only SYN packet available to the bpf
      prog during syncookie.  For other regular cases, the bpf prog can
      also use the saved_syn.
      
      When writing options, the bpf prog will first be called to tell the
      kernel its required number of bytes.  It is done by the new
      bpf_skops_hdr_opt_len().  The bpf prog will only be called when the new
      BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG is set in tp->bpf_sock_ops_cb_flags.
      When the bpf prog returns, the kernel will know how many bytes are needed
      and then update the "*remaining" arg accordingly.  4 byte alignment will
      be included in the "*remaining" before this function returns.  The 4 byte
      aligned number of bytes will also be stored into the opts->bpf_opt_len.
      "bpf_opt_len" is a newly added member to the struct tcp_out_options.
      
      Then the new bpf_skops_write_hdr_opt() will call the bpf prog to write the
      header options.  The bpf prog is only called if it has reserved spaces
      before (opts->bpf_opt_len > 0).
      
      The bpf prog is the last one getting a chance to reserve header space
      and writing the header option.
      
      These two functions are half implemented to highlight the changes in
      TCP stack.  The actual codes preparing the bpf running context and
      invoking the bpf prog will be added in the later patch with other
      necessary bpf pieces.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/bpf/20200820190052.2885316-1-kafai@fb.com
      331fca43
  11. 25 7月, 2020 1 次提交
  12. 20 7月, 2020 1 次提交
  13. 21 6月, 2020 1 次提交
  14. 02 6月, 2020 1 次提交
  15. 29 5月, 2020 1 次提交
    • E
      tcp: ipv6: support RFC 6069 (TCP-LD) · d2924569
      Eric Dumazet 提交于
      Make tcp_ld_RTO_revert() helper available to IPv6, and
      implement RFC 6069 :
      
      Quoting this RFC :
      
      3. Connectivity Disruption Indication
      
         For Internet Protocol version 6 (IPv6) [RFC2460], the counterpart of
         the ICMP destination unreachable message of code 0 (net unreachable)
         and of code 1 (host unreachable) is the ICMPv6 destination
         unreachable message of code 0 (no route to destination) [RFC4443].
         As with IPv4, a router should generate an ICMPv6 destination
         unreachable message of code 0 in response to a packet that cannot be
         delivered to its destination address because it lacks a matching
         entry in its routing table.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d2924569
  16. 26 5月, 2020 1 次提交
    • E
      tcp: allow traceroute -Mtcp for unpriv users · 45af29ca
      Eric Dumazet 提交于
      Unpriv users can use traceroute over plain UDP sockets, but not TCP ones.
      
      $ traceroute -Mtcp 8.8.8.8
      You do not have enough privileges to use this traceroute method.
      
      $ traceroute -n -Mudp 8.8.8.8
      traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 60 byte packets
       1  192.168.86.1  3.631 ms  3.512 ms  3.405 ms
       2  10.1.10.1  4.183 ms  4.125 ms  4.072 ms
       3  96.120.88.125  20.621 ms  19.462 ms  20.553 ms
       4  96.110.177.65  24.271 ms  25.351 ms  25.250 ms
       5  69.139.199.197  44.492 ms  43.075 ms  44.346 ms
       6  68.86.143.93  27.969 ms  25.184 ms  25.092 ms
       7  96.112.146.18  25.323 ms 96.112.146.22  25.583 ms 96.112.146.26  24.502 ms
       8  72.14.239.204  24.405 ms 74.125.37.224  16.326 ms  17.194 ms
       9  209.85.251.9  18.154 ms 209.85.247.55  14.449 ms 209.85.251.9  26.296 ms^C
      
      We can easily support traceroute over TCP, by queueing an error message
      into socket error queue.
      
      Note that applications need to set IP_RECVERR/IPV6_RECVERR option to
      enable this feature, and that the error message is only queued
      while in SYN_SNT state.
      
      socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 3
      setsockopt(3, SOL_IPV6, IPV6_RECVERR, [1], 4) = 0
      setsockopt(3, SOL_SOCKET, SO_TIMESTAMP_OLD, [1], 4) = 0
      setsockopt(3, SOL_IPV6, IPV6_UNICAST_HOPS, [5], 4) = 0
      connect(3, {sa_family=AF_INET6, sin6_port=htons(8787), sin6_flowinfo=htonl(0),
              inet_pton(AF_INET6, "2002:a05:6608:297::", &sin6_addr), sin6_scope_id=0}, 28) = -1 EHOSTUNREACH (No route to host)
      recvmsg(3, {msg_name={sa_family=AF_INET6, sin6_port=htons(8787), sin6_flowinfo=htonl(0),
              inet_pton(AF_INET6, "2002:a05:6608:297::", &sin6_addr), sin6_scope_id=0},
              msg_namelen=1024->28, msg_iov=[{iov_base="`\r\337\320\0004\6\1&\7\370\260\200\231\16\27\0\0\0\0\0\0\0\0 \2\n\5f\10\2\227"..., iov_len=1024}],
              msg_iovlen=1, msg_control=[{cmsg_len=32, cmsg_level=SOL_SOCKET, cmsg_type=SO_TIMESTAMP_OLD, cmsg_data={tv_sec=1590340680, tv_usec=272424}},
                                         {cmsg_len=60, cmsg_level=SOL_IPV6, cmsg_type=IPV6_RECVERR}],
              msg_controllen=96, msg_flags=MSG_ERRQUEUE}, MSG_ERRQUEUE) = 144
      
      Suggested-by: Maciej Żenczykowski <maze@google.com
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Reviewed-by: NMaciej Żenczykowski <maze@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      45af29ca
  17. 13 3月, 2020 1 次提交
  18. 30 1月, 2020 1 次提交
    • G
      mptcp: Fix undefined mptcp_handle_ipv6_mapped for modular IPV6 · 31484d56
      Geert Uytterhoeven 提交于
      If CONFIG_MPTCP=y, CONFIG_MPTCP_IPV6=n, and CONFIG_IPV6=m:
      
          ERROR: "mptcp_handle_ipv6_mapped" [net/ipv6/ipv6.ko] undefined!
      
      This does not happen if CONFIG_MPTCP_IPV6=y, as CONFIG_MPTCP_IPV6
      selects CONFIG_IPV6, and thus forces CONFIG_IPV6 builtin.
      
      As exporting a symbol for an empty function would be a bit wasteful, fix
      this by providing a dummy version of mptcp_handle_ipv6_mapped() for the
      CONFIG_MPTCP_IPV6=n case.
      
      Rename mptcp_handle_ipv6_mapped() to mptcpv6_handle_mapped(), to make it
      clear this is a pure-IPV6 function, just like mptcpv6_init().
      
      Fixes: cec37a6e ("mptcp: Handle MP_CAPABLE options for outgoing connections")
      Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      31484d56
  19. 24 1月, 2020 2 次提交
  20. 10 1月, 2020 1 次提交
  21. 03 1月, 2020 3 次提交
    • D
      net: Add device index to tcp_md5sig · 6b102db5
      David Ahern 提交于
      Add support for userspace to specify a device index to limit the scope
      of an entry via the TCP_MD5SIG_EXT setsockopt. The existing __tcpm_pad
      is renamed to tcpm_ifindex and the new field is only checked if the new
      TCP_MD5SIG_FLAG_IFINDEX is set in tcpm_flags. For now, the device index
      must point to an L3 master device (e.g., VRF). The API and error
      handling are setup to allow the constraint to be relaxed in the future
      to any device index.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b102db5
    • D
      tcp: Add l3index to tcp_md5sig_key and md5 functions · dea53bb8
      David Ahern 提交于
      Add l3index to tcp_md5sig_key to represent the L3 domain of a key, and
      add l3index to tcp_md5_do_add and tcp_md5_do_del to fill in the key.
      
      With the key now based on an l3index, add the new parameter to the
      lookup functions and consider the l3index when looking for a match.
      
      The l3index comes from the skb when processing ingress packets leveraging
      the helpers created for socket lookups, tcp_v4_sdif and inet_iif (and the
      v6 variants). When the sdif index is set it means the packet ingressed a
      device that is part of an L3 domain and inet_iif points to the VRF device.
      For egress, the L3 domain is determined from the socket binding and
      sk_bound_dev_if.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dea53bb8
    • D
      ipv6/tcp: Pass dif and sdif to tcp_v6_inbound_md5_hash · d14c77e0
      David Ahern 提交于
      The original ingress device index is saved to the cb space of the skb
      and the cb is moved during tcp processing. Since tcp_v6_inbound_md5_hash
      can be called before and after the cb move, pass dif and sdif to it so
      the caller can save both prior to the cb move. Both are used by a later
      patch.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d14c77e0
  22. 05 12月, 2019 1 次提交
  23. 07 11月, 2019 1 次提交
  24. 14 10月, 2019 4 次提交
    • E
      tcp: annotate tp->write_seq lockless reads · 0f317464
      Eric Dumazet 提交于
      There are few places where we fetch tp->write_seq while
      this field can change from IRQ or other cpu.
      
      We need to add READ_ONCE() annotations, and also make
      sure write sides use corresponding WRITE_ONCE() to avoid
      store-tearing.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0f317464
    • E
      tcp: annotate tp->copied_seq lockless reads · 7db48e98
      Eric Dumazet 提交于
      There are few places where we fetch tp->copied_seq while
      this field can change from IRQ or other cpu.
      
      We need to add READ_ONCE() annotations, and also make
      sure write sides use corresponding WRITE_ONCE() to avoid
      store-tearing.
      
      Note that tcp_inq_hint() was already using READ_ONCE(tp->copied_seq)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7db48e98
    • E
      tcp: annotate tp->rcv_nxt lockless reads · dba7d9b8
      Eric Dumazet 提交于
      There are few places where we fetch tp->rcv_nxt while
      this field can change from IRQ or other cpu.
      
      We need to add READ_ONCE() annotations, and also make
      sure write sides use corresponding WRITE_ONCE() to avoid
      store-tearing.
      
      Note that tcp_inq_hint() was already using READ_ONCE(tp->rcv_nxt)
      
      syzbot reported :
      
      BUG: KCSAN: data-race in tcp_poll / tcp_queue_rcv
      
      write to 0xffff888120425770 of 4 bytes by interrupt on cpu 0:
       tcp_rcv_nxt_update net/ipv4/tcp_input.c:3365 [inline]
       tcp_queue_rcv+0x180/0x380 net/ipv4/tcp_input.c:4638
       tcp_rcv_established+0xbf1/0xf50 net/ipv4/tcp_input.c:5616
       tcp_v4_do_rcv+0x381/0x4e0 net/ipv4/tcp_ipv4.c:1542
       tcp_v4_rcv+0x1a03/0x1bf0 net/ipv4/tcp_ipv4.c:1923
       ip_protocol_deliver_rcu+0x51/0x470 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5004
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5118
       netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5208
       napi_skb_finish net/core/dev.c:5671 [inline]
       napi_gro_receive+0x28f/0x330 net/core/dev.c:5704
       receive_buf+0x284/0x30b0 drivers/net/virtio_net.c:1061
      
      read to 0xffff888120425770 of 4 bytes by task 7254 on cpu 1:
       tcp_stream_is_readable net/ipv4/tcp.c:480 [inline]
       tcp_poll+0x204/0x6b0 net/ipv4/tcp.c:554
       sock_poll+0xed/0x250 net/socket.c:1256
       vfs_poll include/linux/poll.h:90 [inline]
       ep_item_poll.isra.0+0x90/0x190 fs/eventpoll.c:892
       ep_send_events_proc+0x113/0x5c0 fs/eventpoll.c:1749
       ep_scan_ready_list.constprop.0+0x189/0x500 fs/eventpoll.c:704
       ep_send_events fs/eventpoll.c:1793 [inline]
       ep_poll+0xe3/0x900 fs/eventpoll.c:1930
       do_epoll_wait+0x162/0x180 fs/eventpoll.c:2294
       __do_sys_epoll_pwait fs/eventpoll.c:2325 [inline]
       __se_sys_epoll_pwait fs/eventpoll.c:2311 [inline]
       __x64_sys_epoll_pwait+0xcd/0x170 fs/eventpoll.c:2311
       do_syscall_64+0xcf/0x2f0 arch/x86/entry/common.c:296
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 7254 Comm: syz-fuzzer Not tainted 5.3.0+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dba7d9b8
    • E
      tcp: add rcu protection around tp->fastopen_rsk · d983ea6f
      Eric Dumazet 提交于
      Both tcp_v4_err() and tcp_v6_err() do the following operations
      while they do not own the socket lock :
      
      	fastopen = tp->fastopen_rsk;
       	snd_una = fastopen ? tcp_rsk(fastopen)->snt_isn : tp->snd_una;
      
      The problem is that without appropriate barrier, the compiler
      might reload tp->fastopen_rsk and trigger a NULL deref.
      
      request sockets are protected by RCU, we can simply add
      the missing annotations and barriers to solve the issue.
      
      Fixes: 168a8f58 ("tcp: TCP Fast Open Server - main code path")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d983ea6f
  25. 27 9月, 2019 3 次提交
  26. 31 7月, 2019 1 次提交
  27. 12 7月, 2019 1 次提交
  28. 09 7月, 2019 1 次提交
    • W
      ipv6: elide flowlabel check if no exclusive leases exist · 59c820b2
      Willem de Bruijn 提交于
      Processes can request ipv6 flowlabels with cmsg IPV6_FLOWINFO.
      If not set, by default an autogenerated flowlabel is selected.
      
      Explicit flowlabels require a control operation per label plus a
      datapath check on every connection (every datagram if unconnected).
      This is particularly expensive on unconnected sockets multiplexing
      many flows, such as QUIC.
      
      In the common case, where no lease is exclusive, the check can be
      safely elided, as both lease request and check trivially succeed.
      Indeed, autoflowlabel does the same even with exclusive leases.
      
      Elide the check if no process has requested an exclusive lease.
      
      fl6_sock_lookup previously returns either a reference to a lease or
      NULL to denote failure. Modify to return a real error and update
      all callers. On return NULL, they can use the label and will elide
      the atomic_dec in fl6_sock_release.
      
      This is an optimization. Robust applications still have to revert to
      requesting leases if the fast path fails due to an exclusive lease.
      
      Changes RFC->v1:
        - use static_key_false_deferred to rate limit jump label operations
          - call static_key_deferred_flush to stop timers on exit
        - move decrement out of RCU context
        - defer optimization also if opt data is associated with a lease
        - updated all fp6_sock_lookup callers, not just udp
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      59c820b2
  29. 02 7月, 2019 1 次提交
  30. 15 6月, 2019 1 次提交