1. 24 7月, 2021 7 次提交
  2. 03 7月, 2021 1 次提交
  3. 30 6月, 2021 1 次提交
  4. 16 6月, 2021 1 次提交
    • K
      tcp: Migrate TCP_NEW_SYN_RECV requests at receiving the final ACK. · d4f2c86b
      Kuniyuki Iwashima 提交于
      This patch also changes the code to call reuseport_migrate_sock() and
      inet_reqsk_clone(), but unlike the other cases, we do not call
      inet_reqsk_clone() right after reuseport_migrate_sock().
      
      Currently, in the receive path for TCP_NEW_SYN_RECV sockets, its listener
      has three kinds of refcnt:
      
        (A) for listener itself
        (B) carried by reuqest_sock
        (C) sock_hold() in tcp_v[46]_rcv()
      
      While processing the req, (A) may disappear by close(listener). Also, (B)
      can disappear by accept(listener) once we put the req into the accept
      queue. So, we have to hold another refcnt (C) for the listener to prevent
      use-after-free.
      
      For socket migration, we call reuseport_migrate_sock() to select a listener
      with (A) and to increment the new listener's refcnt in tcp_v[46]_rcv().
      This refcnt corresponds to (C) and is cleaned up later in tcp_v[46]_rcv().
      Thus we have to take another refcnt (B) for the newly cloned request_sock.
      
      In inet_csk_complete_hashdance(), we hold the count (B), clone the req, and
      try to put the new req into the accept queue. By migrating req after
      winning the "own_req" race, we can avoid such a worst situation:
      
        CPU 1 looks up req1
        CPU 2 looks up req1, unhashes it, then CPU 1 loses the race
        CPU 3 looks up req2, unhashes it, then CPU 2 loses the race
        ...
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20210612123224.12525-8-kuniyu@amazon.co.jp
      d4f2c86b
  5. 15 5月, 2021 1 次提交
    • J
      tcp: add tracepoint for checksum errors · 709c0314
      Jakub Kicinski 提交于
      Add a tracepoint for capturing TCP segments with
      a bad checksum. This makes it easy to identify
      sources of bad frames in the fleet (e.g. machines
      with faulty NICs).
      
      It should also help tools like IOvisor's tcpdrop.py
      which are used today to get detailed information
      about such packets.
      
      We don't have a socket in many cases so we must
      open code the address extraction based just on
      the skb.
      
      v2: add missing export for ipv6=m
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      709c0314
  6. 03 4月, 2021 1 次提交
    • F
      mptcp: add mptcp reset option support · dc87efdb
      Florian Westphal 提交于
      The MPTCP reset option allows to carry a mptcp-specific error code that
      provides more information on the nature of a connection reset.
      
      Reset option data received gets stored in the subflow context so it can
      be sent to userspace via the 'subflow closed' netlink event.
      
      When a subflow is closed, the desired error code that should be sent to
      the peer is also placed in the subflow context structure.
      
      If a reset is sent before subflow establishment could complete, e.g. on
      HMAC failure during an MP_JOIN operation, the mptcp skb extension is
      used to store the reset information.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dc87efdb
  7. 02 4月, 2021 1 次提交
  8. 04 2月, 2021 1 次提交
  9. 21 1月, 2021 2 次提交
    • S
      bpf: Remove extra lock_sock for TCP_ZEROCOPY_RECEIVE · 9cacf81f
      Stanislav Fomichev 提交于
      Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE.
      We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom
      call in do_tcp_getsockopt using the on-stack data. This removes
      3% overhead for locking/unlocking the socket.
      
      Without this patch:
           3.38%     0.07%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt
                  |
                   --3.30%--__cgroup_bpf_run_filter_getsockopt
                             |
                              --0.81%--__kmalloc
      
      With the patch applied:
           0.52%     0.12%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt_kern
      
      Note, exporting uapi/tcp.h requires removing netinet/tcp.h
      from test_progs.h because those headers have confliciting
      definitions.
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20210115163501.805133-2-sdf@google.com
      9cacf81f
    • K
      tcp: Fix potential use-after-free due to double kfree() · c89dffc7
      Kuniyuki Iwashima 提交于
      Receiving ACK with a valid SYN cookie, cookie_v4_check() allocates struct
      request_sock and then can allocate inet_rsk(req)->ireq_opt. After that,
      tcp_v4_syn_recv_sock() allocates struct sock and copies ireq_opt to
      inet_sk(sk)->inet_opt. Normally, tcp_v4_syn_recv_sock() inserts the full
      socket into ehash and sets NULL to ireq_opt. Otherwise,
      tcp_v4_syn_recv_sock() has to reset inet_opt by NULL and free the full
      socket.
      
      The commit 01770a16 ("tcp: fix race condition when creating child
      sockets from syncookies") added a new path, in which more than one cores
      create full sockets for the same SYN cookie. Currently, the core which
      loses the race frees the full socket without resetting inet_opt, resulting
      in that both sock_put() and reqsk_put() call kfree() for the same memory:
      
        sock_put
          sk_free
            __sk_free
              sk_destruct
                __sk_destruct
                  sk->sk_destruct/inet_sock_destruct
                    kfree(rcu_dereference_protected(inet->inet_opt, 1));
      
        reqsk_put
          reqsk_free
            __reqsk_free
              req->rsk_ops->destructor/tcp_v4_reqsk_destructor
                kfree(rcu_dereference_protected(inet_rsk(req)->ireq_opt, 1));
      
      Calling kmalloc() between the double kfree() can lead to use-after-free, so
      this patch fixes it by setting NULL to inet_opt before sock_put().
      
      As a side note, this kind of issue does not happen for IPv6. This is
      because tcp_v6_syn_recv_sock() clones both ipv6_opt and pktopts which
      correspond to ireq_opt in IPv4.
      
      Fixes: 01770a16 ("tcp: fix race condition when creating child sockets from syncookies")
      CC: Ricardo Dias <rdias@singlestore.com>
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Reviewed-by: NBenjamin Herrenschmidt <benh@amazon.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20210118055920.82516-1-kuniyu@amazon.co.jpSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      c89dffc7
  10. 20 1月, 2021 1 次提交
  11. 10 12月, 2020 1 次提交
  12. 04 12月, 2020 1 次提交
    • F
      tcp: merge 'init_req' and 'route_req' functions · 7ea851d1
      Florian Westphal 提交于
      The Multipath-TCP standard (RFC 8684) says that an MPTCP host should send
      a TCP reset if the token in a MP_JOIN request is unknown.
      
      At this time we don't do this, the 3whs completes and the 'new subflow'
      is reset afterwards.  There are two ways to allow MPTCP to send the
      reset.
      
      1. override 'send_synack' callback and emit the rst from there.
         The drawback is that the request socket gets inserted into the
         listeners queue just to get removed again right away.
      
      2. Send the reset from the 'route_req' function instead.
         This avoids the 'add&remove request socket', but route_req lacks the
         skb that is required to send the TCP reset.
      
      Instead of just adding the skb to that function for MPTCP sake alone,
      Paolo suggested to merge init_req and route_req functions.
      
      This saves one indirection from syn processing path and provides the skb
      to the merged function at the same time.
      
      'send reset on unknown mptcp join token' is added in next patch.
      Suggested-by: NPaolo Abeni <pabeni@redhat.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      7ea851d1
  13. 25 11月, 2020 1 次提交
    • A
      tcp: Set ECT0 bit in tos/tclass for synack when BPF needs ECN · 407c85c7
      Alexander Duyck 提交于
      When a BPF program is used to select between a type of TCP congestion
      control algorithm that uses either ECN or not there is a case where the
      synack for the frame was coming up without the ECT0 bit set. A bit of
      research found that this was due to the final socket being configured to
      dctcp while the listener socket was staying in cubic.
      
      To reproduce it all that is needed is to monitor TCP traffic while running
      the sample bpf program "samples/bpf/tcp_cong_kern.c". What is observed,
      assuming tcp_dctcp module is loaded or compiled in and the traffic matches
      the rules in the sample file, is that for all frames with the exception of
      the synack the ECT0 bit is set.
      
      To address that it is necessary to make one additional call to
      tcp_bpf_ca_needs_ecn using the request socket and then use the output of
      that to set the ECT0 bit for the tos/tclass of the packet.
      
      Fixes: 91b5b21c ("bpf: Add support for changing congestion control")
      Signed-off-by: NAlexander Duyck <alexanderduyck@fb.com>
      Link: https://lore.kernel.org/r/160593039663.2604.1374502006916871573.stgit@localhost.localdomainSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      407c85c7
  14. 24 11月, 2020 1 次提交
    • R
      tcp: fix race condition when creating child sockets from syncookies · 01770a16
      Ricardo Dias 提交于
      When the TCP stack is in SYN flood mode, the server child socket is
      created from the SYN cookie received in a TCP packet with the ACK flag
      set.
      
      The child socket is created when the server receives the first TCP
      packet with a valid SYN cookie from the client. Usually, this packet
      corresponds to the final step of the TCP 3-way handshake, the ACK
      packet. But is also possible to receive a valid SYN cookie from the
      first TCP data packet sent by the client, and thus create a child socket
      from that SYN cookie.
      
      Since a client socket is ready to send data as soon as it receives the
      SYN+ACK packet from the server, the client can send the ACK packet (sent
      by the TCP stack code), and the first data packet (sent by the userspace
      program) almost at the same time, and thus the server will equally
      receive the two TCP packets with valid SYN cookies almost at the same
      instant.
      
      When such event happens, the TCP stack code has a race condition that
      occurs between the momement a lookup is done to the established
      connections hashtable to check for the existence of a connection for the
      same client, and the moment that the child socket is added to the
      established connections hashtable. As a consequence, this race condition
      can lead to a situation where we add two child sockets to the
      established connections hashtable and deliver two sockets to the
      userspace program to the same client.
      
      This patch fixes the race condition by checking if an existing child
      socket exists for the same client when we are adding the second child
      socket to the established connections socket. If an existing child
      socket exists, we drop the packet and discard the second child socket
      to the same client.
      Signed-off-by: NRicardo Dias <rdias@singlestore.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20201120111133.GA67501@rdias-suse-pc.lanSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      01770a16
  15. 21 11月, 2020 1 次提交
    • A
      tcp: Allow full IP tos/IPv6 tclass to be reflected in L3 header · 861602b5
      Alexander Duyck 提交于
      An issue was recently found where DCTCP SYN/ACK packets did not have the
      ECT bit set in the L3 header. A bit of code review found that the recent
      change referenced below had gone though and added a mask that prevented the
      ECN bits from being populated in the L3 header.
      
      This patch addresses that by rolling back the mask so that it is only
      applied to the flags coming from the incoming TCP request instead of
      applying it to the socket tos/tclass field. Doing this the ECT bits were
      restored in the SYN/ACK packets in my testing.
      
      One thing that is not addressed by this patch set is the fact that
      tcp_reflect_tos appears to be incompatible with ECN based congestion
      avoidance algorithms. At a minimum the feature should likely be documented
      which it currently isn't.
      
      Fixes: ac8f1710 ("tcp: reflect tos value received in SYN to the socket")
      Signed-off-by: NAlexander Duyck <alexanderduyck@fb.com>
      Acked-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      861602b5
  16. 15 11月, 2020 1 次提交
  17. 06 10月, 2020 1 次提交
    • E
      tcp: fix receive window update in tcp_add_backlog() · 86bccd03
      Eric Dumazet 提交于
      We got reports from GKE customers flows being reset by netfilter
      conntrack unless nf_conntrack_tcp_be_liberal is set to 1.
      
      Traces seemed to suggest ACK packet being dropped by the
      packet capture, or more likely that ACK were received in the
      wrong order.
      
       wscale=7, SYN and SYNACK not shown here.
      
       This ACK allows the sender to send 1871*128 bytes from seq 51359321 :
       New right edge of the window -> 51359321+1871*128=51598809
      
       09:17:23.389210 IP A > B: Flags [.], ack 51359321, win 1871, options [nop,nop,TS val 10 ecr 999], length 0
      
       09:17:23.389212 IP B > A: Flags [.], seq 51422681:51424089, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 1408
       09:17:23.389214 IP A > B: Flags [.], ack 51422681, win 1376, options [nop,nop,TS val 10 ecr 999], length 0
       09:17:23.389253 IP B > A: Flags [.], seq 51424089:51488857, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 64768
       09:17:23.389272 IP A > B: Flags [.], ack 51488857, win 859, options [nop,nop,TS val 10 ecr 999], length 0
       09:17:23.389275 IP B > A: Flags [.], seq 51488857:51521241, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 32384
      
       Receiver now allows to send 606*128=77568 from seq 51521241 :
       New right edge of the window -> 51521241+606*128=51598809
      
       09:17:23.389296 IP A > B: Flags [.], ack 51521241, win 606, options [nop,nop,TS val 10 ecr 999], length 0
      
       09:17:23.389308 IP B > A: Flags [.], seq 51521241:51553625, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 32384
      
       It seems the sender exceeds RWIN allowance, since 51611353 > 51598809
      
       09:17:23.389346 IP B > A: Flags [.], seq 51553625:51611353, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 57728
       09:17:23.389356 IP B > A: Flags [.], seq 51611353:51618393, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 7040
      
       09:17:23.389367 IP A > B: Flags [.], ack 51611353, win 0, options [nop,nop,TS val 10 ecr 999], length 0
      
       netfilter conntrack is not happy and sends RST
      
       09:17:23.389389 IP A > B: Flags [R], seq 92176528, win 0, length 0
       09:17:23.389488 IP B > A: Flags [R], seq 174478967, win 0, length 0
      
       Now imagine ACK were delivered out of order and tcp_add_backlog() sets window based on wrong packet.
       New right edge of the window -> 51521241+859*128=51631193
      
      Normally TCP stack handles OOO packets just fine, but it
      turns out tcp_add_backlog() does not. It can update the window
      field of the aggregated packet even if the ACK sequence
      of the last received packet is too old.
      
      Many thanks to Alexandre Ferrieux for independently reporting the issue
      and suggesting a fix.
      
      Fixes: 4f693b55 ("tcp: implement coalescing on backlog queue")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAlexandre Ferrieux <alexandre.ferrieux@orange.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      86bccd03
  18. 11 9月, 2020 2 次提交
  19. 25 8月, 2020 2 次提交
    • R
      net: ipv4: delete repeated words · 2bdcc73c
      Randy Dunlap 提交于
      Drop duplicate words in comments in net/ipv4/.
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2bdcc73c
    • M
      bpf: tcp: Add bpf_skops_hdr_opt_len() and bpf_skops_write_hdr_opt() · 331fca43
      Martin KaFai Lau 提交于
      The bpf prog needs to parse the SYN header to learn what options have
      been sent by the peer's bpf-prog before writing its options into SYNACK.
      This patch adds a "syn_skb" arg to tcp_make_synack() and send_synack().
      This syn_skb will eventually be made available (as read-only) to the
      bpf prog.  This will be the only SYN packet available to the bpf
      prog during syncookie.  For other regular cases, the bpf prog can
      also use the saved_syn.
      
      When writing options, the bpf prog will first be called to tell the
      kernel its required number of bytes.  It is done by the new
      bpf_skops_hdr_opt_len().  The bpf prog will only be called when the new
      BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG is set in tp->bpf_sock_ops_cb_flags.
      When the bpf prog returns, the kernel will know how many bytes are needed
      and then update the "*remaining" arg accordingly.  4 byte alignment will
      be included in the "*remaining" before this function returns.  The 4 byte
      aligned number of bytes will also be stored into the opts->bpf_opt_len.
      "bpf_opt_len" is a newly added member to the struct tcp_out_options.
      
      Then the new bpf_skops_write_hdr_opt() will call the bpf prog to write the
      header options.  The bpf prog is only called if it has reserved spaces
      before (opts->bpf_opt_len > 0).
      
      The bpf prog is the last one getting a chance to reserve header space
      and writing the header option.
      
      These two functions are half implemented to highlight the changes in
      TCP stack.  The actual codes preparing the bpf running context and
      invoking the bpf prog will be added in the later patch with other
      necessary bpf pieces.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/bpf/20200820190052.2885316-1-kafai@fb.com
      331fca43
  20. 26 7月, 2020 2 次提交
  21. 25 7月, 2020 1 次提交
  22. 22 7月, 2020 1 次提交
  23. 20 7月, 2020 2 次提交
  24. 02 7月, 2020 1 次提交
    • E
      tcp: md5: refine tcp_md5_do_add()/tcp_md5_hash_key() barriers · e6ced831
      Eric Dumazet 提交于
      My prior fix went a bit too far, according to Herbert and Mathieu.
      
      Since we accept that concurrent TCP MD5 lookups might see inconsistent
      keys, we can use READ_ONCE()/WRITE_ONCE() instead of smp_rmb()/smp_wmb()
      
      Clearing all key->key[] is needed to avoid possible KMSAN reports,
      if key->keylen is increased. Since tcp_md5_do_add() is not fast path,
      using __GFP_ZERO to clear all struct tcp_md5sig_key is simpler.
      
      data_race() was added in linux-5.8 and will prevent KCSAN reports,
      this can safely be removed in stable backports, if data_race() is
      not yet backported.
      
      v2: use data_race() both in tcp_md5_hash_key() and tcp_md5_do_add()
      
      Fixes: 6a2febec ("tcp: md5: add missing memory barriers in tcp_md5_do_add()/tcp_md5_hash_key()")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Marco Elver <elver@google.com>
      Reviewed-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e6ced831
  25. 01 7月, 2020 1 次提交
    • E
      tcp: md5: add missing memory barriers in tcp_md5_do_add()/tcp_md5_hash_key() · 6a2febec
      Eric Dumazet 提交于
      MD5 keys are read with RCU protection, and tcp_md5_do_add()
      might update in-place a prior key.
      
      Normally, typical RCU updates would allocate a new piece
      of memory. In this case only key->key and key->keylen might
      be updated, and we do not care if an incoming packet could
      see the old key, the new one, or some intermediate value,
      since changing the key on a live flow is known to be problematic
      anyway.
      
      We only want to make sure that in the case key->keylen
      is changed, cpus in tcp_md5_hash_key() wont try to use
      uninitialized data, or crash because key->keylen was
      read twice to feed sg_init_one() and ahash_request_set_crypt()
      
      Fixes: 9ea88a15 ("tcp: md5: check md5 signature without socket lock")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6a2febec
  26. 25 6月, 2020 2 次提交
  27. 29 5月, 2020 1 次提交
    • E
      tcp: ipv6: support RFC 6069 (TCP-LD) · d2924569
      Eric Dumazet 提交于
      Make tcp_ld_RTO_revert() helper available to IPv6, and
      implement RFC 6069 :
      
      Quoting this RFC :
      
      3. Connectivity Disruption Indication
      
         For Internet Protocol version 6 (IPv6) [RFC2460], the counterpart of
         the ICMP destination unreachable message of code 0 (net unreachable)
         and of code 1 (host unreachable) is the ICMPv6 destination
         unreachable message of code 0 (no route to destination) [RFC4443].
         As with IPv4, a router should generate an ICMPv6 destination
         unreachable message of code 0 in response to a packet that cannot be
         delivered to its destination address because it lacks a matching
         entry in its routing table.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d2924569
  28. 28 5月, 2020 1 次提交