1. 13 12月, 2022 2 次提交
  2. 18 11月, 2022 2 次提交
  3. 10 11月, 2022 13 次提交
  4. 18 10月, 2022 1 次提交
    • E
      tcp: tcp_rtx_synack() can be called from process context · af8fc5d2
      Eric Dumazet 提交于
      stable inclusion
      from stable-v5.10.122
      commit 0a0f7f84148445c9f02f226928803a870139d820
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I5W6OE
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=0a0f7f84148445c9f02f226928803a870139d820
      
      --------------------------------
      
      [ Upstream commit 0a375c82 ]
      
      Laurent reported the enclosed report [1]
      
      This bug triggers with following coditions:
      
      0) Kernel built with CONFIG_DEBUG_PREEMPT=y
      
      1) A new passive FastOpen TCP socket is created.
         This FO socket waits for an ACK coming from client to be a complete
         ESTABLISHED one.
      2) A socket operation on this socket goes through lock_sock()
         release_sock() dance.
      3) While the socket is owned by the user in step 2),
         a retransmit of the SYN is received and stored in socket backlog.
      4) At release_sock() time, the socket backlog is processed while
         in process context.
      5) A SYNACK packet is cooked in response of the SYN retransmit.
      6) -> tcp_rtx_synack() is called in process context.
      
      Before blamed commit, tcp_rtx_synack() was always called from BH handler,
      from a timer handler.
      
      Fix this by using TCP_INC_STATS() & NET_INC_STATS()
      which do not assume caller is in non preemptible context.
      
      [1]
      BUG: using __this_cpu_add() in preemptible [00000000] code: epollpep/2180
      caller is tcp_rtx_synack.part.0+0x36/0xc0
      CPU: 10 PID: 2180 Comm: epollpep Tainted: G           OE     5.16.0-0.bpo.4-amd64 #1  Debian 5.16.12-1~bpo11+1
      Hardware name: Supermicro SYS-5039MC-H8TRF/X11SCD-F, BIOS 1.7 11/23/2021
      Call Trace:
       <TASK>
       dump_stack_lvl+0x48/0x5e
       check_preemption_disabled+0xde/0xe0
       tcp_rtx_synack.part.0+0x36/0xc0
       tcp_rtx_synack+0x8d/0xa0
       ? kmem_cache_alloc+0x2e0/0x3e0
       ? apparmor_file_alloc_security+0x3b/0x1f0
       inet_rtx_syn_ack+0x16/0x30
       tcp_check_req+0x367/0x610
       tcp_rcv_state_process+0x91/0xf60
       ? get_nohz_timer_target+0x18/0x1a0
       ? lock_timer_base+0x61/0x80
       ? preempt_count_add+0x68/0xa0
       tcp_v4_do_rcv+0xbd/0x270
       __release_sock+0x6d/0xb0
       release_sock+0x2b/0x90
       sock_setsockopt+0x138/0x1140
       ? __sys_getsockname+0x7e/0xc0
       ? aa_sk_perm+0x3e/0x1a0
       __sys_setsockopt+0x198/0x1e0
       __x64_sys_setsockopt+0x21/0x30
       do_syscall_64+0x38/0xc0
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fixes: 168a8f58 ("tcp: TCP Fast Open Server - main code path")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NLaurent Fasnacht <laurent.fasnacht@proton.ch>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Link: https://lore.kernel.org/r/20220530213713.601888-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      Reviewed-by: NWei Li <liwei391@huawei.com>
      af8fc5d2
  5. 02 8月, 2022 1 次提交
    • E
      tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT · e86a4849
      Eric Dumazet 提交于
      stable inclusion
      from stable-v5.10.114
      commit 8a9d6ca3608f76beac1c40a02073c3c52a572b10
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I5IY1V
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=8a9d6ca3608f76beac1c40a02073c3c52a572b10
      
      --------------------------------
      
      [ Upstream commit 4bfe744f ]
      
      I had this bug sitting for too long in my pile, it is time to fix it.
      
      Thanks to Doug Porter for reminding me of it!
      
      We had various attempts in the past, including commit
      0cbe6a8f ("tcp: remove SOCK_QUEUE_SHRUNK"),
      but the issue is that TCP stack currently only generates
      EPOLLOUT from input path, when tp->snd_una has advanced
      and skb(s) cleaned from rtx queue.
      
      If a flow has a big RTT, and/or receives SACKs, it is possible
      that the notsent part (tp->write_seq - tp->snd_nxt) reaches 0
      and no more data can be sent until tp->snd_una finally advances.
      
      What is needed is to also check if POLLOUT needs to be generated
      whenever tp->snd_nxt is advanced, from output path.
      
      This bug triggers more often after an idle period, as
      we do not receive ACK for at least one RTT. tcp_notsent_lowat
      could be a fraction of what CWND and pacing rate would allow to
      send during this RTT.
      
      In a followup patch, I will remove the bogus call
      to tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED)
      from tcp_check_space(). Fact that we have decided to generate
      an EPOLLOUT does not mean the application has immediately
      refilled the transmit queue. This optimistic call
      might have been the reason the bug seemed not too serious.
      
      Tested:
      
      200 ms rtt, 1% packet loss, 32 MB tcp_rmem[2] and tcp_wmem[2]
      
      $ echo 500000 >/proc/sys/net/ipv4/tcp_notsent_lowat
      $ cat bench_rr.sh
      SUM=0
      for i in {1..10}
      do
       V=`netperf -H remote_host -l30 -t TCP_RR -- -r 10000000,10000 -o LOCAL_BYTES_SENT | egrep -v "MIGRATED|Bytes"`
       echo $V
       SUM=$(($SUM + $V))
      done
      echo SUM=$SUM
      
      Before patch:
      $ bench_rr.sh
      130000000
      80000000
      140000000
      140000000
      140000000
      140000000
      130000000
      40000000
      90000000
      110000000
      SUM=1140000000
      
      After patch:
      $ bench_rr.sh
      430000000
      590000000
      530000000
      450000000
      450000000
      350000000
      450000000
      490000000
      480000000
      460000000
      SUM=4680000000  # This is 410 % of the value before patch.
      
      Fixes: c9bee3b7 ("tcp: TCP_NOTSENT_LOWAT socket option")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDoug Porter <dsp@fb.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
      e86a4849
  6. 05 7月, 2022 1 次提交
  7. 27 1月, 2022 1 次提交
  8. 07 1月, 2022 3 次提交
  9. 15 10月, 2021 1 次提交
    • E
      ipv6: tcp: drop silly ICMPv6 packet too big messages · 8cb8fbad
      Eric Dumazet 提交于
      stable inclusion
      from stable-5.10.53
      commit 2fee3cf4c97b8913c0c907dd7bb1988baa8faa01
      bugzilla: 175574 https://gitee.com/openeuler/kernel/issues/I4DTUX
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=2fee3cf4c97b8913c0c907dd7bb1988baa8faa01
      
      --------------------------------
      
      commit c7bb4b89 upstream.
      
      While TCP stack scales reasonably well, there is still one part that
      can be used to DDOS it.
      
      IPv6 Packet too big messages have to lookup/insert a new route,
      and if abused by attackers, can easily put hosts under high stress,
      with many cpus contending on a spinlock while one is stuck in fib6_run_gc()
      
      ip6_protocol_deliver_rcu()
       icmpv6_rcv()
        icmpv6_notify()
         tcp_v6_err()
          tcp_v6_mtu_reduced()
           inet6_csk_update_pmtu()
            ip6_rt_update_pmtu()
             __ip6_rt_update_pmtu()
              ip6_rt_cache_alloc()
               ip6_dst_alloc()
                dst_alloc()
                 ip6_dst_gc()
                  fib6_run_gc()
                   spin_lock_bh() ...
      
      Some of our servers have been hit by malicious ICMPv6 packets
      trying to _increase_ the MTU/MSS of TCP flows.
      
      We believe these ICMPv6 packets are a result of a bug in one ISP stack,
      since they were blindly sent back for _every_ (small) packet sent to them.
      
      These packets are for one TCP flow:
      09:24:36.266491 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
      09:24:36.266509 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
      09:24:36.316688 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
      09:24:36.316704 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
      09:24:36.608151 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
      
      TCP stack can filter some silly requests :
      
      1) MTU below IPV6_MIN_MTU can be filtered early in tcp_v6_err()
      2) tcp_v6_mtu_reduced() can drop requests trying to increase current MSS.
      
      This tests happen before the IPv6 routing stack is entered, thus
      removing the potential contention and route exhaustion.
      
      Note that IPv6 stack was performing these checks, but too late
      (ie : after the route has been added, and after the potential
      garbage collect war)
      
      v2: fix typo caught by Martin, thanks !
      v3: exports tcp_mtu_to_mss(), caught by David, thanks !
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NMaciej Żenczykowski <maze@google.com>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      8cb8fbad
  10. 09 2月, 2021 1 次提交
  11. 08 2月, 2021 1 次提交
  12. 10 12月, 2020 1 次提交
    • N
      tcp: fix cwnd-limited bug for TSO deferral where we send nothing · 299bcb55
      Neal Cardwell 提交于
      When cwnd is not a multiple of the TSO skb size of N*MSS, we can get
      into persistent scenarios where we have the following sequence:
      
      (1) ACK for full-sized skb of N*MSS arrives
        -> tcp_write_xmit() transmit full-sized skb with N*MSS
        -> move pacing release time forward
        -> exit tcp_write_xmit() because pacing time is in the future
      
      (2) TSQ callback or TCP internal pacing timer fires
        -> try to transmit next skb, but TSO deferral finds remainder of
           available cwnd is not big enough to trigger an immediate send
           now, so we defer sending until the next ACK.
      
      (3) repeat...
      
      So we can get into a case where we never mark ourselves as
      cwnd-limited for many seconds at a time, even with
      bulk/infinite-backlog senders, because:
      
      o In case (1) above, every time in tcp_write_xmit() we have enough
      cwnd to send a full-sized skb, we are not fully using the cwnd
      (because cwnd is not a multiple of the TSO skb size). So every time we
      send data, we are not cwnd limited, and so in the cwnd-limited
      tracking code in tcp_cwnd_validate() we mark ourselves as not
      cwnd-limited.
      
      o In case (2) above, every time in tcp_write_xmit() that we try to
      transmit the "remainder" of the cwnd but defer, we set the local
      variable is_cwnd_limited to true, but we do not send any packets, so
      sent_pkts is zero, so we don't call the cwnd-limited logic to update
      tp->is_cwnd_limited.
      
      Fixes: ca8a2263 ("tcp: make cwnd-limited checks measurement-based, and gentler")
      Reported-by: NIngemar Johansson <ingemar.s.johansson@ericsson.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20201209035759.1225145-1-ncardwell.kernel@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      299bcb55
  13. 01 10月, 2020 2 次提交
  14. 15 9月, 2020 1 次提交
    • E
      tcp: remove SOCK_QUEUE_SHRUNK · 0cbe6a8f
      Eric Dumazet 提交于
      SOCK_QUEUE_SHRUNK is currently used by TCP as a temporary state
      that remembers if some room has been made in the rtx queue
      by an incoming ACK packet.
      
      This is later used from tcp_check_space() before
      considering to send EPOLLOUT.
      
      Problem is: If we receive SACK packets, and no packet
      is removed from RTX queue, we can send fresh packets, thus
      moving them from write queue to rtx queue and eventually
      empty the write queue.
      
      This stall can happen if TCP_NOTSENT_LOWAT is used.
      
      With this fix, we no longer risk stalling sends while holes
      are repaired, and we can fully use socket sndbuf.
      
      This also removes a cache line dirtying for typical RPC
      workloads.
      
      Fixes: c9bee3b7 ("tcp: TCP_NOTSENT_LOWAT socket option")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0cbe6a8f
  15. 25 8月, 2020 3 次提交
    • M
      bpf: tcp: Allow bpf prog to write and parse TCP header option · 0813a841
      Martin KaFai Lau 提交于
      [ Note: The TCP changes here is mainly to implement the bpf
        pieces into the bpf_skops_*() functions introduced
        in the earlier patches. ]
      
      The earlier effort in BPF-TCP-CC allows the TCP Congestion Control
      algorithm to be written in BPF.  It opens up opportunities to allow
      a faster turnaround time in testing/releasing new congestion control
      ideas to production environment.
      
      The same flexibility can be extended to writing TCP header option.
      It is not uncommon that people want to test new TCP header option
      to improve the TCP performance.  Another use case is for data-center
      that has a more controlled environment and has more flexibility in
      putting header options for internal only use.
      
      For example, we want to test the idea in putting maximum delay
      ACK in TCP header option which is similar to a draft RFC proposal [1].
      
      This patch introduces the necessary BPF API and use them in the
      TCP stack to allow BPF_PROG_TYPE_SOCK_OPS program to parse
      and write TCP header options.  It currently supports most of
      the TCP packet except RST.
      
      Supported TCP header option:
      ───────────────────────────
      This patch allows the bpf-prog to write any option kind.
      Different bpf-progs can write its own option by calling the new helper
      bpf_store_hdr_opt().  The helper will ensure there is no duplicated
      option in the header.
      
      By allowing bpf-prog to write any option kind, this gives a lot of
      flexibility to the bpf-prog.  Different bpf-prog can write its
      own option kind.  It could also allow the bpf-prog to support a
      recently standardized option on an older kernel.
      
      Sockops Callback Flags:
      ──────────────────────
      The bpf program will only be called to parse/write tcp header option
      if the following newly added callback flags are enabled
      in tp->bpf_sock_ops_cb_flags:
      BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG
      BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG
      BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG
      
      A few words on the PARSE CB flags.  When the above PARSE CB flags are
      turned on, the bpf-prog will be called on packets received
      at a sk that has at least reached the ESTABLISHED state.
      The parsing of the SYN-SYNACK-ACK will be discussed in the
      "3 Way HandShake" section.
      
      The default is off for all of the above new CB flags, i.e. the bpf prog
      will not be called to parse or write bpf hdr option.  There are
      details comment on these new cb flags in the UAPI bpf.h.
      
      sock_ops->skb_data and bpf_load_hdr_opt()
      ─────────────────────────────────────────
      sock_ops->skb_data and sock_ops->skb_data_end covers the whole
      TCP header and its options.  They are read only.
      
      The new bpf_load_hdr_opt() helps to read a particular option "kind"
      from the skb_data.
      
      Please refer to the comment in UAPI bpf.h.  It has details
      on what skb_data contains under different sock_ops->op.
      
      3 Way HandShake
      ───────────────
      The bpf-prog can learn if it is sending SYN or SYNACK by reading the
      sock_ops->skb_tcp_flags.
      
      * Passive side
      
      When writing SYNACK (i.e. sock_ops->op == BPF_SOCK_OPS_WRITE_HDR_OPT_CB),
      the received SYN skb will be available to the bpf prog.  The bpf prog can
      use the SYN skb (which may carry the header option sent from the remote bpf
      prog) to decide what bpf header option should be written to the outgoing
      SYNACK skb.  The SYN packet can be obtained by getsockopt(TCP_BPF_SYN*).
      More on this later.  Also, the bpf prog can learn if it is in syncookie
      mode (by checking sock_ops->args[0] == BPF_WRITE_HDR_TCP_SYNACK_COOKIE).
      
      The bpf prog can store the received SYN pkt by using the existing
      bpf_setsockopt(TCP_SAVE_SYN).  The example in a later patch does it.
      [ Note that the fullsock here is a listen sk, bpf_sk_storage
        is not very useful here since the listen sk will be shared
        by many concurrent connection requests.
      
        Extending bpf_sk_storage support to request_sock will add weight
        to the minisock and it is not necessary better than storing the
        whole ~100 bytes SYN pkt. ]
      
      When the connection is established, the bpf prog will be called
      in the existing PASSIVE_ESTABLISHED_CB callback.  At that time,
      the bpf prog can get the header option from the saved syn and
      then apply the needed operation to the newly established socket.
      The later patch will use the max delay ack specified in the SYN
      header and set the RTO of this newly established connection
      as an example.
      
      The received ACK (that concludes the 3WHS) will also be available to
      the bpf prog during PASSIVE_ESTABLISHED_CB through the sock_ops->skb_data.
      It could be useful in syncookie scenario.  More on this later.
      
      There is an existing getsockopt "TCP_SAVED_SYN" to return the whole
      saved syn pkt which includes the IP[46] header and the TCP header.
      A few "TCP_BPF_SYN*" getsockopt has been added to allow specifying where to
      start getting from, e.g. starting from TCP header, or from IP[46] header.
      
      The new getsockopt(TCP_BPF_SYN*) will also know where it can get
      the SYN's packet from:
        - (a) the just received syn (available when the bpf prog is writing SYNACK)
              and it is the only way to get SYN during syncookie mode.
        or
        - (b) the saved syn (available in PASSIVE_ESTABLISHED_CB and also other
              existing CB).
      
      The bpf prog does not need to know where the SYN pkt is coming from.
      The getsockopt(TCP_BPF_SYN*) will hide this details.
      
      Similarly, a flags "BPF_LOAD_HDR_OPT_TCP_SYN" is also added to
      bpf_load_hdr_opt() to read a particular header option from the SYN packet.
      
      * Fastopen
      
      Fastopen should work the same as the regular non fastopen case.
      This is a test in a later patch.
      
      * Syncookie
      
      For syncookie, the later example patch asks the active
      side's bpf prog to resend the header options in ACK.  The server
      can use bpf_load_hdr_opt() to look at the options in this
      received ACK during PASSIVE_ESTABLISHED_CB.
      
      * Active side
      
      The bpf prog will get a chance to write the bpf header option
      in the SYN packet during WRITE_HDR_OPT_CB.  The received SYNACK
      pkt will also be available to the bpf prog during the existing
      ACTIVE_ESTABLISHED_CB callback through the sock_ops->skb_data
      and bpf_load_hdr_opt().
      
      * Turn off header CB flags after 3WHS
      
      If the bpf prog does not need to write/parse header options
      beyond the 3WHS, the bpf prog can clear the bpf_sock_ops_cb_flags
      to avoid being called for header options.
      Or the bpf-prog can select to leave the UNKNOWN_HDR_OPT_CB_FLAG on
      so that the kernel will only call it when there is option that
      the kernel cannot handle.
      
      [1]: draft-wang-tcpm-low-latency-opt-00
           https://tools.ietf.org/html/draft-wang-tcpm-low-latency-opt-00Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200820190104.2885895-1-kafai@fb.com
      0813a841
    • M
      bpf: tcp: Add bpf_skops_hdr_opt_len() and bpf_skops_write_hdr_opt() · 331fca43
      Martin KaFai Lau 提交于
      The bpf prog needs to parse the SYN header to learn what options have
      been sent by the peer's bpf-prog before writing its options into SYNACK.
      This patch adds a "syn_skb" arg to tcp_make_synack() and send_synack().
      This syn_skb will eventually be made available (as read-only) to the
      bpf prog.  This will be the only SYN packet available to the bpf
      prog during syncookie.  For other regular cases, the bpf prog can
      also use the saved_syn.
      
      When writing options, the bpf prog will first be called to tell the
      kernel its required number of bytes.  It is done by the new
      bpf_skops_hdr_opt_len().  The bpf prog will only be called when the new
      BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG is set in tp->bpf_sock_ops_cb_flags.
      When the bpf prog returns, the kernel will know how many bytes are needed
      and then update the "*remaining" arg accordingly.  4 byte alignment will
      be included in the "*remaining" before this function returns.  The 4 byte
      aligned number of bytes will also be stored into the opts->bpf_opt_len.
      "bpf_opt_len" is a newly added member to the struct tcp_out_options.
      
      Then the new bpf_skops_write_hdr_opt() will call the bpf prog to write the
      header options.  The bpf prog is only called if it has reserved spaces
      before (opts->bpf_opt_len > 0).
      
      The bpf prog is the last one getting a chance to reserve header space
      and writing the header option.
      
      These two functions are half implemented to highlight the changes in
      TCP stack.  The actual codes preparing the bpf running context and
      invoking the bpf prog will be added in the later patch with other
      necessary bpf pieces.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/bpf/20200820190052.2885316-1-kafai@fb.com
      331fca43
    • M
      tcp: bpf: Add TCP_BPF_DELACK_MAX setsockopt · 2b8ee4f0
      Martin KaFai Lau 提交于
      This change is mostly from an internal patch and adapts it from sysctl
      config to the bpf_setsockopt setup.
      
      The bpf_prog can set the max delay ack by using
      bpf_setsockopt(TCP_BPF_DELACK_MAX).  This max delay ack can be communicated
      to its peer through bpf header option.  The receiving peer can then use
      this max delay ack and set a potentially lower rto by using
      bpf_setsockopt(TCP_BPF_RTO_MIN) which will be introduced
      in the next patch.
      
      Another later selftest patch will also use it like the above to show
      how to write and parse bpf tcp header option.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20200820190021.2884000-1-kafai@fb.com
      2b8ee4f0
  16. 01 8月, 2020 1 次提交
  17. 24 7月, 2020 1 次提交
    • Y
      tcp: allow at most one TLP probe per flight · 76be93fc
      Yuchung Cheng 提交于
      Previously TLP may send multiple probes of new data in one
      flight. This happens when the sender is cwnd limited. After the
      initial TLP containing new data is sent, the sender receives another
      ACK that acks partial inflight.  It may re-arm another TLP timer
      to send more, if no further ACK returns before the next TLP timeout
      (PTO) expires. The sender may send in theory a large amount of TLP
      until send queue is depleted. This only happens if the sender sees
      such irregular uncommon ACK pattern. But it is generally undesirable
      behavior during congestion especially.
      
      The original TLP design restrict only one TLP probe per inflight as
      published in "Reducing Web Latency: the Virtue of Gentle Aggression",
      SIGCOMM 2013. This patch changes TLP to send at most one probe
      per inflight.
      
      Note that if the sender is app-limited, TLP retransmits old data
      and did not have this issue.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76be93fc
  18. 14 7月, 2020 1 次提交
  19. 02 7月, 2020 1 次提交
    • E
      tcp: md5: do not send silly options in SYNCOOKIES · e114e1e8
      Eric Dumazet 提交于
      Whenever cookie_init_timestamp() has been used to encode
      ECN,SACK,WSCALE options, we can not remove the TS option in the SYNACK.
      
      Otherwise, tcp_synack_options() will still advertize options like WSCALE
      that we can not deduce later when receiving the packet from the client
      to complete 3WHS.
      
      Note that modern linux TCP stacks wont use MD5+TS+SACK in a SYN packet,
      but we can not know for sure that all TCP stacks have the same logic.
      
      Before the fix a tcpdump would exhibit this wrong exchange :
      
      10:12:15.464591 IP C > S: Flags [S], seq 4202415601, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 456965269 ecr 0,nop,wscale 8], length 0
      10:12:15.464602 IP S > C: Flags [S.], seq 253516766, ack 4202415602, win 65535, options [nop,nop,md5 valid,mss 1400,nop,nop,sackOK,nop,wscale 8], length 0
      10:12:15.464611 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 valid], length 0
      10:12:15.464678 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options [nop,nop,md5 valid], length 12
      10:12:15.464685 IP S > C: Flags [.], ack 13, win 65535, options [nop,nop,md5 valid], length 0
      
      After this patch the exchange looks saner :
      
      11:59:59.882990 IP C > S: Flags [S], seq 517075944, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508483 ecr 0,nop,wscale 8], length 0
      11:59:59.883002 IP S > C: Flags [S.], seq 1902939253, ack 517075945, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508479 ecr 1751508483,nop,wscale 8], length 0
      11:59:59.883012 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508479], length 0
      11:59:59.883114 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508479], length 12
      11:59:59.883122 IP S > C: Flags [.], ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508483], length 0
      11:59:59.883152 IP S > C: Flags [P.], seq 1:13, ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508484 ecr 1751508483], length 12
      11:59:59.883170 IP C > S: Flags [.], ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508484 ecr 1751508484], length 0
      
      Of course, no SACK block will ever be added later, but nothing should break.
      Technically, we could remove the 4 nops included in MD5+TS options,
      but again some stacks could break seeing not conventional alignment.
      
      Fixes: 4957faad ("TCPCT part 1g: Responder Cookie => Initiator")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e114e1e8
  20. 21 6月, 2020 2 次提交