1. 02 11月, 2021 3 次提交
    • J
      Revert "net: avoid double accounting for pure zerocopy skbs" · 84882cf7
      Jakub Kicinski 提交于
      This reverts commit f1a456f8.
      
        WARNING: CPU: 1 PID: 6819 at net/core/skbuff.c:5429 skb_try_coalesce+0x78b/0x7e0
        CPU: 1 PID: 6819 Comm: xxxxxxx Kdump: loaded Tainted: G S                5.15.0-04194-gd852503f7711 #16
        RIP: 0010:skb_try_coalesce+0x78b/0x7e0
        Code: e8 2a bf 41 ff 44 8b b3 bc 00 00 00 48 8b 7c 24 30 e8 19 c0 41 ff 44 89 f0 48 03 83 c0 00 00 00 48 89 44 24 40 e9 47 fb ff ff <0f> 0b e9 ca fc ff ff 4c 8d 70 ff 48 83 c0 07 48 89 44 24 38 e9 61
        RSP: 0018:ffff88881f449688 EFLAGS: 00010282
        RAX: 00000000fffffe96 RBX: ffff8881566e4460 RCX: ffffffff82079f7e
        RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffff8881566e47b0
        RBP: ffff8881566e46e0 R08: ffffed102619235d R09: ffffed102619235d
        R10: ffff888130c91ae3 R11: ffffed102619235c R12: ffff88881f4498a0
        R13: 0000000000000056 R14: 0000000000000009 R15: ffff888130c91ac0
        FS:  00007fec2cbb9700(0000) GS:ffff88881f440000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fec1b060d80 CR3: 00000003acf94005 CR4: 00000000003706e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         <IRQ>
         tcp_try_coalesce+0xeb/0x290
         ? tcp_parse_options+0x610/0x610
         ? mark_held_locks+0x79/0xa0
         tcp_queue_rcv+0x69/0x2f0
         tcp_rcv_established+0xa49/0xd40
         ? tcp_data_queue+0x18a0/0x18a0
         tcp_v6_do_rcv+0x1c9/0x880
         ? rt6_mtu_change_route+0x100/0x100
         tcp_v6_rcv+0x1624/0x1830
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      84882cf7
    • T
      net: avoid double accounting for pure zerocopy skbs · f1a456f8
      Talal Ahmad 提交于
      Track skbs with only zerocopy data and avoid charging them to kernel
      memory to correctly account the memory utilization for msg_zerocopy.
      All of the data in such skbs is held in user pages which are already
      accounted to user. Before this change, they are charged again in
      kernel in __zerocopy_sg_from_iter. The charging in kernel is
      excessive because data is not being copied into skb frags. This
      excessive charging can lead to kernel going into memory pressure
      state which impacts all sockets in the system adversely. Mark pure
      zerocopy skbs with a SKBFL_PURE_ZEROCOPY flag and remove
      charge/uncharge for data in such skbs.
      
      Initially, an skb is marked pure zerocopy when it is empty and in
      zerocopy path. skb can then change from a pure zerocopy skb to mixed
      data skb (zerocopy and copy data) if it is at tail of write queue and
      there is room available in it and non-zerocopy data is being sent in
      the next sendmsg call. At this time sk_mem_charge is done for the pure
      zerocopied data and the pure zerocopy flag is unmarked. We found that
      this happens very rarely on workloads that pass MSG_ZEROCOPY.
      
      A pure zerocopy skb can later be coalesced into normal skb if they are
      next to each other in queue but this patch prevents coalescing from
      happening. This avoids complexity of charging when skb downgrades from
      pure zerocopy to mixed. This is also rare.
      
      In sk_wmem_free_skb, if it is a pure zerocopy skb, an sk_mem_uncharge
      for SKB_TRUESIZE(MAX_TCP_HEADER) is done for sk_mem_charge in
      tcp_skb_entail for an skb without data.
      
      Testing with the msg_zerocopy.c benchmark between two hosts(100G nics)
      with zerocopy showed that before this patch the 'sock' variable in
      memory.stat for cgroup2 that tracks sum of sk_forward_alloc,
      sk_rmem_alloc and sk_wmem_queued is around 1822720 and with this
      change it is 0. This is due to no charge to sk_forward_alloc for
      zerocopy data and shows memory utilization for kernel is lowered.
      Signed-off-by: NTalal Ahmad <talalahmad@google.com>
      Acked-by: NArjun Roy <arjunroy@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      f1a456f8
    • T
      tcp: rename sk_wmem_free_skb · 03271f3a
      Talal Ahmad 提交于
      sk_wmem_free_skb() is only used by TCP.
      
      Rename it to make this clear, and move its declaration to
      include/net/tcp.h
      Signed-off-by: NTalal Ahmad <talalahmad@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NArjun Roy <arjunroy@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      03271f3a
  2. 28 10月, 2021 4 次提交
  3. 26 10月, 2021 1 次提交
  4. 30 9月, 2021 1 次提交
  5. 24 9月, 2021 1 次提交
  6. 18 8月, 2021 1 次提交
  7. 09 7月, 2021 1 次提交
    • E
      ipv6: tcp: drop silly ICMPv6 packet too big messages · c7bb4b89
      Eric Dumazet 提交于
      While TCP stack scales reasonably well, there is still one part that
      can be used to DDOS it.
      
      IPv6 Packet too big messages have to lookup/insert a new route,
      and if abused by attackers, can easily put hosts under high stress,
      with many cpus contending on a spinlock while one is stuck in fib6_run_gc()
      
      ip6_protocol_deliver_rcu()
       icmpv6_rcv()
        icmpv6_notify()
         tcp_v6_err()
          tcp_v6_mtu_reduced()
           inet6_csk_update_pmtu()
            ip6_rt_update_pmtu()
             __ip6_rt_update_pmtu()
              ip6_rt_cache_alloc()
               ip6_dst_alloc()
                dst_alloc()
                 ip6_dst_gc()
                  fib6_run_gc()
                   spin_lock_bh() ...
      
      Some of our servers have been hit by malicious ICMPv6 packets
      trying to _increase_ the MTU/MSS of TCP flows.
      
      We believe these ICMPv6 packets are a result of a bug in one ISP stack,
      since they were blindly sent back for _every_ (small) packet sent to them.
      
      These packets are for one TCP flow:
      09:24:36.266491 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
      09:24:36.266509 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
      09:24:36.316688 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
      09:24:36.316704 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
      09:24:36.608151 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
      
      TCP stack can filter some silly requests :
      
      1) MTU below IPV6_MIN_MTU can be filtered early in tcp_v6_err()
      2) tcp_v6_mtu_reduced() can drop requests trying to increase current MSS.
      
      This tests happen before the IPv6 routing stack is entered, thus
      removing the potential contention and route exhaustion.
      
      Note that IPv6 stack was performing these checks, but too late
      (ie : after the route has been added, and after the potential
      garbage collect war)
      
      v2: fix typo caught by Martin, thanks !
      v3: exports tcp_mtu_to_mss(), caught by David, thanks !
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NMaciej Żenczykowski <maze@google.com>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7bb4b89
  8. 12 3月, 2021 2 次提交
    • E
      tcp: remove obsolete check in __tcp_retransmit_skb() · ac3959fd
      Eric Dumazet 提交于
      TSQ provides a nice way to avoid bufferbloat on individual socket,
      including retransmit packets. We can get rid of the old
      heuristic:
      
      	/* Do not sent more than we queued. 1/4 is reserved for possible
      	 * copying overhead: fragmentation, tunneling, mangling etc.
      	 */
      	if (refcount_read(&sk->sk_wmem_alloc) >
      	    min_t(u32, sk->sk_wmem_queued + (sk->sk_wmem_queued >> 2),
      		  sk->sk_sndbuf))
      		return -EAGAIN;
      
      This heuristic was giving false positives according to Jakub,
      whenever TX completions are delayed above RTT. (Ack packets
      are processed by TCP stack before clones are orphaned/freed)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NJakub Kicinski <kuba@kernel.org>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac3959fd
    • E
      tcp: plug skb_still_in_host_queue() to TSQ · f4dae54e
      Eric Dumazet 提交于
      Jakub and Neil reported an increase of RTO timers whenever
      TX completions are delayed a bit more (by increasing
      NIC TX coalescing parameters)
      
      Main issue is that TCP stack has a logic preventing a packet
      being retransmit if the prior clone has not yet been
      orphaned or freed.
      
      This logic came with commit 1f3279ae ("tcp: avoid
      retransmits of TCP packets hanging in host queues")
      
      Thankfully, in the case skb_still_in_host_queue() detects
      the initial clone is still in flight, it can use TSQ logic
      that will eventually retry later, at the moment the clone
      is freed or orphaned.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NNeil Spring <ntspring@fb.com>
      Reported-by: NJakub Kicinski <kuba@kernel.org>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f4dae54e
  9. 24 1月, 2021 1 次提交
  10. 19 1月, 2021 1 次提交
  11. 14 1月, 2021 1 次提交
  12. 10 12月, 2020 1 次提交
    • N
      tcp: fix cwnd-limited bug for TSO deferral where we send nothing · 299bcb55
      Neal Cardwell 提交于
      When cwnd is not a multiple of the TSO skb size of N*MSS, we can get
      into persistent scenarios where we have the following sequence:
      
      (1) ACK for full-sized skb of N*MSS arrives
        -> tcp_write_xmit() transmit full-sized skb with N*MSS
        -> move pacing release time forward
        -> exit tcp_write_xmit() because pacing time is in the future
      
      (2) TSQ callback or TCP internal pacing timer fires
        -> try to transmit next skb, but TSO deferral finds remainder of
           available cwnd is not big enough to trigger an immediate send
           now, so we defer sending until the next ACK.
      
      (3) repeat...
      
      So we can get into a case where we never mark ourselves as
      cwnd-limited for many seconds at a time, even with
      bulk/infinite-backlog senders, because:
      
      o In case (1) above, every time in tcp_write_xmit() we have enough
      cwnd to send a full-sized skb, we are not fully using the cwnd
      (because cwnd is not a multiple of the TSO skb size). So every time we
      send data, we are not cwnd limited, and so in the cwnd-limited
      tracking code in tcp_cwnd_validate() we mark ourselves as not
      cwnd-limited.
      
      o In case (2) above, every time in tcp_write_xmit() that we try to
      transmit the "remainder" of the cwnd but defer, we set the local
      variable is_cwnd_limited to true, but we do not send any packets, so
      sent_pkts is zero, so we don't call the cwnd-limited logic to update
      tp->is_cwnd_limited.
      
      Fixes: ca8a2263 ("tcp: make cwnd-limited checks measurement-based, and gentler")
      Reported-by: NIngemar Johansson <ingemar.s.johansson@ericsson.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20201209035759.1225145-1-ncardwell.kernel@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      299bcb55
  13. 21 11月, 2020 1 次提交
  14. 08 11月, 2020 1 次提交
  15. 05 11月, 2020 1 次提交
    • P
      tcp: propagate MPTCP skb extensions on xmit splits · 5a369ca6
      Paolo Abeni 提交于
      When the TCP stack splits a packet on the write queue, the tail
      half currently lose the associated skb extensions, and will not
      carry the DSM on the wire.
      
      The above does not cause functional problems and is allowed by
      the RFC, but interact badly with GRO and RX coalescing, as possible
      candidates for aggregation will carry different TCP options.
      
      This change tries to improve the MPTCP behavior, propagating the
      skb extensions on split.
      
      Additionally, we must prevent the MPTCP stack from updating the
      mapping after the split occur: that will both violate the RFC and
      fool the reader.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      5a369ca6
  16. 01 10月, 2020 2 次提交
  17. 15 9月, 2020 1 次提交
    • E
      tcp: remove SOCK_QUEUE_SHRUNK · 0cbe6a8f
      Eric Dumazet 提交于
      SOCK_QUEUE_SHRUNK is currently used by TCP as a temporary state
      that remembers if some room has been made in the rtx queue
      by an incoming ACK packet.
      
      This is later used from tcp_check_space() before
      considering to send EPOLLOUT.
      
      Problem is: If we receive SACK packets, and no packet
      is removed from RTX queue, we can send fresh packets, thus
      moving them from write queue to rtx queue and eventually
      empty the write queue.
      
      This stall can happen if TCP_NOTSENT_LOWAT is used.
      
      With this fix, we no longer risk stalling sends while holes
      are repaired, and we can fully use socket sndbuf.
      
      This also removes a cache line dirtying for typical RPC
      workloads.
      
      Fixes: c9bee3b7 ("tcp: TCP_NOTSENT_LOWAT socket option")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0cbe6a8f
  18. 25 8月, 2020 3 次提交
    • M
      bpf: tcp: Allow bpf prog to write and parse TCP header option · 0813a841
      Martin KaFai Lau 提交于
      [ Note: The TCP changes here is mainly to implement the bpf
        pieces into the bpf_skops_*() functions introduced
        in the earlier patches. ]
      
      The earlier effort in BPF-TCP-CC allows the TCP Congestion Control
      algorithm to be written in BPF.  It opens up opportunities to allow
      a faster turnaround time in testing/releasing new congestion control
      ideas to production environment.
      
      The same flexibility can be extended to writing TCP header option.
      It is not uncommon that people want to test new TCP header option
      to improve the TCP performance.  Another use case is for data-center
      that has a more controlled environment and has more flexibility in
      putting header options for internal only use.
      
      For example, we want to test the idea in putting maximum delay
      ACK in TCP header option which is similar to a draft RFC proposal [1].
      
      This patch introduces the necessary BPF API and use them in the
      TCP stack to allow BPF_PROG_TYPE_SOCK_OPS program to parse
      and write TCP header options.  It currently supports most of
      the TCP packet except RST.
      
      Supported TCP header option:
      ───────────────────────────
      This patch allows the bpf-prog to write any option kind.
      Different bpf-progs can write its own option by calling the new helper
      bpf_store_hdr_opt().  The helper will ensure there is no duplicated
      option in the header.
      
      By allowing bpf-prog to write any option kind, this gives a lot of
      flexibility to the bpf-prog.  Different bpf-prog can write its
      own option kind.  It could also allow the bpf-prog to support a
      recently standardized option on an older kernel.
      
      Sockops Callback Flags:
      ──────────────────────
      The bpf program will only be called to parse/write tcp header option
      if the following newly added callback flags are enabled
      in tp->bpf_sock_ops_cb_flags:
      BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG
      BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG
      BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG
      
      A few words on the PARSE CB flags.  When the above PARSE CB flags are
      turned on, the bpf-prog will be called on packets received
      at a sk that has at least reached the ESTABLISHED state.
      The parsing of the SYN-SYNACK-ACK will be discussed in the
      "3 Way HandShake" section.
      
      The default is off for all of the above new CB flags, i.e. the bpf prog
      will not be called to parse or write bpf hdr option.  There are
      details comment on these new cb flags in the UAPI bpf.h.
      
      sock_ops->skb_data and bpf_load_hdr_opt()
      ─────────────────────────────────────────
      sock_ops->skb_data and sock_ops->skb_data_end covers the whole
      TCP header and its options.  They are read only.
      
      The new bpf_load_hdr_opt() helps to read a particular option "kind"
      from the skb_data.
      
      Please refer to the comment in UAPI bpf.h.  It has details
      on what skb_data contains under different sock_ops->op.
      
      3 Way HandShake
      ───────────────
      The bpf-prog can learn if it is sending SYN or SYNACK by reading the
      sock_ops->skb_tcp_flags.
      
      * Passive side
      
      When writing SYNACK (i.e. sock_ops->op == BPF_SOCK_OPS_WRITE_HDR_OPT_CB),
      the received SYN skb will be available to the bpf prog.  The bpf prog can
      use the SYN skb (which may carry the header option sent from the remote bpf
      prog) to decide what bpf header option should be written to the outgoing
      SYNACK skb.  The SYN packet can be obtained by getsockopt(TCP_BPF_SYN*).
      More on this later.  Also, the bpf prog can learn if it is in syncookie
      mode (by checking sock_ops->args[0] == BPF_WRITE_HDR_TCP_SYNACK_COOKIE).
      
      The bpf prog can store the received SYN pkt by using the existing
      bpf_setsockopt(TCP_SAVE_SYN).  The example in a later patch does it.
      [ Note that the fullsock here is a listen sk, bpf_sk_storage
        is not very useful here since the listen sk will be shared
        by many concurrent connection requests.
      
        Extending bpf_sk_storage support to request_sock will add weight
        to the minisock and it is not necessary better than storing the
        whole ~100 bytes SYN pkt. ]
      
      When the connection is established, the bpf prog will be called
      in the existing PASSIVE_ESTABLISHED_CB callback.  At that time,
      the bpf prog can get the header option from the saved syn and
      then apply the needed operation to the newly established socket.
      The later patch will use the max delay ack specified in the SYN
      header and set the RTO of this newly established connection
      as an example.
      
      The received ACK (that concludes the 3WHS) will also be available to
      the bpf prog during PASSIVE_ESTABLISHED_CB through the sock_ops->skb_data.
      It could be useful in syncookie scenario.  More on this later.
      
      There is an existing getsockopt "TCP_SAVED_SYN" to return the whole
      saved syn pkt which includes the IP[46] header and the TCP header.
      A few "TCP_BPF_SYN*" getsockopt has been added to allow specifying where to
      start getting from, e.g. starting from TCP header, or from IP[46] header.
      
      The new getsockopt(TCP_BPF_SYN*) will also know where it can get
      the SYN's packet from:
        - (a) the just received syn (available when the bpf prog is writing SYNACK)
              and it is the only way to get SYN during syncookie mode.
        or
        - (b) the saved syn (available in PASSIVE_ESTABLISHED_CB and also other
              existing CB).
      
      The bpf prog does not need to know where the SYN pkt is coming from.
      The getsockopt(TCP_BPF_SYN*) will hide this details.
      
      Similarly, a flags "BPF_LOAD_HDR_OPT_TCP_SYN" is also added to
      bpf_load_hdr_opt() to read a particular header option from the SYN packet.
      
      * Fastopen
      
      Fastopen should work the same as the regular non fastopen case.
      This is a test in a later patch.
      
      * Syncookie
      
      For syncookie, the later example patch asks the active
      side's bpf prog to resend the header options in ACK.  The server
      can use bpf_load_hdr_opt() to look at the options in this
      received ACK during PASSIVE_ESTABLISHED_CB.
      
      * Active side
      
      The bpf prog will get a chance to write the bpf header option
      in the SYN packet during WRITE_HDR_OPT_CB.  The received SYNACK
      pkt will also be available to the bpf prog during the existing
      ACTIVE_ESTABLISHED_CB callback through the sock_ops->skb_data
      and bpf_load_hdr_opt().
      
      * Turn off header CB flags after 3WHS
      
      If the bpf prog does not need to write/parse header options
      beyond the 3WHS, the bpf prog can clear the bpf_sock_ops_cb_flags
      to avoid being called for header options.
      Or the bpf-prog can select to leave the UNKNOWN_HDR_OPT_CB_FLAG on
      so that the kernel will only call it when there is option that
      the kernel cannot handle.
      
      [1]: draft-wang-tcpm-low-latency-opt-00
           https://tools.ietf.org/html/draft-wang-tcpm-low-latency-opt-00Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200820190104.2885895-1-kafai@fb.com
      0813a841
    • M
      bpf: tcp: Add bpf_skops_hdr_opt_len() and bpf_skops_write_hdr_opt() · 331fca43
      Martin KaFai Lau 提交于
      The bpf prog needs to parse the SYN header to learn what options have
      been sent by the peer's bpf-prog before writing its options into SYNACK.
      This patch adds a "syn_skb" arg to tcp_make_synack() and send_synack().
      This syn_skb will eventually be made available (as read-only) to the
      bpf prog.  This will be the only SYN packet available to the bpf
      prog during syncookie.  For other regular cases, the bpf prog can
      also use the saved_syn.
      
      When writing options, the bpf prog will first be called to tell the
      kernel its required number of bytes.  It is done by the new
      bpf_skops_hdr_opt_len().  The bpf prog will only be called when the new
      BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG is set in tp->bpf_sock_ops_cb_flags.
      When the bpf prog returns, the kernel will know how many bytes are needed
      and then update the "*remaining" arg accordingly.  4 byte alignment will
      be included in the "*remaining" before this function returns.  The 4 byte
      aligned number of bytes will also be stored into the opts->bpf_opt_len.
      "bpf_opt_len" is a newly added member to the struct tcp_out_options.
      
      Then the new bpf_skops_write_hdr_opt() will call the bpf prog to write the
      header options.  The bpf prog is only called if it has reserved spaces
      before (opts->bpf_opt_len > 0).
      
      The bpf prog is the last one getting a chance to reserve header space
      and writing the header option.
      
      These two functions are half implemented to highlight the changes in
      TCP stack.  The actual codes preparing the bpf running context and
      invoking the bpf prog will be added in the later patch with other
      necessary bpf pieces.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/bpf/20200820190052.2885316-1-kafai@fb.com
      331fca43
    • M
      tcp: bpf: Add TCP_BPF_DELACK_MAX setsockopt · 2b8ee4f0
      Martin KaFai Lau 提交于
      This change is mostly from an internal patch and adapts it from sysctl
      config to the bpf_setsockopt setup.
      
      The bpf_prog can set the max delay ack by using
      bpf_setsockopt(TCP_BPF_DELACK_MAX).  This max delay ack can be communicated
      to its peer through bpf header option.  The receiving peer can then use
      this max delay ack and set a potentially lower rto by using
      bpf_setsockopt(TCP_BPF_RTO_MIN) which will be introduced
      in the next patch.
      
      Another later selftest patch will also use it like the above to show
      how to write and parse bpf tcp header option.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20200820190021.2884000-1-kafai@fb.com
      2b8ee4f0
  19. 01 8月, 2020 1 次提交
  20. 24 7月, 2020 1 次提交
    • Y
      tcp: allow at most one TLP probe per flight · 76be93fc
      Yuchung Cheng 提交于
      Previously TLP may send multiple probes of new data in one
      flight. This happens when the sender is cwnd limited. After the
      initial TLP containing new data is sent, the sender receives another
      ACK that acks partial inflight.  It may re-arm another TLP timer
      to send more, if no further ACK returns before the next TLP timeout
      (PTO) expires. The sender may send in theory a large amount of TLP
      until send queue is depleted. This only happens if the sender sees
      such irregular uncommon ACK pattern. But it is generally undesirable
      behavior during congestion especially.
      
      The original TLP design restrict only one TLP probe per inflight as
      published in "Reducing Web Latency: the Virtue of Gentle Aggression",
      SIGCOMM 2013. This patch changes TLP to send at most one probe
      per inflight.
      
      Note that if the sender is app-limited, TLP retransmits old data
      and did not have this issue.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76be93fc
  21. 14 7月, 2020 1 次提交
  22. 02 7月, 2020 1 次提交
    • E
      tcp: md5: do not send silly options in SYNCOOKIES · e114e1e8
      Eric Dumazet 提交于
      Whenever cookie_init_timestamp() has been used to encode
      ECN,SACK,WSCALE options, we can not remove the TS option in the SYNACK.
      
      Otherwise, tcp_synack_options() will still advertize options like WSCALE
      that we can not deduce later when receiving the packet from the client
      to complete 3WHS.
      
      Note that modern linux TCP stacks wont use MD5+TS+SACK in a SYN packet,
      but we can not know for sure that all TCP stacks have the same logic.
      
      Before the fix a tcpdump would exhibit this wrong exchange :
      
      10:12:15.464591 IP C > S: Flags [S], seq 4202415601, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 456965269 ecr 0,nop,wscale 8], length 0
      10:12:15.464602 IP S > C: Flags [S.], seq 253516766, ack 4202415602, win 65535, options [nop,nop,md5 valid,mss 1400,nop,nop,sackOK,nop,wscale 8], length 0
      10:12:15.464611 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 valid], length 0
      10:12:15.464678 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options [nop,nop,md5 valid], length 12
      10:12:15.464685 IP S > C: Flags [.], ack 13, win 65535, options [nop,nop,md5 valid], length 0
      
      After this patch the exchange looks saner :
      
      11:59:59.882990 IP C > S: Flags [S], seq 517075944, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508483 ecr 0,nop,wscale 8], length 0
      11:59:59.883002 IP S > C: Flags [S.], seq 1902939253, ack 517075945, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508479 ecr 1751508483,nop,wscale 8], length 0
      11:59:59.883012 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508479], length 0
      11:59:59.883114 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508479], length 12
      11:59:59.883122 IP S > C: Flags [.], ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508483], length 0
      11:59:59.883152 IP S > C: Flags [P.], seq 1:13, ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508484 ecr 1751508483], length 12
      11:59:59.883170 IP C > S: Flags [.], ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508484 ecr 1751508484], length 0
      
      Of course, no SACK block will ever be added later, but nothing should break.
      Technically, we could remove the 4 nops included in MD5+TS options,
      but again some stacks could break seeing not conventional alignment.
      
      Fixes: 4957faad ("TCPCT part 1g: Responder Cookie => Initiator")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e114e1e8
  23. 21 6月, 2020 2 次提交
  24. 07 5月, 2020 2 次提交
    • E
      tcp: defer xmit timer reset in tcp_xmit_retransmit_queue() · 916e6d1a
      Eric Dumazet 提交于
      As hinted in prior change ("tcp: refine tcp_pacing_delay()
      for very low pacing rates"), it is probably best arming
      the xmit timer only when all the packets have been scheduled,
      rather than when the head of rtx queue has been re-sent.
      
      This does matter for flows having extremely low pacing rates,
      since their tp->tcp_wstamp_ns could be far in the future.
      
      Note that the regular xmit path has a stronger limit
      in tcp_small_queue_check(), meaning it is less likely to
      go beyond the pacing horizon.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      916e6d1a
    • E
      tcp: refine tcp_pacing_delay() for very low pacing rates · 8dc242ad
      Eric Dumazet 提交于
      With the addition of horizon feature to sch_fq, we noticed some
      suboptimal behavior of extremely low pacing rate TCP flows, especially
      when TCP is not aware of a drop happening in lower stacks.
      
      Back in commit 3f80e08f ("tcp: add tcp_reset_xmit_timer() helper"),
      tcp_pacing_delay() was added to estimate an extra delay to add to standard
      rto timers.
      
      This patch removes the skb argument from this helper and
      tcp_reset_xmit_timer() because it makes more sense to simply
      consider the time at which next packet is allowed to be sent,
      instead of the time of whatever packet has been sent.
      
      This avoids arming RTO timer too soon and removes
      spurious horizon drops.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8dc242ad
  25. 01 5月, 2020 1 次提交
    • E
      tcp: add tp->dup_ack_counter · 2b195850
      Eric Dumazet 提交于
      In commit 86de5921 ("tcp: defer SACK compression after DupThresh")
      I added a TCP_FASTRETRANS_THRESH bias to tp->compressed_ack in order
      to enable sack compression only after 3 dupacks.
      
      Since we plan to relax this rule for flows that involve
      stacks not requiring this old rule, this patch adds
      a distinct tp->dup_ack_counter.
      
      This means the TCP_FASTRETRANS_THRESH value is now used
      in a single location that a future patch can adjust:
      
      	if (tp->dup_ack_counter < TCP_FASTRETRANS_THRESH) {
      		tp->dup_ack_counter++;
      		goto send_now;
      	}
      
      This patch also introduces tcp_sack_compress_send_ack()
      helper to ease following patch comprehension.
      
      This patch refines LINUX_MIB_TCPACKCOMPRESSED to not
      count the acks that we had to send if the timer expires
      or tcp_sack_compress_send_ack() is sending an ack.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2b195850
  26. 26 4月, 2020 1 次提交
    • F
      tcp: mptcp: use mptcp receive buffer space to select rcv window · 071c8ed6
      Florian Westphal 提交于
      In MPTCP, the receive window is shared across all subflows, because it
      refers to the mptcp-level sequence space.
      
      MPTCP receivers already place incoming packets on the mptcp socket
      receive queue and will charge it to the mptcp socket rcvbuf until
      userspace consumes the data.
      
      Update __tcp_select_window to use the occupancy of the parent/mptcp
      socket instead of the subflow socket in case the tcp socket is part
      of a logical mptcp connection.
      
      This commit doesn't change choice of initial window for passive or active
      connections.
      While it would be possible to change those as well, this adds complexity
      (especially when handling MP_JOIN requests).  Furthermore, the MPTCP RFC
      specifically says that a MPTCP sender 'MUST NOT use the RCV.WND field
      of a TCP segment at the connection level if it does not also carry a DSS
      option with a Data ACK field.'
      
      SYN/SYNACK packets do not carry a DSS option with a Data ACK field.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      071c8ed6
  27. 21 3月, 2020 1 次提交
  28. 20 3月, 2020 1 次提交
    • E
      tcp: ensure skb->dev is NULL before leaving TCP stack · b738a185
      Eric Dumazet 提交于
      skb->rbnode is sharing three skb fields : next, prev, dev
      
      When a packet is sent, TCP keeps the original skb (master)
      in a rtx queue, which was converted to rbtree a while back.
      
      __tcp_transmit_skb() is responsible to clone the master skb,
      and add the TCP header to the clone before sending it
      to network layer.
      
      skb_clone() already clears skb->next and skb->prev, but copies
      the master oskb->dev into the clone.
      
      We need to clear skb->dev, otherwise lower layers could interpret
      the value as a pointer to a netdev.
      
      This old bug surfaced recently when commit 28f8bfd1
      ("netfilter: Support iif matches in POSTROUTING") was merged.
      
      Before this netfilter commit, skb->dev value was ignored and
      changed before reaching dev_queue_xmit()
      
      Fixes: 75c119af ("tcp: implement rb-tree based retransmit queue")
      Fixes: 28f8bfd1 ("netfilter: Support iif matches in POSTROUTING")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NMartin Zaharinov <micron10@gmail.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b738a185
  29. 24 1月, 2020 1 次提交
    • C
      mptcp: parse and emit MP_CAPABLE option according to v1 spec · cc7972ea
      Christoph Paasch 提交于
      This implements MP_CAPABLE options parsing and writing according
      to RFC 6824 bis / RFC 8684: MPTCP v1.
      
      Local key is sent on syn/ack, and both keys are sent on 3rd ack.
      MP_CAPABLE messages len are updated accordingly. We need the skbuff to
      correctly emit the above, so we push the skbuff struct as an argument
      all the way from tcp code to the relevant mptcp callbacks.
      
      When processing incoming MP_CAPABLE + data, build a full blown DSS-like
      map info, to simplify later processing.  On child socket creation, we
      need to record the remote key, if available.
      Signed-off-by: NChristoph Paasch <cpaasch@apple.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cc7972ea