1. 13 7月, 2022 2 次提交
  2. 01 6月, 2022 1 次提交
    • E
      tcp: tcp_rtx_synack() can be called from process context · 0a375c82
      Eric Dumazet 提交于
      Laurent reported the enclosed report [1]
      
      This bug triggers with following coditions:
      
      0) Kernel built with CONFIG_DEBUG_PREEMPT=y
      
      1) A new passive FastOpen TCP socket is created.
         This FO socket waits for an ACK coming from client to be a complete
         ESTABLISHED one.
      2) A socket operation on this socket goes through lock_sock()
         release_sock() dance.
      3) While the socket is owned by the user in step 2),
         a retransmit of the SYN is received and stored in socket backlog.
      4) At release_sock() time, the socket backlog is processed while
         in process context.
      5) A SYNACK packet is cooked in response of the SYN retransmit.
      6) -> tcp_rtx_synack() is called in process context.
      
      Before blamed commit, tcp_rtx_synack() was always called from BH handler,
      from a timer handler.
      
      Fix this by using TCP_INC_STATS() & NET_INC_STATS()
      which do not assume caller is in non preemptible context.
      
      [1]
      BUG: using __this_cpu_add() in preemptible [00000000] code: epollpep/2180
      caller is tcp_rtx_synack.part.0+0x36/0xc0
      CPU: 10 PID: 2180 Comm: epollpep Tainted: G           OE     5.16.0-0.bpo.4-amd64 #1  Debian 5.16.12-1~bpo11+1
      Hardware name: Supermicro SYS-5039MC-H8TRF/X11SCD-F, BIOS 1.7 11/23/2021
      Call Trace:
       <TASK>
       dump_stack_lvl+0x48/0x5e
       check_preemption_disabled+0xde/0xe0
       tcp_rtx_synack.part.0+0x36/0xc0
       tcp_rtx_synack+0x8d/0xa0
       ? kmem_cache_alloc+0x2e0/0x3e0
       ? apparmor_file_alloc_security+0x3b/0x1f0
       inet_rtx_syn_ack+0x16/0x30
       tcp_check_req+0x367/0x610
       tcp_rcv_state_process+0x91/0xf60
       ? get_nohz_timer_target+0x18/0x1a0
       ? lock_timer_base+0x61/0x80
       ? preempt_count_add+0x68/0xa0
       tcp_v4_do_rcv+0xbd/0x270
       __release_sock+0x6d/0xb0
       release_sock+0x2b/0x90
       sock_setsockopt+0x138/0x1140
       ? __sys_getsockname+0x7e/0xc0
       ? aa_sk_perm+0x3e/0x1a0
       __sys_setsockopt+0x198/0x1e0
       __x64_sys_setsockopt+0x21/0x30
       do_syscall_64+0x38/0xc0
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fixes: 168a8f58 ("tcp: TCP Fast Open Server - main code path")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NLaurent Fasnacht <laurent.fasnacht@proton.ch>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Link: https://lore.kernel.org/r/20220530213713.601888-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      0a375c82
  3. 16 5月, 2022 1 次提交
    • A
      net: allow gso_max_size to exceed 65536 · 7c4e983c
      Alexander Duyck 提交于
      The code for gso_max_size was added originally to allow for debugging and
      workaround of buggy devices that couldn't support TSO with blocks 64K in
      size. The original reason for limiting it to 64K was because that was the
      existing limits of IPv4 and non-jumbogram IPv6 length fields.
      
      With the addition of Big TCP we can remove this limit and allow the value
      to potentially go up to UINT_MAX and instead be limited by the tso_max_size
      value.
      
      So in order to support this we need to go through and clean up the
      remaining users of the gso_max_size value so that the values will cap at
      64K for non-TCPv6 flows. In addition we can clean up the GSO_MAX_SIZE value
      so that 64K becomes GSO_LEGACY_MAX_SIZE and UINT_MAX will now be the upper
      limit for GSO_MAX_SIZE.
      
      v6: (edumazet) fixed a compile error if CONFIG_IPV6=n,
                     in a new sk_trim_gso_size() helper.
                     netif_set_tso_max_size() caps the requested TSO size
                     with GSO_MAX_SIZE.
      Signed-off-by: NAlexander Duyck <alexanderduyck@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c4e983c
  4. 06 5月, 2022 1 次提交
  5. 25 4月, 2022 1 次提交
    • E
      tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT · 4bfe744f
      Eric Dumazet 提交于
      I had this bug sitting for too long in my pile, it is time to fix it.
      
      Thanks to Doug Porter for reminding me of it!
      
      We had various attempts in the past, including commit
      0cbe6a8f ("tcp: remove SOCK_QUEUE_SHRUNK"),
      but the issue is that TCP stack currently only generates
      EPOLLOUT from input path, when tp->snd_una has advanced
      and skb(s) cleaned from rtx queue.
      
      If a flow has a big RTT, and/or receives SACKs, it is possible
      that the notsent part (tp->write_seq - tp->snd_nxt) reaches 0
      and no more data can be sent until tp->snd_una finally advances.
      
      What is needed is to also check if POLLOUT needs to be generated
      whenever tp->snd_nxt is advanced, from output path.
      
      This bug triggers more often after an idle period, as
      we do not receive ACK for at least one RTT. tcp_notsent_lowat
      could be a fraction of what CWND and pacing rate would allow to
      send during this RTT.
      
      In a followup patch, I will remove the bogus call
      to tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED)
      from tcp_check_space(). Fact that we have decided to generate
      an EPOLLOUT does not mean the application has immediately
      refilled the transmit queue. This optimistic call
      might have been the reason the bug seemed not too serious.
      
      Tested:
      
      200 ms rtt, 1% packet loss, 32 MB tcp_rmem[2] and tcp_wmem[2]
      
      $ echo 500000 >/proc/sys/net/ipv4/tcp_notsent_lowat
      $ cat bench_rr.sh
      SUM=0
      for i in {1..10}
      do
       V=`netperf -H remote_host -l30 -t TCP_RR -- -r 10000000,10000 -o LOCAL_BYTES_SENT | egrep -v "MIGRATED|Bytes"`
       echo $V
       SUM=$(($SUM + $V))
      done
      echo SUM=$SUM
      
      Before patch:
      $ bench_rr.sh
      130000000
      80000000
      140000000
      140000000
      140000000
      140000000
      130000000
      40000000
      90000000
      110000000
      SUM=1140000000
      
      After patch:
      $ bench_rr.sh
      430000000
      590000000
      530000000
      450000000
      450000000
      350000000
      450000000
      490000000
      480000000
      460000000
      SUM=4680000000  # This is 410 % of the value before patch.
      
      Fixes: c9bee3b7 ("tcp: TCP_NOTSENT_LOWAT socket option")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDoug Porter <dsp@fb.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4bfe744f
  6. 07 4月, 2022 1 次提交
  7. 22 3月, 2022 1 次提交
  8. 10 3月, 2022 1 次提交
    • E
      tcp: adjust TSO packet sizes based on min_rtt · 65466904
      Eric Dumazet 提交于
      Back when tcp_tso_autosize() and TCP pacing were introduced,
      our focus was really to reduce burst sizes for long distance
      flows.
      
      The simple heuristic of using sk_pacing_rate/1024 has worked
      well, but can lead to too small packets for hosts in the same
      rack/cluster, when thousands of flows compete for the bottleneck.
      
      Neal Cardwell had the idea of making the TSO burst size
      a function of both sk_pacing_rate and tcp_min_rtt()
      
      Indeed, for local flows, sending bigger bursts is better
      to reduce cpu costs, as occasional losses can be repaired
      quite fast.
      
      This patch is based on Neal Cardwell implementation
      done more than two years ago.
      bbr is adjusting max_pacing_rate based on measured bandwidth,
      while cubic would over estimate max_pacing_rate.
      
      /proc/sys/net/ipv4/tcp_tso_rtt_log can be used to tune or disable
      this new feature, in logarithmic steps.
      
      Tested:
      
      100Gbit NIC, two hosts in the same rack, 4K MTU.
      600 flows rate-limited to 20000000 bytes per second.
      
      Before patch: (TSO sizes would be limited to 20000000/1024/4096 -> 4 segments per TSO)
      
      ~# echo 0 >/proc/sys/net/ipv4/tcp_tso_rtt_log
      ~# nstat -n;perf stat ./super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000;nstat|egrep "TcpInSegs|TcpOutSegs|TcpRetransSegs|Delivered"
        96005
      
       Performance counter stats for './super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000':
      
               65,945.29 msec task-clock                #    2.845 CPUs utilized
               1,314,632      context-switches          # 19935.279 M/sec
                   5,292      cpu-migrations            #   80.249 M/sec
                 940,641      page-faults               # 14264.023 M/sec
         201,117,030,926      cycles                    # 3049769.216 GHz                   (83.45%)
          17,699,435,405      stalled-cycles-frontend   #    8.80% frontend cycles idle     (83.48%)
         136,584,015,071      stalled-cycles-backend    #   67.91% backend cycles idle      (83.44%)
          53,809,530,436      instructions              #    0.27  insn per cycle
                                                        #    2.54  stalled cycles per insn  (83.36%)
           9,062,315,523      branches                  # 137422329.563 M/sec               (83.22%)
             153,008,621      branch-misses             #    1.69% of all branches          (83.32%)
      
            23.182970846 seconds time elapsed
      
      TcpInSegs                       15648792           0.0
      TcpOutSegs                      58659110           0.0  # Average of 3.7 4K segments per TSO packet
      TcpExtTCPDelivered              58654791           0.0
      TcpExtTCPDeliveredCE            19                 0.0
      
      After patch:
      
      ~# echo 9 >/proc/sys/net/ipv4/tcp_tso_rtt_log
      ~# nstat -n;perf stat ./super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000;nstat|egrep "TcpInSegs|TcpOutSegs|TcpRetransSegs|Delivered"
        96046
      
       Performance counter stats for './super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000':
      
               48,982.58 msec task-clock                #    2.104 CPUs utilized
                 186,014      context-switches          # 3797.599 M/sec
                   3,109      cpu-migrations            #   63.472 M/sec
                 941,180      page-faults               # 19214.814 M/sec
         153,459,763,868      cycles                    # 3132982.807 GHz                   (83.56%)
          12,069,861,356      stalled-cycles-frontend   #    7.87% frontend cycles idle     (83.32%)
         120,485,917,953      stalled-cycles-backend    #   78.51% backend cycles idle      (83.24%)
          36,803,672,106      instructions              #    0.24  insn per cycle
                                                        #    3.27  stalled cycles per insn  (83.18%)
           5,947,266,275      branches                  # 121417383.427 M/sec               (83.64%)
              87,984,616      branch-misses             #    1.48% of all branches          (83.43%)
      
            23.281200256 seconds time elapsed
      
      TcpInSegs                       1434706            0.0
      TcpOutSegs                      58883378           0.0  # Average of 41 4K segments per TSO packet
      TcpExtTCPDelivered              58878971           0.0
      TcpExtTCPDeliveredCE            9664               0.0
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NNeal Cardwell <ncardwell@google.com>
      Link: https://lore.kernel.org/r/20220309015757.2532973-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      65466904
  9. 03 3月, 2022 1 次提交
    • M
      net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp · a1ac9c8a
      Martin KaFai Lau 提交于
      skb->tstamp was first used as the (rcv) timestamp.
      The major usage is to report it to the user (e.g. SO_TIMESTAMP).
      
      Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP)
      during egress and used by the qdisc (e.g. sch_fq) to make decision on when
      the skb can be passed to the dev.
      
      Currently, there is no way to tell skb->tstamp having the (rcv) timestamp
      or the delivery_time, so it is always reset to 0 whenever forwarded
      between egress and ingress.
      
      While it makes sense to always clear the (rcv) timestamp in skb->tstamp
      to avoid confusing sch_fq that expects the delivery_time, it is a
      performance issue [0] to clear the delivery_time if the skb finally
      egress to a fq@phy-dev.  For example, when forwarding from egress to
      ingress and then finally back to egress:
      
                  tcp-sender => veth@netns => veth@hostns => fq@eth0@hostns
                                           ^              ^
                                           reset          rest
      
      This patch adds one bit skb->mono_delivery_time to flag the skb->tstamp
      is storing the mono delivery_time (EDT) instead of the (rcv) timestamp.
      
      The current use case is to keep the TCP mono delivery_time (EDT) and
      to be used with sch_fq.  A latter patch will also allow tc-bpf@ingress
      to read and change the mono delivery_time.
      
      In the future, another bit (e.g. skb->user_delivery_time) can be added
      for the SCM_TXTIME where the clock base is tracked by sk->sk_clockid.
      
      [ This patch is a prep work.  The following patches will
        get the other parts of the stack ready first.  Then another patch
        after that will finally set the skb->mono_delivery_time. ]
      
      skb_set_delivery_time() function is added.  It is used by the tcp_output.c
      and during ip[6] fragmentation to assign the delivery_time to
      the skb->tstamp and also set the skb->mono_delivery_time.
      
      A note on the change in ip_send_unicast_reply() in ip_output.c.
      It is only used by TCP to send reset/ack out of a ctl_sk.
      Like the new skb_set_delivery_time(), this patch sets
      the skb->mono_delivery_time to 0 for now as a place
      holder.  It will be enabled in a latter patch.
      A similar case in tcp_ipv6 can be done with
      skb_set_delivery_time() in tcp_v6_send_response().
      
      [0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdfSigned-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a1ac9c8a
  10. 31 1月, 2022 1 次提交
  11. 26 1月, 2022 1 次提交
  12. 02 12月, 2021 1 次提交
  13. 29 11月, 2021 1 次提交
  14. 16 11月, 2021 1 次提交
  15. 04 11月, 2021 1 次提交
  16. 03 11月, 2021 2 次提交
    • T
      net: avoid double accounting for pure zerocopy skbs · 9b65b17d
      Talal Ahmad 提交于
      Track skbs containing only zerocopy data and avoid charging them to
      kernel memory to correctly account the memory utilization for
      msg_zerocopy. All of the data in such skbs is held in user pages which
      are already accounted to user. Before this change, they are charged
      again in kernel in __zerocopy_sg_from_iter. The charging in kernel is
      excessive because data is not being copied into skb frags. This
      excessive charging can lead to kernel going into memory pressure
      state which impacts all sockets in the system adversely. Mark pure
      zerocopy skbs with a SKBFL_PURE_ZEROCOPY flag and remove
      charge/uncharge for data in such skbs.
      
      Initially, an skb is marked pure zerocopy when it is empty and in
      zerocopy path. skb can then change from a pure zerocopy skb to mixed
      data skb (zerocopy and copy data) if it is at tail of write queue and
      there is room available in it and non-zerocopy data is being sent in
      the next sendmsg call. At this time sk_mem_charge is done for the pure
      zerocopied data and the pure zerocopy flag is unmarked. We found that
      this happens very rarely on workloads that pass MSG_ZEROCOPY.
      
      A pure zerocopy skb can later be coalesced into normal skb if they are
      next to each other in queue but this patch prevents coalescing from
      happening. This avoids complexity of charging when skb downgrades from
      pure zerocopy to mixed. This is also rare.
      
      In sk_wmem_free_skb, if it is a pure zerocopy skb, an sk_mem_uncharge
      for SKB_TRUESIZE(skb_end_offset(skb)) is done for sk_mem_charge in
      tcp_skb_entail for an skb without data.
      
      Testing with the msg_zerocopy.c benchmark between two hosts(100G nics)
      with zerocopy showed that before this patch the 'sock' variable in
      memory.stat for cgroup2 that tracks sum of sk_forward_alloc,
      sk_rmem_alloc and sk_wmem_queued is around 1822720 and with this
      change it is 0. This is due to no charge to sk_forward_alloc for
      zerocopy data and shows memory utilization for kernel is lowered.
      
      With this commit we don't see the warning we saw in previous commit
      which resulted in commit 84882cf7.
      Signed-off-by: NTalal Ahmad <talalahmad@google.com>
      Acked-by: NArjun Roy <arjunroy@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b65b17d
    • E
      net: add and use skb_unclone_keeptruesize() helper · c4777efa
      Eric Dumazet 提交于
      While commit 097b9146 ("net: fix up truesize of cloned
      skb in skb_prepare_for_shift()") fixed immediate issues found
      when KFENCE was enabled/tested, there are still similar issues,
      when tcp_trim_head() hits KFENCE while the master skb
      is cloned.
      
      This happens under heavy networking TX workloads,
      when the TX completion might be delayed after incoming ACK.
      
      This patch fixes the WARNING in sk_stream_kill_queues
      when sk->sk_mem_queued/sk->sk_forward_alloc are not zero.
      
      Fixes: d3fb45f3 ("mm, kfence: insert KFENCE hooks for SLAB")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NMarco Elver <elver@google.com>
      Link: https://lore.kernel.org/r/20211102004555.1359210-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      c4777efa
  17. 02 11月, 2021 3 次提交
    • J
      Revert "net: avoid double accounting for pure zerocopy skbs" · 84882cf7
      Jakub Kicinski 提交于
      This reverts commit f1a456f8.
      
        WARNING: CPU: 1 PID: 6819 at net/core/skbuff.c:5429 skb_try_coalesce+0x78b/0x7e0
        CPU: 1 PID: 6819 Comm: xxxxxxx Kdump: loaded Tainted: G S                5.15.0-04194-gd852503f7711 #16
        RIP: 0010:skb_try_coalesce+0x78b/0x7e0
        Code: e8 2a bf 41 ff 44 8b b3 bc 00 00 00 48 8b 7c 24 30 e8 19 c0 41 ff 44 89 f0 48 03 83 c0 00 00 00 48 89 44 24 40 e9 47 fb ff ff <0f> 0b e9 ca fc ff ff 4c 8d 70 ff 48 83 c0 07 48 89 44 24 38 e9 61
        RSP: 0018:ffff88881f449688 EFLAGS: 00010282
        RAX: 00000000fffffe96 RBX: ffff8881566e4460 RCX: ffffffff82079f7e
        RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffff8881566e47b0
        RBP: ffff8881566e46e0 R08: ffffed102619235d R09: ffffed102619235d
        R10: ffff888130c91ae3 R11: ffffed102619235c R12: ffff88881f4498a0
        R13: 0000000000000056 R14: 0000000000000009 R15: ffff888130c91ac0
        FS:  00007fec2cbb9700(0000) GS:ffff88881f440000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fec1b060d80 CR3: 00000003acf94005 CR4: 00000000003706e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         <IRQ>
         tcp_try_coalesce+0xeb/0x290
         ? tcp_parse_options+0x610/0x610
         ? mark_held_locks+0x79/0xa0
         tcp_queue_rcv+0x69/0x2f0
         tcp_rcv_established+0xa49/0xd40
         ? tcp_data_queue+0x18a0/0x18a0
         tcp_v6_do_rcv+0x1c9/0x880
         ? rt6_mtu_change_route+0x100/0x100
         tcp_v6_rcv+0x1624/0x1830
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      84882cf7
    • T
      net: avoid double accounting for pure zerocopy skbs · f1a456f8
      Talal Ahmad 提交于
      Track skbs with only zerocopy data and avoid charging them to kernel
      memory to correctly account the memory utilization for msg_zerocopy.
      All of the data in such skbs is held in user pages which are already
      accounted to user. Before this change, they are charged again in
      kernel in __zerocopy_sg_from_iter. The charging in kernel is
      excessive because data is not being copied into skb frags. This
      excessive charging can lead to kernel going into memory pressure
      state which impacts all sockets in the system adversely. Mark pure
      zerocopy skbs with a SKBFL_PURE_ZEROCOPY flag and remove
      charge/uncharge for data in such skbs.
      
      Initially, an skb is marked pure zerocopy when it is empty and in
      zerocopy path. skb can then change from a pure zerocopy skb to mixed
      data skb (zerocopy and copy data) if it is at tail of write queue and
      there is room available in it and non-zerocopy data is being sent in
      the next sendmsg call. At this time sk_mem_charge is done for the pure
      zerocopied data and the pure zerocopy flag is unmarked. We found that
      this happens very rarely on workloads that pass MSG_ZEROCOPY.
      
      A pure zerocopy skb can later be coalesced into normal skb if they are
      next to each other in queue but this patch prevents coalescing from
      happening. This avoids complexity of charging when skb downgrades from
      pure zerocopy to mixed. This is also rare.
      
      In sk_wmem_free_skb, if it is a pure zerocopy skb, an sk_mem_uncharge
      for SKB_TRUESIZE(MAX_TCP_HEADER) is done for sk_mem_charge in
      tcp_skb_entail for an skb without data.
      
      Testing with the msg_zerocopy.c benchmark between two hosts(100G nics)
      with zerocopy showed that before this patch the 'sock' variable in
      memory.stat for cgroup2 that tracks sum of sk_forward_alloc,
      sk_rmem_alloc and sk_wmem_queued is around 1822720 and with this
      change it is 0. This is due to no charge to sk_forward_alloc for
      zerocopy data and shows memory utilization for kernel is lowered.
      Signed-off-by: NTalal Ahmad <talalahmad@google.com>
      Acked-by: NArjun Roy <arjunroy@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      f1a456f8
    • T
      tcp: rename sk_wmem_free_skb · 03271f3a
      Talal Ahmad 提交于
      sk_wmem_free_skb() is only used by TCP.
      
      Rename it to make this clear, and move its declaration to
      include/net/tcp.h
      Signed-off-by: NTalal Ahmad <talalahmad@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NArjun Roy <arjunroy@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      03271f3a
  18. 28 10月, 2021 4 次提交
  19. 26 10月, 2021 1 次提交
  20. 30 9月, 2021 1 次提交
  21. 24 9月, 2021 1 次提交
  22. 18 8月, 2021 1 次提交
  23. 09 7月, 2021 1 次提交
    • E
      ipv6: tcp: drop silly ICMPv6 packet too big messages · c7bb4b89
      Eric Dumazet 提交于
      While TCP stack scales reasonably well, there is still one part that
      can be used to DDOS it.
      
      IPv6 Packet too big messages have to lookup/insert a new route,
      and if abused by attackers, can easily put hosts under high stress,
      with many cpus contending on a spinlock while one is stuck in fib6_run_gc()
      
      ip6_protocol_deliver_rcu()
       icmpv6_rcv()
        icmpv6_notify()
         tcp_v6_err()
          tcp_v6_mtu_reduced()
           inet6_csk_update_pmtu()
            ip6_rt_update_pmtu()
             __ip6_rt_update_pmtu()
              ip6_rt_cache_alloc()
               ip6_dst_alloc()
                dst_alloc()
                 ip6_dst_gc()
                  fib6_run_gc()
                   spin_lock_bh() ...
      
      Some of our servers have been hit by malicious ICMPv6 packets
      trying to _increase_ the MTU/MSS of TCP flows.
      
      We believe these ICMPv6 packets are a result of a bug in one ISP stack,
      since they were blindly sent back for _every_ (small) packet sent to them.
      
      These packets are for one TCP flow:
      09:24:36.266491 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
      09:24:36.266509 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
      09:24:36.316688 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
      09:24:36.316704 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
      09:24:36.608151 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
      
      TCP stack can filter some silly requests :
      
      1) MTU below IPV6_MIN_MTU can be filtered early in tcp_v6_err()
      2) tcp_v6_mtu_reduced() can drop requests trying to increase current MSS.
      
      This tests happen before the IPv6 routing stack is entered, thus
      removing the potential contention and route exhaustion.
      
      Note that IPv6 stack was performing these checks, but too late
      (ie : after the route has been added, and after the potential
      garbage collect war)
      
      v2: fix typo caught by Martin, thanks !
      v3: exports tcp_mtu_to_mss(), caught by David, thanks !
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NMaciej Żenczykowski <maze@google.com>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7bb4b89
  24. 12 3月, 2021 2 次提交
    • E
      tcp: remove obsolete check in __tcp_retransmit_skb() · ac3959fd
      Eric Dumazet 提交于
      TSQ provides a nice way to avoid bufferbloat on individual socket,
      including retransmit packets. We can get rid of the old
      heuristic:
      
      	/* Do not sent more than we queued. 1/4 is reserved for possible
      	 * copying overhead: fragmentation, tunneling, mangling etc.
      	 */
      	if (refcount_read(&sk->sk_wmem_alloc) >
      	    min_t(u32, sk->sk_wmem_queued + (sk->sk_wmem_queued >> 2),
      		  sk->sk_sndbuf))
      		return -EAGAIN;
      
      This heuristic was giving false positives according to Jakub,
      whenever TX completions are delayed above RTT. (Ack packets
      are processed by TCP stack before clones are orphaned/freed)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NJakub Kicinski <kuba@kernel.org>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac3959fd
    • E
      tcp: plug skb_still_in_host_queue() to TSQ · f4dae54e
      Eric Dumazet 提交于
      Jakub and Neil reported an increase of RTO timers whenever
      TX completions are delayed a bit more (by increasing
      NIC TX coalescing parameters)
      
      Main issue is that TCP stack has a logic preventing a packet
      being retransmit if the prior clone has not yet been
      orphaned or freed.
      
      This logic came with commit 1f3279ae ("tcp: avoid
      retransmits of TCP packets hanging in host queues")
      
      Thankfully, in the case skb_still_in_host_queue() detects
      the initial clone is still in flight, it can use TSQ logic
      that will eventually retry later, at the moment the clone
      is freed or orphaned.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NNeil Spring <ntspring@fb.com>
      Reported-by: NJakub Kicinski <kuba@kernel.org>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f4dae54e
  25. 24 1月, 2021 1 次提交
  26. 19 1月, 2021 1 次提交
  27. 14 1月, 2021 1 次提交
  28. 10 12月, 2020 1 次提交
    • N
      tcp: fix cwnd-limited bug for TSO deferral where we send nothing · 299bcb55
      Neal Cardwell 提交于
      When cwnd is not a multiple of the TSO skb size of N*MSS, we can get
      into persistent scenarios where we have the following sequence:
      
      (1) ACK for full-sized skb of N*MSS arrives
        -> tcp_write_xmit() transmit full-sized skb with N*MSS
        -> move pacing release time forward
        -> exit tcp_write_xmit() because pacing time is in the future
      
      (2) TSQ callback or TCP internal pacing timer fires
        -> try to transmit next skb, but TSO deferral finds remainder of
           available cwnd is not big enough to trigger an immediate send
           now, so we defer sending until the next ACK.
      
      (3) repeat...
      
      So we can get into a case where we never mark ourselves as
      cwnd-limited for many seconds at a time, even with
      bulk/infinite-backlog senders, because:
      
      o In case (1) above, every time in tcp_write_xmit() we have enough
      cwnd to send a full-sized skb, we are not fully using the cwnd
      (because cwnd is not a multiple of the TSO skb size). So every time we
      send data, we are not cwnd limited, and so in the cwnd-limited
      tracking code in tcp_cwnd_validate() we mark ourselves as not
      cwnd-limited.
      
      o In case (2) above, every time in tcp_write_xmit() that we try to
      transmit the "remainder" of the cwnd but defer, we set the local
      variable is_cwnd_limited to true, but we do not send any packets, so
      sent_pkts is zero, so we don't call the cwnd-limited logic to update
      tp->is_cwnd_limited.
      
      Fixes: ca8a2263 ("tcp: make cwnd-limited checks measurement-based, and gentler")
      Reported-by: NIngemar Johansson <ingemar.s.johansson@ericsson.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20201209035759.1225145-1-ncardwell.kernel@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      299bcb55
  29. 21 11月, 2020 1 次提交
  30. 08 11月, 2020 1 次提交
  31. 05 11月, 2020 1 次提交
    • P
      tcp: propagate MPTCP skb extensions on xmit splits · 5a369ca6
      Paolo Abeni 提交于
      When the TCP stack splits a packet on the write queue, the tail
      half currently lose the associated skb extensions, and will not
      carry the DSM on the wire.
      
      The above does not cause functional problems and is allowed by
      the RFC, but interact badly with GRO and RX coalescing, as possible
      candidates for aggregation will carry different TCP options.
      
      This change tries to improve the MPTCP behavior, propagating the
      skb extensions on split.
      
      Additionally, we must prevent the MPTCP stack from updating the
      mapping after the split occur: that will both violate the RFC and
      fool the reader.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      5a369ca6
  32. 01 10月, 2020 1 次提交