1. 31 12月, 2019 1 次提交
    • C
      tcp: Fix highest_sack and highest_sack_seq · 85369750
      Cambda Zhu 提交于
      >From commit 50895b9d ("tcp: highest_sack fix"), the logic about
      setting tp->highest_sack to the head of the send queue was removed.
      Of course the logic is error prone, but it is logical. Before we
      remove the pointer to the highest sack skb and use the seq instead,
      we need to set tp->highest_sack to NULL when there is no skb after
      the last sack, and then replace NULL with the real skb when new skb
      inserted into the rtx queue, because the NULL means the highest sack
      seq is tp->snd_nxt. If tp->highest_sack is NULL and new data sent,
      the next ACK with sack option will increase tp->reordering unexpectedly.
      
      This patch sets tp->highest_sack to the tail of the rtx queue if
      it's NULL and new data is sent. The patch keeps the rule that the
      highest_sack can only be maintained by sack processing, except for
      this only case.
      
      Fixes: 50895b9d ("tcp: highest_sack fix")
      Signed-off-by: NCambda Zhu <cambda@linux.alibaba.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      85369750
  2. 18 12月, 2019 1 次提交
  3. 14 12月, 2019 2 次提交
  4. 07 12月, 2019 1 次提交
  5. 08 11月, 2019 1 次提交
  6. 14 10月, 2019 5 次提交
  7. 12 9月, 2019 1 次提交
    • E
      tcp: force a PSH flag on TSO packets · 051ba674
      Eric Dumazet 提交于
      When tcp sends a TSO packet, adding a PSH flag on it
      reduces the sojourn time of GRO packet in GRO receivers.
      
      This is particularly the case under pressure, since RX queues
      receive packets for many concurrent flows.
      
      A sender can give a hint to GRO engines when it is
      appropriate to flush a super-packet, especially when pacing
      is in the picture, since next packet is probably delayed by
      one ms.
      
      Having less packets in GRO engine reduces chance
      of LRU eviction or inflated RTT, and reduces GRO cost.
      
      We found recently that we must not set the PSH flag on
      individual full-size MSS segments [1] :
      
       Under pressure (CWR state), we better let the packet sit
       for a small delay (depending on NAPI logic) so that the
       ACK packet is delayed, and thus next packet we send is
       also delayed a bit. Eventually the bottleneck queue can
       be drained. DCTCP flows with CWND=1 have demonstrated
       the issue.
      
      This patch allows to slowdown the aggregate traffic without
      involving high resolution timers on senders and/or
      receivers.
      
      It has been used at Google for about four years,
      and has been discussed at various networking conferences.
      
      [1] segments smaller than MSS already have PSH flag set
          by tcp_sendmsg() / tcp_mark_push(), unless MSG_MORE
          has been requested by the user.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Tariq Toukan <tariqt@mellanox.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      051ba674
  8. 29 8月, 2019 1 次提交
  9. 09 8月, 2019 1 次提交
    • J
      net/tls: prevent skb_orphan() from leaking TLS plain text with offload · 41477662
      Jakub Kicinski 提交于
      sk_validate_xmit_skb() and drivers depend on the sk member of
      struct sk_buff to identify segments requiring encryption.
      Any operation which removes or does not preserve the original TLS
      socket such as skb_orphan() or skb_clone() will cause clear text
      leaks.
      
      Make the TCP socket underlying an offloaded TLS connection
      mark all skbs as decrypted, if TLS TX is in offload mode.
      Then in sk_validate_xmit_skb() catch skbs which have no socket
      (or a socket with no validation) and decrypted flag set.
      
      Note that CONFIG_SOCK_VALIDATE_XMIT, CONFIG_TLS_DEVICE and
      sk->sk_validate_xmit_skb are slightly interchangeable right now,
      they all imply TLS offload. The new checks are guarded by
      CONFIG_TLS_DEVICE because that's the option guarding the
      sk_buff->decrypted member.
      
      Second, smaller issue with orphaning is that it breaks
      the guarantee that packets will be delivered to device
      queues in-order. All TLS offload drivers depend on that
      scheduling property. This means skb_orphan_partial()'s
      trick of preserving partial socket references will cause
      issues in the drivers. We need a full orphan, and as a
      result netem delay/throttling will cause all TLS offload
      skbs to be dropped.
      
      Reusing the sk_buff->decrypted flag also protects from
      leaking clear text when incoming, decrypted skb is redirected
      (e.g. by TC).
      
      See commit 0608c69c ("bpf: sk_msg, sock{map|hash} redirect
      through ULP") for justification why the internal flag is safe.
      The only location which could leak the flag in is tcp_bpf_sendmsg(),
      which is taken care of by clearing the previously unused bit.
      
      v2:
       - remove superfluous decrypted mark copy (Willem);
       - remove the stale doc entry (Boris);
       - rely entirely on EOR marking to prevent coalescing (Boris);
       - use an internal sendpages flag instead of marking the socket
         (Boris).
      v3 (Willem):
       - reorganize the can_skb_orphan_partial() condition;
       - fix the flag leak-in through tcp_bpf_sendmsg.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Reviewed-by: NBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      41477662
  10. 31 7月, 2019 1 次提交
  11. 22 7月, 2019 1 次提交
  12. 22 6月, 2019 1 次提交
  13. 16 6月, 2019 3 次提交
    • E
      tcp: add tcp_min_snd_mss sysctl · 5f3e2bf0
      Eric Dumazet 提交于
      Some TCP peers announce a very small MSS option in their SYN and/or
      SYN/ACK messages.
      
      This forces the stack to send packets with a very high network/cpu
      overhead.
      
      Linux has enforced a minimal value of 48. Since this value includes
      the size of TCP options, and that the options can consume up to 40
      bytes, this means that each segment can include only 8 bytes of payload.
      
      In some cases, it can be useful to increase the minimal value
      to a saner value.
      
      We still let the default to 48 (TCP_MIN_SND_MSS), for compatibility
      reasons.
      
      Note that TCP_MAXSEG socket option enforces a minimal value
      of (TCP_MIN_MSS). David Miller increased this minimal value
      in commit c39508d6 ("tcp: Make TCP_MAXSEG minimum more correct.")
      from 64 to 88.
      
      We might in the future merge TCP_MIN_SND_MSS and TCP_MIN_MSS.
      
      CVE-2019-11479 -- tcp mss hardcoded to 48
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Suggested-by: NJonathan Looney <jtl@netflix.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Tyler Hicks <tyhicks@canonical.com>
      Cc: Bruce Curtis <brucec@netflix.com>
      Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f3e2bf0
    • E
      tcp: tcp_fragment() should apply sane memory limits · f070ef2a
      Eric Dumazet 提交于
      Jonathan Looney reported that a malicious peer can force a sender
      to fragment its retransmit queue into tiny skbs, inflating memory
      usage and/or overflow 32bit counters.
      
      TCP allows an application to queue up to sk_sndbuf bytes,
      so we need to give some allowance for non malicious splitting
      of retransmit queue.
      
      A new SNMP counter is added to monitor how many times TCP
      did not allow to split an skb if the allowance was exceeded.
      
      Note that this counter might increase in the case applications
      use SO_SNDBUF socket option to lower sk_sndbuf.
      
      CVE-2019-11478 : tcp_fragment, prevent fragmenting a packet when the
      	socket is already using more than half the allowed space
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NJonathan Looney <jtl@netflix.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Reviewed-by: NTyler Hicks <tyhicks@canonical.com>
      Cc: Bruce Curtis <brucec@netflix.com>
      Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f070ef2a
    • E
      tcp: limit payload size of sacked skbs · 3b4929f6
      Eric Dumazet 提交于
      Jonathan Looney reported that TCP can trigger the following crash
      in tcp_shifted_skb() :
      
      	BUG_ON(tcp_skb_pcount(skb) < pcount);
      
      This can happen if the remote peer has advertized the smallest
      MSS that linux TCP accepts : 48
      
      An skb can hold 17 fragments, and each fragment can hold 32KB
      on x86, or 64KB on PowerPC.
      
      This means that the 16bit witdh of TCP_SKB_CB(skb)->tcp_gso_segs
      can overflow.
      
      Note that tcp_sendmsg() builds skbs with less than 64KB
      of payload, so this problem needs SACK to be enabled.
      SACK blocks allow TCP to coalesce multiple skbs in the retransmit
      queue, thus filling the 17 fragments to maximal capacity.
      
      CVE-2019-11477 -- u16 overflow of TCP_SKB_CB(skb)->tcp_gso_segs
      
      Fixes: 832d11c5 ("tcp: Try to restore large SKBs while SACK processing")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NJonathan Looney <jtl@netflix.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Reviewed-by: NTyler Hicks <tyhicks@canonical.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Bruce Curtis <brucec@netflix.com>
      Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b4929f6
  14. 13 6月, 2019 1 次提交
    • E
      tcp: add optional per socket transmit delay · a842fe14
      Eric Dumazet 提交于
      Adding delays to TCP flows is crucial for studying behavior
      of TCP stacks, including congestion control modules.
      
      Linux offers netem module, but it has unpractical constraints :
      - Need root access to change qdisc
      - Hard to setup on egress if combined with non trivial qdisc like FQ
      - Single delay for all flows.
      
      EDT (Earliest Departure Time) adoption in TCP stack allows us
      to enable a per socket delay at a very small cost.
      
      Networking tools can now establish thousands of flows, each of them
      with a different delay, simulating real world conditions.
      
      This requires FQ packet scheduler or a EDT-enabled NIC.
      
      This patchs adds TCP_TX_DELAY socket option, to set a delay in
      usec units.
      
        unsigned int tx_delay = 10000; /* 10 msec */
      
        setsockopt(fd, SOL_TCP, TCP_TX_DELAY, &tx_delay, sizeof(tx_delay));
      
      Note that FQ packet scheduler limits might need some tweaking :
      
      man tc-fq
      
      PARAMETERS
         limit
             Hard  limit  on  the  real  queue  size. When this limit is
             reached, new packets are dropped. If the value is  lowered,
             packets  are  dropped so that the new limit is met. Default
             is 10000 packets.
      
         flow_limit
             Hard limit on the maximum  number  of  packets  queued  per
             flow.  Default value is 100.
      
      Use of TCP_TX_DELAY option will increase number of skbs in FQ qdisc,
      so packets would be dropped if any of the previous limit is hit.
      
      Use of a jump label makes this support runtime-free, for hosts
      never using the option.
      
      Also note that TSQ (TCP Small Queues) limits are slightly changed
      with this patch : we need to account that skbs artificially delayed
      wont stop us providind more skbs to feed the pipe (netem uses
      skb_orphan_partial() for this purpose, but FQ can not use this trick)
      
      Because of that, using big delays might very well trigger
      old bugs in TSO auto defer logic and/or sndbuf limited detection.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a842fe14
  15. 21 5月, 2019 1 次提交
  16. 01 5月, 2019 1 次提交
  17. 07 4月, 2019 1 次提交
  18. 24 3月, 2019 1 次提交
  19. 27 2月, 2019 2 次提交
  20. 24 2月, 2019 1 次提交
    • E
      tcp: repaired skbs must init their tso_segs · bf50b606
      Eric Dumazet 提交于
      syzbot reported a WARN_ON(!tcp_skb_pcount(skb))
      in tcp_send_loss_probe() [1]
      
      This was caused by TCP_REPAIR sent skbs that inadvertenly
      were missing a call to tcp_init_tso_segs()
      
      [1]
      WARNING: CPU: 1 PID: 0 at net/ipv4/tcp_output.c:2534 tcp_send_loss_probe+0x771/0x8a0 net/ipv4/tcp_output.c:2534
      Kernel panic - not syncing: panic_on_warn set ...
      CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.0.0-rc7+ #77
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       panic+0x2cb/0x65c kernel/panic.c:214
       __warn.cold+0x20/0x45 kernel/panic.c:571
       report_bug+0x263/0x2b0 lib/bug.c:186
       fixup_bug arch/x86/kernel/traps.c:178 [inline]
       fixup_bug arch/x86/kernel/traps.c:173 [inline]
       do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:271
       do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:290
       invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:973
      RIP: 0010:tcp_send_loss_probe+0x771/0x8a0 net/ipv4/tcp_output.c:2534
      Code: 88 fc ff ff 4c 89 ef e8 ed 75 c8 fb e9 c8 fc ff ff e8 43 76 c8 fb e9 63 fd ff ff e8 d9 75 c8 fb e9 94 f9 ff ff e8 bf 03 91 fb <0f> 0b e9 7d fa ff ff e8 b3 03 91 fb 0f b6 1d 37 43 7a 03 31 ff 89
      RSP: 0018:ffff8880ae907c60 EFLAGS: 00010206
      RAX: ffff8880a989c340 RBX: 0000000000000000 RCX: ffffffff85dedbdb
      RDX: 0000000000000100 RSI: ffffffff85dee0b1 RDI: 0000000000000005
      RBP: ffff8880ae907c90 R08: ffff8880a989c340 R09: ffffed10147d1ae1
      R10: ffffed10147d1ae0 R11: ffff8880a3e8d703 R12: ffff888091b90040
      R13: ffff8880a3e8d540 R14: 0000000000008000 R15: ffff888091b90860
       tcp_write_timer_handler+0x5c0/0x8a0 net/ipv4/tcp_timer.c:583
       tcp_write_timer+0x10e/0x1d0 net/ipv4/tcp_timer.c:607
       call_timer_fn+0x190/0x720 kernel/time/timer.c:1325
       expire_timers kernel/time/timer.c:1362 [inline]
       __run_timers kernel/time/timer.c:1681 [inline]
       __run_timers kernel/time/timer.c:1649 [inline]
       run_timer_softirq+0x652/0x1700 kernel/time/timer.c:1694
       __do_softirq+0x266/0x95a kernel/softirq.c:292
       invoke_softirq kernel/softirq.c:373 [inline]
       irq_exit+0x180/0x1d0 kernel/softirq.c:413
       exiting_irq arch/x86/include/asm/apic.h:536 [inline]
       smp_apic_timer_interrupt+0x14a/0x570 arch/x86/kernel/apic/apic.c:1062
       apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:807
       </IRQ>
      RIP: 0010:native_safe_halt+0x2/0x10 arch/x86/include/asm/irqflags.h:58
      Code: ff ff ff 48 89 c7 48 89 45 d8 e8 59 0c a1 fa 48 8b 45 d8 e9 ce fe ff ff 48 89 df e8 48 0c a1 fa eb 82 90 90 90 90 90 90 fb f4 <c3> 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 f4 c3 90 90 90 90 90 90
      RSP: 0018:ffff8880a98afd78 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13
      RAX: 1ffffffff1125061 RBX: ffff8880a989c340 RCX: 0000000000000000
      RDX: dffffc0000000000 RSI: 0000000000000001 RDI: ffff8880a989cbbc
      RBP: ffff8880a98afda8 R08: ffff8880a989c340 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
      R13: ffffffff889282f8 R14: 0000000000000001 R15: 0000000000000000
       arch_cpu_idle+0x10/0x20 arch/x86/kernel/process.c:555
       default_idle_call+0x36/0x90 kernel/sched/idle.c:93
       cpuidle_idle_call kernel/sched/idle.c:153 [inline]
       do_idle+0x386/0x570 kernel/sched/idle.c:262
       cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:353
       start_secondary+0x404/0x5c0 arch/x86/kernel/smpboot.c:271
       secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:243
      Kernel Offset: disabled
      Rebooting in 86400 seconds..
      
      Fixes: 79861919 ("tcp: fix TCP_REPAIR xmit queue setup")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Cc: Andrey Vagin <avagin@openvz.org>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf50b606
  21. 28 1月, 2019 2 次提交
  22. 26 1月, 2019 1 次提交
  23. 18 1月, 2019 3 次提交
    • Y
      tcp: less aggressive window probing on local congestion · c1d5674f
      Yuchung Cheng 提交于
      Previously when the sender fails to send (original) data packet or
      window probes due to congestion in the local host (e.g. throttling
      in qdisc), it'll retry within an RTO or two up to 500ms.
      
      In low-RTT networks such as data-centers, RTO is often far below
      the default minimum 200ms. Then local host congestion could trigger
      a retry storm pouring gas to the fire. Worse yet, the probe counter
      (icsk_probes_out) is not properly updated so the aggressive retry
      may exceed the system limit (15 rounds) until the packet finally
      slips through.
      
      On such rare events, it's wise to retry more conservatively
      (500ms) and update the stats properly to reflect these incidents
      and follow the system limit. Note that this is consistent with
      the behaviors when a keep-alive probe or RTO retry is dropped
      due to local congestion.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NNeal Cardwell <ncardwell@google.com>
      Reviewed-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c1d5674f
    • Y
      tcp: always set retrans_stamp on recovery · 7ae18975
      Yuchung Cheng 提交于
      Previously TCP socket's retrans_stamp is not set if the
      retransmission has failed to send. As a result if a socket is
      experiencing local issues to retransmit packets, determining when
      to abort a socket is complicated w/o knowning the starting time of
      the recovery since retrans_stamp may remain zero.
      
      This complication causes sub-optimal behavior that TCP may use the
      latest, instead of the first, retransmission time to compute the
      elapsed time of a stalling connection due to local issues. Then TCP
      may disrecard TCP retries settings and keep retrying until it finally
      succeed: not a good idea when the local host is already strained.
      
      The simple fix is to always timestamp the start of a recovery.
      It's worth noting that retrans_stamp is also used to compare echo
      timestamp values to detect spurious recovery. This patch does
      not break that because retrans_stamp is still later than when the
      original packet was sent.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NNeal Cardwell <ncardwell@google.com>
      Reviewed-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ae18975
    • Y
      tcp: always timestamp on every skb transmission · 7f12422c
      Yuchung Cheng 提交于
      Previously TCP skbs are not always timestamped if the transmission
      failed due to memory or other local issues. This makes deciding
      when to abort a socket tricky and complicated because the first
      unacknowledged skb's timestamp may be 0 on TCP timeout.
      
      The straight-forward fix is to always timestamp skb on every
      transmission attempt. Also every skb retransmission needs to be
      flagged properly to avoid RTT under-estimation. This can happen
      upon receiving an ACK for the original packet and the a previous
      (spurious) retransmission has failed.
      
      It's worth noting that this reverts to the old time-stamping
      style before commit 8c72c65b ("tcp: update skb->skb_mstamp more
      carefully") which addresses a problem in computing the elapsed time
      of a stalled window-probing socket. The problem will be addressed
      differently in the next patches with a simpler approach.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NNeal Cardwell <ncardwell@google.com>
      Reviewed-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7f12422c
  24. 11 12月, 2018 1 次提交
  25. 08 12月, 2018 1 次提交
  26. 06 12月, 2018 2 次提交
  27. 01 12月, 2018 2 次提交