1. 25 1月, 2018 1 次提交
    • D
      net: tcp: close sock if net namespace is exiting · 4ee806d5
      Dan Streetman 提交于
      When a tcp socket is closed, if it detects that its net namespace is
      exiting, close immediately and do not wait for FIN sequence.
      
      For normal sockets, a reference is taken to their net namespace, so it will
      never exit while the socket is open.  However, kernel sockets do not take a
      reference to their net namespace, so it may begin exiting while the kernel
      socket is still open.  In this case if the kernel socket is a tcp socket,
      it will stay open trying to complete its close sequence.  The sock's dst(s)
      hold a reference to their interface, which are all transferred to the
      namespace's loopback interface when the real interfaces are taken down.
      When the namespace tries to take down its loopback interface, it hangs
      waiting for all references to the loopback interface to release, which
      results in messages like:
      
      unregister_netdevice: waiting for lo to become free. Usage count = 1
      
      These messages continue until the socket finally times out and closes.
      Since the net namespace cleanup holds the net_mutex while calling its
      registered pernet callbacks, any new net namespace initialization is
      blocked until the current net namespace finishes exiting.
      
      After this change, the tcp socket notices the exiting net namespace, and
      closes immediately, releasing its dst(s) and their reference to the
      loopback interface, which lets the net namespace continue exiting.
      
      Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=97811Signed-off-by: NDan Streetman <ddstreet@canonical.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4ee806d5
  2. 08 12月, 2017 1 次提交
  3. 11 11月, 2017 2 次提交
  4. 10 11月, 2017 1 次提交
  5. 05 11月, 2017 1 次提交
    • P
      tcp: higher throughput under reordering with adaptive RACK reordering wnd · 1f255691
      Priyaranjan Jha 提交于
      Currently TCP RACK loss detection does not work well if packets are
      being reordered beyond its static reordering window (min_rtt/4).Under
      such reordering it may falsely trigger loss recoveries and reduce TCP
      throughput significantly.
      
      This patch improves that by increasing and reducing the reordering
      window based on DSACK, which is now supported in major TCP implementations.
      It makes RACK's reo_wnd adaptive based on DSACK and no. of recoveries.
      
      - If DSACK is received, increment reo_wnd by min_rtt/4 (upper bounded
        by srtt), since there is possibility that spurious retransmission was
        due to reordering delay longer than reo_wnd.
      
      - Persist the current reo_wnd value for TCP_RACK_RECOVERY_THRESH (16)
        no. of successful recoveries (accounts for full DSACK-based loss
        recovery undo). After that, reset it to default (min_rtt/4).
      
      - At max, reo_wnd is incremented only once per rtt. So that the new
        DSACK on which we are reacting, is due to the spurious retx (approx)
        after the reo_wnd has been updated last time.
      
      - reo_wnd is tracked in terms of steps (of min_rtt/4), rather than
        absolute value to account for change in rtt.
      
      In our internal testing, we observed significant increase in throughput,
      in scenarios where reordering exceeds min_rtt/4 (previous static value).
      Signed-off-by: NPriyaranjan Jha <priyarjha@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1f255691
  6. 28 10月, 2017 2 次提交
  7. 27 10月, 2017 1 次提交
  8. 26 10月, 2017 1 次提交
  9. 24 10月, 2017 2 次提交
  10. 20 10月, 2017 1 次提交
  11. 07 10月, 2017 3 次提交
    • E
      tcp: implement rb-tree based retransmit queue · 75c119af
      Eric Dumazet 提交于
      Using a linear list to store all skbs in write queue has been okay
      for quite a while : O(N) is not too bad when N < 500.
      
      Things get messy when N is the order of 100,000 : Modern TCP stacks
      want 10Gbit+ of throughput even with 200 ms RTT flows.
      
      40 ns per cache line miss means a full scan can use 4 ms,
      blowing away CPU caches.
      
      SACK processing often can use various hints to avoid parsing
      whole retransmit queue. But with high packet losses and/or high
      reordering, hints no longer work.
      
      Sender has to process thousands of unfriendly SACK, accumulating
      a huge socket backlog, burning a cpu and massively dropping packets.
      
      Using an rb-tree for retransmit queue has been avoided for years
      because it added complexity and overhead, but now is the time
      to be more resistant and say no to quadratic behavior.
      
      1) RTX queue is no longer part of the write queue : already sent skbs
      are stored in one rb-tree.
      
      2) Since reaching the head of write queue no longer needs
      sk->sk_send_head, we added an union of sk_send_head and tcp_rtx_queue
      
      Tested:
      
       On receiver :
       netem on ingress : delay 150ms 200us loss 1
       GRO disabled to force stress and SACK storms.
      
      for f in `seq 1 10`
      do
       ./netperf -H lpaa6 -l30 -- -K bbr -o THROUGHPUT|tail -1
      done | awk '{print $0} {sum += $0} END {printf "%7u\n",sum}'
      
      Before patch :
      
      323.87
      351.48
      339.59
      338.62
      306.72
      204.07
      304.93
      291.88
      202.47
      176.88
         2840
      
      After patch:
      
      1700.83
      2207.98
      2070.17
      1544.26
      2114.76
      2124.89
      1693.14
      1080.91
      2216.82
      1299.94
        18053
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      75c119af
    • E
      tcp: tcp_tx_timestamp() cleanup · 4e8cc228
      Eric Dumazet 提交于
      tcp_write_queue_tail() call can be factorized.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4e8cc228
    • E
      tcp: uninline tcp_write_queue_purge() · ac3f09ba
      Eric Dumazet 提交于
      Since the upcoming rtx rbtree will add some extra code,
      it is time to not inline this fat function anymore.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac3f09ba
  12. 06 10月, 2017 2 次提交
    • E
      tcp: new list for sent but unacked skbs for RACK recovery · e2080072
      Eric Dumazet 提交于
      This patch adds a new queue (list) that tracks the sent but not yet
      acked or SACKed skbs for a TCP connection. The list is chronologically
      ordered by skb->skb_mstamp (the head is the oldest sent skb).
      
      This list will be used to optimize TCP Rack recovery, which checks
      an skb's timestamp to judge if it has been lost and needs to be
      retransmitted. Since TCP write queue is ordered by sequence instead
      of sent time, RACK has to scan over the write queue to catch all
      eligible packets to detect lost retransmission, and iterates through
      SACKed skbs repeatedly.
      
      Special cares for rare events:
      1. TCP repair fakes skb transmission so the send queue needs adjusted
      2. SACK reneging would require re-inserting SACKed skbs into the
         send queue. For now I believe it's not worth the complexity to
         make RACK work perfectly on SACK reneging, so we do nothing here.
      3. Fast Open: currently for non-TFO, send-queue correctly queues
         the pure SYN packet. For TFO which queues a pure SYN and
         then a data packet, send-queue only queues the data packet but
         not the pure SYN due to the structure of TFO code. This is okay
         because the SYN receiver would never respond with a SACK on a
         missing SYN (i.e. SYN is never fast-retransmitted by SACK/RACK).
      
      In order to not grow sk_buff, we use an union for the new list and
      _skb_refdst/destructor fields. This is a bit complicated because
      we need to make sure _skb_refdst and destructor are properly zeroed
      before skb is cloned/copied at transmit, and before being freed.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2080072
    • W
      tcp: uniform the set up of sockets after successful connection · 27204aaa
      Wei Wang 提交于
      Currently in the TCP code, the initialization sequence for cached
      metrics, congestion control, BPF, etc, after successful connection
      is very inconsistent. This introduces inconsistent bevhavior and is
      prone to bugs. The current call sequence is as follows:
      
      (1) for active case (tcp_finish_connect() case):
              tcp_mtup_init(sk);
              icsk->icsk_af_ops->rebuild_header(sk);
              tcp_init_metrics(sk);
              tcp_call_bpf(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB);
              tcp_init_congestion_control(sk);
              tcp_init_buffer_space(sk);
      
      (2) for passive case (tcp_rcv_state_process() TCP_SYN_RECV case):
              icsk->icsk_af_ops->rebuild_header(sk);
              tcp_call_bpf(sk, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
              tcp_init_congestion_control(sk);
              tcp_mtup_init(sk);
              tcp_init_buffer_space(sk);
              tcp_init_metrics(sk);
      
      (3) for TFO passive case (tcp_fastopen_create_child()):
              inet_csk(child)->icsk_af_ops->rebuild_header(child);
              tcp_init_congestion_control(child);
              tcp_mtup_init(child);
              tcp_init_metrics(child);
              tcp_call_bpf(child, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
              tcp_init_buffer_space(child);
      
      This commit uniforms the above functions to have the following sequence:
              tcp_mtup_init(sk);
              icsk->icsk_af_ops->rebuild_header(sk);
              tcp_init_metrics(sk);
              tcp_call_bpf(sk, BPF_SOCK_OPS_ACTIVE/PASSIVE_ESTABLISHED_CB);
              tcp_init_congestion_control(sk);
              tcp_init_buffer_space(sk);
      This sequence is the same as the (1) active case. We pick this sequence
      because this order correctly allows BPF to override the settings
      including congestion control module and initial cwnd, etc from
      the route, and then allows the CC module to see those settings.
      Suggested-by: NNeal Cardwell <ncardwell@google.com>
      Tested-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27204aaa
  13. 02 10月, 2017 3 次提交
  14. 02 9月, 2017 1 次提交
  15. 31 8月, 2017 1 次提交
  16. 26 8月, 2017 2 次提交
    • E
      tcp: fix hang in tcp_sendpage_locked() · bd9dfc54
      Eric Dumazet 提交于
      syszkaller got a hang in tcp stack, related to a bug in
      tcp_sendpage_locked()
      
      root@syzkaller:~# cat /proc/3059/stack
      [<ffffffff83de926c>] __lock_sock+0x1dc/0x2f0
      [<ffffffff83de9473>] lock_sock_nested+0xf3/0x110
      [<ffffffff8408ce01>] tcp_sendmsg+0x21/0x50
      [<ffffffff84163b6f>] inet_sendmsg+0x11f/0x5e0
      [<ffffffff83dd8eea>] sock_sendmsg+0xca/0x110
      [<ffffffff83dd9547>] kernel_sendmsg+0x47/0x60
      [<ffffffff83de35dc>] sock_no_sendpage+0x1cc/0x280
      [<ffffffff8408916b>] tcp_sendpage_locked+0x10b/0x160
      [<ffffffff84089203>] tcp_sendpage+0x43/0x60
      [<ffffffff841641da>] inet_sendpage+0x1aa/0x660
      [<ffffffff83dd4fcd>] kernel_sendpage+0x8d/0xe0
      [<ffffffff83dd50ac>] sock_sendpage+0x8c/0xc0
      [<ffffffff81b63300>] pipe_to_sendpage+0x290/0x3b0
      [<ffffffff81b67243>] __splice_from_pipe+0x343/0x750
      [<ffffffff81b6a459>] splice_from_pipe+0x1e9/0x330
      [<ffffffff81b6a5e0>] generic_splice_sendpage+0x40/0x50
      [<ffffffff81b6b1d7>] SyS_splice+0x7b7/0x1610
      [<ffffffff84d77a01>] entry_SYSCALL_64_fastpath+0x1f/0xbe
      
      Fixes: 306b13eb ("proto_ops: Add locked held versions of sendmsg and sendpage")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Tom Herbert <tom@quantonium.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd9dfc54
    • S
      tcp: fix refcnt leak with ebpf congestion control · ebfa00c5
      Sabrina Dubroca 提交于
      There are a few bugs around refcnt handling in the new BPF congestion
      control setsockopt:
      
       - The new ca is assigned to icsk->icsk_ca_ops even in the case where we
         cannot get a reference on it. This would lead to a use after free,
         since that ca is going away soon.
      
       - Changing the congestion control case doesn't release the refcnt on
         the previous ca.
      
       - In the reinit case, we first leak a reference on the old ca, then we
         call tcp_reinit_congestion_control on the ca that we have just
         assigned, leading to deinitializing the wrong ca (->release of the
         new ca on the old ca's data) and releasing the refcount on the ca
         that we actually want to use.
      
      This is visible by building (for example) BIC as a module and setting
      net.ipv4.tcp_congestion_control=bic, and using tcp_cong_kern.c from
      samples/bpf.
      
      This patch fixes the refcount issues, and moves reinit back into tcp
      core to avoid passing a ca pointer back to BPF.
      
      Fixes: 91b5b21c ("bpf: Add support for changing congestion control")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Acked-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ebfa00c5
  17. 24 8月, 2017 1 次提交
    • M
      tcp: Extend SOF_TIMESTAMPING_RX_SOFTWARE to TCP recvmsg · 98aaa913
      Mike Maloney 提交于
      When SOF_TIMESTAMPING_RX_SOFTWARE is enabled for tcp sockets, return the
      timestamp corresponding to the highest sequence number data returned.
      
      Previously the skb->tstamp is overwritten when a TCP packet is placed
      in the out of order queue.  While the packet is in the ooo queue, save the
      timestamp in the TCB_SKB_CB.  This space is shared with the gso_*
      options which are only used on the tx path, and a previously unused 4
      byte hole.
      
      When skbs are coalesced either in the sk_receive_queue or the
      out_of_order_queue always choose the timestamp of the appended skb to
      maintain the invariant of returning the timestamp of the last byte in
      the recvmsg buffer.
      Signed-off-by: NMike Maloney <maloney@google.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98aaa913
  18. 17 8月, 2017 1 次提交
  19. 04 8月, 2017 1 次提交
    • W
      tcp: enable MSG_ZEROCOPY · f214f915
      Willem de Bruijn 提交于
      Enable support for MSG_ZEROCOPY to the TCP stack. TSO and GSO are
      both supported. Only data sent to remote destinations is sent without
      copying. Packets looped onto a local destination have their payload
      copied to avoid unbounded latency.
      
      Tested:
        A 10x TCP_STREAM between two hosts showed a reduction in netserver
        process cycles by up to 70%, depending on packet size. Systemwide,
        savings are of course much less pronounced, at up to 20% best case.
      
        msg_zerocopy.sh 4 tcp:
      
        without zerocopy
          tx=121792 (7600 MB) txc=0 zc=n
          rx=60458 (7600 MB)
      
        with zerocopy
          tx=286257 (17863 MB) txc=286257 zc=y
          rx=140022 (17863 MB)
      
        This test opens a pair of sockets over veth, one one calls send with
        64KB and optionally MSG_ZEROCOPY and on the other reads the initial
        bytes. The receiver truncates, so this is strictly an upper bound on
        what is achievable. It is more representative of sending data out of
        a physical NIC (when payload is not touched, either).
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f214f915
  20. 02 8月, 2017 1 次提交
  21. 01 8月, 2017 4 次提交
  22. 02 7月, 2017 1 次提交
  23. 01 7月, 2017 1 次提交
  24. 28 6月, 2017 1 次提交
  25. 26 6月, 2017 1 次提交
  26. 20 6月, 2017 1 次提交
  27. 16 6月, 2017 2 次提交