1. 08 10月, 2017 2 次提交
  2. 07 10月, 2017 4 次提交
    • E
      tcp: implement rb-tree based retransmit queue · 75c119af
      Eric Dumazet 提交于
      Using a linear list to store all skbs in write queue has been okay
      for quite a while : O(N) is not too bad when N < 500.
      
      Things get messy when N is the order of 100,000 : Modern TCP stacks
      want 10Gbit+ of throughput even with 200 ms RTT flows.
      
      40 ns per cache line miss means a full scan can use 4 ms,
      blowing away CPU caches.
      
      SACK processing often can use various hints to avoid parsing
      whole retransmit queue. But with high packet losses and/or high
      reordering, hints no longer work.
      
      Sender has to process thousands of unfriendly SACK, accumulating
      a huge socket backlog, burning a cpu and massively dropping packets.
      
      Using an rb-tree for retransmit queue has been avoided for years
      because it added complexity and overhead, but now is the time
      to be more resistant and say no to quadratic behavior.
      
      1) RTX queue is no longer part of the write queue : already sent skbs
      are stored in one rb-tree.
      
      2) Since reaching the head of write queue no longer needs
      sk->sk_send_head, we added an union of sk_send_head and tcp_rtx_queue
      
      Tested:
      
       On receiver :
       netem on ingress : delay 150ms 200us loss 1
       GRO disabled to force stress and SACK storms.
      
      for f in `seq 1 10`
      do
       ./netperf -H lpaa6 -l30 -- -K bbr -o THROUGHPUT|tail -1
      done | awk '{print $0} {sum += $0} END {printf "%7u\n",sum}'
      
      Before patch :
      
      323.87
      351.48
      339.59
      338.62
      306.72
      204.07
      304.93
      291.88
      202.47
      176.88
         2840
      
      After patch:
      
      1700.83
      2207.98
      2070.17
      1544.26
      2114.76
      2124.89
      1693.14
      1080.91
      2216.82
      1299.94
        18053
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      75c119af
    • E
      tcp: uninline tcp_write_queue_purge() · ac3f09ba
      Eric Dumazet 提交于
      Since the upcoming rtx rbtree will add some extra code,
      it is time to not inline this fat function anymore.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac3f09ba
    • E
      net: add rb_to_skb() and other rb tree helpers · 18a4c0ea
      Eric Dumazet 提交于
      Geeralize private netem_rb_to_skb()
      
      TCP rtx queue will soon be converted to rb-tree,
      so we will need skb_rbtree_walk() helpers.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      18a4c0ea
    • J
      net/ipv6: Convert icmpv6_push_pending_frames to void · 4e64b1ed
      Joe Perches 提交于
      commit cc71b7b0 ("net/ipv6: remove unused err variable on
      icmpv6_push_pending_frames") exposed icmpv6_push_pending_frames
      return value not being used.
      
      Remove now unnecessary int err declarations and uses.
      
      Miscellanea:
      
      o Remove unnecessary goto and out: labels
      o Realign arguments
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4e64b1ed
  3. 06 10月, 2017 6 次提交
    • E
      tcp: new list for sent but unacked skbs for RACK recovery · e2080072
      Eric Dumazet 提交于
      This patch adds a new queue (list) that tracks the sent but not yet
      acked or SACKed skbs for a TCP connection. The list is chronologically
      ordered by skb->skb_mstamp (the head is the oldest sent skb).
      
      This list will be used to optimize TCP Rack recovery, which checks
      an skb's timestamp to judge if it has been lost and needs to be
      retransmitted. Since TCP write queue is ordered by sequence instead
      of sent time, RACK has to scan over the write queue to catch all
      eligible packets to detect lost retransmission, and iterates through
      SACKed skbs repeatedly.
      
      Special cares for rare events:
      1. TCP repair fakes skb transmission so the send queue needs adjusted
      2. SACK reneging would require re-inserting SACKed skbs into the
         send queue. For now I believe it's not worth the complexity to
         make RACK work perfectly on SACK reneging, so we do nothing here.
      3. Fast Open: currently for non-TFO, send-queue correctly queues
         the pure SYN packet. For TFO which queues a pure SYN and
         then a data packet, send-queue only queues the data packet but
         not the pure SYN due to the structure of TFO code. This is okay
         because the SYN receiver would never respond with a SACK on a
         missing SYN (i.e. SYN is never fast-retransmitted by SACK/RACK).
      
      In order to not grow sk_buff, we use an union for the new list and
      _skb_refdst/destructor fields. This is a bit complicated because
      we need to make sure _skb_refdst and destructor are properly zeroed
      before skb is cloned/copied at transmit, and before being freed.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2080072
    • W
      tcp: uniform the set up of sockets after successful connection · 27204aaa
      Wei Wang 提交于
      Currently in the TCP code, the initialization sequence for cached
      metrics, congestion control, BPF, etc, after successful connection
      is very inconsistent. This introduces inconsistent bevhavior and is
      prone to bugs. The current call sequence is as follows:
      
      (1) for active case (tcp_finish_connect() case):
              tcp_mtup_init(sk);
              icsk->icsk_af_ops->rebuild_header(sk);
              tcp_init_metrics(sk);
              tcp_call_bpf(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB);
              tcp_init_congestion_control(sk);
              tcp_init_buffer_space(sk);
      
      (2) for passive case (tcp_rcv_state_process() TCP_SYN_RECV case):
              icsk->icsk_af_ops->rebuild_header(sk);
              tcp_call_bpf(sk, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
              tcp_init_congestion_control(sk);
              tcp_mtup_init(sk);
              tcp_init_buffer_space(sk);
              tcp_init_metrics(sk);
      
      (3) for TFO passive case (tcp_fastopen_create_child()):
              inet_csk(child)->icsk_af_ops->rebuild_header(child);
              tcp_init_congestion_control(child);
              tcp_mtup_init(child);
              tcp_init_metrics(child);
              tcp_call_bpf(child, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
              tcp_init_buffer_space(child);
      
      This commit uniforms the above functions to have the following sequence:
              tcp_mtup_init(sk);
              icsk->icsk_af_ops->rebuild_header(sk);
              tcp_init_metrics(sk);
              tcp_call_bpf(sk, BPF_SOCK_OPS_ACTIVE/PASSIVE_ESTABLISHED_CB);
              tcp_init_congestion_control(sk);
              tcp_init_buffer_space(sk);
      This sequence is the same as the (1) active case. We pick this sequence
      because this order correctly allows BPF to override the settings
      including congestion control module and initial cwnd, etc from
      the route, and then allows the CC module to see those settings.
      Suggested-by: NNeal Cardwell <ncardwell@google.com>
      Tested-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27204aaa
    • S
      VSOCK: add sock_diag interface · 413a4317
      Stefan Hajnoczi 提交于
      This patch adds the sock_diag interface for querying sockets from
      userspace.  Tools like ss(8) and netstat(8) can use this interface to
      list open sockets.
      
      The userspace ABI is defined in <linux/vm_sockets_diag.h> and includes
      netlink request and response structs.  The request can query sockets
      based on their sk_state (e.g. listening sockets only) and the response
      contains socket information fields including the local/remote addresses,
      inode number, etc.
      
      This patch does not dump VMCI pending sockets because I have only tested
      the virtio transport, which does not use pending sockets.  Support can
      be added later by extending vsock_diag_dump() if needed by VMCI users.
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      413a4317
    • S
      VSOCK: use TCP state constants for sk_state · 3b4477d2
      Stefan Hajnoczi 提交于
      There are two state fields: socket->state and sock->sk_state.  The
      socket->state field uses SS_UNCONNECTED, SS_CONNECTED, etc while the
      sock->sk_state typically uses values that match TCP state constants
      (TCP_CLOSE, TCP_ESTABLISHED).  AF_VSOCK does not follow this convention
      and instead uses SS_* constants for both fields.
      
      The sk_state field will be exposed to userspace through the vsock_diag
      interface for ss(8), netstat(8), and other programs.
      
      This patch switches sk_state to TCP state constants so that the meaning
      of this field is consistent with other address families.  Not just
      AF_INET and AF_INET6 use the TCP constants, AF_UNIX and others do too.
      
      The following mapping was used to convert the code:
      
        SS_FREE -> TCP_CLOSE
        SS_UNCONNECTED -> TCP_CLOSE
        SS_CONNECTING -> TCP_SYN_SENT
        SS_CONNECTED -> TCP_ESTABLISHED
        SS_DISCONNECTING -> TCP_CLOSING
        VSOCK_SS_LISTEN -> TCP_LISTEN
      
      In __vsock_create() the sk_state initialization was dropped because
      sock_init_data() already initializes sk_state to TCP_CLOSE.
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b4477d2
    • S
      VSOCK: move __vsock_in_bound/connected_table() to af_vsock.h · bf359b81
      Stefan Hajnoczi 提交于
      The vsock_diag.ko module will need to check socket table membership.
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf359b81
    • S
      VSOCK: export socket tables for sock_diag interface · 44f20980
      Stefan Hajnoczi 提交于
      The socket table symbols need to be exported from vsock.ko so that the
      vsock_diag.ko module will be able to traverse sockets.
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      44f20980
  4. 05 10月, 2017 8 次提交
  5. 04 10月, 2017 18 次提交
  6. 03 10月, 2017 2 次提交