1. 31 1月, 2017 1 次提交
    • R
      net: Avoid receiving packets with an l3mdev on unbound UDP sockets · 63a6fff3
      Robert Shearman 提交于
      Packets arriving in a VRF currently are delivered to UDP sockets that
      aren't bound to any interface. TCP defaults to not delivering packets
      arriving in a VRF to unbound sockets. IP route lookup and socket
      transmit both assume that unbound means using the default table and
      UDP applications that haven't been changed to be aware of VRFs may not
      function correctly in this case since they may not be able to handle
      overlapping IP address ranges, or be able to send packets back to the
      original sender if required.
      
      So add a sysctl, udp_l3mdev_accept, to control this behaviour with it
      being analgous to the existing tcp_l3mdev_accept, namely to allow a
      process to have a VRF-global listen socket. Have this default to off
      as this is the behaviour that users will expect, given that there is
      no explicit mechanism to set unmodified VRF-unaware application into a
      default VRF.
      Signed-off-by: NRobert Shearman <rshearma@brocade.com>
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Tested-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      63a6fff3
  2. 30 1月, 2017 2 次提交
  3. 26 1月, 2017 5 次提交
    • W
      net/tcp-fastopen: make connect()'s return case more consistent with non-TFO · 3979ad7e
      Willy Tarreau 提交于
      Without TFO, any subsequent connect() call after a successful one returns
      -1 EISCONN. The last API update ensured that __inet_stream_connect() can
      return -1 EINPROGRESS in response to sendmsg() when TFO is in use to
      indicate that the connection is now in progress. Unfortunately since this
      function is used both for connect() and sendmsg(), it has the undesired
      side effect of making connect() now return -1 EINPROGRESS as well after
      a successful call, while at the same time poll() returns POLLOUT. This
      can confuse some applications which happen to call connect() and to
      check for -1 EISCONN to ensure the connection is usable, and for which
      EINPROGRESS indicates a need to poll, causing a loop.
      
      This problem was encountered in haproxy where a call to connect() is
      precisely used in certain cases to confirm a connection's readiness.
      While arguably haproxy's behaviour should be improved here, it seems
      important to aim at a more robust behaviour when the goal of the new
      API is to make it easier to implement TFO in existing applications.
      
      This patch simply ensures that we preserve the same semantics as in
      the non-TFO case on the connect() syscall when using TFO, while still
      returning -1 EINPROGRESS on sendmsg(). For this we simply tell
      __inet_stream_connect() whether we're doing a regular connect() or in
      fact connecting for a sendmsg() call.
      
      Cc: Wei Wang <weiwan@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NWilly Tarreau <w@1wt.eu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3979ad7e
    • W
      net/tcp-fastopen: Add new API support · 19f6d3f3
      Wei Wang 提交于
      This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an
      alternative way to perform Fast Open on the active side (client). Prior
      to this patch, a client needs to replace the connect() call with
      sendto(MSG_FASTOPEN). This can be cumbersome for applications who want
      to use Fast Open: these socket operations are often done in lower layer
      libraries used by many other applications. Changing these libraries
      and/or the socket call sequences are not trivial. A more convenient
      approach is to perform Fast Open by simply enabling a socket option when
      the socket is created w/o changing other socket calls sequence:
        s = socket()
          create a new socket
        setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …);
          newly introduced sockopt
          If set, new functionality described below will be used.
          Return ENOTSUPP if TFO is not supported or not enabled in the
          kernel.
      
        connect()
          With cookie present, return 0 immediately.
          With no cookie, initiate 3WHS with TFO cookie-request option and
          return -1 with errno = EINPROGRESS.
      
        write()/sendmsg()
          With cookie present, send out SYN with data and return the number of
          bytes buffered.
          With no cookie, and 3WHS not yet completed, return -1 with errno =
          EINPROGRESS.
          No MSG_FASTOPEN flag is needed.
      
        read()
          Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but
          write() is not called yet.
          Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is
          established but no msg is received yet.
          Return number of bytes read if socket is established and there is
          msg received.
      
      The new API simplifies life for applications that always perform a write()
      immediately after a successful connect(). Such applications can now take
      advantage of Fast Open by merely making one new setsockopt() call at the time
      of creating the socket. Nothing else about the application's socket call
      sequence needs to change.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19f6d3f3
    • W
      net/tcp-fastopen: refactor cookie check logic · 065263f4
      Wei Wang 提交于
      Refactor the cookie check logic in tcp_send_syn_data() into a function.
      This function will be called else where in later changes.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      065263f4
    • J
      tcp: correct memory barrier usage in tcp_check_space() · 56d80622
      Jason Baron 提交于
      sock_reset_flag() maps to __clear_bit() not the atomic version clear_bit().
      Thus, we need smp_mb(), smp_mb__after_atomic() is not sufficient.
      
      Fixes: 3c715127 ("tcp: add memory barriers to write space paths")
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NJason Baron <jbaron@akamai.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      56d80622
    • E
      tcp: reduce skb overhead in selected places · 60b1af33
      Eric Dumazet 提交于
      tcp_add_backlog() can use skb_condense() helper to get better
      gains and less SKB_TRUESIZE() magic. This only happens when socket
      backlog has to be used.
      
      Some attacks involve specially crafted out of order tiny TCP packets,
      clogging the ofo queue of (many) sockets.
      Then later, expensive collapse happens, trying to copy all these skbs
      into single ones.
      This unfortunately does not work if each skb has no neighbor in TCP
      sequence order.
      
      By using skb_condense() if the skb could not be coalesced to a prior
      one, we defeat these kind of threats, potentially saving 4K per skb
      (or more, since this is one page fragment).
      
      A typical NAPI driver allocates gro packets with GRO_MAX_HEAD bytes
      in skb->head, meaning the copy done by skb_condense() is limited to
      about 200 bytes.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      60b1af33
  4. 25 1月, 2017 2 次提交
  5. 21 1月, 2017 2 次提交
  6. 20 1月, 2017 1 次提交
    • A
      tcp: initialize max window for a new fastopen socket · 0dbd7ff3
      Alexey Kodanev 提交于
      Found that if we run LTP netstress test with large MSS (65K),
      the first attempt from server to send data comparable to this
      MSS on fastopen connection will be delayed by the probe timer.
      
      Here is an example:
      
           < S  seq 0:0 win 43690 options [mss 65495 wscale 7 tfo cookie] length 32
           > S. seq 0:0 ack 1 win 43690 options [mss 65495 wscale 7] length 0
           < .  ack 1 win 342 length 0
      
      Inside tcp_sendmsg(), tcp_send_mss() returns max MSS in 'mss_now',
      as well as in 'size_goal'. This results the segment not queued for
      transmition until all the data copied from user buffer. Then, inside
      __tcp_push_pending_frames(), it breaks on send window test and
      continues with the check probe timer.
      
      Fragmentation occurs in tcp_write_wakeup()...
      
      +0.2 > P. seq 1:43777 ack 1 win 342 length 43776
           < .  ack 43777, win 1365 length 0
           > P. seq 43777:65001 ack 1 win 342 options [...] length 21224
           ...
      
      This also contradicts with the fact that we should bound to the half
      of the window if it is large.
      
      Fix this flaw by correctly initializing max_window. Before that, it
      could have large values that affect further calculations of 'size_goal'.
      
      Fixes: 168a8f58 ("tcp: TCP Fast Open Server - main code path")
      Signed-off-by: NAlexey Kodanev <alexey.kodanev@oracle.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0dbd7ff3
  7. 19 1月, 2017 8 次提交
    • D
      lwtunnel: fix autoload of lwt modules · 9ed59592
      David Ahern 提交于
      Trying to add an mpls encap route when the MPLS modules are not loaded
      hangs. For example:
      
          CONFIG_MPLS=y
          CONFIG_NET_MPLS_GSO=m
          CONFIG_MPLS_ROUTING=m
          CONFIG_MPLS_IPTUNNEL=m
      
          $ ip route add 10.10.10.10/32 encap mpls 100 via inet 10.100.1.2
      
      The ip command hangs:
      root       880   826  0 21:25 pts/0    00:00:00 ip route add 10.10.10.10/32 encap mpls 100 via inet 10.100.1.2
      
          $ cat /proc/880/stack
          [<ffffffff81065a9b>] call_usermodehelper_exec+0xd6/0x134
          [<ffffffff81065efc>] __request_module+0x27b/0x30a
          [<ffffffff814542f6>] lwtunnel_build_state+0xe4/0x178
          [<ffffffff814aa1e4>] fib_create_info+0x47f/0xdd4
          [<ffffffff814ae451>] fib_table_insert+0x90/0x41f
          [<ffffffff814a8010>] inet_rtm_newroute+0x4b/0x52
          ...
      
      modprobe is trying to load rtnl-lwt-MPLS:
      
      root       881     5  0 21:25 ?        00:00:00 /sbin/modprobe -q -- rtnl-lwt-MPLS
      
      and it hangs after loading mpls_router:
      
          $ cat /proc/881/stack
          [<ffffffff81441537>] rtnl_lock+0x12/0x14
          [<ffffffff8142ca2a>] register_netdevice_notifier+0x16/0x179
          [<ffffffffa0033025>] mpls_init+0x25/0x1000 [mpls_router]
          [<ffffffff81000471>] do_one_initcall+0x8e/0x13f
          [<ffffffff81119961>] do_init_module+0x5a/0x1e5
          [<ffffffff810bd070>] load_module+0x13bd/0x17d6
          ...
      
      The problem is that lwtunnel_build_state is called with rtnl lock
      held preventing mpls_init from registering.
      
      Given the potential references held by the time lwtunnel_build_state it
      can not drop the rtnl lock to the load module. So, extract the module
      loading code from lwtunnel_build_state into a new function to validate
      the encap type. The new function is called while converting the user
      request into a fib_config which is well before any table, device or
      fib entries are examined.
      
      Fixes: 745041e2 ("lwtunnel: autoload of lwt modules")
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9ed59592
    • A
      netfilter: ipt_CLUSTERIP: fix build error without procfs · 3fd0b634
      Arnd Bergmann 提交于
      We can't access c->pde if CONFIG_PROC_FS is disabled:
      
      net/ipv4/netfilter/ipt_CLUSTERIP.c: In function 'clusterip_config_find_get':
      net/ipv4/netfilter/ipt_CLUSTERIP.c:147:9: error: 'struct clusterip_config' has no member named 'pde'
      
      This moves the check inside of another #ifdef.
      
      Fixes: 6c5d5cfb ("netfilter: ipt_CLUSTERIP: check duplicate config when initializing")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      3fd0b634
    • J
      inet: reset tb->fastreuseport when adding a reuseport sk · 637bc8bb
      Josef Bacik 提交于
      If we have non reuseport sockets on a tb we will set tb->fastreuseport to 0 and
      never set it again.  Which means that in the future if we end up adding a bunch
      of reuseport sk's to that tb we'll have to do the expensive scan every time.
      Instead add the ipv4/ipv6 saddr fields to the bind bucket, as well as the family
      so we know what comparison to make, and the ipv6 only setting so we can make
      sure to compare with new sockets appropriately.  Once one sk has made it onto
      the list we know that there are no potential bind conflicts on the owners list
      that match that sk's rcv_addr.  So copy the sk's information into our bind
      bucket and set tb->fastruseport to FASTREUSESOCK_STRICT so we know we have to do
      an extra check for subsequent reuseport sockets and skip the expensive bind
      conflict check.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      637bc8bb
    • J
      inet: split inet_csk_get_port into two functions · 289141b7
      Josef Bacik 提交于
      inet_csk_get_port does two different things, it either scans for an open port,
      or it tries to see if the specified port is available for use.  Since these two
      operations have different rules and are basically independent lets split them
      into two different functions to make them both more readable.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      289141b7
    • J
      inet: don't check for bind conflicts twice when searching for a port · 6cd66616
      Josef Bacik 提交于
      This is just wasted time, we've already found a tb that doesn't have a bind
      conflict, and we don't drop the head lock so scanning again isn't going to give
      us a different answer.  Instead move the tb->reuse setting logic outside of the
      found_tb path and put it in the success: path.  Then make it so that we don't
      goto again if we find a bind conflict in the found_tb path as we won't reach
      this anymore when we are scanning for an ephemeral port.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6cd66616
    • J
      inet: kill smallest_size and smallest_port · b9470c27
      Josef Bacik 提交于
      In inet_csk_get_port we seem to be using smallest_port to figure out where the
      best place to look for a SO_REUSEPORT sk that matches with an existing set of
      SO_REUSEPORT's.  However if we get to the logic
      
      if (smallest_size != -1) {
      	port = smallest_port;
      	goto have_port;
      }
      
      we will do a useless search, because we would have already done the
      inet_csk_bind_conflict for that port and it would have returned 1, otherwise we
      would have gone to found_tb and succeeded.  Since this logic makes us do yet
      another trip through inet_csk_bind_conflict for a port we know won't work just
      delete this code and save us the time.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b9470c27
    • J
      inet: drop ->bind_conflict · aa078842
      Josef Bacik 提交于
      The only difference between inet6_csk_bind_conflict and inet_csk_bind_conflict
      is how they check the rcv_saddr, so delete this call back and simply
      change inet_csk_bind_conflict to call inet_rcv_saddr_equal.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aa078842
    • J
      inet: collapse ipv4/v6 rcv_saddr_equal functions into one · fe38d2a1
      Josef Bacik 提交于
      We pass these per-protocol equal functions around in various places, but
      we can just have one function that checks the sk->sk_family and then do
      the right comparison function.  I've also changed the ipv4 version to
      not cast to inet_sock since it is unneeded.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fe38d2a1
  8. 18 1月, 2017 2 次提交
    • J
      tcp: accept RST for rcv_nxt - 1 after receiving a FIN · 0e40f4c9
      Jason Baron 提交于
      Using a Mac OSX box as a client connecting to a Linux server, we have found
      that when certain applications (such as 'ab'), are abruptly terminated
      (via ^C), a FIN is sent followed by a RST packet on tcp connections. The
      FIN is accepted by the Linux stack but the RST is sent with the same
      sequence number as the FIN, and Linux responds with a challenge ACK per
      RFC 5961. The OSX client then sometimes (they are rate-limited) does not
      reply with any RST as would be expected on a closed socket.
      
      This results in sockets accumulating on the Linux server left mostly in
      the CLOSE_WAIT state, although LAST_ACK and CLOSING are also possible.
      This sequence of events can tie up a lot of resources on the Linux server
      since there may be a lot of data in write buffers at the time of the RST.
      Accepting a RST equal to rcv_nxt - 1, after we have already successfully
      processed a FIN, has made a significant difference for us in practice, by
      freeing up unneeded resources in a more expedient fashion.
      
      A packetdrill test demonstrating the behavior:
      
      // testing mac osx rst behavior
      
      // Establish a connection
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      0.000 bind(3, ..., ...) = 0
      0.000 listen(3, 1) = 0
      
      0.100 < S 0:0(0) win 32768 <mss 1460,nop,wscale 10>
      0.100 > S. 0:0(0) ack 1 <mss 1460,nop,wscale 5>
      0.200 < . 1:1(0) ack 1 win 32768
      0.200 accept(3, ..., ...) = 4
      
      // Client closes the connection
      0.300 < F. 1:1(0) ack 1 win 32768
      
      // now send rst with same sequence
      0.300 < R. 1:1(0) ack 1 win 32768
      
      // make sure we are in TCP_CLOSE
      0.400 %{
      assert tcpi_state == 7
      }%
      Signed-off-by: NJason Baron <jbaron@akamai.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0e40f4c9
    • G
      net: ping: Use right format specifier to avoid type casting · a7ef6715
      Gao Feng 提交于
      The inet_num is u16, so use %hu instead of casting it to int. And
      the sk_bound_dev_if is int actually, so it needn't cast to int.
      Signed-off-by: NGao Feng <fgao@ikuai8.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a7ef6715
  9. 16 1月, 2017 1 次提交
    • L
      netfilter: rpfilter: fix incorrect loopback packet judgment · 6443ebc3
      Liping Zhang 提交于
      Currently, we check the existing rtable in PREROUTING hook, if RTCF_LOCAL
      is set, we assume that the packet is loopback.
      
      But this assumption is incorrect, for example, a packet encapsulated
      in ipsec transport mode was received and routed to local, after
      decapsulation, it would be delivered to local again, and the rtable
      was not dropped, so RTCF_LOCAL check would trigger. But actually, the
      packet was not loopback.
      
      So for these normal loopback packets, we can check whether the in device
      is IFF_LOOPBACK or not. For these locally generated broadcast/multicast,
      we can check whether the skb->pkt_type is PACKET_LOOPBACK or not.
      
      Finally, there's a subtle difference between nft fib expr and xtables
      rpfilter extension, user can add the following nft rule to do strict
      rpfilter check:
        # nft add rule x y meta iif eth0 fib saddr . iif oif != eth0 drop
      
      So when the packet is loopback, it's better to store the in device
      instead of the LOOPBACK_IFINDEX, otherwise, after adding the above
      nft rule, locally generated broad/multicast packets will be dropped
      incorrectly.
      
      Fixes: f83a7ea2 ("netfilter: xt_rpfilter: skip locally generated broadcast/multicast, too")
      Fixes: f6d0cbcf ("netfilter: nf_tables: add fib expression")
      Signed-off-by: NLiping Zhang <zlpnobody@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      6443ebc3
  10. 14 1月, 2017 14 次提交
    • Y
      tcp: disable fack by default · 94bdc978
      Yuchung Cheng 提交于
      This patch disables FACK by default as RACK is the successor of FACK
      (inspired by the insights behind FACK).
      
      FACK[1] in Linux works as follows: a packet P is deemed lost,
      if packet Q of higher sequence is s/acked and P and Q are distant
      by at least dupthresh number of packets in sequence space.
      
      FACK is more aggressive than the IETF recommened recovery for SACK
      (RFC3517 A Conservative Selective Acknowledgment (SACK)-based Loss
       Recovery Algorithm for TCP), because a single SACK may trigger
      fast recovery. This obviously won't work well with reordering so
      FACK is dynamically disabled upon detecting reordering.
      
      RACK supersedes FACK by using time distance instead of sequence
      distance. On reordering, RACK waits for a quarter of RTT receiving
      a single SACK before starting recovery. (the timer can be made more
      adaptive in the future by measuring reordering distance in time,
      but currently RTT/4 seem to work well.) Once the recovery starts,
      RACK behaves almost like FACK because it reduces the reodering
      window to 1ms, so it fast retransmits quickly. In addition RACK
      can detect loss retransmission as it does not care about the packet
      sequences (being repeated or not), which is extremely useful when
      the connection is going through a traffic policer.
      
      Google server experiments indicate that disabling FACK after enabling
      RACK has negligible impact on the overall loss recovery performance
      with more reordering events detected.  But we still keep the FACK
      implementation for backup if RACK has bugs that needs to be disabled.
      
      [1] M. Mathis, J. Mahdavi, "Forward Acknowledgment: Refining
      TCP Congestion Control," In Proceedings of SIGCOMM '96, August 1996.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      94bdc978
    • Y
      tcp: remove thin_dupack feature · 4a7f6009
      Yuchung Cheng 提交于
      Thin stream DUPACK is to start fast recovery on only one DUPACK
      provided the connection is a thin stream (i.e., low inflight).  But
      this older feature is now subsumed with RACK. If a connection
      receives only a single DUPACK, RACK would arm a reordering timer
      and soon starts fast recovery instead of timeout if no further
      ACKs are received.
      
      The socket option (THIN_DUPACK) is kept as a nop for compatibility.
      Note that this patch does not change another thin-stream feature
      which enables linear RTO. Although it might be good to generalize
      that in the future (i.e., linear RTO for the first say 3 retries).
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4a7f6009
    • Y
      tcp: remove RFC4653 NCR · ac229dca
      Yuchung Cheng 提交于
      This patch removes the (partial) implementation of the aggressive
      limited transmit in RFC4653 TCP Non-Congestion Robustness (NCR).
      
      NCR is a mitigation to the problem created by the dynamic
      DUPACK threshold.  With the current adaptive DUPACK threshold
      (tp->reordering) could cause timeouts by preventing fast recovery.
      For example, if the last packet of a cwnd burst was reordered, the
      threshold will be set to the size of cwnd. But if next application
      burst is smaller than threshold and has drops instead of reorderings,
      the sender would not trigger fast recovery but instead resorts to a
      timeout recovery.
      
      NCR mitigates this issue by checking the number of DUPACKs against
      the current flight size additionally. The techniqueue is similar to
      the early retransmit RFC.
      
      With RACK loss detection, this mitigation is not needed, because RACK
      does not use DUPACK threshold to detect losses. RACK arms a reordering
      timer to fire at most a quarter RTT later to start fast recovery.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac229dca
    • Y
      tcp: remove early retransmit · bec41a11
      Yuchung Cheng 提交于
      This patch removes the support of RFC5827 early retransmit (i.e.,
      fast recovery on small inflight with <3 dupacks) because it is
      subsumed by the new RACK loss detection. More specifically when
      RACK receives DUPACKs, it'll arm a reordering timer to start fast
      recovery after a quarter of (min)RTT, hence it covers the early
      retransmit except RACK does not limit itself to specific inflight
      or dupack numbers.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bec41a11
    • Y
      tcp: remove forward retransmit feature · 840a3cbe
      Yuchung Cheng 提交于
      Forward retransmit is an esoteric feature in RFC3517 (condition(3)
      in the NextSeg()). Basically if a packet is not considered lost by
      the current criteria (# of dupacks etc), but the congestion window
      has room for more packets, then retransmit this packet.
      
      However it actually conflicts with the rest of recovery design. For
      example, when reordering is detected we want to be conservative
      in retransmitting packets but forward-retransmit feature would
      break that to force more retransmission. Also the implementation is
      fairly complicated inside the retransmission logic inducing extra
      iterations in the write queue. With RACK losses are being detected
      timely and this heuristic is no longer necessary. There this patch
      removes the feature.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      840a3cbe
    • Y
      tcp: extend F-RTO to catch more spurious timeouts · 89fe18e4
      Yuchung Cheng 提交于
      Current F-RTO reverts cwnd reset whenever a never-retransmitted
      packet was (s)acked. The timeout can be declared spurious because
      the packets acknoledged with this ACK was transmitted before the
      timeout, so clearly not all the packets are lost to reset the cwnd.
      
      This nice detection does not really depend F-RTO internals. This
      patch applies the detection universally. On Google servers this
      change detected 20% more spurious timeouts.
      Suggested-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      89fe18e4
    • Y
      tcp: enable RACK loss detection to trigger recovery · a0370b3f
      Yuchung Cheng 提交于
      This patch changes two things:
      
      1. Start fast recovery with RACK in addition to other heuristics
         (e.g., DUPACK threshold, FACK). Prior to this change RACK
         is enabled to detect losses only after the recovery has
         started by other algorithms.
      
      2. Disable TCP early retransmit. RACK subsumes the early retransmit
         with the new reordering timer feature. A latter patch in this
         series removes the early retransmit code.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a0370b3f
    • Y
      tcp: check undo conditions before detecting losses · 98e36d44
      Yuchung Cheng 提交于
      Currently RACK would mark loss before the undo operations in TCP
      loss recovery. This could incorrectly identify real losses as
      spurious. For example a sender first experiences a delay spike and
      then eventually some packets were lost due to buffer overrun.
      In this case, the sender should perform fast recovery b/c not all
      the packets were lost.
      
      But the sender may first trigger a (spurious) RTO and reset
      cwnd to 1. The following ACKs may used to mark real losses by
      tcp_rack_mark_lost. Then in tcp_process_loss this ACK could trigger
      F-RTO undo condition and unmark real losses and revert the cwnd
      reduction. If there are no more ACKs coming back, eventually the
      sender would timeout again instead of performing fast recovery.
      
      The patch fixes this incorrect process by always performing
      the undo checks before detecting losses.
      
      Fixes: 4f41b1c5 ("tcp: use RACK to detect losses")
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98e36d44
    • Y
      tcp: use sequence to break TS ties for RACK loss detection · 1d0833df
      Yuchung Cheng 提交于
      The packets inside a jumbo skb (e.g., TSO) share the same skb
      timestamp, even though they are sent sequentially on the wire. Since
      RACK is based on time, it can not detect some packets inside the
      same skb are lost.  However, we can leverage the packet sequence
      numbers as extended timestamps to detect losses. Therefore, when
      RACK timestamp is identical to skb's timestamp (i.e., one of the
      packets of the skb is acked or sacked), we use the sequence numbers
      of the acked and unacked packets to break ties.
      
      We can use the same sequence logic to advance RACK xmit time as
      well to detect more losses and avoid timeout.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1d0833df
    • Y
      tcp: add reordering timer in RACK loss detection · 57dde7f7
      Yuchung Cheng 提交于
      This patch makes RACK install a reordering timer when it suspects
      some packets might be lost, but wants to delay the decision
      a little bit to accomodate reordering.
      
      It does not create a new timer but instead repurposes the existing
      RTO timer, because both are meant to retransmit packets.
      Specifically it arms a timer ICSK_TIME_REO_TIMEOUT when
      the RACK timing check fails. The wait time is set to
      
        RACK.RTT + RACK.reo_wnd - (NOW - Packet.xmit_time) + fudge
      
      This translates to expecting a packet (Packet) should take
      (RACK.RTT + RACK.reo_wnd + fudge) to deliver after it was sent.
      
      When there are multiple packets that need a timer, we use one timer
      with the maximum timeout. Therefore the timer conservatively uses
      the maximum window to expire N packets by one timeout, instead of
      N timeouts to expire N packets sent at different times.
      
      The fudge factor is 2 jiffies to ensure when the timer fires, all
      the suspected packets would exceed the deadline and be marked lost
      by tcp_rack_detect_loss(). It has to be at least 1 jiffy because the
      clock may tick between calling icsk_reset_xmit_timer(timeout) and
      actually hang the timer. The next jiffy is to lower-bound the timeout
      to 2 jiffies when reo_wnd is < 1ms.
      
      When the reordering timer fires (tcp_rack_reo_timeout): If we aren't
      in Recovery we'll enter fast recovery and force fast retransmit.
      This is very similar to the early retransmit (RFC5827) except RACK
      is not constrained to only enter recovery for small outstanding
      flights.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      57dde7f7
    • Y
      tcp: record most recent RTT in RACK loss detection · deed7be7
      Yuchung Cheng 提交于
      Record the most recent RTT in RACK. It is often identical to the
      "ca_rtt_us" values in tcp_clean_rtx_queue. But when the packet has
      been retransmitted, RACK choses to believe the ACK is for the
      (latest) retransmitted packet if the RTT is over minimum RTT.
      
      This requires passing the arrival time of the most recent ACK to
      RACK routines. The timestamp is now recorded in the "ack_time"
      in tcp_sacktag_state during the ACK processing.
      
      This patch does not change the RACK algorithm itself. It only adds
      the RTT variable to prepare the next main patch.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      deed7be7
    • Y
      tcp: new helper for RACK to detect loss · e636f8b0
      Yuchung Cheng 提交于
      Create a new helper tcp_rack_detect_loss to prepare the upcoming
      RACK reordering timer patch.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e636f8b0
    • Y
      tcp: new helper function for RACK loss detection · db8da6bb
      Yuchung Cheng 提交于
      Create a new helper tcp_rack_mark_skb_lost to prepare the
      upcoming RACK reordering timer support.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db8da6bb
    • S
      tcp: fix tcp_fastopen unaligned access complaints on sparc · 003c9410
      Shannon Nelson 提交于
      Fix up a data alignment issue on sparc by swapping the order
      of the cookie byte array field with the length field in
      struct tcp_fastopen_cookie, and making it a proper union
      to clean up the typecasting.
      
      This addresses log complaints like these:
          log_unaligned: 113 callbacks suppressed
          Kernel unaligned access at TPC[976490] tcp_try_fastopen+0x2d0/0x360
          Kernel unaligned access at TPC[9764ac] tcp_try_fastopen+0x2ec/0x360
          Kernel unaligned access at TPC[9764c8] tcp_try_fastopen+0x308/0x360
          Kernel unaligned access at TPC[9764e4] tcp_try_fastopen+0x324/0x360
          Kernel unaligned access at TPC[976490] tcp_try_fastopen+0x2d0/0x360
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NShannon Nelson <shannon.nelson@oracle.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      003c9410
  11. 13 1月, 2017 2 次提交
    • N
      ipmr: improve hash scalability · 8fb472c0
      Nikolay Aleksandrov 提交于
      Recently we started using ipmr with thousands of entries and easily hit
      soft lockups on smaller devices. The reason is that the hash function
      uses the high order bits from the src and dst, but those don't change in
      many common cases, also the hash table  is only 64 elements so with
      thousands it doesn't scale at all.
      This patch migrates the hash table to rhashtable, and in particular the
      rhl interface which allows for duplicate elements to be chained because
      of the MFC_PROXY support (*,G; *,*,oif cases) which allows for multiple
      duplicate entries to be added with different interfaces (IMO wrong, but
      it's been in for a long time).
      
      And here are some results from tests I've run in a VM:
       mr_table size (default, allocated for all namespaces):
        Before                    After
         49304 bytes               2400 bytes
      
       Add 65000 routes (the diff is much larger on smaller devices):
        Before                    After
         1m42s                     58s
      
       Forwarding 256 byte packets with 65000 routes (test done in a VM):
        Before                    After
         3 Mbps / ~1465 pps        122 Mbps / ~59000 pps
      
      As a bonus we no longer see the soft lockups on smaller devices which
      showed up even with 2000 entries before.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8fb472c0
    • D
      net: ipv4: fix table id in getroute response · 8a430ed5
      David Ahern 提交于
      rtm_table is an 8-bit field while table ids are allowed up to u32. Commit
      709772e6 ("net: Fix routing tables with id > 255 for legacy software")
      added the preference to set rtm_table in dumps to RT_TABLE_COMPAT if the
      table id is > 255. The table id returned on get route requests should do
      the same.
      
      Fixes: c36ba660 ("net: Allow user to get table id from route lookup")
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8a430ed5