1. 14 1月, 2017 13 次提交
    • Y
      tcp: disable fack by default · 94bdc978
      Yuchung Cheng 提交于
      This patch disables FACK by default as RACK is the successor of FACK
      (inspired by the insights behind FACK).
      
      FACK[1] in Linux works as follows: a packet P is deemed lost,
      if packet Q of higher sequence is s/acked and P and Q are distant
      by at least dupthresh number of packets in sequence space.
      
      FACK is more aggressive than the IETF recommened recovery for SACK
      (RFC3517 A Conservative Selective Acknowledgment (SACK)-based Loss
       Recovery Algorithm for TCP), because a single SACK may trigger
      fast recovery. This obviously won't work well with reordering so
      FACK is dynamically disabled upon detecting reordering.
      
      RACK supersedes FACK by using time distance instead of sequence
      distance. On reordering, RACK waits for a quarter of RTT receiving
      a single SACK before starting recovery. (the timer can be made more
      adaptive in the future by measuring reordering distance in time,
      but currently RTT/4 seem to work well.) Once the recovery starts,
      RACK behaves almost like FACK because it reduces the reodering
      window to 1ms, so it fast retransmits quickly. In addition RACK
      can detect loss retransmission as it does not care about the packet
      sequences (being repeated or not), which is extremely useful when
      the connection is going through a traffic policer.
      
      Google server experiments indicate that disabling FACK after enabling
      RACK has negligible impact on the overall loss recovery performance
      with more reordering events detected.  But we still keep the FACK
      implementation for backup if RACK has bugs that needs to be disabled.
      
      [1] M. Mathis, J. Mahdavi, "Forward Acknowledgment: Refining
      TCP Congestion Control," In Proceedings of SIGCOMM '96, August 1996.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      94bdc978
    • Y
      tcp: remove thin_dupack feature · 4a7f6009
      Yuchung Cheng 提交于
      Thin stream DUPACK is to start fast recovery on only one DUPACK
      provided the connection is a thin stream (i.e., low inflight).  But
      this older feature is now subsumed with RACK. If a connection
      receives only a single DUPACK, RACK would arm a reordering timer
      and soon starts fast recovery instead of timeout if no further
      ACKs are received.
      
      The socket option (THIN_DUPACK) is kept as a nop for compatibility.
      Note that this patch does not change another thin-stream feature
      which enables linear RTO. Although it might be good to generalize
      that in the future (i.e., linear RTO for the first say 3 retries).
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4a7f6009
    • Y
      tcp: remove RFC4653 NCR · ac229dca
      Yuchung Cheng 提交于
      This patch removes the (partial) implementation of the aggressive
      limited transmit in RFC4653 TCP Non-Congestion Robustness (NCR).
      
      NCR is a mitigation to the problem created by the dynamic
      DUPACK threshold.  With the current adaptive DUPACK threshold
      (tp->reordering) could cause timeouts by preventing fast recovery.
      For example, if the last packet of a cwnd burst was reordered, the
      threshold will be set to the size of cwnd. But if next application
      burst is smaller than threshold and has drops instead of reorderings,
      the sender would not trigger fast recovery but instead resorts to a
      timeout recovery.
      
      NCR mitigates this issue by checking the number of DUPACKs against
      the current flight size additionally. The techniqueue is similar to
      the early retransmit RFC.
      
      With RACK loss detection, this mitigation is not needed, because RACK
      does not use DUPACK threshold to detect losses. RACK arms a reordering
      timer to fire at most a quarter RTT later to start fast recovery.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac229dca
    • Y
      tcp: remove early retransmit · bec41a11
      Yuchung Cheng 提交于
      This patch removes the support of RFC5827 early retransmit (i.e.,
      fast recovery on small inflight with <3 dupacks) because it is
      subsumed by the new RACK loss detection. More specifically when
      RACK receives DUPACKs, it'll arm a reordering timer to start fast
      recovery after a quarter of (min)RTT, hence it covers the early
      retransmit except RACK does not limit itself to specific inflight
      or dupack numbers.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bec41a11
    • Y
      tcp: remove forward retransmit feature · 840a3cbe
      Yuchung Cheng 提交于
      Forward retransmit is an esoteric feature in RFC3517 (condition(3)
      in the NextSeg()). Basically if a packet is not considered lost by
      the current criteria (# of dupacks etc), but the congestion window
      has room for more packets, then retransmit this packet.
      
      However it actually conflicts with the rest of recovery design. For
      example, when reordering is detected we want to be conservative
      in retransmitting packets but forward-retransmit feature would
      break that to force more retransmission. Also the implementation is
      fairly complicated inside the retransmission logic inducing extra
      iterations in the write queue. With RACK losses are being detected
      timely and this heuristic is no longer necessary. There this patch
      removes the feature.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      840a3cbe
    • Y
      tcp: extend F-RTO to catch more spurious timeouts · 89fe18e4
      Yuchung Cheng 提交于
      Current F-RTO reverts cwnd reset whenever a never-retransmitted
      packet was (s)acked. The timeout can be declared spurious because
      the packets acknoledged with this ACK was transmitted before the
      timeout, so clearly not all the packets are lost to reset the cwnd.
      
      This nice detection does not really depend F-RTO internals. This
      patch applies the detection universally. On Google servers this
      change detected 20% more spurious timeouts.
      Suggested-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      89fe18e4
    • Y
      tcp: enable RACK loss detection to trigger recovery · a0370b3f
      Yuchung Cheng 提交于
      This patch changes two things:
      
      1. Start fast recovery with RACK in addition to other heuristics
         (e.g., DUPACK threshold, FACK). Prior to this change RACK
         is enabled to detect losses only after the recovery has
         started by other algorithms.
      
      2. Disable TCP early retransmit. RACK subsumes the early retransmit
         with the new reordering timer feature. A latter patch in this
         series removes the early retransmit code.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a0370b3f
    • Y
      tcp: check undo conditions before detecting losses · 98e36d44
      Yuchung Cheng 提交于
      Currently RACK would mark loss before the undo operations in TCP
      loss recovery. This could incorrectly identify real losses as
      spurious. For example a sender first experiences a delay spike and
      then eventually some packets were lost due to buffer overrun.
      In this case, the sender should perform fast recovery b/c not all
      the packets were lost.
      
      But the sender may first trigger a (spurious) RTO and reset
      cwnd to 1. The following ACKs may used to mark real losses by
      tcp_rack_mark_lost. Then in tcp_process_loss this ACK could trigger
      F-RTO undo condition and unmark real losses and revert the cwnd
      reduction. If there are no more ACKs coming back, eventually the
      sender would timeout again instead of performing fast recovery.
      
      The patch fixes this incorrect process by always performing
      the undo checks before detecting losses.
      
      Fixes: 4f41b1c5 ("tcp: use RACK to detect losses")
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98e36d44
    • Y
      tcp: use sequence to break TS ties for RACK loss detection · 1d0833df
      Yuchung Cheng 提交于
      The packets inside a jumbo skb (e.g., TSO) share the same skb
      timestamp, even though they are sent sequentially on the wire. Since
      RACK is based on time, it can not detect some packets inside the
      same skb are lost.  However, we can leverage the packet sequence
      numbers as extended timestamps to detect losses. Therefore, when
      RACK timestamp is identical to skb's timestamp (i.e., one of the
      packets of the skb is acked or sacked), we use the sequence numbers
      of the acked and unacked packets to break ties.
      
      We can use the same sequence logic to advance RACK xmit time as
      well to detect more losses and avoid timeout.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1d0833df
    • Y
      tcp: add reordering timer in RACK loss detection · 57dde7f7
      Yuchung Cheng 提交于
      This patch makes RACK install a reordering timer when it suspects
      some packets might be lost, but wants to delay the decision
      a little bit to accomodate reordering.
      
      It does not create a new timer but instead repurposes the existing
      RTO timer, because both are meant to retransmit packets.
      Specifically it arms a timer ICSK_TIME_REO_TIMEOUT when
      the RACK timing check fails. The wait time is set to
      
        RACK.RTT + RACK.reo_wnd - (NOW - Packet.xmit_time) + fudge
      
      This translates to expecting a packet (Packet) should take
      (RACK.RTT + RACK.reo_wnd + fudge) to deliver after it was sent.
      
      When there are multiple packets that need a timer, we use one timer
      with the maximum timeout. Therefore the timer conservatively uses
      the maximum window to expire N packets by one timeout, instead of
      N timeouts to expire N packets sent at different times.
      
      The fudge factor is 2 jiffies to ensure when the timer fires, all
      the suspected packets would exceed the deadline and be marked lost
      by tcp_rack_detect_loss(). It has to be at least 1 jiffy because the
      clock may tick between calling icsk_reset_xmit_timer(timeout) and
      actually hang the timer. The next jiffy is to lower-bound the timeout
      to 2 jiffies when reo_wnd is < 1ms.
      
      When the reordering timer fires (tcp_rack_reo_timeout): If we aren't
      in Recovery we'll enter fast recovery and force fast retransmit.
      This is very similar to the early retransmit (RFC5827) except RACK
      is not constrained to only enter recovery for small outstanding
      flights.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      57dde7f7
    • Y
      tcp: record most recent RTT in RACK loss detection · deed7be7
      Yuchung Cheng 提交于
      Record the most recent RTT in RACK. It is often identical to the
      "ca_rtt_us" values in tcp_clean_rtx_queue. But when the packet has
      been retransmitted, RACK choses to believe the ACK is for the
      (latest) retransmitted packet if the RTT is over minimum RTT.
      
      This requires passing the arrival time of the most recent ACK to
      RACK routines. The timestamp is now recorded in the "ack_time"
      in tcp_sacktag_state during the ACK processing.
      
      This patch does not change the RACK algorithm itself. It only adds
      the RTT variable to prepare the next main patch.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      deed7be7
    • Y
      tcp: new helper for RACK to detect loss · e636f8b0
      Yuchung Cheng 提交于
      Create a new helper tcp_rack_detect_loss to prepare the upcoming
      RACK reordering timer patch.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e636f8b0
    • Y
      tcp: new helper function for RACK loss detection · db8da6bb
      Yuchung Cheng 提交于
      Create a new helper tcp_rack_mark_skb_lost to prepare the
      upcoming RACK reordering timer support.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db8da6bb
  2. 13 1月, 2017 1 次提交
    • N
      ipmr: improve hash scalability · 8fb472c0
      Nikolay Aleksandrov 提交于
      Recently we started using ipmr with thousands of entries and easily hit
      soft lockups on smaller devices. The reason is that the hash function
      uses the high order bits from the src and dst, but those don't change in
      many common cases, also the hash table  is only 64 elements so with
      thousands it doesn't scale at all.
      This patch migrates the hash table to rhashtable, and in particular the
      rhl interface which allows for duplicate elements to be chained because
      of the MFC_PROXY support (*,G; *,*,oif cases) which allows for multiple
      duplicate entries to be added with different interfaces (IMO wrong, but
      it's been in for a long time).
      
      And here are some results from tests I've run in a VM:
       mr_table size (default, allocated for all namespaces):
        Before                    After
         49304 bytes               2400 bytes
      
       Add 65000 routes (the diff is much larger on smaller devices):
        Before                    After
         1m42s                     58s
      
       Forwarding 256 byte packets with 65000 routes (test done in a VM):
        Before                    After
         3 Mbps / ~1465 pps        122 Mbps / ~59000 pps
      
      As a bonus we no longer see the soft lockups on smaller devices which
      showed up even with 2000 entries before.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8fb472c0
  3. 11 1月, 2017 1 次提交
    • D
      net: ipv4: Fix multipath selection with vrf · 7a18c5b9
      David Ahern 提交于
      fib_select_path does not call fib_select_multipath if oif is set in the
      flow struct. For VRF use cases oif is always set, so multipath route
      selection is bypassed. Use the FLOWI_FLAG_SKIP_NH_OIF to skip the oif
      check similar to what is done in fib_table_lookup.
      
      Add saddr and proto to the flow struct for the fib lookup done by the
      VRF driver to better match hash computation for a flow.
      
      Fixes: 613d09b3 ("net: Use VRF device index for lookups on TX")
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7a18c5b9
  4. 10 1月, 2017 9 次提交
  5. 09 1月, 2017 3 次提交
  6. 07 1月, 2017 3 次提交
  7. 06 1月, 2017 1 次提交
    • S
      tcp: provide timestamps for partial writes · ad02c4f5
      Soheil Hassas Yeganeh 提交于
      For TCP sockets, TX timestamps are only captured when the user data
      is successfully and fully written to the socket. In many cases,
      however, TCP writes can be partial for which no timestamp is
      collected.
      
      Collect timestamps whenever any user data is (fully or partially)
      copied into the socket. Pass tcp_write_queue_tail to tcp_tx_timestamp
      instead of the local skb pointer since it can be set to NULL on
      the error path.
      
      Note that tcp_write_queue_tail can be NULL, even if bytes have been
      copied to the socket. This is because acknowledgements are being
      processed in tcp_sendmsg(), and by the time tcp_tx_timestamp is
      called tcp_write_queue_tail can be NULL. For such cases, this patch
      does not collect any timestamps (i.e., it is best-effort).
      
      This patch is written with suggestions from Willem de Bruijn and
      Eric Dumazet.
      
      Change-log V1 -> V2:
      	- Use sockc.tsflags instead of sk->sk_tsflags.
      	- Use the same code path for normal writes and errors.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ad02c4f5
  8. 05 1月, 2017 1 次提交
  9. 03 1月, 2017 3 次提交
    • N
      ipmr, ip6mr: add RTNH_F_UNRESOLVED flag to unresolved cache entries · 1708ebc9
      Nikolay Aleksandrov 提交于
      While working with ipmr, we noticed that it is impossible to determine
      if an entry is actually unresolved or its IIF interface has disappeared
      (e.g. virtual interface got deleted). These entries look almost
      identical to user-space when dumping or receiving notifications. So in
      order to recognize them add a new RTNH_F_UNRESOLVED flag which is set when
      sending an unresolved cache entry to user-space.
      Suggested-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1708ebc9
    • A
      ipv4: Do not allow MAIN to be alias for new LOCAL w/ custom rules · 5350d54f
      Alexander Duyck 提交于
      In the case of custom rules being present we need to handle the case of the
      LOCAL table being intialized after the new rule has been added.  To address
      that I am adding a new check so that we can make certain we don't use an
      alias of MAIN for LOCAL when allocating a new table.
      
      Fixes: 0ddcf43d ("ipv4: FIB Local/MAIN table collapse")
      Reported-by: NOliver Brunel <jjk@jjacky.com>
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5350d54f
    • M
      igmp: Make igmp group member RFC 3376 compliant · 7ababb78
      Michal Tesar 提交于
      5.2. Action on Reception of a Query
      
       When a system receives a Query, it does not respond immediately.
       Instead, it delays its response by a random amount of time, bounded
       by the Max Resp Time value derived from the Max Resp Code in the
       received Query message.  A system may receive a variety of Queries on
       different interfaces and of different kinds (e.g., General Queries,
       Group-Specific Queries, and Group-and-Source-Specific Queries), each
       of which may require its own delayed response.
      
       Before scheduling a response to a Query, the system must first
       consider previously scheduled pending responses and in many cases
       schedule a combined response.  Therefore, the system must be able to
       maintain the following state:
      
       o A timer per interface for scheduling responses to General Queries.
      
       o A per-group and interface timer for scheduling responses to Group-
         Specific and Group-and-Source-Specific Queries.
      
       o A per-group and interface list of sources to be reported in the
         response to a Group-and-Source-Specific Query.
      
       When a new Query with the Router-Alert option arrives on an
       interface, provided the system has state to report, a delay for a
       response is randomly selected in the range (0, [Max Resp Time]) where
       Max Resp Time is derived from Max Resp Code in the received Query
       message.  The following rules are then used to determine if a Report
       needs to be scheduled and the type of Report to schedule.  The rules
       are considered in order and only the first matching rule is applied.
      
       1. If there is a pending response to a previous General Query
          scheduled sooner than the selected delay, no additional response
          needs to be scheduled.
      
       2. If the received Query is a General Query, the interface timer is
          used to schedule a response to the General Query after the
          selected delay.  Any previously pending response to a General
          Query is canceled.
      --8<--
      
      Currently the timer is rearmed with new random expiration time for
      every incoming query regardless of possibly already pending report.
      Which is not aligned with the above RFE.
      It also might happen that higher rate of incoming queries can
      postpone the report after the expiration time of the first query
      causing group membership loss.
      
      Now the per interface general query timer is rearmed only
      when there is no pending report already scheduled on that interface or
      the newly selected expiration time is before the already pending
      scheduled report.
      Signed-off-by: NMichal Tesar <mtesar@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ababb78
  10. 31 12月, 2016 1 次提交
    • D
      net: Allow IP_MULTICAST_IF to set index to L3 slave · 7bb387c5
      David Ahern 提交于
      IP_MULTICAST_IF fails if sk_bound_dev_if is already set and the new index
      does not match it. e.g.,
      
          ntpd[15381]: setsockopt IP_MULTICAST_IF 192.168.1.23 fails: Invalid argument
      
      Relax the check in setsockopt to allow setting mc_index to an L3 slave if
      sk_bound_dev_if points to an L3 master.
      
      Make a similar change for IPv6. In this case change the device lookup to
      take the rcu_read_lock avoiding a refcnt. The rcu lock is also needed for
      the lookup of a potential L3 master device.
      
      This really only silences a setsockopt failure since uses of mc_index are
      secondary to sk_bound_dev_if if it is set. In both cases, if either index
      is an L3 slave or master, lookups are directed to the same FIB table so
      relaxing the check at setsockopt time causes no harm.
      
      Patch is based on a suggested change by Darwin for a problem noted in
      their code base.
      Suggested-by: NDarwin Dingel <darwin.dingel@alliedtelesis.co.nz>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7bb387c5
  11. 30 12月, 2016 4 次提交