1. 15 8月, 2014 1 次提交
    • A
      tcp: don't use timestamp from repaired skb-s to calculate RTT (v2) · 9d186cac
      Andrey Vagin 提交于
      We don't know right timestamp for repaired skb-s. Wrong RTT estimations
      isn't good, because some congestion modules heavily depends on it.
      
      This patch adds the TCPCB_REPAIRED flag, which is included in
      TCPCB_RETRANS.
      
      Thanks to Eric for the advice how to fix this issue.
      
      This patch fixes the warning:
      [  879.562947] WARNING: CPU: 0 PID: 2825 at net/ipv4/tcp_input.c:3078 tcp_ack+0x11f5/0x1380()
      [  879.567253] CPU: 0 PID: 2825 Comm: socket-tcpbuf-l Not tainted 3.16.0-next-20140811 #1
      [  879.567829] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      [  879.568177]  0000000000000000 00000000c532680c ffff880039643d00 ffffffff817aa2d2
      [  879.568776]  0000000000000000 ffff880039643d38 ffffffff8109afbd ffff880039d6ba80
      [  879.569386]  ffff88003a449800 000000002983d6bd 0000000000000000 000000002983d6bc
      [  879.569982] Call Trace:
      [  879.570264]  [<ffffffff817aa2d2>] dump_stack+0x4d/0x66
      [  879.570599]  [<ffffffff8109afbd>] warn_slowpath_common+0x7d/0xa0
      [  879.570935]  [<ffffffff8109b0ea>] warn_slowpath_null+0x1a/0x20
      [  879.571292]  [<ffffffff816d0a05>] tcp_ack+0x11f5/0x1380
      [  879.571614]  [<ffffffff816d10bd>] tcp_rcv_established+0x1ed/0x710
      [  879.571958]  [<ffffffff816dc9da>] tcp_v4_do_rcv+0x10a/0x370
      [  879.572315]  [<ffffffff81657459>] release_sock+0x89/0x1d0
      [  879.572642]  [<ffffffff816c81a0>] do_tcp_setsockopt.isra.36+0x120/0x860
      [  879.573000]  [<ffffffff8110a52e>] ? rcu_read_lock_held+0x6e/0x80
      [  879.573352]  [<ffffffff816c8912>] tcp_setsockopt+0x32/0x40
      [  879.573678]  [<ffffffff81654ac4>] sock_common_setsockopt+0x14/0x20
      [  879.574031]  [<ffffffff816537b0>] SyS_setsockopt+0x80/0xf0
      [  879.574393]  [<ffffffff817b40a9>] system_call_fastpath+0x16/0x1b
      [  879.574730] ---[ end trace a17cbc38eb8c5c00 ]---
      
      v2: moving setting of skb->when for repaired skb-s in tcp_write_xmit,
          where it's set for other skb-s.
      
      Fixes: 431a9124 ("tcp: timestamp SYN+DATA messages")
      Fixes: 740b0f18 ("tcp: switch rtt estimations to usec resolution")
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9d186cac
  2. 14 8月, 2014 2 次提交
  3. 09 8月, 2014 1 次提交
  4. 07 8月, 2014 3 次提交
  5. 06 8月, 2014 4 次提交
    • W
      net-timestamp: ACK timestamp for bytestreams · e1c8a607
      Willem de Bruijn 提交于
      Add SOF_TIMESTAMPING_TX_ACK, a request for a tstamp when the last byte
      in the send() call is acknowledged. It implements the feature for TCP.
      
      The timestamp is generated when the TCP socket cumulative ACK is moved
      beyond the tracked seqno for the first time. The feature ignores SACK
      and FACK, because those acknowledge the specific byte, but not
      necessarily the entire contents of the buffer up to that byte.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e1c8a607
    • W
      net-timestamp: TCP timestamping · 4ed2d765
      Willem de Bruijn 提交于
      TCP timestamping extends SO_TIMESTAMPING to bytestreams.
      
      Bytestreams do not have a 1:1 relationship between send() buffers and
      network packets. The feature interprets a send call on a bytestream as
      a request for a timestamp for the last byte in that send() buffer.
      
      The choice corresponds to a request for a timestamp when all bytes in
      the buffer have been sent. That assumption depends on in-order kernel
      transmission. This is the common case. That said, it is possible to
      construct a traffic shaping tree that would result in reordering.
      The guarantee is strong, then, but not ironclad.
      
      This implementation supports send and sendpages (splice). GSO replaces
      one large packet with multiple smaller packets. This patch also copies
      the option into the correct smaller packet.
      
      This patch does not yet support timestamping on data in an initial TCP
      Fast Open SYN, because that takes a very different data path.
      
      If ID generation in ee_data is enabled, bytestream timestamps return a
      byte offset, instead of the packet counter for datagrams.
      
      The implementation supports a single timestamp per packet. It silenty
      replaces requests for previous timestamps. To avoid missing tstamps,
      flush the tcp queue by disabling Nagle, cork and autocork. Missing
      tstamps can be detected by offset when the ee_data ID is enabled.
      
      Implementation details:
      
      - On GSO, the timestamping code can be included in the main loop. I
      moved it into its own loop to reduce the impact on the common case
      to a single branch.
      
      - To avoid leaking the absolute seqno to userspace, the offset
      returned in ee_data must always be relative. It is an offset between
      an skb and sk field. The first is always set (also for GSO & ACK).
      The second must also never be uninitialized. Only allow the ID
      option on sockets in the ESTABLISHED state, for which the seqno
      is available. Never reset it to zero (instead, move it to the
      current seqno when reenabling the option).
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4ed2d765
    • W
      net-timestamp: add key to disambiguate concurrent datagrams · 09c2d251
      Willem de Bruijn 提交于
      Datagrams timestamped on transmission can coexist in the kernel stack
      and be reordered in packet scheduling. When reading looped datagrams
      from the socket error queue it is not always possible to unique
      correlate looped data with original send() call (for application
      level retransmits). Even if possible, it may be expensive and complex,
      requiring packet inspection.
      
      Introduce a data-independent ID mechanism to associate timestamps with
      send calls. Pass an ID alongside the timestamp in field ee_data of
      sock_extended_err.
      
      The ID is a simple 32 bit unsigned int that is associated with the
      socket and incremented on each send() call for which software tx
      timestamp generation is enabled.
      
      The feature is enabled only if SOF_TIMESTAMPING_OPT_ID is set, to
      avoid changing ee_data for existing applications that expect it 0.
      The counter is reset each time the flag is reenabled. Reenabling
      does not change the ID of already submitted data. It is possible
      to receive out of order IDs if the timestamp stream is not quiesced
      first.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      09c2d251
    • N
      tcp: reduce spurious retransmits due to transient SACK reneging · 5ae344c9
      Neal Cardwell 提交于
      This commit reduces spurious retransmits due to apparent SACK reneging
      by only reacting to SACK reneging that persists for a short delay.
      
      When a sequence space hole at snd_una is filled, some TCP receivers
      send a series of ACKs as they apparently scan their out-of-order queue
      and cumulatively ACK all the packets that have now been consecutiveyly
      received. This is essentially misbehavior B in "Misbehaviors in TCP
      SACK generation" ACM SIGCOMM Computer Communication Review, April
      2011, so we suspect that this is from several common OSes (Windows
      2000, Windows Server 2003, Windows XP). However, this issue has also
      been seen in other cases, e.g. the netdev thread "TCP being hoodwinked
      into spurious retransmissions by lack of timestamps?" from March 2014,
      where the receiver was thought to be a BSD box.
      
      Since snd_una would temporarily be adjacent to a previously SACKed
      range in these scenarios, this receiver behavior triggered the Linux
      SACK reneging code path in the sender. This led the sender to clear
      the SACK scoreboard, enter CA_Loss, and spuriously retransmit
      (potentially) every packet from the entire write queue at line rate
      just a few milliseconds before the ACK for each packet arrives at the
      sender.
      
      To avoid such situations, now when a sender sees apparent reneging it
      does not yet retransmit, but rather adjusts the RTO timer to give the
      receiver a little time (max(RTT/2, 10ms)) to send us some more ACKs
      that will restore sanity to the SACK scoreboard. If the reneging
      persists until this RTO then, as before, we clear the SACK scoreboard
      and enter CA_Loss.
      
      A 10ms delay tolerates a receiver sending such a stream of ACKs at
      56Kbit/sec. And to allow for receivers with slower or more congested
      paths, we wait for at least RTT/2.
      
      We validated the resulting max(RTT/2, 10ms) delay formula with a mix
      of North American and South American Google web server traffic, and
      found that for ACKs displaying transient reneging:
      
       (1) 90% of inter-ACK delays were less than 10ms
       (2) 99% of inter-ACK delays were less than RTT/2
      
      In tests on Google web servers this commit reduced reneging events by
      75%-90% (as measured by the TcpExtTCPSACKReneging counter), without
      any measurable impact on latency for user HTTP and SPDY requests.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ae344c9
  6. 05 8月, 2014 1 次提交
  7. 03 8月, 2014 5 次提交
  8. 01 8月, 2014 6 次提交
  9. 31 7月, 2014 3 次提交
  10. 30 7月, 2014 4 次提交
  11. 29 7月, 2014 1 次提交
    • E
      ip: make IP identifiers less predictable · 04ca6973
      Eric Dumazet 提交于
      In "Counting Packets Sent Between Arbitrary Internet Hosts", Jeffrey and
      Jedidiah describe ways exploiting linux IP identifier generation to
      infer whether two machines are exchanging packets.
      
      With commit 73f156a6 ("inetpeer: get rid of ip_id_count"), we
      changed IP id generation, but this does not really prevent this
      side-channel technique.
      
      This patch adds a random amount of perturbation so that IP identifiers
      for a given destination [1] are no longer monotonically increasing after
      an idle period.
      
      Note that prandom_u32_max(1) returns 0, so if generator is used at most
      once per jiffy, this patch inserts no hole in the ID suite and do not
      increase collision probability.
      
      This is jiffies based, so in the worst case (HZ=1000), the id can
      rollover after ~65 seconds of idle time, which should be fine.
      
      We also change the hash used in __ip_select_ident() to not only hash
      on daddr, but also saddr and protocol, so that ICMP probes can not be
      used to infer information for other protocols.
      
      For IPv6, adds saddr into the hash as well, but not nexthdr.
      
      If I ping the patched target, we can see ID are now hard to predict.
      
      21:57:11.008086 IP (...)
          A > target: ICMP echo request, seq 1, length 64
      21:57:11.010752 IP (... id 2081 ...)
          target > A: ICMP echo reply, seq 1, length 64
      
      21:57:12.013133 IP (...)
          A > target: ICMP echo request, seq 2, length 64
      21:57:12.015737 IP (... id 3039 ...)
          target > A: ICMP echo reply, seq 2, length 64
      
      21:57:13.016580 IP (...)
          A > target: ICMP echo request, seq 3, length 64
      21:57:13.019251 IP (... id 3437 ...)
          target > A: ICMP echo reply, seq 3, length 64
      
      [1] TCP sessions uses a per flow ID generator not changed by this patch.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NJeffrey Knockel <jeffk@cs.unm.edu>
      Reported-by: NJedidiah R. Crandall <crandall@cs.unm.edu>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Hannes Frederic Sowa <hannes@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04ca6973
  12. 28 7月, 2014 9 次提交