You need to sign in or sign up before continuing.
  1. 03 12月, 2016 1 次提交
    • F
      tcp: randomize tcp timestamp offsets for each connection · 95a22cae
      Florian Westphal 提交于
      jiffies based timestamps allow for easy inference of number of devices
      behind NAT translators and also makes tracking of hosts simpler.
      
      commit ceaa1fef ("tcp: adding a per-socket timestamp offset")
      added the main infrastructure that is needed for per-connection ts
      randomization, in particular writing/reading the on-wire tcp header
      format takes the offset into account so rest of stack can use normal
      tcp_time_stamp (jiffies).
      
      So only two items are left:
       - add a tsoffset for request sockets
       - extend the tcp isn generator to also return another 32bit number
         in addition to the ISN.
      
      Re-use of ISN generator also means timestamps are still monotonically
      increasing for same connection quadruple, i.e. PAWS will still work.
      
      Includes fixes from Eric Dumazet.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95a22cae
  2. 21 9月, 2016 2 次提交
    • S
      tcp: track application-limited rate samples · d7722e85
      Soheil Hassas Yeganeh 提交于
      This commit adds code to track whether the delivery rate represented
      by each rate_sample was limited by the application.
      
      Upon each transmit, we store in the is_app_limited field in the skb a
      boolean bit indicating whether there is a known "bubble in the pipe":
      a point in the rate sample interval where the sender was
      application-limited, and did not transmit even though the cwnd and
      pacing rate allowed it.
      
      This logic marks the flow app-limited on a write if *all* of the
      following are true:
      
        1) There is less than 1 MSS of unsent data in the write queue
           available to transmit.
      
        2) There is no packet in the sender's queues (e.g. in fq or the NIC
           tx queue).
      
        3) The connection is not limited by cwnd.
      
        4) There are no lost packets to retransmit.
      
      The tcp_rate_check_app_limited() code in tcp_rate.c determines whether
      the connection is application-limited at the moment. If the flow is
      application-limited, it sets the tp->app_limited field. If the flow is
      application-limited then that means there is effectively a "bubble" of
      silence in the pipe now, and this silence will be reflected in a lower
      bandwidth sample for any rate samples from now until we get an ACK
      indicating this bubble has exited the pipe: specifically, until we get
      an ACK for the next packet we transmit.
      
      When we send every skb we record in scb->tx.is_app_limited whether the
      resulting rate sample will be application-limited.
      
      The code in tcp_rate_gen() checks to see when it is safe to mark all
      known application-limited bubbles of silence as having exited the
      pipe. It does this by checking to see when the delivered count moves
      past the tp->app_limited marker. At this point it zeroes the
      tp->app_limited marker, as all known bubbles are out of the pipe.
      
      We make room for the tx.is_app_limited bit in the skb by borrowing a
      bit from the in_flight field used by NV to record the number of bytes
      in flight. The receive window in the TCP header is 16 bits, and the
      max receive window scaling shift factor is 14 (RFC 1323). So the max
      receive window offered by the TCP protocol is 2^(16+14) = 2^30. So we
      only need 30 bits for the tx.in_flight used by NV.
      Signed-off-by: NVan Jacobson <vanj@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7722e85
    • N
      tcp: use windowed min filter library for TCP min_rtt estimation · 64033892
      Neal Cardwell 提交于
      Refactor the TCP min_rtt code to reuse the new win_minmax library in
      lib/win_minmax.c to simplify the TCP code.
      
      This is a pure refactor: the functionality is exactly the same. We
      just moved the windowed min code to make TCP easier to read and
      maintain, and to allow other parts of the kernel to use the windowed
      min/max filter code.
      Signed-off-by: NVan Jacobson <vanj@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      64033892
  3. 09 9月, 2016 1 次提交
    • Y
      tcp: use an RB tree for ooo receive queue · 9f5afeae
      Yaogong Wang 提交于
      Over the years, TCP BDP has increased by several orders of magnitude,
      and some people are considering to reach the 2 Gbytes limit.
      
      Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
      MSS.
      
      In presence of packet losses (or reorders), TCP stores incoming packets
      into an out of order queue, and number of skbs sitting there waiting for
      the missing packets to be received can be in the 10^5 range.
      
      Most packets are appended to the tail of this queue, and when
      packets can finally be transferred to receive queue, we scan the queue
      from its head.
      
      However, in presence of heavy losses, we might have to find an arbitrary
      point in this queue, involving a linear scan for every incoming packet,
      throwing away cpu caches.
      
      This patch converts it to a RB tree, to get bounded latencies.
      
      Yaogong wrote a preliminary patch about 2 years ago.
      Eric did the rebase, added ofo_last_skb cache, polishing and tests.
      
      Tested with network dropping between 1 and 10 % packets, with good
      success (about 30 % increase of throughput in stress tests)
      
      Next step would be to also use an RB tree for the write queue at sender
      side ;)
      Signed-off-by: NYaogong Wang <wygivan@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Acked-By: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f5afeae
  4. 03 5月, 2016 1 次提交
  5. 28 4月, 2016 2 次提交
  6. 05 4月, 2016 1 次提交
  7. 15 3月, 2016 1 次提交
    • M
      tcp: Add RFC4898 tcpEStatsPerfDataSegsOut/In · a44d6eac
      Martin KaFai Lau 提交于
      Per RFC4898, they count segments sent/received
      containing a positive length data segment (that includes
      retransmission segments carrying data).  Unlike
      tcpi_segs_out/in, tcpi_data_segs_out/in excludes segments
      carrying no data (e.g. pure ack).
      
      The patch also updates the segs_in in tcp_fastopen_add_skb()
      so that segs_in >= data_segs_in property is kept.
      
      Together with retransmission data, tcpi_data_segs_out
      gives a better signal on the rxmit rate.
      
      v6: Rebase on the latest net-next
      
      v5: Eric pointed out that checking skb->len is still needed in
      tcp_fastopen_add_skb() because skb can carry a FIN without data.
      Hence, instead of open coding segs_in and data_segs_in, tcp_segs_in()
      helper is used.  Comment is added to the fastopen case to explain why
      segs_in has to be reset and tcp_segs_in() has to be called before
      __skb_pull().
      
      v4: Add comment to the changes in tcp_fastopen_add_skb()
      and also add remark on this case in the commit message.
      
      v3: Add const modifier to the skb parameter in tcp_segs_in()
      
      v2: Rework based on recent fix by Eric:
      commit a9d99ce2 ("tcp: fix tcpi_segs_in after connection establishment")
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Cc: Chris Rapier <rapier@psc.edu>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Marcelo Ricardo Leitner <mleitner@redhat.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a44d6eac
  8. 08 3月, 2016 1 次提交
    • E
      tcp: fix tcpi_segs_in after connection establishment · a9d99ce2
      Eric Dumazet 提交于
      If final packet (ACK) of 3WHS is lost, it appears we do not properly
      account the following incoming segment into tcpi_segs_in
      
      While we are at it, starts segs_in with one, to count the SYN packet.
      
      We do not yet count number of SYN we received for a request sock, we
      might add this someday.
      
      packetdrill script showing proper behavior after fix :
      
      // Tests tcpi_segs_in when 3rd packet (ACK) of 3WHS is lost
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
         +0 bind(3, ..., ...) = 0
         +0 listen(3, 1) = 0
      
         +0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop>
         +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
      +.020 < P. 1:1001(1000) ack 1 win 32792
      
         +0 accept(3, ..., ...) = 4
      
      +.000 %{ assert tcpi_segs_in == 2, 'tcpi_segs_in=%d' % tcpi_segs_in }%
      
      Fixes: 2efd055c ("tcp: add tcpi_segs_in and tcpi_segs_out to tcp_info")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a9d99ce2
  9. 08 2月, 2016 1 次提交
  10. 23 12月, 2015 1 次提交
    • F
      tcp: honour SO_BINDTODEVICE for TW_RST case too · 271c3b9b
      Florian Westphal 提交于
      Hannes points out that when we generate tcp reset for timewait sockets we
      pretend we found no socket and pass NULL sk to tcp_vX_send_reset().
      
      Make it cope with inet tw sockets and then provide tw sk.
      
      This makes RSTs appear on correct interface when SO_BINDTODEVICE is used.
      
      Packetdrill test case:
      // want default route to be used, we rely on BINDTODEVICE
      `ip route del 192.0.2.0/24 via 192.168.0.2 dev tun0`
      
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      // test case still works due to BINDTODEVICE
      0.001 setsockopt(3, SOL_SOCKET, SO_BINDTODEVICE, "tun0", 4) = 0
      0.100...0.200 connect(3, ..., ...) = 0
      
      0.100 > S 0:0(0) <mss 1460,sackOK,nop,nop>
      0.200 < S. 0:0(0) ack 1 win 32792 <mss 1460,sackOK,nop,nop>
      0.200 > . 1:1(0) ack 1
      
      0.210 close(3) = 0
      
      0.210 > F. 1:1(0) ack 1 win 29200
      0.300 < . 1:1(0) ack 2 win 46
      
      // more data while in FIN_WAIT2, expect RST
      1.300 < P. 1:1001(1000) ack 1 win 46
      
      // fails without this change -- default route is used
      1.301 > R 1:1(0) win 0
      Reported-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      271c3b9b
  11. 06 11月, 2015 1 次提交
  12. 23 10月, 2015 1 次提交
    • E
      tcp/dccp: fix hashdance race for passive sessions · 5e0724d0
      Eric Dumazet 提交于
      Multiple cpus can process duplicates of incoming ACK messages
      matching a SYN_RECV request socket. This is a rare event under
      normal operations, but definitely can happen.
      
      Only one must win the race, otherwise corruption would occur.
      
      To fix this without adding new atomic ops, we use logic in
      inet_ehash_nolisten() to detect the request was present in the same
      ehash bucket where we try to insert the new child.
      
      If request socket was not found, we have to undo the child creation.
      
      This actually removes a spin_lock()/spin_unlock() pair in
      reqsk_queue_unlink() for the fast path.
      
      Fixes: e994b2f0 ("tcp: do not lock listener to process SYN packets")
      Fixes: 079096f1 ("tcp/dccp: install syn_recv requests into ehash table")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e0724d0
  13. 21 10月, 2015 2 次提交
    • Y
      tcp: track the packet timings in RACK · 659a8ad5
      Yuchung Cheng 提交于
      This patch is the first half of the RACK loss recovery.
      
      RACK loss recovery uses the notion of time instead
      of packet sequence (FACK) or counts (dupthresh). It's inspired by the
      previous FACK heuristic in tcp_mark_lost_retrans(): when a limited
      transmit (new data packet) is sacked, then current retransmitted
      sequence below the newly sacked sequence must been lost,
      since at least one round trip time has elapsed.
      
      But it has several limitations:
      1) can't detect tail drops since it depends on limited transmit
      2) is disabled upon reordering (assumes no reordering)
      3) only enabled in fast recovery ut not timeout recovery
      
      RACK (Recently ACK) addresses these limitations with the notion
      of time instead: a packet P1 is lost if a later packet P2 is s/acked,
      as at least one round trip has passed.
      
      Since RACK cares about the time sequence instead of the data sequence
      of packets, it can detect tail drops when later retransmission is
      s/acked while FACK or dupthresh can't. For reordering RACK uses a
      dynamically adjusted reordering window ("reo_wnd") to reduce false
      positives on ever (small) degree of reordering.
      
      This patch implements tcp_advanced_rack() which tracks the
      most recent transmission time among the packets that have been
      delivered (ACKed or SACKed) in tp->rack.mstamp. This timestamp
      is the key to determine which packet has been lost.
      
      Consider an example that the sender sends six packets:
      T1: P1 (lost)
      T2: P2
      T3: P3
      T4: P4
      T100: sack of P2. rack.mstamp = T2
      T101: retransmit P1
      T102: sack of P2,P3,P4. rack.mstamp = T4
      T205: ACK of P4 since the hole is repaired. rack.mstamp = T101
      
      We need to be careful about spurious retransmission because it may
      falsely advance tp->rack.mstamp by an RTT or an RTO, causing RACK
      to falsely mark all packets lost, just like a spurious timeout.
      
      We identify spurious retransmission by the ACK's TS echo value.
      If TS option is not applicable but the retransmission is acknowledged
      less than min-RTT ago, it is likely to be spurious. We refrain from
      using the transmission time of these spurious retransmissions.
      
      The second half is implemented in the next patch that marks packet
      lost using RACK timestamp.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      659a8ad5
    • Y
      tcp: track min RTT using windowed min-filter · f6722583
      Yuchung Cheng 提交于
      Kathleen Nichols' algorithm for tracking the minimum RTT of a
      data stream over some measurement window. It uses constant space
      and constant time per update. Yet it almost always delivers
      the same minimum as an implementation that has to keep all
      the data in the window. The measurement window is tunable via
      sysctl.net.ipv4.tcp_min_rtt_wlen with a default value of 5 minutes.
      
      The algorithm keeps track of the best, 2nd best & 3rd best min
      values, maintaining an invariant that the measurement time of
      the n'th best >= n-1'th best. It also makes sure that the three
      values are widely separated in the time window since that bounds
      the worse case error when that data is monotonically increasing
      over the window.
      
      Upon getting a new min, we can forget everything earlier because
      it has no value - the new min is less than everything else in the
      window by definition and it's the most recent. So we restart fresh
      on every new min and overwrites the 2nd & 3rd choices. The same
      property holds for the 2nd & 3rd best.
      
      Therefore we have to maintain two invariants to maximize the
      information in the samples, one on values (1st.v <= 2nd.v <=
      3rd.v) and the other on times (now-win <=1st.t <= 2nd.t <= 3rd.t <=
      now). These invariants determine the structure of the code
      
      The RTT input to the windowed filter is the minimum RTT measured
      from ACK or SACK, or as the last resort from TCP timestamps.
      
      The accessor tcp_min_rtt() returns the minimum RTT seen in the
      window. ~0U indicates it is not available. The minimum is 1usec
      even if the true RTT is below that.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6722583
  14. 13 10月, 2015 1 次提交
  15. 11 10月, 2015 1 次提交
  16. 03 10月, 2015 1 次提交
  17. 30 9月, 2015 3 次提交
  18. 26 9月, 2015 1 次提交
  19. 25 9月, 2015 1 次提交
  20. 22 9月, 2015 2 次提交
    • E
      tcp/dccp: fix timewait races in timer handling · ed2e9239
      Eric Dumazet 提交于
      When creating a timewait socket, we need to arm the timer before
      allowing other cpus to find it. The signal allowing cpus to find
      the socket is setting tw_refcnt to non zero value.
      
      As we set tw_refcnt in __inet_twsk_hashdance(), we therefore need to
      call inet_twsk_schedule() first.
      
      This also means we need to remove tw_refcnt changes from
      inet_twsk_schedule() and let the caller handle it.
      
      Note that because we use mod_timer_pinned(), we have the guarantee
      the timer wont expire before we set tw_refcnt as we run in BH context.
      
      To make things more readable I introduced inet_twsk_reschedule() helper.
      
      When rearming the timer, we can use mod_timer_pending() to make sure
      we do not rearm a canceled timer.
      
      Note: This bug can possibly trigger if packets of a flow can hit
      multiple cpus. This does not normally happen, unless flow steering
      is broken somehow. This explains this bug was spotted ~5 months after
      its introduction.
      
      A similar fix is needed for SYN_RECV sockets in reqsk_queue_hash_req(),
      but will be provided in a separate patch for proper tracking.
      
      Fixes: 789f558c ("tcp/dccp: get rid of central timewait timer")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NYing Cai <ycai@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ed2e9239
    • Y
      tcp: usec resolution SYN/ACK RTT · 0f1c28ae
      Yuchung Cheng 提交于
      Currently SYN/ACK RTT is measured in jiffies. For LAN the SYN/ACK
      RTT is often measured as 0ms or sometimes 1ms, which would affect
      RTT estimation and min RTT samping used by some congestion control.
      
      This patch improves SYN/ACK RTT to be usec resolution if platform
      supports it. While the timestamping of SYN/ACK is done in request
      sock, the RTT measurement is carefully arranged to avoid storing
      another u64 timestamp in tcp_sock.
      
      For regular handshake w/o SYNACK retransmission, the RTT is sampled
      right after the child socket is created and right before the request
      sock is released (tcp_check_req() in tcp_minisocks.c)
      
      For Fast Open the child socket is already created when SYN/ACK was
      sent, the RTT is sampled in tcp_rcv_state_process() after processing
      the final ACK an right before the request socket is released.
      
      If the SYN/ACK was retransmistted or SYN-cookie was used, we rely
      on TCP timestamps to measure the RTT. The sample is taken at the
      same place in tcp_rcv_state_process() after the timestamp values
      are validated in tcp_validate_incoming(). Note that we do not store
      TS echo value in request_sock for SYN-cookies, because the value
      is already stored in tp->rx_opt used by tcp_ack_update_rtt().
      
      One side benefit is that the RTT measurement now happens before
      initializing congestion control (of the passive side). Therefore
      the congestion control can use the SYN/ACK RTT.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0f1c28ae
  21. 10 7月, 2015 1 次提交
  22. 01 6月, 2015 1 次提交
    • N
      tcp: fix child sockets to use system default congestion control if not set · 9f950415
      Neal Cardwell 提交于
      Linux 3.17 and earlier are explicitly engineered so that if the app
      doesn't specifically request a CC module on a listener before the SYN
      arrives, then the child gets the system default CC when the connection
      is established. See tcp_init_congestion_control() in 3.17 or earlier,
      which says "if no choice made yet assign the current value set as
      default". The change ("net: tcp: assign tcp cong_ops when tcp sk is
      created") altered these semantics, so that children got their parent
      listener's congestion control even if the system default had changed
      after the listener was created.
      
      This commit returns to those original semantics from 3.17 and earlier,
      since they are the original semantics from 2007 in 4d4d3d1e ("[TCP]:
      Congestion control initialization."), and some Linux congestion
      control workflows depend on that.
      
      In summary, if a listener socket specifically sets TCP_CONGESTION to
      "x", or the route locks the CC module to "x", then the child gets
      "x". Otherwise the child gets current system default from
      net.ipv4.tcp_congestion_control. That's the behavior in 3.17 and
      earlier, and this commit restores that.
      
      Fixes: 55d8694f ("net: tcp: assign tcp cong_ops when tcp sk is created")
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Daniel Borkmann <dborkman@redhat.com>
      Cc: Glenn Judd <glenn.judd@morganstanley.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f950415
  23. 22 5月, 2015 1 次提交
    • M
      tcp: add tcpi_segs_in and tcpi_segs_out to tcp_info · 2efd055c
      Marcelo Ricardo Leitner 提交于
      This patch tracks the total number of inbound and outbound segments on a
      TCP socket. One may use this number to have an idea on connection
      quality when compared against the retransmissions.
      
      RFC4898 named these : tcpEStatsPerfSegsIn and tcpEStatsPerfSegsOut
      
      These are a 32bit field each and can be fetched both from TCP_INFO
      getsockopt() if one has a handle on a TCP socket, or from inet_diag
      netlink facility (iproute2/ss patch will follow)
      
      Note that tp->segs_out was placed near tp->snd_nxt for good data
      locality and minimal performance impact, while tp->segs_in was placed
      near tp->bytes_received for the same reason.
      
      Join work with Eric Dumazet.
      
      Note that received SYN are accounted on the listener, but sent SYNACK
      are not accounted.
      Signed-off-by: NMarcelo Ricardo Leitner <mleitner@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2efd055c
  24. 18 5月, 2015 1 次提交
  25. 06 5月, 2015 1 次提交
    • E
      tcp: provide SYN headers for passive connections · cd8ae852
      Eric Dumazet 提交于
      This patch allows a server application to get the TCP SYN headers for
      its passive connections.  This is useful if the server is doing
      fingerprinting of clients based on SYN packet contents.
      
      Two socket options are added: TCP_SAVE_SYN and TCP_SAVED_SYN.
      
      The first is used on a socket to enable saving the SYN headers
      for child connections. This can be set before or after the listen()
      call.
      
      The latter is used to retrieve the SYN headers for passive connections,
      if the parent listener has enabled TCP_SAVE_SYN.
      
      TCP_SAVED_SYN is read once, it frees the saved SYN headers.
      
      The data returned in TCP_SAVED_SYN are network (IPv4/IPv6) and TCP
      headers.
      
      Original patch was written by Tom Herbert, I changed it to not hold
      a full skb (and associated dst and conntracking reference).
      
      We have used such patch for about 3 years at Google.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Tested-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd8ae852
  26. 24 4月, 2015 1 次提交
    • E
      inet: fix possible panic in reqsk_queue_unlink() · b357a364
      Eric Dumazet 提交于
      [ 3897.923145] BUG: unable to handle kernel NULL pointer dereference at
       0000000000000080
      [ 3897.931025] IP: [<ffffffffa9f27686>] reqsk_timer_handler+0x1a6/0x243
      
      There is a race when reqsk_timer_handler() and tcp_check_req() call
      inet_csk_reqsk_queue_unlink() on the same req at the same time.
      
      Before commit fa76ce73 ("inet: get rid of central tcp/dccp listener
      timer"), listener spinlock was held and race could not happen.
      
      To solve this bug, we change reqsk_queue_unlink() to not assume req
      must be found, and we return a status, to conditionally release a
      refcount on the request sock.
      
      This also means tcp_check_req() in non fastopen case might or not
      consume req refcount, so tcp_v6_hnd_req() & tcp_v4_hnd_req() have
      to properly handle this.
      
      (Same remark for dccp_check_req() and its callers)
      
      inet_csk_reqsk_queue_drop() is now too big to be inlined, as it is
      called 4 times in tcp and 3 times in dccp.
      
      Fixes: fa76ce73 ("inet: get rid of central tcp/dccp listener timer")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b357a364
  27. 14 4月, 2015 1 次提交
    • E
      tcp/dccp: get rid of central timewait timer · 789f558c
      Eric Dumazet 提交于
      Using a timer wheel for timewait sockets was nice ~15 years ago when
      memory was expensive and machines had a single processor.
      
      This does not scale, code is ugly and source of huge latencies
      (Typically 30 ms have been seen, cpus spinning on death_lock spinlock.)
      
      We can afford to use an extra 64 bytes per timewait sock and spread
      timewait load to all cpus to have better behavior.
      
      Tested:
      
      On following test, /proc/sys/net/ipv4/tcp_tw_recycle is set to 1
      on the target (lpaa24)
      
      Before patch :
      
      lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
      419594
      
      lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
      437171
      
      While test is running, we can observe 25 or even 33 ms latencies.
      
      lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
      ...
      1000 packets transmitted, 1000 received, 0% packet loss, time 20601ms
      rtt min/avg/max/mdev = 0.020/0.217/25.771/1.535 ms, pipe 2
      
      lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
      ...
      1000 packets transmitted, 1000 received, 0% packet loss, time 20702ms
      rtt min/avg/max/mdev = 0.019/0.183/33.761/1.441 ms, pipe 2
      
      After patch :
      
      About 90% increase of throughput :
      
      lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
      810442
      
      lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
      800992
      
      And latencies are kept to minimal values during this load, even
      if network utilization is 90% higher :
      
      lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
      ...
      1000 packets transmitted, 1000 received, 0% packet loss, time 19991ms
      rtt min/avg/max/mdev = 0.023/0.064/0.360/0.042 ms
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      789f558c
  28. 09 4月, 2015 1 次提交
  29. 04 4月, 2015 2 次提交
  30. 21 3月, 2015 2 次提交
    • E
      inet: get rid of central tcp/dccp listener timer · fa76ce73
      Eric Dumazet 提交于
      One of the major issue for TCP is the SYNACK rtx handling,
      done by inet_csk_reqsk_queue_prune(), fired by the keepalive
      timer of a TCP_LISTEN socket.
      
      This function runs for awful long times, with socket lock held,
      meaning that other cpus needing this lock have to spin for hundred of ms.
      
      SYNACK are sent in huge bursts, likely to cause severe drops anyway.
      
      This model was OK 15 years ago when memory was very tight.
      
      We now can afford to have a timer per request sock.
      
      Timer invocations no longer need to lock the listener,
      and can be run from all cpus in parallel.
      
      With following patch increasing somaxconn width to 32 bits,
      I tested a listener with more than 4 million active request sockets,
      and a steady SYNFLOOD of ~200,000 SYN per second.
      Host was sending ~830,000 SYNACK per second.
      
      This is ~100 times more what we could achieve before this patch.
      
      Later, we will get rid of the listener hash and use ehash instead.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa76ce73
    • E
      inet: drop prev pointer handling in request sock · 52452c54
      Eric Dumazet 提交于
      When request sock are put in ehash table, the whole notion
      of having a previous request to update dl_next is pointless.
      
      Also, following patch will get rid of big purge timer,
      so we want to delete a request sock without holding listener lock.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52452c54
  31. 08 2月, 2015 2 次提交