1. 08 2月, 2016 4 次提交
  2. 07 2月, 2016 1 次提交
    • E
      tcp: fastopen: call tcp_fin() if FIN present in SYNACK · e3e17b77
      Eric Dumazet 提交于
      When we acknowledge a FIN, it is not enough to ack the sequence number
      and queue the skb into receive queue. We also have to call tcp_fin()
      to properly update socket state and send proper poll() notifications.
      
      It seems we also had the problem if we received a SYN packet with the
      FIN flag set, but it does not seem an urgent issue, as no known
      implementation can do that.
      
      Fixes: 61d2bcae ("tcp: fastopen: accept data/FIN present in SYNACK message")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e3e17b77
  3. 06 2月, 2016 1 次提交
  4. 30 1月, 2016 1 次提交
  5. 15 1月, 2016 2 次提交
    • J
      net: tcp_memcontrol: simplify linkage between socket and page counter · baac50bb
      Johannes Weiner 提交于
      There won't be any separate counters for socket memory consumed by
      protocols other than TCP in the future.  Remove the indirection and link
      sockets directly to their owning memory cgroup.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      baac50bb
    • J
      net: tcp_memcontrol: sanitize tcp memory accounting callbacks · e805605c
      Johannes Weiner 提交于
      There won't be a tcp control soft limit, so integrating the memcg code
      into the global skmem limiting scheme complicates things unnecessarily.
      Replace this with simple and clear charge and uncharge calls--hidden
      behind a jump label--to account skb memory.
      
      Note that this is not purely aesthetic: as a result of shoehorning the
      per-memcg code into the same memory accounting functions that handle the
      global level, the old code would compare the per-memcg consumption
      against the smaller of the per-memcg limit and the global limit.  This
      allowed the total consumption of multiple sockets to exceed the global
      limit, as long as the individual sockets stayed within bounds.  After
      this change, the code will always compare the per-memcg consumption to
      the per-memcg limit, and the global consumption to the global limit, and
      thus close this loophole.
      
      Without a soft limit, the per-memcg memory pressure state in sockets is
      generally questionable.  However, we did it until now, so we continue to
      enter it when the hard limit is hit, and packets are dropped, to let
      other sockets in the cgroup know that they shouldn't grow their transmit
      windows, either.  However, keep it simple in the new callback model and
      leave memory pressure lazily when the next packet is accepted (as
      opposed to doing it synchroneously when packets are processed).  When
      packets are dropped, network performance will already be in the toilet,
      so that should be a reasonable trade-off.
      
      As described above, consumption is now checked on the per-memcg level
      and the global level separately.  Likewise, memory pressure states are
      maintained on both the per-memcg level and the global level, and a
      socket is considered under pressure when either level asserts as much.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e805605c
  6. 11 1月, 2016 3 次提交
  7. 23 12月, 2015 1 次提交
  8. 16 12月, 2015 1 次提交
  9. 23 10月, 2015 1 次提交
    • E
      tcp/dccp: fix hashdance race for passive sessions · 5e0724d0
      Eric Dumazet 提交于
      Multiple cpus can process duplicates of incoming ACK messages
      matching a SYN_RECV request socket. This is a rare event under
      normal operations, but definitely can happen.
      
      Only one must win the race, otherwise corruption would occur.
      
      To fix this without adding new atomic ops, we use logic in
      inet_ehash_nolisten() to detect the request was present in the same
      ehash bucket where we try to insert the new child.
      
      If request socket was not found, we have to undo the child creation.
      
      This actually removes a spin_lock()/spin_unlock() pair in
      reqsk_queue_unlink() for the fast path.
      
      Fixes: e994b2f0 ("tcp: do not lock listener to process SYN packets")
      Fixes: 079096f1 ("tcp/dccp: install syn_recv requests into ehash table")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e0724d0
  10. 21 10月, 2015 3 次提交
    • Y
      tcp: use RACK to detect losses · 4f41b1c5
      Yuchung Cheng 提交于
      This patch implements the second half of RACK that uses the the most
      recent transmit time among all delivered packets to detect losses.
      
      tcp_rack_mark_lost() is called upon receiving a dubious ACK.
      It then checks if an not-yet-sacked packet was sent at least
      "reo_wnd" prior to the sent time of the most recently delivered.
      If so the packet is deemed lost.
      
      The "reo_wnd" reordering window starts with 1msec for fast loss
      detection and changes to min-RTT/4 when reordering is observed.
      We found 1msec accommodates well on tiny degree of reordering
      (<3 pkts) on faster links. We use min-RTT instead of SRTT because
      reordering is more of a path property but SRTT can be inflated by
      self-inflicated congestion. The factor of 4 is borrowed from the
      delayed early retransmit and seems to work reasonably well.
      
      Since RACK is still experimental, it is now used as a supplemental
      loss detection on top of existing algorithms. It is only effective
      after the fast recovery starts or after the timeout occurs. The
      fast recovery is still triggered by FACK and/or dupack threshold
      instead of RACK.
      
      We introduce a new sysctl net.ipv4.tcp_recovery for future
      experiments of loss recoveries. For now RACK can be disabled by
      setting it to 0.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f41b1c5
    • Y
      tcp: track the packet timings in RACK · 659a8ad5
      Yuchung Cheng 提交于
      This patch is the first half of the RACK loss recovery.
      
      RACK loss recovery uses the notion of time instead
      of packet sequence (FACK) or counts (dupthresh). It's inspired by the
      previous FACK heuristic in tcp_mark_lost_retrans(): when a limited
      transmit (new data packet) is sacked, then current retransmitted
      sequence below the newly sacked sequence must been lost,
      since at least one round trip time has elapsed.
      
      But it has several limitations:
      1) can't detect tail drops since it depends on limited transmit
      2) is disabled upon reordering (assumes no reordering)
      3) only enabled in fast recovery ut not timeout recovery
      
      RACK (Recently ACK) addresses these limitations with the notion
      of time instead: a packet P1 is lost if a later packet P2 is s/acked,
      as at least one round trip has passed.
      
      Since RACK cares about the time sequence instead of the data sequence
      of packets, it can detect tail drops when later retransmission is
      s/acked while FACK or dupthresh can't. For reordering RACK uses a
      dynamically adjusted reordering window ("reo_wnd") to reduce false
      positives on ever (small) degree of reordering.
      
      This patch implements tcp_advanced_rack() which tracks the
      most recent transmission time among the packets that have been
      delivered (ACKed or SACKed) in tp->rack.mstamp. This timestamp
      is the key to determine which packet has been lost.
      
      Consider an example that the sender sends six packets:
      T1: P1 (lost)
      T2: P2
      T3: P3
      T4: P4
      T100: sack of P2. rack.mstamp = T2
      T101: retransmit P1
      T102: sack of P2,P3,P4. rack.mstamp = T4
      T205: ACK of P4 since the hole is repaired. rack.mstamp = T101
      
      We need to be careful about spurious retransmission because it may
      falsely advance tp->rack.mstamp by an RTT or an RTO, causing RACK
      to falsely mark all packets lost, just like a spurious timeout.
      
      We identify spurious retransmission by the ACK's TS echo value.
      If TS option is not applicable but the retransmission is acknowledged
      less than min-RTT ago, it is likely to be spurious. We refrain from
      using the transmission time of these spurious retransmissions.
      
      The second half is implemented in the next patch that marks packet
      lost using RACK timestamp.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      659a8ad5
    • Y
      tcp: track min RTT using windowed min-filter · f6722583
      Yuchung Cheng 提交于
      Kathleen Nichols' algorithm for tracking the minimum RTT of a
      data stream over some measurement window. It uses constant space
      and constant time per update. Yet it almost always delivers
      the same minimum as an implementation that has to keep all
      the data in the window. The measurement window is tunable via
      sysctl.net.ipv4.tcp_min_rtt_wlen with a default value of 5 minutes.
      
      The algorithm keeps track of the best, 2nd best & 3rd best min
      values, maintaining an invariant that the measurement time of
      the n'th best >= n-1'th best. It also makes sure that the three
      values are widely separated in the time window since that bounds
      the worse case error when that data is monotonically increasing
      over the window.
      
      Upon getting a new min, we can forget everything earlier because
      it has no value - the new min is less than everything else in the
      window by definition and it's the most recent. So we restart fresh
      on every new min and overwrites the 2nd & 3rd choices. The same
      property holds for the 2nd & 3rd best.
      
      Therefore we have to maintain two invariants to maximize the
      information in the samples, one on values (1st.v <= 2nd.v <=
      3rd.v) and the other on times (now-win <=1st.t <= 2nd.t <= 3rd.t <=
      now). These invariants determine the structure of the code
      
      The RTT input to the windowed filter is the minimum RTT measured
      from ACK or SACK, or as the last resort from TCP timestamps.
      
      The accessor tcp_min_rtt() returns the minimum RTT seen in the
      window. ~0U indicates it is not available. The minimum is 1usec
      even if the true RTT is below that.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6722583
  11. 19 10月, 2015 1 次提交
  12. 03 10月, 2015 3 次提交
    • E
      tcp: attach SYNACK messages to request sockets instead of listener · ca6fb065
      Eric Dumazet 提交于
      If a listen backlog is very big (to avoid syncookies), then
      the listener sk->sk_wmem_alloc is the main source of false
      sharing, as we need to touch it twice per SYNACK re-transmit
      and TX completion.
      
      (One SYN packet takes listener lock once, but up to 6 SYNACK
      are generated)
      
      By attaching the skb to the request socket, we remove this
      source of contention.
      
      Tested:
      
       listen(fd, 10485760); // single listener (no SO_REUSEPORT)
       16 RX/TX queue NIC
       Sustain a SYNFLOOD attack of ~320,000 SYN per second,
       Sending ~1,400,000 SYNACK per second.
       Perf profiles now show listener spinlock being next bottleneck.
      
          20.29%  [kernel]  [k] queued_spin_lock_slowpath
          10.06%  [kernel]  [k] __inet_lookup_established
           5.12%  [kernel]  [k] reqsk_timer_handler
           3.22%  [kernel]  [k] get_next_timer_interrupt
           3.00%  [kernel]  [k] tcp_make_synack
           2.77%  [kernel]  [k] ipt_do_table
           2.70%  [kernel]  [k] run_timer_softirq
           2.50%  [kernel]  [k] ip_finish_output
           2.04%  [kernel]  [k] cascade
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca6fb065
    • E
      tcp/dccp: install syn_recv requests into ehash table · 079096f1
      Eric Dumazet 提交于
      In this patch, we insert request sockets into TCP/DCCP
      regular ehash table (where ESTABLISHED and TIMEWAIT sockets
      are) instead of using the per listener hash table.
      
      ACK packets find SYN_RECV pseudo sockets without having
      to find and lock the listener.
      
      In nominal conditions, this halves pressure on listener lock.
      
      Note that this will allow for SO_REUSEPORT refinements,
      so that we can select a listener using cpu/numa affinities instead
      of the prior 'consistent hash', since only SYN packets will
      apply this selection logic.
      
      We will shrink listen_sock in the following patch to ease
      code review.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ying Cai <ycai@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      079096f1
    • E
      tcp: get_openreq[46]() changes · aa3a0c8c
      Eric Dumazet 提交于
      When request sockets are no longer in a per listener hash table
      but on regular TCP ehash, we need to access listener uid
      through req->rsk_listener
      
      get_openreq6() also gets a const for its request socket argument.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aa3a0c8c
  13. 30 9月, 2015 6 次提交
  14. 29 9月, 2015 1 次提交
    • E
      tcp: avoid reorders for TFO passive connections · 7c85af88
      Eric Dumazet 提交于
      We found that a TCP Fast Open passive connection was vulnerable
      to reorders, as the exchange might look like
      
      [1] C -> S S <FO ...> <request>
      [2] S -> C S. ack request <options>
      [3] S -> C . <answer>
      
      packets [2] and [3] can be generated at almost the same time.
      
      If C receives the 3rd packet before the 2nd, it will drop it as
      the socket is in SYN_SENT state and expects a SYNACK.
      
      S will have to retransmit the answer.
      
      Current OOO avoidance in linux is defeated because SYNACK
      packets are attached to the LISTEN socket, while DATA packets
      are attached to the children. They might be sent by different cpus,
      and different TX queues might be selected.
      
      It turns out that for TFO, we created a child, which is a
      full blown socket in TCP_SYN_RECV state, and we simply can attach
      the SYNACK packet to this socket.
      
      This means that at the time tcp_sendmsg() pushes DATA packet,
      skb->ooo_okay will be set iff the SYNACK packet had been sent
      and TX completed.
      
      This removes the reorder source at the host level.
      
      We also removed the export of tcp_try_fastopen(), as it is no
      longer called from IPv6.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c85af88
  15. 26 9月, 2015 6 次提交
  16. 22 9月, 2015 1 次提交
    • Y
      tcp: usec resolution SYN/ACK RTT · 0f1c28ae
      Yuchung Cheng 提交于
      Currently SYN/ACK RTT is measured in jiffies. For LAN the SYN/ACK
      RTT is often measured as 0ms or sometimes 1ms, which would affect
      RTT estimation and min RTT samping used by some congestion control.
      
      This patch improves SYN/ACK RTT to be usec resolution if platform
      supports it. While the timestamping of SYN/ACK is done in request
      sock, the RTT measurement is carefully arranged to avoid storing
      another u64 timestamp in tcp_sock.
      
      For regular handshake w/o SYNACK retransmission, the RTT is sampled
      right after the child socket is created and right before the request
      sock is released (tcp_check_req() in tcp_minisocks.c)
      
      For Fast Open the child socket is already created when SYN/ACK was
      sent, the RTT is sampled in tcp_rcv_state_process() after processing
      the final ACK an right before the request socket is released.
      
      If the SYN/ACK was retransmistted or SYN-cookie was used, we rely
      on TCP timestamps to measure the RTT. The sample is taken at the
      same place in tcp_rcv_state_process() after the timestamp values
      are validated in tcp_validate_incoming(). Note that we do not store
      TS echo value in request_sock for SYN-cookies, because the value
      is already stored in tp->rx_opt used by tcp_ack_update_rtt().
      
      One side benefit is that the RTT measurement now happens before
      initializing congestion control (of the passive side). Therefore
      the congestion control can use the SYN/ACK RTT.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0f1c28ae
  17. 01 9月, 2015 1 次提交
    • D
      tcp: use dctcp if enabled on the route to the initiator · c3a8d947
      Daniel Borkmann 提交于
      Currently, the following case doesn't use DCTCP, even if it should:
      A responder has f.e. Cubic as system wide default, but for a specific
      route to the initiating host, DCTCP is being set in RTAX_CC_ALGO. The
      initiating host then uses DCTCP as congestion control, but since the
      initiator sets ECT(0), tcp_ecn_create_request() doesn't set ecn_ok,
      and we have to fall back to Reno after 3WHS completes.
      
      We were thinking on how to solve this in a minimal, non-intrusive
      way without bloating tcp_ecn_create_request() needlessly: lets cache
      the CA ecn option flag in RTAX_FEATURES. In other words, when ECT(0)
      is set on the SYN packet, set ecn_ok=1 iff route RTAX_FEATURES
      contains the unexposed (internal-only) DST_FEATURE_ECN_CA. This allows
      to only do a single metric feature lookup inside tcp_ecn_create_request().
      
      Joint work with Florian Westphal.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c3a8d947
  18. 26 8月, 2015 2 次提交
    • E
      tcp: refine pacing rate determination · 43e122b0
      Eric Dumazet 提交于
      When TCP pacing was added back in linux-3.12, we chose
      to apply a fixed ratio of 200 % against current rate,
      to allow probing for optimal throughput even during
      slow start phase, where cwnd can be doubled every other gRTT.
      
      At Google, we found it was better applying a different ratio
      while in Congestion Avoidance phase.
      This ratio was set to 120 %.
      
      We've used the normal tcp_in_slow_start() helper for a while,
      then tuned the condition to select the conservative ratio
      as soon as cwnd >= ssthresh/2 :
      
      - After cwnd reduction, it is safer to ramp up more slowly,
        as we approach optimal cwnd.
      - Initial ramp up (ssthresh == INFINITY) still allows doubling
        cwnd every other RTT.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      43e122b0
    • E
      tcp: fix slow start after idle vs TSO/GSO · 6f021c62
      Eric Dumazet 提交于
      slow start after idle might reduce cwnd, but we perform this
      after first packet was cooked and sent.
      
      With TSO/GSO, it means that we might send a full TSO packet
      even if cwnd should have been reduced to IW10.
      
      Moving the SSAI check in skb_entail() makes sense, because
      we slightly reduce number of times this check is done,
      especially for large send() and TCP Small queue callbacks from
      softirq context.
      
      As Neal pointed out, we also need to perform the check
      if/when receive window opens.
      
      Tested:
      
      Following packetdrill test demonstrates the problem
      // Test of slow start after idle
      
      `sysctl -q net.ipv4.tcp_slow_start_after_idle=1`
      
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0    setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0    bind(3, ..., ...) = 0
      +0    listen(3, 1) = 0
      
      +0    < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop,nop,wscale 7>
      +0    > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
      +.100 < . 1:1(0) ack 1 win 511
      +0    accept(3, ..., ...) = 4
      +0    setsockopt(4, SOL_SOCKET, SO_SNDBUF, [200000], 4) = 0
      
      +0    write(4, ..., 26000) = 26000
      +0    > . 1:5001(5000) ack 1
      +0    > . 5001:10001(5000) ack 1
      +0    %{ assert tcpi_snd_cwnd == 10 }%
      
      +.100 < . 1:1(0) ack 10001 win 511
      +0    %{ assert tcpi_snd_cwnd == 20, tcpi_snd_cwnd }%
      +0    > . 10001:20001(10000) ack 1
      +0    > P. 20001:26001(6000) ack 1
      
      +.100 < . 1:1(0) ack 26001 win 511
      +0    %{ assert tcpi_snd_cwnd == 36, tcpi_snd_cwnd }%
      
      +4 write(4, ..., 20000) = 20000
      // If slow start after idle works properly, we should send 5 MSS here (cwnd/2)
      +0    > . 26001:31001(5000) ack 1
      +0    %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }%
      +0    > . 31001:36001(5000) ack 1
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6f021c62
  19. 10 7月, 2015 1 次提交