1. 27 5月, 2015 1 次提交
  2. 22 5月, 2015 2 次提交
    • M
      tcp: add tcpi_segs_in and tcpi_segs_out to tcp_info · 2efd055c
      Marcelo Ricardo Leitner 提交于
      This patch tracks the total number of inbound and outbound segments on a
      TCP socket. One may use this number to have an idea on connection
      quality when compared against the retransmissions.
      
      RFC4898 named these : tcpEStatsPerfSegsIn and tcpEStatsPerfSegsOut
      
      These are a 32bit field each and can be fetched both from TCP_INFO
      getsockopt() if one has a handle on a TCP socket, or from inet_diag
      netlink facility (iproute2/ss patch will follow)
      
      Note that tp->segs_out was placed near tp->snd_nxt for good data
      locality and minimal performance impact, while tp->segs_in was placed
      near tp->bytes_received for the same reason.
      
      Join work with Eric Dumazet.
      
      Note that received SYN are accounted on the listener, but sent SYNACK
      are not accounted.
      Signed-off-by: NMarcelo Ricardo Leitner <mleitner@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2efd055c
    • E
      tcp: add a force_schedule argument to sk_stream_alloc_skb() · eb934478
      Eric Dumazet 提交于
      In commit 8e4d980a ("tcp: fix behavior for epoll edge trigger")
      we fixed a possible hang of TCP sockets under memory pressure,
      by allowing sk_stream_alloc_skb() to use sk_forced_mem_schedule()
      if no packet is in socket write queue.
      
      It turns out there are other cases where we want to force memory
      schedule :
      
      tcp_fragment() & tso_fragment() need to split a big TSO packet into
      two smaller ones. If we block here because of TCP memory pressure,
      we can effectively block TCP socket from sending new data.
      If no further ACK is coming, this hang would be definitive, and socket
      has no chance to effectively reduce its memory usage.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eb934478
  3. 20 5月, 2015 1 次提交
    • D
      tcp: add rfc3168, section 6.1.1.1. fallback · 49213555
      Daniel Borkmann 提交于
      This work as a follow-up of commit f7b3bec6 ("net: allow setting ecn
      via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing
      ECN connections. In other words, this work adds a retry with a non-ECN
      setup SYN packet, as suggested from the RFC on the first timeout:
      
        [...] A host that receives no reply to an ECN-setup SYN within the
        normal SYN retransmission timeout interval MAY resend the SYN and
        any subsequent SYN retransmissions with CWR and ECE cleared. [...]
      
      Schematic client-side view when assuming the server is in tcp_ecn=2 mode,
      that is, Linux default since 2009 via commit 255cac91 ("tcp: extend
      ECN sysctl to allow server-side only ECN"):
      
       1) Normal ECN-capable path:
      
          SYN ECE CWR ----->
                      <----- SYN ACK ECE
                  ACK ----->
      
       2) Path with broken middlebox, when client has fallback:
      
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
                  SYN ----->
                      <----- SYN ACK
                  ACK ----->
      
      In case we would not have the fallback implemented, the middlebox drop
      point would basically end up as:
      
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
      
      In any case, it's rather a smaller percentage of sites where there would
      occur such additional setup latency: it was found in end of 2014 that ~56%
      of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate
      ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect
      when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the
      fallback would mitigate with a slight latency trade-off. Recent related
      paper on this topic:
      
        Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth,
        Gorry Fairhurst, and Richard Scheffenegger:
          "Enabling Internet-Wide Deployment of Explicit Congestion Notification."
          Proc. PAM 2015, New York.
        http://ecn.ethz.ch/ecn-pam15.pdf
      
      Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168,
      section 6.1.1.1. fallback on timeout. For users explicitly not wanting this
      which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that
      allows for disabling the fallback.
      
      tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but
      rather we let tcp_ecn_rcv_synack() take that over on input path in case a
      SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent
      ECN being negotiated eventually in that case.
      
      Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf
      Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdfSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NMirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>
      Signed-off-by: NBrian Trammell <trammell@tik.ee.ethz.ch>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Dave That <dave.taht@gmail.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      49213555
  4. 18 5月, 2015 2 次提交
  5. 10 5月, 2015 2 次提交
  6. 24 4月, 2015 1 次提交
    • E
      tcp: avoid looping in tcp_send_fin() · 845704a5
      Eric Dumazet 提交于
      Presence of an unbound loop in tcp_send_fin() had always been hard
      to explain when analyzing crash dumps involving gigantic dying processes
      with millions of sockets.
      
      Lets try a different strategy :
      
      In case of memory pressure, try to add the FIN flag to last packet
      in write queue, even if packet was already sent. TCP stack will
      be able to deliver this FIN after a timeout event. Note that this
      FIN being delivered by a retransmit, it also carries a Push flag
      given our current implementation.
      
      By checking sk_under_memory_pressure(), we anticipate that cooking
      many FIN packets might deplete tcp memory.
      
      In the case we could not allocate a packet, even with __GFP_WAIT
      allocation, then not sending a FIN seems quite reasonable if it allows
      to get rid of this socket, free memory, and not block the process from
      eventually doing other useful work.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      845704a5
  7. 23 4月, 2015 1 次提交
    • E
      tcp: fix possible deadlock in tcp_send_fin() · d83769a5
      Eric Dumazet 提交于
      Using sk_stream_alloc_skb() in tcp_send_fin() is dangerous in
      case a huge process is killed by OOM, and tcp_mem[2] is hit.
      
      To be able to free memory we need to make progress, so this
      patch allows FIN packets to not care about tcp_mem[2], if
      skb allocation succeeded.
      
      In a follow-up patch, we might abort tcp_send_fin() infinite loop
      in case TIF_MEMDIE is set on this thread, as memory allocator
      did its best getting extra memory already.
      
      This patch reverts d22e1537 ("tcp: fix tcp fin memory accounting")
      
      Fixes: d22e1537 ("tcp: fix tcp fin memory accounting")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d83769a5
  8. 10 4月, 2015 1 次提交
    • E
      tcp: tcp_make_synack() should clear skb->tstamp · b50edd78
      Eric Dumazet 提交于
      I noticed tcpdump was giving funky timestamps for locally
      generated SYNACK messages on loopback interface.
      
      11:42:46.938990 IP 127.0.0.1.48245 > 127.0.0.2.23850: S
      945476042:945476042(0) win 43690 <mss 65495,nop,nop,sackOK,nop,wscale 7>
      
      20:28:58.502209 IP 127.0.0.2.23850 > 127.0.0.1.48245: S
      3160535375:3160535375(0) ack 945476043 win 43690 <mss
      65495,nop,nop,sackOK,nop,wscale 7>
      
      This is because we need to clear skb->tstamp before
      entering lower stack, otherwise net_timestamp_check()
      does not set skb->tstamp.
      
      Fixes: 7faee5c0 ("tcp: remove TCP_SKB_CB(skb)->when")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b50edd78
  9. 08 4月, 2015 2 次提交
  10. 04 4月, 2015 2 次提交
  11. 25 3月, 2015 3 次提交
  12. 21 3月, 2015 2 次提交
  13. 07 3月, 2015 2 次提交
    • F
      ipv4: Create probe timer for tcp PMTU as per RFC4821 · 05cbc0db
      Fan Du 提交于
      As per RFC4821 7.3.  Selecting Probe Size, a probe timer should
      be armed once probing has converged. Once this timer expired,
      probing again to take advantage of any path PMTU change. The
      recommended probing interval is 10 minutes per RFC1981. Probing
      interval could be sysctled by sysctl_tcp_probe_interval.
      
      Eric Dumazet suggested to implement pseudo timer based on 32bits
      jiffies tcp_time_stamp instead of using classic timer for such
      rare event.
      Signed-off-by: NFan Du <fan.du@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      05cbc0db
    • F
      ipv4: Use binary search to choose tcp PMTU probe_size · 6b58e0a5
      Fan Du 提交于
      Current probe_size is chosen by doubling mss_cache,
      the probing process will end shortly with a sub-optimal
      mss size, and the link mtu will not be taken full
      advantage of, in return, this will make user to tweak
      tcp_base_mss with care.
      
      Use binary search to choose probe_size in a fine
      granularity manner, an optimal mss will be found
      to boost performance as its maxmium.
      
      In addition, introduce a sysctl_tcp_probe_threshold
      to control when probing will stop in respect to
      the width of search range.
      
      Test env:
      Docker instance with vxlan encapuslation(82599EB)
      iperf -c 10.0.0.24  -t 60
      
      before this patch:
      1.26 Gbits/sec
      
      After this patch: increase 26%
      1.59 Gbits/sec
      Signed-off-by: NFan Du <fan.du@intel.com>
      Acked-by: NJohn Heffner <johnwheffner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b58e0a5
  14. 01 3月, 2015 3 次提交
    • E
      tcp: tso: allow CA_CWR state in tcp_tso_should_defer() · a0ea700e
      Eric Dumazet 提交于
      Another TCP issue is triggered by ECN.
      
      Under pressure, receiver gets ECN marks, and send back ACK packets
      with ECE TCP flag. Senders enter CA_CWR state.
      
      In this state, tcp_tso_should_defer() is short cut :
      
      if (icsk->icsk_ca_state != TCP_CA_Open)
          goto send_now;
      
      This means that about all ACK packets we receive are triggering
      a partial send, and because cwnd is kept small, we can only send
      a small amount of data for each incoming ACK,
      which in return generate more ACK packets.
      
      Allowing CA_Open and CA_CWR states to enable TSO defer in
      tcp_tso_should_defer() brings performance back :
      TSO autodefer has more chance to defer under pressure.
      
      This patch increases TSO and LRO/GRO efficiency back to normal levels,
      and does not impact overall ECN behavior.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a0ea700e
    • E
      tcp: tso: restore IW10 after TSO autosizing · 50c8339e
      Eric Dumazet 提交于
      With sysctl_tcp_min_tso_segs being 4, it is very possible
      that tcp_tso_should_defer() decides not sending last 2 MSS
      of initial window of 10 packets. This also applies if
      autosizing decides to send X MSS per GSO packet, and cwnd
      is not a multiple of X.
      
      This patch implements an heuristic based on age of first
      skb in write queue : If it was sent very recently (less than half srtt),
      we can predict that no ACK packet will come in less than half rtt,
      so deferring might cause an under utilization of our window.
      
      This is visible on initial send (IW10) on web servers,
      but more generally on some RPC, as the last part of the message
      might need an extra RTT to get delivered.
      
      Tested:
      
      Ran following packetdrill test
      // A simple server-side test that sends exactly an initial window (IW10)
      // worth of packets.
      
      `sysctl -e -q net.ipv4.tcp_min_tso_segs=4`
      
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0    setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0    bind(3, ..., ...) = 0
      +0    listen(3, 1) = 0
      
      +.1   < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
      +0    > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
      +.1   < . 1:1(0) ack 1 win 257
      +0    accept(3, ..., ...) = 4
      
      +0    write(4, ..., 14600) = 14600
      +0    > . 1:5841(5840) ack 1 win 457
      +0    > . 5841:11681(5840) ack 1 win 457
      // Following packet should be sent right now.
      +0    > P. 11681:14601(2920) ack 1 win 457
      
      +.1   < . 1:1(0) ack 14601 win 257
      
      +0    close(4) = 0
      +0    > F. 14601:14601(0) ack 1
      +.1   < F. 1:1(0) ack 14602 win 257
      +0    > . 14602:14602(0) ack 2
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      50c8339e
    • E
      tcp: tso: remove tp->tso_deferred · 5f852eb5
      Eric Dumazet 提交于
      TSO relies on ability to defer sending a small amount of packets.
      Heuristic is to wait for future ACKS in hope to send more packets at once.
      Current algorithm uses a per socket tso_deferred field as a pseudo timer.
      
      This pseudo timer relies on future ACK, but there is no guarantee
      we receive them in time.
      
      Fix would be to use a real timer, but cost of such timer is probably too
      expensive for typical cases.
      
      This patch changes the logic to test the time of last transmit,
      because we should not add bursts of more than 1ms for any given flow.
      
      We've used this patch for about two years at Google, before FQ/pacing
      as it would reduce a fair amount of bursts.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f852eb5
  15. 10 2月, 2015 1 次提交
    • F
      ipv4: Namespecify TCP PMTU mechanism · b0f9ca53
      Fan Du 提交于
      Packetization Layer Path MTU Discovery works separately beside
      Path MTU Discovery at IP level, different net namespace has
      various requirements on which one to chose, e.g., a virutalized
      container instance would require TCP PMTU to probe an usable
      effective mtu for underlying tunnel, while the host would
      employ classical ICMP based PMTU to function.
      
      Hence making TCP PMTU mechanism per net namespace to decouple
      two functionality. Furthermore the probe base MSS should also
      be configured separately for each namespace.
      Signed-off-by: NFan Du <fan.du@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b0f9ca53
  16. 05 2月, 2015 1 次提交
    • E
      tcp: do not pace pure ack packets · 98781965
      Eric Dumazet 提交于
      When we added pacing to TCP, we decided to let sch_fq take care
      of actual pacing.
      
      All TCP had to do was to compute sk->pacing_rate using simple formula:
      
      sk->pacing_rate = 2 * cwnd * mss / rtt
      
      It works well for senders (bulk flows), but not very well for receivers
      or even RPC :
      
      cwnd on the receiver can be less than 10, rtt can be around 100ms, so we
      can end up pacing ACK packets, slowing down the sender.
      
      Really, only the sender should pace, according to its own logic.
      
      Instead of adding a new bit in skb, or call yet another flow
      dissection, we tweak skb->truesize to a small value (2), and
      we instruct sch_fq to use new helper and not pace pure ack.
      
      Note this also helps TCP small queue, as ack packets present
      in qdisc/NIC do not prevent sending a data packet (RPC workload)
      
      This helps to reduce tx completion overhead, ack packets can use regular
      sock_wfree() instead of tcp_wfree() which is a bit more expensive.
      
      This has no impact in the case packets are sent to loopback interface,
      as we do not coalesce ack packets (were we would detect skb->truesize
      lie)
      
      In case netem (with a delay) is used, skb_orphan_partial() also sets
      skb->truesize to 1.
      
      This patch is a combination of two patches we used for about one year at
      Google.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98781965
  17. 04 2月, 2015 1 次提交
    • A
      ip: convert tcp_sendmsg() to iov_iter primitives · 57be5bda
      Al Viro 提交于
      patch is actually smaller than it seems to be - most of it is unindenting
      the inner loop body in tcp_sendmsg() itself...
      
      the bit in tcp_input.c is going to get reverted very soon - that's what
      memcpy_from_msg() will become, but not in this commit; let's keep it
      reasonably contained...
      
      There's one potentially subtle change here: in case of short copy from
      userland, mainline tcp_send_syn_data() discards the skb it has allocated
      and falls back to normal path, where we'll send as much as possible after
      rereading the same data again.  This patch trims SYN+data skb instead -
      that way we don't need to copy from the same place twice.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      57be5bda
  18. 06 1月, 2015 1 次提交
    • D
      net: tcp: add per route congestion control · 81164413
      Daniel Borkmann 提交于
      This work adds the possibility to define a per route/destination
      congestion control algorithm. Generally, this opens up the possibility
      for a machine with different links to enforce specific congestion
      control algorithms with optimal strategies for each of them based
      on their network characteristics, even transparently for a single
      application listening on all links.
      
      For our specific use case, this additionally facilitates deployment
      of DCTCP, for example, applications can easily serve internal
      traffic/dsts in DCTCP and external one with CUBIC. Other scenarios
      would also allow for utilizing e.g. long living, low priority
      background flows for certain destinations/routes while still being
      able for normal traffic to utilize the default congestion control
      algorithm. We also thought about a per netns setting (where different
      defaults are possible), but given its actually a link specific
      property, we argue that a per route/destination setting is the most
      natural and flexible.
      
      The administrator can utilize this through ip-route(8) by appending
      "congctl [lock] <name>", where <name> denotes the name of a
      congestion control algorithm and the optional lock parameter allows
      to enforce the given algorithm so that applications in user space
      would not be allowed to overwrite that algorithm for that destination.
      
      The dst metric lookups are being done when a dst entry is already
      available in order to avoid a costly lookup and still before the
      algorithms are being initialized, thus overhead is very low when the
      feature is not being used. While the client side would need to drop
      the current reference on the module, on server side this can actually
      even be avoided as we just got a flat-copied socket clone.
      
      Joint work with Florian Westphal.
      Suggested-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      81164413
  19. 03 1月, 2015 1 次提交
  20. 10 12月, 2014 2 次提交
    • E
      tcp: refine TSO autosizing · 605ad7f1
      Eric Dumazet 提交于
      Commit 95bd09eb ("tcp: TSO packets automatic sizing") tried to
      control TSO size, but did this at the wrong place (sendmsg() time)
      
      At sendmsg() time, we might have a pessimistic view of flow rate,
      and we end up building very small skbs (with 2 MSS per skb).
      
      This is bad because :
      
       - It sends small TSO packets even in Slow Start where rate quickly
         increases.
       - It tends to make socket write queue very big, increasing tcp_ack()
         processing time, but also increasing memory needs, not necessarily
         accounted for, as fast clones overhead is currently ignored.
       - Lower GRO efficiency and more ACK packets.
      
      Servers with a lot of small lived connections suffer from this.
      
      Lets instead fill skbs as much as possible (64KB of payload), but split
      them at xmit time, when we have a precise idea of the flow rate.
      skb split is actually quite efficient.
      
      Patch looks bigger than necessary, because TCP Small Queue decision now
      has to take place after the eventual split.
      
      As Neal suggested, introduce a new tcp_tso_autosize() helper, so that
      tcp_tso_should_defer() can be synchronized on same goal.
      
      Rename tp->xmit_size_goal_segs to tp->gso_segs, as this variable
      contains number of mss that we can put in GSO packet, and is not
      related to the autosizing goal anymore.
      
      Tested:
      
      40 ms rtt link
      
      nstat >/dev/null
      netperf -H remote -l -2000000 -- -s 1000000
      nstat | egrep "IpInReceives|IpOutRequests|TcpOutSegs|IpExtOutOctets"
      
      Before patch :
      
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/s
      
       87380 2000000 2000000    0.36         44.22
      IpInReceives                    600                0.0
      IpOutRequests                   599                0.0
      TcpOutSegs                      1397               0.0
      IpExtOutOctets                  2033249            0.0
      
      After patch :
      
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       87380 2000000 2000000    0.36       44.27
      IpInReceives                    221                0.0
      IpOutRequests                   232                0.0
      TcpOutSegs                      1397               0.0
      IpExtOutOctets                  2013953            0.0
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      605ad7f1
    • A
      put iov_iter into msghdr · c0371da6
      Al Viro 提交于
      Note that the code _using_ ->msg_iter at that point will be very
      unhappy with anything other than unshifted iovec-backed iov_iter.
      We still need to convert users to proper primitives.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c0371da6
  21. 20 11月, 2014 1 次提交
    • E
      tcp: make connect() mem charging friendly · 355a901e
      Eric Dumazet 提交于
      While working on sk_forward_alloc problems reported by Denys
      Fedoryshchenko, we found that tcp connect() (and fastopen) do not call
      sk_wmem_schedule() for SYN packet (and/or SYN/DATA packet), so
      sk_forward_alloc is negative while connect is in progress.
      
      We can fix this by calling regular sk_stream_alloc_skb() both for the
      SYN packet (in tcp_connect()) and the syn_data packet in
      tcp_send_syn_data()
      
      Then, tcp_send_syn_data() can avoid copying syn_data as we simply
      can manipulate syn_data->cb[] to remove SYN flag (and increment seq)
      
      Instead of open coding memcpy_fromiovecend(), simply use this helper.
      
      This leaves in socket write queue clean fast clone skbs.
      
      This was tested against our fastopen packetdrill tests.
      Reported-by: NDenys Fedoryshchenko <nuclearcat@nuclearcat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      355a901e
  22. 14 11月, 2014 1 次提交
    • E
      tcp: limit GSO packets to half cwnd · d649a7a8
      Eric Dumazet 提交于
      In DC world, GSO packets initially cooked by tcp_sendmsg() are usually
      big, as sk_pacing_rate is high.
      
      When network is congested, cwnd can be smaller than the GSO packets
      found in socket write queue. tcp_write_xmit() splits GSO packets
      using the available cwnd, and we end up sending a single GSO packet,
      consuming all available cwnd.
      
      With GRO aggregation on the receiver, we might handle a single GRO
      packet, sending back a single ACK.
      
      1) This single ACK might be lost
         TLP or RTO are forced to attempt a retransmit.
      2) This ACK releases a full cwnd, sender sends another big GSO packet,
         in a ping pong mode.
      
      This behavior does not fill the pipes in the best way, because of
      scheduling artifacts.
      
      Make sure we always have at least two GSO packets in flight.
      
      This allows us to safely increase GRO efficiency without risking
      spurious retransmits.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d649a7a8
  23. 05 11月, 2014 1 次提交
    • F
      net: allow setting ecn via routing table · f7b3bec6
      Florian Westphal 提交于
      This patch allows to set ECN on a per-route basis in case the sysctl
      tcp_ecn is not set to 1. In other words, when ECN is set for specific
      routes, it provides a tcp_ecn=1 behaviour for that route while the rest
      of the stack acts according to the global settings.
      
      One can use 'ip route change dev $dev $net features ecn' to toggle this.
      
      Having a more fine-grained per-route setting can be beneficial for various
      reasons, for example, 1) within data centers, or 2) local ISPs may deploy
      ECN support for their own video/streaming services [1], etc.
      
      There was a recent measurement study/paper [2] which scanned the Alexa's
      publicly available top million websites list from a vantage point in US,
      Europe and Asia:
      
      Half of the Alexa list will now happily use ECN (tcp_ecn=2, most likely
      blamed to commit 255cac91 ("tcp: extend ECN sysctl to allow server-side
      only ECN") ;)); the break in connectivity on-path was found is about
      1 in 10,000 cases. Timeouts rather than receiving back RSTs were much
      more common in the negotiation phase (and mostly seen in the Alexa
      middle band, ranks around 50k-150k): from 12-thousand hosts on which
      there _may_ be ECN-linked connection failures, only 79 failed with RST
      when _not_ failing with RST when ECN is not requested.
      
      It's unclear though, how much equipment in the wild actually marks CE
      when buffers start to fill up.
      
      We thought about a fallback to non-ECN for retransmitted SYNs as another
      global option (which could perhaps one day be made default), but as Eric
      points out, there's much more work needed to detect broken middleboxes.
      
      Two examples Eric mentioned are buggy firewalls that accept only a single
      SYN per flow, and middleboxes that successfully let an ECN flow establish,
      but later mark CE for all packets (so cwnd converges to 1).
      
       [1] http://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf, p.15
       [2] http://ecn.ethz.ch/
      
      Joint work with Daniel Borkmann.
      
      Reference: http://thread.gmane.org/gmane.linux.network/335797Suggested-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f7b3bec6
  24. 31 10月, 2014 1 次提交
  25. 15 10月, 2014 2 次提交
    • E
      tcp: TCP Small Queues and strange attractors · 9b462d02
      Eric Dumazet 提交于
      TCP Small queues tries to keep number of packets in qdisc
      as small as possible, and depends on a tasklet to feed following
      packets at TX completion time.
      Choice of tasklet was driven by latencies requirements.
      
      Then, TCP stack tries to avoid reorders, by locking flows with
      outstanding packets in qdisc in a given TX queue.
      
      What can happen is that many flows get attracted by a low performing
      TX queue, and cpu servicing TX completion has to feed packets for all of
      them, making this cpu 100% busy in softirq mode.
      
      This became particularly visible with latest skb->xmit_more support
      
      Strategy adopted in this patch is to detect when tcp_wfree() is called
      from ksoftirqd and let the outstanding queue for this flow being drained
      before feeding additional packets, so that skb->ooo_okay can be set
      to allow select_queue() to select the optimal queue :
      
      Incoming ACKS are normally handled by different cpus, so this patch
      gives more chance for these cpus to take over the burden of feeding
      qdisc with future packets.
      
      Tested:
      
      lpaa23:~# ./super_netperf 1400 --google-pacing-rate 3028000 -H lpaa24 -l 3600 &
      
      lpaa23:~# sar -n DEV 1 10 | grep eth1
      06:16:18 AM      eth1 595448.00 1190564.00  38381.09 1760253.12      0.00      0.00      1.00
      06:16:19 AM      eth1 594858.00 1189686e.00  38340.76 1758952.72      0.00      0.00      0.00
      06:16:20 AM      eth1 597017.00 1194019.00  38480.79 1765370.29      0.00      0.00      1.00
      06:16:21 AM      eth1 595450.00 1190936.00  38380.19 1760805.05      0.00      0.00      0.00
      06:16:22 AM      eth1 596385.00 1193096.00  38442.56 1763976.29      0.00      0.00      1.00
      06:16:23 AM      eth1 598155.00 1195978.00  38552.97 1768264.60      0.00      0.00      0.00
      06:16:24 AM      eth1 594405.00 1188643.00  38312.57 1757414.89      0.00      0.00      1.00
      06:16:25 AM      eth1 593366.00 1187154.00  38252.16 1755195.83      0.00      0.00      0.00
      06:16:26 AM      eth1 593188.00 1186118.00  38232.88 1753682.57      0.00      0.00      1.00
      06:16:27 AM      eth1 596301.00 1192241.00  38440.94 1762733.09      0.00      0.00      0.00
      Average:         eth1 595457.30 1190843.50  38381.69 1760664.84      0.00      0.00      0.50
      lpaa23:~# ./tc -s -d qd sh dev eth1 | grep backlog
       backlog 7606336b 2513p requeues 167982
       backlog 224072b 74p requeues 566
       backlog 581376b 192p requeues 5598
       backlog 181680b 60p requeues 1070
       backlog 5305056b 1753p requeues 110166    // Here, this TX queue is attracting flows
       backlog 157456b 52p requeues 1758
       backlog 672216b 222p requeues 3025
       backlog 60560b 20p requeues 24541
       backlog 448144b 148p requeues 21258
      
      lpaa23:~# echo 1 >/proc/sys/net/ipv4/tcp_tsq_enable_tcp_wfree_ksoftirqd_detect
      
      Immediate jump to full bandwidth, and traffic is properly
      shard on all tx queues.
      
      lpaa23:~# sar -n DEV 1 10 | grep eth1
      06:16:46 AM      eth1 1397632.00 2795397.00  90081.87 4133031.26      0.00      0.00      1.00
      06:16:47 AM      eth1 1396874.00 2793614.00  90032.99 4130385.46      0.00      0.00      0.00
      06:16:48 AM      eth1 1395842.00 2791600.00  89966.46 4127409.67      0.00      0.00      1.00
      06:16:49 AM      eth1 1395528.00 2791017.00  89946.17 4126551.24      0.00      0.00      0.00
      06:16:50 AM      eth1 1397891.00 2795716.00  90098.74 4133497.39      0.00      0.00      1.00
      06:16:51 AM      eth1 1394951.00 2789984.00  89908.96 4125022.51      0.00      0.00      0.00
      06:16:52 AM      eth1 1394608.00 2789190.00  89886.90 4123851.36      0.00      0.00      1.00
      06:16:53 AM      eth1 1395314.00 2790653.00  89934.33 4125983.09      0.00      0.00      0.00
      06:16:54 AM      eth1 1396115.00 2792276.00  89984.25 4128411.21      0.00      0.00      1.00
      06:16:55 AM      eth1 1396829.00 2793523.00  90030.19 4130250.28      0.00      0.00      0.00
      Average:         eth1 1396158.40 2792297.00  89987.09 4128439.35      0.00      0.00      0.50
      
      lpaa23:~# tc -s -d qd sh dev eth1 | grep backlog
       backlog 7900052b 2609p requeues 173287
       backlog 878120b 290p requeues 589
       backlog 1068884b 354p requeues 5621
       backlog 996212b 329p requeues 1088
       backlog 984100b 325p requeues 115316
       backlog 956848b 316p requeues 1781
       backlog 1080996b 357p requeues 3047
       backlog 975016b 322p requeues 24571
       backlog 990156b 327p requeues 21274
      
      (All 8 TX queues get a fair share of the traffic)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b462d02
    • E
      tcp: fix ooo_okay setting vs Small Queues · b2532eb9
      Eric Dumazet 提交于
      TCP Small Queues (tcp_tsq_handler()) can hold one reference on
      sk->sk_wmem_alloc, preventing skb->ooo_okay being set.
      
      We should relax test done to set skb->ooo_okay to take care
      of this extra reference.
      
      Minimal truesize of skb containing one byte of payload is
      SKB_TRUESIZE(1)
      
      Without this fix, we have more chance locking flows into the wrong
      transmit queue.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2532eb9
  26. 02 10月, 2014 1 次提交
  27. 30 9月, 2014 1 次提交