1. 12 8月, 2018 1 次提交
  2. 11 5月, 2018 1 次提交
  3. 01 5月, 2018 1 次提交
  4. 01 3月, 2018 1 次提交
  5. 12 2月, 2018 1 次提交
    • L
      vfs: do bulk POLL* -> EPOLL* replacement · a9a08845
      Linus Torvalds 提交于
      This is the mindless scripted replacement of kernel use of POLL*
      variables as described by Al, done by this script:
      
          for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
              L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
              for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
          done
      
      with de-mangling cleanups yet to come.
      
      NOTE! On almost all architectures, the EPOLL* constants have the same
      values as the POLL* constants do.  But they keyword here is "almost".
      For various bad reasons they aren't the same, and epoll() doesn't
      actually work quite correctly in some cases due to this on Sparc et al.
      
      The next patch from Al will sort out the final differences, and we
      should be all done.
      Scripted-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9a08845
  6. 03 12月, 2017 1 次提交
    • M
      inet: Add a 2nd listener hashtable (port+addr) · 61b7c691
      Martin KaFai Lau 提交于
      The current listener hashtable is hashed by port only.
      When a process is listening at many IP addresses with the same port (e.g.
      [IP1]:443, [IP2]:443... [IPN]:443), the inet[6]_lookup_listener()
      performance is degraded to a link list.  It is prone to syn attack.
      
      UDP had a similar issue and a second hashtable was added to resolve it.
      
      This patch adds a second hashtable for the listener's sockets.
      The second hashtable is hashed by port and address.
      
      It cannot reuse the existing skc_portaddr_node which is shared
      with skc_bind_node.  TCP listener needs to use skc_bind_node.
      Instead, this patch adds a hlist_node 'icsk_listen_portaddr_node' to
      the inet_connection_sock which the listener (like TCP) also belongs to.
      
      The new portaddr hashtable may need two lookup (First by IP:PORT.
      Second by INADDR_ANY:PORT if the IP:PORT is a not found).   Hence,
      it implements a similar cut off as UDP such that it will only consult the
      new portaddr hashtable if the current port-only hashtable has >10
      sk in the link-list.
      
      lhash2 and lhash2_mask are added to 'struct inet_hashinfo'.  I take
      this chance to plug a 4 bytes hole.  It is done by first moving
      the existing bind_bucket_cachep up and then add the new
      (int lhash2_mask, *lhash2) after the existing bhash_size.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      61b7c691
  7. 28 11月, 2017 1 次提交
  8. 18 10月, 2017 1 次提交
  9. 16 6月, 2017 1 次提交
    • D
      tcp: ULP infrastructure · 734942cc
      Dave Watson 提交于
      Add the infrustructure for attaching Upper Layer Protocols (ULPs) over TCP
      sockets. Based on a similar infrastructure in tcp_cong.  The idea is that any
      ULP can add its own logic by changing the TCP proto_ops structure to its own
      methods.
      
      Example usage:
      
      setsockopt(sock, SOL_TCP, TCP_ULP, "tls", sizeof("tls"));
      
      modules will call:
      tcp_register_ulp(&tcp_tls_ulp_ops);
      
      to register/unregister their ulp, with an init function and name.
      
      A list of registered ulps will be returned by tcp_get_available_ulp, which is
      hooked up to /proc.  Example:
      
      $ cat /proc/sys/net/ipv4/tcp_available_ulp
      tls
      
      There is currently no functionality to remove or chain ULPs, but
      it should be possible to add these in the future if needed.
      Signed-off-by: NBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: NDave Watson <davejwatson@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      734942cc
  10. 10 3月, 2017 1 次提交
    • D
      net: Work around lockdep limitation in sockets that use sockets · cdfbabfb
      David Howells 提交于
      Lockdep issues a circular dependency warning when AFS issues an operation
      through AF_RXRPC from a context in which the VFS/VM holds the mmap_sem.
      
      The theory lockdep comes up with is as follows:
      
       (1) If the pagefault handler decides it needs to read pages from AFS, it
           calls AFS with mmap_sem held and AFS begins an AF_RXRPC call, but
           creating a call requires the socket lock:
      
      	mmap_sem must be taken before sk_lock-AF_RXRPC
      
       (2) afs_open_socket() opens an AF_RXRPC socket and binds it.  rxrpc_bind()
           binds the underlying UDP socket whilst holding its socket lock.
           inet_bind() takes its own socket lock:
      
      	sk_lock-AF_RXRPC must be taken before sk_lock-AF_INET
      
       (3) Reading from a TCP socket into a userspace buffer might cause a fault
           and thus cause the kernel to take the mmap_sem, but the TCP socket is
           locked whilst doing this:
      
      	sk_lock-AF_INET must be taken before mmap_sem
      
      However, lockdep's theory is wrong in this instance because it deals only
      with lock classes and not individual locks.  The AF_INET lock in (2) isn't
      really equivalent to the AF_INET lock in (3) as the former deals with a
      socket entirely internal to the kernel that never sees userspace.  This is
      a limitation in the design of lockdep.
      
      Fix the general case by:
      
       (1) Double up all the locking keys used in sockets so that one set are
           used if the socket is created by userspace and the other set is used
           if the socket is created by the kernel.
      
       (2) Store the kern parameter passed to sk_alloc() in a variable in the
           sock struct (sk_kern_sock).  This informs sock_lock_init(),
           sock_init_data() and sk_clone_lock() as to the lock keys to be used.
      
           Note that the child created by sk_clone_lock() inherits the parent's
           kern setting.
      
       (3) Add a 'kern' parameter to ->accept() that is analogous to the one
           passed in to ->create() that distinguishes whether kernel_accept() or
           sys_accept4() was the caller and can be passed to sk_alloc().
      
           Note that a lot of accept functions merely dequeue an already
           allocated socket.  I haven't touched these as the new socket already
           exists before we get the parameter.
      
           Note also that there are a couple of places where I've made the accepted
           socket unconditionally kernel-based:
      
      	irda_accept()
      	rds_rcp_accept_one()
      	tcp_accept_from_sock()
      
           because they follow a sock_create_kern() and accept off of that.
      
      Whilst creating this, I noticed that lustre and ocfs don't create sockets
      through sock_create_kern() and thus they aren't marked as for-kernel,
      though they appear to be internal.  I wonder if these should do that so
      that they use the new set of lock keys.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cdfbabfb
  11. 19 1月, 2017 1 次提交
  12. 14 1月, 2017 1 次提交
    • Y
      tcp: add reordering timer in RACK loss detection · 57dde7f7
      Yuchung Cheng 提交于
      This patch makes RACK install a reordering timer when it suspects
      some packets might be lost, but wants to delay the decision
      a little bit to accomodate reordering.
      
      It does not create a new timer but instead repurposes the existing
      RTO timer, because both are meant to retransmit packets.
      Specifically it arms a timer ICSK_TIME_REO_TIMEOUT when
      the RACK timing check fails. The wait time is set to
      
        RACK.RTT + RACK.reo_wnd - (NOW - Packet.xmit_time) + fudge
      
      This translates to expecting a packet (Packet) should take
      (RACK.RTT + RACK.reo_wnd + fudge) to deliver after it was sent.
      
      When there are multiple packets that need a timer, we use one timer
      with the maximum timeout. Therefore the timer conservatively uses
      the maximum window to expire N packets by one timeout, instead of
      N timeouts to expire N packets sent at different times.
      
      The fudge factor is 2 jiffies to ensure when the timer fires, all
      the suspected packets would exceed the deadline and be marked lost
      by tcp_rack_detect_loss(). It has to be at least 1 jiffy because the
      clock may tick between calling icsk_reset_xmit_timer(timeout) and
      actually hang the timer. The next jiffy is to lower-bound the timeout
      to 2 jiffies when reo_wnd is < 1ms.
      
      When the reordering timer fires (tcp_rack_reo_timeout): If we aren't
      in Recovery we'll enter fast recovery and force fast retransmit.
      This is very similar to the early retransmit (RFC5827) except RACK
      is not constrained to only enter recovery for small outstanding
      flights.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      57dde7f7
  13. 18 12月, 2016 1 次提交
    • T
      inet: Fix get port to handle zero port number with soreuseport set · 0643ee4f
      Tom Herbert 提交于
      A user may call listen with binding an explicit port with the intent
      that the kernel will assign an available port to the socket. In this
      case inet_csk_get_port does a port scan. For such sockets, the user may
      also set soreuseport with the intent a creating more sockets for the
      port that is selected. The problem is that the initial socket being
      opened could inadvertently choose an existing and unreleated port
      number that was already created with soreuseport.
      
      This patch adds a boolean parameter to inet_bind_conflict that indicates
      rather soreuseport is allowed for the check (in addition to
      sk->sk_reuseport). In calls to inet_bind_conflict from inet_csk_get_port
      the argument is set to true if an explicit port is being looked up (snum
      argument is nonzero), and is false if port scan is done.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0643ee4f
  14. 30 10月, 2016 1 次提交
  15. 21 9月, 2016 1 次提交
  16. 19 2月, 2016 1 次提交
    • E
      tcp/dccp: fix another race at listener dismantle · 7716682c
      Eric Dumazet 提交于
      Ilya reported following lockdep splat:
      
      kernel: =========================
      kernel: [ BUG: held lock freed! ]
      kernel: 4.5.0-rc1-ceph-00026-g5e0a311 #1 Not tainted
      kernel: -------------------------
      kernel: swapper/5/0 is freeing memory
      ffff880035c9d200-ffff880035c9dbff, with a lock still held there!
      kernel: (&(&queue->rskq_lock)->rlock){+.-...}, at:
      [<ffffffff816f6a88>] inet_csk_reqsk_queue_add+0x28/0xa0
      kernel: 4 locks held by swapper/5/0:
      kernel: #0:  (rcu_read_lock){......}, at: [<ffffffff8169ef6b>]
      netif_receive_skb_internal+0x4b/0x1f0
      kernel: #1:  (rcu_read_lock){......}, at: [<ffffffff816e977f>]
      ip_local_deliver_finish+0x3f/0x380
      kernel: #2:  (slock-AF_INET){+.-...}, at: [<ffffffff81685ffb>]
      sk_clone_lock+0x19b/0x440
      kernel: #3:  (&(&queue->rskq_lock)->rlock){+.-...}, at:
      [<ffffffff816f6a88>] inet_csk_reqsk_queue_add+0x28/0xa0
      
      To properly fix this issue, inet_csk_reqsk_queue_add() needs
      to return to its callers if the child as been queued
      into accept queue.
      
      We also need to make sure listener is still there before
      calling sk->sk_data_ready(), by holding a reference on it,
      since the reference carried by the child can disappear as
      soon as the child is put on accept queue.
      Reported-by: NIlya Dryomov <idryomov@gmail.com>
      Fixes: ebb516af ("tcp/dccp: fix race at listener dismantle phase")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7716682c
  17. 23 10月, 2015 1 次提交
    • E
      tcp/dccp: fix hashdance race for passive sessions · 5e0724d0
      Eric Dumazet 提交于
      Multiple cpus can process duplicates of incoming ACK messages
      matching a SYN_RECV request socket. This is a rare event under
      normal operations, but definitely can happen.
      
      Only one must win the race, otherwise corruption would occur.
      
      To fix this without adding new atomic ops, we use logic in
      inet_ehash_nolisten() to detect the request was present in the same
      ehash bucket where we try to insert the new child.
      
      If request socket was not found, we have to undo the child creation.
      
      This actually removes a spin_lock()/spin_unlock() pair in
      reqsk_queue_unlink() for the fast path.
      
      Fixes: e994b2f0 ("tcp: do not lock listener to process SYN packets")
      Fixes: 079096f1 ("tcp/dccp: install syn_recv requests into ehash table")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e0724d0
  18. 16 10月, 2015 2 次提交
  19. 15 10月, 2015 1 次提交
  20. 03 10月, 2015 3 次提交
  21. 30 9月, 2015 2 次提交
  22. 26 9月, 2015 1 次提交
  23. 01 6月, 2015 1 次提交
    • N
      tcp: fix child sockets to use system default congestion control if not set · 9f950415
      Neal Cardwell 提交于
      Linux 3.17 and earlier are explicitly engineered so that if the app
      doesn't specifically request a CC module on a listener before the SYN
      arrives, then the child gets the system default CC when the connection
      is established. See tcp_init_congestion_control() in 3.17 or earlier,
      which says "if no choice made yet assign the current value set as
      default". The change ("net: tcp: assign tcp cong_ops when tcp sk is
      created") altered these semantics, so that children got their parent
      listener's congestion control even if the system default had changed
      after the listener was created.
      
      This commit returns to those original semantics from 3.17 and earlier,
      since they are the original semantics from 2007 in 4d4d3d1e ("[TCP]:
      Congestion control initialization."), and some Linux congestion
      control workflows depend on that.
      
      In summary, if a listener socket specifically sets TCP_CONGESTION to
      "x", or the route locks the CC module to "x", then the child gets
      "x". Otherwise the child gets current system default from
      net.ipv4.tcp_congestion_control. That's the behavior in 3.17 and
      earlier, and this commit restores that.
      
      Fixes: 55d8694f ("net: tcp: assign tcp cong_ops when tcp sk is created")
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Daniel Borkmann <dborkman@redhat.com>
      Cc: Glenn Judd <glenn.judd@morganstanley.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f950415
  24. 19 5月, 2015 1 次提交
  25. 24 4月, 2015 1 次提交
    • E
      inet: fix possible panic in reqsk_queue_unlink() · b357a364
      Eric Dumazet 提交于
      [ 3897.923145] BUG: unable to handle kernel NULL pointer dereference at
       0000000000000080
      [ 3897.931025] IP: [<ffffffffa9f27686>] reqsk_timer_handler+0x1a6/0x243
      
      There is a race when reqsk_timer_handler() and tcp_check_req() call
      inet_csk_reqsk_queue_unlink() on the same req at the same time.
      
      Before commit fa76ce73 ("inet: get rid of central tcp/dccp listener
      timer"), listener spinlock was held and race could not happen.
      
      To solve this bug, we change reqsk_queue_unlink() to not assume req
      must be found, and we return a status, to conditionally release a
      refcount on the request sock.
      
      This also means tcp_check_req() in non fastopen case might or not
      consume req refcount, so tcp_v6_hnd_req() & tcp_v4_hnd_req() have
      to properly handle this.
      
      (Same remark for dccp_check_req() and its callers)
      
      inet_csk_reqsk_queue_drop() is now too big to be inlined, as it is
      called 4 times in tcp and 3 times in dccp.
      
      Fixes: fa76ce73 ("inet: get rid of central tcp/dccp listener timer")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b357a364
  26. 21 3月, 2015 2 次提交
    • E
      inet: get rid of central tcp/dccp listener timer · fa76ce73
      Eric Dumazet 提交于
      One of the major issue for TCP is the SYNACK rtx handling,
      done by inet_csk_reqsk_queue_prune(), fired by the keepalive
      timer of a TCP_LISTEN socket.
      
      This function runs for awful long times, with socket lock held,
      meaning that other cpus needing this lock have to spin for hundred of ms.
      
      SYNACK are sent in huge bursts, likely to cause severe drops anyway.
      
      This model was OK 15 years ago when memory was very tight.
      
      We now can afford to have a timer per request sock.
      
      Timer invocations no longer need to lock the listener,
      and can be run from all cpus in parallel.
      
      With following patch increasing somaxconn width to 32 bits,
      I tested a listener with more than 4 million active request sockets,
      and a steady SYNFLOOD of ~200,000 SYN per second.
      Host was sending ~830,000 SYNACK per second.
      
      This is ~100 times more what we could achieve before this patch.
      
      Later, we will get rid of the listener hash and use ehash instead.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa76ce73
    • E
      inet: drop prev pointer handling in request sock · 52452c54
      Eric Dumazet 提交于
      When request sock are put in ehash table, the whole notion
      of having a previous request to update dl_next is pointless.
      
      Also, following patch will get rid of big purge timer,
      so we want to delete a request sock without holding listener lock.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52452c54
  27. 18 3月, 2015 1 次提交
    • E
      inet: fix request sock refcounting · 0470c8ca
      Eric Dumazet 提交于
      While testing last patch series, I found req sock refcounting was wrong.
      
      We must set skc_refcnt to 1 for all request socks added in hashes,
      but also on request sockets created by FastOpen or syncookies.
      
      It is tricky because we need to defer this initialization so that
      future RCU lookups do not try to take a refcount on a not yet
      fully initialized request socket.
      
      Also get rid of ireq_refcnt alias.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Fixes: 13854e5a ("inet: add proper refcounting to request sock")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0470c8ca
  28. 17 3月, 2015 1 次提交
  29. 07 3月, 2015 1 次提交
    • F
      ipv4: Create probe timer for tcp PMTU as per RFC4821 · 05cbc0db
      Fan Du 提交于
      As per RFC4821 7.3.  Selecting Probe Size, a probe timer should
      be armed once probing has converged. Once this timer expired,
      probing again to take advantage of any path PMTU change. The
      recommended probing interval is 10 minutes per RFC1981. Probing
      interval could be sysctled by sysctl_tcp_probe_interval.
      
      Eric Dumazet suggested to implement pseudo timer based on 32bits
      jiffies tcp_time_stamp instead of using classic timer for such
      rare event.
      Signed-off-by: NFan Du <fan.du@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      05cbc0db
  30. 06 1月, 2015 1 次提交
    • D
      net: tcp: add key management to congestion control · c5c6a8ab
      Daniel Borkmann 提交于
      This patch adds necessary infrastructure to the congestion control
      framework for later per route congestion control support.
      
      For a per route congestion control possibility, our aim is to store
      a unique u32 key identifier into dst metrics, which can then be
      mapped into a tcp_congestion_ops struct. We argue that having a
      RTAX key entry is the most simple, generic and easy way to manage,
      and also keeps the memory footprint of dst entries lower on 64 bit
      than with storing a pointer directly, for example. Having a unique
      key id also allows for decoupling actual TCP congestion control
      module management from the FIB layer, i.e. we don't have to care
      about expensive module refcounting inside the FIB at this point.
      
      We first thought of using an IDR store for the realization, which
      takes over dynamic assignment of unused key space and also performs
      the key to pointer mapping in RCU. While doing so, we stumbled upon
      the issue that due to the nature of dynamic key distribution, it
      just so happens, arguably in very rare occasions, that excessive
      module loads and unloads can lead to a possible reuse of previously
      used key space. Thus, previously stale keys in the dst metric are
      now being reassigned to a different congestion control algorithm,
      which might lead to unexpected behaviour. One way to resolve this
      would have been to walk FIBs on the actually rare occasion of a
      module unload and reset the metric keys for each FIB in each netns,
      but that's just very costly.
      
      Therefore, we argue a better solution is to reuse the unique
      congestion control algorithm name member and map that into u32 key
      space through jhash. For that, we split the flags attribute (as it
      currently uses 2 bits only anyway) into two u32 attributes, flags
      and key, so that we can keep the cacheline boundary of 2 cachelines
      on x86_64 and cache the precalculated key at registration time for
      the fast path. On average we might expect 2 - 4 modules being loaded
      worst case perhaps 15, so a key collision possibility is extremely
      low, and guaranteed collision-free on LE/BE for all in-tree modules.
      Overall this results in much simpler code, and all without the
      overhead of an IDR. Due to the deterministic nature, modules can
      now be unloaded, the congestion control algorithm for a specific
      but unloaded key will fall back to the default one, and on module
      reload time it will switch back to the expected algorithm
      transparently.
      
      Joint work with Florian Westphal.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c5c6a8ab
  31. 23 9月, 2014 1 次提交
  32. 15 8月, 2014 1 次提交
  33. 16 4月, 2014 1 次提交
  34. 22 9月, 2013 1 次提交
  35. 12 3月, 2013 1 次提交
    • N
      tcp: Tail loss probe (TLP) · 6ba8a3b1
      Nandita Dukkipati 提交于
      This patch series implement the Tail loss probe (TLP) algorithm described
      in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
      first patch implements the basic algorithm.
      
      TLP's goal is to reduce tail latency of short transactions. It achieves
      this by converting retransmission timeouts (RTOs) occuring due
      to tail losses (losses at end of transactions) into fast recovery.
      TLP transmits one packet in two round-trips when a connection is in
      Open state and isn't receiving any ACKs. The transmitted packet, aka
      loss probe, can be either new or a retransmission. When there is tail
      loss, the ACK from a loss probe triggers FACK/early-retransmit based
      fast recovery, thus avoiding a costly RTO. In the absence of loss,
      there is no change in the connection state.
      
      PTO stands for probe timeout. It is a timer event indicating
      that an ACK is overdue and triggers a loss probe packet. The PTO value
      is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
      ACK timer when there is only one oustanding packet.
      
      TLP Algorithm
      
      On transmission of new data in Open state:
        -> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
        -> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
        -> PTO = min(PTO, RTO)
      
      Conditions for scheduling PTO:
        -> Connection is in Open state.
        -> Connection is either cwnd limited or no new data to send.
        -> Number of probes per tail loss episode is limited to one.
        -> Connection is SACK enabled.
      
      When PTO fires:
        new_segment_exists:
          -> transmit new segment.
          -> packets_out++. cwnd remains same.
      
        no_new_packet:
          -> retransmit the last segment.
             Its ACK triggers FACK or early retransmit based recovery.
      
      ACK path:
        -> rearm RTO at start of ACK processing.
        -> reschedule PTO if need be.
      
      In addition, the patch includes a small variation to the Early Retransmit
      (ER) algorithm, such that ER and TLP together can in principle recover any
      N-degree of tail loss through fast recovery. TLP is controlled by the same
      sysctl as ER, tcp_early_retrans sysctl.
      tcp_early_retrans==0; disables TLP and ER.
      		 ==1; enables RFC5827 ER.
      		 ==2; delayed ER.
      		 ==3; TLP and delayed ER. [DEFAULT]
      		 ==4; TLP only.
      
      The TLP patch series have been extensively tested on Google Web servers.
      It is most effective for short Web trasactions, where it reduced RTOs by 15%
      and improved HTTP response time (average by 6%, 99th percentile by 10%).
      The transmitted probes account for <0.5% of the overall transmissions.
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ba8a3b1