1. 03 12月, 2016 1 次提交
    • F
      tcp: randomize tcp timestamp offsets for each connection · 95a22cae
      Florian Westphal 提交于
      jiffies based timestamps allow for easy inference of number of devices
      behind NAT translators and also makes tracking of hosts simpler.
      
      commit ceaa1fef ("tcp: adding a per-socket timestamp offset")
      added the main infrastructure that is needed for per-connection ts
      randomization, in particular writing/reading the on-wire tcp header
      format takes the offset into account so rest of stack can use normal
      tcp_time_stamp (jiffies).
      
      So only two items are left:
       - add a tsoffset for request sockets
       - extend the tcp isn generator to also return another 32bit number
         in addition to the ISN.
      
      Re-use of ISN generator also means timestamps are still monotonically
      increasing for same connection quadruple, i.e. PAWS will still work.
      
      Includes fixes from Eric Dumazet.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95a22cae
  2. 30 11月, 2016 2 次提交
  3. 22 11月, 2016 1 次提交
  4. 14 11月, 2016 1 次提交
  5. 04 11月, 2016 1 次提交
    • D
      net: tcp: check skb is non-NULL for exact match on lookups · da96786e
      David Ahern 提交于
      Andrey reported the following error report while running the syzkaller
      fuzzer:
      
      general protection fault: 0000 [#1] SMP KASAN
      Dumping ftrace buffer:
         (ftrace buffer empty)
      Modules linked in:
      CPU: 0 PID: 648 Comm: syz-executor Not tainted 4.9.0-rc3+ #333
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      task: ffff8800398c4480 task.stack: ffff88003b468000
      RIP: 0010:[<ffffffff83091106>]  [<     inline     >]
      inet_exact_dif_match include/net/tcp.h:808
      RIP: 0010:[<ffffffff83091106>]  [<ffffffff83091106>]
      __inet_lookup_listener+0xb6/0x500 net/ipv4/inet_hashtables.c:219
      RSP: 0018:ffff88003b46f270  EFLAGS: 00010202
      RAX: 0000000000000004 RBX: 0000000000004242 RCX: 0000000000000001
      RDX: 0000000000000000 RSI: ffffc90000e3c000 RDI: 0000000000000054
      RBP: ffff88003b46f2d8 R08: 0000000000004000 R09: ffffffff830910e7
      R10: 0000000000000000 R11: 000000000000000a R12: ffffffff867fa0c0
      R13: 0000000000004242 R14: 0000000000000003 R15: dffffc0000000000
      FS:  00007fb135881700(0000) GS:ffff88003ec00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020cc3000 CR3: 000000006d56a000 CR4: 00000000000006f0
      Stack:
       0000000000000000 000000000601a8c0 0000000000000000 ffffffff00004242
       424200003b9083c2 ffff88003def4041 ffffffff84e7e040 0000000000000246
       ffff88003a0911c0 0000000000000000 ffff88003a091298 ffff88003b9083ae
      Call Trace:
       [<ffffffff831100f4>] tcp_v4_send_reset+0x584/0x1700 net/ipv4/tcp_ipv4.c:643
       [<ffffffff83115b1b>] tcp_v4_rcv+0x198b/0x2e50 net/ipv4/tcp_ipv4.c:1718
       [<ffffffff83069d22>] ip_local_deliver_finish+0x332/0xad0
      net/ipv4/ip_input.c:216
      ...
      
      MD5 has a code path that calls __inet_lookup_listener with a null skb,
      so inet{6}_exact_dif_match needs to check skb against null before pulling
      the flag.
      
      Fixes: a04a480d ("net: Require exact match for TCP socket lookups if
             dif is l3mdev")
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Tested-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      da96786e
  6. 17 10月, 2016 1 次提交
    • D
      net: Require exact match for TCP socket lookups if dif is l3mdev · a04a480d
      David Ahern 提交于
      Currently, socket lookups for l3mdev (vrf) use cases can match a socket
      that is bound to a port but not a device (ie., a global socket). If the
      sysctl tcp_l3mdev_accept is not set this leads to ack packets going out
      based on the main table even though the packet came in from an L3 domain.
      The end result is that the connection does not establish creating
      confusion for users since the service is running and a socket shows in
      ss output. Fix by requiring an exact dif to sk_bound_dev_if match if the
      skb came through an interface enslaved to an l3mdev device and the
      tcp_l3mdev_accept is not set.
      
      skb's through an l3mdev interface are marked by setting a flag in
      inet{6}_skb_parm. The IPv6 variant is already set; this patch adds the
      flag for IPv4. Using an skb flag avoids a device lookup on the dif. The
      flag is set in the VRF driver using the IP{6}CB macros. For IPv4, the
      inet_skb_parm struct is moved in the cb per commit 971f10ec, so the
      match function in the TCP stack needs to use TCP_SKB_CB. For IPv6, the
      move is done after the socket lookup, so IP6CB is used.
      
      The flags field in inet_skb_parm struct needs to be increased to add
      another flag. There is currently a 1-byte hole following the flags,
      so it can be expanded to u16 without increasing the size of the struct.
      
      Fixes: 193125db ("net: Introduce VRF device driver")
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a04a480d
  7. 21 9月, 2016 7 次提交
    • Y
      tcp: new CC hook to set sending rate with rate_sample in any CA state · c0402760
      Yuchung Cheng 提交于
      This commit introduces an optional new "omnipotent" hook,
      cong_control(), for congestion control modules. The cong_control()
      function is called at the end of processing an ACK (i.e., after
      updating sequence numbers, the SACK scoreboard, and loss
      detection). At that moment we have precise delivery rate information
      the congestion control module can use to control the sending behavior
      (using cwnd, TSO skb size, and pacing rate) in any CA state.
      
      This function can also be used by a congestion control that prefers
      not to use the default cwnd reduction approach (i.e., the PRR
      algorithm) during CA_Recovery to control the cwnd and sending rate
      during loss recovery.
      
      We take advantage of the fact that recent changes defer the
      retransmission or transmission of new data (e.g. by F-RTO) in recovery
      until the new tcp_cong_control() function is run.
      
      With this commit, we only run tcp_update_pacing_rate() if the
      congestion control is not using this new API. New congestion controls
      which use the new API do not want the TCP stack to run the default
      pacing rate calculation and overwrite whatever pacing rate they have
      chosen at initialization time.
      Signed-off-by: NVan Jacobson <vanj@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0402760
    • Y
      tcp: allow congestion control to expand send buffer differently · 77bfc174
      Yuchung Cheng 提交于
      Currently the TCP send buffer expands to twice cwnd, in order to allow
      limited transmits in the CA_Recovery state. This assumes that cwnd
      does not increase in the CA_Recovery.
      
      For some congestion control algorithms, like the upcoming BBR module,
      if the losses in recovery do not indicate congestion then we may
      continue to raise cwnd multiplicatively in recovery. In such cases the
      current multiplier will falsely limit the sending rate, much as if it
      were limited by the application.
      
      This commit adds an optional congestion control callback to use a
      different multiplier to expand the TCP send buffer. For congestion
      control modules that do not specificy this callback, TCP continues to
      use the previous default of 2.
      Signed-off-by: NVan Jacobson <vanj@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      77bfc174
    • N
      tcp: export tcp_tso_autosize() and parameterize minimum number of TSO segments · 1b3878ca
      Neal Cardwell 提交于
      To allow congestion control modules to use the default TSO auto-sizing
      algorithm as one of the ingredients in their own decision about TSO sizing:
      
      1) Export tcp_tso_autosize() so that CC modules can use it.
      
      2) Change tcp_tso_autosize() to allow callers to specify a minimum
         number of segments per TSO skb, in case the congestion control
         module has a different notion of the best floor for TSO skbs for
         the connection right now. For very low-rate paths or policed
         connections it can be appropriate to use smaller TSO skbs.
      Signed-off-by: NVan Jacobson <vanj@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1b3878ca
    • N
      tcp: allow congestion control module to request TSO skb segment count · ed6e7268
      Neal Cardwell 提交于
      Add the tso_segs_goal() function in tcp_congestion_ops to allow the
      congestion control module to specify the number of segments that
      should be in a TSO skb sent by tcp_write_xmit() and
      tcp_xmit_retransmit_queue(). The congestion control module can either
      request a particular number of segments in TSO skb that we transmit,
      or return 0 if it doesn't care.
      
      This allows the upcoming BBR congestion control module to select small
      TSO skb sizes if the module detects that the bottleneck bandwidth is
      very low, or that the connection is policed to a low rate.
      Signed-off-by: NVan Jacobson <vanj@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ed6e7268
    • S
      tcp: track application-limited rate samples · d7722e85
      Soheil Hassas Yeganeh 提交于
      This commit adds code to track whether the delivery rate represented
      by each rate_sample was limited by the application.
      
      Upon each transmit, we store in the is_app_limited field in the skb a
      boolean bit indicating whether there is a known "bubble in the pipe":
      a point in the rate sample interval where the sender was
      application-limited, and did not transmit even though the cwnd and
      pacing rate allowed it.
      
      This logic marks the flow app-limited on a write if *all* of the
      following are true:
      
        1) There is less than 1 MSS of unsent data in the write queue
           available to transmit.
      
        2) There is no packet in the sender's queues (e.g. in fq or the NIC
           tx queue).
      
        3) The connection is not limited by cwnd.
      
        4) There are no lost packets to retransmit.
      
      The tcp_rate_check_app_limited() code in tcp_rate.c determines whether
      the connection is application-limited at the moment. If the flow is
      application-limited, it sets the tp->app_limited field. If the flow is
      application-limited then that means there is effectively a "bubble" of
      silence in the pipe now, and this silence will be reflected in a lower
      bandwidth sample for any rate samples from now until we get an ACK
      indicating this bubble has exited the pipe: specifically, until we get
      an ACK for the next packet we transmit.
      
      When we send every skb we record in scb->tx.is_app_limited whether the
      resulting rate sample will be application-limited.
      
      The code in tcp_rate_gen() checks to see when it is safe to mark all
      known application-limited bubbles of silence as having exited the
      pipe. It does this by checking to see when the delivered count moves
      past the tp->app_limited marker. At this point it zeroes the
      tp->app_limited marker, as all known bubbles are out of the pipe.
      
      We make room for the tx.is_app_limited bit in the skb by borrowing a
      bit from the in_flight field used by NV to record the number of bytes
      in flight. The receive window in the TCP header is 16 bits, and the
      max receive window scaling shift factor is 14 (RFC 1323). So the max
      receive window offered by the TCP protocol is 2^(16+14) = 2^30. So we
      only need 30 bits for the tx.in_flight used by NV.
      Signed-off-by: NVan Jacobson <vanj@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7722e85
    • Y
      tcp: track data delivery rate for a TCP connection · b9f64820
      Yuchung Cheng 提交于
      This patch generates data delivery rate (throughput) samples on a
      per-ACK basis. These rate samples can be used by congestion control
      modules, and specifically will be used by TCP BBR in later patches in
      this series.
      
      Key state:
      
      tp->delivered: Tracks the total number of data packets (original or not)
      	       delivered so far. This is an already-existing field.
      
      tp->delivered_mstamp: the last time tp->delivered was updated.
      
      Algorithm:
      
      A rate sample is calculated as (d1 - d0)/(t1 - t0) on a per-ACK basis:
      
        d1: the current tp->delivered after processing the ACK
        t1: the current time after processing the ACK
      
        d0: the prior tp->delivered when the acked skb was transmitted
        t0: the prior tp->delivered_mstamp when the acked skb was transmitted
      
      When an skb is transmitted, we snapshot d0 and t0 in its control
      block in tcp_rate_skb_sent().
      
      When an ACK arrives, it may SACK and ACK some skbs. For each SACKed
      or ACKed skb, tcp_rate_skb_delivered() updates the rate_sample struct
      to reflect the latest (d0, t0).
      
      Finally, tcp_rate_gen() generates a rate sample by storing
      (d1 - d0) in rs->delivered and (t1 - t0) in rs->interval_us.
      
      One caveat: if an skb was sent with no packets in flight, then
      tp->delivered_mstamp may be either invalid (if the connection is
      starting) or outdated (if the connection was idle). In that case,
      we'll re-stamp tp->delivered_mstamp.
      
      At first glance it seems t0 should always be the time when an skb was
      transmitted, but actually this could over-estimate the rate due to
      phase mismatch between transmit and ACK events. To track the delivery
      rate, we ensure that if packets are in flight then t0 and and t1 are
      times at which packets were marked delivered.
      
      If the initial and final RTTs are different then one may be corrupted
      by some sort of noise. The noise we see most often is sending gaps
      caused by delayed, compressed, or stretched acks. This either affects
      both RTTs equally or artificially reduces the final RTT. We approach
      this by recording the info we need to compute the initial RTT
      (duration of the "send phase" of the window) when we recorded the
      associated inflight. Then, for a filter to avoid bandwidth
      overestimates, we generalize the per-sample bandwidth computation
      from:
      
          bw = delivered / ack_phase_rtt
      
      to the following:
      
          bw = delivered / max(send_phase_rtt, ack_phase_rtt)
      
      In large-scale experiments, this filtering approach incorporating
      send_phase_rtt is effective at avoiding bandwidth overestimates due to
      ACK compression or stretched ACKs.
      Signed-off-by: NVan Jacobson <vanj@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b9f64820
    • N
      tcp: use windowed min filter library for TCP min_rtt estimation · 64033892
      Neal Cardwell 提交于
      Refactor the TCP min_rtt code to reuse the new win_minmax library in
      lib/win_minmax.c to simplify the TCP code.
      
      This is a pure refactor: the functionality is exactly the same. We
      just moved the windowed min code to make TCP easier to read and
      maintain, and to allow other parts of the kernel to use the windowed
      min/max filter code.
      Signed-off-by: NVan Jacobson <vanj@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      64033892
  8. 09 9月, 2016 1 次提交
    • Y
      tcp: use an RB tree for ooo receive queue · 9f5afeae
      Yaogong Wang 提交于
      Over the years, TCP BDP has increased by several orders of magnitude,
      and some people are considering to reach the 2 Gbytes limit.
      
      Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
      MSS.
      
      In presence of packet losses (or reorders), TCP stores incoming packets
      into an out of order queue, and number of skbs sitting there waiting for
      the missing packets to be received can be in the 10^5 range.
      
      Most packets are appended to the tail of this queue, and when
      packets can finally be transferred to receive queue, we scan the queue
      from its head.
      
      However, in presence of heavy losses, we might have to find an arbitrary
      point in this queue, involving a linear scan for every incoming packet,
      throwing away cpu caches.
      
      This patch converts it to a RB tree, to get bounded latencies.
      
      Yaogong wrote a preliminary patch about 2 years ago.
      Eric did the rebase, added ofo_last_skb cache, polishing and tests.
      
      Tested with network dropping between 1 and 10 % packets, with good
      success (about 30 % increase of throughput in stress tests)
      
      Next step would be to also use an RB tree for the write queue at sender
      side ;)
      Signed-off-by: NYaogong Wang <wygivan@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Acked-By: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f5afeae
  9. 29 8月, 2016 3 次提交
    • E
      tcp: add tcp_add_backlog() · c9c33212
      Eric Dumazet 提交于
      When TCP operates in lossy environments (between 1 and 10 % packet
      losses), many SACK blocks can be exchanged, and I noticed we could
      drop them on busy senders, if these SACK blocks have to be queued
      into the socket backlog.
      
      While the main cause is the poor performance of RACK/SACK processing,
      we can try to avoid these drops of valuable information that can lead to
      spurious timeouts and retransmits.
      
      Cause of the drops is the skb->truesize overestimation caused by :
      
      - drivers allocating ~2048 (or more) bytes as a fragment to hold an
        Ethernet frame.
      
      - various pskb_may_pull() calls bringing the headers into skb->head
        might have pulled all the frame content, but skb->truesize could
        not be lowered, as the stack has no idea of each fragment truesize.
      
      The backlog drops are also more visible on bidirectional flows, since
      their sk_rmem_alloc can be quite big.
      
      Let's add some room for the backlog, as only the socket owner
      can selectively take action to lower memory needs, like collapsing
      receive queues or partial ofo pruning.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9c33212
    • T
      tcp: Set read_sock and peek_len proto_ops · 32035585
      Tom Herbert 提交于
      In inet_stream_ops we set read_sock to tcp_read_sock and peek_len to
      tcp_peek_len (which is just a stub function that calls tcp_inq).
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      32035585
    • T
      net: Add read_sock proto_op · 0294b625
      Tom Herbert 提交于
      Add new function in proto_ops structure. This includes moving the
      typedef got sk_read_actor into net.h and removing the definition from
      tcp.h.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0294b625
  10. 24 8月, 2016 1 次提交
  11. 19 8月, 2016 1 次提交
    • E
      tcp: fix use after free in tcp_xmit_retransmit_queue() · bb1fceca
      Eric Dumazet 提交于
      When tcp_sendmsg() allocates a fresh and empty skb, it puts it at the
      tail of the write queue using tcp_add_write_queue_tail()
      
      Then it attempts to copy user data into this fresh skb.
      
      If the copy fails, we undo the work and remove the fresh skb.
      
      Unfortunately, this undo lacks the change done to tp->highest_sack and
      we can leave a dangling pointer (to a freed skb)
      
      Later, tcp_xmit_retransmit_queue() can dereference this pointer and
      access freed memory. For regular kernels where memory is not unmapped,
      this might cause SACK bugs because tcp_highest_sack_seq() is buggy,
      returning garbage instead of tp->snd_nxt, but with various debug
      features like CONFIG_DEBUG_PAGEALLOC, this can crash the kernel.
      
      This bug was found by Marco Grassi thanks to syzkaller.
      
      Fixes: 6859d494 ("[TCP]: Abstract tp->highest_sack accessing & point to next skb")
      Reported-by: NMarco Grassi <marco.gra@gmail.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Reviewed-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bb1fceca
  12. 01 7月, 2016 1 次提交
  13. 30 6月, 2016 1 次提交
  14. 11 6月, 2016 1 次提交
  15. 12 5月, 2016 2 次提交
    • D
      net: l3mdev: Add hook in ip and ipv6 · 74b20582
      David Ahern 提交于
      Currently the VRF driver uses the rx_handler to switch the skb device
      to the VRF device. Switching the dev prior to the ip / ipv6 layer
      means the VRF driver has to duplicate IP/IPv6 processing which adds
      overhead and makes features such as retaining the ingress device index
      more complicated than necessary.
      
      This patch moves the hook to the L3 layer just after the first NF_HOOK
      for PRE_ROUTING. This location makes exposing the original ingress device
      trivial (next patch) and allows adding other NF_HOOKs to the VRF driver
      in the future.
      
      dev_queue_xmit_nit is exported so that the VRF driver can cycle the skb
      with the switched device through the packet taps to maintain current
      behavior (tcpdump can be used on either the vrf device or the enslaved
      devices).
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      74b20582
    • L
      tcp: replace cnt & rtt with struct in pkts_acked() · 756ee172
      Lawrence Brakmo 提交于
      Replace 2 arguments (cnt and rtt) in the congestion control modules'
      pkts_acked() function with a struct. This will allow adding more
      information without having to modify existing congestion control
      modules (tcp_nv in particular needs bytes in flight when packet
      was sent).
      
      As proposed by Neal Cardwell in his comments to the tcp_nv patch.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      756ee172
  16. 09 5月, 2016 1 次提交
  17. 29 4月, 2016 1 次提交
    • M
      tcp: Make use of MSG_EOR in tcp_sendmsg · c134ecb8
      Martin KaFai Lau 提交于
      This patch adds an eor bit to the TCP_SKB_CB.  When MSG_EOR
      is passed to tcp_sendmsg, the eor bit will be set at the skb
      containing the last byte of the userland's msg.  The eor bit
      will prevent data from appending to that skb in the future.
      
      The change in do_tcp_sendpages is to honor the eor set
      during the previous tcp_sendmsg(MSG_EOR) call.
      
      This patch handles the tcp_sendmsg case.  The followup patches
      will handle other skb coalescing and fragment cases.
      
      One potential use case is to use MSG_EOR with
      SOF_TIMESTAMPING_TX_ACK to get a more accurate
      TCP ack timestamping on application protocol with
      multiple outgoing response messages (e.g. HTTP2).
      
      Packetdrill script for testing:
      ~~~~~~
      +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
      +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
      +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0 bind(3, ..., ...) = 0
      +0 listen(3, 1) = 0
      
      0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
      0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
      0.200 < . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
      +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
      
      0.200 write(4, ..., 14600) = 14600
      0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
      0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
      
      0.200 > .  1:7301(7300) ack 1
      0.200 > P. 7301:14601(7300) ack 1
      
      0.300 < . 1:1(0) ack 14601 win 257
      0.300 > P. 14601:15331(730) ack 1
      0.300 > P. 15331:16061(730) ack 1
      
      0.400 < . 1:1(0) ack 16061 win 257
      0.400 close(4) = 0
      0.400 > F. 16061:16061(0) ack 1
      0.400 < F. 1:1(0) ack 16062 win 257
      0.400 > . 16062:16062(0) ack 2
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c134ecb8
  18. 28 4月, 2016 4 次提交
  19. 25 4月, 2016 1 次提交
    • E
      tcp-tso: do not split TSO packets at retransmit time · 10d3be56
      Eric Dumazet 提交于
      Linux TCP stack painfully segments all TSO/GSO packets before retransmits.
      
      This was fine back in the days when TSO/GSO were emerging, with their
      bugs, but we believe the dark age is over.
      
      Keeping big packets in write queues, but also in stack traversal
      has a lot of benefits.
       - Less memory overhead, because write queues have less skbs
       - Less cpu overhead at ACK processing.
       - Better SACK processing, as lot of studies mentioned how
         awful linux was at this ;)
       - Less cpu overhead to send the rtx packets
         (IP stack traversal, netfilter traversal, drivers...)
       - Better latencies in presence of losses.
       - Smaller spikes in fq like packet schedulers, as retransmits
         are not constrained by TCP Small Queues.
      
      1 % packet losses are common today, and at 100Gbit speeds, this
      translates to ~80,000 losses per second.
      Losses are often correlated, and we see many retransmit events
      leading to 1-MSS train of packets, at the time hosts are already
      under stress.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10d3be56
  20. 22 4月, 2016 1 次提交
    • M
      tcp: Merge tx_flags and tskey in tcp_shifted_skb · cfea5a68
      Martin KaFai Lau 提交于
      After receiving sacks, tcp_shifted_skb() will collapse
      skbs if possible.  tx_flags and tskey also have to be
      merged.
      
      This patch reuses the tcp_skb_collapse_tstamp() to handle
      them.
      
      BPF Output Before:
      ~~~~~
      <no-output-due-to-missing-tstamp-event>
      
      BPF Output After:
      ~~~~~
      <...>-2024  [007] d.s.    88.644374: : ee_data:14599
      
      Packetdrill Script:
      ~~~~~
      +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
      +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
      +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0 bind(3, ..., ...) = 0
      +0 listen(3, 1) = 0
      
      0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
      0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
      0.200 < . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
      +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
      
      0.200 write(4, ..., 1460) = 1460
      +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
      0.200 write(4, ..., 13140) = 13140
      
      0.200 > P. 1:1461(1460) ack 1
      0.200 > . 1461:8761(7300) ack 1
      0.200 > P. 8761:14601(5840) ack 1
      
      0.300 < . 1:1(0) ack 1 win 257 <sack 1461:14601,nop,nop>
      0.300 > P. 1:1461(1460) ack 1
      0.400 < . 1:1(0) ack 14601 win 257
      
      0.400 close(4) = 0
      0.400 > F. 14601:14601(0) ack 1
      0.500 < F. 1:1(0) ack 14602 win 257
      0.500 > . 14602:14602(0) ack 2
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Tested-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cfea5a68
  21. 16 4月, 2016 1 次提交
    • E
      tcp: do not mess with listener sk_wmem_alloc · b3d05147
      Eric Dumazet 提交于
      When removing sk_refcnt manipulation on synflood, I missed that
      using skb_set_owner_w() was racy, if sk->sk_wmem_alloc had already
      transitioned to 0.
      
      We should hold sk_refcnt instead, but this is a big deal under attack.
      (Doing so increase performance from 3.2 Mpps to 3.8 Mpps only)
      
      In this patch, I chose to not attach a socket to syncookies skb.
      
      Performance is now 5 Mpps instead of 3.2 Mpps.
      
      Following patch will remove last known false sharing in
      tcp_rcv_state_process()
      
      Fixes: 3b24d854 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b3d05147
  22. 05 4月, 2016 2 次提交
  23. 03 4月, 2016 1 次提交
    • Y
      tcp: remove cwnd moderation after recovery · 23492623
      Yuchung Cheng 提交于
      For non-SACK connections, cwnd is lowered to inflight plus 3 packets
      when the recovery ends. This is an optional feature in the NewReno
      RFC 2582 to reduce the potential burst when cwnd is "re-opened"
      after recovery and inflight is low.
      
      This feature is questionably effective because of PRR: when
      the recovery ends (i.e., snd_una == high_seq) NewReno holds the
      CA_Recovery state for another round trip to prevent false fast
      retransmits. But if the inflight is low, PRR will overwrite the
      moderated cwnd in tcp_cwnd_reduction() later regardlessly. So if a
      receiver responds bogus ACKs (i.e., acking future data) to speed up
      transfer after recovery, it can only induce a burst up to a window
      worth of data packets by acking up to SND.NXT. A restart from (short)
      idle or receiving streched ACKs can both cause such bursts as well.
      
      On the other hand, if the recovery ends because the sender
      detects the losses were spurious (e.g., reordering). This feature
      unconditionally lowers a reverted cwnd even though nothing
      was lost.
      
      By principle loss recovery module should not update cwnd. Further
      pacing is much more effective to reduce burst. Hence this patch
      removes the cwnd moderation feature.
      
      v2 changes: revised commit message on bogus ACKs and burst, and
                  missing signature
      Signed-off-by: NMatt Mathis <mattmathis@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      23492623
  24. 15 3月, 2016 1 次提交
    • M
      tcp: Add RFC4898 tcpEStatsPerfDataSegsOut/In · a44d6eac
      Martin KaFai Lau 提交于
      Per RFC4898, they count segments sent/received
      containing a positive length data segment (that includes
      retransmission segments carrying data).  Unlike
      tcpi_segs_out/in, tcpi_data_segs_out/in excludes segments
      carrying no data (e.g. pure ack).
      
      The patch also updates the segs_in in tcp_fastopen_add_skb()
      so that segs_in >= data_segs_in property is kept.
      
      Together with retransmission data, tcpi_data_segs_out
      gives a better signal on the rxmit rate.
      
      v6: Rebase on the latest net-next
      
      v5: Eric pointed out that checking skb->len is still needed in
      tcp_fastopen_add_skb() because skb can carry a FIN without data.
      Hence, instead of open coding segs_in and data_segs_in, tcp_segs_in()
      helper is used.  Comment is added to the fastopen case to explain why
      segs_in has to be reset and tcp_segs_in() has to be called before
      __skb_pull().
      
      v4: Add comment to the changes in tcp_fastopen_add_skb()
      and also add remark on this case in the commit message.
      
      v3: Add const modifier to the skb parameter in tcp_segs_in()
      
      v2: Rework based on recent fix by Eric:
      commit a9d99ce2 ("tcp: fix tcpi_segs_in after connection establishment")
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Cc: Chris Rapier <rapier@psc.edu>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Marcelo Ricardo Leitner <mleitner@redhat.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a44d6eac
  25. 10 3月, 2016 1 次提交
  26. 09 2月, 2016 1 次提交