1. 24 10月, 2016 1 次提交
    • C
      net: ip, diag -- Add diag interface for raw sockets · 432490f9
      Cyrill Gorcunov 提交于
      In criu we are actively using diag interface to collect sockets
      present in the system when dumping applications. And while for
      unix, tcp, udp[lite], packet, netlink it works as expected,
      the raw sockets do not have. Thus add it.
      
      v2:
       - add missing sock_put calls in raw_diag_dump_one (by eric.dumazet@)
       - implement @destroy for diag requests (by dsa@)
      
      v3:
       - add export of raw_abort for IPv6 (by dsa@)
       - pass net-admin flag into inet_sk_diag_fill due to
         changes in net-next branch (by dsa@)
      
      v4:
       - use @pad in struct inet_diag_req_v2 for raw socket
         protocol specification: raw module carries sockets
         which may have custom protocol passed from socket()
         syscall and sole @sdiag_protocol is not enough to
         match underlied ones
       - start reporting protocol specifed in socket() call
         when sockets are raw ones for the same reason: user
         space tools like ss may parse this attribute and use
         it for socket matching
      
      v5 (by eric.dumazet@):
       - use sock_hold in raw_sock_get instead of atomic_inc,
         we're holding (raw_v4_hashinfo|raw_v6_hashinfo)->lock
         when looking up so counter won't be zero here.
      
      v6:
       - use sdiag_raw_protocol() helper which will access @pad
         structure used for raw sockets protocol specification:
         we can't simply rename this member without breaking uapi
      
      v7:
       - sine sdiag_raw_protocol() helper is not suitable for
         uapi lets rather make an alias structure with proper
         names. __check_inet_diag_req_raw helper will catch
         if any of structure unintentionally changed.
      
      CC: David S. Miller <davem@davemloft.net>
      CC: Eric Dumazet <eric.dumazet@gmail.com>
      CC: David Ahern <dsa@cumulusnetworks.com>
      CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      CC: James Morris <jmorris@namei.org>
      CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      CC: Patrick McHardy <kaber@trash.net>
      CC: Andrey Vagin <avagin@openvz.org>
      CC: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      432490f9
  2. 21 9月, 2016 2 次提交
    • N
      tcp_bbr: add BBR congestion control · 0f8782ea
      Neal Cardwell 提交于
      This commit implements a new TCP congestion control algorithm: BBR
      (Bottleneck Bandwidth and RTT). A detailed description of BBR will be
      published in ACM Queue, Vol. 14 No. 5, September-October 2016, as
      "BBR: Congestion-Based Congestion Control".
      
      BBR has significantly increased throughput and reduced latency for
      connections on Google's internal backbone networks and google.com and
      YouTube Web servers.
      
      BBR requires only changes on the sender side, not in the network or
      the receiver side. Thus it can be incrementally deployed on today's
      Internet, or in datacenters.
      
      The Internet has predominantly used loss-based congestion control
      (largely Reno or CUBIC) since the 1980s, relying on packet loss as the
      signal to slow down. While this worked well for many years, loss-based
      congestion control is unfortunately out-dated in today's networks. On
      today's Internet, loss-based congestion control causes the infamous
      bufferbloat problem, often causing seconds of needless queuing delay,
      since it fills the bloated buffers in many last-mile links. On today's
      high-speed long-haul links using commodity switches with shallow
      buffers, loss-based congestion control has abysmal throughput because
      it over-reacts to losses caused by transient traffic bursts.
      
      In 1981 Kleinrock and Gale showed that the optimal operating point for
      a network maximizes delivered bandwidth while minimizing delay and
      loss, not only for single connections but for the network as a
      whole. Finding that optimal operating point has been elusive, since
      any single network measurement is ambiguous: network measurements are
      the result of both bandwidth and propagation delay, and those two
      cannot be measured simultaneously.
      
      While it is impossible to disambiguate any single bandwidth or RTT
      measurement, a connection's behavior over time tells a clearer
      story. BBR uses a measurement strategy designed to resolve this
      ambiguity. It combines these measurements with a robust servo loop
      using recent control systems advances to implement a distributed
      congestion control algorithm that reacts to actual congestion, not
      packet loss or transient queue delay, and is designed to converge with
      high probability to a point near the optimal operating point.
      
      In a nutshell, BBR creates an explicit model of the network pipe by
      sequentially probing the bottleneck bandwidth and RTT. On the arrival
      of each ACK, BBR derives the current delivery rate of the last round
      trip, and feeds it through a windowed max-filter to estimate the
      bottleneck bandwidth. Conversely it uses a windowed min-filter to
      estimate the round trip propagation delay. The max-filtered bandwidth
      and min-filtered RTT estimates form BBR's model of the network pipe.
      
      Using its model, BBR sets control parameters to govern sending
      behavior. The primary control is the pacing rate: BBR applies a gain
      multiplier to transmit faster or slower than the observed bottleneck
      bandwidth. The conventional congestion window (cwnd) is now the
      secondary control; the cwnd is set to a small multiple of the
      estimated BDP (bandwidth-delay product) in order to allow full
      utilization and bandwidth probing while bounding the potential amount
      of queue at the bottleneck.
      
      When a BBR connection starts, it enters STARTUP mode and applies a
      high gain to perform an exponential search to quickly probe the
      bottleneck bandwidth (doubling its sending rate each round trip, like
      slow start). However, instead of continuing until it fills up the
      buffer (i.e. a loss), or until delay or ACK spacing reaches some
      threshold (like Hystart), it uses its model of the pipe to estimate
      when that pipe is full: it estimates the pipe is full when it notices
      the estimated bandwidth has stopped growing. At that point it exits
      STARTUP and enters DRAIN mode, where it reduces its pacing rate to
      drain the queue it estimates it has created.
      
      Then BBR enters steady state. In steady state, PROBE_BW mode cycles
      between first pacing faster to probe for more bandwidth, then pacing
      slower to drain any queue that created if no more bandwidth was
      available, and then cruising at the estimated bandwidth to utilize the
      pipe without creating excess queue. Occasionally, on an as-needed
      basis, it sends significantly slower to probe for RTT (PROBE_RTT
      mode).
      
      BBR has been fully deployed on Google's wide-area backbone networks
      and we're experimenting with BBR on Google.com and YouTube on a global
      scale.  Replacing CUBIC with BBR has resulted in significant
      improvements in network latency and application (RPC, browser, and
      video) metrics. For more details please refer to our upcoming ACM
      Queue publication.
      
      Example performance results, to illustrate the difference between BBR
      and CUBIC:
      
      Resilience to random loss (e.g. from shallow buffers):
        Consider a netperf TCP_STREAM test lasting 30 secs on an emulated
        path with a 10Gbps bottleneck, 100ms RTT, and 1% packet loss
        rate. CUBIC gets 3.27 Mbps, and BBR gets 9150 Mbps (2798x higher).
      
      Low latency with the bloated buffers common in today's last-mile links:
        Consider a netperf TCP_STREAM test lasting 120 secs on an emulated
        path with a 10Mbps bottleneck, 40ms RTT, and 1000-packet bottleneck
        buffer. Both fully utilize the bottleneck bandwidth, but BBR
        achieves this with a median RTT 25x lower (43 ms instead of 1.09
        secs).
      
      Our long-term goal is to improve the congestion control algorithms
      used on the Internet. We are hopeful that BBR can help advance the
      efforts toward this goal, and motivate the community to do further
      research.
      
      Test results, performance evaluations, feedback, and BBR-related
      discussions are very welcome in the public e-mail list for BBR:
      
        https://groups.google.com/forum/#!forum/bbr-dev
      
      NOTE: BBR *must* be used with the fq qdisc ("man tc-fq") with pacing
      enabled, since pacing is integral to the BBR design and
      implementation. BBR without pacing would not function properly, and
      may incur unnecessary high packet loss rates.
      Signed-off-by: NVan Jacobson <vanj@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0f8782ea
    • Y
      tcp: track data delivery rate for a TCP connection · b9f64820
      Yuchung Cheng 提交于
      This patch generates data delivery rate (throughput) samples on a
      per-ACK basis. These rate samples can be used by congestion control
      modules, and specifically will be used by TCP BBR in later patches in
      this series.
      
      Key state:
      
      tp->delivered: Tracks the total number of data packets (original or not)
      	       delivered so far. This is an already-existing field.
      
      tp->delivered_mstamp: the last time tp->delivered was updated.
      
      Algorithm:
      
      A rate sample is calculated as (d1 - d0)/(t1 - t0) on a per-ACK basis:
      
        d1: the current tp->delivered after processing the ACK
        t1: the current time after processing the ACK
      
        d0: the prior tp->delivered when the acked skb was transmitted
        t0: the prior tp->delivered_mstamp when the acked skb was transmitted
      
      When an skb is transmitted, we snapshot d0 and t0 in its control
      block in tcp_rate_skb_sent().
      
      When an ACK arrives, it may SACK and ACK some skbs. For each SACKed
      or ACKed skb, tcp_rate_skb_delivered() updates the rate_sample struct
      to reflect the latest (d0, t0).
      
      Finally, tcp_rate_gen() generates a rate sample by storing
      (d1 - d0) in rs->delivered and (t1 - t0) in rs->interval_us.
      
      One caveat: if an skb was sent with no packets in flight, then
      tp->delivered_mstamp may be either invalid (if the connection is
      starting) or outdated (if the connection was idle). In that case,
      we'll re-stamp tp->delivered_mstamp.
      
      At first glance it seems t0 should always be the time when an skb was
      transmitted, but actually this could over-estimate the rate due to
      phase mismatch between transmit and ACK events. To track the delivery
      rate, we ensure that if packets are in flight then t0 and and t1 are
      times at which packets were marked delivered.
      
      If the initial and final RTTs are different then one may be corrupted
      by some sort of noise. The noise we see most often is sending gaps
      caused by delayed, compressed, or stretched acks. This either affects
      both RTTs equally or artificially reduces the final RTT. We approach
      this by recording the info we need to compute the initial RTT
      (duration of the "send phase" of the window) when we recorded the
      associated inflight. Then, for a filter to avoid bandwidth
      overestimates, we generalize the per-sample bandwidth computation
      from:
      
          bw = delivered / ack_phase_rtt
      
      to the following:
      
          bw = delivered / max(send_phase_rtt, ack_phase_rtt)
      
      In large-scale experiments, this filtering approach incorporating
      send_phase_rtt is effective at avoiding bandwidth overestimates due to
      ACK compression or stretched ACKs.
      Signed-off-by: NVan Jacobson <vanj@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b9f64820
  3. 11 6月, 2016 1 次提交
  4. 18 2月, 2016 1 次提交
  5. 21 1月, 2016 2 次提交
  6. 21 10月, 2015 1 次提交
    • Y
      tcp: track the packet timings in RACK · 659a8ad5
      Yuchung Cheng 提交于
      This patch is the first half of the RACK loss recovery.
      
      RACK loss recovery uses the notion of time instead
      of packet sequence (FACK) or counts (dupthresh). It's inspired by the
      previous FACK heuristic in tcp_mark_lost_retrans(): when a limited
      transmit (new data packet) is sacked, then current retransmitted
      sequence below the newly sacked sequence must been lost,
      since at least one round trip time has elapsed.
      
      But it has several limitations:
      1) can't detect tail drops since it depends on limited transmit
      2) is disabled upon reordering (assumes no reordering)
      3) only enabled in fast recovery ut not timeout recovery
      
      RACK (Recently ACK) addresses these limitations with the notion
      of time instead: a packet P1 is lost if a later packet P2 is s/acked,
      as at least one round trip has passed.
      
      Since RACK cares about the time sequence instead of the data sequence
      of packets, it can detect tail drops when later retransmission is
      s/acked while FACK or dupthresh can't. For reordering RACK uses a
      dynamically adjusted reordering window ("reo_wnd") to reduce false
      positives on ever (small) degree of reordering.
      
      This patch implements tcp_advanced_rack() which tracks the
      most recent transmission time among the packets that have been
      delivered (ACKed or SACKed) in tp->rack.mstamp. This timestamp
      is the key to determine which packet has been lost.
      
      Consider an example that the sender sends six packets:
      T1: P1 (lost)
      T2: P2
      T3: P3
      T4: P4
      T100: sack of P2. rack.mstamp = T2
      T101: retransmit P1
      T102: sack of P2,P3,P4. rack.mstamp = T4
      T205: ACK of P4 since the hole is repaired. rack.mstamp = T101
      
      We need to be careful about spurious retransmission because it may
      falsely advance tp->rack.mstamp by an RTT or an RTO, causing RACK
      to falsely mark all packets lost, just like a spurious timeout.
      
      We identify spurious retransmission by the ACK's TS echo value.
      If TS option is not applicable but the retransmission is acknowledged
      less than min-RTT ago, it is likely to be spurious. We refrain from
      using the transmission time of these spurious retransmissions.
      
      The second half is implemented in the next patch that marks packet
      lost using RACK timestamp.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      659a8ad5
  7. 28 8月, 2015 1 次提交
  8. 11 6月, 2015 1 次提交
    • K
      tcp: add CDG congestion control · 2b0a8c9e
      Kenneth Klette Jonassen 提交于
      CAIA Delay-Gradient (CDG) is a TCP congestion control that modifies
      the TCP sender in order to [1]:
      
        o Use the delay gradient as a congestion signal.
        o Back off with an average probability that is independent of the RTT.
        o Coexist with flows that use loss-based congestion control, i.e.,
          flows that are unresponsive to the delay signal.
        o Tolerate packet loss unrelated to congestion. (Disabled by default.)
      
      Its FreeBSD implementation was presented for the ICCRG in July 2012;
      slides are available at http://www.ietf.org/proceedings/84/iccrg.html
      
      Running the experiment scenarios in [1] suggests that our implementation
      achieves more goodput compared with FreeBSD 10.0 senders, although it also
      causes more queueing delay for a given backoff factor.
      
      The loss tolerance heuristic is disabled by default due to safety concerns
      for its use in the Internet [2, p. 45-46].
      
      We use a variant of the Hybrid Slow start algorithm in tcp_cubic to reduce
      the probability of slow start overshoot.
      
      [1] D.A. Hayes and G. Armitage. "Revisiting TCP congestion control using
          delay gradients." In Networking 2011, pages 328-341. Springer, 2011.
      [2] K.K. Jonassen. "Implementing CAIA Delay-Gradient in Linux."
          MSc thesis. Department of Informatics, University of Oslo, 2015.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: David Hayes <davihay@ifi.uio.no>
      Cc: Andreas Petlund <apetlund@simula.no>
      Cc: Dave Taht <dave.taht@bufferbloat.net>
      Cc: Nicolas Kuhn <nicolas.kuhn@telecom-bretagne.eu>
      Signed-off-by: NKenneth Klette Jonassen <kennetkl@ifi.uio.no>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2b0a8c9e
  9. 14 5月, 2015 1 次提交
  10. 06 10月, 2014 1 次提交
  11. 29 9月, 2014 1 次提交
    • D
      net: tcp: add DCTCP congestion control algorithm · e3118e83
      Daniel Borkmann 提交于
      This work adds the DataCenter TCP (DCTCP) congestion control
      algorithm [1], which has been first published at SIGCOMM 2010 [2],
      resp. follow-up analysis at SIGMETRICS 2011 [3] (and also, more
      recently as an informational IETF draft available at [4]).
      
      DCTCP is an enhancement to the TCP congestion control algorithm for
      data center networks. Typical data center workloads are i.e.
      i) partition/aggregate (queries; bursty, delay sensitive), ii) short
      messages e.g. 50KB-1MB (for coordination and control state; delay
      sensitive), and iii) large flows e.g. 1MB-100MB (data update;
      throughput sensitive). DCTCP has therefore been designed for such
      environments to provide/achieve the following three requirements:
      
        * High burst tolerance (incast due to partition/aggregate)
        * Low latency (short flows, queries)
        * High throughput (continuous data updates, large file
          transfers) with commodity, shallow buffered switches
      
      The basic idea of its design consists of two fundamentals: i) on the
      switch side, packets are being marked when its internal queue
      length > threshold K (K is chosen so that a large enough headroom
      for marked traffic is still available in the switch queue); ii) the
      sender/host side maintains a moving average of the fraction of marked
      packets, so each RTT, F is being updated as follows:
      
       F := X / Y, where X is # of marked ACKs, Y is total # of ACKs
       alpha := (1 - g) * alpha + g * F, where g is a smoothing constant
      
      The resulting alpha (iow: probability that switch queue is congested)
      is then being used in order to adaptively decrease the congestion
      window W:
      
       W := (1 - (alpha / 2)) * W
      
      The means for receiving marked packets resp. marking them on switch
      side in DCTCP is the use of ECN.
      
      RFC3168 describes a mechanism for using Explicit Congestion Notification
      from the switch for early detection of congestion, rather than waiting
      for segment loss to occur.
      
      However, this method only detects the presence of congestion, not
      the *extent*. In the presence of mild congestion, it reduces the TCP
      congestion window too aggressively and unnecessarily affects the
      throughput of long flows [4].
      
      DCTCP, as mentioned, enhances Explicit Congestion Notification (ECN)
      processing to estimate the fraction of bytes that encounter congestion,
      rather than simply detecting that some congestion has occurred. DCTCP
      then scales the TCP congestion window based on this estimate [4],
      thus it can derive multibit feedback from the information present in
      the single-bit sequence of marks in its control law. And thus act in
      *proportion* to the extent of congestion, not its *presence*.
      
      Switches therefore set the Congestion Experienced (CE) codepoint in
      packets when internal queue lengths exceed threshold K. Resulting,
      DCTCP delivers the same or better throughput than normal TCP, while
      using 90% less buffer space.
      
      It was found in [2] that DCTCP enables the applications to handle 10x
      the current background traffic, without impacting foreground traffic.
      Moreover, a 10x increase in foreground traffic did not cause any
      timeouts, and thus largely eliminates TCP incast collapse problems.
      
      The algorithm itself has already seen deployments in large production
      data centers since then.
      
      We did a long-term stress-test and analysis in a data center, short
      summary of our TCP incast tests with iperf compared to cubic:
      
      This test measured DCTCP throughput and latency and compared it with
      CUBIC throughput and latency for an incast scenario. In this test, 19
      senders sent at maximum rate to a single receiver. The receiver simply
      ran iperf -s.
      
      The senders ran iperf -c <receiver> -t 30. All senders started
      simultaneously (using local clocks synchronized by ntp).
      
      This test was repeated multiple times. Below shows the results from a
      single test. Other tests are similar. (DCTCP results were extremely
      consistent, CUBIC results show some variance induced by the TCP timeouts
      that CUBIC encountered.)
      
      For this test, we report statistics on the number of TCP timeouts,
      flow throughput, and traffic latency.
      
      1) Timeouts (total over all flows, and per flow summaries):
      
                  CUBIC            DCTCP
        Total     3227             25
        Mean       169.842          1.316
        Median     183              1
        Max        207              5
        Min        123              0
        Stddev      28.991          1.600
      
      Timeout data is taken by measuring the net change in netstat -s
      "other TCP timeouts" reported. As a result, the timeout measurements
      above are not restricted to the test traffic, and we believe that it
      is likely that all of the "DCTCP timeouts" are actually timeouts for
      non-test traffic. We report them nevertheless. CUBIC will also include
      some non-test timeouts, but they are drawfed by bona fide test traffic
      timeouts for CUBIC. Clearly DCTCP does an excellent job of preventing
      TCP timeouts. DCTCP reduces timeouts by at least two orders of
      magnitude and may well have eliminated them in this scenario.
      
      2) Throughput (per flow in Mbps):
      
                  CUBIC            DCTCP
        Mean      521.684          521.895
        Median    464              523
        Max       776              527
        Min       403              519
        Stddev    105.891            2.601
        Fairness    0.962            0.999
      
      Throughput data was simply the average throughput for each flow
      reported by iperf. By avoiding TCP timeouts, DCTCP is able to
      achieve much better per-flow results. In CUBIC, many flows
      experience TCP timeouts which makes flow throughput unpredictable and
      unfair. DCTCP, on the other hand, provides very clean predictable
      throughput without incurring TCP timeouts. Thus, the standard deviation
      of CUBIC throughput is dramatically higher than the standard deviation
      of DCTCP throughput.
      
      Mean throughput is nearly identical because even though cubic flows
      suffer TCP timeouts, other flows will step in and fill the unused
      bandwidth. Note that this test is something of a best case scenario
      for incast under CUBIC: it allows other flows to fill in for flows
      experiencing a timeout. Under situations where the receiver is issuing
      requests and then waiting for all flows to complete, flows cannot fill
      in for timed out flows and throughput will drop dramatically.
      
      3) Latency (in ms):
      
                  CUBIC            DCTCP
        Mean      4.0088           0.04219
        Median    4.055            0.0395
        Max       4.2              0.085
        Min       3.32             0.028
        Stddev    0.1666           0.01064
      
      Latency for each protocol was computed by running "ping -i 0.2
      <receiver>" from a single sender to the receiver during the incast
      test. For DCTCP, "ping -Q 0x6 -i 0.2 <receiver>" was used to ensure
      that traffic traversed the DCTCP queue and was not dropped when the
      queue size was greater than the marking threshold. The summary
      statistics above are over all ping metrics measured between the single
      sender, receiver pair.
      
      The latency results for this test show a dramatic difference between
      CUBIC and DCTCP. CUBIC intentionally overflows the switch buffer
      which incurs the maximum queue latency (more buffer memory will lead
      to high latency.) DCTCP, on the other hand, deliberately attempts to
      keep queue occupancy low. The result is a two orders of magnitude
      reduction of latency with DCTCP - even with a switch with relatively
      little RAM. Switches with larger amounts of RAM will incur increasing
      amounts of latency for CUBIC, but not for DCTCP.
      
      4) Convergence and stability test:
      
      This test measured the time that DCTCP took to fairly redistribute
      bandwidth when a new flow commences. It also measured DCTCP's ability
      to remain stable at a fair bandwidth distribution. DCTCP is compared
      with CUBIC for this test.
      
      At the commencement of this test, a single flow is sending at maximum
      rate (near 10 Gbps) to a single receiver. One second after that first
      flow commences, a new flow from a distinct server begins sending to
      the same receiver as the first flow. After the second flow has sent
      data for 10 seconds, the second flow is terminated. The first flow
      sends for an additional second. Ideally, the bandwidth would be evenly
      shared as soon as the second flow starts, and recover as soon as it
      stops.
      
      The results of this test are shown below. Note that the flow bandwidth
      for the two flows was measured near the same time, but not
      simultaneously.
      
      DCTCP performs nearly perfectly within the measurement limitations
      of this test: bandwidth is quickly distributed fairly between the two
      flows, remains stable throughout the duration of the test, and
      recovers quickly. CUBIC, in contrast, is slow to divide the bandwidth
      fairly, and has trouble remaining stable.
      
        CUBIC                      DCTCP
      
        Seconds  Flow 1  Flow 2    Seconds  Flow 1  Flow 2
         0       9.93    0          0       9.92    0
         0.5     9.87    0          0.5     9.86    0
         1       8.73    2.25       1       6.46    4.88
         1.5     7.29    2.8        1.5     4.9     4.99
         2       6.96    3.1        2       4.92    4.94
         2.5     6.67    3.34       2.5     4.93    5
         3       6.39    3.57       3       4.92    4.99
         3.5     6.24    3.75       3.5     4.94    4.74
         4       6       3.94       4       5.34    4.71
         4.5     5.88    4.09       4.5     4.99    4.97
         5       5.27    4.98       5       4.83    5.01
         5.5     4.93    5.04       5.5     4.89    4.99
         6       4.9     4.99       6       4.92    5.04
         6.5     4.93    5.1        6.5     4.91    4.97
         7       4.28    5.8        7       4.97    4.97
         7.5     4.62    4.91       7.5     4.99    4.82
         8       5.05    4.45       8       5.16    4.76
         8.5     5.93    4.09       8.5     4.94    4.98
         9       5.73    4.2        9       4.92    5.02
         9.5     5.62    4.32       9.5     4.87    5.03
        10       6.12    3.2       10       4.91    5.01
        10.5     6.91    3.11      10.5     4.87    5.04
        11       8.48    0         11       8.49    4.94
        11.5     9.87    0         11.5     9.9     0
      
      SYN/ACK ECT test:
      
      This test demonstrates the importance of ECT on SYN and SYN-ACK packets
      by measuring the connection probability in the presence of competing
      flows for a DCTCP connection attempt *without* ECT in the SYN packet.
      The test was repeated five times for each number of competing flows.
      
                    Competing Flows  1 |    2 |    4 |    8 |   16
                                     ------------------------------
      Mean Connection Probability    1 | 0.67 | 0.45 | 0.28 |    0
      Median Connection Probability  1 | 0.65 | 0.45 | 0.25 |    0
      
      As the number of competing flows moves beyond 1, the connection
      probability drops rapidly.
      
      Enabling DCTCP with this patch requires the following steps:
      
      DCTCP must be running both on the sender and receiver side in your
      data center, i.e.:
      
        sysctl -w net.ipv4.tcp_congestion_control=dctcp
      
      Also, ECN functionality must be enabled on all switches in your
      data center for DCTCP to work. The default ECN marking threshold (K)
      heuristic on the switch for DCTCP is e.g., 20 packets (30KB) at
      1Gbps, and 65 packets (~100KB) at 10Gbps (K > 1/7 * C * RTT, [4]).
      
      In above tests, for each switch port, traffic was segregated into two
      queues. For any packet with a DSCP of 0x01 - or equivalently a TOS of
      0x04 - the packet was placed into the DCTCP queue. All other packets
      were placed into the default drop-tail queue. For the DCTCP queue,
      RED/ECN marking was enabled, here, with a marking threshold of 75 KB.
      More details however, we refer you to the paper [2] under section 3).
      
      There are no code changes required to applications running in user
      space. DCTCP has been implemented in full *isolation* of the rest of
      the TCP code as its own congestion control module, so that it can run
      without a need to expose code to the core of the TCP stack, and thus
      nothing changes for non-DCTCP users.
      
      Changes in the CA framework code are minimal, and DCTCP algorithm
      operates on mechanisms that are already available in most Silicon.
      The gain (dctcp_shift_g) is currently a fixed constant (1/16) from
      the paper, but we leave the option that it can be chosen carefully
      to a different value by the user.
      
      In case DCTCP is being used and ECN support on peer site is off,
      DCTCP falls back after 3WHS to operate in normal TCP Reno mode.
      
      ss {-4,-6} -t -i diag interface:
      
        ... dctcp wscale:7,7 rto:203 rtt:2.349/0.026 mss:1448 cwnd:2054
        ssthresh:1102 ce_state 0 alpha 15 ab_ecn 0 ab_tot 735584
        send 10129.2Mbps pacing_rate 20254.1Mbps unacked:1822 retrans:0/15
        reordering:101 rcv_space:29200
      
        ... dctcp-reno wscale:7,7 rto:201 rtt:0.711/1.327 ato:40 mss:1448
        cwnd:10 ssthresh:1102 fallback_mode send 162.9Mbps pacing_rate
        325.5Mbps rcv_rtt:1.5 rcv_space:29200
      
      More information about DCTCP can be found in [1-4].
      
        [1] http://simula.stanford.edu/~alizade/Site/DCTCP.html
        [2] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf
        [3] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf
        [4] http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00
      
      Joint work with Florian Westphal and Glenn Judd.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NGlenn Judd <glenn.judd@morganstanley.com>
      Acked-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e3118e83
  12. 20 9月, 2014 1 次提交
    • T
      fou: Support for foo-over-udp RX path · 23461551
      Tom Herbert 提交于
      This patch provides a receive path for foo-over-udp. This allows
      direct encapsulation of IP protocols over UDP. The bound destination
      port is used to map to an IP protocol, and the XFRM framework
      (udp_encap_rcv) is used to receive encapsulated packets. Upon
      reception, the encapsulation header is logically removed (pointer
      to transport header is advanced) and the packet is reinjected into
      the receive path with the IP protocol indicated by the mapping.
      
      Netlink is used to configure FOU ports. The configuration information
      includes the port number to bind to and the IP protocol corresponding
      to that port.
      
      This should support GRE/UDP
      (http://tools.ietf.org/html/draft-yong-tsvwg-gre-in-udp-encap-02),
      as will as the other IP tunneling protocols (IPIP, SIT).
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      23461551
  13. 15 7月, 2014 1 次提交
  14. 25 2月, 2014 1 次提交
  15. 07 1月, 2014 1 次提交
  16. 04 7月, 2013 1 次提交
  17. 20 6月, 2013 1 次提交
  18. 12 6月, 2013 1 次提交
  19. 08 6月, 2013 1 次提交
  20. 27 3月, 2013 1 次提交
    • P
      GRE: Refactor GRE tunneling code. · c5441932
      Pravin B Shelar 提交于
      Following patch refactors GRE code into ip tunneling code and GRE
      specific code. Common tunneling code is moved to ip_tunnel module.
      ip_tunnel module is written as generic library which can be used
      by different tunneling implementations.
      
      ip_tunnel module contains following components:
       - packet xmit and rcv generic code. xmit flow looks like
         (gre_xmit/ipip_xmit)->ip_tunnel_xmit->ip_local_out.
       - hash table of all devices.
       - lookup for tunnel devices.
       - control plane operations like device create, destroy, ioctl, netlink
         operations code.
       - registration for tunneling modules, like gre, ipip etc.
       - define single pcpu_tstats dev->tstats.
       - struct tnl_ptk_info added to pass parsed tunnel packet parameters.
      
      ipip.h header is renamed to ip_tunnel.h
      Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c5441932
  21. 01 8月, 2012 1 次提交
    • A
      memcg: rename config variables · c255a458
      Andrew Morton 提交于
      Sanity:
      
      CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
      CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
      CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
      CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM
      
      [mhocko@suse.cz: fix missed bits]
      Cc: Glauber Costa <glommer@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c255a458
  22. 20 7月, 2012 1 次提交
    • Y
      net-tcp: Fast Open base · 2100c8d2
      Yuchung Cheng 提交于
      This patch impelements the common code for both the client and server.
      
      1. TCP Fast Open option processing. Since Fast Open does not have an
         option number assigned by IANA yet, it shares the experiment option
         code 254 by implementing draft-ietf-tcpm-experimental-options
         with a 16 bits magic number 0xF989. This enables global experiments
         without clashing the scarce(2) experimental options available for TCP.
      
         When the draft status becomes standard (maybe), the client should
         switch to the new option number assigned while the server supports
         both numbers for transistion.
      
      2. The new sysctl tcp_fastopen
      
      3. A place holder init function
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2100c8d2
  23. 19 7月, 2012 1 次提交
  24. 11 7月, 2012 1 次提交
  25. 13 12月, 2011 1 次提交
  26. 10 12月, 2011 1 次提交
  27. 14 5月, 2011 1 次提交
    • V
      net: ipv4: add IPPROTO_ICMP socket kind · c319b4d7
      Vasiliy Kulikov 提交于
      This patch adds IPPROTO_ICMP socket kind.  It makes it possible to send
      ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
      without any special privileges.  In other words, the patch makes it
      possible to implement setuid-less and CAP_NET_RAW-less /bin/ping.  In
      order not to increase the kernel's attack surface, the new functionality
      is disabled by default, but is enabled at bootup by supporting Linux
      distributions, optionally with restriction to a group or a group range
      (see below).
      
      Similar functionality is implemented in Mac OS X:
      http://www.manpagez.com/man/4/icmp/
      
      A new ping socket is created with
      
          socket(PF_INET, SOCK_DGRAM, PROT_ICMP)
      
      Message identifiers (octets 4-5 of ICMP header) are interpreted as local
      ports. Addresses are stored in struct sockaddr_in. No port numbers are
      reserved for privileged processes, port 0 is reserved for API ("let the
      kernel pick a free number"). There is no notion of remote ports, remote
      port numbers provided by the user (e.g. in connect()) are ignored.
      
      Data sent and received include ICMP headers. This is deliberate to:
      1) Avoid the need to transport headers values like sequence numbers by
      other means.
      2) Make it easier to port existing programs using raw sockets.
      
      ICMP headers given to send() are checked and sanitized. The type must be
      ICMP_ECHO and the code must be zero (future extensions might relax this,
      see below). The id is set to the number (local port) of the socket, the
      checksum is always recomputed.
      
      ICMP reply packets received from the network are demultiplexed according
      to their id's, and are returned by recv() without any modifications.
      IP header information and ICMP errors of those packets may be obtained
      via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
      quenches and redirects are reported as fake errors via the error queue
      (IP_RECVERR); the next hop address for redirects is saved to ee_info (in
      network order).
      
      socket(2) is restricted to the group range specified in
      "/proc/sys/net/ipv4/ping_group_range".  It is "1 0" by default, meaning
      that nobody (not even root) may create ping sockets.  Setting it to "100
      100" would grant permissions to the single group (to either make
      /sbin/ping g+s and owned by this group or to grant permissions to the
      "netadmins" group), "0 4294967295" would enable it for the world, "100
      4294967295" would enable it for the users, but not daemons.
      
      The existing code might be (in the unlikely case anyone needs it)
      extended rather easily to handle other similar pairs of ICMP messages
      (Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
      etc.).
      
      Userspace ping util & patch for it:
      http://openwall.info/wiki/people/segoon/ping
      
      For Openwall GNU/*/Linux it was the last step on the road to the
      setuid-less distro.  A revision of this patch (for RHEL5/OpenVZ kernels)
      is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
      http://mirrors.kernel.org/openwall/Owl/current/iso/
      
      Initially this functionality was written by Pavel Kankovsky for
      Linux 2.4.32, but unfortunately it was never made public.
      
      All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
      the patch.
      
      PATCH v3:
          - switched to flowi4.
          - minor changes to be consistent with raw sockets code.
      
      PATCH v2:
          - changed ping_debug() to pr_debug().
          - removed CONFIG_IP_PING.
          - removed ping_seq_fops.owner field (unused for procfs).
          - switched to proc_net_fops_create().
          - switched to %pK in seq_printf().
      
      PATCH v1:
          - fixed checksumming bug.
          - CAP_NET_RAW may not create icmp sockets anymore.
      
      RFC v2:
          - minor cleanups.
          - introduced sysctl'able group range to restrict socket(2).
      Signed-off-by: NVasiliy Kulikov <segoon@openwall.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c319b4d7
  28. 02 2月, 2011 1 次提交
    • D
      ipv4: Remove fib_hash. · 3630b7c0
      David S. Miller 提交于
      The time has finally come to remove the hash based routing table
      implementation in ipv4.
      
      FIB Trie is mature, well tested, and I've done an audit of it's code
      to confirm that it implements insert, delete, and lookup with the same
      identical semantics as fib_hash did.
      
      If there are any semantic differences found in fib_trie, we should
      simply fix them.
      
      I've placed the trie statistic config option under advanced router
      configuration.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NStephen Hemminger <shemminger@vyatta.com>
      3630b7c0
  29. 22 8月, 2010 1 次提交
    • D
      PPTP: PPP over IPv4 (Point-to-Point Tunneling Protocol) · 00959ade
      Dmitry Kozlov 提交于
      PPP: introduce "pptp" module which implements point-to-point tunneling protocol using pppox framework
      NET: introduce the "gre" module for demultiplexing GRE packets on version criteria
           (required to pptp and ip_gre may coexists)
      NET: ip_gre: update to use the "gre" module
      
      This patch introduces then pptp support to the linux kernel which
      dramatically speeds up pptp vpn connections and decreases cpu usage in
      comparison of existing user-space implementation
      (poptop/pptpclient). There is accel-pptp project
      (https://sourceforge.net/projects/accel-pptp/) to utilize this module,
      it contains plugin for pppd to use pptp in client-mode and modified
      pptpd (poptop) to build high-performance pptp NAS.
      
      There was many changes from initial submitted patch, most important are:
      1. using rcu instead of read-write locks
      2. using static bitmap instead of dynamically allocated
      3. using vmalloc for memory allocation instead of BITS_PER_LONG + __get_free_pages
      4. fixed many coding style issues
      Thanks to Eric Dumazet.
      Signed-off-by: NDmitry Kozlov <xeb@mail.ru>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      00959ade
  30. 07 10月, 2008 1 次提交
  31. 07 3月, 2008 1 次提交
    • D
      [UDP]: Revert udplite and code split. · db8dac20
      David S. Miller 提交于
      This reverts commit db1ed684 ("[IPV6]
      UDP: Rename IPv6 UDP files."), commit
      8be8af8f ("[IPV4] UDP: Move
      IPv4-specific bits to other file.") and commit
      e898d4db ("[UDP]: Allow users to
      configure UDP-Lite.").
      
      First, udplite is of such small cost, and it is a core protocol just
      like TCP and normal UDP are.
      
      We spent enormous amounts of effort to make udplite share as much code
      with core UDP as possible.  All of that work is less valuable if we're
      just going to slap a config option on udplite support.
      
      It is also causing build failures, as reported on linux-next, showing
      that the changeset was not tested very well.  In fact, this is the
      second build failure resulting from the udplite change.
      
      Finally, the config options provided was a bool, instead of a modular
      option.  Meaning the udplite code does not even get build tested
      by allmodconfig builds, and furthermore the user is not presented
      with a reasonable modular build option which is particularly needed
      by distribution vendors.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db8dac20
  32. 04 3月, 2008 2 次提交
  33. 29 1月, 2008 1 次提交
    • P
      [IPV4]: Cleanup the sysctl_net_ipv4.c file · 9ba63979
      Pavel Emelyanov 提交于
      This includes several cleanups:
      
       * tune Makefile to compile out this file when SYSCTL=n. Now
         it looks like net/core/sysctl_net_core.c one;
       * move the ipv4_config to af_inet.c to exist all the time;
       * remove additional sysctl_ip_nonlocal_bind declaration
         (it is already declared in net/ip.h);
       * remove no nonger needed ifdefs from this file.
      
      This is a preparation for using ctl paths for net/ipv4/
      sysctl table.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9ba63979
  34. 16 10月, 2007 1 次提交
    • P
      [INET]: Collect frag queues management objects together · 7eb95156
      Pavel Emelyanov 提交于
      There are some objects that are common in all the places
      which are used to keep track of frag queues, they are:
      
       * hash table
       * LRU list
       * rw lock
       * rnd number for hash function
       * the number of queues
       * the amount of memory occupied by queues
       * secret timer
      
      Move all this stuff into one structure (struct inet_frags)
      to make it possible use them uniformly in the future. Like
      with the previous patch this mostly consists of hunks like
      
      -    write_lock(&ipfrag_lock);
      +    write_lock(&ip4_frags.lock);
      
      To address the issue with exporting the number of queues and
      the amount of memory occupied by queues outside the .c file
      they are declared in, I introduce a couple of helpers.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7eb95156
  35. 11 10月, 2007 1 次提交
  36. 11 7月, 2007 1 次提交
  37. 26 4月, 2007 1 次提交